Understanding Bioinformatics

(1)

Understanding

Bioinformatics

(2)

(3)

Understanding

Bioinformatics

(4)

Senior Publisher: Jackie Harbor Editor: Dom Holdsworth

Development Editor: Eleanor Lawrence Illustrations: Nigel Orme

Typesetting: Georgina Lucas

Cover design: Matthew McClements, Blink Studio Limited Production Manager: Tracey Scarlett

Copyeditor: Jo Clayton Proofreader: Sally Livitt

Accuracy Checking: Eleni Rapsomaniki Indexer: Lisa Furnival

Vice President: Denise Schanck

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. Every attempt has been made to source the figures accurately. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. All rights reserved. No part of this book covered by the copyright herein may be reproduced or used in any format in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems—without permission of the publisher.

10-digit ISBN 0-8153-4024-9 (paperback) 13-digit ISBN 978-0-8153-4024-9 (paperback)

Library of Congress Cataloging-in-Publication Data Zvelebil, Marketa J.

Understanding bioinformatics / Marketa Zvelebil & Jeremy O. Baum. p . ; cm.

Includes bibliographical references and index. ISBN-13: 978-0-8153-4024-9 (pbk.)

ISBN-10: 0-8153-4024-9 (pbk.) 1. Bioinformatics.

[DNLM: 1. Computational Biology-methods. QU 26.5 Z96u 2008] I. Baum, Jeremy O. II. Title. QH324.2.Z84 2008

572.80285-dc22

2007027514

Published by Garland Science, Taylor & Francis Group, LLC, an informa business 270 Madison Avenue, New York, NY 10016, USA, and

2 Park Square, Milton Park, Abingdon, OX14 4RN, UK. Printed in the United States of America.

(5)

PREFACE

The analysis of data arising from biomedical research has undergone a revolution over the last 15 years, brought about by the combined impact of the Internet and the development of increasingly sophisticated and accurate bioinformatics tech-niques. All research workers in the areas of biomolecular science and biomedicine are now expected to be competent in several areas of sequence analysis and often, additionally, in protein structure analysis and other more advanced bioinformatics techniques.

When we began our research careers in the early 1980s all of the techniques that now comprise bioinformatics were restricted to specialists, as databases and user-friendly applications were not readily available and had to be installed on labora-tory computers. By the mid-1990s many datasets and analysis programs had become available on the Internet, and the scientists who produced sequences began to take on tasks such as sequence alignment themselves. However, there was a delay in providing comprehensive training in these techniques. At the end of the 1990s we started to expand our teaching of bioinformatics at both undergraduate and postgraduate level. We soon realized that there was a need for a textbook that bridged the gap between the simplistic introductions available, which concen-trated on results almost to the exclusion of the underlying science, and the very detailed monographs, which presented the theoretical underpinnings of a restricted set of techniques. This textbook is our attempt to fill that gap.

Therefore on the one hand we wanted to include material explaining the program methods, because we believe that to perform a proper analysis it is not sufficient to understand how to use a program and the kind of results (and errors!) it can produce. It is also necessary to have some understanding of the technique used by the program and the science on which it is based. But on the other hand, we wanted this book to be accessible to the bioinformatics beginner, and we recognized that even the more advanced students occasionally just want a quick reminder of what an application does, without having to read through the theory behind it.

From this apparent dilemma was born the division into Applications and Theory Chapters. Throughout the book, we wrote dedicated Applications Chapters to provide a working knowledge of bioinformatics applications, quick and easy to grasp. In most places, an Applications Chapter is then followed by a Theory Chapter, which explains the program methods and the science behind them. Inevitably, we found this created a small amount of duplication between some chapters, but to us this was a small sacrifice if it left the reader free to choose at what level they could engage with the subject of bioinformatics.

We have created a book that will serve as a comfortable introduction to any new student of bioinformatics, but which they can continue to use into their postgrad-uate studies. The book assumes a certain level of understanding of the background biology, for example gene and protein structure, where it is important to appreciate the variety that exists and not only know the canonical examples of first-year text-books. In addition, to describe the techniques in detail a level of mathematics is

(6)

required which is more appropriate for more advanced students. We are aware that many postgraduate students of bioinformatics have a background in areas such as computer science and mathematics. They will find many familiar algorithmic approaches presented, but will see their application in unfamiliar territory. As they read the book they will also appreciate that to become truly competent at bioinfor-matics they will require knowledge of biomedical science.

There is a certain amount of frustration inherent in producing any book, as the writing process seems often to be as much about what cannot be included as what can. Bioinformatics as a subject has already expanded to such an extent, and we had to be careful not to diminish the book's teaching value by trying to squeeze every possible topic into it. We have tried to include as broad a range of subjects as possible, but some have been omitted. For example, we do not deal with the methods of constructing a nucleotide sequence from the individual reads, nor with a number of more specialized aspects of genome annotation.

The final chapter is an introduction to the even-faster-moving subject of systems biology. Again, we had to balance the desire to say more against the practical constraints of space. But we hope this chapter gives readers a flavor of what the subject covers and the questions it is trying to answer. The chapter will not answer every reader's every query about systems biology, but if it prompts more of them to inquire further, that is already an achievement.

We wish to acknowledge many people who have helped us with this project. We would almost certainly not have got here without the enthusiasm and support of Matthew Day who guided us through the process of getting a first draft. Getting from there to the finished book was made possible by the invaluable advice and encouragement from Chris Dixon, Dom Holdsworth, Jackie Harbor, and others from Garland Science. We also wish to thank Eleanor Lawrence for her skills in massaging our text into shape, and Nigel Orme for producing the wonderful illus-trations. We received inspiration and encouragement from many others, too many to name here, but including our students and those who read our draft chapters. Finally, we wish to thank the many friends and family members who have had to suffer while we wrote this book. In particular JB wishes to thank his wife Hilary for her encouragement and perseverance. MZ wishes to specially thank her parents, Martin Scurr, Nick Lee, and her colleagues at work.

Marketa Zvelebil Jeremy O. Baum May 2007

(7)

A NOTE TO THE READER

Organization of this Book

Applications and Theory Chapters

Careful thought has gone into the organization of this book. The chapters are grouped in two ways. Firstly, the chapters are organized into seven parts according to topic. Within the parts, there is a second, less traditional, level of organization: most chapters are designated as either Applications or Theory Chapters. This book is designed to be accessible both to students who wish to obtain a working knowl-edge of the bioinformatics applications, as well as to students who want to know how the applications work and maybe write their own. So at the start of most parts, there are dedicated Applications Chapters, which deal with the more practical aspects of the particular research area, and are intended to act as a useful hands-on introduction. Following this are Theory Chapters, which explain the science, theory, and techniques employed in generally available applications. These are more demanding and should preferably be read after having gained a little experience of running the programs. In order to become truly proficient in the techniques you need to read and understand these more technical aspects. On the opening page of each chapter, and in the Table of Contents, it is clearly indicated whether it is an Applications or a Theory Chapter.

Part 1: Background Basics

Background Basics provides three introductory chapters to key knowledge that will be assumed throughout the remainder of the book. The first two chapters contain material that should be well-known to readers with a background in biomedical science. The first chapter describes the structure of nucleic acids and some of the roles played by them in living systems, including a brief description of how the genomic DNA is transcribed into mRNA and then translated into protein. The second chapter describes the structure and organization of proteins. Both of these chapters present only the most basic information required, and should not in any way be regarded as an adequate grounding in these topics for serious work. The intention is to provide enough information to make this book self-sufficient. The third chapter in this part describes databases, again at a very introductory level. Many biomedical research workers have large datasets to analyze, and these need to be stored in a convenient and practical way. Databases can provide a complete solution to this problem.

Part 2: Sequence Alignments

Sequence Alignments contains three chapters that deal with a variety of analyses of sequences, all relating to identifying similarities. Chapter 4 is a practical introduc-tion to the area, following some examples through different analyses and showing some potential problems as well as successful results. Chapters 5 and 6 deal with several of the many different techniques used in sequence analysis. Chapter 5 focuses on the general aspects of aligning two sequences and the specific methods employed in database searches. A number of techniques are described in detail, including dynamic programming, suffix trees, hashing, and chaining. Chapter 6 deals with methods involving many sequences, defining commonly occurring patterns, defining the profile of a family of related proteins, and constructing a multiple alignment. A key technique presented in this chapter is that of hidden Markov models (HMMs).

(8)

Part 3: Evolutionary Processes

Evolutionary Processes presents the methods used to obtain phylogenetic trees from a sequence dataset. These trees are reconstructions of the evolutionary history of the sequences, assuming that they share a common ancestor. Chapter 7 explains some of the basic concepts involved, and then shows how the different methods can be applied to two different scientific problems. In Chapter 8 details are given of the techniques involved and how they relate to the assumptions made about the evolutionary processes.

Part 4: Genome Characteristics

Genome Characteristics deals with the analysis required to interpret raw genome sequence data. Although by the time a genome sequence is published in the research journals some preliminary analysis will have been carried out, often the unanalyzed sequence is available before then. This part describes some of the tech-niques that can be used to try to locate genes in the sequence. Chapter 9 describes some of the range of programs available, and shows how complex their output can be and illustrates some of the possible pitfalls. Chapter 10 presents a survey of the techniques used, especially different Markov models and how models of whole genes can be built up from models of individual components such as ribosome-binding sites.

Part 5: Secondary Structures

Secondary Structures provides two chapters on methods of predicting secondary structures based on sequence (or primary structure). Chapter 11 introduces the methods of secondary structure prediction and discusses the various techniques and ways to interpret the results. Later sections of the chapter deal with prediction of more specialized secondary structure such as protein transmembrane regions, coiled coil and leucine zipper structures, and RNA secondary structures. Chapter 12 presents the underlying principles and details of the prediction methods from basic concepts to in-depth understanding of techniques such as neural networks and Markov models applied to this problem.

Part 6: Tertiary Structures

Tertiary Structures extends the material in Part 5 to enable the prediction and modeling of protein tertiary and quaternary structure. Chapter 13 introduces the reader to the concepts of energy functions, minimization, and ab initio prediction. It deals in more detail with the method of threading and focuses on homology modeling of protein structures, taking the student in a stepwise fashion through the process. The chapter ends with example studies to illustrate the techniques. Chapter 14 contains methods and techniques for further analysis of structural information and describes the importance of structure and function relationships. This chapter deals with how fold prediction can help to identify function, as well as giving an introduction to ligand docking and drug design.

Part 7: Cells and Organisms

Cells and Organisms consists of two chapters that deal in some detail with expres-sion analysis and an introductory chapter on systems biology. Chapter 15 intro-duces the techniques available to analyze protein and gene expression data. It shows the reader the information that can be learned from these experimental techniques as well as how the information could be used for further analysis. Chapter 16 presents some of the clustering techniques and statistics that are touched upon in Chapter 15 and are commonly used in gene and protein expres-sion analysis. Chapter 17 is a standalone chapter dealing with the modeling of systems processes. It introduces the reader to the basic concepts of systems biology, and shows what this exciting and rapidly growing field may achieve in the future.

(9)

Appendices

j

Three appendices are provided that expand on some of the concepts mentioned in the main part of this book. These are useful for the more inquisitive and advanced reader. Appendix A deals with probability and Bayesian analysis, Appendix B is mainly associated with Part 6 and deals with molecular energy functions, while Appendix C describes function optimization techniques.

Organization of the Chapters

Learning Outcomes

Each chapter opens with a list of learning outcomes which summarize the topics to be covered and act as a revision checklist.

Flow Diagrams

Within each chapter every section is introduced with a flow diagram to help the student to visualize and remember the topics covered in that section. A flow diagram from Chapter 5 is given below, as an example. Those concepts which will be described in the current section are shown in yellow boxes with arrows to show how they are connected to each other. For example two main types of optimal alignments will be described in this section of the chapter: local and global. Those concepts which were described in previous sections of the chapter are shown in grey boxes, so that the links can easily be seen between the topics of the current section and what has already been presented. For example, creating alignments requires methods for scoring gaps and for scoring substitutions, both of which have already been described in the chapter. In this way the major concepts and their inter-relationships are gradually built up throughout the chapter.

(10)

X

Mind Maps

Each chapter has a mind map, which is a specialized pedagogical feature, enabling the student to visualize and remember the steps that are necessary for specific appli-cations. The mind map for Chapter 4 is given above, as an example. In this example, four main areas of the topic 'producing and analyzing sequence alignments' have been identified: measuring matches, database searching, aligning sequences, and families. Each of these areas, colored for clarity, is developed to identify the key concepts involved, creating a visual aid to help the reader see at a glance the range of the material covered in discussing this area. Occasionally there are important connections between distinct areas of the mind map, as here in linking BLAST and PHI-BLAST, with the latter method being derived directly from the former, but having a quite different function, and thus being in a different area of the mind map.

Illustrations

Each chapter is illustrated with four-color figures. Considerable care has been put into ensuring simplicity as well as consistency of representation across the book. Figure 4.16 is given below, as an example.

(11)

List of Symbols

Bioinformatics makes use of numerous symbols, many of which will be unfamiliar to those who do not already know the subject well. To help the reader navigate the symbols used in this book, a comprehensive list is given at the back which quotes each symbol, its definition, and where its most significant occurrences in the book are located.

Glossary

All technical terms are highlighted in bold where they first appear in the text and are then listed and explained in the Glossary. Further, each term in the Glossary also appears in the Index, so the reader can quickly gain access to the relevant pages where the term is covered in more detail. The book has been designed to cross-reference in as thorough and helpful a way as possible.

Garland Science Website

Garland Science has made available a number of supplementary resources on its website, which are freely available and do not require a password. For more details, go to www.garlandscience.com/gs_textbooks.asp and follow the link to Understanding Bioinformatics.

Artwork

j

All the figures in Understanding Bioinformatics are available to download from the Garland Science website. The artwork files are saved in zip format, with a single zip file for each chapter. Individual figures can then be extracted as jpg files.

Additional Material

The Garland Science website has some additional material relating to the topics in this book. For each of the seven parts a pdf is available, which provides a set of useful weblinks relevant to those chapters. These include weblinks to relevant and impor-tant databases and to file format definitions, as well as to free programs and to servers which permit data analysis on-line. In addition to these, the sets of data which were used to illustrate the methods of analysis are also provided. These will allow the reader to reanalyze the same data, reproducing the results shown here and trying out other techniques.

(12)

xii

LIST OF REVIEWERS

The Authors and Publishers of Understanding Bioinformatics gratefully acknowledge the contribution of the following reviewers in the development of this book:

Stephen Altschul National Center for Biotechnology Information, Bethesda, Maryland, USA

Petri Auvinen Institute of Biotechnology, University of Helsinki, Finland Joel Bader Johns Hopkins University, Baltimore, USA

Tim Bailey University of Queensland, Brisbane, Australia Alex Bateman Wellcome Trust Sanger Institute, Cambridge, UK Meredith Betterton University of Colorado at Boulder, USA

Andy Brass University of Manchester, UK

Chris Bystroff Rensselaer Polytechnic University, Troy, USA Charlotte Deane University of Oxford, UK

John Hancock MRC Mammalian Genetics Unit, Harwell, Oxfordshire, UK Steve Harris University of Oxford, UK

Steve Henikoff Fred Hutchinson Cancer Research Center, Seattle, USA Jaap Heringa Free University, Amsterdam, Netherlands

Sudha Iyengar Case Western Reserve University, Cleveland, USA Sun Kim Indiana University Bloomington, USA

Patrice Koehl University of California Davis, USA

Frank Lebeda US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA

David Liberies University of Bergen, Norway

Peter Lockhart Massey University, Palmerston North, New Zealand James Mclnerney National University of Ireland, Maynooth, Ireland Nicholas Morris University of Newcastle, UK

William Pearson University of Virginia, Charlottesville, USA Marialuisa Pellegrini- European Bioinformatics Institute, Cambridge, UK Calace

Mihaela Pertea University of Maryland, College Park, Maryland, USA David Robertson University of Manchester, UK

Rob Russell EMBL, Heidelberg, Germany Ravinder Singh University of Colorado, USA

Deanne Taylor Brandeis University, Waltham, Massachusetts, USA Jen Taylor University of Oxford, UK

(13)

CONTENTS IN BRIEF

PART 1 Background Basics

Chapter 1:

The Nucleic Acid World

3 Chapter 2:

Protein Structure

25 Chapter 3:

Dealing With Databases

45 PART 2 Sequence Alignments

Chapter 4:

Producing and Analyzing Sequence Alignments

Applications Chapter

71 Chapter 5:

Pairwise Sequence Alignment and Database Searching

Theory Chapter

115 Chapter 6:

Patterns, Profiles, and Multiple Alignments

Theory Chapter

165 PART 3 Evolutionary Processes

Chapter 7:

Recovering Evolutionary History

Applications Chapter

223 Chapter 8:

Building Phylogenetic Trees

Theory Chapter

267 PART 4 Genome Characteristics

Chapter 9:

Revealing Genome Features

Applications Chapter

317 Chapter 10:

Gene Detection and Genome Annotation

Theory Chapter

357 PART 5 Secondary Structures

Chapter 11:

Obtaining Secondary Structure from Sequence

Applications Chapter

411 Chapter 12:

Predicting Secondary Structures

Theory Chapter

461 PART 6 Tertiary Structures

Chapter 13:

Modeling Protein Structure

Applications Chapter

521 Chapter 14:

Analyzing Structure-Function Relationships

Applications Chapter

567 PART 7 Cells and Organisms

Chapter 15:

Proteome and Gene Expression Analysis

599 Chapter 16:

Clustering Methods and Statistics

625 Chapter 17:

Systems Biology

667 APPENDICES Background Theory

Appendix A: Probability, Information, and Bayesian Analysis

695 Appendix B:

Molecular Energy Functions

700

(14)

xiv

Part 1 Background Basics

Chapter 1 The Nucleic Acid World

1.1 The Structure of DNA and RNA 5

DNA is a linear polymer of only four different bases 5 Two complementary DNA strands interact by

base pairing to form a double helix 7 RNA molecules are mostly single stranded but

can also have base-pair structures 9

1.2 DNA, RNA, and Protein: The Central Dogma 10

DNA is the information store, but RNA is

the messenger 11

Messenger RNA is translated into protein

according to the genetic code 12

Translation involves transfer RNAs and

RNA-containing ribosomes 13

1.3 Gene Structure and Control 14

RNA polymerase binds to specific sequences that position it and identify where to begin transcription 15 The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17 Eukaryotic mRNA transcripts undergo several

modifications prior to their use in translation 18

The control of translation 19

1.4 The Tree of Life and Evolution 20

A brief survey of the basic characteristics of the

major forms of life 21

Nucleic acid sequences can change as a result of

mutation 22

Summary 23

Further Reading 24

Chapter 2 Protein Structure

2.1 Primary and Secondary Structure 25

Protein structure can be considered on several

different levels 26

Amino acids are the building blocks of proteins 27 The differing chemical and physical properties of amino acids are due to their side chains 28

Amino acids are covalently linked together in the

protein chain by peptide bonds 29

Secondary structure of proteins is made up of

a-helices and (3-strands 33

Several different types of (3-sheet are found

in protein structures 35

Turns, hairpins and loops connect helices

and strands 36

2.2 Implication for Bioinformatics 37

Certain amino acids prefer a particular

structural unit 37

Evolution has aided sequence analysis 38 Visualization and computer manipulation

of protein structures 38

2.3 Proteins Fold to Form Compact Structures 40

The tertiary structure of a protein is defined

by the path of the polypeptide chain 41 The stable folded state of a protein represents

a state of low energy 41

Many proteins are formed of multiple subunits 42

Summary 43

Chapter 3 Dealing with Databases

3.1 The Structure of Databases 46

Flat-file databases store data as text files 48 Relational databases are widely used for storing

biological information 49

XML has the flexibility to define bespoke data

classifications 50

Many other database structures are used

for biological data 51

Databases can be accessed locally or online

and often link to each other 52

3.2 Types of Database 52

There's more to databases than just data 53

Primary and derived data 53

How we define and connect things is very

important: Ontologies 54

3.3 Looking for Databases 55

Sequence databases 55

(15)

Protein interaction databases 5S

Structural databases 59

3.4 Data Quality 61

Nonredundancy is especially important for some applications of sequence databases 62 Automated methods can be used to check for data

consistency 63

Initial analysis and annotation is usually

automated 64

Human intervention is often required to produce

the highest quality annotation 65

The importance of updating databases and entry

identifier and version numbers 65

Summary 66

Further Reading 67

Part 2 Sequence Alignments

APPLICATIONS CHAPTER

Chapter 4 Producing and Analyzing Sequence

Alignments

4.1 Principles of Sequence Alignment 72

Alignment is the task of locating equivalent regions of two or more sequences to maximize

their similarity 73

Alignment can reveal homology between sequences 74 It is easier to detect homology when comparing protein sequences than when comparing nucleic

acid sequences 75

4.2 Scoring Alignments 76

The quality of an alignment is measured by giving

it a quantitative score 76

The simplest way of quantifying similarity

between two sequences is percentage identity 76 The dot-plot gives a visual assessment of similarity

based on identity 77

Genuine matches do not have to be identical 79 There is a minimum percentage identity that can

be accepted as significant 81

There are many different ways of scoring an

alignment 81

4.3 Substitution Matrices 81

Substitution matrices are used to assign individual scores to aligned sequence positions 81 The PAM substitution matrices use substitution frequencies derived from sets of closely related

protein sequences 82

The BLOSUM substitution matrices use mutation data from highly conserved local regions of

sequence 84

The choice of substitution matrix depends on the

problem to be solved 84

4.4 Inserting Gaps 85

Gaps inserted in a sequence to maximize similarity

require a scoring penalty 85

Dynamic programming algorithms can determine

the optimal introduction of gaps 86

4.5 Types of Alignment 87

Different kinds of alignments are useful in

different circumstances 87

Multiple sequence alignments enable the simultaneous comparison of a set of similar

sequences 90

Multiple alignments can be constructed by

several different techniques 90

Multiple alignments can improve the accuracy of alignment for sequences of low similarity 91 ClustalW can make global multiple alignments of both DNA and protein sequences 92 Multiple alignments can be made by combining

a series of local alignments 92

Alignment can be improved by incorporating

additional information 93

4.6 Searching Databases 93

Fast yet accurate search algorithms have been

developed 94

FASTA is a fast database-search method based on matching short identical segments 95 BLAST is based on finding very similar short segments 95 Different versions of BLAST and FASTA are used

for different problems 95

PSI-BLAST enables profile-based database searches 96 SSEARCH is a rigorous alignment method 97

4.7 Searching with Nucleic Acid or Protein Sequences 97

DNA or RNA sequences can be used either

directly or after translation 97

The quality of a database match has to be tested to ensure that it could not have arisen by chance 97 Choosing an appropriate E-value threshold helps

to limit a database search 98

Low-complexity regions can complicate

homology searches 100

Different databases can be used to solve

particular problems 102

4.8 Protein Sequence Motifs or Patterns 103

Creation of pattern databases requires expert

knowledge 104

The BLOCKS database contains automatically compiled short blocks of conserved multiply

aligned protein sequences 105

4.9 Searching Using Motifs and Patterns 107

The PROSITE database can be searched for

(16)

xvi

The pattern-based program PHI-BLAST searches for both homology and matching motifs 108 Patterns can be generated from multiple

sequences using PRATT 108

The PRINTS database consists of fingerprints representing sets of conserved motifs that

describe a protein family 109

The Pfam database defines profiles of protein

families 109

4.10 Patterns and Protein Function 109

Searches can be made for particular functional

sites in proteins 109

Sequence comparison is not the only way of

analyzing protein sequences 110

Summary 111

Further Reading 112

THEORY CHAPTER

Chapter 5 Pairwise Sequence Alignment and

Database Searching

5.1 Substitution Matrices and Scoring 117

Alignment scores attempt to measure the

likelihood of a common evolutionary ancestor 117 The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins

of proteins 119

The BLOSUM matrices were designed to find

conserved regions of proteins 122

Scoring matrices for nucleotide sequence

alignment can be derived in similar ways 125

The substitution scoring matrix used must be

appropriate to the specific alignment problem 126 Gaps are scored in a much more heuristic way

than substitutions 126

5.2 Dynamic Programming Algorithms 127

Optimal global alignments are produced using efficient variations of the Needleman-Wunsch

algorithm 129

Local and suboptimal alignments can be produced by making small modifications to the dynamic

programming algorithm 135

Time can be saved with a loss of rigor by not

calculating the whole matrix 139

5.3 Indexing Techniques and Algorithmic

Approximations 141

Suffix trees locate the positions of repeats and

unique sequences 141

Hashing is an indexing technique that lists the starting positions of all k-tuples 143 The FASTA algorithm uses hashing and chaining

for fast database searching 144

The BLAST algorithm makes use of finite-state

automata 147

Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150

5.4 Alignment Score Significance 153

The statistics of gapped local alignments can be

approximated by the same theory 156

5.5 Aligning Complete Genome Sequences 156

Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment

of higher organisms 157

The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159

Summary 159

Chapter 6 Patterns, Profiles, and Multiple

Alignments

6.1 Profiles and Sequence Logos 167

Position-specific scoring matrices are an

extension of substitution scoring matrices 168 Methods for overcoming a lack of data in deriving

the values for a PSSM 171

PSI-BLAST is a sequence database searching

program 176

Representing a profile as a logo 177

6.2 Profile Hidden Markov Models 179

The basic structure of HMMs used in sequence

alignment to profiles 180

Estimating HMM parameters using aligned

sequences 185

Scoring a sequence against a profile HMM: The most probable path and the sum over

all paths 187

Estimating HMM parameters using unaligned

sequences 190

6.3 Aligning Profiles 193

Comparing two PSSMs by alignment 193

Aligning profile HMMs 195

6.4 Multiple Sequence Alignments by Gradual

Sequence Addition 196

The order in which sequences are added is chosen based on the estimated likelihood of incorporating

errors in the alignment 198

Many different scoring schemes have been used in constructing multiple alignments 200

(17)

The multiple alignment is built using the guide tree and profile methods and may be further

refined 204

6.5 Other Ways of Obtaining Multiple Alignments 207

The multiple sequence alignment program

DIALIGN aligns ungapped blocks 207

The SAGA method of multiple alignment uses

a genetic algorithm 209

6.6 Sequence Pattern Discovery 211

Discovering patterns in a multiple alignment:

eMOTIF and AACC 213

Probabilistic searching for common patterns in

sequences: Gibbs and MEME 215

Searching for more general sequence patterns 217

Summary 218

Part 3 Evolutionary Processes

Chapter 7 Recovering Evolutionary History

7.1 The Structure and Interpretation of

Phylogenetic Trees 225

Phylogenetic trees reconstruct evolutionary

relationships 225

Tree topology can be described in several ways 230 Consensus and condensed trees report the

results of comparing tree topologies 232

7.2 Molecular Evolution and its Consequences 235

Most related sequences have many positions

that have mutated several times 236

The rate of accepted mutation is usually not the same for all types of base substitution 236 Different codon positions have different

mutation rates 238

Only orthologous genes should be used to

construct species phylogenetic trees 239 Major changes affecting large regions of the

genome are surprisingly common 247

7.3 Phylogenetic Tree Reconstruction 248

Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species 249 The choice of the method for tree reconstruction

depends to some extent on the size and quality of

the dataset 249

A model of evolution must be chosen to use with

the method 251

All phylogenetic analyses must start with an

accurate multiple alignment 255

Phylogenetic analyses of a small dataset of

16S RNA sequence data 255

Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259

Summary 264

Chapter 8 Building Phylogenetic Trees

8.1 Evolutionary Models and the Calculation

of Evolutionary Distance 268

A simple but inaccurate measure of evolutionary

distance is the p-distance 268

The Poisson distance correction takes account of multiple mutations at the same site 270 The Gamma distance correction takes account of mutation rate variation at different sequence

positions 270

The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences 271 More complex models distinguish between the relative frequencies of different types of mutation 272 There is a nucleotide bias in DNA sequences 275 Models of protein-sequence evolution are closely related to the substitution matrices used for

sequence alignment 276

8.2 Generating Single Phylogenetic Trees 276

Clustering methods produce a phylogenetic tree

based on evolutionary distances 276

The UPGMA method assumes a constant

molecular clock and produces an ultrametric tree 278 The Fitch-Margoliash method produces an

unrooted additive tree 279

The neighbor-joining method is related to the

concept of minimum evolution 282

Stepwise addition and star-decomposition methods are usually used to generate starting

trees for further exploration, not the final tree 285

8.3 Generating Multiple Tree Topologies 286

The branch-and-bound method greatly improves the efficiency of exploring tree topology 288 Optimization of tree topology can be achieved

by making a series of small changes to an existing

tree 288

Finding the root gives a phylogenetic tree a

direction in time 291

8.4 Evaluating Tree Topologies 293

Functions based on evolutionary distances can

be used to evaluate trees 293

Unweighted parsimony methods look for the trees with the smallest number of mutations 297

(18)

xviii

Mutations can be weighted in different ways

in the parsimony method 300

Trees can be evaluated using the maximum

likelihood method 302

The quartet-puzzling method also involves maximum likelihood in the standard implementation 305 Bayesian methods can also be used to reconstruct

phylogenetic trees 306

8.5 Assessing the Reliability of Tree Features

and Comparing Trees 307

The long-branch attraction problem can arise

even with perfect data and methodology 308 Tree topology can be tested by examining the

interior branches 309

Tests have been proposed for comparing two

or more alternative trees 310

Summary 311

Part 4 Genome Characteristics

Chapter 9 Revealing Genome Features

9.1 Preliminary Examination of Genome Sequence 318

Whole genome sequences can be split up to

simplify gene searches 319

Structural RNA genes and repeat sequences

can be excluded from further analysis 319 Homology can be used to identify genes in both prokaryotic and eukaryotic genomes 322

9.2 Gene Prediction in Prokaryotic Genomes 322 9.3 Gene Prediction in Eukaryotic Genomes 323

Programs for predicting exons and introns use

a variety of approaches 323

Gene predictions must preserve the correct

reading frame 324

Some programs search for exons using only

the query sequence and a model for exons 327 Some programs search for genes using only

the query sequence and a gene model 332 Genes can be predicted using a gene model

and sequence similarity 334

Genomes of related organisms can be used

to improve gene prediction 336

9.4 Splice Site Detection 337

Splice sites can be detected independently by

specialized programs 338

9.5 Prediction of Promoter Regions 338

Prokaryotic promoter regions contain relatively

well-defined motifs 339

Eukaryotic promoter regions are typically more complex than prokaryotic promoters 340 A variety of promoter-prediction methods are

available online 340

Promoter prediction results are not very clear-cut 341

9.6 Confirming Predictions 342

There are various methods for calculating the

accuracy of gene-prediction programs 342 Translating predicted exons can confirm the

correctness of the prediction 343

Constructing the protein and identifying homologs 343

9.7 Genome Annotation 346

Genome annotation is the final step in genome

analysis 347

Gene ontology provides a standard vocabulary

for gene annotation 348

9.8 Large Genome Comparisons 353

Summary 354

Chapter 10 Gene Detection and Genome

Annotation

10.1 Detection of Functional RNA Molecules Using Decision Trees 361

Detection of tRNA genes using the tRNAscan

algorithm 361

Detection of tRNA genes in eukaryotic genomes 362

10.2 Features Useful for Gene Detection in Prokaryotes 364 10.3 Algorithms for Gene Detection in Prokaryotes 368

GeneMark uses inhomogeneous Markov chains

and dicodon statistics 368

GLIMMER uses interpolated Markov models of

coding potential 371

ORPHEUS uses homology, codon statistics, and

ribosome-binding sites 372

GeneMark.hmm uses explicit state duration

hidden Markov models 373

EcoParse is an HMM gene model 376

10.4 Features Used in Eukaryotic Gene Detection 377

Differences between prokaryotic and

eukaryotic genes 377

Introns, exons, and splice sites 379 Promoter sequences and binding sites for

(19)

10.5 Predicting Eukaryotic Gene Signals 381

Detection of core promoter binding signals is a key element of some eukaryotic

gene-prediction methods 381

A set of models has been designed to locate

the site of core promoter sequence signals 383 Predicting promoter regions from general

sequence properties can reduce the numbers

of false-positive results 387

Predicting eukaryotic transcription and

translation start sites 389

Translation and transcription stop signals

complete the gene definition 389

10.6 Predicting Exon/Intron Structure 389

Exons can be identified using general sequence

properties 390

Splice-site prediction 392

Splice sites can be predicted by sequence patterns

combined with base statistics 393

GenScan uses a combination of weight matrices and decision trees to locate splice sites 394 GeneSplicer predicts splice sites using first-order

Markov chains 394

NetPlantGene uses neural networks with

intron and exon predictions to predict splice sites 395 Other splicing features may yet be exploited for

splice-site prediction 396

Specific methods exist to identify initial and

terminal exons 396

Exons can be defined by searching databases for

homologous regions 397

10.7 Complete Eukaryotic Gene Models 397 10.8 Beyond the Prediction of Individual Genes 399

Functional annotation 400

Comparison of related genomes can help resolve

uncertain predictions 403

Evaluation and reevaluation of gene-detection

methods 405

Summary 405

Part 5 Secondary Structures

Chapter 11 Obtaining Secondary Structure

from Sequence

11.1 Types of Prediction Methods 413

Statistical methods are based on rules that give the probability that a residue will form part of a

particular secondary structure 414

Nearest-neighbor methods are statistical methods

that incorporate additional information about

protein structure 414

Machine-learning approaches to secondary structure prediction mainly make use of neural

networks and HMM methods 415

11.2 Training and Test Databases 416

There are several ways to define protein

secondary structures 417

11.3 Assessing the Accuracy of Prediction

Programs 417

Q3 measures the accuracy of individual residue

assignments 417

Secondary structure predictions should not be

expected to reach 100% residue accuracy 418 The Sov value measures the prediction accuracy

for whole elements 419

CAFASP/CASP: Unbiased and readily available

protein prediction assessments 419

11.4 Statistical and Knowledge-Based Methods 421

The GOR method uses an information theory

approach 422

The program Zpred includes multiple alignment of homologous sequences and residue

conservation information 425

There is an overall increase in prediction accuracy using multiple sequence information 426 The nearest-neighbor method: The use of multiple

nonhomologous sequences 428

PREDATOR is a combined statistical and knowledge-based program that includes the

nearest-neighbor approach 428

11.5 Neural Network Methods of Secondary Structure

Prediction 430

Assessing the reliability of neural net predictions 432 Several examples of Web-based neural network secondary structure prediction programs 432

PROF: Protein forecasting 434

PSIPRED 434

Jnet: Using several alternative representations

of the sequence alignment 434

11.6 Some Secondary Structures Require Specialized Prediction Methods 435

Transmembrane proteins 436

Quantifying the preference for a membrane

environment 437

11.7 Prediction of Transmembrane Protein Structure 438

Multi-helix membrane proteins 439

A selection of prediction programs to predict

(20)

XX

Statistical methods 443

Knowledge-based prediction 443

Evolutionary information from protein families

improves the prediction 444

Neural nets in transmembrane prediction 445 Predicting transmembrane helices with

hidden Markov models 446

Comparing the results: What to choose 447 What happens if a non-transmembrane protein is submitted to transmembrane prediction programs 448 Prediction of transmembrane structure

containing |3-strands 448

11.8 Coiled-coil Structures 451

The COILS prediction program 452

PAIRCOIL and MULTICOIL are an extension

of the COILS algorithm 453

Zipping the Leucine zipper: A specialized

coiled coil 453

11.9 RNA Secondary Structure Prediction 455

Summary 458

THEORY CHAPTER

Chapter 12 Predicting Secondary Structures

12.1 Defining Secondary Structure and Prediction

Accuracy 463

The definitions used for automatic protein secondary structure assignment do not give identical results 464 There are several different measures of the

accuracy of secondary structure prediction 469

12.2 Secondary Structure Prediction Based on

Residue Propensities 472

Each structural state has an amino acid preference which can be assigned as a residue propensity 473 The simplest prediction methods are based on the average residue propensity over a sequence window 476 Residue propensities are modulated by nearby

sequence 479

Predictions can be significantly improved by

including information from homologous sequences 484

12.3 The Nearest-Neighbor Methods are Based on

Sequence Segment Similarity 485

Short segments of similar sequence are found

to have similar structure 487

Several sequence similarity measures have been used to identify nearest-neighbor segments 488 A weighted average of the nearest-neighbor

segment structures is used to make the prediction 490 A nearest-neighbor method has been developed to predict regions with a high potential to misfold 491

12.4 Neural Networks Have Been Employed

Successfully for Secondary Structure Prediction 492

Layered feed-forward neural networks can

transform a sequence into a structural prediction 494 Inclusion of information on homologous

sequences improves neural network accuracy 502 More complex neural nets have been applied to predict secondary and other structural features 503

12.5 Hidden Markov Models Have Been Applied to Structure Prediction 504

HMM methods have been found especially

effective for transmembrane proteins 506 Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509

12.6 General Data Classification Techniques Can

Predict Structural Features 510

Support vector machines have been successfully used for protein structure prediction 511 Discriminants, SOMs, and other methods have

also been used 512

Summary 514

Part 6 Tertiary Structures

Chapter 13 Modeling Protein Structure

13.1 Potential Energy Functions and Force Fields 524

The conformation of a protein can be visualized in terms of a potential energy surface 525 Conformational energies can be described by

simple mathematical functions 525

Similar force fields can be used to represent conformational energies in the presence of

averaged environments 526

Potential energy functions can be used to assess

a modeled structure 527

Energy minimization can be used to refine a modeled structure and identify local energy minima 527 Molecular dynamics and simulated annealing

are used to find global energy minima 528

13.2 Obtaining a Structure by Threading 529

The prediction of protein folds in the absence of

known structural homologs 531

Libraries or databases of nonredundant protein

folds are used in threading 531

Two distinct types of scoring schemes have been

used in threading methods 531

Dynamic programming methods can identify optimal alignments of target sequences and

(21)

Several methods are available to assess the

confidence to be put on the fold prediction 534 The C2-like domain from the Dictyostelia:

A practical example of threading 535 13.3 Principles of Homology Modeling 537

Closely related target and template sequences give

better models 539

Significant sequence identity depends on the

length of the sequence 540

Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541 Model building is based on a number of

assumptions 541

13.4 Steps in Homology Modeling 542

Structural homologs to the target protein are

found in the PDB 543

Accurate alignment of target and template

sequences is essential for successful modeling 543 The structurally conserved regions of a protein

are modeled first 544

The modeled core is checked for misfits before

proceeding to the next stage 545

Sequence realignment and remodeling may

improve the structure 545

Insertions and deletions are usually modeled

as loops 545

Nonidentical amino acid side chains are modeled mainly by using rotamer libraries 547

Energy minimization is used to relieve

structural errors 548

Molecular dynamics can be used to explore

possible conformations for mobile loops 548 Models need to be checked for accuracy 549 How far can homology models be trusted? 551

13.5 Automated Homology Modeling 552

The program MODELLER models by satisfying

protein structure constraints 553

COMPOSER uses fragment-based modeling to

automatically generate a model 553

Automated methods available on the Web for

comparative modeling 554

Assessment of structure prediction 554

13.6 Homology Modeling of PI3 Kinase p 11 Occ 557

Swiss-Pdb Viewer can be used for manual

or semi-manual modeling 557

Alignment, core modeling, and side-chain

modeling are carried out all in one 558 The loops are modeled from a database of

possible structures 559

Energy minimization and quality inspection

can be carried out within Swiss-Pdb Viewer 559

MolIDE is a downloadable semi-automatic

modeling package 560

Automated modeling on the Web illustrated with

pi 10a kinase 561

Modeling a functionally related but sequentially

dissimilar protein: mTOR 563

Generating a multidomain three-dimensional

structure from sequence 564

Summary 564

Chapter 14 Analyzing Structure-Function

Relationships

14.1 Functional Conservation 568

Functional regions are usually structurally

conserved 569

Similar biochemical function can be found

in proteins with different folds 570 Fold libraries identify structurally similar proteins

regardless of function 571

14.2 Structure Comparison Methods 574

Finding domains in proteins aids structure

comparison 574

Structural comparisons can reveal conserved functional elements not discernible from a

sequence comparison 576

The CE method builds up a structural alignment from pairs of aligned protein segments 576 The Vector Alignment Search Tool (VAST) aligns

secondary structural elements 577

DALI identifies structure superposition without

maintaining segment order 578

FATCAT introduces rotations between rigid

segments 579

14.3 Finding Binding Sites 580

Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites 582 Searching for protein-protein interactions

using surface properties 584

Surface calculations highlight clefts or holes

in a protein that may serve as binding sites 585 Looking at residue conservation can identify

binding sites 586

14.4 Docking Methods and Programs 587

Simple docking procedures can be used when the structure of a homologous protein bound

to a ligand analog is known 588

Specialized docking programs will automatically

(22)

xxii

Scoring functions are used to identify the most

likely docked ligand 590

The DOCK program is a semirigid-body method that analyzes shape and chemical

complementarity of ligand and binding site 590 Fragment docking identifies potential substrates by predicting types of atoms and functional

groups in the binding area 591

GOLD is a flexible docking program, which

utilizes a genetic algorithm 591

The water molecules in binding sites should also

be considered 592

Summary 593

Part 7 Cells and Organisms

Chapter 15 Proteome and Gene Expression Analysis

15.1 Analysis of Large-scale Gene Expression 601

The expression of large numbers of different genes can be measured simultaneously by DNA

microarrays 602

Gene expression microarrays are mainly used to detect differences in gene expression in

different conditions 602

Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604 Digital differential display uses bioinformatics

and statistics to detect differential gene

expression in different tissues 605

Facilitating the integration of data from different

places and experiments 606

The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis 606 Techniques based on self-organizing maps

can be used for analyzing microarray data 608 Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision

of clusters 610

Clustered gene expression data can be used as

a tool for further research 610

15.2 Analysis of Large-scale Protein Expression 612

Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613 Measuring the expression levels shown in 2D gels 614 Differences in protein expression levels between different samples can be detected by 2D gels 615 Clustering methods are used to identify protein spots with similar expression patterns 615

Principal component analysis (PCA) is an alternative to clustering for analyzing microarray

and 2D gel data 618

The changes in a set of protein spots can be

tracked over a number of different samples 618 Databases and online tools are available to aid the interpretation of 2D gel data 620 Protein microarrays allow the simultaneous

detection of the presence or activity of large

numbers of different proteins 621

Mass spectrometry can be used to identify the proteins separated and purified by 2D gel

electrophoresis or other means 621

Protein-identification programs for mass

spectrometry are freely available on the Web 622 Mass spectrometry can be used to measure

protein concentration 623

Summary 623

Chapter 16 Clustering Methods and Statistics

16.1 Expression Data Require Preparation Prior

to Analysis 626

Data normalization is designed to remove

systematic experimental errors 627

Expression levels are often analyzed as ratios

and are usually transformed by taking logarithms 628 Sometimes further normalization is useful after

the data transformation 630

Principal component analysis is a method for

combining the properties of an object 631

16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points 633

Euclidean distance is the measure used in

everyday life 634

The Pearson correlation coefficient measures distance in terms of the shape of the expression

response 635

The Mahalanobis distance takes account of the variation and correlation of expression responses 636

16.3 Clustering Methods Identify Similar and Distinct Expression Patterns 637

Hierarchical clustering produces a related set of alternative partitions of the data 639 fc-means clustering groups data into several

clusters but does not determine a relationship

between clusters 641

Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined

number of clusters 644

Evolutionary clustering algorithms use selection, recombination, and mutation to find the best

(23)

The self-organizing tree algorithm (SOTA)

determines the number of clusters required 648 Biclustering identifies a subset of similar

expression level patterns occurring in a subset

of the samples 649

The validity of clusters is determined by

independent methods 650

16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression 651

f-tests can be used to estimate the significance of the difference between two expression levels 654 Nonparametric tests are used to avoid making

assumptions about the data sampling 656 Multiple testing of differential expression requires special techniques to control error rates 657

16.5 Gene and Protein Expression Data Can be Used to Classify Samples 659

Many alternative methods have been proposed

that can classify samples 660

Support vector machines are another form of supervised learning algorithms that can produce

classifiers 661

Summary 662

Chapter 17 Systems Biology

17.1 What is a System? 669

A system is more than the sum of its parts 669 A biological system is a living network 670 Databases are useful starting points in

constructing a network 671

To construct a model more information is

needed than a network 672

There are three possible approaches to

constructing a model 674

Kinetic models are not the only way in

systems biology 678

17.2 Structure of the Model 679

Control circuits are an essential part of any

biological system 680

The interactions in networks can be represented as simple differential equations 680

17.3 Robustness of Biological Systems 683

Robustness is a distinct feature of complexity

in biology 684

Modularity plays an important part in robustness 685 Redundancy in the system can provide robustness 686 Living systems can switch from one state to

another by means of bistable switches 688

17.4 Storing and Running System Models 689

Specialized programs make simulating

systems easier 691

Standardized system descriptions aid their

storage and reuse 692

Summary 692

APPENDICES Background Theory

Appendix A: Probability, Information, and

Bayesian Analysis

Probability Theory, Entropy, and Information 695

Mutually exclusive events 695

Occurrence of two events 696

Occurrence of two random variables 696

Bayesian Analysis 697

Bayes' theorem 697

Inference of parameter values 698

Appendix B: Molecular Energy Functions

Force Fields for Calculating Intra- and Intermolecular Interaction Energies 701

Bonding terms 702

Nonbonding terms 704

Potentials used in Threading 706

Potentials of mean force 706

Potential terms relating to solvent effects 707

Appendix C: Function Optimization

Full Search Methods 710

Dynamic programming and branch-and-bound 710

Local Optimization 710

The downhill simplex method 711

The steepest descent method 711

The conjugate gradient method 714

Methods using second derivatives 714

Thermodynamic Simulation and Global Optimization 715

Monte Carlo and genetic algorithms 716

Molecular dynamics 718 Simulated annealing 719 Summary 719 Further Reading 719 List of Symbols 721 Glossary 734 Index 751

(24)

PARTI

(25)

1

THE NUCLEIC ACID WORLD

When you have read Chapter 1, you should be able to:

State the chemical structures of nucleic acids. Explain base-pairing and the double helix. Explain how DNA stores genetic information.

Summarize the intermediate role of mRNA between DNA and proteins. Outline how mRNA is translated into protein by ribosomes.

Outline how gene control is exercised by binding to short nucleotide sequences. Show that eukaryotic mRNA often has segments (introns) removed before translation. Discuss how all life probably evolved from a single common ancestor.

(26)

4

information about the biological context of the bioinformatics problems we discuss is also given in the biology boxes and glossary items throughout the book. This chapter will deal with the nucleic acids—deoxyribonucleic acid (DNA) and

ribonucleic acid (RNA)—and how they encode proteins, while the structure and

functions of proteins themselves will be discussed in Chapter 2. In these two chap-ters we shall also discuss how DNA changes its information-coding and functional properties over time as a result of the processes of mutation, giving rise to the enor-mous diversity of life, and the need for bioinformatics to understand it.

Mind Map 1.1

A mind map schematic of the topics covered in this chapter and divided, in general, according to the topic sections. This is to help you visualize the topics, understand the structure of the chapter, and memorize the important elements.

The main role of DNA is information storage. In all living cells, from unicellular bacteria to multicellular plants and animals, DNA is the material in which genetic instructions are stored and is the chemical structure in which these instructions are transmitted from generation to generation; all the information required to make and maintain a new organism is stored in its DNA. Incredibly, the information required to reproduce even very complex organisms is stored on a relatively small number of DNA molecules. This set of molecules is called the organism's genome. In humans there are just 46 DNA molecules in most cells, one in each chromosome. Each DNA molecule is copied before cell division, and the copies are distributed such that each daughter cell receives a full set of genetic information. The basic set of 46 DNA molecules together encode everything needed to make a human being. (We will skip over the important influence of the environment and the nature-nurture debate, as they are not relevant to this book.)

Proteins are manufactured using the information encoded in DNA and are the mole-cules that direct the actual processes on which life depends. Processes essential to life, such as energy metabolism, biosynthesis, and intercellular communication, are all carried out through the agency of proteins. A few key processes such as the