Understanding
Bioinformatics
Understanding
Bioinformatics
Senior Publisher: Jackie Harbor Editor: Dom Holdsworth
Development Editor: Eleanor Lawrence Illustrations: Nigel Orme
Typesetting: Georgina Lucas
Cover design: Matthew McClements, Blink Studio Limited Production Manager: Tracey Scarlett
Copyeditor: Jo Clayton Proofreader: Sally Livitt
Accuracy Checking: Eleni Rapsomaniki Indexer: Lisa Furnival
Vice President: Denise Schanck
© 2008 by Garland Science, Taylor & Francis Group, LLC
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. Every attempt has been made to source the figures accurately. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. All rights reserved. No part of this book covered by the copyright herein may be reproduced or used in any format in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems—without permission of the publisher.
10-digit ISBN 0-8153-4024-9 (paperback) 13-digit ISBN 978-0-8153-4024-9 (paperback)
Library of Congress Cataloging-in-Publication Data Zvelebil, Marketa J.
Understanding bioinformatics / Marketa Zvelebil & Jeremy O. Baum. p . ; cm.
Includes bibliographical references and index. ISBN-13: 978-0-8153-4024-9 (pbk.)
ISBN-10: 0-8153-4024-9 (pbk.) 1. Bioinformatics.
[DNLM: 1. Computational Biology-methods. QU 26.5 Z96u 2008] I. Baum, Jeremy O. II. Title. QH324.2.Z84 2008
572.80285-dc22
2007027514
Published by Garland Science, Taylor & Francis Group, LLC, an informa business 270 Madison Avenue, New York, NY 10016, USA, and
2 Park Square, Milton Park, Abingdon, OX14 4RN, UK. Printed in the United States of America.
PREFACE
The analysis of data arising from biomedical research has undergone a revolution over the last 15 years, brought about by the combined impact of the Internet and the development of increasingly sophisticated and accurate bioinformatics tech-niques. All research workers in the areas of biomolecular science and biomedicine are now expected to be competent in several areas of sequence analysis and often, additionally, in protein structure analysis and other more advanced bioinformatics techniques.
When we began our research careers in the early 1980s all of the techniques that now comprise bioinformatics were restricted to specialists, as databases and user-friendly applications were not readily available and had to be installed on labora-tory computers. By the mid-1990s many datasets and analysis programs had become available on the Internet, and the scientists who produced sequences began to take on tasks such as sequence alignment themselves. However, there was a delay in providing comprehensive training in these techniques. At the end of the 1990s we started to expand our teaching of bioinformatics at both undergraduate and postgraduate level. We soon realized that there was a need for a textbook that bridged the gap between the simplistic introductions available, which concen-trated on results almost to the exclusion of the underlying science, and the very detailed monographs, which presented the theoretical underpinnings of a restricted set of techniques. This textbook is our attempt to fill that gap.
Therefore on the one hand we wanted to include material explaining the program methods, because we believe that to perform a proper analysis it is not sufficient to understand how to use a program and the kind of results (and errors!) it can produce. It is also necessary to have some understanding of the technique used by the program and the science on which it is based. But on the other hand, we wanted this book to be accessible to the bioinformatics beginner, and we recognized that even the more advanced students occasionally just want a quick reminder of what an application does, without having to read through the theory behind it.
From this apparent dilemma was born the division into Applications and Theory Chapters. Throughout the book, we wrote dedicated Applications Chapters to provide a working knowledge of bioinformatics applications, quick and easy to grasp. In most places, an Applications Chapter is then followed by a Theory Chapter, which explains the program methods and the science behind them. Inevitably, we found this created a small amount of duplication between some chapters, but to us this was a small sacrifice if it left the reader free to choose at what level they could engage with the subject of bioinformatics.
We have created a book that will serve as a comfortable introduction to any new student of bioinformatics, but which they can continue to use into their postgrad-uate studies. The book assumes a certain level of understanding of the background biology, for example gene and protein structure, where it is important to appreciate the variety that exists and not only know the canonical examples of first-year text-books. In addition, to describe the techniques in detail a level of mathematics is
required which is more appropriate for more advanced students. We are aware that many postgraduate students of bioinformatics have a background in areas such as computer science and mathematics. They will find many familiar algorithmic approaches presented, but will see their application in unfamiliar territory. As they read the book they will also appreciate that to become truly competent at bioinfor-matics they will require knowledge of biomedical science.
There is a certain amount of frustration inherent in producing any book, as the writing process seems often to be as much about what cannot be included as what can. Bioinformatics as a subject has already expanded to such an extent, and we had to be careful not to diminish the book's teaching value by trying to squeeze every possible topic into it. We have tried to include as broad a range of subjects as possible, but some have been omitted. For example, we do not deal with the methods of constructing a nucleotide sequence from the individual reads, nor with a number of more specialized aspects of genome annotation.
The final chapter is an introduction to the even-faster-moving subject of systems biology. Again, we had to balance the desire to say more against the practical constraints of space. But we hope this chapter gives readers a flavor of what the subject covers and the questions it is trying to answer. The chapter will not answer every reader's every query about systems biology, but if it prompts more of them to inquire further, that is already an achievement.
We wish to acknowledge many people who have helped us with this project. We would almost certainly not have got here without the enthusiasm and support of Matthew Day who guided us through the process of getting a first draft. Getting from there to the finished book was made possible by the invaluable advice and encouragement from Chris Dixon, Dom Holdsworth, Jackie Harbor, and others from Garland Science. We also wish to thank Eleanor Lawrence for her skills in massaging our text into shape, and Nigel Orme for producing the wonderful illus-trations. We received inspiration and encouragement from many others, too many to name here, but including our students and those who read our draft chapters. Finally, we wish to thank the many friends and family members who have had to suffer while we wrote this book. In particular JB wishes to thank his wife Hilary for her encouragement and perseverance. MZ wishes to specially thank her parents, Martin Scurr, Nick Lee, and her colleagues at work.
Marketa Zvelebil Jeremy O. Baum May 2007
A NOTE TO THE READER
Organization of this Book
Applications and Theory Chapters
Careful thought has gone into the organization of this book. The chapters are grouped in two ways. Firstly, the chapters are organized into seven parts according to topic. Within the parts, there is a second, less traditional, level of organization: most chapters are designated as either Applications or Theory Chapters. This book is designed to be accessible both to students who wish to obtain a working knowl-edge of the bioinformatics applications, as well as to students who want to know how the applications work and maybe write their own. So at the start of most parts, there are dedicated Applications Chapters, which deal with the more practical aspects of the particular research area, and are intended to act as a useful hands-on introduction. Following this are Theory Chapters, which explain the science, theory, and techniques employed in generally available applications. These are more demanding and should preferably be read after having gained a little experience of running the programs. In order to become truly proficient in the techniques you need to read and understand these more technical aspects. On the opening page of each chapter, and in the Table of Contents, it is clearly indicated whether it is an Applications or a Theory Chapter.
Part 1: Background Basics
Background Basics provides three introductory chapters to key knowledge that will be assumed throughout the remainder of the book. The first two chapters contain material that should be well-known to readers with a background in biomedical science. The first chapter describes the structure of nucleic acids and some of the roles played by them in living systems, including a brief description of how the genomic DNA is transcribed into mRNA and then translated into protein. The second chapter describes the structure and organization of proteins. Both of these chapters present only the most basic information required, and should not in any way be regarded as an adequate grounding in these topics for serious work. The intention is to provide enough information to make this book self-sufficient. The third chapter in this part describes databases, again at a very introductory level. Many biomedical research workers have large datasets to analyze, and these need to be stored in a convenient and practical way. Databases can provide a complete solution to this problem.
Part 2: Sequence Alignments
Sequence Alignments contains three chapters that deal with a variety of analyses of sequences, all relating to identifying similarities. Chapter 4 is a practical introduc-tion to the area, following some examples through different analyses and showing some potential problems as well as successful results. Chapters 5 and 6 deal with several of the many different techniques used in sequence analysis. Chapter 5 focuses on the general aspects of aligning two sequences and the specific methods employed in database searches. A number of techniques are described in detail, including dynamic programming, suffix trees, hashing, and chaining. Chapter 6 deals with methods involving many sequences, defining commonly occurring patterns, defining the profile of a family of related proteins, and constructing a multiple alignment. A key technique presented in this chapter is that of hidden Markov models (HMMs).
Part 3: Evolutionary Processes
Evolutionary Processes presents the methods used to obtain phylogenetic trees from a sequence dataset. These trees are reconstructions of the evolutionary history of the sequences, assuming that they share a common ancestor. Chapter 7 explains some of the basic concepts involved, and then shows how the different methods can be applied to two different scientific problems. In Chapter 8 details are given of the techniques involved and how they relate to the assumptions made about the evolutionary processes.
Part 4: Genome Characteristics
Genome Characteristics deals with the analysis required to interpret raw genome sequence data. Although by the time a genome sequence is published in the research journals some preliminary analysis will have been carried out, often the unanalyzed sequence is available before then. This part describes some of the tech-niques that can be used to try to locate genes in the sequence. Chapter 9 describes some of the range of programs available, and shows how complex their output can be and illustrates some of the possible pitfalls. Chapter 10 presents a survey of the techniques used, especially different Markov models and how models of whole genes can be built up from models of individual components such as ribosome-binding sites.
Part 5: Secondary Structures
Secondary Structures provides two chapters on methods of predicting secondary structures based on sequence (or primary structure). Chapter 11 introduces the methods of secondary structure prediction and discusses the various techniques and ways to interpret the results. Later sections of the chapter deal with prediction of more specialized secondary structure such as protein transmembrane regions, coiled coil and leucine zipper structures, and RNA secondary structures. Chapter 12 presents the underlying principles and details of the prediction methods from basic concepts to in-depth understanding of techniques such as neural networks and Markov models applied to this problem.
Part 6: Tertiary Structures
Tertiary Structures extends the material in Part 5 to enable the prediction and modeling of protein tertiary and quaternary structure. Chapter 13 introduces the reader to the concepts of energy functions, minimization, and ab initio prediction. It deals in more detail with the method of threading and focuses on homology modeling of protein structures, taking the student in a stepwise fashion through the process. The chapter ends with example studies to illustrate the techniques. Chapter 14 contains methods and techniques for further analysis of structural information and describes the importance of structure and function relationships. This chapter deals with how fold prediction can help to identify function, as well as giving an introduction to ligand docking and drug design.
Part 7: Cells and Organisms
Cells and Organisms consists of two chapters that deal in some detail with expres-sion analysis and an introductory chapter on systems biology. Chapter 15 intro-duces the techniques available to analyze protein and gene expression data. It shows the reader the information that can be learned from these experimental techniques as well as how the information could be used for further analysis. Chapter 16 presents some of the clustering techniques and statistics that are touched upon in Chapter 15 and are commonly used in gene and protein expres-sion analysis. Chapter 17 is a standalone chapter dealing with the modeling of systems processes. It introduces the reader to the basic concepts of systems biology, and shows what this exciting and rapidly growing field may achieve in the future.
Appendices
j
Three appendices are provided that expand on some of the concepts mentioned in the main part of this book. These are useful for the more inquisitive and advanced reader. Appendix A deals with probability and Bayesian analysis, Appendix B is mainly associated with Part 6 and deals with molecular energy functions, while Appendix C describes function optimization techniques.
Organization of the Chapters
Learning Outcomes
Each chapter opens with a list of learning outcomes which summarize the topics to be covered and act as a revision checklist.
Flow Diagrams
Within each chapter every section is introduced with a flow diagram to help the student to visualize and remember the topics covered in that section. A flow diagram from Chapter 5 is given below, as an example. Those concepts which will be described in the current section are shown in yellow boxes with arrows to show how they are connected to each other. For example two main types of optimal alignments will be described in this section of the chapter: local and global. Those concepts which were described in previous sections of the chapter are shown in grey boxes, so that the links can easily be seen between the topics of the current section and what has already been presented. For example, creating alignments requires methods for scoring gaps and for scoring substitutions, both of which have already been described in the chapter. In this way the major concepts and their inter-relationships are gradually built up throughout the chapter.
X
Mind Maps
Each chapter has a mind map, which is a specialized pedagogical feature, enabling the student to visualize and remember the steps that are necessary for specific appli-cations. The mind map for Chapter 4 is given above, as an example. In this example, four main areas of the topic 'producing and analyzing sequence alignments' have been identified: measuring matches, database searching, aligning sequences, and families. Each of these areas, colored for clarity, is developed to identify the key concepts involved, creating a visual aid to help the reader see at a glance the range of the material covered in discussing this area. Occasionally there are important connections between distinct areas of the mind map, as here in linking BLAST and PHI-BLAST, with the latter method being derived directly from the former, but having a quite different function, and thus being in a different area of the mind map.
Illustrations
Each chapter is illustrated with four-color figures. Considerable care has been put into ensuring simplicity as well as consistency of representation across the book. Figure 4.16 is given below, as an example.
Further Reading
It is not possible to summarize all current knowledge in the confines of this book, let alone anticipate future developments in this rapidly developing subject. Therefore at the end of each chapter there are references to research literature and specialist monographs to help readers continue to develop their knowledge and skills. We have grouped the books and articles according to topic, such that the sections within the Further Reading correspond to the sections in the chapter itself: we hope this will help the reader target their attention more quickly onto the appro-priate extension material.
List of Symbols
Bioinformatics makes use of numerous symbols, many of which will be unfamiliar to those who do not already know the subject well. To help the reader navigate the symbols used in this book, a comprehensive list is given at the back which quotes each symbol, its definition, and where its most significant occurrences in the book are located.
Glossary
All technical terms are highlighted in bold where they first appear in the text and are then listed and explained in the Glossary. Further, each term in the Glossary also appears in the Index, so the reader can quickly gain access to the relevant pages where the term is covered in more detail. The book has been designed to cross-reference in as thorough and helpful a way as possible.
Garland Science Website
Garland Science has made available a number of supplementary resources on its website, which are freely available and do not require a password. For more details, go to www.garlandscience.com/gs_textbooks.asp and follow the link to Understanding Bioinformatics.
Artwork
j
All the figures in Understanding Bioinformatics are available to download from the Garland Science website. The artwork files are saved in zip format, with a single zip file for each chapter. Individual figures can then be extracted as jpg files.
Additional Material
The Garland Science website has some additional material relating to the topics in this book. For each of the seven parts a pdf is available, which provides a set of useful weblinks relevant to those chapters. These include weblinks to relevant and impor-tant databases and to file format definitions, as well as to free programs and to servers which permit data analysis on-line. In addition to these, the sets of data which were used to illustrate the methods of analysis are also provided. These will allow the reader to reanalyze the same data, reproducing the results shown here and trying out other techniques.
xii
LIST OF REVIEWERS
The Authors and Publishers of Understanding Bioinformatics gratefully acknowledge the contribution of the following reviewers in the development of this book:
Stephen Altschul National Center for Biotechnology Information, Bethesda, Maryland, USA
Petri Auvinen Institute of Biotechnology, University of Helsinki, Finland Joel Bader Johns Hopkins University, Baltimore, USA
Tim Bailey University of Queensland, Brisbane, Australia Alex Bateman Wellcome Trust Sanger Institute, Cambridge, UK Meredith Betterton University of Colorado at Boulder, USA
Andy Brass University of Manchester, UK
Chris Bystroff Rensselaer Polytechnic University, Troy, USA Charlotte Deane University of Oxford, UK
John Hancock MRC Mammalian Genetics Unit, Harwell, Oxfordshire, UK Steve Harris University of Oxford, UK
Steve Henikoff Fred Hutchinson Cancer Research Center, Seattle, USA Jaap Heringa Free University, Amsterdam, Netherlands
Sudha Iyengar Case Western Reserve University, Cleveland, USA Sun Kim Indiana University Bloomington, USA
Patrice Koehl University of California Davis, USA
Frank Lebeda US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA
David Liberies University of Bergen, Norway
Peter Lockhart Massey University, Palmerston North, New Zealand James Mclnerney National University of Ireland, Maynooth, Ireland Nicholas Morris University of Newcastle, UK
William Pearson University of Virginia, Charlottesville, USA Marialuisa Pellegrini- European Bioinformatics Institute, Cambridge, UK Calace
Mihaela Pertea University of Maryland, College Park, Maryland, USA David Robertson University of Manchester, UK
Rob Russell EMBL, Heidelberg, Germany Ravinder Singh University of Colorado, USA
Deanne Taylor Brandeis University, Waltham, Massachusetts, USA Jen Taylor University of Oxford, UK
CONTENTS IN BRIEF
PART 1 Background Basics
Chapter 1:
The Nucleic Acid World
3
Chapter 2:
Protein Structure
25
Chapter 3:
Dealing With Databases
45
PART 2 Sequence Alignments
Chapter 4:
Producing and Analyzing Sequence Alignments
Applications Chapter
71
Chapter 5:
Pairwise Sequence Alignment and Database Searching
Theory Chapter
115
Chapter 6:
Patterns, Profiles, and Multiple Alignments
Theory Chapter
165
PART 3 Evolutionary Processes
Chapter 7:
Recovering Evolutionary History
Applications Chapter
223
Chapter 8:
Building Phylogenetic Trees
Theory Chapter
267
PART 4 Genome Characteristics
Chapter 9:
Revealing Genome Features
Applications Chapter
317
Chapter 10:
Gene Detection and Genome Annotation
Theory Chapter
357
PART 5 Secondary Structures
Chapter 11:
Obtaining Secondary Structure from Sequence
Applications Chapter
411
Chapter 12:
Predicting Secondary Structures
Theory Chapter
461
PART 6 Tertiary Structures
Chapter 13:
Modeling Protein Structure
Applications Chapter
521
Chapter 14:
Analyzing Structure-Function Relationships
Applications Chapter
567
PART 7 Cells and Organisms
Chapter 15:
Proteome and Gene Expression Analysis
599
Chapter 16:
Clustering Methods and Statistics
625
Chapter 17:
Systems Biology
667
APPENDICES Background Theory
Appendix A: Probability, Information, and Bayesian Analysis
695
Appendix B:
Molecular Energy Functions
700
xiv
CONTENTS
Preface v
A Note to the Reader vii
List of Reviewers xii
Contents in Brief xiii
Part 1 Background Basics
Chapter 1 The Nucleic Acid World
1.1 The Structure of DNA and RNA 5
DNA is a linear polymer of only four different bases 5 Two complementary DNA strands interact by
base pairing to form a double helix 7 RNA molecules are mostly single stranded but
can also have base-pair structures 9
1.2 DNA, RNA, and Protein: The Central Dogma 10
DNA is the information store, but RNA is
the messenger 11
Messenger RNA is translated into protein
according to the genetic code 12
Translation involves transfer RNAs and
RNA-containing ribosomes 13
1.3 Gene Structure and Control 14
RNA polymerase binds to specific sequences that position it and identify where to begin transcription 15 The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17 Eukaryotic mRNA transcripts undergo several
modifications prior to their use in translation 18
The control of translation 19
1.4 The Tree of Life and Evolution 20
A brief survey of the basic characteristics of the
major forms of life 21
Nucleic acid sequences can change as a result of
mutation 22
Summary 23
Further Reading 24
Chapter 2 Protein Structure
2.1 Primary and Secondary Structure 25
Protein structure can be considered on several
different levels 26
Amino acids are the building blocks of proteins 27 The differing chemical and physical properties of amino acids are due to their side chains 28
Amino acids are covalently linked together in the
protein chain by peptide bonds 29
Secondary structure of proteins is made up of
a-helices and (3-strands 33
Several different types of (3-sheet are found
in protein structures 35
Turns, hairpins and loops connect helices
and strands 36
2.2 Implication for Bioinformatics 37
Certain amino acids prefer a particular
structural unit 37
Evolution has aided sequence analysis 38 Visualization and computer manipulation
of protein structures 38
2.3 Proteins Fold to Form Compact Structures 40
The tertiary structure of a protein is defined
by the path of the polypeptide chain 41 The stable folded state of a protein represents
a state of low energy 41
Many proteins are formed of multiple subunits 42
Summary 43
Further Reading 44
Chapter 3 Dealing with Databases
3.1 The Structure of Databases 46
Flat-file databases store data as text files 48 Relational databases are widely used for storing
biological information 49
XML has the flexibility to define bespoke data
classifications 50
Many other database structures are used
for biological data 51
Databases can be accessed locally or online
and often link to each other 52
3.2 Types of Database 52
There's more to databases than just data 53
Primary and derived data 53
How we define and connect things is very
important: Ontologies 54
3.3 Looking for Databases 55
Sequence databases 55
Protein interaction databases 5S
Structural databases 59
3.4 Data Quality 61
Nonredundancy is especially important for some applications of sequence databases 62 Automated methods can be used to check for data
consistency 63
Initial analysis and annotation is usually
automated 64
Human intervention is often required to produce
the highest quality annotation 65
The importance of updating databases and entry
identifier and version numbers 65
Summary 66
Further Reading 67
Part 2 Sequence Alignments
APPLICATIONS CHAPTER
Chapter 4 Producing and Analyzing Sequence
Alignments
4.1 Principles of Sequence Alignment 72
Alignment is the task of locating equivalent regions of two or more sequences to maximize
their similarity 73
Alignment can reveal homology between sequences 74 It is easier to detect homology when comparing protein sequences than when comparing nucleic
acid sequences 75
4.2 Scoring Alignments 76
The quality of an alignment is measured by giving
it a quantitative score 76
The simplest way of quantifying similarity
between two sequences is percentage identity 76 The dot-plot gives a visual assessment of similarity
based on identity 77
Genuine matches do not have to be identical 79 There is a minimum percentage identity that can
be accepted as significant 81
There are many different ways of scoring an
alignment 81
4.3 Substitution Matrices 81
Substitution matrices are used to assign individual scores to aligned sequence positions 81 The PAM substitution matrices use substitution frequencies derived from sets of closely related
protein sequences 82
The BLOSUM substitution matrices use mutation data from highly conserved local regions of
sequence 84
The choice of substitution matrix depends on the
problem to be solved 84
4.4 Inserting Gaps 85
Gaps inserted in a sequence to maximize similarity
require a scoring penalty 85
Dynamic programming algorithms can determine
the optimal introduction of gaps 86
4.5 Types of Alignment 87
Different kinds of alignments are useful in
different circumstances 87
Multiple sequence alignments enable the simultaneous comparison of a set of similar
sequences 90
Multiple alignments can be constructed by
several different techniques 90
Multiple alignments can improve the accuracy of alignment for sequences of low similarity 91 ClustalW can make global multiple alignments of both DNA and protein sequences 92 Multiple alignments can be made by combining
a series of local alignments 92
Alignment can be improved by incorporating
additional information 93
4.6 Searching Databases 93
Fast yet accurate search algorithms have been
developed 94
FASTA is a fast database-search method based on matching short identical segments 95 BLAST is based on finding very similar short segments 95 Different versions of BLAST and FASTA are used
for different problems 95
PSI-BLAST enables profile-based database searches 96 SSEARCH is a rigorous alignment method 97
4.7 Searching with Nucleic Acid or Protein Sequences 97
DNA or RNA sequences can be used either
directly or after translation 97
The quality of a database match has to be tested to ensure that it could not have arisen by chance 97 Choosing an appropriate E-value threshold helps
to limit a database search 98
Low-complexity regions can complicate
homology searches 100
Different databases can be used to solve
particular problems 102
4.8 Protein Sequence Motifs or Patterns 103
Creation of pattern databases requires expert
knowledge 104
The BLOCKS database contains automatically compiled short blocks of conserved multiply
aligned protein sequences 105
4.9 Searching Using Motifs and Patterns 107
The PROSITE database can be searched for
xvi
The pattern-based program PHI-BLAST searches for both homology and matching motifs 108 Patterns can be generated from multiple
sequences using PRATT 108
The PRINTS database consists of fingerprints representing sets of conserved motifs that
describe a protein family 109
The Pfam database defines profiles of protein
families 109
4.10 Patterns and Protein Function 109
Searches can be made for particular functional
sites in proteins 109
Sequence comparison is not the only way of
analyzing protein sequences 110
Summary 111
Further Reading 112
THEORY CHAPTER
Chapter 5 Pairwise Sequence Alignment and
Database Searching
5.1 Substitution Matrices and Scoring 117
Alignment scores attempt to measure the
likelihood of a common evolutionary ancestor 117 The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins
of proteins 119
The BLOSUM matrices were designed to find
conserved regions of proteins 122
Scoring matrices for nucleotide sequence
alignment can be derived in similar ways 125
The substitution scoring matrix used must be
appropriate to the specific alignment problem 126 Gaps are scored in a much more heuristic way
than substitutions 126
5.2 Dynamic Programming Algorithms 127
Optimal global alignments are produced using efficient variations of the Needleman-Wunsch
algorithm 129
Local and suboptimal alignments can be produced by making small modifications to the dynamic
programming algorithm 135
Time can be saved with a loss of rigor by not
calculating the whole matrix 139
5.3 Indexing Techniques and Algorithmic
Approximations 141
Suffix trees locate the positions of repeats and
unique sequences 141
Hashing is an indexing technique that lists the starting positions of all k-tuples 143 The FASTA algorithm uses hashing and chaining
for fast database searching 144
The BLAST algorithm makes use of finite-state
automata 147
Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150
5.4 Alignment Score Significance 153
The statistics of gapped local alignments can be
approximated by the same theory 156
5.5 Aligning Complete Genome Sequences 156
Indexing and scanning whole genome sequences efficiently is crucial for the sequence alignment
of higher organisms 157
The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159
Summary 159
Further Reading 161
THEORY CHAPTER
Chapter 6 Patterns, Profiles, and Multiple
Alignments
6.1 Profiles and Sequence Logos 167
Position-specific scoring matrices are an
extension of substitution scoring matrices 168 Methods for overcoming a lack of data in deriving
the values for a PSSM 171
PSI-BLAST is a sequence database searching
program 176
Representing a profile as a logo 177
6.2 Profile Hidden Markov Models 179
The basic structure of HMMs used in sequence
alignment to profiles 180
Estimating HMM parameters using aligned
sequences 185
Scoring a sequence against a profile HMM: The most probable path and the sum over
all paths 187
Estimating HMM parameters using unaligned
sequences 190
6.3 Aligning Profiles 193
Comparing two PSSMs by alignment 193
Aligning profile HMMs 195
6.4 Multiple Sequence Alignments by Gradual
Sequence Addition 196
The order in which sequences are added is chosen based on the estimated likelihood of incorporating
errors in the alignment 198
Many different scoring schemes have been used in constructing multiple alignments 200
The multiple alignment is built using the guide tree and profile methods and may be further
refined 204
6.5 Other Ways of Obtaining Multiple Alignments 207
The multiple sequence alignment program
DIALIGN aligns ungapped blocks 207
The SAGA method of multiple alignment uses
a genetic algorithm 209
6.6 Sequence Pattern Discovery 211
Discovering patterns in a multiple alignment:
eMOTIF and AACC 213
Probabilistic searching for common patterns in
sequences: Gibbs and MEME 215
Searching for more general sequence patterns 217
Summary 218
Further Reading 219
Part 3 Evolutionary Processes
APPLICATIONS CHAPTER
Chapter 7 Recovering Evolutionary History
7.1 The Structure and Interpretation of
Phylogenetic Trees 225
Phylogenetic trees reconstruct evolutionary
relationships 225
Tree topology can be described in several ways 230 Consensus and condensed trees report the
results of comparing tree topologies 232
7.2 Molecular Evolution and its Consequences 235
Most related sequences have many positions
that have mutated several times 236
The rate of accepted mutation is usually not the same for all types of base substitution 236 Different codon positions have different
mutation rates 238
Only orthologous genes should be used to
construct species phylogenetic trees 239 Major changes affecting large regions of the
genome are surprisingly common 247
7.3 Phylogenetic Tree Reconstruction 248
Small ribosomal subunit rRNA sequences are well suited to reconstructing the evolution of species 249 The choice of the method for tree reconstruction
depends to some extent on the size and quality of
the dataset 249
A model of evolution must be chosen to use with
the method 251
All phylogenetic analyses must start with an
accurate multiple alignment 255
Phylogenetic analyses of a small dataset of
16S RNA sequence data 255
Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259
Summary 264
Further Reading 265
THEORY CHAPTER
Chapter 8 Building Phylogenetic Trees
8.1 Evolutionary Models and the Calculation
of Evolutionary Distance 268
A simple but inaccurate measure of evolutionary
distance is the p-distance 268
The Poisson distance correction takes account of multiple mutations at the same site 270 The Gamma distance correction takes account of mutation rate variation at different sequence
positions 270
The Jukes-Cantor model reproduces some basic features of the evolution of nucleotide sequences 271 More complex models distinguish between the relative frequencies of different types of mutation 272 There is a nucleotide bias in DNA sequences 275 Models of protein-sequence evolution are closely related to the substitution matrices used for
sequence alignment 276
8.2 Generating Single Phylogenetic Trees 276
Clustering methods produce a phylogenetic tree
based on evolutionary distances 276
The UPGMA method assumes a constant
molecular clock and produces an ultrametric tree 278 The Fitch-Margoliash method produces an
unrooted additive tree 279
The neighbor-joining method is related to the
concept of minimum evolution 282
Stepwise addition and star-decomposition methods are usually used to generate starting
trees for further exploration, not the final tree 285
8.3 Generating Multiple Tree Topologies 286
The branch-and-bound method greatly improves the efficiency of exploring tree topology 288 Optimization of tree topology can be achieved
by making a series of small changes to an existing
tree 288
Finding the root gives a phylogenetic tree a
direction in time 291
8.4 Evaluating Tree Topologies 293
Functions based on evolutionary distances can
be used to evaluate trees 293
Unweighted parsimony methods look for the trees with the smallest number of mutations 297
xviii
Mutations can be weighted in different ways
in the parsimony method 300
Trees can be evaluated using the maximum
likelihood method 302
The quartet-puzzling method also involves maximum likelihood in the standard implementation 305 Bayesian methods can also be used to reconstruct
phylogenetic trees 306
8.5 Assessing the Reliability of Tree Features
and Comparing Trees 307
The long-branch attraction problem can arise
even with perfect data and methodology 308 Tree topology can be tested by examining the
interior branches 309
Tests have been proposed for comparing two
or more alternative trees 310
Summary 311
Further Reading 312
Part 4 Genome Characteristics
APPLICATIONS CHAPTER
Chapter 9 Revealing Genome Features
9.1 Preliminary Examination of Genome Sequence 318
Whole genome sequences can be split up to
simplify gene searches 319
Structural RNA genes and repeat sequences
can be excluded from further analysis 319 Homology can be used to identify genes in both prokaryotic and eukaryotic genomes 322
9.2 Gene Prediction in Prokaryotic Genomes 322 9.3 Gene Prediction in Eukaryotic Genomes 323
Programs for predicting exons and introns use
a variety of approaches 323
Gene predictions must preserve the correct
reading frame 324
Some programs search for exons using only
the query sequence and a model for exons 327 Some programs search for genes using only
the query sequence and a gene model 332 Genes can be predicted using a gene model
and sequence similarity 334
Genomes of related organisms can be used
to improve gene prediction 336
9.4 Splice Site Detection 337
Splice sites can be detected independently by
specialized programs 338
9.5 Prediction of Promoter Regions 338
Prokaryotic promoter regions contain relatively
well-defined motifs 339
Eukaryotic promoter regions are typically more complex than prokaryotic promoters 340 A variety of promoter-prediction methods are
available online 340
Promoter prediction results are not very clear-cut 341
9.6 Confirming Predictions 342
There are various methods for calculating the
accuracy of gene-prediction programs 342 Translating predicted exons can confirm the
correctness of the prediction 343
Constructing the protein and identifying homologs 343
9.7 Genome Annotation 346
Genome annotation is the final step in genome
analysis 347
Gene ontology provides a standard vocabulary
for gene annotation 348
9.8 Large Genome Comparisons 353
Summary 354
Further Reading 355
THEORY CHAPTER
Chapter 10 Gene Detection and Genome
Annotation
10.1 Detection of Functional RNA Molecules Using Decision Trees 361
Detection of tRNA genes using the tRNAscan
algorithm 361
Detection of tRNA genes in eukaryotic genomes 362
10.2 Features Useful for Gene Detection in Prokaryotes 364 10.3 Algorithms for Gene Detection in Prokaryotes 368
GeneMark uses inhomogeneous Markov chains
and dicodon statistics 368
GLIMMER uses interpolated Markov models of
coding potential 371
ORPHEUS uses homology, codon statistics, and
ribosome-binding sites 372
GeneMark.hmm uses explicit state duration
hidden Markov models 373
EcoParse is an HMM gene model 376
10.4 Features Used in Eukaryotic Gene Detection 377
Differences between prokaryotic and
eukaryotic genes 377
Introns, exons, and splice sites 379 Promoter sequences and binding sites for
10.5 Predicting Eukaryotic Gene Signals 381
Detection of core promoter binding signals is a key element of some eukaryotic
gene-prediction methods 381
A set of models has been designed to locate
the site of core promoter sequence signals 383 Predicting promoter regions from general
sequence properties can reduce the numbers
of false-positive results 387
Predicting eukaryotic transcription and
translation start sites 389
Translation and transcription stop signals
complete the gene definition 389
10.6 Predicting Exon/Intron Structure 389
Exons can be identified using general sequence
properties 390
Splice-site prediction 392
Splice sites can be predicted by sequence patterns
combined with base statistics 393
GenScan uses a combination of weight matrices and decision trees to locate splice sites 394 GeneSplicer predicts splice sites using first-order
Markov chains 394
NetPlantGene uses neural networks with
intron and exon predictions to predict splice sites 395 Other splicing features may yet be exploited for
splice-site prediction 396
Specific methods exist to identify initial and
terminal exons 396
Exons can be defined by searching databases for
homologous regions 397
10.7 Complete Eukaryotic Gene Models 397 10.8 Beyond the Prediction of Individual Genes 399
Functional annotation 400
Comparison of related genomes can help resolve
uncertain predictions 403
Evaluation and reevaluation of gene-detection
methods 405
Summary 405
Further Reading 406
Part 5 Secondary Structures
APPLICATIONS CHAPTER
Chapter 11 Obtaining Secondary Structure
from Sequence
11.1 Types of Prediction Methods 413
Statistical methods are based on rules that give the probability that a residue will form part of a
particular secondary structure 414
Nearest-neighbor methods are statistical methods
that incorporate additional information about
protein structure 414
Machine-learning approaches to secondary structure prediction mainly make use of neural
networks and HMM methods 415
11.2 Training and Test Databases 416
There are several ways to define protein
secondary structures 417
11.3 Assessing the Accuracy of Prediction
Programs 417
Q3 measures the accuracy of individual residue
assignments 417
Secondary structure predictions should not be
expected to reach 100% residue accuracy 418 The Sov value measures the prediction accuracy
for whole elements 419
CAFASP/CASP: Unbiased and readily available
protein prediction assessments 419
11.4 Statistical and Knowledge-Based Methods 421
The GOR method uses an information theory
approach 422
The program Zpred includes multiple alignment of homologous sequences and residue
conservation information 425
There is an overall increase in prediction accuracy using multiple sequence information 426 The nearest-neighbor method: The use of multiple
nonhomologous sequences 428
PREDATOR is a combined statistical and knowledge-based program that includes the
nearest-neighbor approach 428
11.5 Neural Network Methods of Secondary Structure
Prediction 430
Assessing the reliability of neural net predictions 432 Several examples of Web-based neural network secondary structure prediction programs 432
PROF: Protein forecasting 434
PSIPRED 434
Jnet: Using several alternative representations
of the sequence alignment 434
11.6 Some Secondary Structures Require Specialized Prediction Methods 435
Transmembrane proteins 436
Quantifying the preference for a membrane
environment 437
11.7 Prediction of Transmembrane Protein Structure 438
Multi-helix membrane proteins 439
A selection of prediction programs to predict
XX
Statistical methods 443
Knowledge-based prediction 443
Evolutionary information from protein families
improves the prediction 444
Neural nets in transmembrane prediction 445 Predicting transmembrane helices with
hidden Markov models 446
Comparing the results: What to choose 447 What happens if a non-transmembrane protein is submitted to transmembrane prediction programs 448 Prediction of transmembrane structure
containing |3-strands 448
11.8 Coiled-coil Structures 451
The COILS prediction program 452
PAIRCOIL and MULTICOIL are an extension
of the COILS algorithm 453
Zipping the Leucine zipper: A specialized
coiled coil 453
11.9 RNA Secondary Structure Prediction 455
Summary 458
Further Reading 459
THEORY CHAPTER
Chapter 12 Predicting Secondary Structures
12.1 Defining Secondary Structure and Prediction
Accuracy 463
The definitions used for automatic protein secondary structure assignment do not give identical results 464 There are several different measures of the
accuracy of secondary structure prediction 469
12.2 Secondary Structure Prediction Based on
Residue Propensities 472
Each structural state has an amino acid preference which can be assigned as a residue propensity 473 The simplest prediction methods are based on the average residue propensity over a sequence window 476 Residue propensities are modulated by nearby
sequence 479
Predictions can be significantly improved by
including information from homologous sequences 484
12.3 The Nearest-Neighbor Methods are Based on
Sequence Segment Similarity 485
Short segments of similar sequence are found
to have similar structure 487
Several sequence similarity measures have been used to identify nearest-neighbor segments 488 A weighted average of the nearest-neighbor
segment structures is used to make the prediction 490 A nearest-neighbor method has been developed to predict regions with a high potential to misfold 491
12.4 Neural Networks Have Been Employed
Successfully for Secondary Structure Prediction 492
Layered feed-forward neural networks can
transform a sequence into a structural prediction 494 Inclusion of information on homologous
sequences improves neural network accuracy 502 More complex neural nets have been applied to predict secondary and other structural features 503
12.5 Hidden Markov Models Have Been Applied to Structure Prediction 504
HMM methods have been found especially
effective for transmembrane proteins 506 Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509
12.6 General Data Classification Techniques Can
Predict Structural Features 510
Support vector machines have been successfully used for protein structure prediction 511 Discriminants, SOMs, and other methods have
also been used 512
Summary 514
Further Reading 515
Part 6 Tertiary Structures
APPLICATIONS CHAPTER
Chapter 13 Modeling Protein Structure
13.1 Potential Energy Functions and Force Fields 524
The conformation of a protein can be visualized in terms of a potential energy surface 525 Conformational energies can be described by
simple mathematical functions 525
Similar force fields can be used to represent conformational energies in the presence of
averaged environments 526
Potential energy functions can be used to assess
a modeled structure 527
Energy minimization can be used to refine a modeled structure and identify local energy minima 527 Molecular dynamics and simulated annealing
are used to find global energy minima 528
13.2 Obtaining a Structure by Threading 529
The prediction of protein folds in the absence of
known structural homologs 531
Libraries or databases of nonredundant protein
folds are used in threading 531
Two distinct types of scoring schemes have been
used in threading methods 531
Dynamic programming methods can identify optimal alignments of target sequences and
Several methods are available to assess the
confidence to be put on the fold prediction 534 The C2-like domain from the Dictyostelia:
A practical example of threading 535 13.3 Principles of Homology Modeling 537
Closely related target and template sequences give
better models 539
Significant sequence identity depends on the
length of the sequence 540
Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541 Model building is based on a number of
assumptions 541
13.4 Steps in Homology Modeling 542
Structural homologs to the target protein are
found in the PDB 543
Accurate alignment of target and template
sequences is essential for successful modeling 543 The structurally conserved regions of a protein
are modeled first 544
The modeled core is checked for misfits before
proceeding to the next stage 545
Sequence realignment and remodeling may
improve the structure 545
Insertions and deletions are usually modeled
as loops 545
Nonidentical amino acid side chains are modeled mainly by using rotamer libraries 547
Energy minimization is used to relieve
structural errors 548
Molecular dynamics can be used to explore
possible conformations for mobile loops 548 Models need to be checked for accuracy 549 How far can homology models be trusted? 551
13.5 Automated Homology Modeling 552
The program MODELLER models by satisfying
protein structure constraints 553
COMPOSER uses fragment-based modeling to
automatically generate a model 553
Automated methods available on the Web for
comparative modeling 554
Assessment of structure prediction 554
13.6 Homology Modeling of PI3 Kinase p 11 Occ 557
Swiss-Pdb Viewer can be used for manual
or semi-manual modeling 557
Alignment, core modeling, and side-chain
modeling are carried out all in one 558 The loops are modeled from a database of
possible structures 559
Energy minimization and quality inspection
can be carried out within Swiss-Pdb Viewer 559
MolIDE is a downloadable semi-automatic
modeling package 560
Automated modeling on the Web illustrated with
pi 10a kinase 561
Modeling a functionally related but sequentially
dissimilar protein: mTOR 563
Generating a multidomain three-dimensional
structure from sequence 564
Summary 564
Further Reading 565
APPLICATIONS CHAPTER
Chapter 14 Analyzing Structure-Function
Relationships
14.1 Functional Conservation 568
Functional regions are usually structurally
conserved 569
Similar biochemical function can be found
in proteins with different folds 570 Fold libraries identify structurally similar proteins
regardless of function 571
14.2 Structure Comparison Methods 574
Finding domains in proteins aids structure
comparison 574
Structural comparisons can reveal conserved functional elements not discernible from a
sequence comparison 576
The CE method builds up a structural alignment from pairs of aligned protein segments 576 The Vector Alignment Search Tool (VAST) aligns
secondary structural elements 577
DALI identifies structure superposition without
maintaining segment order 578
FATCAT introduces rotations between rigid
segments 579
14.3 Finding Binding Sites 580
Highly conserved, strongly charged, or hydrophobic surface areas may indicate interaction sites 582 Searching for protein-protein interactions
using surface properties 584
Surface calculations highlight clefts or holes
in a protein that may serve as binding sites 585 Looking at residue conservation can identify
binding sites 586
14.4 Docking Methods and Programs 587
Simple docking procedures can be used when the structure of a homologous protein bound
to a ligand analog is known 588
Specialized docking programs will automatically
xxii
Scoring functions are used to identify the most
likely docked ligand 590
The DOCK program is a semirigid-body method that analyzes shape and chemical
complementarity of ligand and binding site 590 Fragment docking identifies potential substrates by predicting types of atoms and functional
groups in the binding area 591
GOLD is a flexible docking program, which
utilizes a genetic algorithm 591
The water molecules in binding sites should also
be considered 592
Summary 593
Further Reading 594
Part 7 Cells and Organisms
Chapter 15 Proteome and Gene Expression Analysis
15.1 Analysis of Large-scale Gene Expression 601
The expression of large numbers of different genes can be measured simultaneously by DNA
microarrays 602
Gene expression microarrays are mainly used to detect differences in gene expression in
different conditions 602
Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604 Digital differential display uses bioinformatics
and statistics to detect differential gene
expression in different tissues 605
Facilitating the integration of data from different
places and experiments 606
The simplest method of analyzing gene expression microarray data is hierarchical cluster analysis 606 Techniques based on self-organizing maps
can be used for analyzing microarray data 608 Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision
of clusters 610
Clustered gene expression data can be used as
a tool for further research 610
15.2 Analysis of Large-scale Protein Expression 612
Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613 Measuring the expression levels shown in 2D gels 614 Differences in protein expression levels between different samples can be detected by 2D gels 615 Clustering methods are used to identify protein spots with similar expression patterns 615
Principal component analysis (PCA) is an alternative to clustering for analyzing microarray
and 2D gel data 618
The changes in a set of protein spots can be
tracked over a number of different samples 618 Databases and online tools are available to aid the interpretation of 2D gel data 620 Protein microarrays allow the simultaneous
detection of the presence or activity of large
numbers of different proteins 621
Mass spectrometry can be used to identify the proteins separated and purified by 2D gel
electrophoresis or other means 621
Protein-identification programs for mass
spectrometry are freely available on the Web 622 Mass spectrometry can be used to measure
protein concentration 623
Summary 623
Further Reading 624
Chapter 16 Clustering Methods and Statistics
16.1 Expression Data Require Preparation Prior
to Analysis 626
Data normalization is designed to remove
systematic experimental errors 627
Expression levels are often analyzed as ratios
and are usually transformed by taking logarithms 628 Sometimes further normalization is useful after
the data transformation 630
Principal component analysis is a method for
combining the properties of an object 631
16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points 633
Euclidean distance is the measure used in
everyday life 634
The Pearson correlation coefficient measures distance in terms of the shape of the expression
response 635
The Mahalanobis distance takes account of the variation and correlation of expression responses 636
16.3 Clustering Methods Identify Similar and Distinct Expression Patterns 637
Hierarchical clustering produces a related set of alternative partitions of the data 639 fc-means clustering groups data into several
clusters but does not determine a relationship
between clusters 641
Self-organizing maps (SOMs) use neural network methods to cluster data into a predetermined
number of clusters 644
Evolutionary clustering algorithms use selection, recombination, and mutation to find the best
The self-organizing tree algorithm (SOTA)
determines the number of clusters required 648 Biclustering identifies a subset of similar
expression level patterns occurring in a subset
of the samples 649
The validity of clusters is determined by
independent methods 650
16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression 651
f-tests can be used to estimate the significance of the difference between two expression levels 654 Nonparametric tests are used to avoid making
assumptions about the data sampling 656 Multiple testing of differential expression requires special techniques to control error rates 657
16.5 Gene and Protein Expression Data Can be Used to Classify Samples 659
Many alternative methods have been proposed
that can classify samples 660
Support vector machines are another form of supervised learning algorithms that can produce
classifiers 661
Summary 662
Further Reading 664
Chapter 17 Systems Biology
17.1 What is a System? 669
A system is more than the sum of its parts 669 A biological system is a living network 670 Databases are useful starting points in
constructing a network 671
To construct a model more information is
needed than a network 672
There are three possible approaches to
constructing a model 674
Kinetic models are not the only way in
systems biology 678
17.2 Structure of the Model 679
Control circuits are an essential part of any
biological system 680
The interactions in networks can be represented as simple differential equations 680
17.3 Robustness of Biological Systems 683
Robustness is a distinct feature of complexity
in biology 684
Modularity plays an important part in robustness 685 Redundancy in the system can provide robustness 686 Living systems can switch from one state to
another by means of bistable switches 688
17.4 Storing and Running System Models 689
Specialized programs make simulating
systems easier 691
Standardized system descriptions aid their
storage and reuse 692
Summary 692
Further Reading 693
APPENDICES Background Theory
Appendix A: Probability, Information, and
Bayesian Analysis
Probability Theory, Entropy, and Information 695
Mutually exclusive events 695
Occurrence of two events 696
Occurrence of two random variables 696
Bayesian Analysis 697
Bayes' theorem 697
Inference of parameter values 698
Further Reading 699
Appendix B: Molecular Energy Functions
Force Fields for Calculating Intra- and Intermolecular Interaction Energies 701
Bonding terms 702
Nonbonding terms 704
Potentials used in Threading 706
Potentials of mean force 706
Potential terms relating to solvent effects 707
Further Reading 708
Appendix C: Function Optimization
Full Search Methods 710
Dynamic programming and branch-and-bound 710
Local Optimization 710
The downhill simplex method 711
The steepest descent method 711
The conjugate gradient method 714
Methods using second derivatives 714
Thermodynamic Simulation and Global Optimization 715
Monte Carlo and genetic algorithms 716
Molecular dynamics 718 Simulated annealing 719 Summary 719 Further Reading 719 List of Symbols 721 Glossary 734 Index 751
PARTI
1
THE NUCLEIC ACID WORLD
When you have read Chapter 1, you should be able to:
State the chemical structures of nucleic acids. Explain base-pairing and the double helix. Explain how DNA stores genetic information.
Summarize the intermediate role of mRNA between DNA and proteins. Outline how mRNA is translated into protein by ribosomes.
Outline how gene control is exercised by binding to short nucleotide sequences. Show that eukaryotic mRNA often has segments (introns) removed before translation. Discuss how all life probably evolved from a single common ancestor.
4
information about the biological context of the bioinformatics problems we discuss is also given in the biology boxes and glossary items throughout the book. This chapter will deal with the nucleic acids—deoxyribonucleic acid (DNA) and
ribonucleic acid (RNA)—and how they encode proteins, while the structure and
functions of proteins themselves will be discussed in Chapter 2. In these two chap-ters we shall also discuss how DNA changes its information-coding and functional properties over time as a result of the processes of mutation, giving rise to the enor-mous diversity of life, and the need for bioinformatics to understand it.
Mind Map 1.1
A mind map schematic of the topics covered in this chapter and divided, in general, according to the topic sections. This is to help you visualize the topics, understand the structure of the chapter, and memorize the important elements.
The main role of DNA is information storage. In all living cells, from unicellular bacteria to multicellular plants and animals, DNA is the material in which genetic instructions are stored and is the chemical structure in which these instructions are transmitted from generation to generation; all the information required to make and maintain a new organism is stored in its DNA. Incredibly, the information required to reproduce even very complex organisms is stored on a relatively small number of DNA molecules. This set of molecules is called the organism's genome. In humans there are just 46 DNA molecules in most cells, one in each chromosome. Each DNA molecule is copied before cell division, and the copies are distributed such that each daughter cell receives a full set of genetic information. The basic set of 46 DNA molecules together encode everything needed to make a human being. (We will skip over the important influence of the environment and the nature-nurture debate, as they are not relevant to this book.)
Proteins are manufactured using the information encoded in DNA and are the mole-cules that direct the actual processes on which life depends. Processes essential to life, such as energy metabolism, biosynthesis, and intercellular communication, are all carried out through the agency of proteins. A few key processes such as the