• No results found

As a communications system, the computer is like an asynchronous communications medium,

typified by server-based e-mail. Data are deposited in the system and retrieved later. As in an e-mail system, the information can be retrieved by the person who originally created it, or more often by another party. Similarly, biological database systems can be private or public, where the former is intended to provide asynchronous communications to the same user, and the latter approximates sending mail to others.

From a data-management perspective, computers are increasingly used as asynchronous

communications devices. That is, unlike a telephone or other synchronous communications device, communications through computer networks don't necessarily occur in real-time and are independent of any clock. Instead, communications are event driven. For example, in the normal operation of a

database server, data are sent to a computer where they may be stored for a few microseconds or for a decade before they are output to a printer or monitor. When computers are configured as a communication device, the users who generate the data tend to be different from the users who receive the results, and the response time may be hours or days.

The difference between e-mail and a typical biological database is that contribution to biological databases such as GenBank are meant for others, but the identity of the recipients is unknown and largely unknowable by the sender. Unlike an e-mail that is automatically deleted by the server after the e-mail is read, the data in a biological database is considered permanent or at least not altered by the process of accessing the data.

The communications of sequencing and protein structure information is hindered because of the lack of a standard format for creating and storing gene data, even within companies. Several contenders for the standard include Gene Expression Markup Language (GEML), based on the eXtensible Markup Language (XML), and Microarray Markup Language (MAML). The latter is based on collaboration between the National Center for Biotechnology Information, Stanford University, and the European Bioinformatics Institute.

Endnote

The history of innovation and development in computing and molecular biology is much more complicated than that suggested by the timeline in Figure 1-2 or by the Central Dogma. The flow of information through replication, transcription, and translation is more involved than described here, and there are unknowns and exceptions to most of the theories put forth by investigators. For example, the Central Dogma is only partially correct, in that the flow of information isn't

unidirectional as Watson initially proposed. In contrast to the Central Dogma, information can flow from external sources into the genome. For example, retroviruses or RNA-based viruses such as HIV copy their genetic information into the host cell's DNA, where the cell's machinery obediently

duplicates the retrovirus.

In addition, there are many more unknowns than the role of introns and other apparently non-coding DNA in the chromosomes. Many of the proteins in the human proteome haven't been cataloged, and the roles of those that have been cataloged are poorly understood. Similarly, the processes of replication, transcription, and translation are exceedingly complex, involving hundreds of thousands of operations mediated by hundreds of factors, only a few of which are understood. Furthermore, the information-transfer process described by the Central Dogma differs somewhat from that used by mitochondria and some microorganisms.

What's more, it's possible that the source of much of the work in bioinformatics—the human genome—is inherently biased. Because much of the sequence data is derive from analysis of Craig Venter's DNA, with minimal contributions from five other donors, the data necessarily reflect Venter's genotype. Although it was recognized early on that his DNA carries a variant gene associated with abnormal fat metabolism and Alzheimer's disease, other variants carried by Venter that have not yet been studied may be considered normal for the human genome until more research is performed. Undoubtedly, over the next decade, when scientists finally finish and verify the genome sequence, other discoveries will be made as well. For example, it's unclear what the sequences in the

centromeres will reveal, especially because the sequences in those regions of each chromosome have been resistant to sequencing techniques used on the other parts of the chromosomes.

Similarly, developments in computer science have not been as straightforward as suggested by the timeline. For example, the promise of AI, the darling of the computer science community throughout the 1980s, never materialized. After the massive military funding for language translation

evaporated, the few companies that attempted to survive in the commercial world folded. Even the notable academic systems, such as MYCIN—the first rule-based expert system in medicine—were never put to practical use. What survives today are the various pattern-recognition methods and object-oriented programming techniques that are invaluable in genome and proteome research. The timeline offered here also glosses over much of the human struggle involved in the discoveries and triumphs in both molecular biology and computing. For example, James Watson was initially in charge of the Human Genome Project, but resigned after only a few years because of a feud with the director of the National Institutes of Health over gene patenting. His successor, Francis Collins was then embroiled in competition with Craig Venter's private research institute over methodology.

Although Venter prevailed and won the race to decode the majority of what is currently understood to be the human genome, the commercial viability of his company is less certain. Similarly, there is turmoil—and millions of dollars at stake—over determining who should be credited with the basic sequencing technique.

Just as the hype of what AI was supposed to deliver served to kill the industry for many years, many of the favored genomics research firms have performed less profitably than expected on Wall Street. Some genetically engineered drugs have not taken off as expected, and companies such as Genetech have been forced to turn to modifications of conventional pharmaceuticals to stay in business.

When exploring the computational methods described in this book, the reader is encouraged to apply basic business metrics to the information. For example, what is the added value of each step in the

computerization process? How can the computing method described save time, provide a more accurate result, or save valuable resources? In the end, computers and computational methods are simply tools. Like a sculptor, chipping away at the rock covering a statue, it's up to the readers to select the tools that can best help create their vision.

Chapter 2. Databases

Prefoldin Chaperone, PDB entry 1FXK. Image produced with PDB Structure Explorer,

which is based on MolScript and Raster3D.

What a piece of work is man! How noble in reason! How infinite in faculty! In form, and moving, how express and admirable! In action how like an angel! In

apprehension how like a god! The beauty of the world! The paragon of animals! And yet, to me, what is this quintessence of dust?

—William Shakespeare, Hamlet

Computers serve four interdependent functions in bioinformatics: communications, computation, control, and storage. Embedded computer controllers in sequencing machines, fermentation tanks, and bioreactors direct the programmable robotic arms that automate intricate processes and

markedly decrease the need for human operators. When time is of the essence, computer-controlled devices are superior to manual operations, in part because they can operate virtually unattended around the clock. Venter's company, Celera Genomics—followed by government-funded sequencing laboratories—was able to make unprecedented gains in sequencing throughput primarily through computer-directed robots that automated the tedious sequencing process.

As a communications device, not only has the computer helped researchers craft more journal articles in less time than at any other point in history, but an increasingly large proportion of

academic research information appears online. Up until the mid-1990s, newly discovered nucleotide sequences from human and other species of DNA were published in printed journals, requiring that researchers interested in using computer techniques to explore the sequence either key in the sequences by hand or use optical character recognition (OCR) systems to automatically capture the printed sequences and translate them into in machine-readable form. Today, no researcher would think of consulting a printed journal for a nucleotide sequence, but would immediately turn to either one of the numerous public databases on the Web or one of the value-added commercial databases. Furthermore, if a printed journal article isn't referenced by one of the electronic databases, such as PubMed, then the chances of the article ever being read in any form are low.

As computational devices in bioinformatics, computers are used for tasks that range from searching for nucleotide sequences and visualizing protein folding patterns to simulating complex 3D protein- protein interactions, for applications ranging from drug discovery to biomaterials research and development. As an example of computer processing power focused on numeric computation in bioinformatics, consider that Celera Genomics' network of 800 Compaq AlphaServers has the capacity to compare up to 250 billion genomic sequences per hour generated by its hundreds of robotic gene sequencing machines. Even lesser-endowed companies and academic centers are creating high- performance Beowulf clusters for bioinformatics work. These massively parallel systems that are constructed from dedicated PC hardware are generally affordable and available to anyone.

Researchers at another pharmacological powerhouse, GlaxoSmithKline (GSK), are studying how individual variations in the genetic code cause adverse drug reactions in some patients. To pursue this research, GSK partners with biotech research firms who store clinical data from drug trials and correlate it with the patient's genetic information to create a genetic profile of patients at risk. Similarly, clinicians with the Mayo Clinic in Minnesota are working with researchers to identify gene markers that indicate which patients should respond to specific anticancer therapy. Elsewhere, pharmaceutical research firms are using genetic traits to predict whether a patient will respond to therapy as well as the likelihood of serious side effects. Several biotech startups are developing panels of DNA tests that will allow clinicians to quickly determine how patients metabolize drugs so that dosage regimens can be tailored to their individual metabolism.

All of these activities revolve around database technology. For example, both communications and computation operations in bioinformatics depend on data that have to be maintained. Electronic databases maintain data in a persistent, non-volatile form that allows operations to be repeated and compared with other operations, with the results communicated to other researchers and developers. The electronic database—a file composed of records, each containing fields together with a set of operations for searching, sorting, recombining, and other functions—is the silicon, plastic, and iron- oxide equivalent of the experimenter's private notebook, and the basis for electronic publishing to the scientific community.

As an illustration of how central databases are to the molecular biology research and development, consider a sampling of the public bioinformatics databases listed in Table 2-1. Perhaps the best- known of the hundreds of DNA sequence databases accessible through the Internet are the

international nucleotide sequence database collaborators GENBANK, supported by the National Center for Biological Information (NCBI), the DNA DataBank of Japan (DDBJ), and the European Molecular Biology Laboratory (EMBL). Another major database, PubMed, which is maintained by the U.S. National Library of Medicine, is a key resource for biomedical literature.