• No results found

From a hardware perspective, implementing a database requires more than servers, large hard drives, perhaps a network and the associated cables and electronics. Power conditioners and

uninterruptible power supplies are needed to protect sensitive equipment and the data they contain from power surges and sudden, unplanned power outages. Providing a secure environment for data includes the usual use of username and passwords to protect accounts. However, for higher levels of assurance against data theft or manipulation, secure ID cards, dongles, and biometrics (such as voice, fingerprint, and retinal recognition) may be appropriate.

Secure ID cards are credit card–sized pseudorandom number generators that are synchronized with a similar generator on the server. Users enter the 16-digit number displayed on the secure ID card for

their password to gain access to the system. Biometric security systems use personal biological characteristics, such as a fingerprint, voice, or the pattern of capillaries on the retinae to verify the identity of a user. Dongles are hardware keys that applications look for on either the serial or USB port of a workstation before users can access their data and applications. Dongles can be considered as a form of hardware-based encryption. Dedicated, high-speed hardware capable of high-speed encryption and decryption are available options as well.

Encryption is the use of a key or code to scramble a message so that it can only be deciphered by someone with knowledge of the key and the algorithm used to encrypt the original message. From a practical perspective, encryption is the processing of data so that it's at least challenging for casual eavesdroppers to read, even if the data are intercepted.

For Web-based databases, Secure Socket Layer (SSL) is the dominant security protocol. Information transmitted over the Web using SSL is automatically encrypted, and only when the user's Web browser and the computer serving content have the same key can they communicate. Both Netscape and Internet Explorer support the optional use of SSL.

One of the limitations of SSL is that it's wedded to the client/server architecture, where a secure session is established, through which any amount of data may be securely transmitted for the duration of the session. A complementary communications protocol that makes use of encryption is Secure Hypertext Transfer Protocol (S-HTTP), a protocol that is designed to transmit individual messages securely over the Web. That is, SSL provides a secure communications channel for the length of the connection between the client and the server, regardless of whether or not data is flowing from one to the other. In contrast, S-HTTP is more appropriate for short communications that only uses the channel when data are moving from sender to receiver.

Regardless of whether SSL or S-HTTP is used, at the core of communications over the Internet is an encryption technology called Public Key Encryption (PKE), which is based on a pair of keys or data strings. One key is public, known or at least knowable to everyone, and one key is private, known only to the sender. The private key, which is not shared with anyone, is used to decrypt information that's been encrypted by someone using the public key. In other words, encoding uses a generally available public key and decoding is performed using a private key available only to the intended recipient. PKE is like a physical padlock, where one key is used to lock a padlock and another key to open it.

Endnote

Looking to the immediate future, the database technologies that will most likely have a significant impact on bioinformatics are the ones that deal with systems integration, the process in which disparate computer applications and systems can share data. Because the applications in a typical biotech laboratory are often cobbled together from different vendors and custom, in-house

development, and may be running on multiple generations of hardware, system integration is still a custom-programming task. As a result, integrating every database in an organization can take months of effort, considerable expense, and have only mixed results. Part of the challenge is that, due to the relative youth of the bioinformatics arena, the market has yet to respond to the need for commercial integration tools that address the specific needs of the community. Two areas in which rapid innovation is required for database integration and overall improved interoperability of bioinformatics tools are vocabulary standards and DBMSs.

Although organizations such as NCBI and the National Library of Medicine are actively involved in developing tools for the molecular biologist working in the field of bioinformatics, a vocabulary of bioinformatics has yet to be defined. As a result, most data warehouses and data dictionaries are based on ad-hoc compilations of existing vocabularies with additions made on an as-needed basis. Part of the challenge of creating a standard bioinformatics vocabulary is determining the appropriate level of granularity needed to adequately describe everything from nucleotide sequences and protein structure to species data. This challenge is intensified as the focus of bioinformatics research shifts from nucleotide sequencing to proteomics, which necessarily includes phenotypic expression data stored in clinical systems. As a result, an all-encompassing vocabulary must increasingly incorporate data in the medical record and public health as well.

In the area of DBMSs, although the relational model currently dominates the market, the complexity of clinical and laboratory data is driving many researchers to seriously consider other DBMS

technologies, such as object-oriented DBMSs. While there is a great deal of interest in object-oriented approaches to supporting bioinformatics computing, the information technology community is still expressing caution toward the technology. This is partly because many object-oriented database systems are incomplete, in that they lack backup and recovery functions. In addition, data models often conflict, the languages supported by vendors are proprietary, scalability is unproven, and the systems require huge amounts of memory and computational resources. In the recent past, vendors have partially addressed these and other limitations of ODBMs, but performance and scalability concerns remain.

Several vendors are building what they consider the next generation of bioinformatics database systems, but it's uncertain which of these systems will establish a standard. As such, the most promising technologies in the systems integration arena are aimed at the general computing market, such as Web Services, Storage Area Networks, Storage Service Providers, or Application Service Providers. Time will tell which of these models, if any, can be shown to be economically—as opposed to simply technologically—viable. In most cases, this translates to technologies that are transparent to the research workflow, thereby augmenting current processes and contributing to effectiveness of R&D.

By far the most significant challenges surrounding the effective use of database technology in

bioinformatics relate to issues of security, privacy, and bioethics, and how these issues will eventually affect legislation that will either support or hamper advances in the field. Consider the privacy and security issues associated with having an individual's medical records and DNA analysis available online and instantly available to teachers, employers, the courts, police, the FBI, and, inevitably, hackers. For now, the challenge is achieving the level of database integration that would make these issues a reality. At best, integration is limited to what Internet and intranet technology can support, through both fixed or hard-wired links and, more commonly, through dynamic links provided by online search engines. As described in Chapter 4, "Search Engines," significant progress in molecular biology database integration is being made in this arena.

Chapter 3. Networks

Ebola Virus structure, superimposed over its PDB summary information. Image

produced with PDB Structure Explorer, which is based on MolScript and Raster3D.

People seldom improve when they have no other model but themselves to copy after. —Oliver Goldsmith

Comparing a data network to a living organism, the hardware provides the skeleton or basic infrastructure upon which the nervous system is built. Similarly, a few hundred meters of cable running through the walls of a laboratory is necessary but insufficient to constitute a network. Rather, the data pulsing through cables or other media in a coordinated fashion define a network. This coordination is provided by electronics that connect workstations and shared computer

peripherals with the networks that amplify, route, filter, block, and translate data. Every competent bioinformatics researcher should have a basic understanding of the limits, capabilities, and benefits of specific network hardware, if only to be able to converse intelligently with hardware vendors or to direct the management of an information services provider.

According to Chaos Theory, the ability to adapt and the capacity for spontaneous self-organization are the two main characteristics of complex systems—systems that have many independent variables interacting with each other in many ways and that have the ability to balance order and chaos. In this regard, computer networks qualify as complex systems, always at the edge of failure, but still working. In some sense, it's difficult to define success and failure for these systems, in part because of the so-called law of unintended consequences that stipulates these systems can provide results so beneficial, so out of proportion to the intended "success" that they overshadow the significance of the intended goal. Consider that gunpowder was intended as an elixir to prolong life, or that the adhesive on 3M Post-It Notes® was intended to be a superglue, Edison's phonograph was intended to be a telephone message recorder, and Jacquard's punch card was intended to automate the loom, not to give the computer its instructions or determine presidential elections. Such is the case with the Internet, one of the greatest enabling technologies in bioinformatics, allowing researchers in laboratories anywhere on the globe to access data maintained by the National Center for Biological Information (NCBI), the National Institutes of Health (NIH), and other government agencies.

The Internet was never intended to serve as the portal to the code of life, but was a natural successor to the cold war projects in the 1950s and early 1960s. During this time, the military establishment enjoyed the nearly unanimous respect and support of politicians and the public.

Universities with the top science and engineering faculties received nearly unlimited funding, and the labors of the nation's top scientists filtered directly into industry. Military demand and government grants funded the development of huge projects that helped establish the U.S. as a Mecca for technological developments in computing and communications networks.

The modern Internet was the unintended outcome of two early complex systems: the ARPANET (Advanced Research Project Agency Network) and the SAGE system (semiautomatic ground

environment), developed for the military in the early 1950s and 1960s, respectively. SAGE was the national air defense system comprised of an elaborate, ad hoc network of incompatible command and control computers, early warning radar systems, weather centers, air traffic control centers, ships, planes, and weapons systems. The communications network component of the SAGE system was comprehensive and extended beyond the border of the U.S. and included ships and aircraft. It was primarily a military system, with a civil defense link as its only tie with civilian communications system.

Government-sponsored R&D increasingly required reliable communications between industry, academia, and the military. Out of this need, and spurred by the fear of disruption of the civilian communications grid through eventual nuclear attack, a group of scientists designed a highly redundant communications system, starting with a single node at UCLA in September of 1969. By 1977, the ARPANET stretched across the U.S. and extended from Hawaii to Europe. The ARPANET quickly grew and became more complex, with an increasing number of nodes and redundant cross- links that provided alternate communications paths in the event that any particular node or link failed.

Although the ARPANET's infrastructure was an interdependent network of nodes and

interconnections, the data available from the network was indistinguishable from data available from any standalone computer. The infrastructure of the system provided redundant data communications,

but no quick and intuitive way for content authors to cross-link data throughout the network for later access—the mechanism that allows today's Internet users to search for information. In 1990,

ARPANET was replaced by the National Science Foundation Network (NSFNET) to connect its supercomputers to regional networks. Today, NSFNET operates as the high-speed backbone of the Internet.

Fortunately, and apparently coincidentally, during the period of military expansion in the 1950s and 1960s, federally funded researchers at academic institutions explored ways to manage the growing store of digital data amid the increasingly complex network of computers and networks. One

development was hypertext, a cross-referencing scheme, where a word in one document is linked to a word in the same or a different document.

Around the time the ARPANET was born, a number of academic researchers began experimenting with computer-based systems that used hypertext. For example, in the early 1970s, a team at Carnegie-Mellon University developed ZOG, a hypertext-based system that was eventually installed on a U.S. aircraft carrier. ZOG was a reference application that provided the crew with online

documentation that was richly cross-linked to improve speed and efficiency of locating data relevant to operating shipboard equipment.

In addition to applications for the military, a variety of commercial, hypertext-based document management systems were spun out of academia and commercial laboratories, such as the Owl Guide hypertext program from the University of Kent, England, and the Notecards system from Xerox PARC in California. Both of these systems were essentially stand-alone equivalents of a modern Web browser, but based on proprietary document formats with content limited to what could be stored on a hard drive or local area network (LAN). The potential market for these products was limited

because of specialized hardware requirements. For example, the initial version of Owl Guide, which predated Apple's HyperCard hypertext program, was only available for the Apple Macintosh.

Similarly, Notecards required a Xerox workstation running under a LISP-based operating system. These and other document management systems allowed researchers to create limited Web-like environments, but without the advantage of the current Web of millions of documents authored by others.

In this circuitous way, out of the quest for national security through an indestructible communications network, the modern Internet was born. Today, the Internet connects bioinformatics researchers in China, Japan, Europe, and worldwide, regardless of political or national affiliation. It not only provides communications, including e-mail, videoconferencing, and remote information access. Together with other networks, the Internet provides for resource sharing and alternate, reliable sources of

bioinformatics data.

As an example of how important networks are in bioinformatics R&D, consider that the typical microarray laboratory involved in creating genetic profiles for custom drug development and other purposes generates huge amounts of data. Not only does an individual microarray experiment generate thousands of data points, usually in the form of 16-bit tiff (tagged image file format) files, but the experimental design leading up to the experiments, including gene data analysis, involves access to volumes of timely data as well. Furthermore, analysis and visualization of the experimental data requires that they be seamlessly and immediately available to other researchers.

The scientific method involves not only formulating a hypothesis and then generating creative and logical alternative solutions for methods of supporting or refuting it, but also a hypothesis that will withstand the scrutiny of others. Results must be verifiable and reproducible under similar conditions in different laboratories. One of the challenges of working with microarrays is that there is still

considerable art involved in creating meaningful results. Results are often difficult to reproduce, even within the same laboratory. Fortunately, computational methods, including statistical methods, can help identify and control for some sources of error.

As shown in Figure 3-1, computers dedicated to experimental design, scanning and image analysis, expression analysis, and gene data manipulation support the typical microarray laboratory. The microarray device is only one small component of the overall research and design process. For

example, once the experiment is designed using gene data gleaned from an online database, the microarray containing the clones of interest has to be designed and manufactured. After hybridization with cDNA or RNA from tissue samples, the chips are optically scanned and the relative intensity of fluorescent markers on the images are analyzed and stored. The data are subsequently subject to further image processing and gene expression analysis.

Figure 3-1. Microarray Laboratory Network. The computers in a typical