• No results found

provide for security.

Issue Description

Indexing Vocabulary, metadata, language, completeness, efficiency Space Requirements Index space versus data space

Hardware Requirements Hard drives, network

Scalability Ability to expand functionality without investing in new hardware and software

Database Design Data model

Archival Process Responsibilities for overseeing the process Space Requirements Current and projected archival capacity

Completeness Relative quantity of total data that are archived

Media Selection Compatibility, speed, capacity, data density, cost, volatility, durability, and stability

Location Local, server-based, or network Infrastructure Requirements Network and computer hardware Relative Value Value of data vs. archival overhead Hardware Configuration RAID and other configurations

Longevity Technical obsolescence of media and MTBF rating of related equipment

Security Limited access to data

The hardware involved in the archiving process may include a PC-based CD-ROM burner, a large database server that's networked to a number of workstations and routinely backed up onto magnetic tape, or a network-based storage that may be located offsite. As discussed later in this chapter, each option has security, cost, and performance issues. The software tools selected for archiving data also define the usability and performance of the data archive, especially regarding data indexing and retrieval functions.

After data have been created and, if necessary, modified for use, and before it can be archived, it's typically named, indexed, and filed to facilitate locating it in the future. As such, the filing system, naming conventions, and accuracy and specificity of indexing limit the efficiency with which the data

can be located later. For example, each document can be assigned one or more keywords, but if the keywords aren't appropriate, the keyword vocabulary is undefined or not enforced, or too few keywords are used, then a document may be effectively lost in the system. Not only the choice and number of keywords, but the indexing hierarchy can make data hard to find.

The process of data archiving is far more important than the associated technology, in that the best software and hardware are useless if they aren't used. Of the technical issues involved in archiving gene sequences, microarray analysis, and other bioinformatics data, scalability is typically the most important. Even relatively small laboratories generate megabytes of data every week, which is fueling demand for very-large-capacity archival storage devices.

One of the primary determinants of archive capacity is the storage media—the physical material used to form a tape, disk, or cartridge. In addition to capacity, media can be characterized in terms of compatibility, speed, data density, cost, volatility, durability, and stability. Compatibility is the ability of media to function within a particular software and hardware environment. Speed is a multi-faceted performance characteristic that encompasses both the time to locate data (seek time) and the time to write it to or download it from the media (data transfer rate), all of which are functions of the construction of the supporting hardware and electronics. Seek time may be several hundred milliseconds for a CD-ROM, a few milliseconds for a hard drive, and a few microseconds for a flash memory card. Capacity—the maximum amount of data the media can store—is a function of the media construction, the tolerance of the casing or cartridge for tape- and disk-based media, and the technology used to read and write the data. Capacity is also a function of data density, which is in turn a function of the media, the drive mechanism, and the error coding and compression

technologies. Error-control and compression schemes in hard drives and other media allow higher data densities than the raw media would support otherwise.

Cost is a function of the raw materials involved in the creation of media, but has more to do with what the market will bear and what the competition has to offer. Volatility, a characteristic normally ascribed to solid-state memory, refers to the status of the data when external power is removed. Flash memory, like magnetic disk or tape, is considered relatively non-volatile, and can hold data for years without loss.

Durability refers to the physical properties of the media that contribute to the longevity of the surface, mechanisms, and housing, if any, during normal use. For example, the bearings and other components in the rotational system of a hard drive undergo wear and tear over time. Stability reflects the physical properties of the media in a given environment that contribute to the longevity of the media and therefore the data, in a dormant state. For example, the bearings, metal, and plastic parts are subject to the same problems that beset every complex electro-mechanical device. Lubrication dries out, leaving bearings dry and without protection, rubber becomes brittle, plastic parts deform, and dust and lint accumulate in the cooling system. Furthermore, the magnetic patterns induced in the iron-oxide coating on the disk platters fade over the years, especially in the heat. Similarly, the plastic-based optical media of a CD-ROM is susceptible to damage from high humidity, rapid and extreme temperature fluctuations, and contamination from airborne pollution. Over time, oil from our fingers can also damage the plastic surface of a CD-ROM. Fluctuations in temperature and humidity can also cause shrinking and expansion of magnetic tape, distorting the position of data tracks, resulting in data loss.

The longevity or life expectancy of the devices in an archive system is usually expressed in the Mean Time Between Failure (MTBF) rating. The MTBF, an estimate of the failure rate of a device during its expected lifetime, is one metric that can be used to estimate the life span of an archive. Typical MTBF ratings for tape drives and commercial-grade hard drives are over 20 years. However, this figure assumes ideal conditions of constant low temperature and humidity, freedom from biological agents, static-electricity discharges, and mechanical abuse. Another consideration is that even if a tape survives a decade or more in fireproof safe, it's likely that the data it contains will be inaccessible because of changes in tape-drive standards. Most of the disk packs, tapes, and magnetic cartridges that were standard archival media a decade ago are incompatible with current computer hardware. Archives vary considerably in configuration and in proximity to the source data. For example, servers typically employ several independent hard drives configured as a Redundant Array of Independent

Disks (RAID system) that function in part as an integrated archival system. The idea behind a RAID system is to provide real-time backup of data by increasing the odds that data written to a server will survive the crash of any given hard drive in the array. RAID was originally introduced in the late 1980s as a means of turning relatively slow and inexpensive hard disks into fast, large-capacity, more reliable storage systems.

RAID systems derive their speed from reading and writing to multiple disks in parallel. The increased reliability is achieved through mirroring or replicating data across the array and by using error-

detection and correction schemes. Although there are seven levels of RAID, level 3 is most applicable to bioinformatics computing. In RAID-3, a disk is dedicated to storing a parity bit—an extra bit used to determine the accuracy of data transfer—for error detection and correction. If analysis of the parity bit indicates an error, the faulty disk can be identified and replaced. The data can be reconstructed by using the remaining disks and the parity disk.

For example, in Figure 2-10, disks A–D are dedicated to data and disk P is used to store the parity bit. In this case, an odd number of "1" bits corresponds to a high ("1") parity bit. When data are written in parallel to the data disks, the corresponding parity bit is stored on the parity disk. Immediately after the data are written to the data disks, the data are read and the parity bits are compared. The discrepancy noted in Figure 2-10 is typical of a case when there is an error on one disk. The error on disk "C" can be repaired, or if groups of errors are suddenly becoming apparent indicating imminent disk failure, then the entire disk can be replaced.

Figure 2-10. RAID-3. Data disks are read and written to in parallel,