• No results found

Archive Storage Media Trade Studies

This section justifies the choices made when choosing the implementation strategy for the GSA bulk data storage subsystem.

3.1 Bulk Data Storage Media

Archive Storage Media Trade Studies

FIGURE 24. Bulk Data Storage Subsystem overview

TABLE 55. Bulk data storage subsystem requirements SR1.1 The GSA shall archive all data.

All data received from Gemini will be available from the bulk data storage system. The storage sys- tem will have sufficient capacity to store all data.

SR1.3.2 Data processing shall proceed promptly.

Science and calibration data will be stored on-line, allowing the data to be accessed without manual operator intervention. Data Ingest <<subsystem>> Data Processing <<subsystem>> Archive Directory Database Archive Storage Media Data copied from removable media to magnetic disk.

Archive directory for removable media and archive storage media. (execute mfsIngest) Data retrieval. acp mfsIngest AD library mediaMigrate Media Creation <<subsystem>> File re-arangement. Media creation commands. mfsOnline Data ingest indicates archive volumes are mounted by executing mfsOnline.

Archive Storage Media Trade Studies

off-line — Data are stored on a shelf, and are only available through operator intervention when

there is a need for the data. If off-line storage is used, software must be available to prompt the operator to mount appropriate media, and to coordinate the execution of the requesting task with the availability of the data.

on-line — All of the data is stored on spinning magnetic disk.

near-line — The data is stored on a jukebox, and loaded automatically by jukebox robotics when

needed. Superficially, the data appears to be on-line, however there are important differ- ences between on-line data and near-line data:

It takes between 10 and 30 seconds for the jukebox to put a volume into a reader and make it available for reading.

The readers used in jukeboxes have considerably slower data transfer rates com- pared to magnetic disk.

Jukeboxes have a small number of media readers (four in the jukeboxes used by the CADC). If there are fewer volumes in use than the number of readers available in the jukebox, then data throughput is limited primarily by the performance of the readers. When the number of volumes in use exceeds the number of readers, the jukebox begins “thrashing” — loading and using each volume for a short period of time before unloading the volume to load another. When a jukebox is thrashing, the performance becomes limited by the jukebox robotics. Thrashing also causes more wear-and-tear on the jukeboxes.

If there is more than one file to be retrieved from each different volume, the order in which the files can effect performance. Reading 10 files from one volume, fol- lowed by reading 10 other files from a second volume can be significantly more efficient than reading the same 20 files in random order.

The costs associated with each of these options are listed in Table 56 on page 128. Only CD- ROM and DVD-ROM are considered for off-line storage. There are other options, however the other options are not “consumer” products, and so the costs are significantly higher. It should also be noted that the cost of on-line storage is dropping faster than near-line storage, and so on- line storage is becoming an increasingly attractive option. Note that the costs given in Table 56 are those in effect in early 2001, and are subject to change, for instance:

SR4.2 Maximum and average requirements for raw data ingest.

The bulk data storage system will be able to accept new data at the required rates. This includes retrieval of data for automatic processing (preview, derived descriptors, derived objects, etc.). SR4.3 Maximum and average requirements for data retrieval by users.

The bulk data storage system will be able to supply data for user requests at the required rates. SR4.6 Data for Internet retrieval should be available promptly.

Data must be stored on-line. Data must be available reasonably quickly. SR5.3 Data shall be electronically secure.

Provide tools to create backup media as necessary. Failures and recoveries must be incorporated. Proprietary data will be secure from unauthorized electronic snooping.

SR5.7 GSA operations staff shall incorporate evolving technologies into the GSA.

It will be necessary to periodically copy both the “live” archive, and backup media to new technol- ogy. The Bulk Data Subsystem will provide facilities to do the copying. There must be no interrup- tion in service from the Bulk Data Subsystem while the copying and media upgrades take place. TABLE 55. Bulk data storage subsystem requirements

Archive Storage Media Trade Studies

The cost of blank CD-ROM media is dropping slowly.

The cost of blank DVD-ROM media is dropping moderately quickly.

Double sided DVD-ROM may soon be available, nearly halving the hardware cost, but using media with an unknown cost.

Magnetic disk costs is dropping steadily.

The requirement that GSA users have 24 hour a day access to data effectively eliminates the off- line media option. The cost of having an operator available at the archive site at all times is far higher that the approximately $20000/year cost of storing Gemini data on-line or near-line.

Although the costs of the near-line DVD-ROM and on-line storage listed in Table 56 are essen- tially the same, we have selected on-line storage for the Gemini Science archive for the following reasons:

1. The cost of on-line storage is currently about the same as the cost of near-line storage.

2. Historically, the cost of on-line storage has been dropping faster than near-line storage. If this

TABLE 56. Total cost of ownership of archive storage media (USD)

Category Need Cost

off-line CD-ROM Media $0.65/GB

Operator: To meet the availability requirement, operators would have to be available 24 hours a day.

Software to manage data retrieval DVD-ROM Media $6.5/GB

Operator: To meet the availability requirement, operators would have to be available 24 hours a day.

Software to manage data retrieval near-line CD-ROM Media $0.65/GB

Jukeboxes $44/GB

Jukebox host system $14/GB Jukebox driver software $16/GB Maintenance

Software to manage data retrieval DVD-ROM Media $6.5/GB

Jukeboxes $6/GB

jukebox host systems $2.5/GB Jukebox driver software $2/GB Maintenance

Software to manage data retrieval. on-line RAID disk

array

Media $13/GB

Host computer system $2.5/GB Maintenance

Database Schema

3. The performance of on-line storage is significantly better than near-line storage (although properly managed jukeboxes would meet GSA performance requirements).

4. The near-line storage option would need software to manage data retrieval, in order to utilize the jukeboxes efficiently. This software adds complexity and cost to the GSA software devel- opment.

5. Upgrading to newer, higher density, cheaper storage options will be necessary with all of the storage options, however this upgrade path is easier and cheaper with magnetic disk media than with any of the other options.

The GSA will use off-line CD-ROM and/or DVD-ROM as backup media for the on-line archive. The choice between CD-ROM and DVD-ROM will have to take into account the following fac- tors:

The difference in cost between the media.

The handling costs required to write the media.

The handling costs required to ingest the media into the DHS system.

Gemini is currently writing CD-ROM for delivery to the GSA as archive media. Since the costs of purchasing and writing the archive media dominate the total cost of using a media type, and since these costs are born by Gemini, we expect Gemini will choose when to switch to DVD- ROM as the archive delivery media type.