Monitor meta-data store - Data ingest software

4. Data ingest software

4.1 Monitor meta-data store

A program named gemIngest will be created to examine the GSA meta-data tables described in [3]. gemIngest will be run daily to copy any new data from the Gemini meta-data stores into the GSA catalogues. This program will perform the following tasks:

Data ingest software

• Populate the received table with any new datasets found in the meta-data stores from both Gemini North and Gemini South.

• Populate the dataSuperset catalogues described in Chapter 4 from the Gemini North and Gemini South meta-data stores.

• Copy the observing programs from the Gemini North and Gemini South meta-data stores to the GSA catalogues observing program table. This will also involve extracting various parameters from the observing programs, and insert them into the GSA observing program tables as searchable fields.

• Merge the Gemini North and Gemini South observing logs into a single observing log.

• Merge the Gemini North and Gemini South environmental data tables into a single table.

• If necessary, merge the Gemini North and Gemini South publications tables into a single table. However since it seems more efficient to only have one group monitoring publications for both telescopes, it seems likely that there will only be one publications table.

• Queue the first level data processing for each dataset for execution. The first level data processing will include:

— Examine external object catalogues (SIMBAD and NED) to determine which objects should be in the field of view, and inserting these objects into the object catalogue tables.

FIGURE 17. Data Ingest subsystem overview

Meta Data Store <<subsystem>> Bulk Data Storage <<subsystem>> Data Processing <<subsystem>> Catalogues <<subsystem>> gemIngest received table received library gemVerify fitsIngest

Header ingest data dictionary library

Header ingest table library

Data ingest software

— A Data Processing Discovery Agent task which will identify any further data processing appropriate for the dataset. The Data Processing Discovery Agents are described in Chapter 5, section 4.3 on page 89.

The initial level of data processing will scheduled for high batch priority execution. The initial data processing will not require any access to proprietary data, and so is not dependant on the release date of the raw data.

• Detect any modifications to existing data in the Gemini meta-data stores, and begin the prop- agation of the changes through the GSA catalogues. This will involve re-copying the modi- fied meta-data to the catalogues.

• A change to a release date which causes a public dataset to become proprietary will cause a high priority data processing task to be started. This task will examine the GSA system and determine which archive data products should be removed from the system (previews, image descriptors derived from the data, sources, object parameters derived from the data, etc.). Ini- tially, this task will send a message to the archive operators. The message will include a script which an operator can execute to perform any required deletions from the archive tables. If making public data proprietary becomes a common occurrence, and when confidence is gained that the deletions are correct, then this process may be automated. (Changes to release date should normally occur before becomes public, and so making public data proprietary should be considered error recovery and not part of the normal way of operating.)

For efficiency, this program will use the timestamps in the Gemini meta-data stores to determine which data needs to be examined. By default, the program will process only new rows in each of the meta-data store tables in timestamp order. Each time the program completes processing one of the tables, the largest timestamp processed will be recorded in the ingestProcess table. The next time the program executes, it can resume processing at the next newest row in the tables.

The program will have command-line options to enable the following functions:

• Process all datasets, not just the new datasets.

• Process all datasets in a given range of datasets.

• Only process datasets from a specified meta-data database (either Gemini North or Gemini South).

• Process data from the dataset meta-data store tables, but not the other tables in the meta-data stores unless explicitly requested by other command line options.

• Process environmental data from the meta-data stores, but not the other tables in the meta- data stores unless explicitly requested by other command line options.

• Process publication information from the meta-data stores, but not the other tables in the meta-data stores unless explicitly requested by other command line options.

• Process the electronic observing logs from the meta-data stores, but not the other tables in the meta-data stores unless explicitly requested by other command line options.

• Process the Observing program tables from the meta-data stores, but not the other tables in the meta-data stores unless explicitly requested by other command line options.

The program will have a configuration file which contains the following data:

• The database name and tables names of the meta-data stores described in [3]. This should include information describing separate Gemini North and Gemini South meta-data stores, and should also allow for meta-data stores generated by the GSA using the fitsIngest pro- gram described in Section 4.3 on page 100.

Data ingest software

• A list of data processing tasks to start for each new dataset.

• The names of the GSA catalogue tables to received the meta-data.

In document Conceptual Design Document (Revised Initial) (Page 103-106)