Catalogue maintenance - Conceptual Design Document (Revised Initial)

The maintenance of the data in the catalogues will be done by a variety of tasks described in the following sections. An overview of the catalogue subsystem is shown in Figure 12 on page 71. The shaded area indicates the components which are a part of the catalogue subsystem. The data flow through the catalogue tables is shown in Figure 13 on page 72.

4.1 Data superset catalogue

The data superset catalogue will be maintained by the Data Processing Discovery Agents described in Chapter 5, section 4.3 on page 89, and by the Data Ingest task described in Chapter 6, section 4.1 on page 97. The Data Ingest task will add simple datasets to the data superset table, and the Data Processing Discovery Agent tasks will identify associations and insert them into these tables. Some of the information destined for these tables may be produced by further data processing.

error The error associated with this value. integer TRUE FALSE FALSE rank The rank of this value relative to other known values for

this attribute.

tinyint TRUE FALSE FALSE

TABLE 28. Columns of table objectPropertiesString

Name Comment Data Type Manda-

tory

Primary Foreign Key objectId The unique object id. binary(8) TRUE TRUE FALSE attributeId The attribute id. binary(4) TRUE TRUE TRUE recipeInstan-

ceId

The recipe instance which produced this value. binary(8) TRUE TRUE TRUE value The value of the attribute. var-

char(255)

TRUE FALSE FALSE certainty This is an estimate of the degree of certainty for this

value.

float TRUE FALSE FALSE rank The rank of this value relative to other values for the

attribute.

tinyint TRUE FALSE FALSE TABLE 27. Columns of table objectPropertiesInt

Name Comment Data Type Manda-

tory

Catalogue maintenance

4.2 Science table

The science table will be maintained by the gemScience program. This program will examine the data superset catalogue created by the GSA Data Ingest subsystem, and the data processing subsystem. The program will have rules to determine which data supersets sets are “science” data, and will move these datasets into the science table. This program will be run periodically (daily) to update the science catalogue by the GSA data processing system. The data in the science table is always public, and so this table will be populated as soon as Gemini inserts catalogue data into the meta-data store.

The rules for determining when a data superset is useful for science will be developed in cooper- ation with Gemini. The rules used by gemScience will by similar to the rules used by other archives at the CADC. Some examples of the rules used by this program would be:

• Examining observation type and observing program id to eliminate calibration data.

• Checking for the existence of associations which should replace member datasets.

• Checking data quality to remove data which is not scientifically useful.

In addition to the above rules, new rules will be created as experience is gained with the Gemini data, and with the content of the science table.

4.3 Observations table

The observations table will be maintained by a set of data processing recipes. These recipes will be executed by the data processing system as each dataset becomes public. The recipes may be instrument specific, and will calculate the data to be inserted into the observations table based on information in the science table, and when necessary, based on the pixel data. As each row is inserted into the observations table, the recipes will search for any other observations with over- lapping fields of view, and insert appropriate rows into the observationOverlap table.

FIGURE 12. Catalogue Subsystem Overview

gemScience science table dataSuperset catalogue tables observation table

source tables object tables Data Ingest

<<subsystem>>

Data Processing <<subsystem>>

Catalogue maintenance

FIGURE 13. Data flow through the catalogue tables

Simple datasets are added to the dataSuperset catalogue table as Gemini adds datasets to the meta-data store.

Associated data supersets are added to this table as they are discovered by the Data Drocessing Discovery Agent. Public attributes will be filled in immediately. Attributes which must be derived from pixel data will be

added by data processing done after the data becomes public.

The science table will be maintained by the gemScience program. The data supsersets catalogue will be the source of information

inserted into the science table.

The source catalogue is populated by data processing recipes. The recipes will

examine public data to detect sources and calculate object parameters. The observations tables will be maintained

by a set of data processing recipes triggered when data becomes public.

The object catalogue will be maintained by data processing recipes.

Some recipes will examine the field of view of images and extract object information from external catalogues. Some recipes will examine the source catalogue and

merge the sources into objects. dataset dataSupersetCatalogue science observations sourceCatalogue object

Hardware Requirements

4.4 Source catalogue

The source catalogue will be maintained by a set of data processing recipes. These recipes will be executed as each dataset becomes public. The recipes will examine the calibrated pixel data of the science observations, and produce source parameters based on the calculations. These recipes will be data class specific (i.e. some will process spectral data, others will process image data), but will probably not be instrument specific. Initially, the recipes will execute the sextractor program to detect sources [9], and the galfit program to determine the galaxy morphology [10]. The recipes used to populate the source catalogue will be expanded over the life of the archive.

4.5 Object catalogue

The object catalogue will be maintained by a set of data processing recipes. There will be two categories of recipes to maintain the object catalogue:

• Recipes which examine the science table and extract objects from other catalogues (Simbad and NED) which are in each observation’s field of view. These recipes will only use field of view information which is always public, and so the data processing system can execute these recipes as soon as datasets are inserted into the science table.

• Recipes which examine the content of the source catalogue, merge attributes which describe the same object into a single set of attribute-values describing the object, and rank the various values for each attribute. These recipes will be dependant on the source catalogue content, which is in turn dependant on the pixel data being public.

Each of the data processing recipes will use a shared object identification algorithm to determine which objects are the same, and which objects are different.

Each of the data processing recipes will use a shared attribute ranking algorithm to determine the relative goodness of attributes taken from various sources.

The object identification algorithm is a challenging task which may not be available with the initial release of the system.

In document Conceptual Design Document (Revised Initial) (Page 76-79)