Data Management - 3 Data Reduction Approach

3 Data Reduction Approach

3.3 Data Management

As commented in this chapter, DPAC has enforced the adoption of a well defined Interface Control Document (ICD) to ensure the intercommunica- tion of all the processing systems participating in the data reduction effort [Hernandez, 2014]. This ICD basically defines the interface or interfaces between all the system through a well defined and controlled Data Model (DM). All the data produced following this DM ends up in the Gaia MDB. The MDB can be understood as a hub of all data produced by the Gaia data processing systems as shown in Figure 3.1.

As mentioned several times, new data is received on a daily basis and enters in the data processing pipeline continuously. It seem obvious that the data – which will reach really huge volumes - need to be partitioned. This data partition is actually useful for the definition and coordination of the data entering the iterative processing. The data partition, in the form of Data Segments (DSs), is defined according to the On Board Mission Time

(OBMT) an nominally will cover periods of approximately six months. The length of the DS will ultimately be adapted to the Gaia release schedule. Finally, the DRC will always be phased with the definition of the DS. The plan is to version the MDB at regular intervals in phase with the DS and DRC [Hernandez, 2013]. Due to the iterative nature of the science processing, new versions of the MDB will be always derived from the data of the previous version and therefore this previous version will be completely superseded by the new one. For this purpose all the data products are uniquely tagged with a solution identifier - coding at least the DRC and the software that generate the data [Hernandez, 2012]. This solution identifier is also of great help for tracking the input data used to generate subsequent data products or when data qualification is required, allowing for example the retraction or deletion of invalid data.

In general, each DPC stores a fraction of the current and past MDB versions depending on the needs of their processing systems. Only DPCC stores a full version of the MDB, including the raw data from the space- craft. Once a MDB is closed, a new MDB is initialised, in general with a new DM adjusted to the most recent updates in the calibration models and data product changes.

It will be a normal scenario to have to deal with data from an earlier version of the MDB and DM, typically we need to read the input data with a newer version of the DM. In general, if no breaking changes have been introduced in the DM, the DPAC software is capable of reading old versions of the data. In case of breaking changes, tools are available for the conversion which will be responsibility of each DPC.

All the operations related to the MDB and DM handling are fully described in Els [2014]. This technical note also summarises all the details related to data transfers, data naming conventions and the set of procedures for data conversion and data recovery.

As described in previous sections, during the DRC the different processing systems produce specific data products which are sent to DPCE for its

addition to the MDB. In general, integrating these data products does not imply any additional task except for the data contributing or updating the source parameters of the Gaia catalogue. In practice, it is required the execution of a last process in charge of the integration of the partial source parameter solutions. This system is called MDB Integrator [Hutton, 2014] and is in charge of merging the updates and resolving any conflict found in the data to be integrated. More details are provided in Section 4.3 devoted to the IDU Cross-Match task.

It is only when the MDB Integrator runs, at the end of the DRC, that the current MDB can be effectively closed. DPCE is both, the MDB host- ing and MDB Integrator. The expected data volume of the MDB has been summarised in Appendix D whereas Appendix C describes the data exchange scheme and technology adopted for the GTS.

Regarding the Gaia data releases, they will be created from the MDB when a given DRCs is closed, initially only fraction of the catalogue will be extracted but ultimately the full MDB contents will be made public to the scientific community as described in Luri et al. [2013].

3.4 Conclusions & Contributions of this thesis

We have summarised the data reduction approach adopted for Gaia. This summary has covered the most relevant systems starting from the daily processing; IDT and FL and finishing with the main system involved in the astrometric core solution: IDU, AGIS and PhotPipe. We have also described in detail the main steps involved in the astrometric data iterative reduction and the essential role of IDU and AGIS systems.

As remarked on Section 3.2, IDT and IDU have similar features. A large fraction of the algorithms developed for IDU during this thesis has been also integrated in IDT and are actively used in the daily pipeline at DPCE. We must highlight the contributions done for the raw data handling (see Section 2.8), the Cross-Match and the Image Parameters Determination

(IPD) tasks. Also the work done for this thesis has contributed to the improvement of the algorithms provided by other CUs, mainly CU1 and CU5. Additionally, a lot of monitoring tools have been developed during this thesis which have been integrated in both systems. A detailed list of these tools is available in Section 4.8.

It is worth pointing out that the definition of this data reduction approach has been achieved thanks to the efforts of all the CU and DPC teams. We have participated as CU3 members but also as DPCB members, thus assuring the fulfilment of the scientific requirements of IDU but also the technical topics related to the data provision and processing scheduling. Due to the iterative nature and the interdependencies between all the systems the definition of a detailed operations schedule is fundamental. Also the adoption of a common DM and ICD plays an important role for the success of the integration of all the systems. We have participated actively on the definition of this DM which has required several updates to fulfil the main interface requirements between the data reduction systems, in our case AGIS and IDU.

At the time of writing, the first AGIS run is conducted using already the Cross–Match (IDU-XM) results (see Chapter 6). In a few months, when this data is distributed to all DPCs, the very first iterative loop will start. This is a very relevant milestone and achievement, that should demonstrate the correct interface of all systems developed during the last years.

In document High performance computing of massive Astrometry and Photometry data from Gaia (Page 96-100)