D7.7 Data Management Plan (a)

(1)

Management Environment Based on Nested Recursive Parallelism

Project Number 671603

D7.7 – Data Management Plan (a)

WP7: Dissemination and Exploitation

Version: 1.0

Author(s): Pierre Lemarinier (IBM), Emanuele Ragnoli (IBM), Benoit Leonard (NUMECA), Roman Iakymchuk (KTH)

(2)

Due date: PM6

Submission date: day/month/year Project start date: 01/10/2015 Project duration: 36 months Deliverable lead

organization IBM

Version: 1.0

Status Final

Author(s): Pierre Lemarinier (IBM), Emanuele Ragnoli (IBM), Benoit Leonard (NUMECA), Roman Iakymchuk (KTH) Reviewer(s) Erwin Laure (KTH), Michel Pottiez (Numeca)

Dissemination level

PU PU - Public

Disclaimer

This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement Nr 671603. It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements.

(3)

Acknowledgements

The work presented in this document has been conducted in the context of the EU Horizon 2020. AllScale is a 36-month project that started on October 1st, 2015 and is funded by the European Commission.

The partners in the project are UNIVERSITÄT INNSBRUCK (UBIK), FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN NÜRNBERG (FAU), THE QUEEN'S UNIVERSITY OF BELFAST (QUB), KUNGLIGA TEKNISKA HÖGSKOLAN (KTH), NUMERICAL MECHANICS APPLICATIONS INTERNATIONAL SA (NUMEXA), IBM IRELAND LIMITED (IBM).

The content of this document is the result of extensive discussions within the AllScale Consortium as a whole.

More information

Public AllScale reports and other information pertaining to the project are available through the AllScale public Web site under http://www.allscale.eu.

Version

History

Version Date Comments, Changes, Status Authors, contributors, reviewers

0.1 08/02/16 General document shaping Author: Pierre _{Lemarinier (IBM)} 0.2 03/03/16 Pilot applications sections Authors: Emanuele

Ragnoli (IBM), Benoit Leonard (NUMECA), Roman Iakymchuk (KTH)

0.3 07/03/16 Runtime section Author: Pierre

Lemarinier (IBM) Reviewer: Kostas Katrinis (IBM) 0.4 10/03/16 Introduction, conclusion Author: Pierre

Lemarinier (IBM) Reviewers: Emanuele Ragnoli (IBM), Michel Pottiez (NUMECA), Roman Iakymchuk (KTH)

0.5 14/03/16 Working group review

amendment Reviewers: Erwin Laure (KTH), Michel Pottiez (Numeca)

(4)

Index of Figures

Index of Tables

(5)

Executive Summary

This document describes the Data management Plan of the Allscale project. This project will provide an environment and runtime for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. Three pilot applications will be developed using this environment in order to demonstrate its features and capabilities. At this stage, benchmarks of the sub components of the Allscale environment, and their resultant data have not been identified. We will describe and open these data sets in a future data management plan once we identify the relevant benchmark. The three pilot applications have identified data sets for input and output and are described in section 3. The resulting data size is expected to be in the order of terabytes and we are engaged in the process of identifying a data repository which could store such a large volume. While at this stage we cannot provide details regarding the data repository we will use, and taking into consideration that the next data management plan, deliverable D7.8, is due at PM36, we will provide an update on our solution in the deliverable D1.1: Periodic report due PM18.

(6)

1 Introduction

AllScale participates to the Open Data pilot [1], [2] and therefore it will aim for sharing open data by striving to make datasets derivatives publicly available. This will apply both to datasets used for validation of publicly disseminated outcome of the technical work (e.g. publications, white papers), but potentially also extend to datasets that derived from experimental/evaluation work that may not constitute part of formal communication of research and innovation outcome.

This deliverable outlines the initial data management plan (DMP) for Allscale, in line with the H2020 guidelines for data management plan creation [3]. It identifies the initial classes of datasets that the project foresees to contribute with to the respective communities, and subsequently extends to outlining the type, format, metadata and sharing modalities. At this stage, we foresee the following classes of datasets to be made available: a) data corresponding to execution of benchmarks and applications on target machines that will appeal to the technical HPC Software/Systems community (e.g. power consumption time series resulting from execution in the AllScale environment, data movement patterns etc. and b) data with application semantics that will appeal to the domains corresponding to the three AllScale pilot applications. We detail these sets of data in the following corresponding sections.

The resulting data opened to the community will result in Tera bytes of data that must be stored in publicly accessible storage. Currently, no institution from the Allscale consortium has a storage service which could host such a data size yet. Nonetheless, partnered academic institutions have ongoing efforts into bringing such service, and we plan to use one of these storage services once they are available. We are also studying the possibility of using the B2share [4] storage service in case the aforementioned storage service would not be available in due time. As our investigation progresses on the precise storage service to use, and considering the next deliverable on data management plan is due at the end of the project, we will report on this in the D1.1: Periodic report due at PM18. The rest of the document is structured as follow: Section 2 presents our plan with respect to data related to the Allscale distributed runtime and component. Section 3 presents our plan related to data of the three Pilot applications, and we draw our conclusion and future work in Section 4.

2 Management of Runtime Data

The evaluation of the capability and performance of the Allscale environment will be achieved through the evaluation of the three pilot applications. During that course, the various components of the environment, such as the compiler, runtime, scheduler, the monitoring system or the resiliency component, will be qualitatively evaluated, both individually and as a system. Initially, and in addition to the application evaluation, we expect the use of micro benchmark for evaluating the environment components; these micro benchmarks along with their corresponding data sets and their description will be identified at a later

(7)

Copyright © AllScale Consortium Partners 2016 ₇ stage of the project, past the design and initial prototyping of the target components to be benchmarked. We thus plan to elaborate on these in subsequent refined versions of the data management plan, as this will become relevant past the initial design and prototyping of the AllScale environment. In this context and due to the fact that the set of data that will be used are not identified yet, it is possible that some of this data will not be provided publicly if doing so would compromise future exploitation.

3 Application Data Management Plan

The Allscale project includes three pilot applications in order to demonstrate the capabilities and performance improvement its concepts provide: AMDADOS, iPIC3D, and Fine/Open. For each of these pilot applications, data will be generated and opened to the community. We detail hereafter the relevant data for each application, according to the requested description format.

3.1 AMDADOS: Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

 Data set reference and name AMDADOS DeepWater Horizon Simulation

 Data set description

3 sub datasets will be collected/generated and stored in the same dataset:

Topological data: bathymetry and temporal data to define the oil spill spatial and temporal framework

Inputs Data: atmospheric, boundary and petroleum datasets collected from open sources Output Data: data on the predicted location and density of the oils spill

Inputs and topological data are collected from open sources like NOAA. Output data are the data generated by the simulation runs

 Standards and metadata We will use the NetCDF standard.

 Data sharing

There is no embargo or confidentiality issue on the dataset and the data that can be made available on an open access basis. The issue becomes then the availability of a server with an open link for the public consumption. It needs to be considered that the dataset will be in the order of terabytes. A potential solution to that will be to either offer selected samples of the entire dataset.

 Archiving and preservation (including storage and backup)

We do not have a procedure in place yet for the long term preservation and maintenance of the data set. Summary statistics and snapshot of the prediction with relevant validation error will be submitted to relevant scientific publications.

3.2 iPIC3D: implicit Particle-in-Cell code for Space Weather Applications

 Data set reference and name

(8)

We are going to record data regarding to particle positions and velocities (x,y,z,u,v,w) to reconstruct the trajectories. In particular, we focus on the detecting high-energy particles trapped in the Earth’s radiation belts and storing their trajectories. In the AllScale iPIC3D application, we expect to have initially 1,000,000 particles to be tracked. We also intent to perform wave particle interaction simulation as a benchmark. These tests result can be compared with the reference simulations performed with the parent iPIC3D code.

 Standards and metadata

The HDF5 and binary files will be used to store particle trajectories in the I/O stage.  Data sharing

The access to iPIC3D data will be open to all the scientific community. We will provide a description of the binary file with particle trajectory while the HDF5 files will have the necessary metadata to reconstruct the particle trajectory.

We do not have a procedure in place yet for the long term preservation and maintenance of the data set.

3.3 Fine/Open: Large Industrial unsteady CFD simulations

 Data set reference and name FINE™/Open Full Aircraft DES Simulation.

 Data set description

As input data, we will use a public geometry of a full aircraft. At this stage of project the geometry and its origin are not yet defined. Once they will be known, all necessary information will be provided and made available.

A 3D unstructured mesh will be generated and accessible through the standard CGNS format.

Running a DES simulation, three types of outputs will be generated:

- 3D averaged solutions like density, static pressure, static temperature, velocity, etc. Binary file in standard CGNS-HDF5 format.

- Temporal evolution of global quantities like lift, drag, moment, forces per surface or group of surfaces. ASCII files with self-explanatory header file.

- Temporal evolution of the resolved quantities at given x,y,z points. ASCII files with self-explanatory header file, one file per control point.

The solutions will be compared to our reference FINE™/Open solver.  Standards and metadata

We will use standard CGNS-HDF5 format to store 3D averaged solution while for temporal global quantity and control point evolution, we will use ASCII file with self-explanatory header file.

 Data sharing

The access to FINE™/Open mesh and solutions will be opened to scientific community with a clear description of the case (scheme, boundary condition, Mach, Reynolds, ...).

(9)

Currently, NUMECA has no server and web-address to access this type of data which can be in the order of terabytes.

We do not have a procedure in place yet for the long term preservation and maintenance of the data set.

4 Conclusions and Future Work

We presented in this deliverable our Data Management Plan for the Allscale project. Allscale relying on three Pilot applications to demonstrate its success, most data produced during the project lifetime will be inherent of these applications. As such and following the recommendation of the European Commission, we detailed for each application the identified data sets that we will open to the community. Other data related to evaluating the Allscale runtime or exhibited by the different Allscale components have not been identified or formatted yet. We thus plan to describe them in a future release of the Data Management Plan and open it to the wider community. We are also in the process of identifying the possible data repository we can use for storing the Allscale opened data and we will use the D1.1: Periodic report due at PM18, to convey our solution to this issue.

Bibliography

[1]European Open Data Platform Openaire:

https://www.openaire.eu/opendatapilot

[2] European Commission Guidelines on Open Access to Scientific Publications and Research Data in Horizon 2020:

http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2 020-hi-oa-pilot-guide_en.pdf

[3] European Commission, “Guidelines on Data Management in Horizon 2020,

https://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi /oa_pilot/h2020-hi-oa-data-mgt_en.pdf

D7.7 Data Management Plan (a)