• No results found

2 Evolution of Scientific Archives

3.5 Data as a Service

A typical scientific archive follows the Data as a Service (DaaS) model as the scientific information is distributed over a network (the Internet). The VO may be seen as the standardization of a set of DaaS services for astronomical data. Its benefits include the possibility to move data easily from one platform to another, outsource the presentation (or visualization) layer, preservation of data integrity by implementing access control measures, compatibility among different platforms, collection of automated data metrics for quality purposes and global accessibility among others. It consists of an entity that comprises both the platform and the data.

However, it should not be limited to the well-defined interfaces to the data, but also the ability to interact and process it, data wrangling, or a higher level set of operations, i.e. some kind of Domain Specific Language (DSL), that helps close the gap between the scientific and technological fields. This will enhance the adoption among those users who are less experienced in technology, by providing functionality that can be used and customized on-the-fly. Many of the traditional machine learning and data mining techniques (the most generic ones) will be made available in the distributed processing engines like Apache Spark (through MLlib), but others more specific to science and astronomy will not. Examples might be operations as simple as a geometrical search for selecting stars in a particular area of the sky, a more complex implementation of a cross­ match in different wavelengths (with a configurable lambda that defines what a match is), a combination/composition of both, etc. The design of these higher level operations should take into account the potential reusability they might have and so provide proper configuration options (and default values) in the same way it is done in some popular programming languages used for analytics like R or Python. A powerful DSL allows to query any single data set and ease the exploration of data or any development on top.

60 Chapter 3. Enabling Large Scale Data Science and Data Products

official data deliverables but for any other derived or incorporated data sets, will both encourage collaboration and empower the reusability of intermediate results, something required in order to unlock the potential of a real living archive (Brown, 2012). This does not only refer to the metadata, which can be queried to identify a subset of the data archive with certain features, but also to the data lineage information of when and how a particular data set was created, along with the inputs that it took and the software or model that was run.

In addition, an interactive access to the platform (both data processing engines and data sets) will facilitate community engagement and will be the key to enable explo­ ration of data sets and any preliminary testing of hypotheses or assumptions (like re­ marked above). It will also soften the barriers for bringing the software and models to the data centre, encouraging collaborative and reproducible research by being able to share intermediate results otherwise impossible for the biggest in size. The proper inter­ operability with visualization engines and tools will effectively close the loop, increasing the experience offered throughout the platform.

Last but not least, Functional Programming is getting traction as the best program­ ming paradigm for Big Data, given the scalability that it offers. This makes it very suitable for DaaS. Functional Programming is based on the idea of immutable state, i.e. the state of data does not change over time and thus it is not needed to synchronize the accesses to data from the different threads (which prevents scalability). This program­ ming paradigm has been boosted by the current trend in CPU multicore architectures, which lower the clock speed and provide with a significant number of cores that can execute independent flows of instructions. The unlimited scalability of Cloud computing has also contributed to the popularity of this programming paradigm, which exists since many decades. The attributed benefits that Functional Programming provides range from a better error handling and modularity of the code, to shorter (more expressive) code and increased developer productivity.

Architecture and Techniques for the

Gaia Mission Archive

If you want something new, you have to stop doing something old. Peter F. Drucker.

The ESA Gaia mission represents a breakthrough in astrophysics, a cornerstone mis­ sion aimed at producing the most accurate three dimensional map of the Milky Way to date. The resulting stereoscopic census of our Galaxy will represent a giant leap in as­ trometric accuracy complemented by the only full sky homogeneous photometric survey with an angular resolution comparable to that of the Hubble Space Telescope (HST), as well as the largest spectroscopic survey ever undertaken. The scientific bounty will be immense, not only unravelling the formation history and evolution of our Galaxy but also revealing and classifying thousands of extra-solar planetary systems, minor bodies within our solar system and millions of extragalactic objects, including some 500,000 quasars. Moreover, such a massive survey is bound to uncover many surprises that the universe still holds in store for us.

The Gaia mission poses several challenges for current data archiving technologies, mainly due to the unprecedented amount of data that will be produced. The final data delivery will not only include the catalogue of one billion sources but also the single epoch Charge Coupled Device (CCD) transit data that was used in its computation (including spectra), making an estimated total data set of around 1 PB.

New challenges arise in two main areas: state of the art computing technologies that need to be applied for the ingestion, data management and storage processes of the OAIS model (see Figure 2.4), and new methodologies and protocols will have to arise, based in current VO technology (see Section 2.2), to fulfil new requirements in the access process. The combination of both processes will give the opportunity to deliver unprecedented levels of accessibility to perform high performance computation over large amounts of 61