Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

(1)

Richard Grunzke*, Jens Krüger, Sandra Gesing, Sonja Herres-Pawlis, Alexander Hoffmann, Alvaro Aguilera, Wolfgang E. Nagel

[email protected]

Managing Complexity in

Distributed Data Life Cycles Enhancing

Scientific Discovery

(2)

Data Life Cycles

Data from creation, management, analysis, utilization and archiving Focus on generating insights based on data

(3)

Richard Grunzke, Alvaro Aguilera 3

Data Life Cycles – Big Data and HPC

Large-scale simulations with HPC

– Result data can be in petabyte range

Instruments such as high-throughput microscopes

– 0,85 GB/s → 2 petabyte monthly Big Data and growing rapidly

(4)

Data Life Cycles – Complexity

Infrastructures ever more complex

Data sources: detectors, simulations, distributed sensors, ...

Data management: storage hierarchy, geographical distribution, transfers, protocols, HPC and user access, AAI, ...

HPC: heterogeneous architectures, cores, nodes, OS, network, ... Data sinks: scratch, home, repository, archive, …

(5)

Data Life Cycles – Complexity

Users expected to learn all this?

Few will even attempt as they want to concentrate on their science → Many potential new HPC users would not begin

Users do better science faster via accessible HPC and Big Data → Driving and sustaining force behind HPC

(6)

Data Life Cycles – Complexity

As complexity increases, productivity decreases

Maintaining usefulness via abstraction to hide complexity and automation to avoid manual tasks

– Frameworks and libraries

– Modeling and simulation approaches

– Automated parallelization and error detection

– Graphically aided performance analysis and optimization

– Computing and workflow middlewares

– Data and metadata management systems

– Science gateways and virtual research environments

(7)

Richard Grunzke

Data Life Cycles – Data Sources

Instruments

– Detectors in particle accelerators

– High-throughput microscopes

– Distributed sensors measuring properties of wind power stations Computing Resources

– Large scale simulations

(8)

Data Life Cycles – Data Management

Storage hierarchy:

– Ramdisk, SSD, HDD, SAN, NAS, Tape

Parallel file systems with focus on storing data in form of files

– GPFS, Lustre, pNFS, HDFS, ...

Distributed data management systems with advanced features

(9)

Data Life Cycles – Metadata

Metadata as information about data to organize it based on content Higher level functionality on top of data management

Easy discovery of data fundamental for its usefulness

Highly complex situation with many standards and systems

Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/ Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/

(10)

Data Life Cycles – Metadata Management

Centralized metadata catalog

+ Consistent uniform view, + Directly searchable

- Potential bottleneck, - Single point of failure, - Archiving complex

– AMGA Metadata Service, Dspace, Fedora Commons,ISOcat, … Systems with metadata in close proximity to data

+ More failure-resistant and better scalable, + More suitable for long-term archiving

- Central component for searchability necessary, - No uniform view, - Possibly more files

– HDF5, NeXus, NetCDF, …

Systems with a combined proximity approach + Combination of earlier approaches

(11)

Data Life Cycles – Computing Management

Supercomputers, clusters, Architectures, CPUs, RAM, Operating systems, Racks, nodes, interconnects, Batch systems

Abstraction of highly complex computing resources, User-driven - User directly initiates tasks

– UNICORE, Globus Toolkit, gLite, …

Workflow-driven - User creates and submits workflow

– gUSE, UNICORE, ...

Data-driven - Tasks automatically executed by pre-defined rules

(12)

Data Life Cycles – Workflow Management

Higher level functionality based on computing management Workflow as chaining together of multiple applications

Support for dependencies, loops, sequential, in parallel

(13)

Data Life Cycles – Data Sinks

Data stored according to re-use probability

– Scratch file system

– Home directory

– Digital data repository

(14)

Data Life Cycles – Utilization

User interfaces important for acceptance among scientists Flexibility vs usability

Commandline-based access - Highly customizable and scriptable

– UNICORE, Globus Toolkit, gLite, ...

Rich-Client-based access - Local software installation required

– UNICORE, Taverna Workbench, …

Web-based – Always up-to-date, Single point of entry to infrastructures

(15)

Data Life Cycles – MoSGrid Science Gateway

HPC and workflow enabled science gateway for molecular simulations Built in BMBF project

350 users

3 chemical application domains 70 workflows with 90 applications

Extended in two EU projects & being ported to US XSEDE infrastructure Further follow-up funding proposals submitted

J. Krüger*, R. Grunzke*, S. Gesing*, et al.: The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations, Journal of Chemical Theory and Computation, 2014.

Docking Quantum Chemistry

(16)

Data Life Cycles – VAVID

HPC and workflow enabled science gateway for car crash simulations and wind turbine sensor data

BMBF project based on the MoSGrid idea Duration of 3 years

(17)

Summary

Challenge of quickly rising data and computing demands

Increasing complexity of data-intensive HPC needs to be managed to maintain and increase relevancy to users

Done by abstraction and automation

Data, computing, metadata, workflow management Science gateways for productivity

Important goals – Federated security – Big Data – Resilience – Usability – Sustainability

(18)