Richard Grunzke*, Jens Krüger, Sandra Gesing, Sonja Herres-Pawlis, Alexander Hoffmann, Alvaro Aguilera, Wolfgang E. Nagel
Managing Complexity in
Distributed Data Life Cycles Enhancing
Scientific Discovery
Data Life Cycles
Data from creation, management, analysis, utilization and archiving Focus on generating insights based on data
Richard Grunzke, Alvaro Aguilera 3
Data Life Cycles – Big Data and HPC
Large-scale simulations with HPC
– Result data can be in petabyte range
Instruments such as high-throughput microscopes
– 0,85 GB/s → 2 petabyte monthly Big Data and growing rapidly
Data Life Cycles – Complexity
Infrastructures ever more complex
Data sources: detectors, simulations, distributed sensors, ...
Data management: storage hierarchy, geographical distribution, transfers, protocols, HPC and user access, AAI, ...
HPC: heterogeneous architectures, cores, nodes, OS, network, ... Data sinks: scratch, home, repository, archive, …
Richard Grunzke, Alvaro Aguilera 5
Data Life Cycles – Complexity
Users expected to learn all this?
Few will even attempt as they want to concentrate on their science → Many potential new HPC users would not begin
Users do better science faster via accessible HPC and Big Data → Driving and sustaining force behind HPC
Data Life Cycles – Complexity
As complexity increases, productivity decreases
Maintaining usefulness via abstraction to hide complexity and automation to avoid manual tasks
– Frameworks and libraries
– Modeling and simulation approaches
– Automated parallelization and error detection
– Graphically aided performance analysis and optimization
– Computing and workflow middlewares
– Data and metadata management systems
– Science gateways and virtual research environments
Richard Grunzke
Data Life Cycles – Data Sources
Instruments
– Detectors in particle accelerators
– High-throughput microscopes
– Distributed sensors measuring properties of wind power stations Computing Resources
– Large scale simulations
Data Life Cycles – Data Management
Storage hierarchy:
– Ramdisk, SSD, HDD, SAN, NAS, Tape
Parallel file systems with focus on storing data in form of files
– GPFS, Lustre, pNFS, HDFS, ...
Distributed data management systems with advanced features
Richard Grunzke, Alvaro Aguilera 9
Data Life Cycles – Metadata
Metadata as information about data to organize it based on content Higher level functionality on top of data management
Easy discovery of data fundamental for its usefulness
Highly complex situation with many standards and systems
Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/ Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/
Data Life Cycles – Metadata Management
Centralized metadata catalog
+ Consistent uniform view, + Directly searchable
- Potential bottleneck, - Single point of failure, - Archiving complex
– AMGA Metadata Service, Dspace, Fedora Commons,ISOcat, … Systems with metadata in close proximity to data
+ More failure-resistant and better scalable, + More suitable for long-term archiving
- Central component for searchability necessary, - No uniform view, - Possibly more files
– HDF5, NeXus, NetCDF, …
Systems with a combined proximity approach + Combination of earlier approaches
Richard Grunzke, Alvaro Aguilera 11
Data Life Cycles – Computing Management
Supercomputers, clusters, Architectures, CPUs, RAM, Operating systems, Racks, nodes, interconnects, Batch systems
Abstraction of highly complex computing resources, User-driven - User directly initiates tasks
– UNICORE, Globus Toolkit, gLite, …
Workflow-driven - User creates and submits workflow
– gUSE, UNICORE, ...
Data-driven - Tasks automatically executed by pre-defined rules
Data Life Cycles – Workflow Management
Higher level functionality based on computing management Workflow as chaining together of multiple applications
Support for dependencies, loops, sequential, in parallel
Richard Grunzke, Alvaro Aguilera 13
Data Life Cycles – Data Sinks
Data stored according to re-use probability
– Scratch file system
– Home directory
– Digital data repository
Data Life Cycles – Utilization
User interfaces important for acceptance among scientists Flexibility vs usability
Commandline-based access - Highly customizable and scriptable
– UNICORE, Globus Toolkit, gLite, ...
Rich-Client-based access - Local software installation required
– UNICORE, Taverna Workbench, …
Web-based – Always up-to-date, Single point of entry to infrastructures
Richard Grunzke, Alvaro Aguilera 15
Data Life Cycles – MoSGrid Science Gateway
HPC and workflow enabled science gateway for molecular simulations Built in BMBF project
350 users
3 chemical application domains 70 workflows with 90 applications
Extended in two EU projects & being ported to US XSEDE infrastructure Further follow-up funding proposals submitted
J. Krüger*, R. Grunzke*, S. Gesing*, et al.: The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations, Journal of Chemical Theory and Computation, 2014.
Docking Quantum Chemistry
Data Life Cycles – VAVID
HPC and workflow enabled science gateway for car crash simulations and wind turbine sensor data
BMBF project based on the MoSGrid idea Duration of 3 years
Richard Grunzke, Alvaro Aguilera 17
Summary
Challenge of quickly rising data and computing demands
Increasing complexity of data-intensive HPC needs to be managed to maintain and increase relevancy to users
Done by abstraction and automation
Data, computing, metadata, workflow management Science gateways for productivity
Important goals – Federated security – Big Data – Resilience – Usability – Sustainability