• No results found

Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery

N/A
N/A
Protected

Academic year: 2021

Share "Managing Complexity in Distributed Data Life Cycles Enhancing Scientific Discovery"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

Richard Grunzke*, Jens Krüger, Sandra Gesing, Sonja Herres-Pawlis, Alexander Hoffmann, Alvaro Aguilera, Wolfgang E. Nagel

[email protected]

Managing Complexity in

Distributed Data Life Cycles Enhancing

Scientific Discovery

(2)

Data Life Cycles

Data from creation, management, analysis, utilization and archiving Focus on generating insights based on data

(3)

Richard Grunzke, Alvaro Aguilera 3

Data Life Cycles – Big Data and HPC

Large-scale simulations with HPC

– Result data can be in petabyte range

Instruments such as high-throughput microscopes

– 0,85 GB/s → 2 petabyte monthly Big Data and growing rapidly

(4)

Data Life Cycles – Complexity

Infrastructures ever more complex

Data sources: detectors, simulations, distributed sensors, ...

Data management: storage hierarchy, geographical distribution, transfers, protocols, HPC and user access, AAI, ...

HPC: heterogeneous architectures, cores, nodes, OS, network, ... Data sinks: scratch, home, repository, archive, …

(5)

Richard Grunzke, Alvaro Aguilera 5

Data Life Cycles – Complexity

Users expected to learn all this?

Few will even attempt as they want to concentrate on their science → Many potential new HPC users would not begin

Users do better science faster via accessible HPC and Big Data → Driving and sustaining force behind HPC

(6)

Data Life Cycles – Complexity

As complexity increases, productivity decreases

Maintaining usefulness via abstraction to hide complexity and automation to avoid manual tasks

– Frameworks and libraries

– Modeling and simulation approaches

– Automated parallelization and error detection

– Graphically aided performance analysis and optimization

– Computing and workflow middlewares

– Data and metadata management systems

– Science gateways and virtual research environments

(7)

Richard Grunzke

Data Life Cycles – Data Sources

Instruments

– Detectors in particle accelerators

– High-throughput microscopes

– Distributed sensors measuring properties of wind power stations Computing Resources

– Large scale simulations

(8)

Data Life Cycles – Data Management

Storage hierarchy:

– Ramdisk, SSD, HDD, SAN, NAS, Tape

Parallel file systems with focus on storing data in form of files

– GPFS, Lustre, pNFS, HDFS, ...

Distributed data management systems with advanced features

(9)

Richard Grunzke, Alvaro Aguilera 9

Data Life Cycles – Metadata

Metadata as information about data to organize it based on content Higher level functionality on top of data management

Easy discovery of data fundamental for its usefulness

Highly complex situation with many standards and systems

Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/ Copyright 2009-2010 Jenn Riley. This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License http://creativecommons.org/licenses/by-nc-sa/3.0/us/

(10)

Data Life Cycles – Metadata Management

Centralized metadata catalog

+ Consistent uniform view, + Directly searchable

- Potential bottleneck, - Single point of failure, - Archiving complex

– AMGA Metadata Service, Dspace, Fedora Commons,ISOcat, … Systems with metadata in close proximity to data

+ More failure-resistant and better scalable, + More suitable for long-term archiving

- Central component for searchability necessary, - No uniform view, - Possibly more files

– HDF5, NeXus, NetCDF, …

Systems with a combined proximity approach + Combination of earlier approaches

(11)

Richard Grunzke, Alvaro Aguilera 11

Data Life Cycles – Computing Management

Supercomputers, clusters, Architectures, CPUs, RAM, Operating systems, Racks, nodes, interconnects, Batch systems

Abstraction of highly complex computing resources, User-driven - User directly initiates tasks

– UNICORE, Globus Toolkit, gLite, …

Workflow-driven - User creates and submits workflow

– gUSE, UNICORE, ...

Data-driven - Tasks automatically executed by pre-defined rules

(12)

Data Life Cycles – Workflow Management

Higher level functionality based on computing management Workflow as chaining together of multiple applications

Support for dependencies, loops, sequential, in parallel

(13)

Richard Grunzke, Alvaro Aguilera 13

Data Life Cycles – Data Sinks

Data stored according to re-use probability

– Scratch file system

– Home directory

– Digital data repository

(14)

Data Life Cycles – Utilization

User interfaces important for acceptance among scientists Flexibility vs usability

Commandline-based access - Highly customizable and scriptable

– UNICORE, Globus Toolkit, gLite, ...

Rich-Client-based access - Local software installation required

– UNICORE, Taverna Workbench, …

Web-based – Always up-to-date, Single point of entry to infrastructures

(15)

Richard Grunzke, Alvaro Aguilera 15

Data Life Cycles – MoSGrid Science Gateway

HPC and workflow enabled science gateway for molecular simulations Built in BMBF project

350 users

3 chemical application domains 70 workflows with 90 applications

Extended in two EU projects & being ported to US XSEDE infrastructure Further follow-up funding proposals submitted

J. Krüger*, R. Grunzke*, S. Gesing*, et al.: The MoSGrid Science Gateway - A Complete Solution for Molecular Simulations, Journal of Chemical Theory and Computation, 2014.

Docking Quantum Chemistry

(16)

Data Life Cycles – VAVID

HPC and workflow enabled science gateway for car crash simulations and wind turbine sensor data

BMBF project based on the MoSGrid idea Duration of 3 years

(17)

Richard Grunzke, Alvaro Aguilera 17

Summary

Challenge of quickly rising data and computing demands

Increasing complexity of data-intensive HPC needs to be managed to maintain and increase relevancy to users

Done by abstraction and automation

Data, computing, metadata, workflow management Science gateways for productivity

Important goals – Federated security – Big Data – Resilience – Usability – Sustainability

(18)

References

Related documents

The specific objectives of the research are To evaluate the effectiveness of viscoelastic and softwood damping system in reducing tool wear during machining

DICAL HOUSE gifts and wine hampers are always well received, and there is a hamper for every taste so step inside the flagship Store located on the outskirts of Mosta, or if more

ECP Project Management Structure Board of Directors Science Council Industry Council Project Director Deputy Director CTO Integration Manager D ep ar tm en t o f En er g y

Figure 7-23 Temperature, pressure and emissions data collected during the combustion wood pellets and defluidisation of the bed with a non-uniform air distribution plate

Paper presented at the workshop on “Late Pleistocene and Human Adaptation around the Last Glacial Maximum in Northeast Asia,” sponsored by Tokyo Metropolitan University Laboratory

“[I]f you’ve got a racially aggravated charge, you have to ask yourself as a police officer and as a prosecutor, ‘Do I have enough evidence to prove the assault … and the racial

Using encryption keys is a traditional method of image encryption can be carried out using DES, AES algorithms, digital signatures, vector quantisation, chaos theory etc. In some

Ralston in the October 1914 issue of Ladies’ Home Journal (p.. lapping wide on the side gore, creating a charming asymmetrical front closure. The round flat collar was also