• No results found

Data Intensive Science Education

N/A
N/A
Protected

Academic year: 2021

Share "Data Intensive Science Education"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Intensive Science

Education

Thomas J. Hacker

Associate Professor, Computer & Information Technology

Purdue University, West Lafayette, Indiana USA

Gjesteprofessor (Visiting Professor), Department of Electrical

Engineering and Computer Science

University of Stavanger, Norway

EU-China-Nord America Workshop on HPC Cloud and Big Data

June 20, 2013

(2)

Introduction and Motivation

Theory and Experiment (1800s)

Computational Simulation

Third leg of science

Past 50 years or so (1950s)

Data (21st century science)

Fourth “leg” of science

Researchers are flooded with data

Tremendous quantity and multiple scales of data

Difficult to collect, store, and manage

(3)

Data is the 4th Paradigm

Producing an avalanche of high resolution

digital data

All (or most) of the data needs to be accessible

over a long period of time

Much of the data is not reproducible

Example – NEES project

Structure or sample destroyed

through testing

Very expensive to

(4)

We are surrounded by data that we want,

but it is difficult to find the information

that we need

Private, shared, and public data

repositories

– Files on your computer

– E-mail

– Group documents and files

– Experimental results

– Published papers

• Data are scattered across many systems and devices

– Personal computer, old diskettes in a box, several email systems,

– Old computer systems

Data, data every where…

“Water, water every where, Nor any drop

to drink.”

Samuel Taylor Coleridge, Rime of the

Ancient Mariner

The Rime of the Ancient Mariner: Plate 32: The Pliot, by Gustave Doré

(5)

Need for Data Education

Data is the 4th paradigm of Science and Engineering

We are losing valuable data every day

The techniques we were taught to maintain a “lab

notebook” has not been effectively transferred to

computer based data collection and registration systems.

So much data is available and collected today, it is not

(6)

Two Examples of Data Intensive

Science

Two large-scale science and engineering projects

illustrate the problems related to data intensive

science

National Science Foundation George E. Brown

Network for Earthquake Engineering Simulation

(NEES)

Purdue operates the headquarters for the NEEScomm,

the community of NEES research facilities

The Compact Muon Solenoid project

(7)

NSF Network for Earthquake

Engineering Simulation (NEES)

Safer buildings and civil infrastructure are needed to reduce

damage and loss from earthquakes and tsunamis

To facilitate research to improve seismic design of buildings and

civil infrastructure, the National Science Foundation established

NEES

NEES Objectives

Develop a national, multi-user, research infrastructure to support research

and innovation in earthquake and tsunami loss reduction

Create an educated workforce in hazard mitigation

(8)

Facilitate access to the world's best integrated

network of state-of-the art physical simulation

facilities

Build a cyber-enabled community that shares

ideas, data, and computational tools and models.

Promote education and training for the next

generation of researchers and practitioners.

Cultivate partnerships with other organizations to

disseminate research results, leverage

cyberinfrastructure, and reduce risk by transferring

results into practice.

(9)

NEES has a broad set of experimental facilities

Each type of equipment produces unique data

Located at 14 sites across the United States

Shake Table

Tsunami Wave Basin

Large-Scale Testing Facilities

Centrifuge

Field and Mobile Facilities

Large-Displacement Facility

Cyberinfrastructure

(10)

University of California Santa Barbara University of California San Diego University of California Los Angeles University of California Davis Lehigh University Rensselaer Polytechnic Institute

Cornell University

University of Buffalo University of Minnesota

University of Illinois- Urbana

Oregon State University

University of California Berkeley University of Nevada Reno University of Texas Austin

https://www.nees.org

(11)

Lehigh University

Reaction wall, strong floor

dynamic actuators

UC-Berkeley

Reconfigurable Reaction Wall

University of Illinois Urbana-Champaign

Multi-Axial Full-Scale Sub-Structured Testing & Simulation (MUST-SIM)

University of Minnesota

Reaction walls

Multi-Axial Subassemblage

Testing (MAST)

Large-Scale Testing Facilties

(12)

NEEShub at

Nees.org

(13)

Compact Muon Solenoid Project

Another example of a “big data” project

Two primary computational goals

Move detector data from Large Hadron Collider at CERN to

remote sites for processing

Examine detector data for evidence of Higgs boson

~15 PB/yr data

Applications used by CMS are not inherently parallel

Data is split up and distributed across nodes

(14)

CMS Project Overview

CERN Large Hadron Collider Project (LHC)

Particle accelerator and collider – largest in the

world

17 mile circumference tunnel

Providing evidence to support the existence of

the Higgs boson

Six detector experiments at the LHC

Atlas, CMS, LHCb, ALICE, TOTEM, LHCf

Compact Muon Solenoid (CMS)

Very large solenoid with 4 Tesla magnetic field

(15)
(16)
(17)

Purdue CMS Tier-2 Center

Computing Infrastructure

~10,000 computing cores within the Purdue

University Community Cluster program

– Purdue recently (June 18) announced the Conte Supercomputer

– Fastest university-owned supercomputer in the United States

3 PB of disk storage running Hadoop

Sharing a 100 Gb/sec network uplink to Indianapolis and

Chicago

Ultimately connecting to Fermi National Lab in Chicago

(18)

Purdue CMS Tier-2 Center

Physicists from around the world submit

computational jobs to Purdue

Data is copied from the Tier-1 to Purdue storage on

user request

Simulation codes also run at Purdue, with results

pushed up to Tier-1 center or other Tier-2s.

International data sharing

Data interoperability is designed into the project

from the beginning. There is one instrument (the

CMS detector), which greatly simplified the sharing

and reuse of data compared with a project like NEES

(19)

Challenges involved in Big Data

Performance at Scale

How can we effectively match data performance with HPC

capabilities?

How can we ensure good reliability of these systems?

Data Curation Challenges

What should we preserve, how should we preserve it, and how

can we ensure the long-term viability of the data?

Disciplinary Sociology and Cyberinfrastructure

How can we effectively promote and support the adoption and

use of new technologies?

How can we foster the development of new disciplinary

practices focused on the long-term accessibility of data?

(20)

Performance at Scale

Petaflop scale systems are now available for use by researchers

Example: Purdue Conte system announced this week (Rmax 943 TF,

Rpeak 1.342 Petaflops)

Conte was built with 580 HP ProLiant SL250 Generation 8 (Gen8) servers,

each incorporating two Intel Xeon processors and two Intel Xeon Phi

coprocessors, integrated with Mellanox 56Gb/S FDR InfiniBand.

Conte has 580 servers (570 at the time of testing) with 9,120 standard

cores and 68,400 Phi cores, for a total of 77,520 cores.

Big data analytics coupled with petascale systems requires high

bandwidth storage systems

Avoid wasteful and expensive CPU stalls

Scaling up is along two axes:

Large volume of data (example: CMS Project)

(21)

Curation Challenges

Data production rate is tremendous

Volume of data is growing over time

Sensor sampling rate increasing

High definition video

Managing data transfer

Time required to upload and download data is growing

Upload and download time can take a lot of time if there are network bottlenecks

Ensuring data integrity

Filtering, cleaning, and calibration is often needed before upload and

curating data

The community needs to also retain the raw data in case an error is

made or in case a researcher can later distill further insights from the

data.

(22)

Curation Challenges

File type management

Data is stored in files through the intermediary of an application

This means that the information in the data will be encoded into some kind of

format

It’s difficult (if not impossible) to restrict the file formats used by

the research community

As these applications change (or disappear) over time, the information

encoded in the data may become stranded

Risk of stranded data

When the file format cannot be precisely identified, then we don’t

know which application can be used as an intermediary for reading

the information encoded in the data.

(23)

Curation Challenges

Linking computation with data and archived data

Will need the ability to quickly search archived data –

much more detailed that what Google can deliver

How can we quickly discover, convert, and transfer

archived data to be close to the user and to

computation? (especially HPC)

Need to match data I/O capabilities with growth in the

(24)

Long-term accessibility

We have data in the NEEShub from the 1970s

Science: “Rescue of Old Data Offers Lesson for

Particle Physicists” by Andrew Curry (Feb 2011)

Described the need to find old, almost lost data for a

physics experiment from the 1980s

The data will need to remain viable and

(25)

Discipline Sociology

Sociological factors in data curation

Disciplinary differences in how data are archived, how to value

archived data, and determining what is worth retaining

Who determines what is worth keeping?

What is the practice in the specific discipline?

International standards and practices in metadata tagging,

representing numbers, and even character sets

NEES is working with partners in Japan and China – we need to

determine how to represent their data in a common standard

framework

Terminology for numbers (“,” vs. “.’, lakh vs. 100,000)

Changing the behavior of scientists to value curation and long-term

accessibility

(26)

Managing Curation at Scale

How can we efficiently use data curator’s time?

NEES now has 1.8M files, what will happen in 3 more years?

How can we manage 10M files with a limited curation staff?

For NEES ,we are using the OAIS model as a guideline for designing

a curation pipeline for curating NEES data

The OAIS model is proving to be a useful model for thinking about how to

undertake data curation

We are developing a curation pipeline to help automate curation for the

many files in the NEES Project Warehouse

(27)

Data Analytics

There are technologies available today that can be

used to provide solutions to these problems

High performance computing

Parallel file systems

Map Reduce/Hadoop

A sustainable solution requires more than a set of

technologies

An effective data cyberinfrastructure involves both

sociological and technological components.

What is needed to educate and train researchers to

(28)

Our approach

Developing a joint research and education and program in big

data analytics between the University of Stavanger and Purdue

University and AMD Research.

Chunming Rong, Tomasz Wlodarczyk (Stavanger)

Thomas Hacker, Ray Hansen, Natasha Nikolaidis (Purdue)

Greg Rodgers (AMD Research)

Funded by SIU: “Strategic Collaboration on Advanced Data Analysis and

Communication between Purdue University and University of Stavanger”

Developing a semester long joint course in HPC and Big Data Analytics ,

and a short summer course (to be delivered next week)

(29)

Planned Course Objectives

Students will learn to put modern tools to use in order to do data

analysis of large and complex data sets.

Students will be able to: design, construct, test, and benchmark a small

data processing cluster (based on Hadoop)

Demonstrate knowledge of MapReduce functionalities through the

development of a MapReduce program

Understand Hadoop job tracker, task tracker, scheduling issues,

communications, and resource management.

Construct programs based on MapReduce paradigm for typical algorithmic

problems

Use functional programming concept to describe data dependencies and

analyze complexity of MapReduce programs

(30)

Planned Course Objectives

Algorithms

Understand algorithmic complexity of the worst case, expected case, and best

case running time (big-Oh notation), and the orders of complexity (e.g. N, N^2,

Log N, NP-Hard)

Examine a basic algorithm and identify the algorithmic complexity order

File Systems

Describe the concepts of a distributed file system, how it differs from a local file

system, the performance of distributed file systems.

Describe a parallel file system, the performance advantages possible through the

use of a parallel file system, and the inherent 

reliability and fault tolerance mechanisms needed for parallel file systems.

Examples include OrangeFS and Lustre

understand peak and sustained bandwidth rates

understand the differences between RDBMS, data warehouse, unstructured big

data, and keyed files.

(31)

Short Course Format

Lecture in the morning followed by lab in the

afternoon

Labs are built on a set of Desktop PCs running Hadoop

in an RHEL6 virtual machine running on top of VMware

Using pfsense (open source firewall) to create a secure

network connection from the instruction site to the

computers running Hadoop

Working to refine the network and lab equipment

setup based on our experiences delivering the short

course next week.

(32)

Short Course – Day 1 Topics

Lecture

– Introduction and motivation for the course

– History of HPC, big data, Moore's Law.

– Science domain areas, and problems in each of those areas that motivate the need for this. Where are we today, and what is the projected need later? How are things driven by

increases in computing power?

– Definition of big data, big compute, why we need both combined

– Mixture of trends, principles, and implementation in historic context that students should understand.

– Parallel application types

– Introduction to MapReduce

– Dataflow within MapReduce with plug-in

Labs

The hadoop command, HDFS, and Linux basics

(33)

Short Course – Day 2 Topics

Lectures

–Introduction to MapReduce, continued

–Combiners

–More complex MapReduce example (search assist)

–Hadoop Architecture

–Motivation for Hadoop

–Hadoop building blocks (name node, data node, etc.)

–Fault tolerance and failures, replication, and data aware scheduling.

–Main components (HDFS, MapReduce, modes (local, distributed, pseudo distributed), etc.)

–HDFS GUI

Labs

–We will use combiners and multiple reducers to improve performance. We will look at network traffic

and data counters to evaluate. 

–Students will evaluate the performance improvement for each optimization of MapReduce program.

(34)

Short Course – Day 3 Topics

Lectures

– Hadoop Architecture, continued

– Comparison of HDFS with other Parallel File System architectures (GoogleFS, Lustre, OrangeFS), and how Hadoop differs from these systems

– Chaining MapReduce jobs

– Mapreduce Algorithms: K-means or other algorithms

– Schemas for unstructured data using Hive

– Introduction to data organization. Why are we concerned about data organization? What are the impacts of poor organization on performance and correctness?

– Data organization: Level of data organization - data structure, file level, cluster level, data parallelization, organization level.

– How do we deal with large sequential files from a performance perspective and how it would be represented in parallel file system (e.g. HDFS)

Lab

(35)

Expected Outcomes

Provide education and training to researchers to allow

them to effectively think about big data to effectively

use the technologies in their research and daily work.

Improved data collection and management practices

by researchers

Development of new techniques for collaboration on a

joint course across the Atlantic with a shared lab

infrastructure for lab assignments.

(36)

Conclusions

There is a need for data intensive training and education for scientists and

engineers

– Effectively use existing technologies

– Develop new disciplinary practices for annotating and preserving valuable data

– Understand the critical need for data curation for the viability and long-term accessibility of data

We are developing a education and research program focused on these issues

– Short course

– Semester length joint course at University of Stavanger and Purdue University

Holding a symposium at the CloudCom conference in December

DataCom - Symposium on High Performance and Data Intensive Computing 

– Thomas Hacker, Purdue Univ., USA

– Tomasz Wiktor Wlodarczyk, University of Stavanger, Norway

– DataCom is organized under CloudCom as two tracks

– Big Data

References

Related documents

These classes are tests based on F statistics (supF , aveF , expF tests), on OLS residuals (OLS-based CUSUM and MOSUM tests) and on maximum likelihood scores (including

gas cleaning engine aftertreatment Limit – Injector technology Automotive (Helmond) Automotive (Helmond) TTAI TÜV Rheinland TNO Automotive International (Helmond) TTAI TÜV Rheinland

Responsible for the operational and fiscal administration of clinical training (M.A. and Psy.D.), clinical services (Adler Community Health Services), community engagement, the

1) The Institute will consider the FIRST PHYSICAL ATTEMPT of the candidate at the examination as first attempt for awarding class. In other words, the candidate

Acylcarnitine Profile Total Carnitine Levels Liver Function Tests..

This course covers the major elements of logistics management including gaining competitive advantages through logistics and supply chain, the customer service dimension of logistics,

Registrations - Grandparenting, Communications, Customer Service, Finance and Quality Management were examined.. No non-conformances were located, A copy of the report

Central government organizations are defined according to the 2008 System of National Accounts (EC et al , 2009), which describes the central government subsector as