Big Data and Clouds

(1)

https://portal.futuregrid.org

Big Data and Clouds

May 7 2013

Geoffrey Fox

[email protected]

http://www.infomall.org http://www.futuregrid.org

School of Informatics and Computing Digital Science Center

(2)

Abstract

• We explore the principle that much of “the future” will be characterized by “Using Clouds running Data Analytics processing Big Data to solve problems in

X-Informatics”. Applications (values of X) include explicitly already Astronomy, Biology, Biomedicine, Business, Chemistry, Crisis, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields defined implicitly. We discuss the implications of this concept for education and research. Education requires new curricula – generically called data science – which will be hugely popular due to the many millions of jobs opening up in both “core technology” and within applications where of course there are most opportunities. We discuss possibility of using MOOC’s to jumpstart field. On research side, big data (i.e. large applications) require big (i.e. scalable) algorithms on big

infrastructure running robust convenient programming environments. We discuss clustering and information visualization using dimension reduction as examples of scalable algorithms. We compare Message Passing Interface MPI and extensions of MapReduce as the core technology to execute data analytics.

• We mention FutureGrid and a software defined Computing Testbed as a Service

(3)

Big Data Ecosystem in One Sentence

Use

Clouds

running

Data Analytics Collaboratively

processing

Big Data

to solve problems in

X-Informatics (

or

e-X)

X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and

Wellness with more fields (physics) defined implicitly Spans Industry and Science (research)

Education: Data Science see recent New York Times articles

http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/

(4)

(5)

Issues of Importance

• Economic Imperative: There are a lot of data and a lot of jobs

• Computing Model: Industry adopted clouds which are attractive for

data analytics

• Research Model: 4th _{Paradigm; From Theory to Data driven science?}

• Confusion in a new-old field: lack of consensus academically in

several aspects of data intensive computing from storage to algorithms, to processing and education

• Progress in Data Intensive Programming Models

• Progress in Academic (open source) clouds

• Progress in scalable robust Algorithms: new data need better algorithms?

• Progress in Data Science Education: opportunities at universities

(6)

Economic Imperative

There are a lot of data and a lot of jobs

(7)

Data Deluge

(8)

Some Trends

• The Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) and Scientific applications

• Light weight clients from smartphones, tablets to sensors

• Multicore reawakening parallel computing

• Exascale initiatives will continue drive to high end with a simulation orientation

• Clouds with cheaper, greener, easier to use IT for (some) applications

• New jobs associated with new curricula

• Clouds as a distributed system (classic CS courses)

• Data Analytics (Important theme in academia and industry)

• Network/Web Science

(9)

https://portal.futuregrid.org ₉

(10)

Network v Data v Compute Growth

• Moore’s Law Unnormalized

• Slope of #1 Top 500 > Total Data > Moore > IP

Traffic

10

Year

1995 2000 2005 2010 2015 2020

0.1 1 10 100 1000 10000

(11)

Some Data sizes

• ~40 10

9

_{Web pages}

_{at ~300 kilobytes each = 10 Petabytes}

• Youtube

48 hours video uploaded per minute (2011);

• in 2 months in 2010, uploaded more than total NBC ABC CBS

• ~2.5 petabytes per year uploaded?

• LHC

15 petabytes per year

• Radiology

69 petabytes per year

• Square Kilometer Array Telescope

will be 100

terabits/second

• Earth Observation

becoming ~4 petabytes per year

• Earthquake Science

– few terabytes

total

today

• PolarGrid

– 100’s terabytes/year

• Exascale simulation

data dumps – terabytes/second

(12)

https://portal.futuregrid.org ₁₂

(13)

https://portal.futuregrid.org ₁₃

(14)

(15)

Ruh VP Software GEhttp://fisheritcenter.haas.berkeley.edu/Big_Data/index.html

(16)

“Taming the Big Data Tidal Wave” 2012

(Bill Franks, Chief Analytics Officer Teradata)

_• _{Web Data (“the original big data”)}

– Analyze customer web browsing of e-commerce site to see topics looked at etc.

• Auto Insurance (telematics monitoring driving)

– Equip cars with sensors

• Text data in multiple industries

– Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language processing

• Time and location (GPS) data

– Track trucks (delivery), vehicles(track), people(tell them nearby goodies)

• Retail and manufacturing: RFID

– Asset and inventory management,

• Utility industry: Smart Grid

– Sensors allow dynamic optimization of power

• Gaming industry: Casino Chip tracking (RFID)

– Track individual players, detect fraud, identify patterns

• Industrial engines and equipment: sensor data

– See GE engine

• Video games: telemetry

– This is like monitoring web browsing but rather monitor actions in a game

• Telecommunication and other industries: Social Network data

– Connections make this big data.

(17)

https://portal.futuregrid.org ₁₇

http://www.genome.gov/sequencingcosts/

Faster than Moore’s Law

Slower?

Why need cost effective

Computing!

(18)

The Long Tail of Science

80-20 rule: 20% users generate 80% data but not necessarily 80% knowledge

Collectively “long tail” science is generating a lot of data

Estimated at over 1PB per year and it is growing fast.

(19)

Jobs

(20)

Jobs v. Countries

20

(21)

McKinsey Institute on Big Data Jobs

• There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a

shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.

• Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000

to 190,000 21

(22)

Tom Davenport Harvard Business School

(23)

https://portal.futuregrid.org ₂₃

(24)

https://portal.futuregrid.org ₂₄

(25)

Industry Trends

(26)

https://portal.futuregrid.org ₂₆

(27)

https://portal.futuregrid.org ₂₇

(28)

https://portal.futuregrid.org ₂₈

(29)

https://portal.futuregrid.org ₂₉

(30)

https://portal.futuregrid.org ₃₀

(31)

https://portal.futuregrid.org ₃₁

(32)

https://portal.futuregrid.org ₃₂

(33)

https://portal.futuregrid.org ₃₃

(34)

Computing Model

Industry adopted clouds which are

attractive for data analytics

(35)

5 years Cloud Computing

(36)

Amazon making money

• It took Amazon Web Services (AWS) eight

years to hit $650 million in revenue, according

to Citigroup in 2010.

• Just three years later, Macquarie Capital

(37)

Physically Clouds are Clear

• A bunch of computers in an efficient data center with an

excellent Internet connection

• They were produced to meet need of public-facing Web

2.0 e-Commerce/Social Networking sites

• They can be considered as “optimal giant data center”

plus internet connection

• Note enterprises use private clouds that are giant data

centers but not optimized for Internet access

• Exascale build-out of commercial cloud infrastructure:

for 2014-15 expect

10,000,000 new servers

and

10

(38)

Virtualization made several things more

convenient

• Virtualization = abstraction; run a job – you know not

where

• Virtualization = use hypervisor to support “images”

–

Allows you to define complete job as an “image” – OS +

application

• Efficient packing of multiple applications into one

server as they don’t interfere (much) with each other

if in different virtual machines;

• They interfere if put as two jobs in same machine as

for example must have same OS and same OS

services

• Also security model between VM’s more robust than

(39)

Clouds Offer

From different points of view

• Features from NIST:

– On-demand service (elastic);

– Broad network access;

– Resource pooling;

– Flexible resource allocation;

– Measured service

• Economies of scale in performance and electrical power (Green IT)

• Powerful new software models

– Platform as a Service is not an alternative to Infrastructure as a

Service – it is instead an incredible valued added

– Amazon offers PaaS as Azure, Google started; Azure, Google offer IaaS as Amazon started

• They are cheaper than classic clusters unless latter 100% utilized

(40)

Research Model

4

th

_{Paradigm; From Theory to Data}

driven science?

(41)

(42)

The 4 paradigms of Scientific Research

1. Theory

2. Experiment or Observation

• E.g. Newton observed apples falling to design his theory of

mechanics

3. Simulation of theory or model

4. Data-driven

(Big Data) or The Fourth Paradigm:

Data-Intensive Scientific Discovery (aka Data Science)

• http://research.microsoft.com/en-us/collaboration/fourthparadigm/

A free book

(43)

Anand Rajaraman is Senior Vice President at Walmart Global eCommerce, where he heads up the newly created

@WalmartLabs,

More data usually beats better algorithms

Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million!

Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database(IMDB). Guess which team did better?

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

(44)

Confusion in the new-old data field

lack of consensus academically in several aspects

from storage to algorithms, to processing and

education

(45)

Data Communities Confused I?

• Industry seems to know what it is doing although it’s secretive –

Amazon’s last paper on their recommender system was 2003

– Industry runs the largest data analytics on clouds

– But industry algorithms are rather different from science

• Academia confused on repository model: traditionally one stores

data but one needs to support “running Data Analytics” and one is taught to bring computing to data as in Google/Hadoop file system

– Either store data in compute cloud OR enable high performance networking between distributed data repositories and “analytics engines”

• Academia confused on data storage model: Files (traditional) v.

Database (old industry) v. NOSQL (new cloud industry)

– Hbase MongoDB Riak Cassandra are typical NOSQL systems

• Academia confused on curation of data: University Libraries,

Projects, National repositories, Amazon/Google?

(46)

Data Communities Confused II?

• Academia agrees on principles of Simulation Exascale Architecture:

HPC Cluster with accelerator plus parallel wide area file system

– Industry doesn’t make extensive use of high end simulation

• Academia confused on architecture for data analysis: Grid (as in

LHC), Public Cloud, Private Cloud, re-use simulation architecture with database, object store, parallel file system, HDFS style data

• Academia has not agreed on Programming/Execution model: “Data

Grid Software”, MPI, MapReduce ..

• Academia has not agreed on need for new algorithms: Use natural

extension of old algorithms, R or Matlab. Simulation successes built on great algorithm libraries;

• Academia has not agreed on what algorithms are important?

• Academia could attract more students: with data-oriented curricula

that prepare for industry or research careers

(47)

Clouds in Research

(48)

2 Aspects of Cloud Computing:

Infrastructure and Runtimes

• Cloud infrastructure: outsourcing of servers, computing, data, file space, utility computing, etc..

• Cloud runtimes or Platform: tools to do data-parallel (and other) computations. Valid on Clouds and traditional clusters

– Apache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable, Chubby and others

– MapReduce designed for information retrieval but is excellent for a wide range of science data analysis applications

– Can also do much traditional parallel computing for data-mining if extended to support iterative operations

(49)

Clouds have highlighted SaaS PaaS IaaS

• Software Services are building blocks of

applications

• The middleware or

computing environment including HPC, Grids …

• Nimbus, Eucalyptus,

OpenStack, OpenNebula CloudStack plus Bare-metal

• OpenFlow – likely to grow in importance

Infra structure

IaaS

Ø Software Defined

Computing (virtual Clusters)

Ø Hypervisor, Bare Metal

Ø Operating System

Platform

PaaS

Ø Cloud e.g. MapReduce

Ø HPC e.g. PETSc, SAGA

Ø Computer Science e.g. Compiler tools, Sensor nets, Monitors

Network

NaaS

Ø Software Defined Networks

Ø OpenFlow GENI

Software (Application Or Usage)

SaaS

Ø Education Ø Applications

Ø CS Research Use e.g. test new compiler or storage model

(50)

Science Computing Environments

• Large Scale Supercomputers – Multicore nodes linked by high

performance low latency network

– Increasingly with GPU enhancement

– Suitable for highly parallel simulations

• High Throughput Systems such as European Grid Initiative EGI or

Open Science Grid OSG typically aimed at pleasingly parallel jobs

– Can use “cycle stealing”

– Classic example is LHC data analysis

• Grids federate resources as in EGI/OSG or enable convenient access

to multiple backend systems including supercomputers

• Use Services (SaaS)

– Portals make access convenient and

– Workflow integrates multiple processes into a single job

(51)

Clouds HPC and Grids

• Synchronization/communication Performance

Grids > Clouds > Classic HPC Systems

• Clouds naturally execute effectively Grid workloads but are less

clear for closely coupled HPC applications

• Classic HPC machines as MPI engines offer highest possible

performance on closely coupled problems

• The 4 forms of MapReduce/MPI

1) Map Only – pleasingly parallel

2) Classic MapReduce as in Hadoop; single Map followed by reduction with fault tolerant use of disk

3) Iterative MapReduce use for data mining such as Expectation Maximization in clustering etc.; Cache data in memory between iterations and support the large collective communication (Reduce, Scatter, Gather, Multicast) use in data mining

(52)

Cloud Applications

(53)

What Applications work in Clouds

• Pleasingly (moving to modestly) parallel applications of all sorts

with roughly independent data or spawning independent simulations

– Long tail of science and integration of distributed sensors

• Commercial and Science Data analytics that can use MapReduce

(some of such apps) or its iterative variants (most other data analytics apps)

• Which science applications are using clouds?

– Venus-C (Azure in Europe): 27 applications not using Scheduler, Workflow or MapReduce (except roll your own)

– 50% of applications on FutureGrid are from Life Science – Locally Lilly corporation is commercial cloud user (for drug

discovery) but not IU Biology

• But overall very little science use of clouds

(54)

Parallelism over Users and Usages

• “Long tail of science” can be an important usage mode of clouds.

• In some areas like particle physics and astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion.

• In other areas such as genomics and environmental science, there are many “individual” researchers with distributed collection and analysis of data whose total data and processing needs can match the size of big science.

• Clouds can provide scaling convenient resources for this important

aspect of science.

• Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequences

– Collecting together or summarizing multiple “maps” is a simple Reduction

(55)

Internet of Things and the Cloud

• Cisco projects that there will be 50 billion devices on the Internet by 2020. Most will be small sensors that send streams of information into the cloud where it will be processed and integrated with other streams and turned into knowledge that will help our lives in a

multitude of small and big ways.

• The cloud will become increasing important as a controller of and

resource provider for the Internet of Things.

• As well as today’s use for smart phone and gaming console support, “Intelligent River” “smart homes and grid” and “ubiquitous cities” build on this vision and we could expect a growth in cloud

supported/controlled robotics.

• Some of these “things” will be supporting science

• Natural parallelism over “things”

• “Things” are distributed and so form a Grid

(56)

Sensors (Things) as a Service

Sensors as a Service

Sensor Processing as

a Service (could use MapReduce)

A larger sensor ……… Output Sensor

(57)

https://portal.futuregrid.org ₅₇

(58)

https://portal.futuregrid.org ₅₈

(59)

https://portal.futuregrid.org ₅₉

(60)

4 Forms of MapReduce

60

(a) Map Only _MapReduce(b) Classic _MapReduce(c) Iterative _Synchronous(d) Loosely

Input map reduce Input map reduce Iterations Input Output map _P ij BLAST Analysis Parametric sweep Pleasingly Parallel

High Energy Physics (HEP) Histograms Distributed search

Classic MPI PDE Solvers and particle dynamics

Domain of MapReduce and Iterative Extensions Science Clouds

MPI Exascale

Expectation maximization Clustering e.g. Kmeans Linear Algebra, Page Rank

(61)

Classic Parallel Computing

• HPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infiniband and technologies like MPI

– Often run large capability jobs with 100K (going to 1.5M) cores on same job

– National DoE/NSF/NASA facilities run 100% utilization

– Fault fragile and cannot tolerate “outlier maps” taking longer than others

• Clouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates

results from different maps

– Fault tolerant and does not require map synchronization

– Map only useful special case

• HPC + Clouds: Iterative MapReduce caches results between

“MapReduce” steps and supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in

clustering and other data mining

(62)

Data Intensive Applications

• Applications tend to be new and so can consider emerging

technologies such as clouds

• Do not have lots of small messages but rather large reduction (aka Collective) operations

– New optimizations e.g. for huge messages

• EM (expectation maximization) tends to be good for clouds and

Iterative MapReduce

– Quite complicated computations (so compute largish compared to communicate)

– Communication is Reduction operations (global sums or linear algebra in our case)

• We looked at Clustering and Multidimensional Scaling using

deterministic annealing which are both EM

– See also Latent Dirichlet Allocation and related Information Retrieval algorithms with similar EM structure

(63)

Data Intensive Programming Models

(64)

Map Collective Model (Judy Qiu)

• Combine MPI and MapReduce ideas

• Implement collectives optimally on Infiniband,

Azure, Amazon ……

64

Input

map

Generalized Reduce

Initial Collective Step

Final Collective Step

(65)

Twister for Data Intensive

Iterative Applications

• (Iterative) MapReduce structure with Map-Collective is

framework

• Twister runs on Linux or Azure

• Twister4Azure is built on top of Azure

tables

,

queues

,

storage

Compute Communication Reduce/ barrier

(66)

Pleasingly Parallel

Performance Comparisons

BLAST Sequence Search

Cap3 Sequence Assembly

(67)

Multi Dimensional Scaling

Weak Scaling Data Size Scaling

Performance adjusted for sequential performance difference

X: Calculate invV (BX)

Map Reduc_e Merge

BC: Calculate BX

Calculate Stress

New Iteration

(68)

Hadoop adjusted for Azure: Hadoop KMeans run time adjusted for the performance difference of iDataplex vs Azure

Num cores x Num Data Points

32 x 32 M 64 x 64 M 128 x 128 M 256 x 256 M

Time (ms ) 0 200 400 600 800 1000 1200 1400 Twister4Azure T4A+ tree broadcast T4A + AllReduce

Hadoop Adjusted for Azure

(69)

Kmeans Strong Scaling

(with Hadoop Adjusted)

128 Million data points. 500 Centroids (clusters). 20 Dimensions. 10 iterations Parallel efficiency relative to the 32 core run time.

Note Hadoop slower by factor of 2

Num Cores

32 64 96 128 160 192 224 256

Relative Parallel Efficiency 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

T4A + AllReduce T4A+ tree broadcast Twister4Azure-legacy Hadoop

(70)

Kmeans Clustering

Elapsed Time (s)

0 ₂₅ ₅₀ ₇₅ ₁₀₀ ₁₂₅ ₁₅₀ ₁₇₅ ₂₀₀ ₂₂₅ ₂₅₀

Number of Executing Map Tas ks 0 50 100 150 200 250 300

This shows that the communication and synchronization overheads between iterations are very small (less than one second, which is the lowest measured unit for this graph).

(71)

Kmeans Clustering

128 Million data points(19GB), 500 centroids (78KB), 20 dimensions 10 iterations, 256 cores, 256 map tasks per iteration

Map Task ID

0 ₂₅₆ ₅₁₂ ₇₆₈

1024 1280 1536 1792 2048 2304

(72)

FutureGrid Technology

(73)

FutureGrid Testbed as a Service

• FutureGrid is part of XSEDE set up as a testbed with cloud focus

• Operational since Summer 2010 (i.e. now in third year of use)

• The FutureGrid testbed provides to its users:

– Support of Computer Science and Computational Science research

– A flexible development and testing platform for middleware and application users looking at interoperability, functionality,

performance or evaluation

– FutureGrid is user-customizable, accessed interactively and supports Grid, Cloud and HPC software with and without VM’s

– A rich education and teaching platform for classes

(74)

4 Use Types for FutureGrid

TestbedaaS

•

292 approved projects (

1734 users

)

April 6 2013

– USA(79%), Puerto Rico(3%- Students in class), India, China, lots of European countries (Italy at 2% as class)

– Industry, Government, Academia

• Computer science and Middleware (

55.6%

)

– Core CS and Cyberinfrastructure; Interoperability (3.6%) for Grids and Clouds such as Open Grid Forum OGF Standards

• New Domain Science applications (

20.4%)

– Life science highlighted (10.5%), Non Life Science (9.9%)

• Training Education and Outreach (

14.9%

)

– Semester and short events; focus on outreach to HBCU

• Computer Systems Evaluation (

9.1%

)

– XSEDE (TIS, TAS), OSG, EGI; Campuses

(75)

Sample FutureGrid Projects I

• FG18 Privacy preserving gene read mapping developed hybrid

MapReduce. Small private secure + large public with safe data. Won 2011 PET Award for Outstanding Research in Privacy Enhancing

Technologies

• FG132, Power Grid Sensor analytics on the cloud with distributed Hadoop. Won the IEEE Scaling challenge at CCGrid2012.

• FG156 Integrated System for End-to-end High Performance Networking showed that the RDMA over Converged Ethernet

(InfiniBand made to work over Ethernet network frames) protocol could be used over wide-area networks, making it viable in cloud computing environments.

• FG172 Cloud-TM on distributed concurrency control (software

transactional memory): "When Scalability Meets Consistency: Genuine Multiversion Update Serializable Partial Data Replication,“ 32nd

International Conference on Distributed Computing Systems (ICDCS'12) (good conference) used 40 nodes of FutureGrid

(76)

Sample FutureGrid Projects II

• FG42,45 SAGA Pilot Job P* abstraction and applications. XSEDE

Cyberinfrastructure used on clouds

• FG130 Optimizing Scientific Workflows on Clouds. Scheduling Pegasus on distributed systems with overhead measured and reduced. Used Eucalyptus on FutureGrid

• FG133 Supply Chain Network Simulator Using Cloud Computing with dynamic virtual machines supporting Monte Carlo simulation with Grid Appliance and Nimbus

• FG257 Particle Physics Data analysis for ATLAS LHC experiment used FutureGrid + Canadian Cloud resources to study data analysis on

Nimbus + OpenStack with up to 600 simultaneous jobs

• FG254 Information Diffusion in Online Social Networks is evaluating NoSQL databases (Hbase, MongoDB, Riak) to support analysis of

Twitter feeds

• FG323 SSD performance benchmarking for HDFS on Lima

(77)

Education and Training Use of FutureGrid

• 27 Semester long classes: 563+ students

– Cloud Computing, Distributed Systems, Scientific Computing and Data Analytics • 3 one week summer schools: 390+ students

– Big Data, Cloudy View of Computing (for HBCU’s), Science Clouds • 1 two day workshop: 28 students

• 5 one day tutorials: 173 students

• From 19 Institutions

• Developing 2 MOOC’s (Google Course Builder) on Cloud Computing and use of FutureGrid supported by either FutureGrid or

downloadable appliances (custom images)

– See http://cgltestcloud1.appspot.com/preview

• FutureGrid appliances support Condor/MPI/Hadoop/Iterative MapReduce virtual clusters

(78)

Performance of Dynamic Provisioning

• 4 Phases a) Design and create image (security vet) b) Store in

repository as template with components c) Register Image to VM Manager (cached ahead of time) d) Instantiate (Provision) image

(79)

FutureGrid is an onramp to other systems

• FG supports Education & Training for all systems

• User can do all work on FutureGrid OR

• User can download Appliances on local machines (Virtual Box) OR

• User soon can use CloudMesh to jump to chosen production system

• CloudMesh is similar to OpenStack Horizon, but aimed at multiple

federated systems.

– Built on RAIN and tools like libcloud, boto with protocol (EC2) or programmatic API (python)

– Uses general templated image that can be retargeted

– One-click template & image install on various IaaS & bare metal including Amazon, Azure, Eucalyptus, Openstack, OpenNebula, Nimbus, HPC

– Provisions the complete system needed by user and not just a single image; copes with resource limitations and deploys full range of software

– Integrates our VM metrics package (TAS collaboration) that links to XSEDE (VM's are different from traditional Linux in metrics supported and needed)

(80)

Lessons learnt from FutureGrid

• Unexpected major use from Computer Science and Middleware

• Rapid evolution of Technology Eucalyptus  Nimbus  OpenStack

• Open source IaaS maturing as in “Paypal To Drop VMware From 80,000 Servers and Replace It With OpenStack” (Forbes)

– “VMWare loses $2B in market cap”; eBay expects to switch broadly?

• Need interactive not batch use; nearly all jobs short

• Substantial TestbedaaS technology needed and FutureGrid developed (RAIN, CloudMesh, Operational model) some

• Lessons more positive than DoE Magellan report (aimed as an early science cloud) but goals different

• Still serious performance problems in clouds for networking and device (GPU) linkage; many activities outside FG addressing

– One can get good Infiniband performance on a peculiar OS + Mellanox drivers but not general yet

• We identified characteristics of “optimal hardware”

• Run system with integrated software (computer science) and systems administration team

• Build Computer Testbed as a Service Community

(81)

Future Directions for FutureGrid

• Poised to support more users as technology like OpenStack matures

– Please encourage new users and new challenges

• More focus on academic Platform as a Service (PaaS) - high-level middleware (e.g. Hadoop, Hbase, MongoDB) – as IaaS gets easier to deploy

• Expect increased Big Data challenges

• Improve Education and Training with model for MOOC laboratories

• Finish CloudMesh (and integrate with Nimbus Phantom) to make FutureGrid as hub to jump to multiple different “production” clouds commercially, nationally and on campuses; allow cloud bursting

– Several collaborations developing

• Build underlying software defined system model with integration with GENI and high performance virtualized devices (MIC, GPU)

• Improved ubiquitous monitoring at PaaS IaaS and NaaS levels

• Improve “Reproducible Experiment Management” environment

• Expand and renew hardware via federation

(82)

Direct GPU Virtualization

• Allow VMs to directly access GPU hardware

• Enables CUDA and OpenCL code – no need for

custom APIs

• Utilizes PCI-passthrough of device to guest VM

–

Hardware directed I/O virt (VT-d or IOMMU)

–

Provides direct isolation and security of device

from host or other VMs

–

Removes much of the Host <-> VM overhead

• Similar to what Amazon EC2 uses (proprietary)

(83)

Performance 1

http://futuregrid.org 83 Benchmark maxspflops maxdpflops GFLOPS 0 200 400 600 800 1000 1200

Max FLOPS (Autotuned)

(84)

Performance 2

http://futuregrid.org 84

Benchmark

fft_sp

fft_sp_pcie ifft_sp_{ifft_sp_pcie} fft_dpfft_dp_pcie ifft_dpifft_dp_pcie sgemm_n sgemm_n_pcie

sgemm_t

sgemm_t_pciedgemm_n_{dgemm_n_pcie} dgemm_tdgemm_t_pcie

GFLOPS 0 50 100 150 200 250 300

Fast Fourier Transform and Matrix-Matrix Multiplcation

(85)

Algorithms

Scalable Robust

Algorithms

: new

data need better algorithms?

(86)

Algorithms for Data Analytics

• In simulation area, it is observed that equal contributions

to improved performance come from increased computer

power and better algorithms

http://cra.org/ccc/docs/nitrdsymposium/pdfs/keyes.pdf

• In data intensive area, we haven’t seen this effect so

clearly

– Information retrieval revolutionized but

– Still using Blast in Bioinformatics (although Smith Waterman etc. better)

– Still using R library which has many non optimal algorithms

– Parallelism and use of GPU’s often ignored

(87)

(88)

Data Analytics Futures?

• PETSc and ScaLAPACK and similar libraries very important in

supporting parallel simulations

• Need equivalent Data Analytics libraries

• Include datamining (Clustering, SVM, HMM, Bayesian Nets …), image

processing, information retrieval including hidden factor analysis

(LDA), global inference, dimension reduction

– Many libraries/toolkits (R, Matlab) and web sites (BLAST) but typically not aimed at scalable high performance algorithms

• Should support clouds and HPC; MPI and MapReduce

– Iterative MapReduce an interesting runtime; Hadoop has many limitations • Need a coordinated Academic Business Government Collaboration

to build robust algorithms that scale well

– Crosses Science, Business Network Science, Social Science • Propose to build community to define & implement

SPIDAL or Scalable Parallel Interoperable Data Analytics Library

(89)

Deterministic Annealing

• Deterministic Annealing works in many areas including clustering, latent factor analysis, dimension reduction for both metric and non metric spaces

– ~Always gets better answers than K-means and R?

– But can be parallelized and put on GPU

(90)

DA is Multiscale and Parallel

• Start at high temperature with one cluster and keep splitting

• Parallelism over points (easy) and centers

• Improve using triangle inequality test in high dimensions

90

(91)

Dimension Reduction/MDS

• You can get answers but do you believe them!

• Need to visualize

• HMDS = x<y=1N weight(x,y) ((x, y) – d3D(x, y))2

• Here x and y separately run over all points in the system, (x, y) is distance between x and y in original space while d3D(x, y) is distance

between them after mapping to 3 dimensions. One needs to minimize HMDSfor optimal choices of mapped positions X3D(x).

91

Lymphocytes 4D

LC-MS 2D

(92)

MDS runs as well in Metric and

non Metric Cases

• DA Clustering also runs in non metric case

with rather different formalism

92

(93)

~125 Clusters from Fungi sequence set

(94)

Phylogenetic tree using MDS

94

200 Sequences

(126 centers of clusters found from 446K)

Tree found from mapping sequences to 10D using Neighbor Joining

Whole collection mapped to 3D

2133 Sequences Extended from set of 200

Trees by Neighbor Joining in 3D map

(95)

Data Science Education

Opportunities at universities

see recent New York Times articles

http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/

(96)

Data Science Education

• Broad Range of Topics from Policy to curation to

applications and algorithms, programming models,

data systems, statistics, and broad range of CS

subjects such as Clouds, Programming, HCI,

• Plenty of Jobs and broader range of possibilities

than computational science but similar cosmic

issues

–

What type of degree (Certificate, minor, track, “real”

degree)

–

What implementation (department, interdisciplinary

group supporting education and research program)

(97)

Computational Science

• Interdisciplinary field between computer science

and applications with primary focus on simulation

areas

• Very successful as a research area

–

XSEDE and Exascale systems enable

• Several academic programs but these have been

less successful than computational science research

as

–

No consensus as to curricula and jobs (don’t appoint

faculty in computational science; do appoint to DoE

labs)

–

Field relatively small

(98)

MOOC’s

(99)

https://portal.futuregrid.org ₉₉

(100)

Massive Open Online

Courses

(

MOOC

)

• MOOC’s are very “hot” these days with Udacity and

Coursera as start-ups; perhaps over 100,000 participants

• Relevant to Data Science (where IU is preparing a MOOC)

as this is a new field with few courses at most universities

• Typical model is collection of short prerecorded segments

(talking head over PowerPoint) of length 3-15 minutes

• These “lesson objects” can be viewed as “songs”

• Google Course Builder (python open source) builds

customizable MOOC’s as “playlists” of “songs”

• Tells you to capture all material as “lesson objects”

• We are aiming to build a repository of many “songs”;

used in many ways – tutorials, classes …

(101)

MOOC’s for Traditional Lectures

• We can take MOOC lessons and view

them as a “learning object” that we can share between different teachers

101

• i.e. as a way of teaching typical sized classes but with less effort as shared material

• Start with what’s in repository;

• pick and choose;

• Add custom material of individual teachers

• The ~15 minute Video over PowerPoint of MOOC’s much easier to re-use than PowerPoint

• Do not need special mentoring support

• Defining how to support computing labs with

(102)

https://portal.futuregrid.org ₁₀₂

• Twelve

~10

minutes

lesson

objects in

this

lecture

• IU wants

(103)

(104)

(105)

https://portal.futuregrid.org ₁₀₅

(106)

https://portal.futuregrid.org ₁₀₆ Meeker/Wu May 29 2013 Internet

(107)

Customizable MOOC’s I

• We could teach one class to 100,000 students or 2,000

classes to 50 students

• The 2,000 class choice has 2 useful features

– One can use the usual (electronic) mentoring/grading technology

– One can customize each of 2,000 classes for a particular audience given their level and interests

– One can even allow student to customize – that’s what one does in making play lists in iTunes

• Both models can be supported by a repository of lesson

objects (3-15 minute video segments) in the cloud

• The teacher can choose from existing lesson objects and

add their own to produce a new customized course with

new lessons contributed back to repository

(108)

Customizable MOOC’s II

• The 3-15 minute Video over PowerPoint of MOOC lesson

object’s is easy to re-use

• Qiu (IU)and Hayden (ECSU Elizabeth City State University)

will customize a module

– Starting with Qiu’s cloud computing course at IU

– Adding material on use of Cloud Computing in Remote Sensing (area covered by ECSU course)

• This is a model for adding cloud curricula material to

diverse set of universities

• Defining how to support computing labs associated with

MOOC’s with FutureGrid or appliances + Virtual Box

– Appliances scale as download to student’s client

– Virtual machines essential

(109)

Conclusions

(110)

Conclusions

• Clouds and HPC are here to stay and one should plan on using both

• Data Intensive programs are not like simulations as they have large

“reductions” (“collectives”) and do not have many small messages

– Clouds suitable

• Iterative MapReduce an interesting approach; need to optimize

collectives for new applications (Data analytics) and resources (clouds, GPU’s …)

• Need an initiative to build scalable high performance data analytics

library on top of interoperable cloud-HPC platform

• Many promising data analytics algorithms such as deterministic

annealing not used as implementations not available in R/Matlab

etc.

– More sophisticated software and runs longer but can be

efficiently parallelized so runtime not a big issue

(111)

Conclusions II

• Software defined computing systems linking NaaS, IaaS,

PaaS, SaaS (Network, Infrastructure, Platform, Software) likely to be important

• More employment opportunities in clouds than HPC and

Grids and in data than simulation; so cloud and data related activities popular with students

• Community activity to discuss data science education

– Agree on curricula; is such a degree attractive?

• Role of MOOC’s as either

– Disseminating new curricula

– Managing course fragments that can be assembled into custom courses for particular interdisciplinary students