• No results found

SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014

N/A
N/A
Protected

Academic year: 2021

Share "SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

SIAM PP 2014 !

MapReduce in Scientific Computing!

February 19, 2014

David F. Gleich!

Computer Science! Purdue University

Paul G. Constantine!

Applied Math & Stats! Colorado School of Mines

Hans De Sterck!

Applied Mathematics! University of Waterloo

Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction

Papalexakis! Scaling up tensor factorization

Plantenga! Generating large graphs

Ching! Apache Giraph for big graphs

Zaharia! Data flow computing

Weimer! Relayering the big-data stack

Plimpton! MapReduce & MPI

10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55

(2)

minisymposium

: Parallel Algorithms for

MapReduce-Based Scientific Computing

Hans De Sterck

Department of Applied Mathematics University of Waterloo, Canada

SIAM PP14, Portland, February 2014

(3)

origins of MapReduce

•  Google engineers invented MapReduce

•  Google went from nothing to $400B market cap in 15 years (“organize the world's information”)

•  Google’s initial success was built on two pillars:

–  PageRank algorithm (random walk on web graph; spam-resistant compared to counting inlinks; better search results!)

–  MapReduce framework for scalable (parallel)

processing of big data (file-based) on commodity hardware (fault-tolerant, (private) cloud pioneers) –  new business/legal models (advertising, ‘creative’ new

(4)

Google’s big data processing framework

1.  Google File System (published 2003)

–  fault-tolerant: store every file ‘chunk’ 3 times –  scalable

2.  MapReduce (published 2004)

–  fault-tolerant: restrict expressivity (e.g., no easy point-to-point messages), asynchronous within map and reduce: fault-tolerant through restart

–  scalable, and efficient for big data: put computing were data resides

3.  BigTable (published 2006) –  scalable data store

(5)

Hadoop: open source version of

Google’s framework

1.  Google File System Hadoop Distributed File System (HDFS)

2.  MapReduce

3.  BigTable HBase

•  used (and co-developed) by Yahoo, Facebook, Twitter, ... and many, many other companies

(6)

MapReduce example (wordcount)

•  fault-tolerant, scalable, compute where data resides

(adapted from blog.trifork.com) (very large file)

•  file/disk-based: slow communication, and slow to iterate (stateless, read stored data from disk, not from memory) (slow but scalable)

(7)

large-scale distributed/parallel computing

•  traditional large-scale distributed/parallel computing: –  science, engineering, ...

–  linear algebra, PDEs, optimization, molecular dynamics, Markov chain Monte Carlo, ...

–  mostly in MPI-type (messaging) environments

•  last decade: large-scale parallel/distributed computing has become essential in many new areas:

–  web ranking, graph processing, social networks, data

mining, machine learning, cyber security/spying, business intelligence, big data, ...

–  a significant part of these applications use MapReduce-type paradigms

(8)

large-scale distributed/parallel

computing is a much bigger space now

•  aspects of the MPI and MapReduce paradigms may converge... (opportunities for SIAM PP community!)

•  e.g., can MapReduce-type paradigms act as inspiration

for exascale parallel computing? (fault-tolerance,

scalability, compute where data resides, ..., but slow...) •  it makes sense to consider ‘Scientific Computing in the

broad sense’ (linear algebra, optimization, data mining,

(9)

MapReduce for scientific computing

•  basic algorithms (e.g., linear algebra) not much explored yet (libraries: Pegasus, Mahout, ...)

•  MapReduce framework inspires (new?) ‘recursive’

algorithms for linear algebra and combinatorial scientific computing (e.g., ‘Matrix Inversion’ (recursive block LU) and

‘Scalable Maximum Clique Computation’ Using MapReduce, Jingen Xiang, Waterloo)

•  we have 3 talks on MapReduce for scientific computing in the rest of this morning session:

(10)

this session

Scientific Computing Applications with MapReduce •  Matrix Factorizations in MapReduce with

Applications to Model Reduction

Paul Constantine, Colorado School of Mines; Austin Benson, Stanford

•  Scaling Up Tensor Decompositions with MapReduce

Evangelos Papalexakis, Carnegie Mellon University

•  Generating Large Graphs with Desired Community

Structure

(11)

afternoon session

scalable data analytics environments beyond MapReduce: can we extend and improve MapReduce-type

approaches? (make it faster? HPC?)

•  Apache Giraph: Large-Scale Graph Processing

Infrastructure on Hadoop

Avery Ching, Facebook

graph algorithms (Giraph, Pregel, in memory)

•  Large-Scale Numerical Computation Using a Data

Flow Engine

Matei Zaharia, MIT

Spark: (fault-tolerant, scalable) data flow engine in memory

(12)

afternoon session

•  REEF - Beyond MapReduce by Re-Layering the Big

Data Stack

Markus Weimer, Microsoft

YARN/REEF: more versatile scheduling, maintaining state

•  Traditional and Streaming MapReduce via MPI for

Graph Analytics

Steve Plimpton, Karen D. Devine, Timothy Shead, Sandia National Labs

(13)

SIAM PP 2014 !

MapReduce in Scientific Computing!

February 19, 2014

David F. Gleich!

Computer Science! Purdue University

Paul G. Constantine!

Applied Math & Stats! Colorado School of Mines

Hans De Sterck!

Applied Mathematics! University of Waterloo

Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction

Papalexakis! Scaling up tensor factorization

Plantenga! Generating large graphs

Ching! Apache Giraph for big graphs

Zaharia! Data flow computing

Weimer! Relayering the big-data stack

Plimpton! MapReduce & MPI

10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55

Questions?

(14)

Two themes

AM Session!

What is possible in the MapReduce model & Hadoop?

PM Session!

How can we build-on or improve the

MapReduce model?

#SIAMPP14

David Gleich · Purdue 27

Gleich & De Sterck !

Two introductions to MapReduce Constantine & Benson!

MapReduce-based model reduction Papalexakis!

Scaling up tensor factorization Plantenga!

Generating large graphs

Ching!

Apache Giraph for big graphs Zaharia!

Spark & data flow computing Weimer!

Relayering the big-data stack Plimpton!

References

Related documents

This dissertation addresses four research questions through an international case study approach: (i) is the collaborative governance model, proposed by Ansell and Gash

Among the metal implants, Mg and a number of its alloys are effective because of 1) their mechanical properties, which are close to those of human bone, 2) their natural ionic

This quest for more strategic autonomy, that is, the ability to make decisions in foreign, security and defence policy and have the means to carry these through, if need be

These strategies pertain to: (i) the diversification of flood risk management approaches; (ii) the alignment of flood risk management approaches to overcome fragmentation; (iii)

Outcome 4.1 Appropriate outcome and mitigation strategies mainstreamed into national policies in at least 20 countries, in the development of plans of at least five economic

83 ОМК Annual report 2014 Contacts Sustainable development Business Corporate Government Review Content. One of the reasons why the Casting and Rolling Complex (CRC) failed to deliver

Requests to participate shall be delivered to the contracting entity no later than on the date and time stated in section IV.3.4. The requests shall be delivered in 4

129 Section 4 provides that the EEA applies to all employers and employees; section 5 enjoins employers to take steps to promote equal opportunity in the workplace; section