SIAM PP 2014! MapReduce in Scientific Computing! February 19, 2014

(1)

SIAM PP 2014 !

MapReduce in Scientific Computing!

February 19, 2014

David F. Gleich!

Computer Science! Purdue University

Paul G. Constantine!

Applied Math & Stats! Colorado School of Mines

Hans De Sterck!

Applied Mathematics! University of Waterloo

Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction

Papalexakis! Scaling up tensor factorization

Plantenga! Generating large graphs

Ching! Apache Giraph for big graphs

Zaharia! Data flow computing

Weimer! Relayering the big-data stack

Plimpton! MapReduce & MPI

10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55

(2)

minisymposium

: Parallel Algorithms for

MapReduce-Based Scientific Computing

Hans De Sterck

Department of Applied Mathematics University of Waterloo, Canada

SIAM PP14, Portland, February 2014

(3)

origins of MapReduce

•  Google engineers invented MapReduce

•  Google went from nothing to $400B market cap in 15 years (“organize the world's information”)

•  Google’s initial success was built on two pillars:

–  PageRank algorithm (random walk on web graph; spam-resistant compared to counting inlinks; better search results!)

–  MapReduce framework for scalable (parallel)

processing of big data (file-based) on commodity hardware (fault-tolerant, (private) cloud pioneers) –  new business/legal models (advertising, ‘creative’ new

(4)

Google’s big data processing framework

1.  Google File System (published 2003)

–  fault-tolerant: store every file ‘chunk’ 3 times –  scalable

2.  MapReduce (published 2004)

–  fault-tolerant: restrict expressivity (e.g., no easy point-to-point messages), asynchronous within map and reduce: fault-tolerant through restart

–  scalable, and efficient for big data: put computing were data resides

3.  BigTable (published 2006) –  scalable data store

(5)

Hadoop: open source version of

Google’s framework

1.  Google File System Hadoop Distributed File System (HDFS)

2.  MapReduce

3.  BigTable HBase

•  used (and co-developed) by Yahoo, Facebook, Twitter, ... and many, many other companies

(6)

MapReduce example (wordcount)

•  fault-tolerant, scalable, compute where data resides

(adapted from blog.trifork.com) (very large file)

•  file/disk-based: slow communication, and slow to iterate (stateless, read stored data from disk, not from memory) (slow but scalable)

(7)

large-scale distributed/parallel computing

•  traditional large-scale distributed/parallel computing: –  science, engineering, ...

–  linear algebra, PDEs, optimization, molecular dynamics, Markov chain Monte Carlo, ...

–  mostly in MPI-type (messaging) environments

•  last decade: large-scale parallel/distributed computing has become essential in many new areas:

–  web ranking, graph processing, social networks, data

mining, machine learning, cyber security/spying, business intelligence, big data, ...

–  a significant part of these applications use MapReduce-type paradigms

(8)

large-scale distributed/parallel

computing is a much bigger space now

•  aspects of the MPI and MapReduce paradigms may converge... (opportunities for SIAM PP community!)

•  e.g., can MapReduce-type paradigms act as inspiration

for exascale parallel computing? (fault-tolerance,

scalability, compute where data resides, ..., but slow...) •  it makes sense to consider ‘Scientific Computing in the

broad sense’ (linear algebra, optimization, data mining,

(9)

MapReduce for scientific computing

•  basic algorithms (e.g., linear algebra) not much explored yet (libraries: Pegasus, Mahout, ...)

•  MapReduce framework inspires (new?) ‘recursive’

algorithms for linear algebra and combinatorial scientific computing (e.g., ‘Matrix Inversion’ (recursive block LU) and

‘Scalable Maximum Clique Computation’ Using MapReduce, Jingen Xiang, Waterloo)

•  we have 3 talks on MapReduce for scientific computing in the rest of this morning session:

(10)

this session

Scientific Computing Applications with MapReduce •  Matrix Factorizations in MapReduce with

Applications to Model Reduction

Paul Constantine, Colorado School of Mines; Austin Benson, Stanford

•  Scaling Up Tensor Decompositions with MapReduce

Evangelos Papalexakis, Carnegie Mellon University

•  Generating Large Graphs with Desired Community

Structure

(11)

afternoon session

scalable data analytics environments beyond MapReduce: can we extend and improve MapReduce-type

approaches? (make it faster? HPC?)

•  Apache Giraph: Large-Scale Graph Processing

Infrastructure on Hadoop

Avery Ching, Facebook

graph algorithms (Giraph, Pregel, in memory)

•  Large-Scale Numerical Computation Using a Data

Flow Engine

Matei Zaharia, MIT

Spark: (fault-tolerant, scalable) data flow engine in memory

(12)

afternoon session

•  REEF - Beyond MapReduce by Re-Layering the Big

Data Stack

Markus Weimer, Microsoft

YARN/REEF: more versatile scheduling, maintaining state

•  Traditional and Streaming MapReduce via MPI for

Graph Analytics

Steve Plimpton, Karen D. Devine, Timothy Shead, Sandia National Labs

(13)

SIAM PP 2014 !

MapReduce in Scientific Computing!

February 19, 2014

David F. Gleich!

Computer Science! Purdue University

Paul G. Constantine!

Applied Math & Stats! Colorado School of Mines

Hans De Sterck!

Applied Mathematics! University of Waterloo

Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction

Papalexakis! Scaling up tensor factorization

Plantenga! Generating large graphs

Ching! Apache Giraph for big graphs

Zaharia! Data flow computing

Weimer! Relayering the big-data stack

Plimpton! MapReduce & MPI

10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55

Questions?

(14)

Two themes

AM Session!

What is possible in the MapReduce model & Hadoop?

PM Session!

How can we build-on or improve the

MapReduce model?

#SIAMPP14

David Gleich · Purdue 27

Gleich & De Sterck !

Two introductions to MapReduce Constantine & Benson!

MapReduce-based model reduction Papalexakis!

Scaling up tensor factorization Plantenga!

Generating large graphs

Ching!

Apache Giraph for big graphs Zaharia!

Spark & data flow computing Weimer!

Relayering the big-data stack Plimpton!