SIAM PP 2014 !
MapReduce in Scientific Computing!
February 19, 2014
David F. Gleich!
Computer Science! Purdue University
Paul G. Constantine!
Applied Math & Stats! Colorado School of Mines
Hans De Sterck!
Applied Mathematics! University of Waterloo
Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction
Papalexakis! Scaling up tensor factorization
Plantenga! Generating large graphs
Ching! Apache Giraph for big graphs
Zaharia! Data flow computing
Weimer! Relayering the big-data stack
Plimpton! MapReduce & MPI
10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55
minisymposium
: Parallel Algorithms for
MapReduce-Based Scientific Computing
Hans De Sterck
Department of Applied Mathematics University of Waterloo, Canada
SIAM PP14, Portland, February 2014
origins of MapReduce
• Google engineers invented MapReduce
• Google went from nothing to $400B market cap in 15 years (“organize the world's information”)
• Google’s initial success was built on two pillars:
– PageRank algorithm (random walk on web graph; spam-resistant compared to counting inlinks; better search results!)
– MapReduce framework for scalable (parallel)
processing of big data (file-based) on commodity hardware (fault-tolerant, (private) cloud pioneers) – new business/legal models (advertising, ‘creative’ new
Google’s big data processing framework
1. Google File System (published 2003)
– fault-tolerant: store every file ‘chunk’ 3 times – scalable
2. MapReduce (published 2004)
– fault-tolerant: restrict expressivity (e.g., no easy point-to-point messages), asynchronous within map and reduce: fault-tolerant through restart
– scalable, and efficient for big data: put computing were data resides
3. BigTable (published 2006) – scalable data store
Hadoop: open source version of
Google’s framework
1. Google File System Hadoop Distributed File System (HDFS)
2. MapReduce
3. BigTable HBase
• used (and co-developed) by Yahoo, Facebook, Twitter, ... and many, many other companies
MapReduce example (wordcount)
• fault-tolerant, scalable, compute where data resides
(adapted from blog.trifork.com) (very large file)
• file/disk-based: slow communication, and slow to iterate (stateless, read stored data from disk, not from memory) (slow but scalable)
large-scale distributed/parallel computing
• traditional large-scale distributed/parallel computing: – science, engineering, ...
– linear algebra, PDEs, optimization, molecular dynamics, Markov chain Monte Carlo, ...
– mostly in MPI-type (messaging) environments
• last decade: large-scale parallel/distributed computing has become essential in many new areas:
– web ranking, graph processing, social networks, data
mining, machine learning, cyber security/spying, business intelligence, big data, ...
– a significant part of these applications use MapReduce-type paradigms
large-scale distributed/parallel
computing is a much bigger space now
• aspects of the MPI and MapReduce paradigms may converge... (opportunities for SIAM PP community!)
• e.g., can MapReduce-type paradigms act as inspiration
for exascale parallel computing? (fault-tolerance,
scalability, compute where data resides, ..., but slow...) • it makes sense to consider ‘Scientific Computing in the
broad sense’ (linear algebra, optimization, data mining,
MapReduce for scientific computing
• basic algorithms (e.g., linear algebra) not much explored yet (libraries: Pegasus, Mahout, ...)
• MapReduce framework inspires (new?) ‘recursive’
algorithms for linear algebra and combinatorial scientific computing (e.g., ‘Matrix Inversion’ (recursive block LU) and
‘Scalable Maximum Clique Computation’ Using MapReduce, Jingen Xiang, Waterloo)
• we have 3 talks on MapReduce for scientific computing in the rest of this morning session:
this session
Scientific Computing Applications with MapReduce • Matrix Factorizations in MapReduce with
Applications to Model Reduction
Paul Constantine, Colorado School of Mines; Austin Benson, Stanford
• Scaling Up Tensor Decompositions with MapReduce
Evangelos Papalexakis, Carnegie Mellon University
• Generating Large Graphs with Desired Community
Structure
afternoon session
scalable data analytics environments beyond MapReduce: can we extend and improve MapReduce-type
approaches? (make it faster? HPC?)
• Apache Giraph: Large-Scale Graph Processing
Infrastructure on Hadoop
Avery Ching, Facebook
graph algorithms (Giraph, Pregel, in memory)
• Large-Scale Numerical Computation Using a Data
Flow Engine
Matei Zaharia, MIT
Spark: (fault-tolerant, scalable) data flow engine in memory
afternoon session
• REEF - Beyond MapReduce by Re-Layering the Big
Data Stack
Markus Weimer, Microsoft
YARN/REEF: more versatile scheduling, maintaining state
• Traditional and Streaming MapReduce via MPI for
Graph Analytics
Steve Plimpton, Karen D. Devine, Timothy Shead, Sandia National Labs
SIAM PP 2014 !
MapReduce in Scientific Computing!
February 19, 2014
David F. Gleich!
Computer Science! Purdue University
Paul G. Constantine!
Applied Math & Stats! Colorado School of Mines
Hans De Sterck!
Applied Mathematics! University of Waterloo
Gleich & De Sterck ! Two introductions to MapReduce Constantine & Benson! MapReduce-based model reduction
Papalexakis! Scaling up tensor factorization
Plantenga! Generating large graphs
Ching! Apache Giraph for big graphs
Zaharia! Data flow computing
Weimer! Relayering the big-data stack
Plimpton! MapReduce & MPI
10:35 11:00 11:25 11:50 2:40 3:05 3:30 3:55
Questions?
Two themes
AM Session!
What is possible in the MapReduce model & Hadoop?
PM Session!
How can we build-on or improve the
MapReduce model?
#SIAMPP14
David Gleich · Purdue 27
Gleich & De Sterck !
Two introductions to MapReduce Constantine & Benson!
MapReduce-based model reduction Papalexakis!
Scaling up tensor factorization Plantenga!
Generating large graphs
Ching!
Apache Giraph for big graphs Zaharia!
Spark & data flow computing Weimer!
Relayering the big-data stack Plimpton!