Evaluating MapReduce and Hadoop for Science

(1)

Evaluating MapReduce and Hadoop

for Science

Lavanya Ramakrishnan [email protected]

Lawrence Berkeley National Lab

(2)

Computation and Data are critical parts of the scientific process Experiment Theory Computation Data (Fourth Paradigm)

Three Pillars of Science

Advance Light Source Data Rates

2009 65 TB/yr 2011 312 TB/yr 2013 1900 TB/yr

(3)

Internet BigData led to the MapReduce and Hadoop Evolution

3

Map

(4)

A central component of the MapReduce model is its file system

HDFS GPFS and Lustre

Typical Replication 3 1

Storage Location Compute Node Servers Access Model Custom (except with

Fuse)

POSIX

Stripe Size 64 MB 1 MB Concurrent Writes No Yes

Scales with # of Compute Nodes # of Servers Scale of Largest

Systems

O(10k) Nodes O(100) Servers

(5)

Evaluating the Hype from Reality 5 MapReduce Clusters HPC NoSQL Cloud Hadoop on HPC MongoDB +Hadoop Hadoop on VM

(6)

Streaming adds a performance overhead

6

Evaluating Hadoop for Science, IEEE Cloud 2012

(7)

High performance parallel file systems can be used with Hadoop for small to medium

concurrency 0 2 4 6 8 10 12 0 500 1000 1500 2000 2500 3000 T im e (m in u te s) Number of maps Teragen (1TB) _HDFS GPFS Linear (HDFS) Expon. (HDFS) Linear (GPFS) Expon. (GPFS) 7 Better

(8)

We evaluate three data-intensive operations with different testbed configurations

Filter _Merge Reorder

(9)

Data operations impacts the performance

differences across file systems: Wikipedia (2TB)

0 5 10 15 Pr o ce ss in g ti m e (1 00 0s ) WriteTime ProcessingTime ReadTime Better

(10)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 200 400 600 800 Size (TB) Processing time (s) HDFS GPFS

Read-intensive applications benefit from HDFS

(11)

Scientific Ensembles have similarities with MapReduce structure

11

A large number of loosely coupled tasks, each with their own internal parallelism.

(12)

All patterns could be implemented in Hadoop but with varying levels of difficulty

low _high

(13)

There are challenges when using Hadoop for scientific applications 13 High throughput workflows Scaling up from desktops

File system: non POSIX

Language: Java

Input and output formats:

mostly line-oriented text

Streaming mode: restrictive i/p and o/p model

Data locality: what happens when multiple inputs?

File permissions: jobs run as user hadoop

(14)

Tigres: Design templates for common patterns of parallelism "LightSrc" Domain templates Base Tigres templates Scale up Application "LightSrc-1" Application "LightSrc-2" Create and Debug Share Create and Debug

Implement templates as a library in an

existing language

(15)

Templates

•  Sequence ( name, task_array, input_array )

–  e.g., output [ ] = Sequence (“my seq”, task_array_12,

input_array_12)

•  Parallel ( name, task_array, input_array )

–  e.g., output[ ] = Parallel(“abc”, task_array_12,

input_array_12)

•  Split ( split_task, split_input_values, task_array, task_array_in )

–  e.g., Split( task_x1, input_value_1, spl_t_arr, spl_i_arr)

•  Merge ( task_array, input_array, merge_task, merge_input_values)

(16)

Evaluating the Hype from Reality 16 MapReduce Clusters HPC NoSQL Cloud Hadoop on HPC MongoDB +Hadoop Hadoop on VM

(17)

Reorder and Merge: Writes to Mongo

can be expensive

MongoDB HDFS 4.6 Million Input Records

Processing time (s) 0 200 400 600 800

*Sharded MongoDB vs HDFS on a 8 node Hadoop cluster (R=W)

Read Time Processing Time Write Time

MongoDB HDFS

4.6 Million Input Records

Processing time (s)

0

200

400

600

800 *Sharded MongoDB vs HDFS on a 8 node Hadoop cluster R<W

Read Time Processing Time Write Time Reorder Merge Better

(18)

4.6 9.3 18.6 37.2 Number of Input Records (Million)

Processing Time(min)

50

100

150

Hadoop−MongoDB MapReduce (2 workers)

MongoDB MapReduce

Filter: Hadoop MapReduce provides a way

to scale up analysis on MongoDB

(19)

Data analysis with Hadoop and MongoDB: Offload the MapReduce writes to HDFS

Better Sharding helps Reading from MongoDB Writing to MongoDB Move data to HDFS

(20)

Evaluating the Hype from Reality 20 MapReduce Clusters HPC NoSQL Cloud Hadoop on HPC MongoDB +Hadoop Hadoop on VM

(21)

Teragen and Terasort take longer on virtual machines 21 0 100 200 300 400 500 600 100 GB 200 GB 300 GB 400 GB 500 GB

Execution time (= sec)

Teragen performance Physical Virtual 0 500 1000 1500 2000 2500 3000 100 GB 200 GB 300 GB 400 GB 500 GB

Terasort performance

Physical Virtual

(22)

Reorder on virtual machines is faster (still investigating) 22 0 500 1000 1500 2000 34 GB 74 GB 111 GB

Wikibench reorder performance

Physical Virtual

(23)

Physical and virtual have different power profiles but correlate with maps and reduces

23 5 6 7 8 0 200 400 600 800 0 10 20 30 40 50 60 70 80 90 100 Power (= kW) Left percentage (= %) Time (= sec)

Wikibench reorder power consumption - Physical

37 GB Map Reduce 5 6 7 8 0 200 400 600 800 0 10 20 30 40 50 60 70 80 90 100 Power (= kW) Left percentage (= %) Time (= sec)

Wikibench reorder power consumption - Virtual

37 GB Map Reduce

(24)

Configuring Hadoop on Virtual Machines can benefit from different configurations

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 30D 30C 30D 80C 30D 130C 80D 30C 80D 80C 130D 30C T im e (s ec o n d s) Different Configurations Filter Reorder Merge Better

(25)

Reorder (virtual) needs more compute nodes than data nodes

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 30-30 80-30 130-30 30-80 80-80 30-130 collocation Performance/power Wikibench on VMs, reorder 37GB 74GB 111GB 25 Better

(26)

Filter (virtual) can benefit from more data nodes 0 0.2 0.4 0.6 0.8 1 1.2 1.4 30-30 80-30 130-30 30-80 80-80 30-130 collocation Performance/power Wikibench on VMs, filter 37GB 74GB 111GB 26 Better

(27)

FRIEDA: Storage and Data Management on VMs 27    





           http://frieda.lbl.gov

(28)

Summary

• MapReduce and Hadoop ecosystem are

powerful paradigms for science

– But may not be out of box solutions

• It is possible to run Hadoop in

non-traditional configurations to enable use

in existing environments

(29)

Questions?

• Email:

[email protected]

• Collaborators

– Shane Canon, Elif Dede, Zacharia Fadika,

Madhu Govindaraju, Daniel Gunter, Eugen Feller, Christine Morin