MapReduce. Simplified Data Processing on Large Clusters

(1)

MapReduce

Simpliﬁed Data Processing on Large Clusters

(2)

“As a reaction to this

complexity, we designed a new abstraction that allows us to express the simple

computations we were trying to perform but hides the

messy details of parallelization,

fault-tolerance, data distribution and load balancing in a library.”

“The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code.”

[2]

(3)

What is MapReduce?

● Jeﬀrey Dean and Sanjay Ghemawat of Google, 2004

● Born out of Google’s need to perform distributed computations on large datasets (petabytes)

● Programming model for such computations

● Partitioning data, communication, fault-tolerance, load balancing are abstracted in a library

● Fundamentally comprised of Map and Reduce functions

● Runs on clusters of “commodity” machines, not supercomputers

● Reduces cognitive overhead of parallel programming

(4)

MapReduce

● Takes set of input key/value pairs, returns a set of output key/value pairs

● The user implements the Map and Reduce functions

● Map

○ Takes input key/value pairs and does some computation (implemented by the user), producing intermediate key/value pairs

○ MapReduce library groups intermediate values associated with the same key, and passes key and values to the Reduce function

● Reduce

○ Takes intermediate key and associated set of values and merges these values (according to the user), typically producing zero or one output value

● Additional mapreduce speciﬁcation object speciﬁed by user, containing input

and output ﬁles, optional tuning parameters

(5)

Example: Word Count

map(String key, String value):

// key: document name

// value: document contents for each word w in value:

EmitIntermediate(w, "1");

reduce(String key, Iterator values):

// key: a word

// values: a list of counts int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

(6)

[1]

(7)

Implementation

● Google use case:

○ Dual processor x86 Linux machines with 2-4 GB memory

○ 100 Mbps to 1 Gbps switched ethernet

○ Cluster consists of 100s to 1,000s of commodity machines (machine failures common)

○ Storage is inexpensive hard drives

○ In-house distributed ﬁle system uses replication to ensure availability and reliability

○ Users submit jobs, consisting of tasks, to scheduler

○ Scheduler maps tasks to available nodes in cluster

(8)

Step 1

MapReduce partitions the input ﬁle(s) into M splits, typically 16 to 64 MB each.

MapReduce forks the program and copies it to each node in the cluster.

[2]

(9)

Step 2

One fork is the master node, the rest are workers. The master picks idle workers and assigns each one a map task or a

reduce task.

There are M map tasks and R reduce tasks. M is determined automatically, while R is

speciﬁed by the user (number of

output ﬁles).

(10)

Step 3

A worker who is assigned a map task reads the contents of its corresponding input split(s) and parses key/value pairs.

Key/value pairs are passed to the user implemented Map

function. Intermediate key/value

pairs are stored in a buﬀer.

(11)

Step 4

Buﬀered intermediate key/value pairs are periodically written to local memory and partitioned into R regions.

Locations of written

intermediate key/value pairs are

sent to the master node, who

forwards the locations to reduce

workers.

(12)

Step 5

When a reduce worker is

notiﬁed by the master, it reads the intermediate key/value pairs from the local memory of map workers.

When all intermediate data has

been read, the reduce worker

sorts the key/value pairs,

grouping by key.

(13)

Step 6

The reduce worker iterates over the sorted intermediate

key/value pairs and for each unique key, it passes the key and set of values to the user implemented Reduce function.

The output of the Reduce

function is appended to the ﬁnal

output ﬁle for its corresponding

reduce partition.

(14)

Step 7

When all map tasks and reduce tasks have completed, the

master and workers join with the

user program. The output is

available in the R output ﬁles.

(15)

Master Data Structures

● The master node stores the state (idle, in-progress, completed) of each map task and reduce task, along with the id of the worker node (non-idle)

● The master node is the “conduit” through which the location of intermediate data propagates from map tasks to reduce tasks

○ For each completed map task, the master stores the location and size of the R intermediate regions produced by the map task

○ Updates to size and location are received when a map task completes

○ Size and location information is forwarded incrementally to in-progress reduce tasks

(16)

Fault Tolerance

● With 100s to 1,000s of machines involved, machine failures are common--we want to tolerate them gracefully, and abstract this from the programmer

● Worker failure

○ Master pings workers regularly

○ If no response received

■ Worker is marked failed

■ In-progress map tasks and reduce tasks on failed workers are reset to idle state and rescheduled

○ Complete map tasks are re-executed on failure because output is stored on failed local disk

○ Complete reduce task are not re-executed on failure because output is stored on global fs

○ Reduce workers are notiﬁed of re-execution

(17)

Fault Tolerance

● Master failure

○ Master writes periodic checkpoints of master data structures to global fs

○ If master fails, new master assigned and started from checkpointed state

(18)

Task Granularity

● What values should we choose for M and R?

● M and R >> number of workers

○ Improves load balancing

○ Quicker recovery after failure because completed map tasks can be distributed

● In practice (at Google)

○ Choose M such that map tasks input is <= fs block size

○ Choose R to be a small multiple of number of workers

(19)

Backup Tasks

● “Straggler” machine takes longer

○ Hardware issues e.g. faulty disk

○ Resource contention e.g. other tasks scheduled on machine, sharing CPU, memory, network

● Backup Tasks

○ When a MapReduce program nears completion, the master schedules backup executions of remaining in-progress tasks

○ Task marked complete when either task completes

(20)

[2]

(21)

Apache Hadoop

● Open-source framework for distributed computing using MapReduce

[3]

(22)

WordCount.java

[4]

(23)

Sources

[1] MapReduce Tutorial | Mapreduce Example in Apache Hadoop

[2] MapReduce: Simpliﬁed Data Processing on Large Clusters

[3] An Analysis of the Hadoop HDFS Distributed File System Architecture | by Amit Kriplani

[4] MapReduce Tutorial - Apache Hadoop