MapReduce
Simplified Data Processing on Large Clusters
“As a reaction to this
complexity, we designed a new abstraction that allows us to express the simple
computations we were trying to perform but hides the
messy details of parallelization,
fault-tolerance, data distribution and load balancing in a library.”
“The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code.”
[2]
What is MapReduce?
● Jeffrey Dean and Sanjay Ghemawat of Google, 2004
● Born out of Google’s need to perform distributed computations on large datasets (petabytes)
● Programming model for such computations
● Partitioning data, communication, fault-tolerance, load balancing are abstracted in a library
● Fundamentally comprised of Map and Reduce functions
● Runs on clusters of “commodity” machines, not supercomputers
● Reduces cognitive overhead of parallel programming
MapReduce
● Takes set of input key/value pairs, returns a set of output key/value pairs
● The user implements the Map and Reduce functions
● Map
○ Takes input key/value pairs and does some computation (implemented by the user), producing intermediate key/value pairs
○ MapReduce library groups intermediate values associated with the same key, and passes key and values to the Reduce function
● Reduce
○ Takes intermediate key and associated set of values and merges these values (according to the user), typically producing zero or one output value
● Additional mapreduce specification object specified by user, containing input
and output files, optional tuning parameters
Example: Word Count
map(String key, String value):
// key: document name
// value: document contents for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
[1]
Implementation
● Google use case:
○ Dual processor x86 Linux machines with 2-4 GB memory
○ 100 Mbps to 1 Gbps switched ethernet
○ Cluster consists of 100s to 1,000s of commodity machines (machine failures common)
○ Storage is inexpensive hard drives
○ In-house distributed file system uses replication to ensure availability and reliability
○ Users submit jobs, consisting of tasks, to scheduler
○ Scheduler maps tasks to available nodes in cluster
Step 1
MapReduce partitions the input file(s) into M splits, typically 16 to 64 MB each.
MapReduce forks the program and copies it to each node in the cluster.
[2]
Step 2
One fork is the master node, the rest are workers. The master picks idle workers and assigns each one a map task or a
reduce task.
There are M map tasks and R reduce tasks. M is determined automatically, while R is
specified by the user (number of
output files).
Step 3
A worker who is assigned a map task reads the contents of its corresponding input split(s) and parses key/value pairs.
Key/value pairs are passed to the user implemented Map
function. Intermediate key/value
pairs are stored in a buffer.
Step 4
Buffered intermediate key/value pairs are periodically written to local memory and partitioned into R regions.
Locations of written
intermediate key/value pairs are
sent to the master node, who
forwards the locations to reduce
workers.
Step 5
When a reduce worker is
notified by the master, it reads the intermediate key/value pairs from the local memory of map workers.
When all intermediate data has
been read, the reduce worker
sorts the key/value pairs,
grouping by key.
Step 6
The reduce worker iterates over the sorted intermediate
key/value pairs and for each unique key, it passes the key and set of values to the user implemented Reduce function.
The output of the Reduce
function is appended to the final
output file for its corresponding
reduce partition.
Step 7
When all map tasks and reduce tasks have completed, the
master and workers join with the
user program. The output is
available in the R output files.
Master Data Structures
● The master node stores the state (idle, in-progress, completed) of each map task and reduce task, along with the id of the worker node (non-idle)
● The master node is the “conduit” through which the location of intermediate data propagates from map tasks to reduce tasks
○ For each completed map task, the master stores the location and size of the R intermediate regions produced by the map task
○ Updates to size and location are received when a map task completes
○ Size and location information is forwarded incrementally to in-progress reduce tasks
Fault Tolerance
● With 100s to 1,000s of machines involved, machine failures are common--we want to tolerate them gracefully, and abstract this from the programmer
● Worker failure
○ Master pings workers regularly
○ If no response received
■ Worker is marked failed
■ In-progress map tasks and reduce tasks on failed workers are reset to idle state and rescheduled
○ Complete map tasks are re-executed on failure because output is stored on failed local disk
○ Complete reduce task are not re-executed on failure because output is stored on global fs
○ Reduce workers are notified of re-execution
Fault Tolerance
● Master failure
○ Master writes periodic checkpoints of master data structures to global fs
○ If master fails, new master assigned and started from checkpointed state
Task Granularity
● What values should we choose for M and R?
● M and R >> number of workers
○ Improves load balancing
○ Quicker recovery after failure because completed map tasks can be distributed
● In practice (at Google)
○ Choose M such that map tasks input is <= fs block size
○ Choose R to be a small multiple of number of workers
Backup Tasks
● “Straggler” machine takes longer
○ Hardware issues e.g. faulty disk
○ Resource contention e.g. other tasks scheduled on machine, sharing CPU, memory, network
● Backup Tasks
○ When a MapReduce program nears completion, the master schedules backup executions of remaining in-progress tasks
○ Task marked complete when either task completes
[2]
Apache Hadoop
● Open-source framework for distributed computing using MapReduce
[3]
WordCount.java
[4]
Sources
[1] MapReduce Tutorial | Mapreduce Example in Apache Hadoop
[2] MapReduce: Simplified Data Processing on Large Clusters
[3] An Analysis of the Hadoop HDFS Distributed File System Architecture | by Amit Kriplani
[4] MapReduce Tutorial - Apache Hadoop