2.3 Possible Choices for Parallelization
2.3.3 MapReduce
Dean et al. [22] proposed MapReduce for large-scale data processing in parallel. It simplifies the design of the scalable solutions for processing a large amount of data, by hiding lower-level details of parallel execution such as fault-tolerance, load balancing and synchronization.
MapReduce follows the client/server architecture. One of the machines works as a master node which performs scheduling of tasks, coordinates the distribution of data to the worker nodes and holds, all the book-keeping about the workers, jobs and the data.
2.3 Possible Choices for Parallelization 29
Each client node is assumed to have its own local memory and a global distributed memory shared among all nodes.
Working Mechanism: A MapReduce computation consists of a number of MapRe- duce rounds. Each MapReduce round consists of two computational stages called Map and Reduce, intermediated by a communication stage called Shuffle. An schematic representation of a typical MapReduce round is shown in Figure 2.4.
Figure 2.4: A schematic representation of a MapReduce round
The map/reduce computations take input and produce output as a set of key/value pairs. The input and output are usually stored in a distributed file system [38]. A typical MapReduce cluster consists of a single master node (computer) which schedules and monitors different map/reduce data processing tasks over the slave nodes. Concep- tually, a MapReduce round consists of map, shuffle and reduce phases, and can be expressed as:
M ap(k1, v1)→list(k2, v2)
Reduce(k2, list(v2))→list(k3, v3)
In the Map stage, the given data are logically partitioned into a number of disjoint sub- sets and each subset is assigned to a mapper. The assignment of partitions to mappers
2.3 Possible Choices for Parallelization 30
is made dynamically and is determined at runtime. A mapper (in parallel to other map- pers) reads a record in the form(k1, v1)from its assigned data partition, performs the
user-specified computation, and outputs another setlist(k2, v2). If a mapper runs out
of memory at this stage, the part of output is temporarily moved to the local file system (LFS).
The output pairs produced by mappers (intermediate output) go through a shuffle phase. At this stage, each mapper locally performs the partitioning of its output. Given
r reducers, the key space of intermediate output is divided into r partitions in a way each partition has almost an equal number of distinct keys. This is to ensure that the amount of work associated with each partition is approximately the same. The parti- tioning is usually done via a hash functionhon a keykof each pair,i.e. h(k)mod r. This ensures that the pairs with the same key from different mappers are assigned to a single reducer. The partitioning strategy is based on the assumption that the amount of work associated with each key of intermediate output is the same. Note that a mapper output may not be local to its assigned reducer. In this case, shuffle also involves trans- ferring mapper output to a reducer over the network. The shuffle phase starts as soon as the first mapper finishes its processing, so it may overlap with the map phase.
A reducerriacquires its assigned part of map output by copying the pairs(k2, v2)from bi,j, an ith partition from the jth mapper (1 ≤ j ≤ M), sorts all the assigned pairs
by their keys to construct(k2, list(v2))and processes list(v2)for each distinct keyk2
using a user-specified reduce function. The output produced by the reduce phase can be the final output or can be input to another mapper in the next MapReduce round.
It is worth noting that there is an overhead for achieving parallelization in MapReduce. This mostly consists of the cost of initializing a MapReduce job, I/O cost of reading input and writing output by mappers and reducers during the map and reduce stages, and the cost of shuffling mapper output over the network to reducers. It is also worth observing that output produced by mappers is an input to reducers, which is transferred via shuffle phase. Therefore, less the intermediate output produced by the map stage,
2.3 Possible Choices for Parallelization 31
the lower the cost of transferring it over the network will be and the lower the I/O cost of writing intermediate output and reading input by the mappers and reducers will be. The number of mappers and reducers must also be specified based on the available resources. Setting them to a high value may result in a small amount of work for each mapper or reducer, and the performance gain by parallel processing may be offset by their setup cost. Setting them to a low value on the other hand, may result in ineffective resource utilization.
Example 1. Consider a simple word count using MapReduce. A given text is divided into M = 3 chunks and assigned to mappers. During the map stage, a mapper is assigned a chunk of text. Each ith line of the assigned chunk is read in the form of a
pairhi, lii, wherelirepresents the contents ofithline in the partition. It then counts the
occurrences of each distinct word inli, and outputshword, occurrencesjifor each dis-
tinct word. Once the lines are processed, the output pairs of each mapper are divided into r = 2partitions. For each map output pair hword, occurrencesji, it computes
h(word)mod r to decide to which of the two reducers the pair is sent to. The pairs are then copied to their assigned reducers. A reducer sums up all localoccurrencesj
(1≤j ≤M) of aword, and outputs the global counthword, occurencei.
MapReduce and other Parallel Architectures: Mapreduce was originally designed to run on a large cluster of commodity machines, but later was also adopted to other parallel architectures. Some existing works [49, 108, 16] proposed a design and im- plementation of MapReduce over graphics processing unit (GPU) clusters. Chen et al. [17] attempted to scale up MapReduce applications by proposing a number of work partitioning schemes which leverages both CPUs and GPUs to perform the given computations in parallel. Karloff et al. [58] and Goodrich et al. [44] showed that MapReduce can also be used to simulate PRAM and BSP models.
Unlike other parallel models such as BSP, MapReduce was not introduced with any formal cost estimation formulation. Therefore, an analytical study taking into account the characteristics of physical architecture and the given problem instance is needed to