A New Programming Paradigm: MapReduce - GRDI2020 Final Roadmap Report

Many data-intensive applications require hundreds of special-purpose computations that process large amounts of raw data. Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The main issues for such kind of applications are how to parallelize the computation, distribute the data, and handle failures.

What is MapReduce?

MapReduce is a programming model and an associated implementation for processing and

generating large data sets while hiding the messy details of parallelization, fault-tolerance, data distribution, and load balancing.

The basic idea of Map Reduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster [75].

The map program reads a set of “records” from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a “split” function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce

scheduler to process. If N nodes participate in the map phase, then there are M files on disk

storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤

M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all

output records from the map phase with the same hash value will be consumed by the same reduce instance — no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the “answer” to a MapReduce computation.

The MapReduce Framework

The MapReduce Framework is composed of the following components/functions:

Input reader: It divides the input into appropriate size 'splits' (in practice typically 16MB to 128MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs.

Map Function: Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. The input and output types of the map can be (and often are) different from each other.

Partition Function: Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reduce.

A typical default is to hash the key and modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load balancing purposes, otherwise the MapReduce operation can be held up waiting for slow reducers to finish.

Between the map and reduce stages, the data is shuffled (parallel-sorted / exchanged between nodes) in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. Compare Function: The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function.

Reduce Function: The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and output 0 or more values.

Output Writer: It writes the output of the Reduce to stable storage, usually a distributed file system.

An Example

Consider the problem of counting the number of occurrences of each word in a large collection of documents [76]. The user would write code similar to the following pseudo-code:

Map (String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values): // key: a word

// values: a list of counts int result = 0;

for each v in values: result += ParseInt(v); Emit(AsString(result));

The map function emits each word plus an associated count of occurrences (just `1' in this simple example). The reduce function sums together all counts emitted for a particular word.

Advantage of MapReduce

The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in parallel — though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than "commodity" servers can handle — a large server cluster can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled — assuming the input data is still available.

In fact, MapReduce achieves an excellent fault-tolerance by parceling out a number of operations on the set of data to each node in the network. Each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node records the node as dead and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a check to ensure that there are not parallel conflicting threads running. When files are renamed, it is possible to also copy them to another name in addition to the name of the task.

The reduce operations operate much the same way. Because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on. This property is desirable as it conserves bandwidth across the backbone network of the datacenter.

Criticisms

The database research community has expressed some concerns regarding this new computing paradigm [75]:

 The database community has learned three important lessons: (i) schemas are good, (ii) separation of the schema from the application is good, and (iii) high-level access languages are good. MapReduce has learned none of these lessons.

 MapReduce is a poor implementation.

 All modern DBMSs use hash and B-tree indexes to accelerate access to data. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.

 MapReduce has no indexes and therefore has only brute force as a processing option.

 MapReduce is not novel.

 The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old.

 MapReduce is missing features. All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce: Bulk loader, Indexing, Updates, Transactions, Integrity constraints, Referential integrity, Views.

 MapReduce is incompatibile with the DBMS tools. A modern SQL DBMS has available all of the following classes of tools: Report writers, Business intelligence tools, Data mining tools, Replication tools, Database design tools. MapReduce cannot use these tools and has none of its own.

The advocates of the MapReduce paradigm reject these views. They assert that DeWitt and Stonebraker's entire analysis is groundless as MapReduce was never designed nor intended to be used as a database. MapReduce is not a data storage or management system — it’s an algorithmic technique for the distributed processing of large amounts of data.

However, it is worthwhile to note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig project at Yahoo! Research is one such effort.

Factors of success

The MapReduce programming model has been successfully used for many different purposes. This success can be attributed to several reasons. First, the model is easy to use, even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing. Second, a large variety of problems are easily expressible as MapReduce computations. For example, MapReduce is used for the generation of data for Google's production web search service, for sorting, for data mining, for machine learning, and many other systems. Third, several implementations of MapReduce have been developed that scale to large clusters of machines comprising thousands of machines. These implementations make efficient use of these machine resources and therefore are suitable for use on many of the large computational problems encountered in data-intensive applications.

In document GRDI2020 Final Roadmap Report (Page 86-90)