• No results found

Writing Hadoop MapReduce Programs

In the previous chapter, we learned how to set up the R and Hadoop development environment. Since we are interested in performing Big Data analytics, we need to learn Hadoop to perform operations with Hadoop MapReduce. In this chapter, we will discuss what MapReduce is, why it is necessary, how MapReduce programs can be developed through Apache Hadoop, and more.

In this chapter, we will cover:

• Understanding the basics of MapReduce • Introducing Hadoop MapReduce

• Understanding the Hadoop MapReduce fundamentals • Writing a Hadoop MapReduce example

• Understanding several possible MapReduce definitions to solve business problems

• Learning different ways to write Hadoop MapReduce in R

Understanding the basics of MapReduce

Understanding the basics of MapReduce could well be a long-term solution if one doesn't have a cluster or uses Message Passing Interface (MPI). However, a more realistic use case is when the data doesn't fit on one disk but fits on a Distributed File System (DFS), or already lives on Hadoop-related software.

Writing Hadoop MapReduce Programs

Moreover, MapReduce is a programming model that works in a distributed fashion, but it is not the only one that does. It might be illuminating to describe other

programming models, for example, MPI and Bulk Synchronous Parallel (BSP). To process Big Data with tools such as R and several machine learning techniques requires a high-configuration machine, but that's not the permanent solution. So, distributed processing is the key to handling this data. This distributed computation can be implemented with the MapReduce programming model.

MapReduce is the one that answers the Big Data question. Logically, to process data we need parallel processing, which means processing over large computation; it can either be obtained by clustering the computers or increasing the configuration of the machine. Using the computer cluster is an ideal way to process data with a large size. Before we talk more about MapReduce by parallel processing, we will discuss Google MapReduce research and a white paper written by Jeffrey Dean and Sanjay Ghemawat in 2004. They introduced MapReduce as simplified data processing software on large clusters. MapReduce implementation runs on large clusters with commodity hardware. This data processing platform is easier for programmers to perform various operations. The system takes care of input data, distributes data across the computer network, processes it in parallel, and finally combines its output into a single file to be aggregated later. This is very helpful in terms of cost and is also a time-saving system for processing large datasets over the cluster. Also, it will efficiently use computer resources to perform analytics over huge data. Google has been granted a patent on MapReduce.

For MapReduce, programmers need to just design/migrate applications into two phases: Map and Reduce. They simply have to design Map functions for processing a key-value pair to generate a set of intermediate key-value pairs, and Reduce functions to merge all the intermediate keys. Both the Map and Reduce functions maintain MapReduce workflow. The Reduce function will start executing the code after completion or once the Map output is available to it.

Chapter 2

[ 39 ]

MapReduce assumes that the Maps are independent and will execute them in

parallel. The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces in the network, the operation will run in parallel on different keys and lists of data.

A distributed filesystem spreads multiple copies of data across different machines. This offers reliability as well as fault tolerance. If a machine with one copy of the file crashes, the same data will be provided from another replicated data source.

The master node of the MapReduce daemon will take care of all the responsibilities of the MapReduce jobs, such as the execution of jobs, the scheduling of Mappers, Reducers, Combiners, and Partitioners, the monitoring of successes as well as failures of individual job tasks, and finally, the completion of the batch job. Apache Hadoop processes the distributed data in a parallel manner by

running Hadoop MapReduce jobs on servers near the data stored on Hadoop's distributed filesystem.

Companies using MapReduce include:

Amazon: This is an online e-commerce and cloud web service provider for Big Data analytics

eBay: This is an e-commerce portal for finding articles by its description • Google: This is a web search engine for finding relevant pages relating to a

particular topic

LinkedIn: This is a professional networking site for Big Data storage and generating personalized recommendations

Trovit: This is a vertical search engine for finding jobs that match a given description

Twitter: This is a social networking site for finding messages

Apart from these, there are many other brands that are using Hadoop for Big Data analytics.

Introducing Hadoop MapReduce

Basically, the MapReduce model can be implemented in several languages, but apart from that, Hadoop MapReduce is a popular Java framework for easily written applications. It processes vast amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable and fault- tolerant manner. This MapReduce paradigm is divided into two phases, Map and Reduce, that mainly deal with key-value pairs of data. The Map and Reduce tasks run sequentially in a cluster, and the output of the Map phase becomes the input of the Reduce phase.

Writing Hadoop MapReduce Programs

All data input elements in MapReduce cannot be updated. If the input (key, value) pairs for mapping tasks are changed, it will not be reflected in the input files. The Mapper output will be piped to the appropriate Reducer grouped with the key attribute as input. This sequential data process will be carried away in a parallel manner with the help of Hadoop MapReduce algorithms as well as Hadoop clusters. MapReduce programs transform the input dataset present in the list format into output data that will also be in the list format. This logical list translation process is mostly repeated twice in the Map and Reduce phases. We can also handle these repetitions by fixing the number of Mappers and Reducers. In the next section, MapReduce concepts are described based on the old MapReduce API.