Introduction to Hadoop - Parallel Pattern Matching using MPI

Chapter 4: Parallel Pattern Matching using MPI

4.6 Introduction to Hadoop

Hadoop is an Apache open source framework for reliable, scalable and distributed computing of large data sets [70]. It was created by Doug Cutting and Mike Cafarella in 2005. Cutting named it Hadoop after his son’s toy elephant. Written in Java, Hadoop has three main objectives:

- Distributed computing: Hadoop is designed to work in an environment of distributed storage and computation of vast amounts of data (multi-terabyte data sets) across large clusters of computers.

- Reliability: Hadoop is designed to handle hardware failure and data congestion in a reliable, fault tolerant way.

- Scalability: Hadoop is designed to scale up from a single server to several thousands of servers.

The Hadoop framework is made up of the following four modules [71]:

- Hadoop Common: Java libraries and utilities required by the other modules to start Hadoop

- Hadoop Yarn: framework for job scheduling and cluster resource management

- Hadoop Distributed File System (HDFS)

- Hadoop MapReduce: parallel processing of large data sets

Since 2012, Hadoop also refers to the set of software frameworks that can be installed along with Hadoop. These include Apache Pig, Apache Hive, Apache HBase and Apache Spark.

Hadoop’s main modules are HDFS and MapReduce. Both are similar to Google’s File System (GFS) and MapReduce, respectively.

4.6.1 HDFS

The Hadoop Distributed File System, as the name suggests, is a distributed file system that is highly fault tolerant and is designed to be installed on low cost hardware and to run on large clusters. HDFS uses a master/slave architecture. The master consists of a single node, the NameNode. The slave(s) consist of DataNode(s). The NameNode manages the file system metadata, and the DataNodes store the data. A file in HDFS is split into several blocks which are stored in the DataNodes, based on a mapping provided by the NameNode. Like any other file system, HDFS has a shell and a list of commands to access and manipulate the file system.

4.6.2 MapReduce

MapReduce is the programming module used for processing data in parallel. A MapReduce job is divided into two phases. The map phase splits the input data into chunks (depending on the input file format specified) and processes these chunks as parallel independent tasks. The input is converted into a set of <key, value> output pairs. The framework then sorts these pairs based on their key values and sends them as input to the reduce task. The reduce phase

aggregates the <key, value> pairs based on the reduction function, and produces an output consisting of a smaller set of <key, value> pairs. Both the primary input and resulting output are stored in the HDFS.

4.6.3 Procedure

The MapReduce framework consists of a single master ResourceManager, and several slave NodeManagers (per node). An application submits a job to Hadoop, specifying the input/output

locations, the map and reduce Java classes in the form of a jar file, and other parameters and configurations that might be needed. The Hadoop job client then sends the jar and configurations to the ResourceManager. The ResourceManager performs resource management, tracks resource consumption and availability, and schedules tasks to the slaves. The slaves execute the tasks and provide task-status information periodically to the master.

4.6.4 Usage and Advantages

Hadoop is a large scale distributed processing framework that can be used on a single

machine. However, to obtain the full potential power of Hadoop, it must be scaled to hundreds or even thousands of nodes, each containing multiple cores. Hadoop has been demonstrated to work on clusters of up to 4000 nodes. The data that Hadoop deals with should also be huge. Hadoop was built to process web-scale data ranging from hundreds of gigabytes to terabytes or even petabytes [72]. Hadoop is not particularly known for being runtime efficient, yet it is widely used by industries and developers across the world. The reason for that is scale. The input data set that the HDFS can typically hold will in no way fit in the hard drive of one computer, or in the memory of a single machine. In most businesses, there is no choice but to use Hadoop’s distributed file system to store and process their data. Another reason for Hadoop’s popularity is its efficiency compared to the potential cost it can incur. For example, suppose one owns a computer with one thousand CPUs. That would be very expensive to obtain (assuming it exists of course), even though it would be fast and very efficient. As opposed to that, consider having one thousand single-CPU computers. Hadoop will connect these cheaper machines together to form a more cost effective, efficient, and reliable cluster. Other advantages of Hadoop include:

- It is open source and compatible on all platforms since it is Java based. And although it is implemented in Java, applications can be written in other languages. Hadoop Streaming allows developers to create and run jobs with many executables.

- It has a simplified programming model. Developers can quickly write and test distributed systems, since Hadoop automatically distributes the data and work across the nodes.

- It detects and handles failures without relying on hardware to provide fault tolerance.

- Nodes can be added dynamically to the cluster without interrupting operation.

The compute nodes and the storage nodes are the same. MapReduce and HDFS run on the same set of nodes, resulting in high data locality that leads to better performance. Data is mostly read from the local disk into the CPU. This reduces network bandwidth and prevents

unnecessary transfers. Hence, computation is moved to the data, instead of moving the data to the computation.

In document Facilitating High Performance Code Parallelization (Page 95-98)