• No results found

When you set up Resource Manager high availability, you must specify the addresses for all the Resource Managers in the yarn-site.xml configuration files used by clients, Application Masters, and Node Managers to communicate with the Resource Managers. Clients, Application Masters, and Node Managers try to connect to the Resource

Managers in a round-robin manner until they reach the active Resource Manager and then cache this address—until it starts getting a communication exception (when the active Resource Manager dies) and have to go again through the round-robin to identify the new active Resource Manager.

Note: MapReduce for Hadoop 1.0 and Hadoop 2.0

MapReduce has undergone a complete overhaul in Hadoop 2.0 and is now called MapReduce 2. However, the MapReduce programming model has not changed—you can still use the MapReduce APIs as discussed in this book. YARN provides a new resource-management and job scheduling model, and its implementation executes MapReduce jobs. In most cases, your existing MapReduce jobs will run without any changes. In some cases, you might need to make minor changes and deal with recompilation.

Summary

In this hour, you learned advanced concepts related to MapReduce, including the

MapReduce Streaming utility, MapReduce joins, Distributed Caches, different types of failures, and performance optimization for MapReduce jobs.

You also took a look at YARN, which brings a major architectural change in the Hadoop framework and opens a new window for scalability, performance, and multitenancy. You explored the YARN architecture, its components, its job execution pipeline, and how YARN transparently handles different types of failures.

Q&A

Q. What is DistributedCache and why it is used?

A. The DistributedCache feature shares static data globally among all nodes in the cluster. Unlike reading from HDFS, in which data from a single file typically is read from a single Mapper, DistributedCache comes in handy when you want to share some data (for example, global lookup, or JAR files or archives that contain executable code) among all the Mappers of the job or for initialization, or when libraries of code need to be accessed on all nodes in the cluster.

Q. What benefits does reuse of JVM provide?

A. By default, a TaskTracker kicks off a Mapper or Reducer task in a separate JVM as a child process. Although separation or isolation sounds like a good idea, to keep a failed map task from breaking the TaskTracker itself, an initialization cost is

incurred each time a separate JVM spins off for the task execution. This additional initialization cost increases even further if the task needs to read a large data

structure in memory (for example, DistributedCache or data needed for joining). To solve this problem, the MapReduce framework has the property

mapred.job.reuse.jvm.num.tasks, which supports reuse of a JVM across multiple tasks of the same job. Hence, the additional initialization cost can be

compensated or shared across many tasks.

Q. How does job scheduling happen in Hadoop?

A. The earlier MapReduce framework used a first in, first out (FIFO) method to process the job requests. When a client submitted a job, it was queued at one end and JobTracker picked it up from the other end (the oldest one) for execution. This FIFO approach for processing jobs was intermixed with JobTracker logic, making it inflexible and giving administrators no control over this. Later, however, Hadoop made it extensible and started support for a pluggable job scheduling component. Now you can use either the FIFO approach, the Capacity Scheduler or Fair

Scheduler, or any other available scheduling component that plugs in to your Hadoop cluster to resolve job contention and schedule a job.

Q. What is a speculative task and when it is executed?

A. The MapReduce framework has a concept of speculative execution, in which it launches a backup or speculative task for an identified slow-running task to do the same work, in parallel, as the original task. The MapReduce framework keeps monitoring both tasks (the original one and the speculative or backup); whichever task completes first is taken into consideration. The MapReduce framework kills the other task (which was doing the same work). Speculative execution is intended to prevent the slow tasks from dragging down the job’s overall execution time.

The important point to note here is that a speculative or backup task is executed only when the MapReduce framework identifies a task running slower than expected (based on the overall progress it has made with respect to other tasks of the same job). The MapReduce framework never kicks off two copies of the same task at the beginning of job execution, to avoid a race for the best execution time. Q. What are counters and why are they used?

A. Counters provide a means of gathering application-level statistics for assessing quality control, diagnosing a problem, monitoring and tracking the progress of job, and so on. Some built-in counters report various metrics for your job, including

FILE_BYTES_READ, FILE_BYTES_WRITTEN, NUM_KILLED_MAPS,

NUM_KILLED_REDUCES, NUM_FAILED_MAPS, and NUM_FAILED_REDUCES. You can create and define other user-defined or custom counters as well.

Q. What is YARN and what does it do?

A. In Hadoop 2.0, MapReduce has undergone a complete overhaul, with a new layer created on top of HDFS. This new layer, called YARN (Yet Another Resource Negotiator), takes care of two major functions: resource management and application life-cycle management. The JobTracker previously handled those

functions. Now MapReduce is just a batch-mode computational layer sitting on top of YARN, whereas YARN acts like an operating system for the Hadoop cluster by providing resource management and application life-cycle management

functionalities. This makes Hadoop a general-purpose data processing platform that is not constrained only to MapReduce.

Q. What is uber-tasking optimization?

A. The concept of uber-tasking in YARN applies to smaller jobs. Those jobs are executed in the same container or in the same JVM in which that application- specific Application Master is running. The basic idea behind uber-tasking optimization is that the distributed task allocation and management overhead exceeds the benefits of executing tasks in parallel for smaller jobs, hence its optimum to execute smaller job in the same JVM or container of the Application Master.

Quiz

1. Hadoop is written in Java, so Hadoop natively supports writing MapReduce jobs in Java. Imagine that you are not a Java expert, though. Do you have any other option for writing MapReduce jobs in any other language?

2. What’s the difference between a map-side join and reduce-side join? Which offers better performance?

3. Why is JobTracker a single point of failure?

4. How can you handle bad records when you are doing analysis on the large set of data?

Answers

1. Hadoop provides the Hadoop Streaming utility (implemented in the form of a JAR file) for creating Map and Reduce executables in other languages (including C# and Python) for execution on a Hadoop cluster. When you execute your executables (Map and Reduce), the Hadoop Streaming utility creates a MapReduce job, submits the job to the intended Hadoop cluster, monitors the progress of the job until it

completes, and reports back. During execution, the Mapper and Reducer executables read the input data line by line from STDIN and emit the output data to STDOUT.

2. Based on implementation, if the join is performed by the Mapper of the MapReduce job, it is called a map-side join. If the Reducer performs the join, it is called a

reduce-side join.

Map-side joins are better in terms of performance and thus are recommended. If one of the datasets is small enough (which is the case in most Big Data scenarios), you can consider using the Distributed Cache (to cache the smaller table and use it as lookup during the join by each Mapper) to do joining during the map phase only. If both datasets are large, though first see whether a map-side join is possible before you opt for a reduce-side join.

3. If the JobTracker fails, all the job clients must resubmit all the running and queued jobs because there is no built-in mechanism to recover from the failure. The

possibility of JobTracker failure is low, but its failure is a serious issue.

The good news is that JobTracker failure is no longer a problem in Hadoop 2.0 YARN.

4. When you are doing analysis in a large dataset, it’s generally acceptable to ignore a small percentage of bad records so that job execution completes successfully. You might write your Mapper or Reducer to gracefully handle these bad records while processing without actually causing the runtime exception to happen or the task to fail. But in some cases, such as if you are using a third-party library or you don’t have access to the source code, you won’t have that option. In those cases, you can use the MapReduce framework to skip bad records automatically on subsequent task retries.

When you enable this feature, two task failures are considered to be normal failure. On the third failure, the MapReduce framework captures the data that caused the failure. On the fourth failure, it skips those identified data to avoid the failure again. All the bad records detected and skipped are written to HDFS in the sequence file format in the job’s output directory under the _logs/skip subdirectory, for later analysis.

5. Aligning to the original master-slave architecture principle, even YARN has a global or master Resource Manager for managing cluster resources and a per-node and - slave Node Manager that takes direction from the Resource Manager and manages resources on the node. These two form the computation fabric for YARN. Apart from that is a per-application Application Master, which is merely an application- specific library tasked with negotiating resources from the global Resource Manager

and coordinating with the Node Manager(s) to execute the tasks and monitor their execution. Containers also are present—these are a group of computing resources, such as memory, CPU, disk, and network.

Part II: Getting Started with HDInsight