Aligning with the original master-slave architecture principle, even YARN has a global or master Resource Manager for managing cluster resources. YARN also has a per–
node/slave Node Manager that takes direction from the Resource Manager and manages resources on that specific node. This forms the computation fabric for YARN. Apart from that, a per-application Application Master is nothing but an application-specific library tasked with negotiating resources from the global Resource Manager and coordinating with the Node Manager(s) to execute and monitor the execution of the tasks. Other components are containers, a group of computing resources such as memory, CPU, disk, and network. This section covers these components in detail.
Resource Manager
For a YARN-based Hadoop cluster, a global Resource Manager manages the global assignment of compute resources to the applications targeting the Hadoop cluster. The Resource Manager has two main components (see Figure 5.4).
FIGURE 5.4 YARN components. Resource Manager—Application Manager
The job client initiates a talk with the Application Manager, a component of the Resource Manager that handles job submission and negotiates for the first container (from the Scheduler, as discussed next) so that the application-specific Application Master can be executed. The Application Manager also provides services to restart the application- specific Application Master container in case it detects any failure of an Application Master instance.
Resource Manager—Scheduler
The Scheduler component of the Resource Manager plays a role in scheduling and allocating cluster resources that are subject to various constraints, such as capacities, queues, and SLA. This is based on the resource requirements of the applications. The cluster resources are described in terms of the number of containers (memory, CPU, disk, network, and so on) and the application-specific tasks in those containers, based on the chosen pluggable scheduling policy (FIFO, Fair Scheduler, Capacity Scheduler, and so on). The yarn.resourcemanager.scheduler.class property is in the yarn- default.xml configuration file. The Scheduler component is a pure scheduler, which means it performs the scheduling functions to allocate computing resources to run the applications’ tasks, based on the resource requirements of the applications. The Scheduler does not itself monitor or track the status of the execution of the applications’ tasks or restart any failed tasks.
GO TO To learn more about pluggable schedulers, refer to the section “MapReduce Job Scheduling,” earlier in this hour.
The notion of moving computation to the data (instead of moving data to the computation) has not changed in YARN. The Scheduler component of the Resource Manager has
enough information about the application’s resource needs, which helps it decide on locating and allocating containers available on the same nodes as the data. If it does not find available resources on those nodes (which contain replicas of the data), it looks out for another node in the same rack.
Application Master
Hadoop 2.0 added multitenancy capabilities in which multiple applications can
simultaneously utilize the Hadoop cluster. Each new application can implement a custom Application Master, which requests resources from the Resource Manager based on the specific application need and manages the application’s scheduling and task execution coordination. Many of these applications use the Hadoop cluster simultaneously with the help of Application Masters; these applications run under the notion of isolation, and each one gets guaranteed capacity.
Based on the job submitted by the job client, the Application Manager executes that application-specific Application Master in the first container assigned. Next, the
application-specific Application Master has the responsibility to negotiate the appropriate resources from the Scheduler component of the Resource Manager, based on that specific application’s need. Other responsibilities include maintaining the task queue, managing the application or job life-cycle by coordinating with the Node Manager(s) to launch the tasks, monitoring the tasks’ execution progress, tracking their status, and handling task failure.
In terms of responsibilities, an Application Master is similar to the JobTracker. However, unlike the JobTracker, which runs on the master node, the Application Master runs on the slave node. The Application Master relieves the Resource Manager (master node) from being overcrowded, manages the life cycle of that specific application, and provides the scaled-out model for the Application Master so that many applications can work together.
Multiple Application Masters from different versions of the same application also can run side by side, allowing users to implement a rolling upgrade of the user application or the cluster according to their own schedule. For example, you can run multiple versions (each version might be served by the same Application Master or a different one) of the
MapReduce jobs on the same cluster.
Note: Common Application Master
Typically, you will find a separate Application Master for each application, but that is not a rule. You could have a single common Application Master for more than one application.
When the task finishes executing, the Application Master releases the container assigned to it and returns it to the Resource Manager so that other queued requests can use it. When all the tasks of the job are completed and all the containers are released, the Application Master deregisters itself with the Resource Manager and then terminates itself—but before that, the Application Master passes the management and metrics information it has
accumulated and aggregated from all the tasks of the job to the Job History Server so that it can be accessed later as needed.
Note: Job History Server
The application-specific Application Master is instantiated when a request comes in for that specific application and is terminated when the job
completes. The Application Master also accumulates and aggregates task metrics and logs—but what happens to those collected metrics and logs when Application Master terminates?
To preserve this useful data, the MapReduce application framework has
another daemon, the Job History Server. It is a kind of centralized location for storing all the job or task execution-related metrics and logs so that the
information is available even after the Application Master is terminated. By default, job history files are stored in the directory /mr-history/done on HDFS. The directories are organized by date when the MapReduce jobs were executed. To access this history log later, the Job History server provides REST APIs to get the status on finished applications.
Note: Timeline Server
Currently, the Job History server supports only MapReduce and provides information on finished MapReduce jobs only. This information is very
specific to MapReduce because it contains the MapReduce counter, Mappers, and Reducers. This is fine if you have only MapReduce, but YARN is
intended to be used beyond MapReduce. YARN thus introduces a new
Timeline Server to collect job execution logs beyond MapReduce, which will eventually take over the Job History server.
The Timeline Server provides generic information about completed applications and per-framework information for running, along with completed applications through two different interfaces.
Application-generic information provides an execution log related to YARN resource management or application-level data, such as queue name,
username, application attempts, containers used by each application, and metrics related to those containers. Application-specific information is specific to an application, as with MapReduce, counters, Mappers, and Reducers.
Based on the failure of the Application Master, the Resource Manager (or, more
specifically, the Application Manager component of Resource Manager) can relaunch the Application Master in another container.
Note: Multitenancy
You can think of Apache MapReduce, Apache Giraph, Apache Tez,
Distributed Shell, and Open MPI as applications. Each has an Application Master specific to it. Here, multitenancy means running these many types of applications side by side, targeting the same YARN-based cluster.