Transparent Resource Management - Support for flexible and transparent distributed computing

naturally to this paradigm. N-body simulations [GF96], genetic algorithms [CP97], Monte Carlo simulations [BRL99] and materials science simulations [PL96] are just a few examples of natural computations that fit in the generalized master-worker paradigm.

4.2.4 Developing Adaptive Applications

Developing an adaptive application is not much different from developing a traditional distributed application. If programmers choose to use the master-worker paradigm, no extra work will be added. If programmers choose to use the distributed object graph, programmers should design the object graphs according to the above instructions. With applying the object graphs, load balancing policies should also be embedded into the code. In the following chapters we will introduce several adaptive application examples developed based on the two paradigms.

4.3 Transparent Resource Management

We establish the flexible execution model by introducing the AA and adaptive applications. One of advantages of the model is that the execution can adapt to the dynamics of the resource environment. For example, if an application requires 100 processors to run, even though there is only 50 processors, the application does not wait for all the processors but still starts and adds processors on the fly when they become available. The problem is, if there is a QoS requirement, e.g. a deadline, how the application dynamically adds resources to meet the deadline? In this example, the application is supposed to run on 100 processors from beginning to meet the deadline. However, as we start the application with only 50 processors (to save the resource assembling time), during the execution the application must need more than 100 processors to compensate the time lost due to the insufficient resources in the beginning. The on-line resource allocation procedure could be complex with the QoS involved. In the meantime, it must adapt to the resource environment which could change at any time as well as the dynamics of the application behavior.

How to dynamically request and allocate resources to meet certain QoS is a challenge in the flexible distributed computing model. In order to introduce high transparency, the AA must be able to intelligent configure the resources without application’s explicit requests. Current automatic resource selections are based on the static (off-line) and information principles (recall Section 3.2.2), therefore we propose a dynamic automatic resource selection approach.

From the overview of the mode (Figure 4.2) we can see, the AA acts like a private agent to au- tonomously configure the execution environment; and the application is an autonomous system to adapt to any environments that the AA provides. This is similar to the concept of a closed-loop control system in the Control Theory. Motivated by the theory and reinforcement learning, we propose an on-line feedback-based automatic resource configuration to provide full resource management transparency to the application layer. As Figure 4.6 shows, during the execution the application continuously reports its execution feedback (i.e. a certain performance utility value) to the AA, which compares it with the performance reference (a QoS related value) and adjust the input (execution environment) to make the

4.4. Chapter Summary 50

AA Ap p lic atio n

Exec utio n enviro nment P erfo rmanc e feed b ac k

referenc e+

Enviro nment rec o nfigure

Figure 4.6: An overview of the online feedback-based automatic resource configuration.

performance to approach the reference value. This approach is an autonomous closed-loop and does not need the user’s participation.

By this means, with the AA’s other services, the AA provides comprehensive resource management including resource selection, resource discovery and resource allocation. While the application only needs to make use of the provided execution environment to get maximized performance. The relation- ship between the AA, the application, and the user can be interestingly described as a stage manager, a rockstar, and audience, where the manager takes care of decorating the stage (configure the execution environment), and the rockstar performs on this stage (application execution). If the stage is very good the rockstar can perform very well. If the rockstar does not like the stage, the manager must do some replacement (resource reconfiguration). The show’s performance does not necessarily depend on the stage but the stage deployment definitely influences the rockstar’s performance. Certainly, the audience do not need to do anything but enjoy the show (the user does not need to know anything about underlying management).

4.4 Chapter Summary

In this chapter, we have introduced the proposal of the two-layer flexible and highly transparent execution model. The key questions to implementing this model are: 1. how the AA configures the virtual machine to run applications with collaborating underlying DRMs, including how to contact DRMs to obtain resources, how to deploy processes, how to enable communication, etc; and 2. how the on-line resource configuration is implemented, including how the application reports performance feedbacks, how the AA learns from feedbacks and tunes the resource environment, etc. In the next chapters we give full descriptions of constructing the model.

Chapter 5 Application Agent

The Application Agent (AA) provides the platform and mechanism for the applications to dynamically interact and negotiate resource with the underlying resource environment on a one-to-one basis. In addition, it provides the mechanism for the applications to control computational resources irrespective of the location of the resources. The primary services it provides are:

Flexible Execution:

• Service 1. On-line resource allocation: Resources can be added/released on-the-fly according to the demand of application.

Transparent Execution:

• Service 2. Transparent resource access: The user or the application can have no knowledge of where and how resources are discovered and allocated. The application can be invoked as easily as starting a desktop program, rather than using job submission in traditional grid computing. • Service 3. Run applications anywhere: An application can be started on any Internet connected

machines and its components will be deployed on the remote resources automatically.

• Service 4. Automatic resource configuration1_{: The required resources are automatically selected}

to satisfy performance QoS requirements (e.g. deadline, budget, execution speed, etc), without explicit resource requests from the application.

Except those four, we provide the wide-area computing service to make AA more practical in current distributed computing field:

• Service 5. Wide-area computing: An application can be deployed and executed on resources from multiple clusters/domains.

In the following these services will be referred to explain how they are provided. The chapter is further organized as follows: Section 5.1 introduces the overall picture of the AA architecture and how it supports an application execution. Section 5.2 introduces the description of the API and how programmers embed its interface into their application code. Section 5.3 introduces the technical details of the

5.1. Overview 52 architecture of AA. Section 5.4 gives the current implementation of the AA. In Section 5.5 we evaluate the AA’s basic performance and demonstrate the ability of AA to support the flexible and transparent execution.

5.1 Overview

The AA is a software framework which is composed of a library of AA interface routines (API) that enable an application to be integrated with the AA, and a set of daemons that reside on the physical machines making up the virtual machine to execute the application. When an AA-enabled application is started, daemons will be started by the AA library. Those daemons are dedicated to this virtual machine, and have the same life cycle as the application. The AA has three basic types of daemon: LComm, PMW- Comm, and RD, which respectively take charge of local communication, process management/wide-area communication, and resource acquisition.

• LComm: Local Communication. Each process is associated with a LComm. It is responsible for message passing for the process with other processes that belong to the same domain.

• PMWComm: Process Management and Wide-area Communication. Each involved domain has one PMWComm. It is responsible for process startup/stop and cross-domain (cross firewall) message routing.

• RD: Resource Discoverer. Each application has one RD. It is responsible for resource discovery and allocation by contacting DRMs. It contains a NameServer component, which is responsible for assigning domain ID to generate process ID.

After an application is invoked, a LComm daemon and a RD daemon will be started on the local machine with the startup of the first process M (normally the master process). The application then requests to add processes/resources via AA API. After receiving a process addition request from the application (actually from the process M ), the AA library generates a resource request to the RD, which broadcasts the request to one or multiple DRMs to allocate a resource. RD will make use of the first returned eligible resource and keep the resources returned from other DRMs in a buffer in case that a similar request is made in the near future. After obtaining a resource, RD passes the granted resource information (e.g. host IP address) to the associated PMWComm in the resource domain to deploy a process on the resource. The PMWComm then finally passes the added process information back to the process M . A host thus has been added into the virtual machine for the application. Each process is assigned a Unique Process ID (UPID) which enables them to communicate with each other through LComm and PMWComm daemons.

Figure 5.1 depicts where the three types of daemons are placed in an AA virtual machine and an example of a virtual machine configuration. In this example, the AA leverages eight remote hosts to compose a wide-area virtual machine to run the application. Those hosts are from two environments: four of them are allocated in a SGE cluster. The hosts are discovered and granted through a SGE DRM.

In document Support for flexible and transparent distributed computing (Page 49-53)