Programming Models for Massively Parallel Machines Most applications written for the current generation of parallel computers

Multiphysics Simulations and Petascale Computing

3.3 Programming Models for Massively Parallel Machines Most applications written for the current generation of parallel computers

use a data parallel model of computation, also known as single program, multiple data (SPMD). In this model, all tasks execute the same algorithm on different parts of the data at the same time. (The tasks need not execute precisely the same instruction at a given instant, since some regions of the problem domain may require additional iterations or other special handling.) In contrast to data parallelism, task parallelism assigns different types of work to different threads or processes in an application. One way to implement task parallelism is by giving multiple threads in a process separate tasks to carry out on a common data set. In practice, however, scientific applications most often use threading as an extended form of data parallelism, such as running independent loop iterations in parallel. (Other types of applications, such as those that do commercial data processing, may have a genuinely task parallel architecture, but these are outside the scope of this chapter.)

One difficulty in writing an efficient data parallel program is keeping all the cores busy all of the time. Forcing some tasks to wait while others finish larger units of work can drastically reduce overall performance. Keeping a program well load-balanced becomes more difficult with increasing parallelism: There may not be a natural way to divide a problem into 100,000 or more equal parts that can be worked on concurrently. Even if partitioning the problem is easy, there may be so little work for each core to do that the cost of communicating data between the large number of tasks overwhelms the gains from increased

∗_{In the rest of this chapter, the term “processor” will refer to a discrete chip, and “core”} will refer to one or more CPUs on those chips. A “node” is a unit of the computer with one or more processors that share direct access to a pool of memory.

parallelism. The remainder of this section describes several approaches to using massively parallel machines eﬃciently.

3.3.1 New parallel languages

Several research groups are developing new languages for parallel programming that are intended to facilitate application development. Their goals include improving programmer productivity, using parallel resources more ef- fectively, and supporting higher levels of abstraction. Three of these new languages are being developed for the U.S. Defense Advanced Research Projects Agency (DARPA) program on High Performance Computer Systems (HPCS). They are Chapel [5] (being developed by Cray), Fortress [1] (Sun), and X10 [6] (IBM). All three languages feature a global view of the program’s address space, though in Chapel and X10 this global address space has partitions vis- ible to the programmer. The languages also include numerous methods for expressing parallelism in loops or between blocks of code. Although these languages support task-level parallelism in addition to data parallelism, the developer still writes a single application that coordinates all activity.

While new languages may simplify writing new parallel applications from scratch, they are not an attractive means to extract further parallelism from existing large parallel codes. At the Lawrence Livermore National Laboratory (LLNL), for example, several important simulation codes have been under continuous development for more than ten years, and each has a million or more source lines. Rewriting even one of these applications in a new language, however expressive and eﬃcient it may be, is not feasible.

3.3.2 MPI-2

Another approach to writing a task parallel application is to use the job- creation and one-sided communication features in the MPI-2 standard [11]. MPI-2 allows a code to partition itself into several parts that run diﬀerent executables. Processes can communicate using either the standard MPI message passing calls or the “one-sided” calls that let processes store data in another process or retrieve it remotely.

While MPI-2 has the basic capabilities necessary to implement task parallel applications, it is a low-level programming model that has not yet been adopted as widely as the original MPI standard. One obstacle to broad adop- tion of MPI-2 has been the slow arrival of full implementations of the standard. Although at least one complete implementation appeared shortly after MPI-2 was ﬁnalized in 1997, MPI-1 implementations remain more widely available.

3.3.3 Cooperative parallelism

To oﬀer a diﬀerent way forward, LLNL is developing an alternative parallel programming model, called cooperative parallelism [14], that existing

applications can adopt incrementally.

Cooperative parallelism is a task parallel, or multiple program, multiple data (MPMD) programming model that complements existing data parallel models. The basic unit of computation is asymponent (short for “simulation component”), which consists of one or more processes executing on one or more nodes. In other words, a symponent can be a single-process program or a data parallel program. A symponent has an object-oriented interface, with defined methods for carrying out tasks. When an application starts, it allocates a fixed set of nodes and processors on a parallel machine, and the runtime system (called Co-op) launches a symponent on a subset of these nodes. Any thread or process within this initial symponent can launch additional symponents on other processors in the pool, and these new symponents can issue further launch requests. Each symponent can run a different exe- cutable, and these executables can be written entirely separately from each other.

When a thread launches a symponent, it receives an identifying handle. This handle can be passed to other threads or processes, and even to other symponents. Any thread with this handle can issue a remote method invo- cation (RMI) on the referenced symponent. Remote methods can be invoked as a blocking, nonblocking, or one-way call. (One-way RMI calls never return data or synchronize with the caller.)

Except for their RMI interfaces, symponents are opaque to one another. They do not share memory or exchange messages. Issuing an RMI to a symponent requires no previous setup, other than obtaining the handle of the target symponent, and no state information regarding the RMI persists in the caller or the remote method after the call is completed. This design avoids potential scaling issues that would arise if each of many thousands of symponents maintained information on all the symponents it had ever called or might potentially call in the future.

Symponents can terminate other symponents, either when they detect an error or for other reasons, and a parent symponent can be notiﬁed when its child terminates. Co-op is implemented using Babel middleware [7], which supplies the RMI functionality and also allows functions written in diﬀerent languages to call each other seamlessly. This means that symponents written in C, C++, and Fortran can call each other without any knowledge of each other’s language.

3.3.4 Example uses of cooperative parallelism

Cooperative parallelism enables applications to exploit several diﬀerent kinds of parallelism at the same time. Uses for this ﬂexibility include factoring out load imbalance and enabling federated computations.

Factoring out load imbalance. Some simulations incur load imbalance because extra calculations are necessary in certain regions of the problem domain. If the application synchronizes after each time step, then all the

Balanced data parallel computation

Server proxy

Server pool

FIGURE 3.1: Cooperative parallelism can improve the performance of unbal- anced computations. The boxes on the left represent a group of processors running a data parallel simulation. The highlighted boxes are tasks that have additional work to do, and they sublet this work to a pool of servers on the right. Each server is itself a parallel job. A server proxy keeps track of which servers are free and assigns work accordingly. This arrangement improves load balance by helping the busier tasks to ﬁnish their work more quickly.

tasks may wait while a subset of them complete these extra calculations. To reduce the time that the less-busy tasks wait for the busy one, the developer can assign the extra calculations to a pool of server symponents, as shown in Figure 3.1. Here, a large symponent running on many cores executes a normal data parallel computation. When any task determines that it needs additional computations, it sends a request to aserver proxy,which forwards the request to an available server in a pool that has been allocated for this purpose. To the main simulation, this request looks like a simple function call, except that the caller can proceed with other work while it waits for the result. It can even submit additional nonblocking requests while the first one is executing. The servers themselves may run as parallel symponents, adding another level of parallelism. The server’s ability to apply several processors to the extra calculation, combined with the caller’s ability to invoke several of these computations concurrently, allows the caller to finish its extra work sooner. This reduces the time that less-busy tasks need to wait, so the overall load balance improves. We expect this approach to be helpful in materials science applications and a number of other fields.

Federated computations. A more ambitious use of cooperative parallelism is to build a large application from existing data parallel codes. For example, a multiphysics model of airﬂow around a wing could combine a ﬂuid dynamics model with a structural mechanics model; or a climate application could combine models of the atmosphere, the ocean, and sea ice. In either case, a separate symponent would represent each independent element of the system. Periodically, the symponents would issue RMI calls to update either a central database or a group of independent symponents with data about

their current state. A fully general solution would employ a collective form of RMI, in which multiple callers could send a request to multiple recipients; however, cooperative parallelism does not yet support this model. Although federated simulations can be written with current technology, they may en- counter diﬃcult load balance problems and other complications. Cooperative parallelism, when fully realized, should provide a simpler way to build multiphysics simulations as a federation of existing single-physics codes.

In document Petascale Computing Algorithms and Applications pdf (Page 105-109)