5·3·4 Runtime Architecture - Effective run time management of parallelism in a functional progr

At runtime in parallel executions, the compiled user’s program is combined with the GUM runtime system (see Section 5·3·5) and the result executed. The algorithm of execution (presented in detail in Section 5·3·5 also) relies upon communication between virtual processing elements facilitated by the use of a hardware abstraction — PVM (described in Section 5·4). This architecture may be illustrated diagrammatically as shown in Figure 5·2. In fact, this diagram illustrates the model of a multiprocessor, such as a Sun SPARCstation multiprocessor (as utilised in the implementation described in this thesis). One drawback of PVM when implemented on a SPARCstation

architecture is the absence of a routine to place a thread on a given processor. This is not a failing of PVM, but is due to Sun’s restriction that programmers cannot allocate tasks/processes/threads/et cetera to a particular processor without using unadvertised kernel interfaces. System Manager (SysMan) PVM task 1 … PVM task n PVM Daemon (pvmd) PEs

Figure 5·2: ‘Hardware’ architecture for parallel program execution on a multiprocessor. Where a network of computers are used, each will run its own PVM daemon for communication with others. This is shown diagrammatically (with only one PVM task per computer for simplicity) in Figure 5·3.

5·3·5 GUM

5·3·5·1 Introduction

The GHC compiler is, in actual fact, many compilers in one. By providing command line options when compiling, different facilities are enabled/disabled and different behaviours are exhibited [AQUA 1995]. When the –parallel option is provided on

Chapter Five: Haskell & GHC execution of programs: GUM. GUM [Trinder et al. 1996; Hammond et al. 1995;

Hammond, Loidl, et al. unpub. a; Trinder, Barry et al. 1998 and unpub.] stands for “Graph reduction for a Unified Machine model”.

System Manager (SysMan) PVM task 1 … PVM task n PEs PVM Daemon 1 (pvmd) PVM Daemon 2 (pvmd) … PVM Daemon m (pvmd) Interconnection Network

Computer 1 Computer 2 … Computer m

Figure 5·3: ‘Hardware’ architecture for parallel program execution on a network of computers. GUM implements graph reduction using an abstract machine modelled on GRIP [Hammond and Peyton Jones 1992; Hammond unpub.; Hammond, Mattson, and Peyton Jones 1994]. GRIP consists of two types of processor: Processing Elements (PEs) which perform graph reduction, and Intelligent Memory Units (IMUs) which hold the shared graph and manage global thread pools.

Although GUM had its inception on the specialised GRIP architecture, GUM is now architecture neutral and portable [Trinder et al. 1996; Hammond et al. 1995; Trinder et al.

1998 and unpub.]. Earlier redesigns of the GRIP reduction system [Hammond, Loidl, Mattson, Partridge, Peyton Jones, and Trinder unpub. b] retained the IMUs, but in GUM these are removed by distributing the global memory amongst all PEs (rather than IMUs) using a globalised address space. (Each processing element has a local heap, which is independently garbage collected; the collection of all local heaps provides a virtual global heap.)

The abstract graph-reduction machine is an extension of the STG-machine which was developed by Peyton Jones and Salkild [Peyton Jones and Salkild 1988 and 1989; Peyton

Jones 1992]. The extensions [Loidl 1998] include a new closure type (the FETCHME

closure — see Section 5·3·5·9), blocking queues (also see Section 5·3·5·9), and global addresses (see Section 5·3·5·2).

GUM utilises PVM for (asynchronous) communication via message-passing and thus obtains portability — PVM version 3 was designed to run on heterogeneous machines possessing either shared- or distributed-memory running a version of the Unix

operating system. PVM is the subject of the next section.

GUM comprises one System Manager task with the remaining processing elements being ‘workers’. Each worker task has a copy of both the program and the runtime system with one processing element known to contain the ‘main program’ thread. At start-up, the system manager spawns the worker tasks and synchronises them. Each task then executes the runtime system which results in program evaluation, load distribution, garbage collection, et cetera. When the main thread terminates, the task indicates this by communicating with the System Manager which then initiates and coordinates the shutting down of all tasks.

Figures 5·2 and 5·3 illustrate the use of GUM and PVM together.

5·3·5·2 Operation

Each task maintains four main graph reduction-related data structures (apart from the graph itself): thread pools, a pool of required sparks, a pool of advisory sparks, and global address tables. Collections of threads are maintained as linked lists (see Section 5·3·5·3) and exist for the runnable thread pool, as well as for blocking queues. Each spark pool is a queue and is actually initialised as a contiguous block of memory occupying all of the memory-space as if the pool were full. Pointer variables are used to indicate that the pool is empty. Required sparks are those created via explicit parallelism (i.e. within Concurrent Haskell [Peyton Jones, Gordon, and Finne unpub.; AQUA 1995] using the

fork annotation). Advisory sparks are those created using the par annotation. See

Section 5·3·5·6 for more details. Closures possess an address local to the processing element on which they reside. Such local addresses are not unique between tasks. Closures known to tasks other than the task they reside on have a global address (GA) through which they can be located by all tasks. There are two main tables for using global addresses that must be maintained:

Chapter Five: Haskell & GHC

• the global-address-to-local-address (or GALA) table which indicates the local address of a closure with known global address — if the closure is not stored locally, a sentinel value is present; and

• the local-address-to-global-address (or LAGA) table which indicates the global address of a local closure (if a global address exists) — again a sentinel value is present if the closure is not yet associated with a global address (i.e. is not global). Initially the thread and spark pools are initialised and empty and no closures have global addresses. A timer is then initialised for the purpose of thread pre-emption and context switching (which depending upon the scheduling algorithm in vogue either ensures fair scheduling in a round-robin fashion, or, simply allows the runtime system to execute interspersed with reduction of threads in an unfair manner). If the current task possesses the main program, a new thread is created containing the main program thread and this is placed into the thread pool.

The main runtime system algorithm cycle now begins. Firstly, the task inspects its inward communication for requests from other tasks for closures. If such requests have been received, the closures referred to are located, and replies are constructed. These replies are sent and the algorithm continues. The thread pool is then examined. If it is empty then the spark pools are examined (first the required pool is examined and if it is empty the advisory pool is examined). If these are empty, receiver-initiated load

distribution (see Section 5·3·5·8) is attempted, and any incoming communication is processed. The main algorithm cycle then begins again.

If a spark pool was found not to be empty when it was examined, the head of the pool is extracted and the closure referred to by the spark examined. If it is found to be evaluated already then the spark is discarded and the next one chosen. If, conversely, the closure pointed to by the spark is unevaluated the spark is sparked into a thread (i.e.

a new thread is created to evaluate the sparked closure) and this thread is placed into the thread pool. Again, the main algorithm cycle begins again.

If the thread pool was found not to be empty, the examinations of the spark pools do not occur. Instead, all incoming communication is processed, the head of the thread pool is extracted, and resumption (or commencement!) of the execution of that thread

ensues. This continues until the timer expires at which point the thread is replaced into the thread pool and the main algorithm cycle begins again.

When the main thread terminates, the task communicates this to the System Manager task, which then coordinates the shutting-down of all tasks.

In document Effective run time management of parallelism in a functional programming context (Page 110-114)