5·3·5·6 Spark Pool Structure and Operation

Parallel execution is obtained through sparks. Pointers to annotated code fragments are called sparks, and are collected together on each processing element in a spark pool. A spark does not have the same structure (or weight) as a thread. It does not have a TSO — it is simply a pointer to a code fragment. This minimalism improves efficiency since the creation of a fully-weighted thread incurs an overhead of time and memory usage and is only necessary if the spark is to be executed. When a spark is to be executed the process of creating a thread can be undertaken; conversely, if a spark is to be discarded thread creation can be omitted. Thus the delaying of thread creation is beneficial. Sparks are purely advisory. If ignored the correct result is guaranteed through conservative evaluation. As indicated above, if a task’s thread pool is empty then it considers its spark pool. Once a spark is selected it is converted into a thread and (assuming that the thread has not already been evaluated via conservative evaluation or an earlier spark) it is added to the thread pool for scheduling. (The idea of retaining potential parallelism in an encapsulated form until it is required — if indeed it ever is — is called lazy task creation [Mohr, Krantz, and Halstead 1991 and 1995].)

If a parent thread requires the value of a closure it has sparked a child to evaluate, it attempts to evaluate it itself [Hammond and Peyton Jones 1992 and unpub.]. If the child has not started evaluation, the parent simply takes over, and the child can be discarded. If the child has commenced evaluation, the parent blocks and is resumed when

the child completes the evaluation. Lastly, if the child has completed the evaluation, the result is extant for the parent to extract. This is the evaluate-and-die thread creation model [Peyton Jones et al. 1987; Loidl 1998] and has the advantage that notification between threads is only required when the original thread demands the result of the newly created thread — and hence it is safe to discard excess sparks if necessary [Hammond et al. 1994].

An alternative thread creation model is the notification model in which the parent blocks when it encounters an unevaluated closure that it has sparked a child to evaluate. When the child completes the evaluation it notifies the parent thread. The advantage of this model [Hammond and Peyton Jones 1992 and unpub.] is that parent threads may suspend pending notification from several child threads. The disadvantage is that no child threads may be easily discarded.

The required and advisory spark pools are created to their full extent. A pointer ‘base’ is set to the start of each, while another pointer ‘limit’ is set to the end of each. Then, ‘head’ and ‘tail’ pointers are set to the ‘base’ to indicate that each spark pool is empty. A spark pool is full whenever its ‘tail’ pointer reaches its ‘limit’. A spark creation throttling

strategy [Hammond and Peyton Jones 1990] is implemented: when a spark pool fills, spark creation is suspended and new sparks are discarded.

In its obtainable form [AQUA 1996], GUM adds new sparks to the rear of the spark pool (using an STG macro) and sparks are removed from the front (as part of the main runtime system cycle — but only if there are no runnable threads). When an attempt is made to remove one spark, the entire spark pool is actually processed: all candidate sparks are sparked into threads, all unsuitable sparks are deleted.

Once an annotated expression is sparked into a thread it competes equally with all mandatory threads (including those that subsequently return to the runnable thread pool from the blocked state) and other speculative threads. This should logically not be the case since the evaluation of the mandatory threads is of paramount importance. Further, converting all sparks into threads hinders load distribution since GUM utilises thread placement only and no threads are actually migrated. Additionally, at any time during execution, speculative threads can be seen to contribute to the final program outcome to varying degrees but the single par annotation obscures this and deems all

Chapter Five: Haskell & GHC

5·3·5·7 Spark Structure

As stated above, a spark is simply a pointer to graph code. It is free from the

accountability and management detail of the TSO details, and therefore is light-weight. In order to be executed, however, these details must be present, hence sparking. In its Spartan form, however, there is no opportunity for storing associative information such as a priority, a name, a processing element allocation, a granularity measure, an

opportunity cost, et cetera.

5·3·5·8 Load Distribution

GUM’s inherent load distribution is simplistic. A task executes mandatory threads until its thread pool is empty. If either of its spark pools is non-empty a spark is converted into a thread and the scheduler executes it. If, however, the spark pools are also empty then load distribution is begun.

The receiver-initiated load distribution algorithm used by GUM [Trinder et al. 1996; Hammond, et al. 1995; AQUA 1996] is as follows:

• one neighbouring processing element is selected at random; • this task is sent a FISH message;

• if the receiving task has a spark that may be donated then this is packed and sent to the originating processing element for unpacking, conversion, and execution — assuming the originating processing element is still idle — and the FISH

message is then destroyed; alternatively

• if the receiving processing element has no available sparks it forwards the FISH

message to another randomly selected processing element.

Each FISH message contains an amount of ‘food’. Each ‘swim’ between processing

elements consumes food and when a FISH message’s food supply is gone the FISH

message is returned to the originating task. When a FISH message arrives back at its

home the task’s state is checked. If it is still idle then the task waits “briefly” [Trinder et al. 1996, page 3; Hammond et al. 1995, page 4] before re-sending the now-replenished

FISH message. This delay is to avoid saturating the machine with unserviceable FISH

Once a task has sent a FISH message, it will not send another until either the sent FISH

message returns ‘hungry’, or if a spark has been received from another processing element, until the task is again idle.

There are three messages associated with load distribution: FISH, SCHEDULE, and ACK.

The use of the FISH message has already been discussed. The SCHEDULE message

accompanies the transferred spark. Once the spark is safely installed at its new home, the local address-global address pair accompanying it is reconciled/adopted at the new home and the source task is notified of the successful arrival and definitive global address by an ACK message.

The following summarises the existing system in terms of the policies defined in Chapter Four:

• load estimation: not applicable; • information: not applicable; • initiation: receiver;

• location: random; • selection: first;

• migration-limiting: unrestricted; and • transfer: immediate.

5·3·5·9 Remote Behaviour

There are many closure types in GUM [AQUA 1996]:

• SPEC — closures requiring specialised garbage collection code including:

• STATIC — statically declared (non-heap) closures;

• CONST — constant closures;

• CHARLIKE — character closure;

• INTLIKE — integer closure;

• BH — black hole closure (indicating closure is under evaluation) upon which

no other threads are waiting — see below;

• BQ — black hole closure (again indicating closure is under evaluation) that

possesses a blocking queue of threads waiting for the result — see below; • IND — indirection closure;

Chapter Five: Haskell & GHC

• FETCHME — place-holder closure inviting the first thread that enters to fetch

the closure details from a remote processing element — see below;

• FMBQ — place-holder closure that contains a blocking queue of threads waiting

for the remotely held information to arrive — see below;

• BF — a blocking queue entry (for a BQ or FMBQ) acting as a proxy for a remote

task — see below;

• TSO — thread state object — see Section 5·3·5·4;

• STKO — stack object; and

• SPEC_RBH and GEN_RBH — revertible black hole for a SPEC or GEN closure

respectively which effectively holds a duplicate of a remote closure and may or may not have a blocking queue — see below;

• GEN — closures requiring generic garbage collection code;

• DYN — dynamic closures;

• TUPLE — pointer-only closures;

• DATA — non-pointer-only closures;

• MUTUPLE — mutable pointer-only closures; and

• IMMUTUPLE — immutable pointer-only closures.

Six of these closure types are related to the parallel evaluation of the user’s program: BH, BQ, FETCHME, FMBQ, BF, and RBH. The messages FISH, SCHEDULE, ACK, FETCH, and RESUME are also related to the parallel evaluation of the user’s program.

When a local closure is entered to commence evaluation20_{, it is turned into a}_BH_closure. This effectively locks the closure and prevents modification by other threads. If other threads encounter a BH closure, the closure becomes a BQ closure and the TSO for the

entering threads’ TSOs are added onto the BQ closure’s blocking queue. When the

closure evaluation is completed these TSOs are added again to the runnable thread pool and the closure is rewritten as its resultant closure type.

When a FISH message arrives, a thread is sparked (if possible) and its closure is replaced

with a revertible black hole (RBH) closure and the original is packed and sent as a SCHEDULE

20_{A closure is evaluated by a thread when that thread’s program counter jumps to the code that the} closure points to. The thread is said to enter the closure.

message. When the ACK message is returned (see Section 5·3·5·8) the RBH closure

becomes a FETCHME (or global indirection) closure.

When an RBH closure is entered the TSO for the entering thread is placed on the end of

the RBH closure’s blocking queue (perhaps creating it). When the closure becomes a FETCHME closure these TSOs are added again to the runnable thread pool and from

there make their way back onto the FETCHME closure’s blocking queue.

When a FETCHME closure is entered, it becomes a FMBQ closure and the TSO of the

entering thread is placed onto the FMBQ closure’s blocking queue. The details of the

location of the closure referenced by the FETCHME closure are extracted and a FETCH

message is sent to the relevant processing element requesting the closure (or its result). If other threads encounter an FMBQ closure those threads’ TSOs are added onto the FMBQ closure’s blocking queue. (There is no need to send additional FETCH messages as

the first will be sufficient to ensure the result arrives when calculated.) When the closure evaluation is completed the TSOs from the blocking queue are added again to the runnable thread pool and the closure is rewritten as its resultant closure type. When a FETCH message arrives, the closure is located and examined. If it is under

evaluation the FETCH message is added to the blocking queue as a BF entry. If the

closure has been evaluated to a simple type the value is packed and a copy of that closure sent as a RESUME message. If the closure has been evaluated to something other

than a simple type, or has not been evaluated at all it is replaced with an appropriate RBH

closure and the original is packed and sent as a RESUME message. When the ACK

message is received the RBH closure becomes a FETCHME closure.

When a RESUME message arrives, the closure is unpacked, the local address-global

address pair accompanying it is reconciled/adopted at the new home, the blocking queues (if any) are reconciled, and the source task is notified of the successful arrival and definitive global address by an ACK message. The FETCHME/FMBQ closure is then

rewritten and the blocking queue (if one exists) is awoken.

If a BF entry is found on any blocking queue when it is awoken the contents are not

placed in the runnable threads pool (as is done for TSO entries). Instead, the entry is re-

queued as an incoming FETCH message to be responded to during the runtime system

Chapter Five: Haskell & GHC

5·4 PVM

The use of specialised hardware for the basis of implementation brings disadvantages. These include [Hammond unpub.] the fact that processors (Motorola MC68020s and microcoded AMD 29000s) are no longer state-of-the-art. In order to increase the availability and accessibility of the GRIP software, and also to produce a portable system (and therefore retain currency), the GRIP software was ported to use PVM21 [Hammond unpub.].

PVM [Geist et al. 1994] is a software system that interfaces to a number of

heterogeneous multiprocessors and multicomputers running the Unix operating system. In this way programs/systems that use PVM library routines can behave as if they are connected to a single multicomputer that possesses a distributed memory. Library routines exist for the definition of the virtual machine, the spawning of tasks onto processing elements, asynchronous communication between tasks on those processing elements, and synchronisation between tasks on those processing elements.

The use of PVM by GUM is a trade-off. The complexities of low-level communication, differing hardware communication routines, heterogeneous computers and networks, are avoided through its use. But this is at the cost of additional runtime overheads (see Chapter Ten). Certainly for development and portability purposes, the use of PVM (or a similar system) is an expedient one. If, however, speed is the ultimate goal for the GHC designers, an abstraction from the hardware may result in an unsatisfactory delay.

21_{The software has also been ported to use MPI [Trinder, Barry,}_{et al.}_{1998]. Information on MPI may} be found in [MPI 1994].

Chapter Six: Related Work

In document Effective run time management of parallelism in a functional programming context (Page 117-125)