6·4·4 Diffusion Scheduling - Effective run time management of parallelism in a functional progr

Diffusion scheduling [Hudak and Goldberg 1984; Hofman and Vree 1992] is a load distribution algorithm in which load distribution activity occurs between neighbours25_. A thread transfer will occur between processing elements within a neighbourhood if a lightly loaded task becomes aware of a heavily loaded task with excess threads. Diffusion scheduling is based on the assumption (proven by Hong, Tan, and Chen [Hong, Tan, and Chen 1998] for a hypercube26_{architecture) that distributing the load} across a neighbourhood will result in distributed load across all neighbourhoods. The restriction of transferring only to neighbours attempts to ensure locality of reference to required data.

Hudak and Goldberg [Hudak and Goldberg 1984] describe their implementation of diffusion scheduling where through the use of heuristic functions a thread is migrated to a neighbouring processing element. The migration depends upon three factors: the processor and memory load on the particular processing element, the processor and

25_{A processing element neighbourhood is sometimes referred to as a}_{buddy set}_{, for example in [Chang} and Shin 1993].

memory load on the neighbouring processing elements (which advertise their load when it changes by at least 10%), and a weighted measure of the “direction” of the thread’s locality of reference. This algorithm requires static analysis to discover the locality of reference.

In the multicomputer implementation, Alfalfa, described in Goldberg’s doctoral thesis [Goldberg 1988], the diffusion scheduling algorithm has the following properties:

• communication occurs between neighbours only;

• each task remembers the load of its neighbouring tasks; • the load metric is the length of the runnable thread pool;

• the transfer policy depended only upon the length of the (potential) donor task’s runnable thread pool — if this was below a certain threshold no thread transfer would take place;

• the location policy varied from round-robin to least-loaded neighbour; and • the information policy dictated that load messages would be transmitted when

there was a “significant change” [Goldberg 1988, page 175] to the task’s load. Goldberg concluded that Alfalfa performed quite well for such simple selections gaining significant but sub-linear speed-up. No comment on the most appropriate location policy is given.

Lin and Keller [Lin and Keller 1987] propose an incremental, adaptive, diffusion scheduling, load distribution scheme (the “Gradient Model”) that is similar to spark percolation in that load distribution is between neighbours only and is receiver-initiated. Thread migration occurs in their scheme as a result of indirect requests from

neighbouring idle tasks. Their implementation considers locality of reference and possesses a restriction that a thread may not migrate wider than some threshold from its originating processing element. Their scheme does not involve speculative evaluation and makes no effort to be efficient, nor non-intrusive. Unfortunately, it is possible for tasks to become flooded (see below).

Hofman and Vree [Hofman and Vree 1992] include the following pitfalls of diffusion scheduling in their presentation:

Chapter Six: Related Work

• flooding — when a lightly loaded task receives threads from multiple heavily loaded neighbours (which cannot occur under the informed-receiver-initiated behaviour of spark percolation); and

• slow spreading of work — due to the ‘incremental’ distribution throughout

neighbourhoods the load distribution achieved through diffusion scheduling is relatively slow.

6·4·5 Stankovic’s Contribution

In 1984, Stankovic [Stankovic 1984] reported on simulation experiments for three adaptive decentralised load distribution algorithms of his devising. For all algorithms, each task sent an estimate of its load (the number of threads) to every other task in two second intervals.

The first scheme was a single threshold27_{scheme: the difference in load between two} tasks must differ by a certain amount before a thread transfer would occur. The second scheme was a double threshold scheme: if the difference in load between two tasks was greater than a first threshold but smaller than a second a thread transfer would occur. Interestingly, if the difference in load was greater than the second threshold value two threads would be transferred. The third scheme was a variation on the first in which the task that lost the thread would remember to which task it had transferred the thread; thread transfers to this task would be banned for some time period.

Stankovic found that for light and moderate loads, the first and third algorithms improved the performance significantly with the third algorithm’s performance surpassing that of the first at moderate loads. Overall, however, the second algorithm had the best performance but tended to move too many threads.

Given the results of Eager et al. [Eager et al. 1986a] the improvements seen due to the introduction of load distribution aren’t surprising. It is also not surprising that sending multiple threads to a processing element able to execute only one gave rise to excess thread movement! Although Stankovic doesn’t state the migration-limiting policy, allowing a thread to move only once would still cause problems if the execution time of a thread exceeds the load distribution period or if the level of parallelism resulted in the

creation of additional threads. Of some surprise also, is that the third algorithm (which prevented multiple transfers to the same task for a period of time) outperformed the first algorithm (which had no such limitation). It is likely under a real workload for threads spawned on the same processing element to be related in terms of data. This is the driving force behind data locality concerns. The third algorithm forcibly transfers such threads to different locations. It would be interesting to observe results of this load distribution algorithm on real, rather than artificially-created threads.

In document Effective run time management of parallelism in a functional programming context (Page 155-158)