Grid-GUM - Load Balancing - Load balancing of irregular parallel applications on heterogeneous

2.3 Load Balancing

2.4.3 Grid-GUM

Grid-GUM [AZ06] is the extension to the GUM runtime system. Grid-GUM aims to adapt GUM to the Grids. It is build on top of Globus [Fos03] Grid middleware and it uses the MPICH-G2 [KTF03] implementation of MPI communication library.

The main difference between Grid-GUM and GUM lies in the work-stealing algorithm used for load balancing. Grid-GUM replaces the Random Stealing algorithm used in GUM (which performs very well on low-latency homogeneous systems, but badly on Computational Grids) with the advanced, adaptive work-stealing mecha- nism, based on the cheap exchange of load information between PEs and its use in the selection of steal targets. It accounts for different latencies between the PEs and different PE computing capabilities, and tries to adapt the load distribution to it. The load-balancing algorithm is described in detail in Al Zain et al. [AZTML06] and eval-

uated in Al Zain et al. [AZTML08]. Since this algorithm is important for this thesis, as it uses the load information in making steal decisions, we now describe it in more detail.

Grid-GUM assumes that the set of all PEs is organised into a set of clusters. Compared to the GUM work stealing, Grid-GUM makes the following improvements :

• The fastest PE (in terms of computing capability) from the largest cluster is nominated as the main PE. Since the assumption is that the main thread will generate the most of the parallelism (typical situation in divide-and-conquer and data-parallel applications), in this way it is ensured that the most work is generated in the largest cluster.

• Grid-GUM differs between steal attempts (FISH messages) that arrive from the same cluster and the ones from remote clusters. If the FISH arrives from the same cluster, one spark is sent as a response in a SCHEDULEmessage (as is done in GUM). On the other hand, if it originates from a different cluster, then Grid- GUM tries to send more sparks in the single SCHEDULE message, to cover the possibly high latency. In this case, it tries to balance the number of sparks (relative to the PE computing capability) between the two PEs involved in the steal operation.

• In Grid-GUM, the PEs exchange information about the load of the environment as the application execution progresses. The load information (the number of sparks and threads that each PE has in its pools, together with the timestamps when this information was last updated) that a PE has is attached to eachFISH

and SCHEDULE message sent by it. When some other PE receives the FISH or

SCHEDULEmessage, it compares the timestamps of the load information from the

message with the timestamps of its own load information, and updates the latter one for each PE for which it received a more recent information. Conversely, the load information in the message is also updated for all PEs for which the PE that receives it posses more recent information. In this way, hopefully, every PE in the environment will have the accurate information about the load of every other PE. When a PE gets idle, it will know where it is most likely to find some work, and it chooses the target for FISH message accordingly (see below). It is worth noting that, since the load information is attached to FISH

and SCHEDULE messages that would be exchanged anyway, the exchange of load

information in Grid-GUM comes at almost no extra cost (except for the small constant additional overhead in processing these two types of messages).

2.4. (PARALLEL) FUNCTIONAL PROGRAMMING 47

• Each PE also has a communication map table, which holds the current communication latencies between that PE and every other PE in the environment (again, together with the timestamps of these information). This table is con- sulted when the PE needs to choose the destination for sending (or forwarding) a FISH message. If the PE ’knows’ about more than one other PE that has some work to offload, it will prefer sending a FISH message to the one which is the nearest to it (i.e. the one for which the communication latency to it is the lowest).

The Algorithms 5 and 6 (taken from Al Zain [AZ06]) give the pseudocodes for the Grid-GUM scheduling loop and the function that processes theFISHmessage (the central parts of Grid-GUM adaptive load balancing system).

Algorithm 5 Grid-GUM schedulingLoopStep() function 1: repeat

2: processMessages()

3: if run queue empty then

4: if no sparks in spark pool >low-watermark then 5: Choose spark s for local execution

6: Turn s into a thread t 7: Add t to run queue 8: else

9: update local load info

10: calculate localRatio=localSpeed/localLoad 11: sort communication latencies om ascending order 12: for each destPE in a system do

13: calculate destRatio = destSpeed/destLoad 14: if localRatio>destRatiothen

15: attach to FISH message my system load information (together with timestamps)

16: end if

17: send FISH message to destPE

18: break

19: end for

20: if not main PE and no FISH message sent then 21: destPE = mainPE

22: attach to FISH message my system load information (together with timestamps)

23: Send FISH message to destPE 24: end if

25: end if 26: end if

27: if run queue not empty then

28: Run one of the threads in run queue 29: if thread blocked on remote data then 30: Send FETCH message

31: end if 32: end if

2.4. (PARALLEL) FUNCTIONAL PROGRAMMING 49

Algorithm 6 Grid-GUM processFISHMessage() function 1: update communication map from FISH message 2: update system load information from FISH message 3: if spark pool non-empty then

4: calculate localRatio (=localLoad/localSpeed) and originRatio (=originLoad/o- riginSpeed)

5: if originRatio >localRatio then

6: if originPE and localPE withing the same cluster then 7: send one spark in SCHEDULE message to originPE 8: else

9: localClusterRatio=localClusterLoad/localClusterSpeed 10: originClusterRatio=originClusterLoad/originClusterSpeed 11: calculate noSparksToSend

12: send noSparksToSend sparks to originPE 13: end if

14: else

15: if FISH exceeded age then 16: return FISH to originPE 17: else

18: for each destPE in a system do 19: calculate destRatio

20: if originRatio>destRatiothen

21: send FISH (together with system load information and timestamps) to destPE

22: break

23: end if 24: end for

25: if not mainPE and no FISH message sent then

26: send FISH (together with system load information and timestamps) to mainPE

27: end if 28: end if 29: end if 30: end if

Chapter 3 SCALES Work-Stealing Simulator

In this chapter, we present SCALES, a work-stealing simulator for parallel applications on distributed computing environments. Section 3.1 gives an overview of SCALES. Sections 3.2 and 3.3 describe, respectively, the applications and the computing environments that SCALES can simulate, and Section 3.4 describes how the application execution is simulated. Section 3.5 gives an overview of other simulators for heterogeneous computing environments.

3.1 Overview of SCALES

SCALES can simulate the execution of a wide class of parallel applications using different work-stealing algorithms. Currently, it supports Random, Hierarchical, Cluster- Aware Random, Adaptive Cluster-Aware Random, Grid-GUM and Feudal (see Section 5.3 for more details) work-stealing algorithms. To implement a new work-stealing algorithm in SCALES, we simply need to implement the functions that deal with the selection of steal targets.

The level of abstraction provided by SCALES enables us to model applications which conform to many parallel programming paradigms, including divide-and-conquer, master-worker and data-parallel applications. One important restriction, however, is that the application modelled cannot have any non-trivial data-dependencies between the tasks forked by the same parent task. It is also possible to simulate heterogeneous computing environments, where heterogeneity comes from different communication latencies between different PE pairs as well as different computing capabilities of PEs. Additionally, SCALES allows the tuning of overhead which certain operations take during the application execution (such as processing messages, task creation and task finish). This enables users to fine-tune the simulator, in order to model more accurately

specific runtime systems.

The following figure shows the output of an example run of SCALES simulator:

> sim_crs -p SimpleDC_10_5ms -s Grid_8_8 Total time 386002

Successful/Attempted steal messages ratio 0.048302 --

Tasks executed in cluster 0 - 687 Tasks executed in cluster 1 - 506 Tasks executed in cluster 2 - 567 Tasks executed in cluster 3 - 425 Tasks executed in cluster 4 - 453 Tasks executed in cluster 5 - 465 Tasks executed in cluster 6 - 491 Tasks executed in cluster 7 - 500

For the simulation of each application execution, SCALES requires three pieces of information:

• work-stealing algorithm being simulated (sim_crs in the above example, which represents Cluster-Aware Random stealing)

• the file with the application model (SimpleDC_10_5msin the above example, see Section 4.2 for details)

• the file with the computing-environment model (Grid_8_8in the above example, see Section 5.2).

SCALES returns the number of time units that the application execution takes. It also returns some statistics about the steal messages sent and the tasks executed. Precisely which statistics are to be returned is decided by user. In the above example, we have chosen to display the ratio between all steal messages sent and those successful in obtaining work. We have also chosen to display the number of tasks executed in each cluster, in order to see how evenly the load was spread across the environment.

In document Load balancing of irregular parallel applications on heterogeneous computing environments (Page 59-67)