Load-based Work-stealing Algorithms - Work-stealing on Heterogeneous Computing Environments

4.4 Work-stealing on Heterogeneous Computing Environments

5.1.1 Load-based Work-stealing Algorithms

The question of how to choose a steal target when perfect load information is present is similar to the same question without load information. The only difference is that, with perfect load information, a thief already knows what PEs have work to offload, so it can restrict the set of possible steal targets it considers to only those with work (rather than considering all PEs as possible targets). Then, the question is whether it is the best to choose a random target with work, the nearest target with work, whether to try stealing from local and remote targets with work in parallel in order to do prefetching of work etc. To answer these questions, we consider extensions to the state-of-the-art work-stealing algorithms presented in Section 2.3.1, restricting the set of possible steal targets at each step where a thief needs to choose one to only those that have some work to offload. We also consider two algorithms (CV and HLV) that depend on the presence of load information and that, therefore, do not have an equivalent that does not use load information.

In the following discussion, the algorithms we present are executed by each thief when it needs to select the steal target. Note that we assume that in all algorithms, only one task is sent in each steal operation. The algorithms we consider are:

• Perfect Random Stealing. In the basic version of Random Stealing, a thief always chooses a random target from the set of all PEs in a system. In Perfect

Random Stealing, a thief always chooses a random target from the set of all PEs with non-zero load. See Algorithm 9.

Algorithm 9 Perfect Random Stealing

1: S:=set of all PEs that have work to offload 2: Select a random victim v fromS

3: Send steal attempt tov

• Perfect Hierarchical Stealing. Recall that in Hierarchical Stealing, the set of all PEs in the environment is organised in a PE-tree (based on communication latencies between PEs), and that a thief first looks for work in its whole PE- subtree, and only if no work is found there will it ask its parent for work. Whereas this basic version always looks for work in the whole PE-subtree of the thief before going up in the tree, in the perfect version the thief will only ask its child for work if the child’s PE-subtree has non-zero load (that is, if any PE in this PE-subtree has non-zero load). This child PE will then recursively repeat the same procedure (i.e. forward the steal attempt to one of its children whose PE-subtree has non-zero load) until the PE with work is reached. If the whole thief’s PE-subtree has zero load, the thief will forward the steal attempt to its parent. In this way, the thief avoids searching for work in its PE-subtrees that do not have any work to offload. See Algorithm 10.

Algorithm 10 Perfect Hierarchical Work Stealing

1: S:=set of all thief’s children that have work to offload 2: if S is nonempty then

3: v=random PE from S

4: else if thief has parent then 5: v=thief’s parent

6: else

7: v=NULL 8: end if

9: if v!=NULL then

10: send steal attempt to v

11: end if

• Perfect Cluster-aware Random Stealing. In Cluster-Aware Random Steal- ing (CRS), local and remote stealing is done in parallel. That is, a thief looks for work within its own cluster in parallel with looking from work remotely. Sim- ilarly to Perfect Random Stealing, in the version of CRS algorithm that uses the perfect load information a thief will always send the steal attempt to a PE

5.1. THE USE OF LOAD INFORMATION 97

which has some work to offload (if such a PE exists). If the steal attempt needs to be sent within the cluster, a random PE with work from the thief’s cluster is selected. Otherwise, in the case of remote stealing, a random PE with work that is outside of the cluster is selected (see Algorithm 11).

Algorithm 11 Perfect Cluster-Aware Work Stealing 1: if outsideStealing then

2: S1=set of all PEs outside of thief’s cluster that have work to offload 3: if S1 is nonempty then

4: v1=random PE from S1 5: else

6: v1=random PE outside of the thief’s cluster 7: end if

8: Send steal attempt to v1 9: outsideStealing = true 10: end if

11: S2=set of all PEs from thief’s cluster that have work to offload 12: if S2 is nonempty then

13: v2=random PE from S2 14: else

15: v2=random PE from thief’s cluster 16: end if

17: Send steal attempt tov2

• Perfect Adaptive Cluster-aware Random Stealing. Recall that in Adap- tive Cluster-Aware Random Stealing (ACRS), when doing a remote stealing, the probability that a thief chooses a target from some cluster is proportional to the communication latency between the thief and that cluster (with respect to the latencies to other clusters). In other words, the probability of thief pchoosing a target from clusterQfor remote stealing attempt is equal to Plat(p,Q)

C∈Slat(p,C)

, where

S is the set of all remote clusters. After the cluster is chosen, the steal attempt is sent to a random PE from it. Therefore, thieves prefer remote stealing from nearer clusters. In Perfect ACRS, the set of all considered clusters is restricted to those that have some PEs with work, and, once the cluster is selected, the steal attempt is sent to a random PE with work from it. See Algorithm 12.

• Closest-Victim (CV) Stealing. In this algorithm, a thief always chooses the closest target with work. Note that this algorithm is similar to the perfect version of Grid-GUM work stealing, with the difference that only one task is transferred between thief and victim. In Grid-GUM, a thief chooses the closest target for which it assumes (based on the load information it has, which might

be inaccurate or outdated) that it has some work to offload. In CV stealing, the load information that thieves have is perfect, so the closest victim that really has some work is chosen. See Algorithm 13.

• Highest-Loaded-Victim (HLV) Stealing. In this algorithm, a thief chooses a target with the highest load (in our case, the largest number of tasks in its task pool). See Algorithm 14.

Note that in all of the algorithms that we consider, it can happen that a thief needs to choose a target, but all of the targets being considered have zero load. For example, it can happen that a thief needs to do local stealing in the CRS algorithm, but all local targets have zero load. In this case, we have decided for Perfect CRS and Perfect

Algorithm 12 Perfect Adaptive Cluster-Aware Work Stealing 1: if outsideStealing then

2: S1=set of all remote clusters that have at least one PE with non-zero load 3: if S1 is nonempty then 4: SumLat=P C∈S1 1 lat(thief,C) 5: for all C ∈S1 do 6: prob(C) = 1 lat(thief,C) SumLat 7: end for

8: D=random cluster from S1 (where cluster C is selected with probability

prob(C)

9: v1=random PE with work from D 10: else

11: v1=random PE outside of the thief’s cluster 12: end if

13: Send steal attempt to v1 14: end if

15: S2=set of all PEs from thief’s cluster that have work to offload 16: if S2 is nonempty then

17: v2=random PE from S2 18: else

19: v2=random PE from thief’s cluster 20: end if

21: Send steal attempt tov2

Algorithm 13 Closest Victim Work Stealing

1: v=closest PE to the thief that has some work to offload 2: if v==NULL then

3: v=random PE 4: end if

5.1. THE USE OF LOAD INFORMATION 99

ACRS to behave in the same way as their basic versions (i.e. to send the steal attempt to a random PE inside or outside of the cluster), and Perfect Random, CV and HLV stealing to behave in the same way as basic Random Stealing (i.e. to choose the target randomly from the set ofall possible targets). Other possibilities are to delay sending a steal attempt for some time period, until one of the targets being considered obtains some work, or to send the attempt back to where it originated from and then the thief that had started the stealing might decide to wait for some time and start the stealing process again. We have also tried these possibilities, but did not observe any notable difference in the performance of our algorithms.

Note also that, even with the perfect load information that we assume, it can happen that a target that receives a steal attempt does not have any tasks in its task pool. This can happen if the target’s load changes between the time a thief sends the steal attempt to it and the time when the target receives it. In this case, we assume that the target will forward the steal attempt to another appropriately chosen target. In Perfect Random, Perfect Hierarchical, Perfect CV and Perfect HLV, this new target is chosen in the same way as if the original target itself was a thief. In Perfect CRS and Perfect ACRS, the original target will always forward the steal attempt to some PE from its cluster. Additionally, if the steal attempt has visited some predefined number of targets, and it did not find work in any of them, it will be returned to the thief who started it.

Although, of course, the algorithms that we consider here do not cover all possible work-stealing algorithms that use load information, we feel that the scope of considered algorithms is broad enough to make conclusions about the way in which the load information is useful in work-stealing, as they cover most of the state-of-the-art algorithms used on distributed environments.

We will compare the algorithms described above with the basic versions of Random, Hierarchical, ACRS and CRS algorithms1_{. Our specific goals in this section are}

1_{In this section we will not consider Grid-GUM, since although it uses the load information,} it works with the approximation of PE loads rather than with the perfect information. We will, therefore, postpone considering this algorithm until the next section, where we will compare it with Feudal Stealing.

Algorithm 14 Highest-Loaded-Victim Work Stealing 1: v=PE with the highest load greater than 0

2: if v==NULL then 3: v=random PE 4: end if

1. We want to compare the speedups obtained under the Perfect Random, Per- fect Hierarchical, Perfect CRS, Perfect ACRS, CV, HLV, Random, Hierarchical, CRS and ACRS algorithms for various applications on various computing environments to discover what algorithm gives the best speedups for most of the application/computing environment combinations. This would give us the answer to what is the best way to choose steal targets if the load information is present during the application execution.

2. Additionally, we want to compare the Perfect Random, Perfect Hierarchical, Perfect CRS and Perfect ACRS algorithms against the Random, Hierarchical, CRS and ACRS algorithms, respectively, in order to find out how much the speedups under each of these algorithms can be improved if the perfect load information is used in them. This would give us the indication of how much better performance can we expect from each individual basic algorithm we consider (Random, Hierarchical, CRS and ACRS) if they have access to information about the load of the environment. This is useful for the developers who must use one of these basic algorithms in their systems, so that they can estimate whether it would be useful to consider extending these algorithms with the mechanisms for estimating the load information during the application execution.

Concerning the methodology of evaluating the performance of algorithms, our aim to explore the use of perfect load information drives this decision towards the use of simulations. In the real runtime systems on distributed computing environments, due to the communication latency, it is generally not possible for PEs to have totally accurate load information of the rest of the environment. Even in the case where each PE sends its load information to some central PE (to which also all steal messages are sent), by the time the message with load information arrives to this central PE, the load of the PE from which the message originates might have changed. This is especially the issue in the systems with high communication latencies. In simulations, however, we can assume that each PE has access to some kind of global system state, where the information about the loads of all PEs is kept. This enables the thieves to have accurate information about the load of all possible targets at the time when they need to decide where to send steal requests.

In document Load balancing of irregular parallel applications on heterogeneous computing environments (Page 109-115)