task size need not be perfect. We do not have to know the exact sizes of tasks, we only need to be able to compare two tasks, and determine which one is the larger.
Since the problem of selecting the task to execute or to send to the thief is orthog- onal to the problem of selecting the stealing targets, the policies that we propose here are not specifically tied to any of the work-stealing algorithms considered in Chapter 5. In other words, our policies can be readily “plugged” into any of these algorithms. This makes the results and the conclusions that we obtain in this chapter very general and applicable to a wide class of runtime systems.
6.2
Granularity-Driven Task Selection Policies
Figure 6.1: A computationally-uniform Grid
Since our focus is on computationally uniform distributed computing environments, the main problem for work-stealing that comes from the environment side is the pres- ence of different communication latencies between a victim and different thieves. Con- sider the example computing environment on Figure 6.1, where a cluster that comprise
8 PEs, located in St. Andrews, is connected to the similar cluster in Timisoara, Ro- mania. We can observe that a steal attempt can arrive to a victim from three different latency levels:
1. same-PE level – When a PE finishes executing a task (or when the executed task gets blocked on communication), it needs to choose one of the tasks from its task pool and execute it. As we have already mentioned before, we can see this as a special case of stealing, where a thief is attempting to steal from itself. The latency between the thief and the victim in this case is zero.
2. same-cluster (local, LAN) level - When a victim receives a steal attempt from a thief that is in the same cluster. We assume that in this case the latency between the thief and the victim is low, but that it is still higher than in the previous case.
3. remote-cluster (remote, WAN) level - When a victim receives a steal attempt from a thief from the remote cluster. In this case, we assume that the latency between the thief and the victim is much higher than in either of the previous two cases.
In this example, we distinguish between only two levels of communication latencies in the environment – one between PEs in the same cluster and one between PEs in dif- ferent clusters. Of course, in the environments consisting of many clusters, the latency between various clusters can be highly different (take, for example, more heterogeneous WorldGrid environments we studied in Chapter 5). In addition, some clusters might consist of multicore machines1, and the latency between different cores on the same machine is much lower than the latency between different machines. Therefore, both the same-cluster and remote-cluster levels can themselves consist of several different levels of latencies. However, most of the time it is enough to consider just two levels, since usually the latencies in the remote-cluster level are a few orders of magnitude higher that the latencies in the same-cluster level, and the differences between the latencies inside the individual levels are not too significant.
Consider what happens during the execution of parallel applications we consider in this thesis on distributed environments. At the beginning of the execution, the main task (possibly after some initial sequential computation) creates a set of child tasks and then blocks until all child tasks finish execution and (if they are offloaded) send their results back. These child tasks can, of course, create their own child tasks. If a
6.2. GRANULARITY-DRIVEN TASK SELECTION POLICIES 169
task is stolen from the PE that is executing its parent task, it needs to be transferred to the thief. In our simulated environment, each task can fit into a single message. In realistic runtime-systems, however, depending on the size of the data that needs to be transferred in order to execute the task, the task transfer may involve sending multiple messages from the victim to the thief. In the case of Grid-GUM, since data fetching is done lazily, the task transfer may involve the exchange of multiple FETCH
and RESUME messages, which increases the overhead in task transfer. After the task
finishes execution on the thief, its result needs to be sent back to the victim. Again, in realistic runtime systems, this may involve sending multiple messages, and in the case of Grid-GUM, the exchange of the series of FETCH-SCHEDULE messages is needed. From the discussion above, we can see that executing a task on some PE, other than the one where its parent task is, involves possibly large overheads. However, in many cases these overheads are constant, no matter how large (in terms of the task size, as defined in Section 4.2.1) the task is. Therefore, sending a large task from a victim to a thief has the same overhead as sending a small one. The question is whether we can use this fact, plus the knowledge about the sizes of application tasks to select different tasks at different latency levels in order to better hide the WAN latency (maybe at the expense of increasing the number of messages sent over LAN). When the tasks sizes are not known (or approximated) in advance of the execution, we can use the same policy which is generally used for the divide-and-conquer applica- tions – First-Come-First-Served (FCFS) policy at the same-cluster and remote-cluster levels, and Last-Come-First-Served (LCFS) at the same-PE level. The last task that a PE creates will be the first one to be executed on it, and the first task that the PE creates will be the first one to be sent to a thief. In the following discussion, we will refer to this policy simply as FCFS, and we will use it as a baseline against which we will compare more advanced policies.
Assuming, however, that we havea-priori information about the size of every task, we can instead organise the task pool into a priority list ordered by task sizes and then use different selection policies at different latency levels. We will consider the following granularity-driven task selection policies for the three different latency levels:
• Small-Small-Large (SSL). When responding to a steal attempt from the same- PE or the same-cluster level, select the smallest task. At the remote-cluster level, select the largest one. The rationale for this policy is that a victim wants to reserve as many large tasks as possible for the remote thieves, to compensate for the high communication costs that they will incur. The victim also wants to keep the remote thieves busy for a longer period than the PEs from the same cluster, so that they request work less often. This means that they would need to send fewer messages over high-latency networks, in the case that they repeatedly need to steal from the same victim. In this way, we aim to both improve the CPU utilisation and to decrease the number of messages sent (at least, the ones sent over high-latency networks), compared to the basic FCFS policy.
• Small-Large-Large (SLL). Choose the smallest task only for the same-PE level, and select the largest task for all other levels. The rationale for preferring this policy to SSL is that the only important thing when answering to a steal attempt is to send the large task, and that it does not matter whether the attempt comes from the same or different cluster. The goal here is only to avoid the overheads of offloading too small tasks, and not to save the largest ones for more remote PEs. Using this policy, the PE that creates parallel tasks should, therefore, execute all or most of the small tasks and only large tasks should ever be offloaded. This is the policy that is implicitly used in all algorithms considered in Section 2.3.1, as for the simple divide-and-conquer applications it is the same as the FCFS policy.
• Large-Large-Large (LLL). Always choose the largest task. This is the greedy approach, where we try to get the largest task executed as soon as possible (i.e. as soon as there is an idle PE to execute it). Smaller tasks are left to be executed later.
• Large-Large-Small (LLS). Choose the smallest task for the remote-cluster level, and the largest one for the same-PE and same-cluster levels. In this way, we hope that the PEs nearby the victim will execute all large tasks, and by the time they finish, the remote PEs will finish executing the small tasks and send their results back. This, perhaps counter-intuitive, policy is actually optimal when the number of tasks altogether is approximately the same as the number of PEs.
Note that we are not trying to achieve load balance in such a way that each PE will execute approximately the same number of tasks, since for irregular applications