Conclusions - Load balancing of irregular parallel applications on heterogeneous computing envi

notably increases the chances of the thieves obtaining work from the remote targets in fewer attempts. We showed that the percentage of successful remote steal attempts under Feudal Stealing is much higher than under CRS or the Grid-GUM stealing. This shows that the Feudal stealing has exactly those benefits that we predicted – the ability to quickly obtain work over high latency networks, with few steal attempts.

5.5 Conclusions

In this chapter we attempted to answer the question of how to choose steal targets during the execution of irregular parallel applications. Our main assumption was that the use of dynamic information about PE loads can help in answering this question. We split the problem of using the load information to drive stealing decisions into two parts.

The first part deals with the issue of the most appropriate way of choosing steal targets, under the assumption that perfect (i.e. totally accurate) load information is present in the runtime system. We have investigated which algorithm for selecting the targets should thieves use in this settings. Additionally, we have investigated how much would the state-of-the-art work-stealing algorithms used in modern runtime sys- tems (described in Section 2.3.1) benefit from the presence of perfect load information. Since these algorithms were previously evaluated only for very regular applications (i.e. simple divide-and-conquer ones) on more homogeneous computing environments, the questions of how would they perform for highly irregular applications on highly heterogeneous computing environments, and whether they could benefit from the load information was open. In order to address both of these issues, in Section 5.2 we performed a number of experiments using different regular and irregular parallel applications on different computing environments

Our conclusions in the first part of the chapter were the following:

• The best way to use load information is to do stealing locally (within a cluster) and remotely (outside of a cluster) in parallel, and to choose random steal targets with work. We named this algorithm for choosing the targets the Perfect CRS algorithm, since it is essentially the CRS algorithm with added perfect load information.

• Of all the algorithms that we considered, when no load information is used the CRS algorithm gives the best speedups for most of the irregular parallel applications. However, for highly irregular applications, Hierarchical Stealing

outperforms it. On the other hand, when load information is added to the algorithms that we considered, the CRS algorithm performs the best for all these applications on all computing environments.

• All of the state-of-the-art work-stealing algorithms were able to benefit from the use of load information. This is especially the case for highly irregular parallel applications, where we have noted a significant improvement in their speedups with the use of load information. The improvements were especially significant on highly heterogeneous computing environments. The only exception was the Hierarchical Stealing algorithm, where we did not observe a clear correlation between the amount of improvement obtained with the use of load information and the heterogeneity of the computing environment.

After demonstrating that load information can indeed be useful in work-stealing, in the second part of this chapter we focused on the way in which an accurate approximation of this information could be obtained during the application execution. In Section 5.3 we presented the Feudal Stealing algorithm. This algorithm uses the same principle as CRS for selecting steal targets. In addition, it uses a combination of locally-centralised (i.e. within the cluster) and remotely-distributed ways of exchang- ing information about PE loads. The main goal of the exchange of load information is for PEs to obtain a reasonable estimation of the load of other PEs during the application execution. We can see Feudal Stealing as an “approximation” of the Perfect CRS algorithm in realistic conditions, where perfect load information is not available. In Section 5.4 we evaluated Feudal Stealing for the same set of applications considered in Section 5.2. We compared it with the Perfect CRS, CRS, Hierarchical Stealing and Grid-GUM algorithms. We showed that Feudal Stealing gives better speedups than either CRS or Grid-GUM, especially for highly irregular applications, for which we concluded in Section 5.2 that the use of load information is crucial. For these applications, the speedups under Feudal Stealing are close to those that were obtained under Perfect CRS. Furthermore, we showed that Feudal Stealing has a much higher percentage of successful remote (i.e. wide-area) steals than either CRS or Grid-GUM. This shows that its way of approximating the load information is better than the mechanism used in Grid-GUM. We also showed that Feudal Stealing gives speedups that are approximately the same as under Hierarchical Stealing for highly-irregular applications on highly-heterogeneous computing environments, and that are better than speedups under Hierarchical Stealing for all other applications and computing environments.

5.5. CONCLUSIONS 163

In the next chapter, we turn our attention to the question of how victims should respond to the steal requests from thieves.

Chapter 6 Granularity-Driven Work Stealing

In this chapter, we consider the question of how a victim should respond to a steal attempt from a thief, i.e. what task to send as a response. We assume that the runtime system has information about the sizes (as defined in Section 4.2.1) of all tasks of the application being executed. In Section 6.2, we propose several task selection policies that victims can use to choose the tasks they send to the thieves. In addition to information about the task sizes, these policies also rely on information about the network topology of the underlying computing environment. In Sections 6.3 and 6.4, we present the evaluation of these policies using both simulations and their implemen- tation in Grid-GUM. We evaluate the extent to which the applications’ speedups can be improved under the policies we propose, in comparison to the trivial FCFS task selection policy, which is used in most of the state-of-the-art work-stealing algorithms. Most of the material in this chapter was published in Janjic and Hammond [JH10].

6.1 Introduction

In the previous chapter, we have considered the question ofwhere thieves should look for work during the application execution. To ensure that thieves are not idle for a long time, it is important to select the stealing targets smartly. We have seen that the strategy to do remote and local stealing in parallel seems to be the best choice for a wide range of parallel applications, especially if PEs have a good approximation of the loads of the PEs in the computing environment. However, we have thus far ignored another very important question that arises during work-stealing – namely, if a victim has more than one task in its task pool,what task should it send as a response to the thief’s steal attempt. Sending the “right” task to the thief can make that thief, and possibly neighbouring PEs as well, busy for a long time, thus eliminating the need for

them to look for work more often. This results in increased utilisation of thieves and, consequently, in the better speedups.

We proceed to investigate different policies for selecting the tasks to be sent as responses to the steal attempts. These policies also deal with selection of tasks that PEs should execute next (once the tasks currently being executed finish or are blocked). The situation where a PE has finished task execution and needs to choose the next task to execute constitute a special case of steal operation, where the PE steals a task from itself.

The most basic task selection policy is First-Come-First-Served (FCFS), where the victim always chooses the oldest task from its task pool, either for execution or to be sent to a thief. To consider any other policy only makes sense if the application’s tasks are sufficiently “different” in some way, and if there is a difference in cost of sending different tasks to different PEs. In other words, the application needs to have a non- zero degree of irregularity, and the computing environment needs to be heterogeneous to some extent.

All the work-stealing algorithms proposed in literature assume that FCFS is the best task selection policy to use when the thief and the victim are different, and Last-Come-First-Served (LCFS) when the thief and the victim are the same. This approach seems natural for simple divide-and-conquer applications, as the data locality is preserved (since PEs execute the tasks most recently created tasks and whose data is still found in cache memory). The remote thieves in this case receive large chunks of work, as older tasks tend to be larger that the newer ones.

We can see that the FCFS-LCFS policy described above uses some implicit as- sumptions about the tasks that comprise the application being executed, such that, the older the task is, the larger it is, and the more parallelism it creates. While this is generally the case for the SimpleDC applications, in the case of the DCFixedPar applications, for example, we see that some of the older tasks might be sequential, while some of the newer tasks might have a lot of parallelism. For irregular SingleDataPar applications, task size is unrelated to age. Yet another drawback of the FCFS-LCFS policy is that it does not distinguish between thieves that are at different distances (in terms of communication latency) from the victim. The oldest task will be sent to the thief regardless of whether it belongs to the same cluster as the victim or not. This may not always be the best solution.

For our purposes, we assume that the runtime system has a good approximation of the size of each parallel task in the application. We propose novel policies that use this information to make better task selection decisions. The approximation of the

In document Load balancing of irregular parallel applications on heterogeneous computing environments (Page 175-181)