GVT Computation - Adaptive techniques for scalable optimistic parallel discrete event simulatio

CHAPTER 6 CONCLUSION

6.2 GVT Computation

Chapter 4 explores a number of different GVT algorithms and gives an in-depth analysis of their effects on key simulation characteristics. We first look at the very basic Blocking GVT computation, and show the concrete ways in which different configurations affect synchronization costs, event efficiency, and overall simulation performance. This analysis is used throughout the rest of the work as a basis from which to evaluate other techniques. It also reveals a strong coupling between event efficiency and synchronization costs, and shows the importance in achieving a good tradeoff between these two quantities. As the thesis progresses we aim to decouple these two quantities from each other so that each can be treated separately.

We then show an effective and completely non-blocking implementation of Mattern’s Phase-Based GVT algorithm [19] which again relies on message-driven execution to work effectively. Our implementation allows the GVTManager and Scheduler to each perform their own communication and computation, while allowing the runtime system to dynami-

cally schedule each piece. No special communication routines needed to be written, and no changes to the Scheduler are required to run alongside the GVTManager. By not requiring event execution to block during the GVT computation, synchronization costs are effectively reduced to zero and most models and configurations saw significant speedups. However, two problems still existed. First, event efficiency was low due to the fact that there was no longer a bound on the simulators optimism. Second, because the algorithm was not aware of any causality information, we often saw an increase in the number of GVT computations required to complete a simulation. This also sometimes led to difficulty in configuring models with high memory consumption.

Here we also propose a new GVT algorithm, the Adaptive Bucketed GVT algorithm. The algorithm is able to run almost completely independently from the simulation Scheduler, and never blocks event execution. Furthermore, it incorporates timestamp information from sent and received events to adapt to the flow of the simulation. This provides two important benefits. First, once the algorithm begins, newly sent events can still be incorporated into the computation. With the Phase-Based GVT algorithm, once it began computation all newly sent events would not be incorporated until the next GVT computation. Secondly, it allows the algorithm to adaptively expand or contract the virtual time window it encom- passes based on the progress of each processor. For models tested, this results in far less collective communication required by the GVT algorithm. The algorithm still resulted in a low event efficiency similar to the Phase-Based algorithm. However, it also presented an avenue for dealing with this problem by using information already collected in the process of the computation. By allowing the GVTManager to selectively throttle certain events, we were able to significantly improve event efficiency while also lowering the total amount of events and anti-events sent between remote processors. In certain model configurations, this resulted in higher event rates. It also became an important part of the work done in Chapter 5.

6.3 DYNAMIC LOAD BALANCING

In Chapter 5 we looked at dynamic load balancing as another way in which our simulator could adapt to complex model workloads to improve performance. We started by evaluating a number of different load balancing algorithms on each model. In starting with the Blocking GVT algorithm, we showed that load balancing effectiveness is extremely dependent on the trigger for determining when to compute the GVT. Whether the trigger was a specific event count or a virtual time window had significant impact on what work had been completed when load balancing began. As a result the trigger also had significant impact on the loads

which were presented to the load balancer, and affected how the load balancer chose which objects to migrate. This also had an impact on which characteristics were most affected by load balancing. When balancing using event count, load balancing had less effect on the balance of work to processors, but a more significant effect on event efficiency. When using a virtual time window, the opposite held true.

We next explored a variety of load metrics which could be used to attribute load to LPs. In traditional HPC applications, CPU time is used to assign load to objects. This is also what we used in our initial experiments. However, speculative execution means that CPU time also includes erroneous event executions and rollback work as part of an LPs load, which intuitively seems problematic. A common solution to this in other PDES load balancing work has been to use the number of committed events to represent the load of each LP. We evaluated this along with a large number of other metrics and analyzed the results. We found that, as expected, CPU time often performed poorly compared to PDES specific metrics. However, we also found that although committed events did sometimes prove to be more effective than CPU load, there were a number of other metrics which performed even better. The problem with committed events is that it ignores incorrect work, and may also not be indicative of future workloads when LP mapping changes. Even though incorrect work and subsequent rollbacks are not part of the final result, they still affect how long each processor takes to reach the next GVT computation. With blocking GVTs, an imbalance in that time can result in more overhead and more time spent idle. Here we showed that metrics which more directly capture total work, future work, or rate of progress can sometimes provide an even better metric to capture imbalance.

Finally, we showed that combining dynamic load balancing with the non-blocking GVT algorithms from Chapter 4 provides the best performance out of all experiments performed in this thesis. First, we showed that load balancing was able to increase event rate by increasing event efficiency when running the PHOLD model using the Phase-Based GVT algorithm. The Traffic model, however, had such a low event efficiency that load balancing had difficulty having a positive impact on performance. Based on these observations, we combined the Adaptive Bucketed GVT algorithm, adaptive event throttling, and dynamic load balancing to run the Traffic model. By completely decoupling concerns, we allow the event throttling to handle event efficiency, load balancing to handle mapping of work load, and the GVT algorithm to keep synchronization costs low. This yielded significant performance improvements for Traffic, and outperformed every other scenario from this thesis.

In document Adaptive techniques for scalable optimistic parallel discrete event simulation (Page 135-138)