Charades LB Framework - DYNAMIC LOAD BALANCING

CHAPTER 5 DYNAMIC LOAD BALANCING

5.1 Charades LB Framework

As discussed in Chapter 2, Charm++ offers robust support for dynamic load balancing. This support stems from the fact that the Charm++ runtime system manages the locations of parallel objects (called chares) for the user, as well as scheduling computation and communication for these objects. As such, there is a large body of research in dynamic load balancing that has been done on top of the Charm++ runtime system in many application domains and with varying goals [23, 25, 61, 26, 62, 27, 63, 24, 64, 65, 66, 67]. In this thesis, we build upon this research by exploring the effects of load balancing in optimistic PDES simulations where there is speculative execution. The speculative execution of events adds additional challenges to load balancing which we must address. First of all, effectively measuring the load of each object is not straightforward when considering the fact that work done by the object may not always contribute to the final simulation result. Secondly, changing the locations of objects may also change the total amount of work done

by the simulator by affecting the amount of speculative execution that requires rollbacks. To address these challenges, we test a variety of metrics for attributing load to objects which more effectively capture a useful notion of “load” in an optimistic simulation.

In Charades, load balancing is accomplished by migrating LPs during execution to achieve a more effective mapping of LPs to processors. In our design of Charades, described in Chapter 3, LPs are encapsulated within chares, which means that the Charm++ runtime system is able to manage their locations, computation, and communication. This also allows the runtime system to move LPs around during execution, as well as automatically measure load statistics for each LP. The combination of these two features allows the runtime to intelligently re-balance LPs to maintain an effective balance of load during execution of a simulation.

In order for an application to use dynamic load balancing it must do three things: • Inform the runtime system how to serialize and deserialize objects

• Inform the runtime system when to perform load balancing • Select one or more load balancing strategies and load metrics

Object serialization and load balancing timing is discussed in the following paragraphs. The specific strategies and metrics studied in this thesis are discussed in detail in the next section.

5.1.1 Object Serialization

In order to serialize and deserialize objects, the application uses the Charm++ Pack-and- UnPack (PUP) framework [68]. The PUP framework uses macros and operator overloading to allow applications to specify which parts of an object to migrate, and how to deal with more complex structures involving pointers. Each chare type has a virtual pup function defined in its base class, which users can override to instruct the runtime on how to migrate that chare. Migration occurs in three steps: sizing, packing, and unpacking. During each step, the same pup function is called, and passed a reference to a PUP object which behaves differently based on which step is being executed. During the sizing step, the PUP object determines the total size of the data being migrated, and allocates an appropriately sized buffer. During packing, the data to be migrated is copied to the buffer. The buffer can then be sent to the chares new location. During unpacking the data is copied from the buffer to appropriate fields of the newly allocated chare. Any fields not included in the pup function will not be migrated. An example pup function is shown in Listing 5.1. The | operator is overloaded and used to tell the PUP object which object members should be migrated.

Listing 5.1: Example PUP function 1 class ExampleChare { 2 public: 3 int member1; 4 SimpleObject member2; 5 ComplexObject* member3;

6 OtherObject member4; // Some transient state that doesn’t need to be migrated

8 void pup(PUP& p) {

9 p | member1; // Tell p to size, pack or unpack member1 depending on state of p

10 p | member2; // Same for member2

12 // Since member3 is heap allocated, we need to check if we are unpacking so

13 // that we can allocate space for it before unpacking it

14 if (p.isUnpacking()) {

15 member3 = new MemberObject();

16 }

17 p | member3; // Size, pack, or unpack member3

18 }

19 };

Primitive types and flat objects can be directly piped through the PUP object as shown on lines 9 and 10. If SimpleObject contains its own pup function then it will be called to recursively migrate member2. Otherwise, it will just treat member2 as a flat chunk of bytes the same size as SimpleObject. For pointers to data, an extra step needs to be taken to ensure that space is correctly allocated during the unpacking step of migration. This is shown on lines 14 to 16 before piping member3 through the PUP object on line 17. In this example, member4 does not need to be migrated and is therefore not included in the pup function. This reduces the amount of data sent across the network during migration.

In Charades, LPs are implemented as chares, which allows the runtime to automatically track their load and migrate them during load balancing. The LP chares have a pup function to correctly serialize all the required simulator information for each LP, including things like pending and passed event lists, causality information, cancellation information, and current virtual time. By default, it migrates the model specific state as if it were a contiguous block of memory; however complex models can provide their own PUP function handle for LPs that have state that requires more careful serialization.

5.1.2 Load Balancing Synchronization

In terms of when to perform load balancing, the application must let the runtime system know when it is at a point at which it is safe to migrate chares. For complicated applications,

migrating a chare at an arbitrary time may cause issues, especially if chares are using shared data structures. Because of this, the runtime system provides a few methods for informing it when load balancing can occur. For this work we will focus on the most common of these methods, AtSync(). In this mode, AtSync() is a function provided by the runtime that must be called on every chare before migration can occur. Once every chare calls AtSync(), the runtime knows it is safe to start load balancing. At this point, the runtime system calls the specified load balancing strategy which collects load data from each object and makes a decision on how to redistribute the objects. Once the decision has been made, migrations are performed, and optionally a ResumeFromSync() method is called informing chares that load balancing has been completed. Based on application needs, chares can block until ResumeFromSync() is called, or continue execution immediately after calling AtSync().

For Charades, we chose to synchronize for load balancing immediately after the comple- tion of certain GVT computations. This decision was made for a couple of reasons. First, depending on the GVT algorithm chosen, the GVT computation may already be an exist- ing synchronization point within a simulation, which provides a natural place to do load balancing. How frequently, and how many times, to perform load balancing is a runtime parameter. Secondly, after the GVT computation completes, fossil collection occurs which frees up unneeded memory within LPs. Calling AtSync() right after fossil collection will minimize the amount of data that has to be migrated between processors. In the case of blocking GVTs, the simulation resumes after load balancing has completed. For non-blocking GVT algorithms, we would like to avoid blocking the simulation for load balancing as well. In these cases, we have also experimented with allowing load balancing to be overlapped with the simulation itself, much like the GVT algorithm. In this case, we simply ignore the ResumeFromSync() message from the runtime and continue the simulation immediately after calling AtSync(). Exactly which GVTs to perform load balancing after is configured via runtime parameters. In this work, we manually tune these parameters experimentally, however there has also been work done within the Charm++ runtime system to automatically select load balancing frequency based on application performance characteristics [24].

In document Adaptive techniques for scalable optimistic parallel discrete event simulation (Page 94-97)