LULESH - Iterative Applications Benchmarking

5.3 Resilient Application Frameworks

5.4.3 Iterative Applications Benchmarking

5.4.3.5 LULESH

X10 provides an implementation of the LULESH proxy application [Karlin et al., 2013], which simulates shock hydrodynamics through a series of time steps. Each step executes a series of stencil operations on a block of elements and nodes that define these elements. The elements and their nodes are evenly partitioned among the places. Therefore, exchanging ghost regions between neighboring places is required for the

§5.4 Performance Evaluation 159

Figure 5.16: PageRank weak scaling with number of places (1 core per place).

Table 5.13:PageRank performance.

128 places 1024 places Non-res. initialization 1384 ms 3385 ms step 232 ms 570 ms total 15324 ms 37565 ms Res. (O-dist) initialization 1237 ms 3546 ms step 259 ms 563 ms checkpoint data (1 ckpt) 89 ms 324 ms checkpoint agree (1 ckpt) 53 ms 236 ms failure detection ( 1 failure) 129 ms 380 ms store recovery (1 failure) 165 ms 445 ms application remake (1 failure) 831 ms 2029 ms team remake (1 failure) 183 ms 201 ms restore data (1 failure) 19 ms 68 ms total (6 ckpts, 3 failures) 25495 ms 58500 ms

stencil computations at each step. Reduction and barrier collective operations are also invoked by LULESH steps for synchronization and evaluating the convergence state. We enhanced LULESH with resilience support using the SPMD iterative executor. Our resilient implementation has been contributed to the X10 applications repository [X10 Applications] in the lulesh2 resilientfolder. Approximately 200 lines of codes out of 4100 total lines were added or modified from the original implementation to conform with theIterativeAppinterface.

We evaluate the performance of LULESH with a problem size of 303 elements per place. The number of places is 343, 512, 729, and 1000, because LULESH requires a perfect cube number of places. The performance results are shown in Figure5.17and Table 5.14.

LULESH uses a communication-intensive initialization kernel that generates a large number of concurrentfinishobjects and remote tasks. Each place pre-allocates buffers for holding the ghost regions and communicates with 26 neighboring places using remote asyncs for exchanging references to these buffers. When places die, some of these references will dangle and require updating. On recovering the application, the initialization kernel is re-executed at all places for updating the buffer references. Using an efficient termination detection protocol is crucial for achiev- ing acceptable performance for LULESH initialization and recovery kernels. The initialization kernel incurs the following resilience overhead with 1000 places: 482% for P-p0, 335% for O-p0, 45% for P-dist, and only 40% for O-dist. When tracking place-zero finishes, the overhead is: 475% for P-dist-slow and 377% for O-dist-slow.

Unlike theGMLapplications, LULESH steps are subject to thefinishresilience overhead as they generate pair-wise remote tasks for exchanging ghost regions. The measured resilience overhead of a single step with 1000 places is: 13% forP-p0, 8% for

O-p0, 10% forP-dist, and only 4% forO-dist. The optimistic protocol is successfully reducing the resilience overhead for LULESH during initialization and step execution. As initializing the places and exchanging the ghost regions execute in parallel at all places, the distributed implementations outperform the centralized implementations in this application.

The following analysis focuses on the O-dist performance results shown in Ta- ble5.14. LULESH has a modest checkpointing state that takes 24 ms for saving in the resilient store and 35 ms for checkpoint agreement with 1000 places. The resulting resilience overhead for a failure-free execution with six checkpoints is less than 19% (the high percentage is due to the small execution time of LULESH steps, compared to theGMLapplications).

The failure recovery performance with 1000 places is as follows. The resilient executor detects the failure of a place in about 364 ms. Recovering the resilient store takes only 60 ms because the checkpointed state is limited in size. Similar to theGML

applications, Teamrecovery takes about 200 ms. The most expensive part of failure recovery is reinitializing the application, which takes about 2 seconds. Restoring the application state using the last checkpoint takes only 3 ms. The total time for recovering the application from one failure, including the time to re-execute the last 5 steps, is about 3 seconds (50% of the total execution time in non-resilient mode).

§5.4 Performance Evaluation 161

Figure 5.17: Resilient LULESH weak scaling with number of places (1 core per place).

Table 5.14:LULESH performance.

343 places 1000 places Non-res. initialization 973 ms 1720 ms step 73 ms 85 ms total 5333 ms 6820 ms Res. (O-dist) initialization 1355 ms 2410 ms step 82 ms 89 ms checkpoint data (1 ckpt) 24 ms 24 ms checkpoint agree (1 ckpt) 31 ms 35 ms failure detection ( 1 failure) 123 ms 364 ms store recovery (1 failure) 48 ms 60 ms application remake (1 failure) 1146 ms 2243 ms team remake (1 failure) 198 ms 194 ms restore data (1 failure) 3 ms 3 ms total (6 ckpts, 3 failures) 12362 ms 18005 ms

Table 5.15:Changed lines of code for adding resilience using the iterative application framework.

Total LOC Changed LOC Changed %

Linear regression 414 86 21%

Logistic regression 585 110 19%

PageRank 383 65 17%

LULESH 4100 200 5%

LULESH with DMTCP

At the early stage of developing the iterative application framework, we conducted a preliminary experiment to compare the performance of application-level checkpointing versus transparent checkpointing with theDMTCP[Ansel et al.,2009] tool. The

DMTCPtool does not require any code changes. It checkpoints all the application data on disk even if part of this data is not needed for recovery or is computable from other values. On the other hand, our framework checkpoints in memory and relies on users to specify the values to be checkpointed. We measured the time to checkpoint LULESH with a problem size of 303elements per place using 216 places onNECTAR virtual machines. The size of the user-defined checkpointing state is about 2.6 MB per place. The non-resilient X10 mode was used with theDMTCPruns and the resilient

P-p0mode, the only resilient mode available at that time, was used with our resilient framework runs. DMTCP took 6.67 seconds to checkpoint the entire application state, while our framework took only 0.35 seconds (19X faster thanDMTCP). With

DMTCPthe application stops when it faces a place failure and should be restarted for recovery (ideally on the same nodes to avoid moving the checkpointing files to new locations). This limitation is avoided by our resilient framework which does not kill the application upon a failure and encapsulates the checkpoints within the application’s memory.

X10 was compiled from revision41ab27eof thebr07 dmtcp branch of the lan- guage repository https://github.com/shamouda/x10.git. LULESH was compiled from revision8439557of the br07 dmtcp branch of the applications repository https://github.com/shamouda/x10-applications. The X10 places used 1 worker thread by settingX10 NTHREADS=1 and did not use any immediate threads. The MPI threading level was configured asMPI_THREAD_MULTIPLE. MPICH was used as the transport layer of X10 due to technical difficulties integratingMPI-ULFMand

DMTCPat that time. Because MPICH is not resilient, we could not simulate a failure recovery scenario in this experiment.

In document Resilience in high level parallel programming languages (Page 178-182)