Experiments and Analysis - Semi-Blocking Checkpointing

CHAPTER 3 Semi-Blocking Checkpointing

3.6 Experiments and Analysis

This section presents an evaluation of the semi-blocking checkpoint algorithm with two different applications. The first application is wave2D, which uses a finite differencing scheme to calculate pressure information over a discretized 2D grid. The second application, ChaNGa, is for N-Body based parallel simulations and is used in cosmology and astronomy [60]. We also present results related to the performance of restarting applications after a failure.

The experiments were performed on Trestles supercomputer at the San Diego Supercom- puter Center. Trestles consists of 324 nodes with 32 cores per node. The theoretical peak performance of the system is 100 teraflops. Each compute node contains four sockets, each with a 8-core 2.4 GHZ AMD Magny-Cores processor. Each node has 64 GB of DDR3 RAM and 120 GB of flash memory(SSD).

3.6.1 Scalability

Figure 3.8a shows the overhead of one checkpoint based on a weak scaling test with wave2D using blocking and semi-blocking checkpoint protocol from 128 cores to 1K cores. The checkpoint size is 4 GB per node. Semi-blocking algorithm reduces the checkpoint overhead from 52 s to 10 s. The optimal checkpoint interval of the blocking algorithm is 372 s given an M value of 1800 s (M is the MTBF of the system) and δblocking of 52 s. This requires

the wave2D application to checkpoint every 960 iterations. In Figure 3.8b, we show the checkpoint overhead to the execution of the applications using the two algorithms. The checkpoint overhead is reduced from 14% to 3% by using the semi-blocking algorithm. With the decreasing checkpoint overhead, the semi-blocking algorithm can afford to checkpoint more frequently to reduce the amount of rework time. So, the optimum checkpoint interval of the semi-blocking algorithm is different from that of the blocking algorithm.

0 5 10 15 20 25 30 35 40 128 256 512 1024 Checkpoint Overhead(s) Number of Cores blocking checkpoint semiïblocking checkpoint

(a) Single Checkpoint

0 2000 4000 6000 8000 10000 128 256 512 1024 Execution Time(s) Number of Cores no checkpoint blocking checkpoint semi−blocking checkpoint (b) Checkpoint Overhead 0 5 10 15 20 25 30 128 256 512 1024 Benefit (%) Number of Cores MTBF:300s 600s 1200s900s 1500s1800s (c) Semi-Blocking Benefit

Figure 3.9: Strong scaling results - ChaNGa.

Now, we compare the performance of the semi-blocking algorithm with the performance of the blocking algorithm using our model and considering both the checkpoint and rollback- recover overhead. As seen in Figure 3.8c, for different M values, the benefit of the semi- blocking checkpoint protocol is mostly constant from 128 cores to 1K cores. When M decreases, checkpoint and restart overhead for the blocking checkpoint protocol increases, hence the semi-blocking protocol shows more benefit. The benefit of the semi-blocking protocol varies from 10% for M of 1800 s to 22% for M of 300 s for checkpoint size of 4 GB/node.

ChaNGa is used to demonstrate the strong scalability of the semi-blocking checkpoint algorithm. We use a 100 million particle system. In one main step of ChaNGa, it first does domain decomposition of the particle space, then builds the Barnes-Hut trees, computes the gravitational forces, and finally updates the particles. Checkpoint is taken periodically after a certain number of steps. Figure 3.10 displays the view of communication bytes sent over time from our Projections performance analysis tool. There is less amount of communication data in the first two phases of one step: domain decomposition and tree building as seen in the figure. Sending more checkpoint messages in these phases can help us incur less interference to the application. With the opportunistic sending of the checkpoint message, our scheme can identify such phases without application knowledge.

Figure 3.9a shows the checkpoint overhead of one checkpoint based on a strong scaling test of ChaNGa application using the blocking and semi-blocking algorithms separately. Checkpoint size per node decreases for a strong scaling test, so the blocking checkpoint time is reduced from 33 s on 128 cores to 5 s on 1K cores. The semi-blocking checkpoint time decreases from 2.6 s to 0.27 s, almost hiding the checkpoint overhead.

In Figure 3.9b, we display the checkpoint overhead of the execution of ChaNGa. The optimum checkpoint interval is adjusted to the blocking checkpoint time. The blocking checkpoint overhead decreases from 12% on 128 cores to 5% on 1K cores because of the de-

Domain

DecompositionTree Build

Calculation and Update

Figure 3.10: Bytes sent in one step of ChaNGa.

creased checkpoint size per node. With the semi-blocking algorithm, the checkpoint overhead is below 1%. Of course with such low checkpoint overhead of the semi-blocking algorithm, applications can benefit more from frequent checkpoints so as not to lose lots of wall clock cycles when failure happens.

Figure 3.9c depicts the percentage benefit of the semi-blocking algorithm to the blocking algorithm at their own optimal checkpoint intervals. Semi-blocking algorithm achieves the largest benefit on 128 cores where the blocking checkpoint overhead is at its maximum in a strong scaling experiment. Even when running on 1K cores with M of 1800 s, the semi- blocking algorithm has over 6% benefit compared to the blocking algorithm. Given that the memory consumption is only 763 MB per node when running on 1K cores, we expect more benefit of the semi-blocking algorithm for applications with larger memory consumption.

3.6.2 Virtualization Analysis

Over-decomposition and asynchronous communication in Charm++ can greatly help overlap communication and computation of applications. In Charm++, programs are broken up into objects called chares. Usually, there are more chares than the number of processors. The number of chares divided by the number of processors is called virtualization ratio. With high virtualization ratio, the communication of the checkpoint or application message of one chare can be overlapped with the computation of other chares on the same core which can further hide the checkpoint overhead, so we can expect more benefits.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 4 8 16 0 5 10 15 20 Interference(s) Benefit(%) Virtualization Ratio interference benefit

Figure 3.11: Effect of virtualization.

the semi-blocking algorithm. The checkpoint size is 0.9 GB per node. The interference of remote checkpoint per checkpoint interval decreases from 1.4 s with 1 chare per core to 0.3 s with 4 chares per core in Figure 3.11. Correspondingly, the benefit of the semi-blocking protocol to the blocking version calculated from the model increases from 12.3% to 15.9% when M is 300 s as expected. However, when the virtualization ratio is increased to 8 and beyond, there is extra overhead to schedule the work of multiple chares, so there is more interference.

3.6.3 Checkpoint and Restart with SSD

As discussed in Section 3.5, half SSD and full SSD schemes can be used depending on the memory consumption of applications.

Figure 3.12 shows the checkpoint timing penalty using SSD with checkpoint data size ranging from 0.45 GB to 2.23 GB per node for the wave2D benchmark. We compare the performance of half and full SSD scheme with asynchronous (half-aio, full-aio) and synchronous IO access (half-sio, full-sio) respectively. Using full SSD scheme with asynchronous IO access saves us more than half the time of writing checkpoint data to SSD with synchronous IO. In Figure 3.13, we show the restart time of in-memory checkpointing, half and full SSD checkpointing with asynchronous IO access. Half SSD scheme has almost negligible overhead compared to the in-memory checkpointing, while full SSD scheme has around one second overhead. During restart, objects first need to get their checkpoints either from their own or their buddy node’s local disk or memory, and then restore the application and process data from the checkpoints. With asynchronous IO access and high virtualization ratio, the restoring of one object can be overlapped with the process of getting checkpoints for another

0 5 10 15 20 25 30 0.45 1.34 2.23 Timing Penalty(s) Checkpoint Size/Node(GB) half−aio full−aio half−sio full−sio

Figure 3.12: Penalty of checkpointing to SSD.

5 10 15 20 25 30 35 40 45 0.45 1.34 2.23 Restart Time(s) Checkpoint Size/Node(GB) in−memory half−aio full−aio

object. Thus we see the restart time is not affected much by checkpointing to SSD.

In document Mitigation of failures in high performance computing via runtime techniques (Page 50-55)