Microbenchmark Results - Read/Write Microbenchmark Results

5.1 Read/Write Microbenchmark Results

5.2.1 Microbenchmark Results

The following set of microbenchmarks analyzes the bandwidth of the NAM CR and its scaling behavior in the DEEP-ER SDV from 1 to 16 nodes with 4 processes each. The checkpoint sizes range from 4 KB up to 2 GB per node. The benchmarks directly call libNAM CR functions without involving an additional layer such as SIONlib, and each process is treated as an independent rank. Hence a maximum of 64 checkpoints are created and evenly assigned to both NAMs (maximum 32 checkpoints per NAM).

5.2.1.1 Checkpointing

The first benchmark measures the overall bandwidth for creating XOR parity checkpoints. A root process configures the NAM CR unit and distributes the job to all participating ranks. Each rank then creates a checkpoint and informs the NAM in order to fetch the data and generate the parity. The bandwidth measurement is started as soon as the MPI job starts and stopped when all ranks have received a notification that the parity has been generated. The actual checkpointing bandwidth is calculated using the total amount of data that has been processed divided by the time the process took, which includes MPI start-up times and synchronization. The results of this benchmark are depicted in Figure 5.6. It can be seen that the bandwidth scales with the number of available nodes.

For one participating node, only one NAM is utilized and only one link of this NAM is accessed since there is a static route between the two endpoints. The resulting peak bandwidth is 6.2 GB/ which is less than what has been measured for PUT requests from a node to the NAM. This surprises as the NAM issues GET requests, and GET responses traveling back to the NAM are very similar to PUTs with respect to how they are handled by the EXTOLL network. The reason for this disparity is software synchronization overhead and the generation of the XOR parity which is then also written to the HMC. It is reasonable to include this overhead in the measurements since it is part of the overall CR process.

With two nodes the eﬀective bandwidth is already more than doubled with 14 GB/s as now both NAMs are involved and the software overhead remains at a comparable level. Adding more nodes to the checkpointing process eventually leads to a bandwidth saturation at 24.8 GB/s with 16 nodes. At a ﬁrst glance this result surprises as it states that the bandwidth per NAM, assuming an equal distribution, is 24.8 GB/s

2 NAMs = 12.4 GB/s.

This is higher than what has been measured for writing data to a NAM via both links. 118

5.2 Checkpoint/Restart

5000

15000

25000

Checkpoint size per Node [Byte]

Bandwidth [MB/s]

8K 32K 128K 512K 2M 8M 32M 128M 512M 2G

Number of Nodes (4 Processes each)

1 2 4 8 16

Fig. 5.6 XOR checkpointing bandwidth with 2 NAMs in the DEEP-ER SDV. 4 processes

per node with one checkpoint per process

However, the theoretical NAM bandwidth analysis in Section 4.4 pointed out that the bottleneck for a two link operation sits in the HTL protocol conversion logic. In case of Checkpoint/Restart this module is completely avoided except for the task of writing out the XOR parity to the HMC. All other data is directed to the CR layer which operates at a higher throughput (17.54 GB/s) than two EXTOLL links can deliver (16.62 GB/s). Achieving even higher bandwidths for checkpointing remains diﬃcult due to natural overhead of generating and storing the XOR parity, and process synchronization among participating nodes.

5.2.1.2 Restart

Benchmarking a restart requires that a XOR parity has already been generated. Hence, a checkpoint is ﬁrst created following the scheme presented in the previous section. The bandwidth measurement is started as soon as the root process informs the NAM that a rank failure has occurred and stopped after the failed rank has entirely retrieved its missing checkpoint. Figure 5.7 shows that restart scales similarly to checkpointing for an increasing number of participating nodes. The resulting bandwidths, however, are

NAM Performance Evaluation

5000

15000

25000

Checkpoint size per Node [Byte]

Bandwidth [MB/s]

8K 32K 128K 512K 2M 8M 32M 128M 512M 2G

Number of Nodes (4 Processes each)

1 2 4 8 16

Fig. 5.7 XOR restart bandwidth with 2 NAMs in the DEEP-ER SDV. 4 processes per node

with one checkpoint per process

continually lower than for checkpointing. The reason for this behavior is the additional read process to fetch the missing checkpoint after reconstruction has ﬁnished.

5.2.1.3 Impact of XOR Set Mapping on CR Performance

One important property that affects CR performance is the assignment of nodes to a XOR set, or more specific the mapping of ranks to one of the two NAMs. The libNAM library currently maps nodes to a set in pseudo-random fashion and the actual topology and routing setup is not considered. As Section 4.7.2 highlighted there exist good and bad mappings for the same node/routing/NAM setup. The measurements so far were executed with manually assigned XOR sets. This is reasonable for a system such as the DEEP-ER SDV. For larger systems and many different applications, however, it is up to libNAM to form these sets. Therefore, it is necessary to measure the performance impact of the mapping scheme.

Figure 5.8 compares the checkpointing bandwidth for two different mappings with 4 nodes. It shows that the potential performance loss for a bad mapping scheme is significant. Therefore, with the current libNAM implementation and without any additional effort it is not guaranteed that always the best mapping is provided. In

5.2 Checkpoint/Restart

5000

15000

25000

Checkpoint size per Node [Byte]

Bandwidth [MB/s]

8K 32K 128K 512K 2M 8M 32M 128M 512M 2G

Node to NAM mapping impact with 4 Nodes

Good mapping Bad mapping

Fig. 5.8 Impact of XOR set to NAM mapping on achievable bandwidth

addition, it can also be due to the job scheduler that a bad mapping is inevitable. In this case the user is responsible to reserve nodes where the routing is guaranteed to target all available NAM links.

In document Accelerating Checkpoint/Restart Application Performance in Large-Scale Systems with Network Attached Memory (Page 132-135)