Numerical results - Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communi

In this section, we evaluate our proposed fault tolerance FTPIDA* and compare its performance with the CP/R technique. All the experiments were performed on the supercomputer Guillimin from McGill University managed by Calcul Qu´ebec and Compute Canada1_{. Guillimin has 1200 nodes, 7200 cores and connected by 4TB/s}

InfiniBand. All nodes are running CentOS 6.3 operating systems. Both FTPIDA* and CP/R implement with OpenMpi 1.8.6.

For our experiment, we implement FTPIDA* and CP/R and use 24 puzzle prob- lem as the case to compare the performance of both the fault tolerance methods. Here, we acknowledge the simultaneous failure of at most 50% of the processes for each test case. For the multiple process failures to occur, we randomly choose the process to be failed with the condition that at most two consecutive neighbours can fail simultaneously. In our implementation, we consider M T T F = 400 seconds, and with CP/R, the checkpoint is taken just before every 400 second.

Figure 3.5 shows the overhead of fault tolerance in a failure-free execution for both FTPIDA* and CP/R with the existing application. The result indicates that backup

1_{The operation of this supercomputer is funded by the Canada Foundation for Innovation (CFI),}

20 30 40 50 60 70 80 0 500 1000 1500 No of Processor Ov

erhead (in seconds)

Fault Tolerance Overhead FTPIDA*

CCP/R

Figure 3.5: Fault Tolerance Overhead (without fault): FTPIDA* vs. CP/R

time for FTPIDA* is subtle and apparently constant with increasing number of processes. On the contrary, for CP/R, checkpoint overhead increases almost linearly at a much higher rate with the increasing scale as compared to our scheme. The reason behind is more checkpoints have to perform with more processes which include more coordination and checkpoint message time with the execution time.

Figure 3.6 represents the comparison of fault recovery overhead between FTPIDA* and CP/R with the different number of the processes. CP/R always spends more time to recover the system from faults, and this recovery time increases with increasing scale. On the other hand, the recovery time of FTPIDA* is almost constant and always smaller than that of CP/R, irrespective of the number of the processes. This is because CP/R resumes all the processes to a globally consistent state even with a single process failure. Whereas, FTPIDA* only resumes the execution of the failed process to its last consistent state (independently) with the critical data from one of its two neighbors. FTPIDA* takes a nearly constant time to recover from one or more processes failure as all the recovery happens independently in parallel. The

No. of Processor

Ov

erhead (in seconds)

0 10 20 30 40 50 20 40 60 80

Fault Recovery Overhead FTPIDA*

CCP/R

Figure 3.6: Failure Recovery Overhead: FTPIDA* vs. CP/R

result shows that our proposed method has a distinct advantage over CP/R. It can be seen that FTPIDA* method reduces the overhead of fault recovery by 90% at most, as compared to CP/R.

Figure 3.7 shows the performance improvement of FTPIDA* over CP/R with multiple process failures. It is found that the performance of FTPIDA* over CP/R increases linearly with the increasing number of processes. The results show that in a failure scenario, our proposed fault tolerant method has performed well and decreased (improve performance) the overhead nearly 30% to 67% for the number of processes from 20 to 80. The reason behind, the more the number of processes incorporates, the more the time required for CP/R for checkpointing and recovery with more failures. Table 3.2 shows the total backup data size produced by FTPIDA* and CP/R . The sizes are given in megabytes and column labeled ”Reduction” represents the total amount by which the backup data size is lessened by FTPIDA* as compared to CP/R. In all the cases, the results show that the checkpoint data produced by FTPIDA* are much lower than that of CP/R. This is primarily because the FTPIDA* system only

20 30 40 50 60 70 80 10 20 30 40 50 60 70

No of Processor

Performance Impro

vment (%)

Performance Improvement: FTPIDA* FTPIDA* Vs. CCP/R

Figure 3.7: Performance Gain: FTPIDA* vs. CP/R

replicates the critical data of its own to two of its neighbor processes. On the other hand, CP/R method freezes the whole systems for a substantial amount of time to collect the backup data from all the processes and create a consistent global backup to recover the system from failure. Therefore, the size of the critical or backup data of FTPIDA* which is needed to recover a system from failure is always smaller than the size of the recovery data of CP/R method.

Table 3.2: Backup data size for FTPIDA* and CP/R

No. of

Processes Checkpoint Data

Backup Data in FTPIDA*(per process)

Total Backup Data

in FTPIDA* Reduction 20 238.04 MByte 0.86 MByte 17.2 MByte 92.77% 40 358.51 MByte 0.88 MByte 35.2 MByte 90.18% 60 514 MByte 0.90 MByte 54 MByte 89.49% 80 895.12 MByte 0.84 MByte 67.2 MByte 92.49%

In document Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms (Page 71-75)