Overhead analysis - Prediction-based checkpoint model

4.2 Prediction-based checkpoint model

4.2.2 Overhead analysis

The proposed checkpoint mechanism differs from the original checkpoint mechanism in terms of when to construct checkpoints according to the four scenarios given in Figure 4.9. Evidently, an accurate prediction algorithm can help the system to establish checkpoints at appropriate times, however, inaccurate prediction leads to time wasted on constructing extra checkpoints. Compared with the original checkpoint mechanism, the time penalties of the prediction-based mechanism are listed in Table 4.3 in terms of the four scenarios. It can be concluded that the prediction module may reduce the time cost for scenario (a) while it results in extra wasted time in scenario (d).

Scenarios Original time waste New time penalties Difference

(a) tb+ tr ts+ t 0 b+ tr tb− ts− t 0 b (b) tb+ tr tb+ tr 0 (c) 0 0 0 (d) 0 ts −ts

Table 4.3: Comparison between original checkpoint mechanism and prediction-based mechanism in terms of wasted time for each of the four scenarios introduced in Figure 4.9. W_N0 − WN = (ts/P + t 0 b− tb) ∗ R ∗ N (4.12) W_N0 − WN < 0 ⇒ (ts/P + t 0 b− tb) ∗ R ∗ N < 0 ⇒ ts/P + t 0 b− tb < 0 ⇒ P > ts/(tb− t 0 b) (4.13)

In order to analyse the improvement made by the prediction-based checkpoint mechanism, Equation 4.12 is derived from Equations 4.4 and 4.11. Obviously, the prediction module must meet the condition that W_N0 − WN < 0, and the derivation

process is shown in Equation 4.13. This leads to the conclusion that the prediction module must meet the necessary condition that P > ts/(tb − t

b) if it is to offer im-

proved fault tolerance; in terms of the recall R, the more precise the prediction made, the more computational time the system saves. In terms of tb and t

b, they have dif-

CHAPTER 4. PROACTIVE FAILURE RECOVERY MECHANISM 78

in [13, 71]. In this thesis, tb is estimated as tc/2 approximately, and t

b is ignored for

simplicity purposes.

Comparison with traditional checkpoint model

The expected running time of applications when using traditional checkpoint mechanism is given in Equation 2.2. As analysed above, the overhead difference between traditional checkpoint mechanism and the prediction-based checkpoint mechanism is W_N0 − WN, hence the expected execution time of applications when using the

prediction-based checkpoint mechanism (E_C0 ) can be expressed as Equation 4.14. It can be seen that the overhead of the proactive failure recovery approach now depends on the failure prediction accuracy (precision and recall). In order to investigate how execution time varies with relevant independent variables, especially in terms of prediction accuracy, Figure 4.10 shows the results of the formulae with three different sets of parameter settings. It can be concluded that the proactive checkpoint mechanism with better prediction accuracy will decrease the overhead compared with the traditional mechanism, however, a lower accuracy will add on an extra overhead.

E_C0 = EC + (W 0 N − WN) = F tc (ts+ (tr+ 1 λ)(e λtc _{− 1)) + (t} s/P + t 0 b− tb) ∗ R ∗ N (4.14) Dependence on precision

Figure 4.10 shows the effects of accuracy (precision and recall) on the prediction-based checkpoint mechanism. In order to investigate how the execution time varies with precision, recall is set at 100% for comparison. Figure 4.11 demonstrates the effects of the formulae with precision varying from 0% to 100%. It can be seen that when precision equals 28%, the prediction-based checkpoint mechanism has the same overhead as the traditional checkpoint mechanism, which means P = ts/(tb − t

b). When precision

accuracy increases (P > ts/(tb − t

b)), the curve declines and the prediction-based

checkpoint mechanism performs better. Similarly, the traditional checkpoint mechanism is better when precision is lower than 28%.

CHAPTER 4. PROACTIVE FAILURE RECOVERY MECHANISM 79 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 7

Failure free running time (days)

Expected execution time (days)

Traditional checkpoint

Proactive checkpoint with higher accuracy Proactive checkpoint with lower accuracy

Figure 4.10: Comparison between traditional checkpoint mechanism and prediction- based checkpoint mechanism according to various application lengths when recall is 100%. The curve of the proactive checkpoint with higher accuracy (P = 100%) shows the effects when precision meets the condition P > ts/(tb − t

b), whereas the curve

of the proactive checkpoint with lower accuracy (P = 10%) shows the results of the formulae when P < ts/(tb − t

b). Other parameters are configured that λ = 0.1, ts =

0.2, tc= 1.6, tb = 0.8, and tr = 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 5.5 6 6.5 7 7.5 8 Precision

Application execution time (days)

Figure 4.11: The effects of Equation 4.14 with varying prediction precision — the horizontal line in the chart denotes the application execution time of the traditional checkpoint mechanism. Other parameters are configured that R = 100%, λ = 0.1, ts =

CHAPTER 4. PROACTIVE FAILURE RECOVERY MECHANISM 80

Dependence on recall

Equation 4.13 shows that the accuracy of precision determines whether the prediction- based checkpoint mechanism performs better, and Equation 4.14 shows that the amount of saved overhead is determined by the accuracy of recall. In order to show the effects of Equation 4.14 with various recall values, parameters are set to ensure that condition 4.13 is met. Figure 4.12 shows the results. It can be seen that the total amount of saved overhead has a linear relationship with recall, the higher the accuracy of recall, the greater the saved overhead from prediction-based checkpoint mechanism will be.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 Recall

Application execution time (days)

Figure 4.12: The effects of Equation 4.14 according to varying prediction recall; the running time using the traditional coordinated checkpoint approach has also been plot- ted. Other parameters are configured that P = 100%, λ = 0.1, ts = 0.2, tc = 1.6, tb =

0.8, and tr= 0.2

Dependence on precision and recall

It can be seen that the proactive checkpoint mechanism performs better when the precision accuracy meets the condition shown in Equation 4.13, while Equation 4.14 shows that both precision and recall can affect the execution time. Figure 4.13 shows the execution of an application with varying precision and recall. It can be seen that in- creasing precision and recall can both reduce the overall execution time (except at low precision).

CHAPTER 4. PROACTIVE FAILURE RECOVERY MECHANISM 81 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5.5 6 6.5 7 Recall Precision

Expected execution time

Figure 4.13: The relationship between expected execution time, precision accuracy and recall accuracy of the proactive failure recovery mechanism. 5-days failure free running time is chosen and other parameters are configured that λ = 0.1, F = 5 days, ts = 0.2, and tr = 0.2

In document Prediction-based failure management for supercomputers (Page 77-81)