Summary - Prediction-based failure management for supercomputers

This chapter has introduced the concept of a failure management framework for large- scale computing systems, and has analyzed its fundamental components: failure detection and failure recovery. Diverse failures tend to have different root causes and thus are handled in various ways. A brief survey of failure detection mechanisms has been discussed. In terms of failure recovery, there are four basic approaches: Retry, Check- point, Replication and Alternation. Apart from failure detection and failure recovery, there are several other issues, such as availability analysis, fault diagnosis, failure prediction, etc., that can assist in reducing adverse effects on runtime of failures.

The second goal of this chapter was to present related work on the most common failure recovery technique — the checkpoint. A definition and a comprehensive survey of checkpoint algorithms have been given. Coordinated checkpoint, which is widely accepted as the failure recovery approach in supercomputers, have been referenced and discussed.

CHAPTER 2. FAILURE MANAGEMENT FRAMEWORK 42

Contributions of this chapter

This chapter has provided an review of the general framework for failure management in large-scale computing systems. Secondly, this chapter presents comprehensive tax- onomies in terms of two distinct rules: creator and coordination.

Relation to other chapters

This chapter has presented background related to failure management, then introduced checkpoint-based failure recovery approaches, based on which the proposed proactive recovery mechanism is developed in Chapter 4.

Chapter 3 Failure prediction methods review

The challenges of real-time failure prediction are addressed in this chapter, and the problem definitions are presented in Section 3.1. A vast amount of work has been published in the area of failure prediction, for example, for the specific problem of real- time failure prediction on the IBM BlueGene/L, there exist a variety of approaches. In Section 3.2, a survey of general failure prediction methods and prediction algorithms specifically designed for the IBM BlueGene/L are presented. For the problem of failure prediction, CRF models are considered later in this thesis. Section 3.3 introduces the definitions of both the standard CRF model and the semi-Markov CRF model, presents some of their applications, and then compares them with other models.

3.1 Failure prediction statement

Failure prediction is a common term in the field of dependable computing either to assess the future reliability of a system according to its specification or to make main- tenance adjustments based on historical events analysis [110, 73]. The term is used in a wide range of domains from software to hardware. In this thesis, the term is used specifically to denote real-time forecasting of failure occurrence at a future point using a specific range of historical system states or events.

Diverse failures have various root causes, and the same failure may have different sources. More importantly, failure prediction methods may differ among multiple systems according to their unique architectural design, data flows and component de- pendencies. This means that an effective failure prediction method is system-oriented and must take into account the specific characteristics of the system. This is particular

CHAPTER 3. FAILURE PREDICTION METHODS REVIEW 44 time ti ti+j d1 d2 d3 d4 ... ∆td ti+k l1 l2 l3 l4... ∆tl ∆tw p1 p2 p3 p4 ... ∆tp ti+m

Figure 3.1: Temporal concepts in the failure prediction process ordered by time t, where m > k > j, ∆td is data window, ∆tl is lead-time window, ∆tw is warning

time window, ∆tpis prediction period, and di, li, and pi are events that occurred in the

various windows.

true with high performance systems which usually make use of customised and system- specific components. In dependable computing, failure prediction can be grouped into two classes from the perspective of time scale: real-time failure prediction (online failure prediction) and long-term failure prediction (reliability analysis). Real-time failure prediction forecasts for a short time period, such as ten minutes ahead. Four different time parameters are defined for the purpose of analysis, as shown in Figure 3.1. ∆td defines the time length of the data window used for failure prediction. Not all

prediction algorithms use the same method: some approaches, such as Markov Process-based mechanisms, use only the current system status, while others take into account the system status during a short time period just before the current time. However, some totally different measurement might be applied, for example, some algorithms do not use a time window but rather use a fixed number of messages or events. ∆td therefore has a different meaning according to the

various mechanisms.

∆tl defines the time span from the current time to the future point at which a failure

may occur. The value of ∆tlmust be carefully selected in online failure predic-

tion mechanisms: if it is too short, administrators may not have enough time to recover or solve the potential problems; whereas, a long lead-time may affect the prediction accuracy.

∆tw is termed the warning time window, which defines a minimum value for ∆tl.

∆tp defines the time window during which a predicted failure is expected to occur.

The shorter the length of the prediction window, the lower the accuracy of the result is likely to be. A longer ∆tp may increase the precision of forecasting

where it is unclear exactly when a future failure will occur, but it is unlikely to be of use in real-time.

CHAPTER 3. FAILURE PREDICTION METHODS REVIEW 45

Formalization

Each event has an associated occurrence time — this association is described as an attribute-value pair. Given E as a set of event types, an event can be depicted as (e, t), where e ∈ E is an event type and t is the occurrence time. More generally, the event type e can contain multiple attributes or features. However, sometimes we need to consider a sequence of events, termed an event sequence, that are ordered by time occurring in a fixed period. There are three aspects needed in order to specify an event sequence: event type e, start time tb and end time te, so that an event sequence s, as

shown in Equation 3.1, can be represented as a triple (ei, tb, te), where ei ∈ E.

s = h(e1, t1), (e2, t2), · · · , (en, tn)i for ei ∈ E, ti ≤ ti+1, tb ≤ ti ≤ te (3.1)

For example, in Figure 3.1 event sequence sd in data window ∆td can be written

as:

sd = h(d1, ti+1), (d2, ti+2), (d3, ti+3), · · · i

and event sequence slin lead-time window ∆tl can be written as:

sl = h(l1, ti+j+1), (l2, ti+j+2), (l3, ti+j+3), · · · i

In document Prediction-based failure management for supercomputers (Page 41-45)