A SURVEY ON FAILURE PREDICTION METHODS

(1)

A SURVEY ON FAILURE PREDICTION

METHODS

MUTHUMANI N,

Department of Computer Applications, SNR Sons College, Coimbatore-641 006.

DR. ANTONY SELVADASS THANAMANI

Department of Computer Science NGM College, Pollachi.

Abstract:

The preventive measures of anomalous system behavior depend on failure prediction mechanism. There are an enormous number of faults that can occur in a computing system which leads to system failure. As faults are unknown and cannot be measured, they produce error messages on their detection. This paper presents a survey on various failure prediction methods.

Keywords: Fault; Error; Failure, Failure prediction; Log files.

1. Introduction

An important characteristic of an intelligent agent is its ability to learn from previous experience in order to predict future events. The mechanization of the learning process by computer algorithms has led to vast amounts of research in the construction of predictive algorithms. The basic difference of failure prediction methods is in the ability to evaluate the current state. Since the current state can only be considered if some monitoring of the system is used as input data, these methods are also called monitoring based methods. The category of methods that evaluate the current system state can be further divided into three categories by analyzing at which stage of failure evolution, observations are taken. Faults can be observed at three stages: by monitoring of symptoms, detection of errors or observation of failures.

2. Literature Survey

Errin W. Fulp et al. [8] introduced a new system failure prediction method using Support Vector Machines (SVM).The source of data is system log files. The proposed approach takes advantage of the sequential nature of log messages and determines which sequence of messages are precursors to failure. Each Message was represented using the tag value, which offers an indication of message criticality. Experimental results using log files from a large 1024 node Linux-based compute cluster indicate that the spectrum- representation of messages combined with a SVM classifier can achieve an accuracy of 73%

Fu et al. [9] developed a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. They discovered more correlations among failure instances by taking into account the information of application allocation. The failure events are clustered based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Grid, show the offline and online predictions by our predicting system can forecast 72.7% to 85.3% of the failure occurrences and capture failure correlations in cluster coalition environment.

In [21] Xiaojuan Ren et al. developed a multi-state model to represent the characteristics of resource failures in Fine-Grained Cycle Sharing FGCS systems. They applied a semi-Markov Process (SMP) to predict the probability that no resource failure will happen in a future time window, based on the host resource usage history. The based prediction model was implemented and tested in the iShare Internet sharing system. Experimental results show that the prediction algorithm adds less than 0.006% overhead to a guest job and the prediction accuracy is higher than 86.5% on average. The effectiveness of the prediction in accommodating the deviations of host workloads was also tested, and the results show that the impact of the deviations on our prediction is negligible

(2)

demonstrated that the proposed adaptive fault management scheme can be effective even with modest prediction accuracy.

F. Salfner et al. presented a new approach in failure prediction called Similar Events Prediction (SEP) [22] .It is based on the recognition of suspicious patterns of error events. They compared SEP to two failure prediction techniques of the same class that evaluate event logs such as error or failure logs. Dispersion Frame Technique (DFT) and reliability based. All three models have been applied to data of a complex commercial telecommunication system. Predictive power of the approaches is compared in terms of precision, recall, F-Measure and accumulated runtime- costs. They demonstrated that SEP outperformed the other failure prediction techniques in all measures and achieved a precision of 80% and recall of 92%

Woochul Andrew, Y presented a new method for failure prediction in which periodic failures are first determined and then filtered from the failure list (Filtered failure Prediction) [27]. The remaining failures are then used in a traditional statistical method. The use of prefiltering leads to an order of magnitude better predictions.

Liang et al. collected event logs over an extensive period from IBM BlueGene/L, and developed a prediction model based on the real failure data. They partitioned the time window into intervals and tried to fine the fatal and failure events within the predict window based on the event characteristics of the preceding intervals. They addressed two main challenges: feature selection and classification [18].

Zhiguo Li et al. [29] presented an effective data-driven technique to predict the occurrence of failure events based on event sequence data. The Cox proportional hazard model was used to provide a rigorous statistical prediction of system failure events. .they has developed an algorithm to extract the frequent failure signatures and two types of failure signatures—parallel and serial signatures were identified efficiently. By coding the failure signatures as time-dependent covariates and interactions, a Cox prediction model was developed based on the frequent failure signatures

Turnbull, D analyzed hardware sensor data to predict failures in a high-end compute server. Features are extracted using sensor windows and potential failure windows. They trained radial basis function networks on these features and achieve a 0.87 true positive rate and 0.10 false positive rate for predicting failures using a data set that comprises of sensor and failure information which was taken for a 5 month period. . This shows that sensor data can be used to predict failures in hardware systems. They demonstrated that RBF network classifiers work well both in terms of computational performance and classification accuracy. Classification accuracy is further improved by using feature subset selection [24].

P. Gujrati et al. [10] presented a new framework for failure prediction in Blue Gene/L, which comprises three-phase namely event preprocessing, base prediction and meta-learning prediction. They have proposed the use of meta-learning for improving failure prediction in large scale clusters such as Blue Gene/L. The proposed framework adaptively integrates and combines two widely used base prediction methods, statistical based method and rule-based method for discovering various fault modes. They demonstrated that the proposed metal earning prediction can significantly improve failure accuracy by up to three times.

In [11] Jiexing Gu et al. presented a dynamic metal earning prediction engine for large-scale systems. Here, the “dynamic” part is from two perspectives: one is to continuously increase the training set during the system operation; and the other is to dynamically modify the rules of failure patterns by tracing prediction accuracy at runtime. . They used 130-week RAS log from the production Blue Gene/L system at SDSC and has shown that it can effectively forecast failures with a precision of 0.9-1.0 and a recall of 0.7-0.8.

Hoffman et al. [13] employed two modelling approaches: an extended Markov chain model and a function approximation technique utilising universal basis functions (UBF) for failure forecasting. Their results show that they can achieve 82%-92% accuracy by using these methods in predicting rare failure events.

(3)

R. Vilalta and S. Ma [25] described an approach to detect patterns in event sequences. They assumed special events called target events. By using association rule mining techniques they find patterns frequently occurring before target events. Patterns are then combined into a rule-based model for prediction. They demonstrated the importance of the size of the time window preceding target events. Experiments on two different combinations of event-type and host of interest show how the false negative error rate decreases significantly as the time window increases.

Hamerly, G. & Elkan [12] introduced a mixture model of naive Bayes sub models (i.e. clusters) that is trained using expectation-maximization. The second method is a naive Bayes classifier, a supervised learning approach. Both methods are tested on real world data concerning 1936 drives. The predictive accuracy of both algorithms is far higher than the accuracy of thresholding methods used in the disk drive industry today. The failure prediction methods presented here perform better than the current industry standard methods, and they perform well enough to be useful in practice.

Bianca Schroeder et al. [1] analyzed failure data of a high-performance computing site. They used data that has been collected at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. A study on the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair was made. The time between failures is modeled well by a Weibull distribution with decreasing hazard rate.

I. Lee et al. [16] has presented a methodology for the analysis of automatically generated event logs from fault tolerant systems. They used event log data from three Tandem systems. Raw event log was taken and the data was reduced by event filtering and time-domain clustering. Probability distributions to characterize the error detection and recovery processes are obtained and the corresponding hazards are calculated. Multivariate statistical techniques (factor analysis and cluster analysis) are used to investigate error and failure dependency among different system components

In [4], Chang-Hua Hu et al. a novel reliability prediction technique based on the evidential reasoning (ER) algorithm is developed .The ER algorithm is applied to forecast reliability in turbocharger engine systems. The feasibility and validity of the ER algorithm in systems reliability prediction is examined. Some nonlinear optimization models are used to find the optimal parameters of forecasting model by minimizing the mean square error (MSE) criterion.

J. Brevik Brevik et al. [3] examined the problem of predicting machine availability in desktop and enterprise computing environments .They compare one parametric and two non-parametric methods for predicting machine availability. They used a synthetic trace of machine availability traces from three separate desktop and enterprise computing environments. Their result shows that a non-parametric approach is better in most experiments in estimating the lower bound of a given quantile, especially when the sample size is small. They found that a non parametric method method based on a binomial approach generates the most accurate estimates.

Ei-Aroui, M., & Soler, J used a Bayesian statistical model to track and predict software reliability. they assumed an environment to get a stochastic model where the successive times between software failures are exponentially distributed. They have shown that the proposed method is useful for simulated failure data based on the numerical examples and real data [7].

T.-T. Y. Lin and D. P. Siewiorek [20] have considered two types of errors: transient and intermittent. They developed a technique called the Dispersion Frame Technique (DFT) which is based on the shape of the interarrival time function of the intermittent errors observed from actual error logs. The DFT was implemented in a distributed on-line monitoring and predictive diagnostic system for the campus-wide Andrew file system at Carnegie Mellon University. Data collected from 13 file servers over a 22 month period were analyzed using both the DFT and conventional statistical methods. It is shown that the DFT can extract intermittent errors from the error log and uses only one fifth of the error log entry points required by statistical methods for failure prediction.

Liang et al. [19] predict failures of IBM’s BlueGene/L from event logs containing reliability, availability and serviceability data. They use temporal and spatial compression. Temporal compression includes all events at a single location occurring with inter-event times lower than some threshold, and spatial compression includes all messages that refer to the same location within some time window. Berenji et al. [2] present a novel hybrid Model based and Data Clustering (MDC) architecture for fault monitoring and diagnosis, which is suitable for complex dynamic systems with continuous and discrete variables. Cheng et al. [5] proposed an application cluster service (APCS) scheme. The proposed APCS provides both a failover scheme and a state recovery scheme for failure management.

(4)

The present warning-algorithm based on maximum error thresholds is replaced by distribution-free statistical hypothesis tests.

Daidone et al. [6] have proposed to use a hidden Markov model approach. Taking advantage of the characteristics of the hidden Markov models formalism, widely used in pattern recognition, they proposed a formalization of the diagnosis process, addressing the complete chain constituted by monitored component, deviation detection and state diagnosis. This method is based on concurrent monitoring. So, this method could also be used for failure prediction: If a component is detected to be faulty, a failure is likely to occur.

Weiss [26] introduces a failure prediction technique called “timeweaver” that is based on a genetic training algorithm. Timeweaver, a genetic-based machine learning system that solves the event prediction problem by identifying predictive temporal and sequential patterns within data. Leangsuksun et al. [15] describe that they have implemented predictive check pointing for a high-availability high performance Linux cluster.

3. Conclusion

The preventive measures of anomalous system behavior depend on failure prediction mechanism. There are an enormous number of faults that can occur in a computing system which leads to system failure. As faults are unknown and cannot be measured, they produce error messages on their detection. A survey of failure prediction methods has been presented here.

Table 1: Failure Prediction Methods

Study Date Length Environment Type Of Data Approach

1 2008 24 Months 1024 Node

Linux-Based Compute Cluster System Log Files Svm

2 2007 - Wayne State Grid Failure Log

Spherical Covariance & Stochastic Model

3 2006 2 Days _{Telecommunication}Commercial

Pl tf

Error Logs Sep

4 2006 8 Months Supercomputer

Platinum At Ncsa Failure Log Ft-Pro

5 2006 3 Months Ishare Internet Sharing

System Log Semi Markov

6 2005 1 Month Ct Log Files Cox Proportion

Model

7 2005 20 Weeks Ibm Bluegene/L Ras Event Logs Customized

Nearest Neighbor

8 2005 3 Months University Of Virginia Monitoring Ffp

9 2004 5 Months Single Sever Sensor And Failure

Information Rbf

10 2004 130 Weeks Ibm Bluegene/L

Systems At Sdsc Ras Event Logs

Dynamic Meta Learner

11 2004 20 Months Ibm Bluegene/L

Systems At Anl And Ras Event Logs Meta Learner

12 2004 53 Days

Commercial Telecommunication

Platform

Ras Event Logs &

Error Logs Ubf

14 2002 1 Month Network Having 750

Hosts Event Log Rule Based Model

15 2002 1 Years 350 Nodes Cluster

System

Event Log, Sar Data, Node

Topology

Time Series, Rule Based, Bayesian

Network

16 2001 - Hardware Systems-Disk

Drives

Quantum Smart

Dataset Naive Bayes Em

17 1997 9 Years Los Alamos National

Laboratory Failure Data

Weibull Distribution

18 1991 - 3 Tandem Systems Event Log

Multivariate Statistical Techniques

21 1996 - 40 Suits Of

Turbochargers

Time To Failure

Data Er Algorithm

(5)

References

[1] Bianca Schroeder, Garth A. Gibson, A Large-Scale Study of Failures In High-Performance Computing Systems, Proceedings of The

International Conference On Dependable Systems and Networks (Dsn2006), Philadelphia, USA, June 25-28, 2006

[2] Berenji, H., Ametha, J., & Vengerov, D. Inductive learning for fault diagnosis. In IEEE Proceedings of 12th International Conference on Fuzzy Systems (FUZZ’03), volume 1.2003

[3] Brevik.J, D. Nurmi and R. Wolski, Automatic Methods For Predicting Machine Availability In Desktop Grid and Peer To-Peer

Systems, CCGRID, IEEE, 2004, Pp. 190-199

[4] Chang-Hua Hu, Xiao-Sheng Si, Jian-Bo Yang, System Reliability Prediction Model Based On Evidential Reasoning Algorithm With

Nonlinear Optimization

[5] Cheng, F., Wu, S., Tsai, P., Chung, Y., & Yang, H. Application Cluster Service Scheme for Near-Zero-Downtime Services. In IEEE

Proceedings of the International Conference on Robotics and Automation, 4062–4067. 2005

[6] Daidone, A., Di Giandomenico, F., Bondavalli, A., & Chiaradonna, S. Hidden Markov Models as a Support for Diagnosis:

Formalization of the Problem and Synthesis of the Solution. In IEEE Proceedings of the 25th Symposium on Reliable Distributed Systems (SRDS 2006). Leeds, UK, 2006

[7] Ei-Aroui, M., & Soler, J. (1996). A Bayes Nonparametric Framework for Software Reliability Analysis. IEEE Transactions On

Reliability, 45, 652–660

[8] Errin W. Fulp, Glenn A. Fink,Jereme N. Haack, Predicting Computer System Failures Using Support Vector Machines,Proceedings of

The First Usenix Conference On Analysis Of System Log

[9] Fu, S. & Xu, C.-Z. Quantifying Temporal and Spatial Fault Event Correlation for Proactive Failure Management. In IEEE Proceedings

of Symposium on Reliable and Distributed Systems (SRDS 07). 2007

[10] Gujrati.P, Y. Li, Z. Lan, R. Thakur, And J. White, “A Meta-Learning Failure Predictor For Bluegene/L Systems,” Proc. Of Icpp’07.

[11] Jiexing Gu, Ziming Zheng1, Zhiling Lan,John White, Eva Hocks, Byung-Hoon Park, Dynamic Meta-Learning For Failure Prediction

In Large-Scale Systems: Acase Study, Proceedings Of The International Conference On Parallel Processing 2008

[12] Hamerly, G. & Elkan, C. Bayesian Approaches to Failure Prediction for Disk Drives. In Proceedings of the Eighteenth International Conference on Machine Learning, 202–209. Morgan Kaufmann Publishers Inc., 2001 [Pdf]

[13] Hoffmann Ga, Salfner F., Malek M. Advanced Failure Prediction In Complex Software Systems, Srds 2004

[14] Hughes, G., Murray, J., Kreutz-Delgado, K., & Elkan, C. Improved disk-drive failure warnings. IEEE Transactions on Reliability,

volume 51(3): 350–357, 2002

[15] Leangsuksun, C., Liu, T., Rao, T., Scott, S., & Libby, R. A Failure Predictive and Policy- Based High Availability Strategy for Linux High Performance Computing Cluster. In The 5th LCI International Conference on Linux Clusters: The HPC Revolution, 18–20. 2004

[16] Lee.I, R. K. Iyer and D. Tang, Error/Failure Analysis Using Event Logs From Fault Tolerant Systems, Proceedings 21st Intl.

Symposium On Fault-Tolerant Computing, 1991,Pp. 10-17.

[17] Li, Y. & Lan, Z. Exploit Failure Prediction For Adaptive Fault-Tolerance In Cluster Computing. In IEEE Proceedings of the Sixth

International Symposium on Cluster Computing and the Grid (Ccgrid’ 06), 531–538. IEEE Computer Society, Los Alamitos, Ca, Usa.

[18] Liang, Y., Zhang, Y., Xiong, H., and Sahoo, R. Failure Prediction In Ibm Bluegene/L Event Logs. In Proceedings of The IEEE

International Conference On Data Mining (2007).

[19] Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., & Sahoo, R. BlueGene/L Failure Analysis and Prediction Models. In IEEE

Proceedings of the International Conference on dependable Systems and Networks (DSN 2006), 425–434. 2006

[20] Lin T.-T. Y. and D. P. Siewiorek. Error Log Analysis: Statistical Modeling and Heuristic Trend Analysis. IEEE Transactions on

Reliability, 39(4):419–432, Oct. 1990

[21] Ren, S. Lee, R. Eigenmann and S. Bagchi, Resource Failure Prediction In Fine-Grained Cycle Sharing System, IEEE HPDC,

Paris,France, 2006.

[22] Salfner.F,M. Schieschke, And M.Malek. Predicting Failures Of Computer Systems: A Case Study For A Telecommunication System.

In Proceedings of IEEE International Parallel And Distributed Processing Symposium (Ipdps 2006), Dpdns Workshop, Rhodes Island, Greece, Apr. 2006

[23] Sahoo R.K, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta and A. Sivasubramaniam, Critical Event Prediction For

Proactive Management In Largescale Computer Clusters, Kdd '03: Proceedings of the Ninth ACM International Conference On Knowledge Discovery and Data Mining, ACM Press, Washington, D.C., 2003, Pp. 426-435.

[24] Turnbull, D. & Alldrin, N. Failure Prediction in Hardware Systems. Technical Report, University Of California, San Diego, 2003.

[25] Vilalta.R and S. Ma, “Predicting Rare Events in Temporal Domains”, Proc. of IEEE Intl. Conf. On Data Mining, 2002.

[26] Weiss, G. Timeweaver: A Genetic Algorithm for Identifying Predictive Patterns in Sequences of Events. In Proceedings of the Genetic

and Evolutionary Computation Conference, 718–725. Morgan Kaufmann, San Francisco, CA, 1999

[27] Woochul Kang and Andrew Grimshaw, Failure Prediction In Computational Grids

[28] Yang, S. A Condition-Based Failure-Prediction and Processing-Scheme For Preventive Maintenance. IEEE Transactions On

Reliability, Volume 52(3): 373–383, 2003

[29] Zhiguo Li, Shiyu Zhou, Suresh Choubey And Crispian Sievenpiper, Failure Event Prediction Using The Cox Proportional Hazard