Data Mining Concepts - LnCm fault model : complexity and validation

predicates for error detection mechanism in order to enhance dependability and address vulnerabilities in software systems.

In contrast, the data mining-based approach proposed in this chapter seeks to discover injection points for multiple soft-errors in order to enhance dependability validation and address vulnerabilities in software systems.

8.2 Data Mining Concepts

Technological advancement have resulted in generating and recording flood of data, and the amount of data information in the world is constantly rising as technology advances. These data are of no use until they are converted into useful information. Thus, it is necessary to analyse and understand this huge amount of data and extract useful information from it. The ability to extract useful knowledge hidden in these data and to act on that knowledge is becoming vital in today’s increasingly information-driven world. In the case of the research presented in this chapter, it is important to understand behavioural patterns of software systems that can be used for building high-level soft-error fault models.

8.2.1 Fundamentals of Data Mining

Data mining is a technology that automatically sifts through huge amount of raw data, seeking regularities and patterns that exist therein, with the aim of using the information obtained to forecast behaviours of future data or to derive knowledge about the data, if the data itself is obscure. Data mining is well-motivated in areas where processes generate vast volume of raw data and there exist high complexity in analysing the data.

Data mining sorts through large-scale data to discover patterns and establish relationships. Technically, data mining is the process of finding correlations or

8.2. DATA MINING CONCEPTS 141

patterns among dozens of fields in large relational databases. This process of finding correlations or patterns is called learning. The (the patterns or correlations) to be learned is called a concept or a target function or a model. The data input used for learning the concept is a set of instances. Each instance is an individual, independent example of the concept to be learned. It should be mentioned that some learning tasks makes it improbable to express the raw data as individual, independent instances and often require background knowledge to be considered as part of the input. For example, learning task involving time sequence. However, the research presented in this chapter employs simple learning schemes and the data used can be presented in the form of individual instances. Each instance is characterised by the values of attributes that measure different aspects of the instance. There are different types of attribute, although the research here deals only with numeric andnominal(or categorical) ones. The output produced by a learning scheme is called aconcept description

or atarget function or amodel. Data mining learning styles include:

• Classification: This involves seeking novel and informative patterns. If an existing structure is already known, data mining can be used to clas- sify new cases into these pre-determined categories. That is, the learning scheme called a classifier, is presented with classified instances from which it is expected to learn a way of classifying previously unseen instances. Classified instances are labelled with class values, and class values for new instances are determined. In the case of the research presented in this chapter, classification algorithms are applied to fault injection instances to learn, fault injection points that may likely induce failure. In a sense, classification learning operates under supervision, as the actual outcome, i.e., the class, of each learning example is given. Thus, classification learning is sometimes called supervised learning. The target function of supervised learning is a discrete function and is also referred to as aclassifier.

8.2. DATA MINING CONCEPTS 142

lated with another event. This seeks association between attributes, not just ones that predict a particular class value. Association learning differ from classification in two ways: (i) they can determine the value of any attribute, not just the class, and (ii) they can determine the value of more than one attribute at a time.

• Clustering: This involves discovering and recognising distinct categories of facts not previously known within the data. This seeks groups of instances that naturally belong together. Clustering finds these clusters and assign the instances to them, and if need be assigns new instances to the clusters.

• Forecasting (or prediction):Finds patterns in the data that can lead to reasonable prediction about future probabilities and trends. This area of data mining is known as predictive analytics. It is used to predict missing or unavailable numerical data values rather than class labels. Regression Analysis is generally used for prediction. Prediction can also be used for identification of distribution trends based on available data.

• Sequence (or path analysis): Is concerned with finding relevant patterns between data examples where the values are delivered in a sequence, i.e., where one event leads to another later event. The input data is a set of sequences called data sequences. Each data sequence is a list of trans- actions, where each transaction is a sets of items. A sequential pattern also consists of a list of sets of items. Sequence pattern analysis aims to find all sequential patterns with a user specified minimum support, where the support of a sequential pattern is the percentage of data sequences that contain the pattern.

The work in this chapter focuses on classification learning, hereafter, discussions are focused on concepts relating to supervised learning. In a simple domain, each instance is characterised by a set ofn-attributes, the set of instances is a subset

8.2. DATA MINING CONCEPTS 143

of an n-dimensional space called an Instance Space, I. Every point in I is a potential state of the process being modelled. In supervised learning, a data mining algorithm is tasked with learning a good approximation, ˆf, of the target function, given a set of instances called training data set, Ttrain, (Ttrain ⊂I),

consisting of N pairs< xi, f(xi)>. The success of supervised learning is judged

by trying out ˆf on an independent set of instances called test data set, Ttest,

(Ttest ⊂ I), for which the true classifications are known but not made known

to the learner. The success rate on theTtest gives an objective measure of how

efficiently the concept has been learned.

While there are many classification algorithms, most use the same workflow for approximating a function. Data mining involves a sequence of important steps. The steps for supervised learning include:

• Preparing data: This step include transforming the data into appro- priate data mining format, i.e., creating data mining data sets, cleaning data and dealing with missing values, scaling and normalising data, transforming and reducing variables, partitioning data set into training, validation and test data sets, addressing class imbalance, and carrying out exploratory data analysis using graphical and statistical techniques. • Choosing an algorithm: Classification algorithms include regression,

decision-trees, rules induction, support vector machines (SVM), neural networks, genetic algorithms, na¨ıve Bayes and nearest neighbours methods. The key difference between classification algorithms is in the kind of decision boundary that is defined between classes, i.e., their functional form and the set of parameters they fit, and the heuristic they employ in searching for the optimal function, also known as the hypothesis, within the space of possible hypotheses as defined by the functional form of the hypotheses.

8.2. DATA MINING CONCEPTS 144

This involves adjusting learning parameters of the chosen classification scheme and building the model from a training data set. The learning parameters and the type of model generated are dependent on the algorithm used.

• Choosing a validation method: This involves selecting measures to

examine the accuracy of the resulting fitted model. The model validation is done, in order to obtain a measure of its expected accuracy on unseen data. Often the accuracy of a model is evaluated with respect to the percentage of test data instances correctly classified, hence most algorithms seek to learn hypotheses that minimise the number of errors.

• Examining fit and updating until satisfied (model refinement):

After validating the model, there might be need to change it for better accuracy, better speed, or to use less memory.

• Using fitted model for predictions: This involves interpreting the

model and drawing conclusions.

The approach proposed in this chapter is to generate simple model to guide in the selection of efficient fault injection points. The main goal of the approach is is to detect the combination of multiple bit-flips that may likely result in system failure. For example, the approach aims to be able to predict that flipping bit 6 in variableA, bit 18 in variable B and bits 11 and 29 is variableC will most probably induce a system failure. Considering the objective of the approach and the nature of the datasets, this chapter focuses on discriminant methods to predict bit-flip combinations that may potentially cause the system to failure, as such, decision trees, rule induction and na¨ıve Bayes algorithms are considered. The fault injection point efficiency is determined by evaluating the quality of the model produced by the learning schemes. This is done by measuring the prediction capabilities of the model.

In document LnCm fault model : complexity and validation (Page 168-173)