Experiments - Using unsupervised machine learning for fault identification in virtual machines

This section broadly describes the approach taken whilst the technical implementation details are described in Section4.6. The implementation of this approach details what is necessary to run the framework and use it to identify faults. The rest of this section on approach describes the overall operating strategy of theFDFs.

Both FDFsperiodically sample behavioural feature data from a local system using the WMI. This data is then converted into vectors for each individual observed feature of a specified window of time – which in turn provides contextual inference and avoids convergence problems. Vectors are used to train stochastic primitives for predicting the behaviour of features to analyse features for potentially errant behaviour.

4.3. EXPERIMENTS 69

Errant behaviour is determined by comparing the actual and forecasted changes for each monitored feature when anSLOfails. Any feature that does not exhibit the predicted behaviour by its respective stochastic primitive is short listed as a potential lead for the root cause of a fault. Prioritisation and ordering of these leads is provided by sorting the leads in the inverse likelihood of the change observed – less expected events are moved further toward the top of the list.

In order to determine if the data should be used to train a stochastic primitive, a series of performance tests determines the overall health of the system being observed. How these tests operate is described in Section 4.6– but, broadly stated, a passing series of performance tests reinforces behaviours in the primitives through training, and a failure sends a signal to begin forecasting and temporally comparing feature behaviours.

Using this approach validates or invalidates the hypothesis by testing the forecasting capabilities of stochastic primitives. If the primitive successfully indicates the correct root cause of the fault after a performance tests fails, then it is clear that the forecasting abilities are working correctly. Conversely, if the primitives do not indicate the correct feature then the hypothesis is not supported.

Fidelity being a chief concern in these experiments, different volumes of data are used to show trends in the approach. Specifically, how much data is needed to train the primitives is explored through the different volumes of input by using 5, 10, 15, 20, 25, and 30 samples. Each sample corresponds to one minute intervals. These values were chosen arbitrarily with the intent to provide a reasonable enough time for the system to accommodate changes.

This approach operates on a few assumptions. The first is that without changes within the observed features’ values the stochastic primitives used in this experiment would not function at all. This is one of the reasons for waiting 60 seconds between samples. The second is that the fault must lie within the observed features’ behavioural data to have a chance of being accurately indicated. One test explores beyond this assumption with surprisingly positive results, but those results are, expectedly, not accurate.

Several key aspects are addressed with theFDFs that appear to be missing from current self- healing systems research. In addition to open questions about fidelity and a lack of basic comparison of performance between different types of primitives, little research exists between studies that explore solutions in the aforementioned fashion. Specifically, the simple observation of feature changes rather than their explicit values had not been examined. Additionally, only one study so far has attempted to use evolutionary programming techniques to explore recovery strategies [27]. This work attempts to address some of the search-space challenges

within that study by providing a mechanism for guidingGAs(see Chapter6, Future Work).

4.3.1 Hidden Markov Models & Artificial Neural Networks

The first experiment leveragesHMMsandANNs, to periodically sample configuration data via an interface, then classify this data using a series of performance tests. Based on the collective results of these tests, the information is categorised as either being in a good or faulty state. Afterwards, theFDFtakes one of two potential actions.

The first action is to update a local data-store. When the system passes all of its performance tests, the existing data-store is examined to make sure its total number of configuration samples does not exceed the maximum threshold. Any data that is beyond the maximum number of data-sets is dropped. The primitives are then greedily trained using their respective learning algorithms using the remaining and latest configuration samples.

The second is to perform an analysis on the system’s feature behaviour data. Features that show changes between the previous ‘good’ sample are compared to the ‘faulty’ configuration. If a change is noted, it is short-listed for comparison. This is an optimisation technique that reduces the maximum number of features for investigation. Any changes are fed into their respective stochastic primitive where the likelihood of the change is then compared to a forecasted value. The differences between the expected (i.e. forecasted) value and the actual value (located within the faulty configuration) provide a measure of confidence or likelihood for the potential cause of the fault. Once this comparison is complete the feature is added to a list of potential root causes. Finally, this list is sorted by highest likelihood starting with the first (0t h) index (Figure4.5). Using this list, metrics are generated via theFDFsthat indicate precision, accuracy, prediction time, the aforementioned confidence value, and the total number of leads generated. The conditions of these metrics, such as what constitutes True and False Positives or their respective Negatives are explained in Section4.3.

Testing is done through fault injections. These take two forms: DFIsandACCs. The details of their implementations and differences are discussed in Section4.3, but the theory behind these two approaches can be summarised as examining the differences between software errors and human errors, respectively. In each case the root cause is known to the administrator but not to theFDF. This allows for validation of the result provided by theFDF, and an unbiased attempt at identifying its respective source – the latter being the primary goal of these experiments. Both the ANN and HMM approaches operate using single-step prediction that has been

4.3. EXPERIMENTS 71

Figure 4.5: Fault Detection Framework Logic & Architecture using Hidden Markov Models and Artificial Neural Networks. Fault Detection Frameworks are provided three inputs, set to run, and then

injected with faults at varying time intervals. The result is an ordered list of leads based on forecasted feature behaviours.

implemented in a reactive manner. Whilst unsupervised learning is generally meant to forecast behaviours into the future, this experiment is meant to be a baseline to determine the accuracy of future endeavours. Understanding their operational capacities is therefore emphasised. Lastly, each of theFDFsrequires basic instantiation before operating. As mentioned previously this book does not centre on self-configuring (i.e. self-provisioning) methods and an initial, minimal setup is required. The FDF must be provided with a polling interval, a set of performance tests from which to ascertain the system’s overall health, and a stochastic primitive with a coupled learning algorithm.

4.3.2 Restricted Boltzmann Machines

FDFsthat leverage RBMsoperate under nearly identical assumptions and conditions as those used forANNsandHMMs. The polling interval, performance tests, and learning modules are provided to theFDF. Afterwards, it is allowed to run for 30 minutes before being subjected to the sameDFIsandACCs.

The primary difference when usingRBMsis how the primitives are trained. In the former experiment a greedy approach is used in conjunction with a windowed data-set. This necessitates that the primitives are destroyed and retrained after every successful data collection. Thus, although faster predictions can be made under these conditions, it also requires more persistent use of a system’s resources. This can create an artificial limitation if the number of features being parsed grows to a size greater than the system can parse within the polling interval. To alleviate this problem the RBMapproach uses a lazy implementation. Data is not directly parsed until a potential fault has been detected. The caveat to this is that theFDFis unable to return potential root causes as quickly as its counterparts.

A secondary difference exists in that some of the vectors used to train theRBMsare partially incomplete – this is intentional. Although the goals of these experiments remain intact by not using simulated feature behaviour data, the requirements of the RBM in how it is trained requires some unique properties. Specifically, theRBMscannot be trained without a complete dataset because the learning inputs must be vectors of equal size.

There are two ways to address this problem. The first is to wait for double the amount of time for a maximum window containing the maximum sample-size number of configurations to populate – in this case 60 minutes. The second is to assume vectors the size of the data-set window (i.e. maximum sample size) and populate them as more information becomes available. In the former, the amount of data being used by the RBM is greater than those in the other experiment. As the total time to observe the system is a key variable in understanding how quickly the primitives can be trained, this is ruled out as a potential option. This additionally allows the experiments to use the same amount of time to attempt to generate results.

RBMs are trained using half as much data as they could otherwise use. The vectors are instantiated based on the provided maximum sample size. Each vector contains a series of values, with each value indicating a certain observation of feature behaviours – change, no

change, and unknown – (1, 0 and null), respectively. Using the latter of these indicators, vector

In document Using unsupervised machine learning for fault identification in virtual machines (Page 87-92)