Implementation - Change-aware Anomaly Detection

4. Change-aware Anomaly Detection

4.6. Implementation

This section describes the important steps of the post processing which is applied to the monitoring data after its extraction from the monitored environment. The items that are prefixed with “Kieker” are existing filters inside the Kieker monitoring framework.

4.6.1. Reconstruct Traces (Kieker: TraceReconstructionFilter)

After having read the monitoring data from the file system, the Kieker filter Execu-

tionRecordTransofrmationFilter is used to transform the monitored data in the form of OperationExecutionRecords to Execution objects. These data structures hold information

about the signature of the operation that was called, when it was called and when it returned, on which host name it was executed, a call depth indicator and a sequence counter. This step is only a conversion step which does not alter the underlying data.

4.6.2. Extract Dependencies from Traces

(MessageTraceDependencyExtractor)

In this step the obtained traces are used to extract the trace information in objects that are used to associate the actual monitored records with the complete execution path it was monitored in. As we assume to have monitoring data from a microservice environment, one may have multiple instances of a microservice. That means, that we have to consider the possibility that single instances may have performance problems. Therefor, we have to take the exact path into account to not mix up anomalies of other instances in the resulting data.

4.6.3. Extract Required Data from Records (Kieker: ExtractionFilter)

After having extracted the dependencies and associated them with the records, some parts of the records are not needed anymore. Therefor, all data except the identifying name, which also contains a reference to the stored trace information, the timestamp and the response time is removed.

4.6. Implementation

4.6.4. Aggregation (Kieker: TimeSeriesPointAggregatorFilter)

The monitored data is aggregated in order to have equidistant data points. This kind of data points is required for the subsequent steps.

4.6.5. Forecasting (Kieker: ForecastingFilter)

The forecasting step uses the previous data points to predict the next data point. This implementation uses the mean forecaster which uses the Java arithmetic mean implementation to deduct the next value. This method was chosen as most of the other approaches assume some kind of seasonality which is not the case here.

4.6.6. Anomaly Score Calculation (Kieker:

AnomalyScoreCalculationFilter)

Based on the forecasted value from the previous step and the actual measured value, an anomaly score is calculated. This represents the anomaly score output of the ΘPAD approach.

4.6.7. Anomaly Detection (Kieker: AnomalyDetectionFilter)

Even though this Kieker filter was initially used as a way to extract only anomalous or non-anomalous data points, it is used here to add the anomaly threshold to the record.

4.6.8. Remapping of ΘPAD Anomaly Scores

To be able to compare the anomaly scores of ΘPAD, RanCorr and our approach, we need to remap the anomaly scores of ΘPAD. ΘPAD uses an anomaly interval of [0, 1] with 0 for no anomaly to 1 for anomaly. RanCorr internally uses an interval of [−1, 1] and is remapped to the ΘPAD interval. To have comparable results, we mapped the ΘPAD anomaly scores to the RanCorr interval and back.

4. Change-aware Anomaly Detection

4.6.9. RanCorr

The RanCorr implementation of Düllmann et al. [DEE+14] was used as a reference and was customized as the structural assumptions were partly different. In the existing implementation the multiple hierarchy levels (e.g., host, package, class, operation) were distinguished while the aim in this thesis was to use only the dependency levels between microservice operations. Therefor the RanCorr approach was adapted and partly simplified to fit the requirements in terms of the input data we have.

Algorithmus 4.1 Customized RanCorr Algorithm

1: procedure CUSTOMIZEDRANCORR(currentItem)

2: currentAnomalyScore ← currentItem.getAnomalyScore()

3: forwardAnomalyScore ← CustomizedRanCorr(currentItem.getNext())

5: if forwardAnomalyScore >= currentAnomalyScore then

6: return 0.5 * (currentAnomalyScore - 1)

7: else

8: return max(localAnomalyScore, forwardAnomalyScore)

9: end if

10: end procedure

Algorithm 4.1 describes the simplified RanCorr algorithm in pseudo code which recur- sively obtains the anomaly scores of the items it depends on, and takes into account if the upstream anomaly scores are higher than the local ones.

4.6.10. EAR

The EAR takes the monitoring data associated with timestamps and the events that were recorded into account. As it is known where updates were conducted, it is possible to apply the effects of the algorithm only to the records that are affected by the events. Algorithm 4.2 illustrates the basic idea of the basic EAR algorithm. The inputs are the events (e.g., update events) containing the time and the scope of the change (e.g., which service type was affected), the anomaly records which contain the scope of the anomaly record and its timestamp. It adds the default anomaly threshold for all anomaly records by default and applies the increased anomaly score to those anomaly records that are affected by the events. The variable gracePeriod defines the duration of the allowed increased anomaly scores and the anomalyFactor stands for the factor that is applied to the normal threshold during the grace period. Note: Due to time limitations it was not possible to implement the advanced EAR approach.

4.6. Implementation

Algorithmus 4.2 EAAR Basic Algorithm

1: procedure EAARBASIC(events, anomalyRecords, threshold) 2: gracePeriod ← 15

3: anomalyFactor ← 2.0

4: anomalyRecords.setThreshold(threshold)

6: for all event in events do

7: affectedAnomalyRecords ← anomalyRecords.getAffected(event, gracePeriod)

8: affectedAnomalyRecords.setThreshold(threshold * anomalyFactor)

9: end for 10: end procedure

Chapter 5

Evaluation

5.1. Evaluation Goals

See whether the combined approach performs better than the naive approach. Compare the results with a commercial product (if possible)

Find out whether the approach also performs for big microservice environments with many service instances.

5.1.1. Research Questions

1. Can the anomaly detection quality be improved by incorporating architectural knowledge?

H0arch: Anomaly detection quality does not change or is reduced by incorpo-

rating architectural knowledge.

H1arch: Anomaly detection quality is increased by incorporating architectural

knowledge.

2. Can the anomaly detection quality be improved by incorporating event knowledge?

H0event: Anomaly detection quality does not change or is reduced by incorpo-

rating event knowledge.

H1event: Anomaly detection quality is increased by incorporating event knowl-

5. Evaluation

To make sure that the customization of the RanCorr implementation was an improve- ment, it is necessary to check whether it improved the anomaly detection quality in contrast to the ΘPAD approach. Eventually, the EAR approach is compared with the Ran- Corr approach as it uses the results of RanCorr to again improve the anomaly detection quality.

5.2. Evaluation Methodology

The generated SUT is executed in a lab experiment on the minikube Kubernetes environment together with supplemental microservice required to monitor the system and provide the load for the microservices in the form of HTTP requests. During the experiment phase anomalies get injected that lead to a delay during the execution of the operations on the SUT resulting in increased response times. The injections can have different scopes which either target all operations of one microservice type or even specific operations of a microservice type. In the beginning of the experiment, the microservice operations get assigned initial delays which are intended to resemble usual response times and are not treated as actual injections. Once the initialization phase of the experiment is finished, the load generation is executed. This leads to the actual monitoring measurements and first gives an impression on the initial state of the SUT. Afterwards, the anomaly injections are made in a step by step manner. Every injection is labeled with an injection type of “real” and “init”. The “real” injection type stands for real anomalies that still should be detected and are not expected. The “init” injection stands for anomalies caused by the initialization phase of a microservice. These events are expected and their starting time is logged.

After the experiment is finished, the monitoring data is extracted and the event and injection logs are stored. The following steps are conducted off-line and the experiment setup is no longer needed at this point. Next, the monitoring data and the event logs are inserted in the evaluation application which processes the data with ΘPAD, RanCorr and the approach of this thesis, the EAR approach to eventually compare their performance in terms of detection quality. Another application applies the data about the injected anomalies labeled with “real” to calculate the quality metrics (T P ,F N ,T N ,F P ). Based on this data, we compare the quality of the approaches to see whether the customized RanCorr approach and the EAR approach can improve the detection quality.

5.3. Evaluation Setup

The SUT was generated by using the generator described in Section 3.3.

5.3. Evaluation Setup

The SUT is a three tier microservice environment with one instance of each tier. Figure 5.1 shows the dependencies between the operations of the tiers.

Figure 5.1.: SUT

The portal microservice has three operations where two of them depend on operations of the logic microservice. Again, the logic microservice has dependencies to methods of the database microservice.

5. Evaluation

Based on the monitoring data obtained during the experiment, the following diagram was extracted. It shows the call structure and number of requests sent to the operations of the microservices. In addition to the operations shown in 5.1 the interception methods are visible here which are required to be able to track the complete call hierarchy throughout the whole system.

5.3.1. Supplemental Setup

The said setups are put into an evaluation environment which contains several additional microservices which are required for the conduction of the experiment.

Figure 5.3 depicts the supplemental setup which will be described in this section.

Figure 5.3.: Evaluation Setup for the generated SUT

JMeter

The JMeter microservice provides the load testing that is directed to the SUT. It contains a test plan that is automatically executed after a short delay.

JMS Server

The JMS Server is responsible for collecting the monitoring data from all microservices in the SUT. During the SUT generation process all microservices are instrumented and set up to send their monitoring data to the JMS Server. The JMS Server runs an ActiveMQ1 _{server in version 5.14.1 which is configured to create an initial queue for the} said monitoring data. This way there is one point where the monitoring data can be read from.

1_{http://activemq.apache.org/}

5.4. Experiment Settings

Monitoring Server

The monitoring server uses the Kieker filters to connect to the JMS Server and reads the monitoring data to write them to the local file system. After the experiment is finished, the monitoring data can be copied from it. One way to do so is to use the installed OpenSSH2_{server which provides the possibility to connect to it from a remote system} and copy files with the secure copy tool (SCP) which is part of the OpenSSH suite.

Registry and Injector

The registry is responsible for keeping track of the injections. The microservices of the SUT send requests to the registry with information about their type, operation and unique ID. The registry component looks up whether delays are defined for these properties and returns the delay value. That way it is possible to have a central component managing all the injections. If injections with a defined duration are set up, it removes such injections after the duration has expired.

The injector component is placed inside the registry microservice where it executes the experiment plan where the injections are defined. Based on the plan it sends injection requests to the registry microservice which then stores the injections to provide them to the microservices in the SUT.

5.4. Experiment Settings

In the post-processing environment it is possible to set several variables that may influence the outcome of the results.

Maximum Trace Duration (5 seconds)

The maximum trace duration is set in the Kieker filter that extracts the traces from the monitoring data. This means that traces that take longer than the given number of seconds are discarded. In this case the maximum trace duration was set to five seconds. As there are no traces that take longer than this time, it did not lead to any discarded traces.

5. Evaluation

Aggregation Span (5 seconds)

The aggregation span is set in the Kieker filter that aggregates the monitoring data into an equidistant time series. The aggregation span defines the window size for which one representing value is calculated and was set to five seconds That means that out of the monitoring data of every five seconds one representing value is calculated.

Forecasting Method (Java Mean)

The forecasting method was set in the Kieker filter that forecasts the next value based on the historical data. The method was set to Java mean as we do not have seasonal data that would be relevant to use models to predict the values. Apart from that, the Java mean method has a fast execution time.

Event Grace Period (15 seconds)

The event grace period is set in the EAR algorithm to define for how long the grace period should be applied after an event was detected. In the experiment this value was set to 15 seconds.

Event Grace Period Factor (2.0)

The event grace period factor is set in the EAR algorithm to define the factor for the normal threshold during the grace period. In the experiment this value was set to 2.

Anomaly Threshold (variable)

The anomaly threshold is set in ΘPAD, RanCorr and EAR to be added to the records for later evaluation. This value was varied step-wise from 0.05 to 0.95 in steps of 0.05.

5.4.1. Injections

Table 5.1 shows the injections that were executed throughout the experiment. The time is in seconds after the start of the experiment. For all injection types that are not of the type “real” the timestamps of the event occurrence were logged and used as input for the post-processing. As the injections of the type “real” resemble real anomalies, which should be still detected as anomalies while the other types should not. The scope defines which microservices and which operations are affected by this injection. The values in braces in the offset column represent the delay that persists after the end of the duration. This should resemble code changes that affect the performance of the application in the long term.

5.4. Experiment Settings

Time (s) Type Scope Offset (ms) Duration (s)

0 Start of load generation

20 real portal:getIndex 250 30 80 init logic 200 15 110 init database 350 (100) 20 160 real logic 300 30 200 init portal 330 (30) 15 240 init database 150 20 280 real portal:getOrder 250 30 380 init logic 250 (50) 20 420 real database 250 30

520 End of the experiment

Table 5.1.: Injections in the experiment

5.4.2. Load Driver

The requests directed to the SUT were generated by Apache JMeter3_{which is an open} source tool for load testing.

The test plan for the experiment was set to a total duration of 500 seconds, using 25 threads which loop forever until the test duration ends. The number of threads was ramped up to the maximum value over 30 seconds (ramp-up period).

Every thread sent HTTP requests in the given order: • GET http://portal:8080/getIndex

• GET http://portal:8080/getLogin • GET http://portal:8080/getOrder

5.4.3. Evaluation environment

The experiment was conducted in an minikube (v0.14.0) environment which got assigned 4 CPUs and 8196MB of RAM. The minikube tool uses VirtualBox to run the virtual machine. The installed VirtualBox version was Oracle VirtualBox in version 5.1.12.

5. Evaluation

The host system the minikube virtual machine was run in was an Arch Linux system with a quad-core CPU (Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz) with 16GB RAM in total.

5.5. Description of Results

Based on the F-measure of the datasets of the different thresholds, the one with the highest F-measure is presented here which is the anomaly threshold of 0.1. The dataset containing all values for all measured thresholds is publicly available at [Dül17]. The total number of data points is 718 and the values are rounded to 3 positions after the decimal point.

Approach TP TN FP FN TPR FPR PPV F-Measure Accuracy

ΘPAD (1) 137 218 218 145 0.486 0.5 0.386 0.43 0.494 RanCorr (2) 98 248 187 184 0.348 0.429 0.344 0.346 0.397 EAR (3) 95 257 179 187 0.337 0.411 0.347 0.342 0.411

Table 5.2.: Statistics for the different approaches for an anomaly threshold of 0.1

In Table 5.2 the different approaches and the corresponding statistical metrics for an anomaly threshold of 0.1 are compared. In comparison, the ΘPAD results are the best across all metrics but the false positive rate.

Compared TP TN FP FN TPR FPR PPV F-Measure Accuracy

1/2 -39 +30 -31 +39 -0.138 -0.071 -0.042 -0.084 -0.097

2/3 -3 +9 -8 +3 -0.011 -0.018 +0.003 -0.004 +0.014

Table 5.3.: Comparison of the different approaches for an anomaly threshold of 0.1 (1:

ΘPAD, 2: Customized RanCorr, 3: EAR)

Table 5.3 shows the differences between the results of the approaches. 1/2 stands for the comparison between ΘPAD and the customized RanCorr approach and 2/3 represents the comparison between the customized RanCorr and the EAR approach. While the difference between the results of ΘPAD are quite high, the difference between RanCorr and the EAR approach are very small.

5.5. Description of Results

5.5.1. Receiver Operation Characteristic (ROC) curves

In order to visualize the results, the results for each of the approaches is plotted using a Receiver Operation Characteristic (ROC) curve which plots the false positive rate on x-axis and the true positive rate on the y-axis. This visualization makes it easier to see how good the classification of the different approaches performs. The diagonal line is the baseline which represents the result a random classification of values would have. Data points lying below that line, perform worse than a random guess. Optimal results should have a high true positive rate and a low false positive rate. That means that the optimal result would be many data points in the upper left corner.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 opad−remapped FPR TPR

Figure 5.4.: Results of the prediction performance of the ΘPAD approach visualized by

a ROC curve

Figure 5.4 shows the ROC curve of the results of the ΘPAD evaluation. Its data points lie close to the indicator line for a random classification and partially even go below it. Figure 5.5 depicts the ROC curve for the prediction performance of the customized Ran- Corr approach. It also lies close to the line representing the random guess classification line and crosses it at some point.

In Figure 5.6 the ROC curve for the prediction performance of the EAR approach can be seen. Even though the line lifts off the baseline, it surpasses it at a higher threshold.

5. Evaluation 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 rancorr FPR TPR

Figure 5.5.: Results of the prediction performance of the customized RanCorr approach

visualized by a ROC curve

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 eaar FPR TPR

Figure 5.6.: Results of the EAR approach in a ROC curve

5.6. Discussion of Results

Based on the metrics reported in Table 5.2 the general anomaly detection quality is rather bad as all metrics are quite low.

As even the existing ΘPAD approach does not perform very well in this context, there might be some aspects that could be improved to increase the overall anomaly detection quality. This could be the revision of the parameters that can be set in the chain of Kieker ΘPAD that may influence the results to some extent.

Due to the big difference between the ΘPAD approach and the customized RanCorr approach depicted in Table 5.3 we have to assume that the intended positive impact of the customized RanCorr approach did not succeed. Therefor, we have to accept the null hypothesis H0arch and reject H1arch as the customized RanCorr approach

did not improve but mainly worsened the overall anomaly prediction performance.

In document Performance anomaly detection in microservice architectures under continuous change (Page 56-80)