Anomaly Detection Engine - CRMF Component Details

4.2 CRMF Component Details

4.2.2 Anomaly Detection Engine

The anomaly detection engine is one of the core components of CRMF and comprises of three sub-components: Data Collection Engine (DCE), Software Analysis Engine (SAE) and Network Analysis Engine (NAE). The DCE mon- itors and collects data from each VM and host for various metrics in order to produce a feature vector for subsequent detection engines (i.e NAE and SAE). System and network level engines identify potential symptoms of anomalous behaviour without performing a lot of computation. They denote the pres- ence of anomalies, but merely suggest the possibility of one. Further analysis would be needed to confirm if anomaly is present and if so, classify it. The initial event is generated by looking at each metric for system and network level in isolation. Hence, an event generated indicates that either host or VM is experiencing anomaly. A broader correlation between metrics can be very expensive as the number of such comparisons grows exponentially with the number of VMs and the monitored metrics.

A Fine Grain Analysis (FGA) is then required that uses classification

algorithms such as AutoClass [103] to identify the cause of anomalies and localize them. The idea is to generate and update inference rules, in a fully unsupervised manner, creating different classes of anomalies being detected in sucha a ways that it allows to update fuzzy inference rules autonomously for efficient localization of anomalies [83,103]. This analysis of the monitored data is required to further understand anomalous behaviour and to narrow down the scope of remediation. It can also generate an alert in case the anomaly can not be classified for which manual intervention is required.

Chapter 4. Cloud Resilience Management 72 Below section give a brief overview of sub-components of anomaly detection engine.

Data Collection Engine

An essential goal for resolving the resilience puzzle is what information should be provided and where the information should come from. Therefore, the DCE is designed in such a way that it can collect and process various rele- vant metrics pertaining to system (such as CPU, memory etc.) and network (such as number of packets, number of bytes and throughput) for every VM and physical host. All metrics are collected at periodic intervals with a config- urable monitoring interval parameter. It has sub-components which perform normalization and smoothing of data and produce feature vectors which are a sequence of values over a fixed interval of time and form the basic input into subsequent detection engines (i.e, NAE and SAE).

The first stage of online detection is data collection which is composed of a set of scripts providing feature extraction and normalisation, which within the SAE is achieved through the use of Volatility2 in conjunction with lib- VMI3. At 3 second intervals the Volatility tool is invoked with the custom plug-in that crawls VM memory for every resident process structure. From each process a number of raw features are extracted which include:

• the current size of virtual memory belonging to each process

• the peak virtual size (i.e. the requested memory allocation) of each process

• the number of threads belonging to each process

• the total number of handles belonging to each process (which includes process threads, file handles, registry entries, etc.)

The raw features are per process, which is not useful for each sample, or snapshot, as a single feature vector. Therefore, the raw features are used to build meta-features which include: the mean, variance and standard deviation of each feature across all processes. The result of feature extraction is a feature vector of the form x = (x1, x2, . . . , xn), where n = 12 due to the

three groups of four meta-features.

At the network level the NAE collects traffic data through tcpdump4from each host’s network at bridge interface (br0). This traffic is then passed onto

a Summary Extraction Script which is based on libpcap5 and converts the

traffic into normalised statistical properties as per packet basis. In order

2 Volatility framework:https://code.google.com/p/volatility/ 3_{libVMI: https://code.google.com/p/vmitools/} 4 tcpdump/libpcap: http://www.tcpdump.org/ 5_{libpcap API:http://www.tcpdump.org/}

to capture the dynamics of varying attack types, both the volume-based features (e.g., count of bytes and packets) and distribution-based features (computed as the Shannon entropy of all values observed in the bin, as used in many seminal pieces of work [78]) are extracted. The resulting feature vector therefore has dimension n = 8 and contains:

• Number of packets

• Number of bytes

• Number of active flows in each bin

• Entropy of source IP address distribution

• Entropy of destination IP address distribution

• Entropy of source port distribution

• Entropy of destination port distribution

• Entropy of packet size distribution

Figure. 4.5below shows the overview of data collection engine.

Deployment function provision resources Resilience metrics Summary extraction Feature selection Pre/post processing (normalization) DCE

Feature vector/Time series

instructions to monitor

monitoring

resource cluster

VMs VMs VMs

Figure 4.5: Overview of the Data Collection Engine

Network Analysis Engine

The purpose of the NAE is to detect anomalous traffic at the physical node level of the cloud. This is achieved by modelling normal traffic patterns

Chapter 4. Cloud Resilience Management 74 and identifying anomalies through online/offline monitoring of traffic on the network interfaces of the cloud node with aid of DCE. The NAE provides a reference implementation of different anomaly detection techniques for offline and online analysis. The one-class Support Vector Machine (SVM) algorithm is chosen for the implementation of SAE and the Recursive Density Estima- tion (RDE) [12] technique is used for the implementation of NAE.

System Analysis Engine

The System Analysis Engine (SAE) is designed to detect anomalies through the observation of VM properties. The SAE builds a model of normal VM operation and detects deviation from the normal through selected anomaly detection technique.

Using the toolchain and experimental setup described in Chapter3, the detection aspects of the system and network wide unified resilience architec- ture has been tested under malware scenario. TheKelihos (Trojan.Kelihos-5)

trojan is chosen to test the performance of SAE and NAE components. Since the trojan isWin.32 binary, which allows it to be executed on the target VM. Upon execution, the malware spawns many child process and subsequently exits from its main process. This is likely an obfuscation method to avoid detection, but has the effect of skewing various features (system and network) resulting in an anomaly. Based on the features obtained from DCE the system and network features are aggregated into a single dataset. This dataset is then applied to the detector (implementing PCA technique).

In order to validate if features provided by DCE can also apply in the elastic scenario of cloud, the VM migration is performed during experiments. The results indicated that malware is spread and re-initiated when intra or inter-cloud VM/Service migration is performed. For the experiment pre- sented in Figure. 4.6, each VM runs Apache HTTPd. The client host runs custom scripts to initiate random HTTP requests from the VMs. The SAE acquires VM memory at the hypervisor level using introspection to collect raw-system-features such as process and memory usage. For 20 minutes run, web traffic occurs continuously at a fixed rate, and so generate the system- level (normal) activity. At 9 minutes into a run, Kelihos is injected, to generate malicious activity in system. At exactly 10 minutes, a migration of infected VM is initiated. A run therefore generate two 10 minutes system- level datasets (features) as result of migration, the trace from the node of the arriving VM and from the node of the departing VM.

Each dataset generated is divided into 3-second bins, and each bin is converted into a (200×12) and (200×8) system and network level feature vectors per node respectively. This yielded400×12and400×8feature vector for outward and inward node. The combined feature vector is submitted to PCA based detector to obtain the k-subspace which corresponds to the

normal behaviour of the traffic, and spans frompc1, throughpck, whereas the

remaining subspace (i.e, pck+1 throughpcm) maps the anomalous behaviour

with respect to the variance of the dataset. Subsequently, the magnitude of the projection of data point xi is computed into the anomalous subspace to

quantify its malicious behaviour which is used to produce a Anomaly Score Graph (ASG). ASG is a time-series representation which summarizes the anomalous score of each bin in the trace indicating how anomalous each time bin is with respect to others.

Bins

50

100

150

200

250

300

350

400 Anomaly statistics

0.2

0.4

0.6

0.8

1 Monitoring starts

on Compute1

Malware

injected

Anomalous

period

Monitoring

starts on

Compute2

VM

migration

starts

Figure 4.6: Results of detection for Kelihos using system

and network wide features

In document Anomaly detection for resilience in cloud computing infrastructures (Page 94-98)