Major Contributions - Anomaly detection for resilience in cloud computing infrastructures

In the course of the cohesive research, the following technical contributions is made:

Cloud Resilience Management Framework (CRMF): The main focus of the evaluation presented in Chapter 3 has been to gain an under- standing of how migration affects the anomaly detection accuracy in a cloud environment. The Cloud Resilience Management Framework (CRMF) (See Chapter4) models and then applies an existing resilience strategy in a cloud operating context to diagnose anomalies. The framework uses end-to-end feedback loop that allows remediation to be integrated with existing cloud management systems. The key aspects of the framework are demonstrated in the prototype, namely: anomaly detection for network and system level analysis, and remediation strategies with the aid of a policy engine. An ex- perimental infrastructure is then built to include a real cloud infrastructure resembling two cloud data centres. This is further extended with implemen- tations of resilience management systems to realise resilience management under different attack scenarios.

One possible future research direction is to further refine the framework as a result of lessons learned from the development of the prototype. This mainly includes analytical capabilities to different outside applications. Moreover, additions to the framework will be made to cover for extra aspects to be identified in due course – which may include the need for inter-provider resilience management and support for more resilience strategies to combat challenges specific to service performance.

Management and resilience of the cloud environments are closely linked. Resilience in cloud environments need to be managed to disseminate relevant policies to the cloud provider that will implement them. Policies specifies actions which are needed to deal with challenges and due to varying nature of challenges these policies need to be further refined to be adaptive in response to challenges. The recent work is presented on refinement process for policies that has iterated phases of decompositions at its core. There are many avenues for future work in this discipline but particularly interesting is in policy based resilience management in the cloud.

Chapter 7. Conclusion and Future Work 124 It would also be interesting to explore how example polices e.g described in templates can be optimised with refinement process. The work thesis presented on refinement procedures can be merged with the policy analysis framework in future to see whether this can be used to guide the resilience against challenges. There are a number of different conflicts that can arise from policies. For example, some policies will trigger complex management procedures which require the execution of actions that may be specified as part of different policies. Determining the existence of conflicting configu- rations for cloud environments is of critical importance. Under the assump- tion that the system may be loaded with a number of policies to address many different challenges simultaneously. Multiple policies that need to co- exist may specify conflicting actions on the same virtual resources, or may trigger the activation of incompatible mechanisms, thereby rendering a particular resilience strategy ineffective. Further, to investigate approaches for the automatic identification and resolution of policy conflicts. One possible approach around solution of this problem is the use of meta-policies which resolve conflicting situations during run-time.

Novel Anomaly Detection Algorithm: The thesis presents a modi- fied anomaly detector based on the density of observed feature vectors (see Chapter 5). The modifications made have achieved improved detection for dynamically evolving workload patterns without having pre-defined anomaly models. This encourages further work concerning monitoring scalability in terms of efficiency and accuracy with cloud specific workloads. The density computation is expressed recursively, i.e., based only on the signal at the previous step together with the latest vector. This makes the algorithm memoryless, i.e., it does not need to store historical data. The lightweight nature of this approach makes it particularly suitable for deployment in cloud environments. In particular, a detector associated with a specific VM is sufficiently lightweight to travel with the migrating VM. This is especially valuable, as a detector that does not migrate with a VM may interpret the sudden departure or arrival of the VM as an anomaly.

Anomaly Detection as a Service (ADaaS): The thesis proposed an Anomaly Detection as a Service Model for cloud infrastructures calledADaaS

(see Chapter6). TheADaaS uses an unsupervised, lightweight and memoryless detector. The embodied technique for detection is based on density of observed feature vectors. ADaaS monitors a range of metrics (VM, host and network) using Monasca API to diagnose anomalies. The extensive evaluation of ADaaS is conducted offline, using representative cloud workloads, as well as online, running in European critical infrastructure provider’s set- up for real time anomaly detection. The results show that ADaaS achieves high accuracy with low false alarm rate in detection anomalies. To the best of the knowledge, ADaaS is the first system that offer anomaly detection

as a service for cloud infrastructure. Cloud infrastructure will benefit from this proposed approach since it is designed to be flexible and distributed as well as being able to detect anomalies in real time. The future work will include identification of anomalies to zoom in detection to focus on remediation actions in response to detected anomalies using service model. Further work will look into integrating the ADaaS with Software Defined Network- ing (SDN) solutions because these new evolving technologies will create new vulnerabilities to the cloud computing environments; therefore an anomaly detection system would be needed to reduce the overall risk.

In document Anomaly detection for resilience in cloud computing infrastructures (Page 146-148)