infrastructure IT
Contract No 312758
Deliverable: 4.2
Resilient Cloud Management
AIT Austrian Institute of Technology•ETRA Investigación y Desarrollo•
Fraunhofer Institute for Experimental Software Engineering IESE•
Karlsruhe Institute of Technology•NEC Europe Ltd.•Lancaster University•Mirasys•
Document Control Information
Title Deliverable 4.2 - Resilient Cloud Management Editor Steven Simpson (ULANC)
Author(s) Steven Simpson (ULANC), Noor-ul-hassan Shirazi (ULANC), Simon Oechsner (NEC), Michael Watson (ULANC), Andreas Mauthe (ULANC), David Hutchison (ULANC), Markus Tauber (AIT), Christian Jung (IESE), Manuel Rudolph (IESE), Reinhard Schwarz (IESE), Matthias Flittner (KIT), Roland Bless (KIT)
Classification
Red– highly sensitive information, limited access for:
Yellow– restricted limited access for:
Green– restricted to consortium members
White – Public
Internal reviewer(s) Paul Smith (AIT), Thomas Puschacher (AMARIS) Review Status Draft
WP manager accepted Coordinator accepted Requested deadline 2014/12/19 . . Versions
Version Date Change Comment/Editor 1 2014/12/08 Version for
review
Steven Simpson (ULANC) 2 2013/12/18 Incorporated
comments from review
Steven Simpson (ULANC)
3 2013/12/19 Updated
header
Summary
Cloud environments make resilience more challenging due to the sharing of non-virtualized resources, frequent reconfigurations, cyber-attacks and legal aspects. We present a Cloud Resilience Management Framework (CRMF), which models and then applies an existing resilience strategy to a cloud operating context to diagnose anomalies. The framework uses an end-to-end feedback loop that allows remediation to be integrated with existing cloud management systems. We demonstrate the applicability of the frame-work with use cases for effective cloud resilience management. This frameframe-work will be further refined and modified according to results from implementations of a prototype and experiments in the context of the demonstration scenarios being considered in SECCRIT project.
Contents
List of Figures 5 List of Tables 5 1 Introduction 7 1.1 Motivation . . . 8 1.2 SECCRIT objectives . . . 81.3 Purpose of the deliverable . . . 9
1.4 Legal Issues . . . 9
2 Background 11 2.1 SECCRIT architecture . . . 11
2.2 Network resilience . . . 12
2.3 Resilience in the cloud . . . 15
2.4 Policy-based Resilience Management . . . 16
2.5 Related Work . . . 17
3 Cloud Resilience Management Framework (CRMF) 19 3.1 Architecture . . . 19
3.1.1 Design requirements . . . 21
3.2 Deployment function . . . 22
3.2.1 SLA parser . . . 23
3.2.2 Placement function . . . 23
3.2.3 Deployment template generator . . . 23
3.3 Anomaly detection . . . 23
3.3.1 Data collection engine . . . 24
3.3.2 Network analysis engine . . . 24
3.3.3 System analysis engine . . . 24
3.3.4 Resilience policies (RP) . . . 24
3.4 Policy Engine . . . 25
3.4.1 IND2UCE Framework . . . 25
3.4.2 Data Usage Control Policies . . . 26
3.4.3 Policy Engine in CRMF . . . 28
4.1 Deployment function . . . 28
4.1.1 SLA parser . . . 28
4.1.2 Placement function . . . 31
4.1.3 Deployment template generator . . . 32
4.2 Interaction with tools for Audit Trails and Root-Cause Analysis (AT-RCA . . 33
4.3 Online anomaly detection . . . 34
4.3.1 Data collection engine . . . 35
4.3.2 One-class SVM for SAE . . . 36
4.3.3 Recursive density estimation . . . 37
4.4 Resilience policies . . . 39 4.5 Data-usage policies . . . 39 5 Quantitative evaluation 41 5.1 Test-bed . . . 41 5.2 Software tools . . . 42 5.3 Evaluation Metrics . . . 43
5.4 Tuning the SAE detection algorithm . . . 43
5.5 Proposed RDE algorithm for NAE . . . 44
5.6 Evaluation of SAE with malware samples . . . 48
5.6.1 Detecting Kelihos . . . 50
5.6.2 Detecting Zeus . . . 50
5.7 Evaluation of NAE with network-level attacks . . . 53
6 Qualitative evaluation 62 6.1 Failure recovery of a virtual machine with minimum interruption to a service 62 6.2 Anti-affinity . . . 64
6.3 Service of traffic management system is not available due to Denial-of-Service attack . . . 65
7 Conclusions and Future Work 67
Appendices 69
A Linkage to other project results 69
B Detection results of portscan using RDE 70
List of Figures
1 SECCRIT Objective mapping to CRMF . . . 9
2 SECCRIT Architectural Framework . . . 11
3 Components mapping to architecture . . . 12
4 TheD2R2+DR resilience strategy . . . 13
5 D2R2components . . . 14
6 A Resilience-oriented View . . . 14
7 Collaborative interface for resilience management . . . 19
8 An overview of the CRMF architecture . . . 20
9 The flow of information and hierarchy of engines within a CRMF . . . 21
10 IND2UCE Framework . . . 25
11 CRMF system architecture . . . 29
12 Internal structure of the deployment function . . . 29
13 Parameter Classes . . . 30
14 A monitoring-oriented view – identifying different interfaces . . . 34
15 Overview of Data Collection Engine . . . 36
16 An Overview of Policy Engine . . . 40
17 Experimental setup test-bed . . . 41
18 Time taken to train the classifier vs. training data-set size . . . 45
19 Time taken to output a class vs. training data-set size . . . 45
20 Original RDE . . . 48
21 Modified RDE . . . 49
22 The performance metric comparison of original vs modified algorithm . . . . 49
23 Results of detection for Kelihos-5 using end-system features and a variety of kernel parameters . . . 51
24 Detection of Zeus-1433 using end-system features with a tuned classifier . 52 25 Detection of various Zeus samples using end-system data and a tuned classifier . . . 52
26 Depiction of migration during normal period, with anomaly directed at static host (NM-MT0) . . . 54
27 Depiction of migration during anomalous period, with anomaly directed at static host (AM-MT0) . . . 54
28 Depiction of migration during normal period, with anomaly directed at mi-grating host (NM-MT1) . . . 55
29 Depiction of migration during anomalous period, with anomaly directed at migrating host (AM-MT1) . . . 55
30 Legend for depictions of anomaly/migration interaction . . . 55
31 Detection results of high-intensity DoS attack when migration happening during anomalous period . . . 56
32 Detection results of high-intensity DoS attack when migration happening during normal period . . . 56
33 The Mirasys test case for VM failure recovery . . . 63
34 Organisations involved in use case . . . 65
35 Results from simulations that implement an incremental detection and re-mediation approach to a DDoS attack [YFSF+11] . . . 66
List of Tables
1 Summary of characterisation . . . 59 2 Detection results of DoS attack with MT0 under high and low intensity . . . 60 3 Detection results of DoS attack with MT1 under high and low intensity . . . 61 4 Linkage to SECCRIT Deliverables . . . 69
1
Introduction
Due to the recent trend of using cloud computing for Critical Infrastructure (CI) services, it is becoming increasingly important for clouds and cloud-based services to beresilient, i.e., able to maintain operations and services even in the presence of challenges. To do this, a number ofresilience-supporting mechanisms are needed at various locations and layers (tenant and physical infrastructures), such as provisioning of redundant com-ponents, traffic capturing, anomaly detection and classification, and network reconfig-uration. Moreover, these mechanisms require appropriate coordination to identify the occurrence of a challenge. Characteristics of the challenges to be detected determine which mechanisms can be then used, and where they should be place. More than one detection algorithm could be used, for example, having an always-on, low-cost anomaly detection alogorithm, alongside an offline, more computationally expensive classifier to diagnose the root cause of an anomaly. Therefore, it is non-trivial to co-ordinate such mechanisms for several reasons:
• They might work on different timescales. For example, the deployment function of cloud orchestration might foresee the need for scalable components and react to load changes on a longer timescale by scaling out a service, whereas reacting to an anomaly requires short-term decision-making to mitigate a potential attack.
• Distinct mechanisms might also have different objectives, and attempt to perform conflicting actions. A reaction to an anomaly might be to migrate VMs away from an affected node, but this migration might conflict with standing policies to keep some VMs apart (because they belong to rival tenants) or together (to reduce latency).
• Reconfiguration of the mechanisms could unexpectedly produce such conflicts and discrepancies.
This deliverable introduces a Cloud Resilience Management Framework (CRMF), which has the goal to facilitate the interactions of components that implement resilience mechanisms, as well as the (re-)configuration of such a system. The components of this framework aim to detect challenges and events in the system, such as an attack or a network overload, and to react to these events by reconfiguration, in order to maintain the operation of the system and, therefore, the critical services running on it.
Many of the decisions to react to events are expressed as Event-Condition-Action (ECA) policies, which bind the components and mechanisms together. Policies sit idle until associated events occur and their conditions are met. Their actions may cause the re-configuration of the system, including the deployment or activation of new components, which can in turn generate new events, and trigger further policies. This continuous process will constitute a policy-driven resilience feedback control-loop, which allows (for example) an initial network anomaly to activate closer inspection of traffic, eventually leading to reconfiguration of a firewall. Such a process thus automatically handles a challenge to the system, without having expensive mechanisms (such as the closer traffic inspection) active at all times.
It is of paramount importance that cloud administrators are able to effectively manage the underlying technologies which enable configurations, elasticity, dynamic invocation of services, and monitoring activities such that security and resilience requirements of the critical service users and providers are met. The set of policies defining the possible
reconfiguration actions is not fixed, and different policies may be loaded or unloaded over time to reflect better resilience practices or a better understanding of the challenges. The expression of resilience mechanisms and strategies as policies allows new policies to be studied and analysed off-line before deployment, to determine correct interaction with existing policies.
1.1 Motivation
The motivation for the work presented here comes from previous work [SFSM11, SHS+11a] and objectives from work package 3 & 4 of the project SECCRIT which state:
• To develop a suitable policy vocabulary, policy mechanisms, and a policy editor that supports the user in specifying his security requirements and in deriving the resulting, machine enforceable security policy.
• To develop concepts for policy deployment and re-deployment to the cloud in a secure manner.
• Anomaly-based detection techniques to discover deviations in system and network behaviour. This is essential to disclose cyber-attacks and predict their impact. Moreover, the high, stringent security requirements of critical infrastructures demand reliable detection of any deviation from normal behaviour. Anomaly-based detection tech-niques are used to detect such deviations in, e.g., system and network behaviour, at run time. Deviations can, for example, be caused by cyber-attacks or component failures. However, in this work, our focus is not on specific anomaly detection techniques, but to explore the concept of resilience in cloudsm i.e., timely detection (possibly on-line) and identification of anomalies and remediation actions to ensure continuous operations. This analysis will help further research into more robust cloud management systems for critical infrastructures provisioning, and to identify how and where in cloud environments those components of resilience management interfaces can be operationally applied.
1.2 SECCRIT objectives
The overall objectives of the SECCRIT project include the following, which are directly relevant for this deliverable:
1. Understand Cloud-Computing risks. 2. Understand Cloud behaviour.
3. Establish best practices for secure Cloud service implementations.
From the Description of Work (DOW) for Work Package 4: “For understanding and managing risks associated with cloud environments at operation time, SECCRIT will pro-vide anomaly-based detection techniques to discover deviations in system behaviour. This is essential to disclose cyber-attacks and predict their impact.” Figure 1 below shows the mapping of SECCRIT objective to proposed resilience management framework.
SE C C R IT O b je ct iv es 1. Definition of legal guidelines 2.Understand cloud behaviour 3.Understand cloud computing risks 4. Establish best practices
for secure cloud service implementations 5. Demonstration of SECCRIT in real use cases Wor k P ac k ag e 4 O b je ct iv es 1.Anomaly based detection techniques to discover deviations in system and network behaviour
2.Technologies for implementing policy enforcement
3.Concepts for policy deployment and re-deployment SE C C R IT F ra m ew or
ks 1.Anomaly Evaluation Framework for Cloud
Computing (AEFCC) 2.Cloud Resilience
Management Framework (CRMF) 3.Integrated Distributed
Data Usage Control Enforcement (IND²UCE)
Protection against attacks and other anomalies
Remediation Fine grain analysis (FGA) Anomaly detection (AD) Resilience policies Data usage policies Detection Policy framework
Resilient cloud computing
coarse events constraints physical infrastructure re-configuration data(NW, VM) detailed events attempted actions constrained actions CRMF
Policy framework (IND²UCE)
Figure 1: SECCRIT Objective mapping to CRMF
1.3 Purpose of the deliverable
This deliverable describes the cloud resilience management framework (CRMF). This in-troduces management interfaces, logical functions, and a description of the data flow of the various sub-components to provide complete end-to-end resilience. This framework supports security requirement specifications in terms of resilience policies and is com-posed of different components which take input from various monitoring and detection mechanisms to ensure high-level security and resilience requirements are being met.
The cloud resilience architecture we describe satisfies the roles of the inner loop elements by realising them as software engines that together form a Cloud Resilience Manager.
This deliverable has the following structure. Background and related work is dis-cussed in Section 2, with reference to the SECCRIT architecture. Section 3 presents the resilience management framework and Section 4 provides more specific detail of its components, focusing on online anomaly-detection techniques for providing end-to-end resilience. Section 5 presents a quantitative evaluation of anomaly detection, while Section 6 evaluates the framework qualitatively against project use cases. Section 7 concludes the deliverable.
1.4 Legal Issues
This brief section summarizes the legal issues that have been addressed in the prepara-tion of this deliverable. This mainly focuses on the anomaly detecprepara-tion component, since the deployment function and the policy framework have no direct contact with user data.
Collection and storage of personal data The anomaly-detection techniques described herein use data acquired by observing network traffic, including any sensitive data con-tained in headers or payload, even if these aspects are not necessary.
• Reason for collectionThis data is collected for statistical analysis.
• Minimization of quantity of personal data collectedMost information from any observed packet is immediately discarded (e.g., payload) or condensed to inferred information (e.g., packet size). Furthermore, of the remaining information (such as IP addresses), these are aggregated into entropy measurements, from which the original data cannot be recovered.
Prevention of unauthorized access Collection of data and its processing does not need to leave the physical network of a data centre.
Real-time tracing of data There is no personal data to be traced.
Assured deletion of data Most personal data is effectively deleted, masked or reduced at the point of collection, or kept within the private physical networks from where the data was collected. Derived data does not need to be retained beyond a time span of (say) 20 minutes. (This does not eliminate the possibility that organizational policies with a broader scope than anomaly detection would keep records for longer for administrative purposes such as forensics and logging.)
Configuration of non-functional requirements Anomaly-detection systems could be misconfigured such that unprocessed records containing personal data could be exported from the networks where they have been captured, and exposed to misuse. Care should be taken to be aware of this potential if requirements appear to dictate such unusual con-figurations. Ordinarily, there should be no reason not to eliminate personal data before leaving the private network. The configuration of database servers holding personal data (for example) forms part of the deployment function. Therefore, non-functional require-ments such as the location of such servers within specific countries is to be included in the functionality of this function.
Tracing of failure No personal data is exposed in a recoverable form.
Forensic value/credibility of data No personal data is exposed in a recoverable form. However, there may be indirect actions related to anomaly detection that lead to other personal data being stored. For example, if an anomaly is detected, more detailed data might be collected for other purposes, e.g. to counter the anomaly, or to log the issue. However, this is beyond the scope of this document.
2
Background
This section considers the positioning of the CRMF in an overall cloud architecture in §2.1, defines resilience for networking §2.2 and clouds §2.3, discusses the user of poli-cies for resilience §2.4, and finally discusses related work in this area in §2.5.
2.1 SECCRIT architecture
The SECCRIT Consortium has developed an architectural framework for deploying crit-ical infrastructure services in the cloud [BFH+14], which provides a basis for the
devel-opment of technical solutions such as Cloud Resilience Management. The reference architecture used in SECCRIT (Figure 2) divides activities in a cloud environment into four layers [sec13]. The term ‘tenant’ refers to cloud customers who use virtual resources provided by the physical infrastructure provider. Tenants can use them to provide services to their own users; we will refer to ‘users’ as those who use the services of tenants.
In SECCRIT, we use the more specific term ‘cloud infrastructure provider’ instead of ‘cloud provider’, because we are assuming that the latter offers Infrastructure as a Ser-vice (IaaS). Several tenants may host critical serSer-vices on a single physical infrastructure, with varying security and resilience requirements. For example, not all components of a single tenant might have the same importance and different components of different tenants might be mixed-and-matched in the physical infrastructure. The security and resilience mechanisms offered by critical service providers need to ensure continuous system operation even in the presence of these challenges.
Tenant Infrastructure Level Physical Cloud Infrastructure Level CI Service Critical Infrastructure (CI) Service Level Component A Abstraction Level CI Service User Resources CI Service Provider Tenant Infrastr. Provider Service Components Tenant Infrastructure
Cloud Infrastructure (Data Centre)
Cloud Infrastructure Provider Client Devices Stakeholder Provides Service (SaaS /Paas) Provides Virtual Infrastructure (IaaS /PaaS) Provides Virtual Resources (IaaS) •Virtual Compute Resources
•Virtual Storage •Virtual Network manages cloud resources manages virtual resources manages service resources •Compute •Storage •Network Component B Component C User Level SLAs
With respect to security and resilience requirements, the deployment function as part of the orchestration component (and a component of the CRMF) translates high-level service descriptions and Service Level Agreements (SLAs) into automatically deployable descriptions such as Heat1 templates in the OpenStack2 environment. Subsequently, the anomaly-detection engine of the CRMF uses input from network and system activity to detect challenges which are then remediated using the policy engine. The activity is measurable at the physical (cloud-infrastructure provider) layer, which has physical networks and machines, and has an external view of system activity in VMs. Moreover, network activity can be measured in the tenant-infrastructure layer by observing traffic on virtual networks, which could be performed by the tenant by running anomaly detectors on VMs that have access to these networks. A mapping of components to architecture can be seen in Figure 3 below.
Client Devices
Abstraction Level Resources
User Level Critical Infrastructure (CI) Service Level Tenant Infrastructure Level Physical Cloud Infrastructure Level D ep lo ym en t f un cti on / O rc he str ati on Anomaly detection Component A ComponentB ComponentC Service components CI Service Policy framework
Cloud infrastructure (Data centre) Tenant infrastructure
Figure 3: Components mapping to architecture
2.2 Network resilience
Network resilience [CMH+07] is a wide-ranging concern, and can be defined as the “the ability of a network to provide an acceptable level of service in light of various chal-lenges” [SHÇ+10]. A target level of service can be defined in a Service Level Agreement
(SLA) using resilience metrics, such as performability and dependability metrics. The
1Heat Orchestration Template: http://docs.openstack.org/developer/heat/template_guide/hot_guide.html
challenges to normal operation considered under the umbrella of network resilience tran-scend those typically considered within any one discipline, and include attacks by ma-licious adversaries, large-scale disasters, software and hardware faults, environmental concerns and human mistakes.
Based on work by [SHÇ+10], theResumeNet3project devised a framework whereby a number of resilience principles are defined, including the resilience strategyD2R2+DR: Defend, Detect, Remediate, Recover, Diagnose and Refine, which is outlined in Figure 4.
D
ia
g
no
se
R
e
f
in
e
D
e
f
en
d
De
t
e
c
t
R
e
m
e
dia
te
Re
c
o
v
e
r
Figure 4: TheD2R2+DRresilience strategy
At its core is a control loop comprising a number of phases that realise the real time aspect of the D2R2+DR strategy and consequently implements cloud resilience. Based on the resilience control loop, other necessary elements of our framework are derived, namely deployment function, anomaly detection and policy engine that aim to build situ-ational awareness, and multilevel information sharing and control mechanisms.
Under theD2R2+DR framework, there must exist components capable of reconfigur-ing underlyreconfigur-ing infrastructure in response to challenges usreconfigur-ing policies (Fig. 5). Reconfig-uration need not apply to the same components on which the detection was based. A policy engine is responsible for mapping detection events to reconfigurations, accepting a resilience strategy as a collection of policies [SHS+11a, SHS+11b].
We can applyD2R2to the SECCRIT architectural framework (detail will follow in 2.1) to provide a resilience view of it (Fig. 6). At the physical layer, the cloud-infrastructure operator has access to physical nodes and network, which can be monitored to inform the detection process. The operator can also reconfigure these devices, in response to detected challenges using policies. Cloud-infrastructure D2R2 may exist as monitoring and reconfiguration points on physical hosts and networks, and on some virtual
com-3
Figure 5:D2R2components
ponents. Resilience managers and detectors need not exist on any physical equipment used directly to provide virtual resources to the above layer.
At the tenant-infrastructure layer, the tenant has access to VMs, and possibly virtual taps on VNs, which can inform detection. In response to challenges, the tenant may reconfigure the hosted machines, and some functionality of the virtual networks might also be exposed. Thus, tenant-infrastructure D2R2is spread across components visible to this layer. Within the inner D2R2 loop, some interaction between these layers may exist in the form of events and reconfigurability exposed by the lower layer. For details on policy and resilience view points, we refer reader to the SECCRIT architecture white paper [sec13].
2.3 Resilience in the cloud
We reinterpret and define cloud resilience asthe ability to maintain an acceptable level of system operation and service even in the presence of challenges. Resilience is already supposed to be a fundamental property of Cloud service provisioning platforms. How-ever, a number of significant outages have occurred that demonstrate Cloud services are not as resilient as one would hope, particularly for providing critical infrastructure services [Nea11]. In a number of cases, these outages are caused by problems in the network infrastructure. As a recentENISA4report on the security and resilience of Gov-ernmental Clouds suggests: “the availability of a cloud service is often dependent on the network used to access it. . . ” and measures should be taken to ensure the resilience of access networks [Cat11]. To the best of our knowledge, there is limited or no under-standing about how to provide a resilient Cloud infrastructure that can collectively address challenges in a coordinated manner – these are treated as separate concerns.
In general, one of the benefits of deploying services in the Cloud is the relative sim-plicity due to its obviation of the direct purchase and maintenance of physical infrastruc-ture. However, this comes at the cost of having to rely on COTS and loss of control over parts of the service deployment process. Meanwhile, economies of scale brought about by a unified service offering built on top of a virtualised infrastructure enable Cloud ser-vice providers to exist and make a profit. However, the deployment of critical serser-vices in the Cloud potentially breaks this economic model by requiring additional configura-tions to address security and resilience concerns related to a particular organisation and service. To help mitigate this problem, we propose to extend the concept of resilience patterns [SFSM+12] for critical service provisioning in the Cloud. Resilience patterns are
specifications of reusable strategies, or configurations, to commonly understood chal-lenges, e.g., they specify the relationship between various detection and remediation mechanisms in order to mitigate a challenge. A pattern can be evaluated using off-line tools, e.g., a simulator, for its suitability to mitigate a particular challenge, such as a DDoS attack, and subsequently deployed and configured for a specific deployment at run-time. Specifying patterns and sharing those that are known to be effective can reduce the over-head of addressing security and resilience concerns, maintaining one of the benefits of utilising the Cloud, i.e., that of reduced deployment costs and complexity.
4
2.4 Policy-based Resilience Management
Management and resilience of the cloud environments are closely linked. The extra layer of resource virtualization makes it difficult to plan effective management due to vary-ing user demands, co-hosted VMs and the arbitrary deployment of multiple applications. Generally, management policies are used to govern the behaviour of a system. These management policies can be mostly looked upon as: “the constraints and preferences on the state or the state transition, of a system and is a guide on the way to achieving the overall objective which itself is also represented by a desire system state” [Goh97]. When using policy-based management, it is critical that rules being specified actually stem from the higher level requirements and that they are implementable.
Challenges to the operation of a Cloud Infrastructure can occur rapidly and with little warning, requiring a fast response in order to maintain acceptable service levels. In or-der to mitigate a challenge, complex multi-phase strategies are required, which combine various monitoring and detection mechanisms that influence the behaviour of remedia-tion mechanisms. To address these issues, SECCRIT advocates the use of policy-based network management techniques for the configuration of resilience strategies.
These techniques allow descriptions of real-time adaptation strategies, which are sep-arate from the implementation of the mechanisms that realise the strategy. This separa-tion allows changes to be made to strategies without the need to take resilience mech-anisms off-line. In short, two forms of policy are supported: authorisation (or access control) and obligation policies. Mainly policies are categorised as obligation and autho-risation polices which are also supported by IND2UCE, which is the policy environment used by our resilience framework.
Obligation policiesspecify management operation that must be performed when a particular event occurs given some supplementary conditions being true. These policies follow the Event-Condition-Action (ECA) paradigm and are of the form:
on <event>
if <conditions>
do <target> <action>;
Therefore, the occurrence of the specific event is a necessary condition for the man-dated operation to be performed. The event is a term of the form e(a1, . . . an, wheree
is the name of the event and a1, . . . , anare the names of its attributes. The condition is
a boolean expression that may check local properties of the nodes and the attributes of the event. The target is the name of a role (i.e., a placeholder) where the action will be executed and so the service or resource assigned to the target role must support an im-plementation of the action. The action is a term of the forma(a1, . . . , am), where a is the
name of the action anda1, . . . , amare the names of its attributes. To simplify notation an
obligation policy can have a list of target-action pairs, all evaluated when the event is true and the condition holds. The attributes of an event may be used for evaluating the condi-tion (to decide whether to invoke the accondi-tion or not), or they may be passed as arguments to the action itself. Implicitly the role to which the obligation policy belongs is the subject of the obligation i.e. the entity enforcing the policy, and the action is invoked on a target role. Note the target may be the same as the subject i.e. a role may perform actions on itself. Obligations can also be used to load other policies (obligations or authorisations) into the system or existing policies may be enabled/disabled to change the management strategy at run-time.
Authorisation policies specify what actions a subject is allowed (positive authori-sation) or forbidden (negative authoriauthori-sation) to invoke on a target. The subject and the target are role names. The action and the condition are defined like in obligations. Au-thorisation decisions could be made by one or more specific roles in the network, but commonly implementations are based on the target making decisions and enforcing the policy as it is assumed that target roles wish to protect the resources they provide to the network.
auth[+/-] <subject> if <condition>
then <target><action>;
2.5 Related Work
The need for real-time and more dynamic nature of cloud infrastructure has made the task of defining resilience framework very challenging. The policy based management has proven to be very effective for complex system management as evident in previous literature. The policy based approach to network and system management proposed in [MAP96, HAW96] defines a framework for management for polices, policy hierarchies and policy transformation. In [RPB93], authors propose to enrich managed objects with policy goals as required by the management policy. It describes policies in two parts active part, containing application specific functionality and passive part which can be re-used without any change. In [KLM+99] authors used approach to enforce policies by means of rules, but the understanding of a rule is more restrictive.
The above works mentions are in context of specification and implementation of poli-cies and recently SLA, little focus has been given to the refinement of the high-level requirements into low level polices. The Verman presented an approach to policy transla-tion that is based on a set of tables [Ver00]. The tables identify the relatransla-tionships between users, applications, servers, routers and classes of service supported by network. Whilst this technique offers the advantage of being fully automated it is inflexible approach, only supporting a very specific type of high-level SLA policy and low level device configuration policy.
The work in [CMBG00], outlines a policy authoring environment that provides a pol-icy took, called POWER, for refining polpol-icy. Where domain expert first develops a set of policy templates, express as Prolog Programs, and the policy authoring tools have an integrated inference engine that interprets these programs to guide the user in select-ing the appropriated elements from the management information model to be included in the final policy. The main limitation of this approach is the absence of any analysis capabilities to evaluate the consistency of the refined policies. Similarly, work presented in [BCV04] allows translation of service-level objectives into configuration parameters of a managed system. The transformation engine takes the service requirements of the user as input, and search the database to determine the optimal parameters values that pro-vide level of service limitation of this technique include its dependence on a rich enough database which is only possible by observing the system for some period of time; and the inability to deal with situations where a given requirement specification results in dif-ferent configurations. There are several relevant projects which are highlighted below as of interest.
2. TClouds [tcl] was an EU FP7 project aimed at developing a cloud infrastructure that achieves security, privacy and resilience. It objectives include to identify and address legal and business issues, define an security architecture for cloud, and provide resilient middle-ware for adaptive security on the cloud-of-clouds.
3. PRECYSE [pre] is an EU FP7 project. The strategic goal of PRECYSE is to de-fine, develop and validate a methodology, an architecture and a set of technologies and tools to improve – by design – the security, reliability, and resilience of the information and communication technology (ICT) systems that support critical in-frastructures (CIs).
4. Cloud Controls Matrix (CCM)[Tea10] is specifically designed to provide fundamen-tal security principles to guide cloud vendors and to assist prospective cloud cus-tomers in assessing the overall security risk of a cloud provider.
5. OrBAC[KBB+03] was developed inside the RNRT MP6 project (communication and information system models and security policies of healthcare and social matters). The purpose of this project is to define a conceptual and industrial framework to meet the needs of information security and sensitive healthcare communications. ResumeNet provides blueprints and design guidelines for our cloud resilience man-agement framework. The proposed resilience strategy in theResumeNet is validated by detailing the guidelines which can be applied to the problem of channel interference in wireless mesh network and to explore the implications of multi-staged and collaborative detection.
The TClouds project targets cloud computing security and minimization of the widespread concerns about the security of personal data by putting its focus on privacy protection in cross-border infrastructures and on ensuring resilience against failures and attacks. They published work about an advanced cloud infrastructure that can deliver computing and storage that achieves a new level of security, privacy, and resilience. Their demonstration focus is on socially significant application areas such as energy and healthcare.
OrBAC provides a well-defined access control policy model, which can be integrated into the CRMF framework, and shall enable fine-grained access control of the resources. The OrBAC API has been created to help software developers introduce security mech-anisms into their software. This API implements the OrBAC model, which is used to specify security policies and also implements the AdOrBAC model [CM03], which is used to manage the administration of the security policies. The MotOrBAC [CCBC06] tool has been developed using this API to edit and manage OrBAC security policies. OrBAC has only been realized on homogeneous systems (such as firewall) or at software level.
The CSA (Cloud Security Alliance) CCM (Cloud Control Matrix) provides a controls framework that gives detailed understanding of security concepts and principles that are aligned to the Cloud Security Alliance guidance in 13 domains. The foundations of the Cloud Security Alliance Control Matrix rest on its customized relationship to other indus-try accepted security standards, regulations, and controls framework such as the ISO 27001/27002,ISACA CoBIT,PCIand NISTand will augment to provide internal control directions forSAS 70 attestations provided by cloud providers. This controls framework can possibly serve as as the backbone for evaluation of the security levels of the CRMF.
3
Cloud Resilience Management Framework (CRMF)
The Cloud Resilience Management Framework defines several automatic functions of an administrative cloud domain, at various levels of the SECCRIT architecture, and how they interact to make the domain resilient to unplanned challenges. It also describes how functions in separate domains can collaborate in dealing with challenges. Implementa-tions of CRMF funcImplementa-tions are additional units to be closely integrated with an existing cloud management framework.
Conceptually, the resilience management can be instantiated in any virtual instance or level of an infrastructure service provider, addressing management in both single-and cross-domain cases given that the infrastructure service provider implements it according to its specific management purposes and with respect to the available infrastructure.
Resources
(Compute
,Network,
Storage)
Existing
management
Framework
CRMF
(SECCRIT)
Collaborative interface
Figure 7: Collaborative interface for resilience management
Information exchanged between different administrative domains can be performed through the collaborative interface as shown in Figure 7 below possibly with the aid of Coordination and Organization Engine (COE). Typically, the kind of information available through these interfaces are detection metrics, performance measurements, configura-tions and enforcement acconfigura-tions, etc.
3.1 Architecture
The CRMF consists of two main functions, theDeployment Function (DF)andResilience Manager (RM). The Deployment Function is concerned with supplying the Virtual or Cloud Infrastructure Manager with configurations describing the creation and deployment
of virtual machines for instantiating a service. On the one hand, DF acts as controller for RMs by placing virtual machines in parts of the infrastructure supervised by an RM in-stance. Alternatively, it can also provide input to the RMs in terms of types of instances started and their respective locations, which may allow detection mechanisms to work. On the other hand, and in addition, it directly creates and places instances according to the resilience requirements of the service. If high-availability service components needing (for example) backup VMs are defined, it deploys redundant instances for higher avail-ability and places them so that the given levels of availavail-ability of the service components are met. Details of the deployment function are presented in Section 3.2 below.
Each Resilience Manager (RM) is composed of three software components, or en-gines, which are shown in Figure 8. (Ain the figure represents a single hardware node in the cloud, andBrepresents the RM, with the aforementioned software engines residing within it.) For simplicity only three nodes are shown and the network links that connect them have been omitted. The anomalies of particular interest are those caused by any activity associated with anomalous behaviour in cloud such as malware on cloud VMs or Denial-of-service attack on host.
Cloud infrastructure A DF COE PE ADE Cloud infrastructure D B C Legend DF: Deployment function ADE: Anomaly detection engine PE: Policy engine
COE: Coordination and organization engine
Figure 8: An overview of the CRMF architecture
The software components within each RMs are: The Anomaly Detection Engine (ADE), Policy Engine (PE) and the Coordination and Organization Engine (COE). The RM on each node performs local detection based on features gathered which is done using sub-component called Data collection Engine (DCE) from its nodes’ VM and its local network view; this is handled by theSoftware Analysis Engine (SAE) andNetwork Analysis Engine (NAE)components respectively. The PE component is in charge of re-mediation and recovery actions based on the output from the anomaly detection engine (ADE), which is conveyed to it by the COE. Finally, the COE component coordinates and disseminates information between other instances and the components within its own node. It is ultimately in charge of the maintenance of the connections between its RM’s peers and embodying the self-organizing aspect of the overall system.
In addition to physical node level resilience, the CRMF is capable of gathering and analysing data at the network component level through the deployment of network RMs as shown by C in Figure 8. The component is deployed on (D in the figure) is an
in-gress/egress router for the cloud; the DCE is tightly coupled with the router/SDN and as such can gather features from all traffic passing through it.
Another important component of RMs is the Coordination and Organization Engine (COE) since it performs a critical role in the maintenance of the overall system. The COE acts autonomously to control the other components within its RMs while at the same time communicating with its RM peers.
Figure 9 shows the relationship between RM engines and the coordinating role that the COE plays. The dashed arrow in Figure 9 indicates communication between the local COE and the remote COEs of each peer RM.
DF
COE
PE
ADE
DCE
NAE
SAE
ADE
PE
COE
Detect
Remediate & Recover
Defend
Legend
DF: Deployment function ADE: Anomaly detection engine PE: Policy engine NAE: Network analysis engine COE: Coordination and organization engine SAE: System analysis engine
Figure 9: The flow of information and hierarchy of engines within a CRMF
Self-organization in a system of RMs is achieved through the dissemination and ex-change of meaningful information with respect to the system and network activities of each VM. In practice, and as depicted in Figure 8, there are various system/network in-terfaces that act as information dispatch points in order to allow efficient event dissemina-tion [MHP10]. Here we highlight some design requirements of our resilience framework. More detail about individual components will follow in subsequent sections below.
3.1.1 Design requirements
• The framework shall be reactive and pro-active to challenges for fast detection and identification, preferably before challenges causes noticeable degradation.
• The framework shall have a modular design that allows re-use of resilience strate-gies. This is vital for handling failure of surrounding components.
• The components of the framework shall easily integrate with each other to allow complete flexibility and adaptability.
• The framework should be easy to modify or integrate according to security and resilience requirements.
• The framework shall easily integrate with existing cloud management systems
• The framework shall provide functionality for efficient and effective management of resilience where needed.
• The framework shall use policies to realize overall resilience.
• The framework shall be composed of various distributed components to provide overall resilience which is desirable in the case of failure of individual components.
3.2 Deployment function
The deployment function concerns itself with supplying the Virtual or Cloud Infrastructure Manager with configurations describing the creation and deployment of virtual machines for the critical infrastructure. Its task is to translate high-level service descriptions and SLAs into automatically deployable descriptions, such as Heat templates in the Open-Stack environment. It is therefore a part of a Tenant Infrastructure Management System (TIMS), being located between the cloud user and the Cloud Infrastructure Operator, cf. Figure 3.
While orchestration, of which the deployment function is a part, generally concerns itself with lifecycle management, configuration and resource assignment, these issues are not in the focus of the particular Critical Infrastructure use cases relevant in SECCRIT. On the other hand, with respect to project objectives and use cases, the inclusion of resilience patterns and mechanisms is of particular interest, since these support high availability of the infrastructure. Resilience is currently normally handled by offering the concept of availability zones (or similar concepts) on the part of the cloud infrastructure providers. An availability zone denotes a collection of resources that is not included in other availability zones, thus ensuring isolation of parts of the infrastructure if instantiated in different zones.
Thus, components that need a higher level of availability are recommended to be started redundantly in separate availability zones, which are promised to be fault inde-pendent and have a short interconnection delay (at least in the Amazon EC2 framework). Neither the resulting availability nor the actual delay (or bandwidth) is necessarily given in the form of an SLA agreement. Comparing this with the higher availability and synchro-nization requirements of critical infrastructure shows the need for enhancing this area of service deployment. In particular, a deployment function is needed that places instances of virtual resources according to the resilience and performance requirements of the ser-vice user. However, this should be done without necessarily disclosing the detailed infras-tructure of the cloud architecture, since this may be information an infrasinfras-tructure provider
is not willing to share. Thus, to not discourage cloud providers from offering such a ser-vice, the deployment function needs to provide information on a relatively abstract level. However, the deployment should be documented and correlated to the infrastructure in order to provide another puzzle piece of the audit trails.
In addition, the placement of virtual machines can be done in cooperation with the anomaly detection component, towards at least two ends: a) to place critical instances in AD-supervised parts of the infrastructure, and b) to provide input to the AD in terms of types of instances started and their respective locations, which may allow to conclude standard traffic and behavioural patterns of and between these instances.
The aforementioned functionality can be separated into the following sub-components:
3.2.1 SLA parser
This subcomponent is responsible for parsing a user description of their service, i.e., the needed service components and their relation/connectivity, as well as the required performance and availability.
3.2.2 Placement function
The requirements extracted by the SLA parser translate into a set of virtual machines that need to be deployed in the physical infrastructure. The placement function maps these instances to the offered infrastructure of the cloud infrastructure provider.
3.2.3 Deployment template generator
Finally, the mapping must be provided in a form that is processable by the cloud manage-ment system. This means generating a deploymanage-ment, e.g., a Heat template, using as input the placement and the additional service description parts generated by the SLA parser.
3.3 Anomaly detection
The anomaly detection engine is one of the core components of CRMF and is com-prised of three sub-components: Data Collection Engine (DCE), Software Analysis En-gine (SAE) and Network Analysis EnEn-gine (NAE). DCE monitors and collect each VM and host for various metrics in order to produce feature vector for subsequent detection en-gines (i.e NAE and SAE). System and network level enen-gines identify potential symptoms of anomalous behaviour without performing lot of computation. They denote the pres-ence of anomalies, but merely suggest the possibility of one. Further analysis would be needed to confirm if anomaly is present and if so, classify it. The initial event is generated by looking at each metrics for system and network level in isolation. Hence, an event generated indicates that either host or VM is experiencing anomaly. A broader corre-lation between metrics can be very expensive as the number of such comparison grow exponential with the number of VMs and the monitored metrics.
TheFine Grain Analysis (FGA)is then required that uses statistical algorithm to iden-tify cause of anomalies and localize them. This analysis of the the monitored data is require to further understand anomalous behaviour and to narrow down the scope of
re-mediation. It can also generate the event in case the anomaly can not be classified for which manual intervention is required. Below we give brief overview of sub-components.
3.3.1 Data collection engine
An essential goal for resilience management is what information should be provided and where the information should come from. Having identified information which might af-fect the decision on remediation, this could serve as the basis to reason about efaf-fective resilience actions subsequently.
Therefore, the DCE is designed in such a way that it can collect and processes various relevant metrics pertaining to system (such as CPU, memory etc.) and network (such as number of packets, number of bytes and throughput) for every VM and physical host. All metrics are collected at periodic intervals with configurable monitoring interval parameter. It has sub-components which perform normalization and smoothing of data and produce feature vectors which is a sequence of values over a fixed interval of time and forms the basic input into subsequent detection engines (i.e, NAE and SAE).
3.3.2 Network analysis engine
The purpose of the NAE is to detect anomalous traffic at the physical node level of the cloud. This is achieved by modelling normal traffic patterns and identifying anomalies through online/offline monitoring of traffic on the network interfaces of the cloud node with aid of DCE. The NAE provide reference implementation of different anomaly detection techniques for offline and online analysis.
3.3.3 System analysis engine
The System Analysis Engine (SAE) is designed to detect anomalies through the obser-vation of VM properties. THE SAE builds a model of normal VM operation and detects deviation from the normal through selected anomaly detection technique.
3.3.4 Resilience policies (RP)
Policies are rule which govern the choices in behaviour of the Cloud system. These policies follow the Event-Condition-Action (ECA) paradigm and are of the form of cer-tain re-configuration and actions based on the output of both SAE and NAE. The output depends on detection algorithm in use by the components, but is ultimately an indica-tion of the current health of the VMs and the host on which VM is residing. The output could be as simple as alert or a boolean indicating whether the VM or host is behaving anomalously or not. Once this alert is received initial resilience policies (actions) (such as sandbox the VM, lower the bandwidth etc.) are instantiate, while in parallel Fine Grain Analysis (FGA) is performing more rigorous analysis to classify and localize the anoma-lies. Hence, narrowing down the scope of remediation for more fine-grained actions.
3.4 Policy Engine
In order to control resilience behaviour and to manage cloud environments, a policy-based approach has been chosen. Policies are enforced by IND2UCE, a framework originally developed for the enforcement of usage control policies. IND2UCE supports context-aware policies and freely configurable monitoring and enforcement components.
3.4.1 IND2UCE Framework
The policy enforcement framework IND2UCE consists of several components and is illus-trated in Figure 10. This is a conceptual view on the infrastructure, as the components do not necessarily need to be on one system and not all components need to be instan-tiated at all. In the following a short description of the IND2UCE components is given, followed by an overview of the generic interaction behavior of the framework. Finally the components are described in detail.
Figure 10: IND2UCE Framework
• Policy Administration Point, PAP
The PAP is a user interface for the specification of policies and for interacting with the Policy Management Point for managing existing and new policies.
• Policy Management Point, PMP
The PMP is the main management point in the framework, responsible for storing policies in the PRP as well as for deploying/revoking policies in PDPs. Additionally the PMP provides capabilities of registration and query of other framework compo-nents for providing a dynamic enforcement infrastructure.
• Policy Retrieval Point, PRP
The PRP is a secure storage for specified policies and can be queried by the PDP for a policy deployment request.
• Policy Enforcement Point, PEP
The technology dependent PEP intercepts system events and provides a common event representation to the PDP for the policy evaluation. Finally the decision drawn by the PDP with respect to installed policies is enforced.
• Policy Decision Point, PDP
The PDP is the reasoning component in the policy enforcement framework. Poli-cies installed by the PMP are instantiated and evaluated upon event notification by PEPs. Depending on the policies, additional information may be required from Policy Information Points and additional actions may be executed using Policy Exe-cution Points.
• Policy Information Point, PIP
The PIP is an independent component in the policy enforcement framework re-sponsible for resolving additional attribute values. Among others, this includes for instance data flow information between several layers of abstraction and context information, e.g., of a mobile device or the cloud environment.
• Policy Execution Point, PXP
The PXP is a component for executing concrete actions such as notifying a user, writing a log entry or sending a mail. The execution is triggered by the PDP based on installed policies.
3.4.2 Data Usage Control Policies
We distinguish between preventive and detective policies. The difference between these two types of policies is the result after applying them and the level of controllability re-quired on the system to enforce the policies. Detective policies report the policy violation which can result in some compensation or correction of the action. In the broadest sense, the correction can be the punishment by the court, which use detected and reported pol-icy violations for their line of argument. In contrast to this approach are the preventive mechanisms. They take care that an undesired action will not take place. The following example clarifies the difference: Patient data is stored in central storage which has au-thentication, authorization and accounting mechanisms implemented. If an unauthorized doctor is accessing the data, this action is blocked by the system, and therefore, preven-tive enforcement has been applied. If he can access the file and the violation is reported to the responsible person, then detective enforcement has been applied.
In general, preventive approaches require more control and possibly imply bigger performance issues, because every action needs to be stopped, transformed by the PEP and checked by the PDP. This results in some delaying overhead. The detective approach requires less control and imposes less performance overhead. However, it limits the policy to compensation or penalties in case of an observed violation. Finally, in some cases preventive policies cannot be enforced due to lack of support in the target systems and the best resort is to observe and react in case violations happen.
We distinguish four enforcement mechanisms that are explained with a short example. For simplicity, we will assume a message that is checked by a firewall-like component enforcing the policy “no personal data is allowed to leave the system”.
• Enforcement byinhibition: The message is blocked or dropped.
• Enforcement bymodification: All personal data related message fields are modified (e.g., censoring by inserting blanks).
• Enforcement by delay: The message is delayed until a compensation action, for instance, has taken place.
• Enforcement byexecution: An action such as sending a notification message to the system administrator or deleting the data source.
The enforcement by execution is the only possibility for detective enforcement, while inhibition, modification, and delay are exclusively types of enforcement performed by pre-ventive mechanisms. A fifth enforcement mechanism possibility is to simply allow the message. This mechanism is usually the standard case. This is similar to the definition of firewall rules with a white-listing approach. In this case, the enforcement “allow” is used to permit the message exchange.
IND2UCE usage control policies are specified in a concrete XML syntax following an XSD schema. In general, a policy consists of one or more enforcement mechanisms. These enforcement mechanisms refer to the introduced preventive and detective mecha-nisms and describe, what may (not) happen to the affected data item. They are specified in form of Event-Condition-Action rules (ECA): if a system eventEis detected and allow-ing its execution satisfies conditionC, then actionAshould be performed. An example of such a policy is given in Listing 1.
<policy name="acp2">
<preventiveMechanism name="ACPfire">
<description>Enocean DoorLocking Trigger</description> <timestep amount="3" unit="SECONDS" />
<trigger action="SmartIntegoMessage" isTry="false"> <paramMatch name="msgType" value="14" />
</trigger> <condition>
<within amount="60" unit="SECONDS">
<eventMatch action="EnoceanTelegram" isTry="false"> <paramMatch name="ID" value="180080e" />
</eventMatch> </within> </condition> <authorizationAction name="default"> <allow> <executeAction name="SmartIntegoAction" >
<parameter name="action" value="ShortTermActivation"/> </executeAction>
</allow>
</authorizationAction> </preventiveMechanism> </policy>
Each mechanism consists of three main parts following the mentioned ECA principle. Additionally it has an optional description and a timestep attribute, which is required for temporal evaluations.
The event, i.e. the trigger event, is a concrete action executed in the system under observation. Once the trigger event is intercepted by a policy enforcement point, it is forwarded to the PDP and the specified condition is evaluated.
Differentiating between preventive and detective mechanisms, the preventive ones additionally contain an authorization action part, describing the decision which has to be performed, if the condition evaluates to true. This decision may be an allowing of the intercepted event, potentially with one or more modifications of event parameters, a delaying of the event execution or a complete inhibition of the event. Hence, for preventive mechanisms the specification of a trigger event, i.e., the action that potentially has to be inhibited, is mandatory, whereas for detective mechanisms it is optional, because here only compensating, i.e., additional execute actions may be executed after occurrence of a specific situation. Such a situation can be the interception of a system event and/or specified as condition. Detective mechanisms can be activated by an event or a time trigger; preventive mechanisms can only be activated by an event trigger.
3.4.3 Policy Engine in CRMF
The IND2UCE policy engine is based on the ECA principle. In the CRMF context, events from the anomaly detection (AD) or from the fine grain analysis (FGA) are used as trig-ger event for the policy engine that can perform remediation actions. Depending on the deployed policy mechanisms, remediation actions can be performed by PXPs in the sys-tem such as migrating virtual machines to a dedicated host in the cloud environment (sandboxing) or starting another instance of the virtual machine to compensate overload.
4
CRMF component detail
Based on the presented CRMF design and theD2R2+DRresilience strategy described in Section 1, we now illustrate how a Cloud infrastructure could be enhanced to cope with challenges. The overall strategy we adopt to mitigate challenges is depicted in Figure 11. A more detailed description of CRMF components follows.
4.1 Deployment function
As described in §3, the deployment function component can be logically separated into components (cf. Figure 12), each with a specific task. We will describe these sub-components in the following.
4.1.1 SLA parser
A first sub-component is a parser for the input coming from the cloud user or the service operator. It needs to analyse the service description and extract the relevant information for further processing by the other sub-components. Ideally, the input parameters are formatted in a way that allows for an automatic analysis and translation into input for the
specifies SLA Security &resilience requirements (S&RR) R es ili en ce t ar ge t M ap pi ng (R T M ) Data usage Policies (DUP) Resilience Policies (RP) Deployment function/ orchestration Heat API Ceilometer API VMs VNs VSs
message bus (rabbit MQ) heat REST API
Heat API Data collection Engine (DCE) Anomaly detection (AD) Aggregation OpenStack APIs monitoring instructions to monitor pcap feature vector time-series Policy engine events Fine grain analysis(FGA) ac tio ns ev en ts reconfigurations/actions CRMF scope
Figure 11: CRMF system architecture
Figure 12: Internal structure of the deployment function
virtual infrastructure manager and orchestrator. In [BSRS13], a YAML-based format was proposed, but no full specification of such a format exists yet. Therefore, we will in the following describe the general parts of such an input file and outline which parameters the deployment function should be able to handle and translate into a resource request and VM deployment. It should be noted that it will not be possible to implement all of these in the scope of this project (also due to the application-dependency of an
implementa-tion). Nevertheless, the list should outline some basic input classes and parameters the deployment function should ideally be able to translate into a deployment template, cf. Figure 13.
Figure 13: Parameter Classes
Non-functional parameters
• This includes Geo-location constraints, e.g., due to legal reasons for the storage of personal data. Storage nodes can thus only be placed in zones that have the according location characteristic.
• As well, this includes the placement of virtualized functions and components under the supervision of anomaly detection (and possibly the Transparency Enhancement Framework, for which the need was identified in D5.1 [BSH+13] and which is de-veloped in WP5 and will be described in the future D5.3). This can be generalised to other security and management features the infrastructure might offer. While these components are functionality offered by the cloud infrastructure, they are not functional in the scope of the deployed services.
Resilience parameters
• The protection model describes which resources and which configuration should be used to increase resilience. Examples are 1:1 or N:M protection with dedicated backup instances not normally used for production traffic, or active-active protection where all instances handle user load but can cope with the failure of instances by redirecting their load to the remaining instances.
• Related to the protection model is the required availability of the resilient compo-nents. This requires placing the protection instances in different availability zones in order to cope with failure of parts of the infrastructure. Depending on the required level of availability, this requires placing instances at a certain (topological) distance from each other, in contrast to the next class of requirements.
• Particularly important here is the description of the synchronisation requirements, i.e., the (networking) resources necessary to interconnect and synchronise the in-stances belonging to a protection scheme. This mainly means the maximum ac-ceptable delay and minimum acac-ceptable bandwidth for a connection between these instances. Since these connections are vital in keeping fail-over times short and loss of data low, the placement of these instances with respect to each other and the available network resources becomes of high importance.
Scaling parameters
• Basic parameters for scaling include the thresholds denoting when to scale out or in (e.g., based on CPU load, memory usage and network utilisation of virtual machines), and minimum and maximum scaling group sizes.
• More advanced parameters are the relation of a scaling group with other instances, i.e., having to scale out or in when another group of instances scales out or in.
Virtualized component parameters
• This includes the resources necessary for the instances themselves, i.e., CPU core number, memory, interfaces etc. This is typically mapped to the flavours of the virtual machines to be started, as supplied by the CIMS.
• The connectivity describes the ‘regular’ network resources necessary for the de-ployed service, both between different parts and components of the service (ex-cluding protection schemes) as well as connectivity to the outside world, as seen from the cloud infrastructure.
• This group also includes the necessary configuration scripts that need to be exe-cuted in order to set up and interconnect machines on application level for their role, e.g., whether a server is configured as the active or a passive backup server.
4.1.2 Placement function
The main sub component is a function that maps the specific requested resources, and, if applicable, their resilience pattern, to a placement of instances in the cloud infrastructure. In particular, it takes the number of redundant or otherwise connected instances as well as their interconnection requirements (in particular delay and bandwidth) as an input from the service operator, via the SLA parser. A second input is a description of the cloud infrastructure with information about the availability of its components, the physical network characteristics as well as Geo-location and status of supervision by anomaly detection. This infrastructure description does not need to be made public or even visible to the service operator or cloud user, but should be disclosed only in case of a dispute. It is based on the measured and/or configured performance data of the cloud infrastructure provider, or could use input from services like ALTO [alt].
With these two inputs, the placement function maps the necessary instances to cope with the resilience requirements of the user to physical resources or resource pools in the infrastructure that provide the needed performance. To this end, it identifies combinations