MAPE-K reference model - Autonomic Computing

2.2 Autonomic Computing

2.2.1 MAPE-K reference model

IBM’s vision of Autonomic Computing is structured around a reference model for autonomic control loops (Huebscher and McCann, 2008), known as the MAPE-K (Monitor – Analyse – Plan – Execute – Knowledge) adaptation loop, and depicted in Figure 2.2. Stemming from Agent Theory, the MAPE-K model can be seen as an evolution of the generic model for intelligent agents proposed in (Russell et al., 2010). According to this model, agents are enabled with sensors to perceive their environment, and are able to execute certain corrective actions on the environment based on the observed values. This continuous process of sensing and acting upon sensed values clearly corresponds to the closed adaptation loop of the MAPE-K model. Applying the model to the domain of self-management at the PaaS level, we now consider each of its elements in more details.

MANAGED ELEMENT ANALYSE PLAN MONITOR EXECUTE SENSORS EFFECTORS AUTONOMIC MANAGER AUTONOMIC ELEMENT

Figure 2.2: IBM’s MAPE-K reference model for autonomic control loops (Kephart and Chess, 2003).

Managed element

Managed elements represent any software and hardware resources which are en- hanced with autonomic behaviour by coupling with an autonomic manager. In the context of clouds, the managed element may be a cloud platform as a whole, a web server, a virtual machine, an operating system, etc. In the presented research work, as explained in Chapters 5 and 6, primary managed elements include applications deployed on a CAP, and add-on services offered to users through the CAP marketplace. Managed elements are equipped with sensors and effectors. The former, also known as probes or gauges, are software or hardware components responsible for collecting information about the managed elements. Sensors are typically associated with metrics – certain characteristics of the managed element, which need to be monitored (e.g., response time, memory utilisation, network bandwidth, etc.). The latter are hardware or software components, whose responsibility is to propagate required adaptation actions to the managed element. Depending on the scale, adaptations can be coarse-grained (e.g., completely sub- stituting a malfunctioning Web service) or fine-grained (e.g., re-configuring that

service to fix it).

Autonomic manager and knowledge

The autonomic manager is the core element of the MAPE-K model – essentially, this where the whole MAPE-K functionality is encapsulated. This software component, configured by human administrators to follow high-level goals, uses monitored data received from sensors and internal (i.e., self-reflective) knowledge of the system to analyse these observations, plan possible adaptation actions if needed, and execute them to the managed element through effectors so as to achieve those goals.

The internal self-reflective knowledge base of the system, shared between the four MAPE-K components, may include any type of information required to perform the Monitor-Analyse-Plan-Execute (MAPE) activities. In the first instance, in includes an architectural model of the managed element – a formal representation of its internal organisation, subcomponents, and connections between them. An- other important component is a set of diagnosis and adaptation rules – formally defined policies, which serve to analyse critical situations and choosing a relevant adaptation plan among existing alternatives. Among other things, the knowledge base can also include historical observations, data logs, repositories of previously detected critical situations and applied adaptation solutions, etc., which implies that the knowledge base is not designed to be static (i.e., populated once by administrators at design-time), but rather has to evolve dynamically over time by accumulating new information at run-time. In connection with this, there is room for applying certain techniques from Machine Learning research, and we describe this possibility in more details in Section 9.2.

Monitoring

The monitoring component of the MAPE-K loop is responsible for gathering information about the environment, which is relevant to the self-managing behaviour of the system. In a broad sense, monitoring may be defined as the process of collecting and reporting relevant information about the execution and evolution of a computer system, and can be performed by various mechanisms1 (Kazhamiakin, Benbernou, Baresi, Plebani, Uhlig and Barais, 2010). These monitoring processes typically target the collection of data concerning a specific artefact, the monitored subject (Bratanis et al., 2012). In the context of CAPs, monitored subjects 1_{In our research we focus on run-time monitoring. Related activities can also include such} techniques as post-mortem log analysis, data mining, and online or offline testing – the interested reader is referred to (Kazhamiakin, Benbernou, Baresi, Plebani, Uhlig and Barais, 2010).

include platform components, deployed applications, service compositions, indi- vidual services, etc., and monitoring properties can be the number of simultane- ous client connections, data storage utilisation, number of tasks in a messaging queue, CPU and memory utilisation, response times to user requests, I/O oper- ations frequency, etc. Appropriate monitored data helps the autonomic manager to recognise failures or sub-optimal performance of the managed element, and execute appropriate changes to the system. The types of monitored properties, and the sensors used, will often be application-specific, just as effectors used to execute changes to the managed element are also application-specific (Huebscher and McCann, 2008).

Two types of monitoring are usually identified in the literature:

• Passive monitoring, also known as non-intrusive, assumes that no changes are made to the managed element. This kind of monitoring is typically tar- geted at the context of the managed element, rather than the element itself. For example, in order to monitor some metrics of a software component, in Linux there are special commands (e.g., top or vmstat return CPU utilisation per process (Huebscher and McCann, 2008)). Linux also provides the /proc directory, which contains runtime system information, such as current CPU and memory utilisation levels for the whole system and for each process individually, information about mounted devices, hardware config- uration, etc. (Huebscher and McCann, 2008). Similar passive monitoring tools exist for most operating systems. As we will further explain in more details, in our use case, to collect information about database space occu- pied by a table, we did not measure the size of that table per se, but rather queried a special separate table, whose role in relational databases is to keep live and up-to-date statistical data about the database. This is another example of passive monitoring. The drawback of this approach, however, is that often monitored information is not enough to unambiguously reason about possible sources of problems in the system.

• Active monitoring assumes designing and implementing software in such a way that it provides an API with some entry-points for capturing required metrics and collecting monitored data. For example, to monitor response time from an external web service, an application should provide an API in order for the monitoring component to intercept the requests. This can often be automated to some extent. For instance, using aspect-oriented pro- gramming techniques allows ‘injecting’ additional functionality into the pro- gramme source code. Moreover, there are some tools for inserting probes

into already compiled Java byte code (e.g., ProbeMeister1), which makes it possible to actively monitor legacy systems. This type of monitoring is also known as intrusive, as it inevitably implies making changes (i.e., intrusions) to the managed element by instrumenting it with probes to facilitate inspec- tion of its characteristics. As with code instrumentation, it is essential that this is done with care, since the instrumentation can itself potentially affect the subject’s performance, providing a flawed picture of its inherent capabil- ities.

It is also important to consider who is responsible the monitoring process and data collection. Typically there are two data collection methods (or a combination of them) used (Bratanis et al., 2012). In polling mode, the autonomic manager is responsible for querying the managed element and its sensors at regular intervals. On contrary, in push mode it is the managed element’s responsibility to notify the monitoring authority of any significant events or changes. In practice, implementing monitoring as part of the self-management functionality may require a deliberate and flexible combination of several approaches. In the context of CAPs, on the one hand, it can be a condition of deployment that add-on services and user applications must conform to a platform-specific API and expose some sort of ‘hooks’, to which monitoring sensors will be attached, thus implementing an intrusive approach to data collection. This in turn can be seen as a potential security threat, since such kind of hooks imply providing access to potentially sensitive internal application data, which may be unacceptable as far as particularly sensitive software systems and business data are concerned, and therefore a less or non-intrusive approach would be preferred.

Moreover, a malfunctioning application or a service cannot be relied upon to act according to their designed specification – that is, it is potentially risky to assume that a report signalling its own failure will be pushed to the monitor by the managed element. In these circumstances, it is important that a PaaS- level autonomic manager implements ‘heart beat monitoring’ and polls managed elements at regular intervals to check whether they are still alive and active. On the other hand, however, the SOC paradigm suggests software systems are often combined in novel, emerging and unpredicted ways, which makes it impossible to make the autonomic manager aware of all of the possible situations in advance. In such circumstances, applications are also required to be able to push event notifications to the autonomic manger, since it may not itself issue the necessary requests.

As self-managing systems grow and the number of sensors increases, moni- 1_{http://www.objs.com/ProbeMeister/}

toring activities may result in a considerable performance overhead. That is, in a system with thousands of probes constantly generating values, the monitoring component may not be able to cope with this overwhelming amount of data. To avoid ‘bottlenecks’, system architects have to distinguish between values which are relevant to self-managing activities and so called ‘noise’ – data, which can be neglected. Another potential solution to this problem is performing high-level monitoring first, and then, once an anomaly is localised, activate additional monitoring resources (Ehlers et al., 2011). With this approach, computational resources are provisioned to the monitoring component on-demand, only when a problem is detected, thus resulting in a higher efficiency and resource consumption.

Analysis

The analysis component’s main responsibility is to perform current situation assessment and detect failures in the managed element. In its simplest form, the analysis engine, based on Event-Condition-Action (ECA) rules, simply detects situations when a single monitored value is exceeding its threshold (e.g., CPU utilisation reaches 100%), and immediately sends this diagnosis to the planning component. However, the problem determination may be a challenging task, especially in a distributed environment when the monitored data is coming from multiple remote sources. Based on the internal knowledge, the autonomic manager should decide whether a particular combination of monitored values represents or may lead to a failure.

In connection with this, techniques from the area of Complex Event Processing (CEP) research (Margara and Cugola, 2011) proved to be helpful in the context of data analysis and situation assessment. From CEP’s point of view, everything hap- pening in the environment and changing the current state of affairs is an atomic event. Sequences of atomic events build up complex events, which, in their turn, may be part of an even more complex event, thus building event hierarchies. For example, when CPU and memory utilisation levels of several VMs running on the same physical machine reach 100% (i.e., atomic events) within a short period of time, this indicates that the utilisation of the whole physical machine has reached its limit (i.e., the complex event). In Chapters 5, 6 and 7 we will explain in more details how CEP techniques were utilised in our own research work.

In document EXCLAIM framework: a monitoring and analysis framework to support self-governance in Cloud Application Platforms (Page 46-51)