Dynamic Cloud Management
Introduction
After analysing and designing large infrastructure domains for many years, there are two issues that keep repeating. The first (the topic of this paper) is that of dynamically and efficiently handling of rapidly changing business scenarios and unforecast business demand. The second is impact of lack of integration and overall immaturity of tooling used to support business operations – the approach described in this paper can also support this maturation.
The dynamic business issues that are faced every day include: 1. Implement rapid business scenario change;
2. Manage business demand support in real-time; and
3. Optimise the environment (to TCO, Green, business priorities or whatever you want). These issues are faced continuously and are artefacts of the world we live in. A competitor opens a new sales channel and takes the market by storm; something happens in the marketplace (or economy) which has huge and immediate business implications; a new acquisition is announced; or a new compliance regime is needed – how fast can you address these scenarios and recoup your customers and sales?
This paper focuses on the systems supporting business operations – the ICT technology
infrastructure layer. The goal state is to make IT into a trusted, efficient and intelligent business supplier. This approach works for any modern environment: heterogeneous; multi-site; cloud, fixed and virtualised resources – it is size, vendor, technology and topology agnostic.
What is Dynamic Cloud Management
Dynamic Cloud Management is the automatic and continuously optimising of your business environment – agility and efficiency. The key drivers used to support this are current business demand and business scenario. These are critical to the business survival, and our ability to expediently and efficiently deliver this capability is becoming paramount to modern business survival. If a competitor opens a new sales channel, how fast can you address or replicate it? If an economic tsunami hits, how fast can you move to address it?
Why is this Important
This approach is about delivering to business expectation. It brings the business to the IT strategy table can help getting IT to the business strategy table. It supports making IT into a business supplier to meet and satisfy business demand – if there is an appetite for such. This is about bringing all players into the current time, real and dynamic (real-time) business service management – putting demand management into the hands of the business.
One example issue with operational management is the sensitivity of a business process, resource, or anything to subtle changes in a seemingly disassociated area. This is the butterfly effect, where a major business revenue stream can be rendered unserviceable for a period, by the operation of a minor process in a different part of the business, where both share a common service.
No business runs a single process, but multiple processes. Additionally, these processes are typically partially dependent on other processes and shared services. As the number of process increase, the resulting complexity balloons exponentially, resulting in a highly complex relationship which nobody really understands. The direct implication of this is when changing one element (increasing volumes, changing a process or implementing an upgrade), nobody can categorically state what the cross impacts will be. Examples of unexpected cross-impact abound, which has also resulted in the death of some corporations.
What else can this do? Support strategic scenario analysis at the enterprise and domain level – merger, demerger, acquisition, sale, upgrade, change. It supports instantaneous operation view (business view, service views, application and infrastructure views). What is your need?
How Does This Work
This approach has 3 major components. These components work jointly and can detect and effect an environment change to requirements within minutes (real-time in business terms).The major
components are summarised as follows: 1. Real-time Capture Analysis:
This tracks multiple data sources, including the actual current customer demand and responsiveness on all defined channels in real-time. Additionally it tests the profiles and searches for aberrant patterns, and profile breaks in real-time.
2. Dynamic Modelling of the Enterprise:
This uses the captured profiles to analyse the operational environment for near-term constraints – both resource and operational. The focus of this modelling is near-term – typically 10-15 minutes maximum from current time (for example).
3. Environment Change:
This component analyses the resource constraints with business scenario and business prioritisation information to determine what service environment changes are needed. If a constraint (within the defined business scenario and business prioritisation context) is about to be realised, then alerts are sent advising of such. If resources are required, then
reallocation and/or acquisition is executed to change the environment. Supporting the major components are several technical pieces of work that need to be accomplished. These are categorised in the following points:
• The support for services that incorporate multi-tenancy.
• Analysis of the business environment encompassing dependencies, tiering and synchronicity. • The ability to rapidly deploy changes into the environment.
A technical way of looking at this, it dynamically provides real on-demand, pay-by-use computing across the software layer (SaaS), Platform layer (PaaS) and Infrastructure (IaaS) layer.
Approach Characteristics
This approach exhibits itself in two modes. The first is as the dynamic real-time controller of the business operational domain. The second is as an off-line environment analyser where scenarios can be tested and trialled.
In Real-time Mode:
In real-time mode the business and operational metrics are captured and analysed. These are tested and compiled into profiles for validation and characterizing. Any variant in the profiles, along with next cycle changes, are used as source reference data for the operational analysis engine. [Critical attention is made with the inputs and profile development to ensure the system is not impacted with aberrant data or false triggers.]
These profiles are then used to create the operational model of the enterprise. This model incorporates not only the driving business demand and customer responsiveness information, but also includes information about the available services and resources within the infrastructure. Additionally, information on relationships and dependencies is also included to provide a valid business and operational view of the enterprise or domain. The operation of the model gives insight in business impact, resource operations, constraints, interactions and achievable operation.
Where issues are detected, remedies can then be established to remove them within the modelling environment. The choice of remedies may involve cost (such as increase the number of available servers or a change to a license requirement) and need to be evaluated against business rules (codifying business priorities, criticality and/or anything else which the business deem appropriate). If a resource constraint is detected to occur within a low-priority business process which is not to be remedied, the identified business stakeholder can be immediately alerted to impending action and effect and other actions may be then affected to change scenario or priorities.
Dependent on the environment, non-utilized resources could be moved to solve complex problems, augment automated testing, be powered down or whatever is appropriate within the environment. This change in deployment can reduce the underutilized resources, significantly decreasing cost and corporate risk.
The real-time mode operates on an iterative basis. The speed of iteration is a function of the
dynamism of the business and the size of the customer base. For example, a business that is subject to dynamic and unforecast demand surges and has a small customer base may require a short iteration period (possibly <10 minutes). However, if the period is set too short, then the prediction algorithm may be excessively sensitive to short (normal) demand surges. If such a short demand surge happened, it would be corrected in the next interval. For a large customer base, there is an inbuilt inertia so that demand surges may occur over an hour or two instead of minutes, so for larger enterprises, the iteration period may be (for example) set to 15 minutes. The interval period needs careful analysis and examination of the company, its operation, imperatives and customer base. In Off-Line Mode:
If the off-line mode, the model is run using actual profiles to evaluate various scenarios. The usage of business load is critical to give situational reality to any analysis performed. In many scenario
analysis modes, estimations or expectations are used, which may wildly vary from reality. Using actual business load data, ill-informed guestimates and their outcomes can be largely negated. Some examples of the type of scenario analysis that can be performed with such a model include, but are not limited to:
1. Support business strategy discussions relative to business demand – model what business initiatives will do to other business processes and divisions.
2. Understand the impact (and cost) of changing product mix, channels or demand.
3. Test the impact of project changes on the environment, validating effectiveness, risk and cost effectiveness.
4. Analyse impact of business continuity scenarios without impacting operating.
5. Analyse the technical impact of mergers, acquisitions, divestitures and other corporate changes without involving squads of people.
Examples
Two examples of this approach in dynamic operation mode are profiled.
The first demonstrates the effectiveness of the basic forecasting algorithmic approach to actual business demand handling, for a single business process – the first component of the approach. The second gives an example of operational monitoring and reporting from the perspective of an observer.
The two figures below illustrate the effectiveness of real-time business demand tracking. The first (Figure 1) shows how close the actual demand can be tracked, including when an unforecast demand is presented (also textually described below). The second figure compares the cost impact of two approaches, the first being simple timed resource provisioning (non-demand sensing) and the second being the effectiveness of real-time demand sensing (as described in this approach). The relative service/transaction cost using this approach is relatively flat!
Business forecasting in real-time.
1. In this example, the forecast (“FNext”) is calculated from the business demand profile (“Forecast”) and the actual transaction demand (“Trans”). There is no additional buffering or algorithm included for safety margins or error handling – the example is purely to
demonstrate the effectiveness of the basic approach.
2. The time interval has been set to one hour to simplify visualization of the example. The “FNext” is showing what the next expected business demand interval, based on the previous. Therefore the 10:00am “FNext” point is derived from the 09:00am “Trans” point. 3. Between 13:00 and 15:00 hours, 200,000 records were removed from the actual transaction
record to synthesize the effect of a negative business demand event. At 16:00 hours the business demand is returned to what it was, synthesizing a positive business demand spike. The recovery from these synthetic demand changes is minimal.
Figure 1: Business Demand Prediction Algorithm Operation
Reporting Example
In this example, a single business process is shown in operation. The basic components of the graph include the overall business process capacity, the forecast business demand profile, the actual business demand and the response time. This graph is continuously updated, giving near real-time operation performance. The example shows operation up to 5pm.
As the business load increases, the provisioning algorithm uses a safety margin in its calculation to establish resource changes. In this example, the peak allocation of resources available to this process is 40 units.
The business process follows that of a simple transactional response system. Where the business demand exceeds the capacity of the system to perform, saturation occurs resulting in response time blowout.
In this example, the SLA is set for responsiveness for 30 units, even though the average is 12.
Additionally, the anticipated business demand was defined within the SLA as 30 units, but efforts are made to accommodate excess demand – in this case the business owner would have received a message to say their business demand exceeded their agreed to levels, and now service provision was being performed on a best-efforts basis, while the business demand exceeds the defined level – no penalties are enforceable.
One of the advantages of real-time reporting to the service desk is that a call to the desk can be quickly and analytically qualified. If the service response time rises to 20 units, and the SLA is set to 30 units, the response may be slower, but no escalation or analysis is launched – saving money, resources and pain. Similarly, if the customer is angry from ‘other’ events (crashed the car coming in, upset with the kids, …..), then the customer response can again be qualified and appropriate
instructions issued. Note: This class of reporting is one by-product of this approach.
These are only two facets of what a mature and optimising CCMI level 4/5 function can delivery, which partially integrates at least four of the ITIL domains.