Different types of escalation
14.0 Availability Management 14.1 Introduction
Organizations are increasingly dependent on IT services, when they are unavailable, in most cases the business stops as well. There is also an increasing demand for 7 days per week 24 hrs a day availability of IT services.
It is therefore vital for the IT organization to manage and control the availability of the IT Services. This is done by defining the requirements from the business regarding the availability of the IT services and then matching them with the possibilities of the IT organization.
14.2 Objective
The objective of Availability Management is to get a clear picture of business requirements regarding IT Services availability and then optimize infrastructure capabilities to align with these needs.
Or one can put it this way:
To ensure the highest availability possible of the IT services as required by the business to reach its goals.
14.3 Process Description
Availability Management depends on a lot of inputs to be able to function well. Among the inputs are:
• The requirements regarding the availability of the business
• Information regarding reliability, maintainability, recoverability and serviceability of the CI’s • Information from the other processes, Incidents, Problems, SLAs and achieved service
levels
The outputs of the process are:
• Recommendation regarding the IT infrastructure to ensure the resilience of the IT infrastructure
• Reports about the availability of the services
• Procedures to ensure availability and recovery are dealt with for every new or improved IT service.
Key terminology and actions that form the basis of this process are:
Availability:
The availability and flexibility of components of the infrastructure. This is expressed in the following formula:
A = (ST - DT)/ST x 100, whereby A - Availability, ST = agreed Service Time and DT = Down Time.
Availability is defined as:
“Availability of a IT Service or component to perform its required function at a stated instant or over a stated period’ (ITIL Service Delivery Book, OGC,2001)
Reliability:
The reliability of components of the infrastructure. In this case the Mean Time Between Failures (MTBF) can be used as a measuring tool.
Reliability is defined as:
“…freedom from operational failure” (ITIL Service Delivery Book, OGC, 2001)
Resilience is a key aspect of reliability Resilience is defined as:
“The ability of an IT component to continue to operate even though one or more of its sub components has failed”
Maintainability:
The capability to maintain or restore a service or component of the infrastructure at a certain level, so that the required functionality can be delivered. Some services or indeed components of the infrastructure are easier to maintain and/or restore to service in the event of a failure. For example, an application has been developed that requires daily housekeeping to ensure its operation and a highly qualified Database Administrator can only do this. This application is not easy to maintain. The maintainability of C.I.'s within the infrastructure is an important consideration as the speed of recovery and the ease of maintenance will impact the uptime and hence availability of services. Operational Level agreements (OLA's) within the Service Level Management process tie in here. Note: Remember that C.I. = Configuration Item
Serviceability:
Serviceability refers to the agreements that are held with third parties providing services to the IT organization. These contracts will define how these external parties will perform to ensure the availability of the services they interface with. For example, how will they ensure resilience, how will they maintain the infrastructure they are responsible for.
Underpinning Contracts within Service Level Management tie in here.
Security:
This is divided into confidentiality, integrity and availability (CIA). It can be desirable (for security reasons, which might endanger the availability) not to make certain components of the infrastructure available, logically or physically.
Security is of great concern to most organizations these days and it is important to ensure that IT services are made available to the organization in a secure way. That means that services and information is available to the right people. It is also important to ensure that services are not so secure that it impedes that ability of the organization to use these services.
14.4 Activities
The activities within the process can be divided in three main activities, which will be discussed in detail in the remainder of this chapter:
• Planning • Improving
• Measuring and reporting
Planning
It is important not only to find out the requirements but also to find out if and how the IT
organization can meet these requirements. The Service Level Management process maintains contact with the business and will be able to provi de the availability expectations to the Availability Management process. The business may have unrealistic expectations with respect to availability without understanding what this means in real terms.
For example, they may want 99.9% availability yet not realize that this will cost five times more than providing 98% availability, for their organizations infrastructure. It is the responsibility of Service Level Management and the Availability Management process to manage expectations.
Design
When considering the design of the infrastructure the IT organization can either design for “availability” or “recovery”.
Design for Availability
When the business cannot afford for particular service/s to have downtime for any length of time designing the infrastructure for availability should be the approach. In this instance the IT organization will need to build resilience into the infrastructure and ensure that preventative maintenance can be performed to maintain services in operation. In many cases building “extra availability” into the infrastructure is an expensive task that must be justified by business need. Designing for Availability is considered a pro-active approach to avoiding downtime in IT services.
Design for Recovery
When the business can tolerate some downtime of services or the cost justification cannot be made for building in additional resilience into the infrastructure then designing for recovery is the appropriate approach. In this approach the infrastructure will be designed such that in the event of a service failure recovery will be as fast as possible. Spare part for example will assist in the speedy of infrastructure components that fail.
Designing for recovery can be seen as more reactive management of availability.
The processes (like Incident Management) need to be in place to recover as soon as possible in case of a service interruption.
Other Considerations
• Security issues
Define the security areas and the impact they might have on the availability of services. Make sure it is clear who has access to what and where.
• Maintenance management
This is a maintenance window that is agreed upon and known to the customers in which the IT organization can do the maintenance and repairs. This way the impact on the IT service of the maintenance and repairs will be reduced.
Improving
Developing the Availability Plan
The Availability Plan will look into the future (for example 12 months) and document what measures will be put in place to ensure that the infrastructure and IT services will be available to meet business requirements.
Input from monitoring and other processes, such as Service Level Management, will provide the basis for decisions on what “availability” measures will be put in place. All plans must be cost justifiable and aligned with business needs.
Measuring and reporting
This involves reporting about the availability of each service, the down times and recovery times. These reports will often go to the Service Level Management process to use in reporting
comparisons (planned versus actual) on service levels back to the customer.
It is also important to measure and report on the perception of the customers on the availability of the IT service.
You can use many ways to identify (un-) availability and potential problems. The following are a few mentioned by the OGC:
CFIA
Component Failure Impact Assessment can be used to predict and evaluate the impact on IT Service arising from component failures within the IT infrastructure.
Fault Tree Analysis is a technique that can be used to determine the chain of events that causes a disruption to IT services.
CRAMM
CCTA Risk Analysis and Management Methodology can be used to identify new risks and provide appropriate countermeasures associated with any change to the business availability requirements and revised IT infrastructure design.
SOA
Systems Outage Analysis is a technique designed to provide a structured approach to identifying the underlying causes of service interruption to the user.