Availability management - CSI AND OTHER SERVICE MANAGEMENT PROCESSES

and techniques

5 Continual service improvement methods and techniques

5.8 CSI AND OTHER SERVICE MANAGEMENT PROCESSES

5.8.1 Availability management

Availability management’s methods are part of the measuring process explained in Chapter 4. They are part of the measuring process – gathering, processing and analysing activities. When the information is provided to CSI in the form of a report or presentation, it becomes part of CSI’s gathering activity. For more details on each method, please consult ITIL Service Design. Availability management provides IT with the business and user perspective about how deficiencies in the infrastructure and underpinning process and procedures impact the business

operation. The use of business-driven metrics

can demonstrate this impact in real terms and help quantify the benefits of improvement opportunities.

Availability management plays an important role in helping the IT support organization recognize where it can add value by exploiting technical skills and competencies in an availability context. The continual improvement technique can be used by availability management to harness this technical capability. This can be used with either small groups of technical staff or a wider group within a workshop environment. The information provided by availability management is made available to CSI through the availability management information system (AMIS).

This section provides practical usage and details on how each availability management method mentioned below can be used in various activities of CSI.

5.8.1.1 Component failure impact analysis

Component failure impact analysis (CFIA) identifies single points of failure, IT services at risk from failure of various CIs and the alternatives that are available should a CI fail. It should also be used to assess the existence and validity of recovery procedures for the selected CIs. The same approach can be used for a single IT service by mapping the component CIs against the vital business functions and users supported by each component.

When a single point of failure is identified, the information is provided to CSI. This information, combined with business requirements, enables CSI to make recommendations on how to address the failure.

5.8.1.2 Fault tree analysis

Fault tree analysis (FTA) is a technique that can be used to determine a chain of events that has caused an incident, or may cause an incident in the future. It offers detailed models of availability, and makes a representation of a chain of events using Boolean algebra and notation. Essentially FTA distinguishes between four events: basic events, resulting events, conditional events and trigger events.

When provided to CSI, FTA information indicates which part of the infrastructure, process or service was responsible in the service disruptions. This information, combined with business requirements,

enables CSI to make recommendations about how to address the fault.

5.8.1.3 Service failure analysis

Service failure analysis (SFA) is a technique designed to provide a structured approach to identify end-to-end availability improvement opportunities that deliver benefits to the user. Many of the activities involved in SFA are closely aligned with those of problem management. In a number of organizations these activities are performed jointly by problem and availability management. SFA should attempt to identify improvement opportunities that benefit the end user. It is therefore important to take an end-to- end view of the service requirements.

CSI and SFA work hand in hand. SFA identifies the business impact of an outage on a service, system or process. This information, combined with business requirements, enables CSI to make recommendations about how to address improvement opportunities.

5.8.1.4 Technical observation

A technical observation (TO) is a prearranged gathering of specialist technical support staff from within IT support. They are brought together to focus on specific aspects of IT availability. The TO’s purpose is to monitor events in real time as they occur, with the specific aim of identifying improvement opportunities within the current IT infrastructure. The TO is best suited to delivering proactive business and end-user benefits from within the real-time IT environment. Bringing together specialist technical staff to observe specific activities and events within the IT infrastructure and operational processes creates an environment to identify improvement opportunities.

The TO gathers, processes and analyses information about the situation. Too often the TO is reactive by nature and is assembled hastily to deal with an emergency. Why wait? If the TO is included as part of the launch of a new service, system or process for example, a lot of the issues inherent to any new component would be identified and dealt with more quickly.

One of the best examples of a TO is the mission control room for a space agency. All the specialists from all aspects of the mission are gathered in one

room. Space agencies don’t wait for the rocket to be launched and experience a problem before gathering specialists to monitor, observe and provide feedback. They set it up well before the actual launch and practise monitoring, observing and providing feedback.

Certainly, launching a rocket is very costly, but so is launching a new service, system or process. Can the business afford a catastrophic failure of a new enterprise resource planning (ERP) application, for example? Incidentally, rocket launches are often aborted seconds before the launch. Shouldn’t organizations (including yours) do the same when someone discovers a major potential flaw in a service or system? CSI starts from the beginning and includes preventing things from failing in the first place. Let’s fix the flaw before it goes into production instead of fixing the fixes (what a concept!). This information, combined with business requirements, enables CSI to make recommendations about how to address the TO’s findings.

5.8.1.5 Expanded incident lifecycle

First, let’s define a few items:

n Availability management The process

responsible for defining, analysing, planning, measuring and improving all aspects of the availability of IT services. Availability

management is responsible for ensuring that all IT infrastructure, processes, tools, roles etc. are appropriate for the agreed service level targets for availability.

n Expanded incident lifecycle A technique to

help with the technical analysis of incidents affecting the availability of components and IT services (see Figure 5.13). The expanded incident lifecycle is further made up of two parts: time to restore service (also known as downtime) and time between failures (also known as uptime). There is a diagnosis part to the incident lifecycle as well as repair, restoration and recovery of the service.

Let’s assume that CSI has decided to improve the incident lifecycle by reducing the mean time to restore service (MTRS) and expanding the mean time between failures (MTBF).

Here is an example of how availability

management can assist in reducing downtime in the expanded incident lifecycle by using many techniques:

n Monitoring (detection of incident) By

adequately monitoring for availability of vital business functions through automated monitoring tools (set at the right threshold) that record and escalate incidents, the time it takes to detect and record incidents is reduced.

n Incident recording Since one of availability

management’s goals is to ‘optimize the … support organization’, educating and training first-line staff as well as simplifying and/or automating incident recording helps reduce the time it takes to record incidents.

n Investigation Using the FTA method,

availability management assists in reducing the time to investigate by creating proper investigation procedures for incident

management staff. The same logic applies to the diagnosis of the incident cause, resolution and recovery.

Here is an example of how availability

management can assist in increasing up-time in the expanded incident lifecycle by using many techniques:

n Using SFA, availability management can make

recommendations to increase the reliability of components, thus reducing the likelihood of an incident occurring in the first place.

n_{Scheduling and performing adequate and}

required internal maintenance of components (maintainability), availability management can help to increase the resilience of components, thus reducing the likelihood of an incident causing an outage.

n_{Ensuring that external maintenance of}

components (serviceability) is properly

scheduled and performed by external vendors, availability management can help to increase the resilience of components, thus reducing the likelihood of an incident causing an outage.

n_{Conducting a CFIA to predict and evaluate the}

impact on IT service availability arising from component failures assists in identifying single points of failure. Availability management will either submit recommendations for enhancements to the resilience and reliability of such components or provide better

troubleshooting procedures to the support groups.

n Implementing security recommendations

coming from information security management

Incident

start Incidentstart

Uptime Uptime Uptime (availability)

Service available Service available Service available Downtime (time to restore) (MTRS) Downtime

Service unavailable Service unavailable Detect Diagnose Repair Recover Restore

Time between service incidents (MTBSI)

Time between failures (MTBF)

Time

Incident start

Availability Availability

regarding the confidentiality, integrity and availability of associated data helps reduce malicious or unauthorized access to data, ensuring data integrity, and thus reducing the likelihood of an incident occurring or decreasing the time it takes to respond to or resolve an incident.

In document ITIL_Continual_Service_Improvement.pdf (Page 126-129)