Repair and Testing - Availability Requirements

4.4 Availability Requirements

4.4.2 Repair and Testing

The speed and accuracy of repairs can have a significant impact on availability. Ideally when a hardware component fails, the systems can accurately identify the field replaceable unit (FRU) the first time. Then, once a replacement unit has been installed, it can be fully tested before being put into service for use by the system.

Typical requirements are:

•

A failed FRU must be identified correctly by the system so that it can be replaced correctly on the first attempt 95% of the time.

•

Failed FRU identification and replacement must not affect the service availability.

•

A replacement FRU cannot be put in service until fully tested within the system.

•

Testing a replacement FRU must not affect service availability.

The frequency of failures and the urgency of repair also affect the cost of operating a system. The more often service personnel have to visit a system to make repairs, the higher the operating cost. FRU reliability can be increased by choice of parts, as well as by providing redundancy built into the FRU.

4.4.3 Upgrades and Changes

All systems inevitably need changes to add or remove hardware components, install operating system changes, and install new application versions. Having the service down during such upgrades will usually violate the availability requirements, so some sort of rolling upgrade strategy is needed.

The rolling upgrade often requires that parts of the system are shutdown and upgraded while the redundant components keep the service available. Once the first parts of the system are upgraded, they are put into service, and the remaining parts are in turn shut down, upgraded, and returned to service.

The implications to the Open HA Framework and application are that some level of compatibility between versions of software is needed. These may include:

•

Message formats, files, and data, common between different versions of software is compatible, or is tagged with version numbers for specific handling.

•

Behavior of the different versions of software is compatible.

•

Testing of the new software on the split system is possible before bringing it into service.

•

4.5 HA Configuration and Cluster Management

Configuration and cluster management middleware is the key controlling entity in an HA

configuration and is implemented by HA middleware distributed amongst the HA cluster systems. It maintains a system model of the components that comprise the cluster, defines how faults are detected in the cluster and what action should be taken as a result. When failures occur it takes the appropriate actions to reconfigure the cluster and notify and/or restart affected parts of the application.

The HA cluster configuration and management must be able to operate in a heterogeneous cluster. The member systems in the cluster may be of different types with different operating

environments, and may have HA management middleware from different suppliers. The system model, the message protocols between cluster members, and application APIs must allow proper heterogeneous interoperability.

4.6 Data Replication and Data Integrity

One of the most difficult aspects of HA application design can be the preservation of the dynamic data and context despite component failures. Not only does it affect the design and testing complexity, but it may also have significant impacts on the system’s performance, dependability and accuracy.

An application usually stores long term data and context in a file system or database, but often has other more transient data that represents the instantaneous context of dialogues and/or transactions. The ability to efficiently replicate this kind of data can dramatically improve the application recovery time.

The open HA framework needs to provide for at least some of the following:

•

An HA file system native to the operating system environment

•

An HA database system

•

An HA shared memory subsystem

•

An application context replication and recovery subsystem

The HA file system, database and shared memory need to be integrated with the overall HA configuration and management subsystem so that their operation can be coordinated with the HA reconfiguration activities (e.g., failure, repair and recovery handling).

5.0 System Capabilities – Configuration Management

5.1 Introduction

Configuration management involves knowing what types of hardware, firmware, and software components are actually in a system. It also tracks the intended configuration of the system (the

system model), which may or may not match the actual system configuration. Finally,

configuration management contains the capability to modify the configuration of the individual components that comprise a system.

In order to provide accurate fault management, one must know the intended make-up of the system. The actions taken by a fault management system typically require changes to the system configuration, and this requires corresponding capabilities in the configuration management strategy. This is attained by maintaining an inventory of the system’s components (hardware, firmware, and software) and by accurately maintaining information on the health of each component.

Although the title of this section is “System Capabilities – Configuration Management”, the capabilities discussed here are only the sub-set of configuration management activities that are necessary in support of high availability. Details of fault management are outlined in Section 6.0.

5.2 Characteristics of System Components

System components are the hardware, software and data that comprise the system. A component is any part of a system that is individually managed. In infrastructure equipment, components (managed or not) typically include:

•

Hardware – Disk drives, boards, fans, switches, cables, chassis, lights, and power supplies

•

Software – Operating systems, drivers, servers, content, and applications

•

Logical Entities – Databases, network topologies, routes, clusters and geographical regions

To lay the foundation for the sections that follow, there are six characteristics of system components that should be addressed in modeling and controlling a system:

•

Heterogeneity. System components include hardware (boards, drives, fans, LEDs, etc.) and

software (O/S, drivers, applications, middleware, etc.)

•

Transience. System components are available or not available, according to either their health

or their presence in the system due to a service event or upgrade

•

Dynamics. The state of system components is typically dynamic and includes attributes such

as health, operational state, and administrative state

5.3 Dynamic System Model

5.3.1 Introduction

The system model provides the basis for both configuration and fault management and is a critical component in meeting availability targets within an HA system. The model is typically

implemented in an in-memory database within the management middleware. This complete system model may use information from models of other components, such as a model of the hardware. Information about these component models may be made available via OS hooks or hardware management processors. The discussions on data collection in section Section 5.4 cover this in more detail.

A system model is an abstract description of component capabilities, interfaces, relevant data structures, and interactions. The system model dynamically tracks all of the managed components in the system and incorporates the configuration, state, relationships and dependencies of these components.

In document Providing Open Architecture High Availability Solutions (Page 37-40)