DEFINITION OF AVAILABILITY

6 High Availability

Availability is the status of a system, a subsystem, or a module in which it is able to perform its intended function under the stated conditions.

An available system performs its functions as specified, responds to commands, then sends status data back. A system that fulfills High Availability requirements is characterized by the following:

• High Availability in TelCo terms is constituted by a cumulative system downtime of less than 3 minutes a year (99.9996% system uptime).

• System downtime includes downtimes caused by unrecoverable hardware defects, software errors and traps, and software maintenance.

• To achieve High Availability status, the hardware must be capable of supporting more than 99.9996% uptime.

Consequences out of the above requirements can easily be derived and are as follows:

• All subsystems and modules must be hot-swappable, independent of whether they are redundant or Single Points of Failure. A subsystem that is not hot-swappable will force a system shutdown for the replacement to be installed. A subsystem or module that is not swappable at all will force the entire device to be replaced.

• A system capable of HA must support redundancy on a subsystem and module level for all modules and subsystems that are crucial to the fully functional status of the device.

• Hot standby (Hot Spare) or Triple Modular Redundant Systems must be supported by the subsystem and module architecture, both in hardware and in software, for all modules and subsystems that are crucial to the fully functional status of the device.

• The Hot Swap process should not affect the system data integrity as to not impact system availability during a module swap.

Most likely, this directly impacts some internal components and subsystems.

A preliminary assessment of the number of internal components that have an impact on the system availability and the acceptable error rates before a link must be assumed to be malfunctioning gives the following:

• To achieve a system uptime of more than 99.9996%, mostly components with bit error rates (BER) of less than 10^–15 are required for internal links.

• All internal communication paths (buses, links, channels) must provide BERs less than or equal to 10^–15.

• The BERs will have to be monitored and compared to two thresholds:

one for counting errors at expected rates indicating normal conditions, and one for taking a link or a component out of operation.

• It is crucial that errors in datagrams (payload and header or LCI) are detected.

• Errors that span multiple datagrams must be detected.

• Whenever a datagram is altered, the integrity of the incoming data must be verified and a new protection must be applied.

It is important to stress one more time that HA in TelCo terms is a cumulative system downtime of less than 3 minutes a year—including hardware defects, software errors and traps, and software maintenance such as patches and version updates or upgrade installations. This can only be achieved using redundancy mechanisms. Typically, there is a component redundancy and a board redundancy, plus software support on all affected modules and subsystems, as well as on the OAM&P card for the switchover.

1+1 REDUNDANCY

1+1 Redundancy is defined as an architecture in which a non-core subsystem is connected to two core subsystems such that the two core subsystems operate in a load-sharing way while both are able to perform their intended function, and are operating as a single subsystem with the appropriately reduced performance when only one of the subsystems is able to perform its intended function. The 1+1 redundancy can be implemented as a simple dual-CPU (or dual-function) Symmetric Multi Processor (SMP) system, or any other dual symmetric architecture, and therefore is relatively cheap. Both redundant subsystems must be able to signal their load and availability status to each other in order for the system to perform load-sharing and redundancy functions. These functions can be executed in software or in hardware, but it must be ensured that the protocol is robust against violations of itself. For example, one

processor must take over all incoming load if the second processor is not responding to requests to take excess load. During normal operation in which both subsystems are able to perform their intended function, each of the units operates at 50% of their intended maximum sustainable load. During an error or an outage of one of the two subsystems, the other one must take over and then operates at 100% of its intended maximum sustainable load in order to maintain the rated throughput of the redundant module. This means that each of the subsystems will have to be rated at a sustainable throughput identical to the throughput of the entire redundant module. Typically, both redundant subsystems share resources such as parts of their memory, I/O, and some semaphores. Failure of any of these can lead to an outage of the entire redundant module. In an error case, the remaining subsystem will alert the OAM&P entity of this. The OAM&P entity will then take appropriate steps.

1:1 REDUNDANCY

1:1 Redundancy is defined as an architecture in which a non-core subsystem is connected to two core subsystems such that the two core subsystems operate in a configuration with an active core subsystem and a hot or cold standby core subsystem (see Figure 6.3). Both subsystems receive the same data, process the same data, and send out the same data for as long as both subsystems are able to perform their intended functions. If any error occurs, the data that both redundant systems send out will differ. The multiplexer and demultiplexer chips and the select logic cannot determine which inputs are valid, and which are not. It will forward the input of the active subsystem to the remainder of the logic of the non-core subsystem, independent of whether it is correct or not. While this appears to be flawed, it is not. The advantage lies in the simplicity and the resulting reliability of these components that are single points of failure. The remainder of the logic

FIGURE 6.3 1:1 redundancy, hot standby.

Subsystem 0

Subsystem 1 Redundant Module

Mux/

Demux

Select Logical unit

of the non-core subsystem will evaluate the data it receives and determine its validity. In an error case, it will alert the OAM&P entity. The OAM&P entity will then take appropriate steps.

Each of the subsystems will have to be rated at a sustainable throughput identical to the throughput of the entire redundant module. This is a significantly more reliable but much more complex architecture than the 1+1 redundancy, and it provides nominal throughput even during times in which one out of the two core devices is out of service. However, it does not require any logic or software for task scheduling, load sharing, or detection of the outage in the redundant subsystem, and switchover is faster. Additionally, since there are no shared resources between the two sub-systems, an outage of these shared resources cannot affect the subsystem or the redundant module as a whole. In some cases, both redundant core subsystems must work in lockstep (cycle or even microcycle synchronous).

In document Axel K. Kloth-Advanced Router Architectures (2005).pdf (Page 52-55)