3 — Failure protection and redundancy provisions in ISAM
3.5 Network path connectivity protection 3-15
3 — Failure protection and redundancy provisions in ISAM
3.1 Overview
When you provide protection for system functions and subsystems by use of redundancy, you improve the reliability of those parts of the ISAM, and hence the availability of the whole ISAM.
Redundancy aspects
Redundancy has different aspects, and each aspect has its advantages and disadvantages which must be taken into account. The following aspects are described:
• relation between essential and redundant resources
• operational mode of the additional redundant resources
• the scope of the protection - the impact of a failure
• the average duration of an outage - time to repair
• the number of simultaneous failures that have to be coped with
Relation between essential and redundant resources
• Bilateral:
One redundant resource can back up only a single dedicated essential resource (notation 1:1 or 1+1).
The advantage is that the redundant resource can be fully preconfigured, and that protection normally takes a minimal time. Also, the configuration data (static, or dynamic, or both) necessary for the redundant resource can be kept on the redundant resource itself.
The disadvantage is that each essential resource has to be duplicated, which adds to the cost, the space requirements, and the power consumption.
• Dynamic:
A redundant resource can replace any one resource out of a group of identical essential resources (notation N:1 or N+1, or N:M or N+M in general).
Because each essential resource does not have to be duplicated, one or a few additional resources can protect a much larger group of identical essential resources.
The disadvantage is that this scheme only is applicable when multiple identical essential resources are present in the ISAM. In many cases, the redundant resource cannot be fully preconfigured. The redundant resource can only be configured after the failing resource has been identified, which means the time for protection has to be increased by the configuration time. Also, an up-to-date copy of the configuration data (static, or dynamic, or both) for the multiple essential resources has to be kept in a place that is not affected by failure of the related resource. This requires either additional storage on the redundant resource, or a more complex data storage mechanism across all the protected resources.
3 — Failure protection and redundancy provisions in ISAM Operational mode of the additional redundant resources
• Standby:
One or more redundant resources are kept inactive or on standby while one or more essential resources perform all the required processing (notation 1:1, N:1,N:M in general).
The advantages are that the ISAM architecture is relatively simple, and the configuration and initialization of the redundant resource(s) starts from a well-known state at the time of activation of the redundant resource(s) in case of a protection switchover. The standby state can apply on the data path, the control path and/or the management path (see “Redundancy provision” for more information and practical examples).
The disadvantages are that the redundant resource does not contribute to the operation (performance) of the ISAM for 99.9% or more of the time, while requiring an additional, up to 100% investment in cost, space and power consumption. Also, in many cases the redundant resource cannot be monitored or tested for 100% of the functions that it has to perform, so a certain risk of dormant faults exists.
• Active and load sharing:
All resources (reflected in the data path, control path and/or management path) are active or operational, normally in a load-sharing mode, but the number of resources in the ISAM exceeds the minimum needed to perform all the necessary processing by one, or more (notation 1+1, N+1, or N+M in general). Some resources can be implemented in load-sharing mode, while others are implemented in active/standby mode (see “Redundancy provision” for more information and practical examples).
If one or more of the active resources fail, the remaining resources take over the whole processing load. Also, all the resources can be monitored in operational conditions, and dormant faults cannot occur.
The advantage of this type of redundancy is that the ISAM performance increases while no faults occur, by virtue of the more-than-necessary active resources.
The disadvantages are that the ISAM usually becomes more complex. A dispatching or processing load distribution function is necessary, which must be fair (that is, the load must be shared evenly over all the resources) and must be able to recognize resource failures in time and to respond to them. Also, this function must not constitute a (significant) single-point-of-failure in itself.
The scope of the protection - the impact of a failure
Usually, it is not economical to protect functions or sub-systems that affect only a limited number of subscribers, interfaces or a limited amount of traffic. An often applied principle is that central resources or aggregation resources (that is, resources whose availability determines the availability of the whole ISAM) are protected, while tributary resources are not protected. However, it depends on the specifics of each individual case whether this principle is economically viable, in either
direction.
3 — Failure protection and redundancy provisions in ISAM
The average duration of an outage - time to repair
Redundancy of a resource nearly always should be optional. In many cases the need for providing redundancy or not for a given resource is determined by the average time to repair. A resource in a system may be reliable enough (that is, its Mean time Between Failure (MTBF) is high enough) to operate in a non-protected way. This is the case, for example, in an attended CO environment, where a stock of spare parts and skilled staff are available and where short detection and intervention times can be guaranteed. However, the same resource may require redundancy when deployed in an unattended outdoor cabinet, in order to meet the same availability as in the CO.
The number of simultaneous failures that have to be coped with
Individual Replaceable Items (RI) in modern, carrier-grade telecommunication equipment are already highly reliable, and provide an intrinsic availability of 99.99%
or even 99.999%, within the boundaries of the specified environmental operating conditions. In order to achieve the generally required 99.9999% availability, coping with a single resource failure (that is, providing at most one redundant resource) is sufficient in all circumstances. The probability of dual simultaneous failures, affecting the same type of resource, is low enough, and does not have to be taken into account for protection.
Redundancy provision
The ISAM basically provides redundancy as an option for essential central or aggregation functions and resources. These include:
• External link protection for:
• network links
• links with sub-tended ISAMs
• Equipment protection for the ISAM:
• Data path: the Ethernet switch fabric
• Control path: the Network Termination (NT) board processor
• Management path: the NT board processor
The ISAM does not protect all the central functions or resources by default. Essential functions and resources reside on the NT board, which can be made redundant. In practice, a number of different configurations with single, redundant NT and single NT IO board are possible, each supporting a different amount or type of protection.
The ISAM can be configured in active/standby mode by means of an optional standby NT board. The standby NT board is synchronized with the active NT board.
In order to speed-up the reconfiguration of the data plane after switchover and to facilitate the rebuilding of the control plane, the dynamic switch configuration (L1 and L2) is also synchronized between the active NT board and the standby NT board.
The management plane is fully restored at the moment the new active NT board is initialized.
3 — Failure protection and redundancy provisions in ISAM