Achieving High Availability
in an IP Telephony System
Scott St. Clair
Director, Corporate Communications
An 8x8 White Paper
Introduction
The standard set for telephone system availability in the United States and other indus-tialized countries is extremely high. So high that most telephone service customers have never picked up a telephone receiver and not heard dial tone, the system’s indication that it is available for use.
The user expectation that a telephone system will always be available is so high that no company can reasonably introduce a telephone system that does not provide that level of availability, despite the fact that many technologies that modern consumer rely upon are much less reliable -- automobiles, personal computers and cable television systems come to mind -- but are nevertheless considered perfectly useful.
This white paper discusses the mechanisms employed in the 8x8’s Netergy iPBX Server System to provide carrier class system availability. In addition to treating the specific schemes that 8x8 has adopted to ensure high availability for the Netergy system, this paper also attempts to provide sufficient background for the reader to understand the difficulties that engineers face in attempting to deliver carrier-class reliability.
About the Netergy iPBX Server System
The Netergy™ iPBX Server System from 8x8 is the first full-function private branch exchange (PBX) system designed specifically to allow service providers to deliver iPBX hosting services. Completely packet-network based, the Netergy iPBX allows service providers to support up to 100 discrete iPBXs -- each dedicated to an individual cus-tomer -- and up to 10,000 total extensions. The Netergy iPBX Server System is part of
High Availability Defined
the Netergy Advanced Telephony System (ATS), which includes cost-effective terminal adapters for customers and the most advanced user control software in the world.
System Architecture The iPBX Server System consists of a cluster of carrier-grade Sun Microsystems Netra t1 Servers running the Netergy iPBX Server Software. The iPBX Server System is located in the service provider's data center, and it is connected to the customer's premise using any broadband IP connection, usually xDSL or T1, which can also be used to offer Internet connectivity. For telephone sets, customers can use inexpensive Netergy Media Hubs to adapt standard analog telephones to IP service or they can use next-generation IP phones. The Netergy ATS connects to the PSTN and the long-dis-tance IP backbone through a SIP softswitch combined with an MGCP gateway or an H.323 gateway and gatekeeper.
System Functionality The Netergy iPBX Server Software provides complete PBX functionality: call hold, call transfer, three-way conferencing, multi-line phone support, paging, hunt groups, voice mail (optional, includes interactive voice response menuing and automated call distribu-tion), direct inbound dialing, and more. Each Netergy iPBX Server can be custom con-figured for each customer and complete support for the Java Telephony Application Programming Interface (JTAPI 1.3) allows customers to deploy CTI applications from third party vendors.
High Availability Defined
Simply put, when a user makes a telephone call or clicks on a Web site, he wants a prompt connection or a rapid response at any time of the night or day. Failure to deliver good performance or, even worse, failure to make the connection or display the web page, could well be grounds for selecting an alternate service provider. Thus, providing high availability is a chief goal of telephone systems.
In the telecommunications market, availability is typically defined as the percentage of time per year that a system is available to its subscribers. The following table lists some availability percentages and the corresponding number of minutes of downtime allowed under that standard.
The current standard for availability in telecommunication systems in the industrialized
TABLE 1. Downtime Minutes
Availability Percentage Allowed Downtime (minutes/year) 99.000 5,000 99.900 500 99.990 50 99.999 5
High Availability Defined
achieve this goal, systems must be designed for both reliability and serviceability, both of which are important elements in providing high availability.
Reliability Reliability is the starting point for building increasingly available systems, since a mea-sure of a system reliability is how long it has been up, and/or how long it typically stays up between failures. The nature of the failure is not important — any failure affects the system’s overall availability.
Mean Time Between Failure, or MTBF, is often considered an important metric with respect to measuring system reliability. However, there is currently no industry adopted standard for measuring MTBF, which makes the MTBF number for a given system or component of questionable use for comparison against other vendors.
There are two primary means of achieving greater reliability:
• Building high MTBF components into the system, and
• Adding them in redundant (N+1) configurations
Serviceability Serviceability defines the time it takes to isolate and repair a fault, or, more succinctly, the time it takes to restore a system to service following a failure. Mean Time To Repair, or MTTR, is considered an important metric when discussing the serviceability of a sys-tem or some component of the syssys-tem.
MTTR, however, is a unit of time and does not factor in the cost of service. Consider the cost of a 7x24 (seven days a week, 24 hours a day) service contract versus 5x8 coverage. In some environments, a system that recovers automatically from most failures can obviate the need for 24 hour service coverage, and can reduce support staff require-ments. The net result is a highly available system with a lower cost of ownership.
Availability Availability, then is the time that the system is running divided by the time the system is meant to be in service, which for telephone systems, is all of the time. The following equation shows the relationship:
Since a telephony system is supposed to be available all of the time, the Time Measured is essentially MTBF plus the MTTR, so the equation becomes:
Thus, it can be seen that the availability of the system depends on the intrinsic reliability of the components and the time needed to repair the system in case of failure.
Availability TimeRunning TimeMeasured ---= Availability MTBF MTBF+MTTR ---=
Mechanisms to Achieve High Availability
An availability of 99.999%, which is the standard for telecommunications systems, allows for a service interruption of no more than five minutes per year (less if the system suffers more than one failure per year). Since it is unrealistic to expect that a technician can repair any failure in less than five minutes, then the system must be designed to replace the failing element automatically without any human intervention.
User Perception of Availability Building automatic repair capabilities into a system is a good strategy to improve the system user’s perception of availability because short failures are less visible to users than long ones, even if short failures occur more often.
The preferability of short failures is obviously better for telephone system users because most subscribers use the system infrequently and for short periods of time, thus not all system failures will be visible to the average user. In addition, if the system operator can manage system down time for a period during which the system is lightly used, then user perception of availability will remain higher than otherwise.
Mechanisms to Achieve High Availability
There are a variety of mechanisms that system designers can use to build highly avail-able systems. These mechanisms center on providing hardware and software redun-dancy, and on the notion of distributed processing.
System Elements that Require Backup
There are essentially two system elements for which backup must be provided in order to achieve high availability in the system:
• Hardware -- Server systems, terminal adapters, the network plant, disk drives (when shared by multiple servers) and power supplies
• Data -- Programs, system configuration data, long-life user data, short-life data
Redundancy Mechanisms Hardware and software redundancy may be divided into grades that provide a level of protection against failures that varies with the cost of the solution. Essentially, the slower the recovery mechanism, the lower its cost. There are three principal grades as described below.
Cold Standby (CS) Following a component fault, users of the component are
discon-nected and lose any work in progress. An automatic fault detection and recovery mecha-nism then detects the fault, and brings into service a redundant component. This redundant component has been lying dormant and must be initialized in order to enter service. Once this is done, users are able to reconnect to it and begin processing again.
Typically, the length of time required for the fault detection process to detect the fault and invoke the redundant component is quite low (tens of seconds). However, the time required for the initialization of the redundant component can be much longer. This recovery time is application dependent, but usually involves cleaning up file systems, databases and other persistent resources to a consistent “rolled back” state, which can
Mechanisms to Achieve High Availability
Warm Standby (WS) Following a component fault, users of the component are
discon-nected and may lose some of their work in progress. An automatic fault detection and recovery mechanism detects the fault, and notifies the redundant component to take over. This redundant component has been actively running, and is partially initialized. Furthermore, it may already share some of the processing state of its failed peer. Hence, not all work in progress may have to be rolled back. Clients of the component must still actively reconnect to the redundant component.
Fault detection times for Warm Standby systems are similar to those in Cold Standby systems, but the recovery times are dramatically lower than Cold Standby (typically tens of seconds), due to the partial initialization and state sharing.
Hot Standby (HS) Active and Redundant components are tightly coupled, processing
state is actively and completely shared between the group components. Following a component fault, users of the component are not disconnected and do not observe the fault in any way. Work in progress continues with the remaining redundant compo-nent(s) in the group providing the component functionality.
In this model, the masking is complete and transparent - clients of the system are unin-terrupted. Recovery times are instantaneous - more accurately, the concept of recovery times does not apply since from a client perspective there is no recovery.
Distributed Processing Distributing an application over a number of hardware platforms and a number of soft-ware processes has several advantages. First and foremost, such a system eliminates the chance that a single point of failure will bring down the entire service. If one hardware platform fails, only those processes that are running on it will be affected. Similarly, if any one software process fails, only the users served by that particular process are affected, as long as there is no interdependence between processes.
Distributed processing is an important high availability mechanism for software for two reasons: software failures are generally caused by an unexpected conjunction of events or by a degenerative defect such as a memory leak.
Unexpected events are unexpected because testing failed to consider a particular combi-nation or configuration. Monolithic software implementations that are designed to serve thousands of users or perform a great number of functions are much harder to test because, as the number of functions and users rise, the potential combinations rise expo-nentially. Limiting the functionality and the number of users allows a software module to be tested more exhaustively, thereby limiting the chances of unexpected event causing a failure.
In a similar way, using a modular approach limits the effect of degenerative defect, both by allowing more thorough testing and by limiting the effect of a defect to a subset of the total system user base. For example, an application such as a “softswitch” that is intended to serve 10,000 users (or telephone lines) could be implemented as a mono-lithic piece of software or divided into 100 independent modules, each serving 100 lines each and linked together. In the monolithic approach, a degenerative failure caused by an unusually high load on the system would affect all 10,000 users. A similar failure in the modular system, however, would affect only 100 users, and given that all 100 would
The Netergy Solution for High Availability
not be trying to use the system during the repair period, only a fraction of the 10,000 users would be aware that the system was not available.
Managing Distributed Processing Systems
If an application is distributed over both multiple hardware platforms and multiple soft-ware modules, a mechanism is required to both detect failures and to transfer functional-ity to other resources. The elements in a distributed processing system are as follows:
• Resources: The actual hardware or software that provides the service for which the system is designed.
• Controllers: A software module that controls the systems resources.
• Supervisor: A software module that monitors the individual system elements to determine whether they are functioning properly.
In actual practice, the controller and the supervisor functions are often combined because they are so tightly linked.
The Netergy Solution for High Availability
As noted above, there are essentially two types of resources that can fail, hardware and software, and two types of backup that are required for recovery, redundant hardware and the replication of configuration and process state data.
When designing the Netergy iPBX Server System, 8x8 engineers were faced with a choice between two different architectures: monolith or distributed.
A monolithic design, one that places the entire capacity and functionality of the system in a single program running on a single processor, would offer the advantage of easy management. There is no coordination needed between many different elements to accomplish the system’s function. However, such a system would be expensive because a very large processor would be required to achieve adequate performance and a second, equally large processor would be needed to provide redundancy. Such systems also scale poorly. For example, if a single system could support 10,000 extensions (or users), the 10,001 extension would be an expensive one.
A distributed architecture, one that divides the functionality of the system up over sev-eral programs and sevsev-eral processing platforms, offered different challenges, primarily coordination. Many different processes would have to be synchronized to offer the required system functionality, and system configuration data would be need to be shared between many different programs which could be running on different processors.
A distributed solution offered distinct advantages, however. First, smaller, less expen-sive processors could be used, and each could serve as a backup for the others. For example, if a cluster of 10 processors was used, the work load from a failed processor would be distributed over the remaining nine, reducing performance only slightly (about 10% at peak usage times). Further, such an architecture would scale much more readily. Adding a processor to our 10-processor cluster would add 10 percent system capacity,
The Netergy Solution for High Availability
The Netergy iPBX Architecture The Netergy iPBX Server System uses a distributed architecture for both hardware and software.
For the processor, 8x8 engineers chose Sun Microsystems’ Netra t1 Server. The Netra offered a number of advantages. First, the Netra t1 is a true carrier grade system, with full NEBS (Network Equipment Building System) Level 3 compliance. Second, the price/performance ratio for server is high, but the t1 does not incorporate hardware redundancy. To provide that redundancy, 8x8 uses a cluster of five Netra t1 Servers, which provide support for up to 100 iPBXs with up to 100 extensions each.
During normal operation, the PBX processing load is distributed evenly across all five systems. Should one of the systems fail, its load is redistributed among the remaining four systems. When the failed system is repaired or replaced, the PBX load is again dis-tributed over five machines. The Netergy iPBX Server System can be scaled in the same way. When additional capacity is required (more PBXs or more extensions), another Netra t1 is added to the system and the additional load is then distributed over the new resource.
To take advantage of a distributed hardware architecture, a distributed software architec-ture is required. It is possible to divide the PBX functionality over several different mod-ules and distribute the modmod-ules over the hardware platform. Modmod-ules might include a call routing module, call control module, feature module and so on. The issue with dividing the functionality in this manner is that each module represents a single point of failure. That is, if any one fails, the whole system becomes unavailable.
Rather than dividing the PBX functionality between modules, 8x8’s engineers instead implemented the entire PBX in one program and limited the number of extensions a sin-gle PBX program could support to 100. In this architecture, each PBX is considered an “instance” and can be essentially assigned to support a specific customer. Using this dis-tributed capacity method has a number of advantages:
• Limiting extensions reduces the processoring power required to support an individ-ual PBX, so twenty can be supported on one modestly powerful server.
• Distributing PBX capacity rather than functionality eliminates single points of fail-ure. Should one PBX fail, others would be unaffected.
• Scaling PBX capacity is very easy. One need only start a new PBX process to sup-port additional customers, and when more processing resources are needed, add an additional server to the PBX cluster.
• Server failures can be managed effectively. Should a server fail, the PBX instances running on that server can be restarted on the other servers in a cluster. Since the incremental load on each remaining server is modest, users should see a minimal reduction in performance.
• As noted above, PBX instances can be dedicated to individual customers, which allows the opportunity for the customization of each instance.
This distributed capacity PBX is a unique solution to providing PBX functionality. In addition to having excellent high-availability characteristics, it makes possible the hosted PBX model pioneered by 8x8. In the hosted model, the PBX software runs on a server cluster located in a service provider’s central office or data center. This cluster
The Netergy Solution for High Availability
can provide PBX functionality over a broadband IP network to hundreds of PBX service subscribers.
Because the cluster offers both hardware and software redundacy and because the server cluster is located in a climate controlled data center with backup power, hosted PBX ser-vices can be both more cost effective and more reliable than if the PBX was located on the customer premise.
Supervising and Controlling a Distributed Capacity PBX
The most substantial challenge in developing a distributed capacity PBX is creating a supervisor/controller program that is capable of managing a hundred or more real-time processes. There are several tasks the controller must perform:
• Resource status -- Monitor hardware and software for failures
• Data replication -- Keep multiple copies of PBX configuration and call status data
• Resource management -- Link controllers and terminals to PBX instances, and dis-tribute PBX instances over the server cluster to balance load.
To perform these functions, the Netergy iPBX incorporates a program called the Supra-Manager. SupraManagers run on each of the servers in a cluster, each providing backup for the others.
The SupraManagers monitor not only the iPBX instances on their own servers, but also the iPBX instances on the other servers. In addition, the SupraManagers all maintain copies of the configuration for each iPBX instance running on the cluster. Call state data is not maintained, however.
Behavior in Case of Failure In case of an iPBX instance failure due to a runtime error
such as a fatal error, program loop or memory leak, the SupraManager will halt and restart the failed instance. During restart, the
QUESTIONS
• When a single instance fails, does the SupraManager just restart it? Or does the SupraManager give its terminals to another, already running back up instance?
• How long does it take to start/restart an instance?
• When an instance restarts, doesn’t it have to reset the terminals to get them into a known state? Won’t that distrupt a stable call? Isn’t that a no-no?
• When an entire server fails, which SupraManager is responsible for reinstanciating the “displaced” iPBXs? I understand that all SMs watch all machines in a cluster, so presumably they all have the data necessary to restart the failed instances.
• I gather from JF that we do not replicate dynamic data (like call states). Is that what mandates resetting the terminals? Will SP4.0 eliminate this problem (true hot backup)?
• When a failed server is returned to service, I gather that the SupraManagers redis-tribute the iPBXs over the cluster at low load times. Is that true?
Summary
The Selection of Java as the Programming Language for Netergy
When designing the Netergy iPBX, 8x8’s system architects decided to implement the core iPBX in the Java programming language. The architects picked Java for a number of reasons, chiefly for reliability and security reasons rather than portabilitity.
Running each instance of the iPBX in its own Java virtual machine or “sandbox” offers a number of advantages. First, every instance of the iPBX is completely insulated from every other instance, so a failure in one will never cause a failure in any other. This is so because the instances do not share system resources such as dynamic link libraries, device drivers or other operating system services: all of those resources are built into the Java virtual machine. This arrangement prevents dynamic data from one process, for example, from accidently corrupting another process that uses the same library (and thus the same scratch pad in memory).
A second advantage is security. Because each iPBX instance is separate from all others, its configuration data is separate also, thus it is much less likely that an other user will inadvertantly (or purposely) misconfigure someone else’s iPBX. From the user’s point of view, there is only one iPBX. The other instances are invisable because they exist in other virtual machines.
Java is also an inherently reliable language because the operating environment is highly structured. Such
QUESTIONS:
• What makes Java programs less likely to have memory leaks (somekind of frame-work structure?)
• Any other stuff I should know about Java and why it’s so cool?