Scalability, availability and maintainability

In such a huge architecture, service availability and ease of maintainability is essential. With so many components, any updates might cause downtime to a certain extent and that was one of the major concerns during the development of a solution for this thesis.

Regarding scalability and starting with the easiest task, adding a new asterisk server should be a simple task. The solution found to solve this issue is indeed rather simple and is no more than adding the server information to the database and to put the server up and running. Adding a server to database is achieved through a very simple web-interface in which the admin must enter: hostname (name of the machine on which asterisk is running), the server's IP address, the server number (recall that each server must have a unique number to be mapped in the SIP URI) and the maximum number of users that the machine holds (this value depends on hardware used).

Illustration 16: Administrator interface: should display the status of current online asterisk servers and allow adding/removing servers

using this new server to book conferences, generating SIP URI's with this new server number.

Providing full service availability is a bit more tricky but very important and, therefore, big effort must be made in order to achieve this. There are several ways of detecting that an asterisk server is down and the system will react in a way that will make the service unavailable, at worst, for a couple of seconds.

The following scenarios may occur:

Consider that an asterisk server (named X) which was not holding any conferences suddenly crashes or becomes unavailable. It will obvisouly stop updating the database since the application is no longer running (all the asterisk servers periodically – every 5 seconds - update the database with the current timestamp). When a new call arrives to the SIP proxy load-balancer, it will check in the database (among other things), for the last timestamp for the server which the user is trying to connect. If it was longer ago then a certain threshold (e.g., 15 seconds is currently being used) then it assumes that it has crashed and forwards the call to another asterisk server (let us call it Y). The Y asterisk server will detect that this call is meant to asterisk server X and tries to bridge to it. Since this operation will fail, it decides that the server X is down and "replaces" it. This replacement means that a new entry will be added in the database stating that server with the IP of server X is down and that calls directed to it are now to be sent to server Y (this is called the substitution process). After this is done, call gets established and the client will not even know that there is a certain server down. Any new calls which have the prefix of the server X will now be automatically directed to server Y, without the client having any idea that is actually not connecting to the server that he was meant to.

A slightly different scenario happens when a server which is holding some conferences crashes. Naturally, all the calls will be dropped since RTP traffic will no longer flow between client and server. From this point on, the flow of events that will occur will be the same as the one explained earlier: a user will call in to the crashed server, will be redirected to a different which will become its substitute server, and call is established in the substitute server.

Finally, the most complex situation takes place when a server X which has bridges to other conferencing servers crashes. In this situation, again, all the clients connected to the server will be dropped as well as the IAX connections between the servers: however, the bridged servers (for instance, Y and Z) will retry to establish connection: when this fails (asterisks should retry since the connection may be down for a number of reasons), then the bridged servers will try to replace the crashed server (if there is more than one server connected, only

Illustration 17: Flow of events when a user joins a conference which causes the replacement of a server which has just crashed

However, imagine the following scenario: 1. Conference is booked to server X

2. At a certain point, server X is too loaded and a new call gets established in server Y which bridges the conference to server X

3. After some time, server Y gets too loaded and a new call gets established in server Z which bridges the conference to server X

4. Server X crashes.

As it was described previously, either server Y or Z will replace server X. Still, there are some users connected to the same conference in servers Y and Z which are not able to talk to each other since they were connected to each other through

Illustration 18: First picture describes the servers configuration when a conference is being bridged to 2 servers: after the crash of asterisk-X, userB gets disconnected, asterisk-Y becomes the substitute server and creates a bridge to asterisk-Z. UserB dials in again and gets connected directly to asterisk-Y

server X. To solve this issue, the following rule must be added to the substitution process:

– Whenever a server is substituting a crashed server, it must check for any

existing bridges before the crash and try to recreate those bridges

Since information regarding bridges is kept in the database, then the replacement server simply needs to look up in the database for the bridges which were originated from server X. Having obtained this information, it is then only a matter of making new IAX connections to the conferences and servers specified in the database and update that same information.

Illustration 19: Flow of messages exchanged when an asterisk server holding bridges goes down. Asterisk-X crashes causing userB to get disconnected and destroying the bridges from asterisk-X to Y and Z. Asterisk-Y will manage to become the substitute server, creating a bridge to asterisk-Z and keeping the conference running. When userB dials in again, he will join server asterisk-Y

because there might be a chance that 2 users trying to some conference actually end up in different servers. This can happen due to following reason: an astman books a conference Z to run on server X, which later on is replaced by server Y. A user connects to conference Z which will be held on server Y. Then, server X is put back up and a new user tried to connect. This new user will (logically) end up on server X but no bridge will be created since the asterisk server does not know that the conference is running on another server. So the idea is that a server can only be put back up when there are no conferences running on some substitute server which were supposed to be run on the server that had crashed (or was put down for maintenance).

And this last paragraph introduces the chapter of server maintainability. In the computer world is quite common that some server X needs to go down for maintenance, software upgrade or anything else that causes a service to be down during a certain amount of time. This issue is also approached in this thesis and is handled in the following way: when the administrator decides that server X needs to go into maintenance, he or she accesses the adminstrator web-interface and deactivates the server. Deactivating it causes some database changes which means that no more conferences will be booked to this server, no new users will be connected to this server from that point on, and another asterisk server Y will become a substitute server for X. However, the server cannot be immediately killed since we have some ongoing conferences and users connected there.

If a user decides to join a conference which was booked for server X, the user will end up in server Y. In case the conference is still running on server X, then a bridge will be created to the existing conference in server X. Otherwise, normal process of joining a conference takes place. The users connected to server X will eventually leave their conferences and, since the maximum time allowed for a conference to take is 4 hours, this is also the maximum time it might take to put a server down.

Putting a server back up might also take some time, but all this logic is done in asterisk since it is not desirable that an administrator must check, at all times, how many users are connected to the replacement server Y that should be connected to server X and, when that number count is 0, he or she can then bring the server back up. So the idea is quite simple: all the logic is built in asterisk server and, when it is booted up, it makes all the checks. It reads the database and will become active if the server Y is no longer holding conferences that were booked for server X. Since there are no conferences booked to server X while the server is down, booting up a server can also take, at most, 4 hours.

In document Design, implementation and analysis of a large scale audio conferencing system using SIP (Page 39-45)