How To Use Application Layer Multicast For Media Distribution






Full text


An Architecture for Centralized SIP-based Audio

Conferencing using Application Layer Multicast

José Simões1, Ravic Costa1, Paulo Nunes1, 3, Rui Lopes1, 2, Laurent Mathy4 1

Departamento de Ciências e Tecnologias da Informação - Instituto Superior das Ciências do Trabalho e da Empresa (ISCTE) 2

Associação para o Desenvolvimento das Técnicas e Tecnologias de Informação (ADETTI) 3

Instituto de Telecomunicações (IT) Lisboa, Portugal

{Jose.Simoes, Ravic.Costa, Paulo.Nunes, Rui.Lopes} 4

Computing Department - Lancaster University Lancaster, United Kingdom

Abstract — Audio conferencing is an important aspect of Internet Telephony services. In this article, using a centralized conferencing architecture, we propose to employ application layer multicast for media distribution, by using “agents” responsible for the delivery of streaming media to end-clients, aiming at reducing the traffic in the network and the server workload. In this model, we use the concepts of active client, which exchanges media directly with the server, and passive client, which is limited to receive media from its agent. In our implementation we use the Real Time Protocol (RTP) for data transmission and the Session Initiation Protocol (SIP) for signaling, hence making it compatible with most existing state-of-the-art hardware and software.

Keywords — Audio Conferencing; SIP; ALM;VoIP I. INTRODUCTION

With Internet Telephony services dissemination, there is an urge to introduce several traditional telephony functionalities to this new environment. Conferencing is not an exception. An audio conference call is a telephone call (IP or not) where the calling party wants to have more than one called party listen in to the audio. Conferencing is widely used these days, and it is used in many different applications and scenarios. It can be used for entertainment, social, education or business purposes. In our proposal, we pretend to serve large scale applications, with a large number of participants, where a few of them are producing media (active participants), while the majority (passive participants) are just listening to what is produced.

Concerning conferencing models, they can be distinguished between: centralized; full mesh; unicast receive and multicast send; multicast, and end mixing [1]. This classification is based on the topology of signaling, media delivery and architecture component relationships [1]. In this paper we consider the centralized conference scenario where a server receives media streams from all participants, mixes them if needed, and redistributes the appropriate media stream(s) back to the participants. This model has the advantage that clients do not need to be modified to perform media mixing and transcoding.

In addition, it is relatively easy to support heterogeneous media clients [1]. Since it is difficult for each sender to subtract its own contribution, the server needs to create a customized stream for each of the active callers, e.g., [2]. Assuming that not all users are using the same media format, the server needs to decode the audio streams to a non-compressed audio format to mix them. After that, it encodes the mixed stream in the appropriate media format for each of the participants. This will lead to media distribution and server workload scalability problems limiting the number of participants in a conference call.

To improve some of the limitations of the centralized conferencing model, notably, the amount of traffic in the network and the server workload, we decided to study the impact of multicast for media distribution.

Due to the non-existence of a globally deployed inter domain multicast routing protocol at the network layer; the use of Application Layer Multicast (ALM) is proposed in our architecture for media distribution [3]. Another important requirement for our architecture is that is should be SIP-based [4]. This will allow the use of popular software for both terminals and server (such as X-Lite, Kphone, SER, etc.). For media distribution, RTP is used either in unicast or ALM connections, depending whether clients are instantaneously active or passive.


Many conference servers in the market today are H.323-based. However, SIP-based conferencing systems are gaining more and more enthusiasts [1]. As an example, in [1], a centralized SIP-based conferencing system provides a suitable multimedia conferencing platform that allows advanced scenarios and services (e.g., transcoding) without requiring that end systems are conferencing aware.

Another important aspect that has to be considered, in what concerns audio conferencing, is how the media is delivered. In


this context, an overlay network architecture scheme for streaming media distribution is proposed in [3].

The use of dedicated unicast connections and ALM for audio-conferencing media distribution is proposed in the Application-Level Network Audio-Conferencing protocol (ALNAC) [5].

The architecture proposed in this paper in Section III tries to overcome some problems identified in the above-presented proposals while maintaining its major strengths. In [1], the SIP-based centralized server does not use application layer multicast for media distribution, therefore there will be a large number of duplicated packets in the network for conferences with a high number of participants located far apart. The approach presented in [3] uses application layer multicast for media delivery but it does not distinguish the different types of clients, with respect to the instantaneous activity of the client. The approach presented in [*3*] uses application layer multicast for media delivery but it does not distinguish the different types of clients, with respect to the instantaneous activity of the client. Therefore, the conference server processes all incoming RTP streams from all active clients and has to generate an outgoing RTP stream for each passive client leading to a waste of CPU cycles used in processing the passive clients RTP streams.

The main differences between the proposed architecture in this paper and ALNAC [5] lie in the fact that our aim is to reduce the bandwidth and server workload whilst ALNAC focuses on the latency experienced by active participants. The other major difference is found in the mechanism for activeness classification: counting of media packets produced in our architecture (see Section IV.D) and Next-Speaker prediction in ALNAC [5].


The architecture proposed in this paper is based on four key elements:

Client – The entity that consumes and potentially produces the audio media (e.g., a softphone). It must support the SIP protocol for signaling and RTP for media transport. A Client can be classified, by their associated agent, as either an Active Client (AC) or a Passive Client (PC), depending if it is contributing with media to the conference or is just listening.

Conference Server (CS) – Composed by a SIP server and a Media Server. It handles the conference signaling and performs the media mixing. The signaling module receives SIP requests from agents when clients want to join and leave conferences. The media mixing module receives and sends RTP media streams from and to participants (via their associated agent).

Agent (AG) – Agents are at the core of our architecture. They have some similarities with a SIP Proxy, as they receive SIP requests and responses and forward them to their parent in the tree when appropriate. Agents are also capable of producing SIP messages (e.g., supporting client status transitions, see Sections IV.B and IV.D).

In what concerns media distribution, they act as media reflectors, since they receive media from the dummy agent and relay it to their children (agents or passive clients). Also in this respect agents act also as media proxies, as they receive media from ACs and forward it to the CS (see Fig. 1).

Fig. 1. Conference media distribution in the proposed architecture. Another agent functionality is the classification of clients as active or passive based on the media that they instantaneously produce (see Section IV.B).

Dummy Agent (DAG) – Its main goal is to redistribute the mixed media to the other Agents, therefore acting as the ALM tree root. Since the DAG does not send RTP packets to the server, it only receives the mixed media from all active participants, which is the one that will be passed to the PCs (all have the same media format). The main reason for the DAG existence lies in the fact that in the streams that are sent to each AC, only N-1 (considering N the number of ACs) audio channels are mixed (its own contribution is not mixed), resulting inadequate for PCs, as in this case they would not receive all contributions. From the CS point of view, the DAG is equivalent to any SIP compliant client (e.g., a SIP softphone). In Fig. 1, are represented the media paths in the proposed architecture highlighting the use of multicast trees at the application layer for media distribution to the PCs. The construction of these trees is out of this paper scope. Protocols such as TBCP [6] could be used for this purpose.


In this section we will show how a user joins, changes its state or leaves a conference call. Fig. 2 lists the control messages (other than SIP) that are required in the architecture (note that these messages are produced/consumed only by Agents and not by Clients nor by the CS). The whole design has focused on presenting a solution that can be integrated


seamlessly with existing open source and commercial solutions based on SIP for both media server and media clients signaling.

Legend: A - Is first Client?, B - Yes or No, C - Start Conference, D - Agent Added, E - End Conference.

Fig. 2. Non-SIP conference control messages.

A. Joining a Conference

The joining process is very similar to a normal call. However, the agent must perform some actions in order to see if a user is active or passive when it joins. The following diagram (Fig. 3) presents the messages exchanged when a user tries to join. When the Agent receives the INVITE request, if it is not already engaged in a conference, it will ask the same question to its parent in the ALM tree, until it reaches the DAG. It is important to notice that the first participant that joins a conference will be considered in the active state (i.e., an AC). The client state information received in the UDP Control message 1 is kept in the AG, namely, the IP address and RTP port. Remember that these details are dynamic, that is, they can be stored in the active clients table at one time, and in the passive clients table at another, depending on their current state. These details will be used by the AG to forward the media do the appropriate client. When starting the conference, before relaying the UDP control message 2 to the son AG, the AG adds this child to a passive clients list, where IP address and RTP port are also stored. In Fig. 3, the modified Invite message contains the Session Description Protocol (SDP) information of the AG instead of the Client (IP address and RTP port).

B. Changing Activity State

The transition from one state to the other is part of the Agent competences. The mechanism used to detect the changes will be explained later in this article. The message flow works as follows (see Fig. 4).

In the passive state, the media sent by the Client is only used for calculating its state, and not for the conference itself, as the media packets are discarded in the Agent and not passed to the CS. As for the AC, it contacts the Agent directly which is responsible for forwarding the media to the CS using a unicast connection and not the ALM tree.

The Agent is the entity deciding whether a Client should or not change its state. The Agent implements this state change by sending an INVITE message. This INVITE message

generated by the Agent conveys the new port settings (either the Conference Server or Agent IP address and RTP port number) that enforce the associated change in the media distribution path.

Fig. 3. Message exchange for the joining process.


C. Leaving a Conference

Before leaving a conference (in a graceful manner), the client sends a BYE message to the Agent as illustrated in Fig. 5. If the Client is active, then the request is forwarded to the CS; otherwise, the Agent is responsible for generating the appropriate responses (TRYING and OK). The message flow is presented in Fig. 5.

If the Client leaving the conference is the last passive client that is registered in the AG, the AG sends an UDP control message 4 to his parent AG to stop sending media in this ALM branch. This process is recursive between Agents in the tree. In case it reaches the tree root (i.e., the DAG), and it has no further child AG’s nor active clients, then the conference will be closed.

D. Mechanism for Deciding Client State Transitions

During a conference call, clients can change their state according to their participation in the call. The entity that measures this activity is the Agent, and the procedure varies whether the client is active or passive. A client state change implies an update in the associated agent passive and active clients table.

Passive to Active Transition - In the passive state, the Agent counts the number of RTP packets sent by the client in a time interval, TP, and if this number exceeds a threshold value

PP, then a transition from passive to active will occur. Value TP

should be the smallest possible otherwise a large quantity of media will be lost. As for PP, its value in the optimal case should be zero, because at minimum activity by the speaker it would activate the transition process, resulting in less packet loss. However, we must not forget that clients send “keep alive” RTP packets periodically when no media is being sent (typically every 5 seconds), that is why condition PP > 1 must be satisfied. The smaller TP is, the smaller PP should be to improve the detection mechanism.

Fig. 5. Message exchange for the leaving process.

Active to Passive Transition - As for the active to passive transition a larger time interval, TA is considered, and in this case, the transition occurs when the number of sent packets is less than PA. One of the reasons for this to happen is related to the normal SIP message flow, namely because if this time is very small we can have two consecutive transitions (active to passive and passive to active) in a short period of time, and this may generate two parallel INVITE messages, which is forbidden by SIP that specifies that “a User Agent Client MUST NOT initiate a new INVITE transaction within a dialog while another INVITE transaction is in progress in either direction” [4]. Moreover, in a conversation, even an active participant has some periods where she/he is not talking, that is why TA value should be larger. Once again, the PA value must take into account that clients send “keep-alive” packets, and if we consider that R defines “keep-alive” rate, then PA > R * TA.


The only architecture components that needed coding are the Agents and the DAG. These entities are capable of understanding SIP messages and redistributing RTP traffic, as well as communicating with other Agents by using our own set of UDP control messages (see Fig. 2).

In our particular case, we used the SIP Express Router (SER) [7] and SIP Express Media Server (SEMS) [2] as our SIP Server and Media Server respectively, forming the CS. It is important to notice that in our prototype, at this stage, we are not considering security or authentication issues.

In the test performed with our implementation, a static ALM tree was used. Furthermore, we tested the ACs (XLite softphones) with three types of media formats, namely G.711u, G.711a and GSM. For the PCs, G.711u was used.

Fig. 6, presents an example of the state classification mechanism used in our tests, with parameters TP = 100 ms,

PP = 2, TA = 3 s and PA = 3, and the PC started to speak at

t = 0.3 s and stopped at t = 3 s. The TP value was chosen in a way that it would not compromise the CPU and memory usage of our machines as the time windows where implemented using software interrupts. In our model a “keep-alive” period of 5 seconds was used (configured in the XLite softphones).


Fig. 6. Example of client state transitions.


The presented architecture has been successfully implemented deployed and tested, providing a SIP-based conferencing platform using application layer multicast for media distribution. We successfully observed that this architecture reduces the number of packets traveling on the network, when compared with the traditional conference approach. This conforms to our goal to improve the performance of the CS for a high number of participants, when most of them are consistently in passive state. This is achieved by reducing the CPU and memory usage in the CS, allowing more users in the system.

We plan to enhance our prototype in two ways. Firstly, we intend to test our implementation with a large number of users, using a SIP traffic generator such as the SIPp tool [8] to simulate them. Secondly, we are considering the use of an algorithm to choose an optimal tree in a dynamic way.

Finally, we would like to stress that the proposed architecture can be used for other types of interactive data in conference scenarios such as a videoconference.


ADETTI / ISCTE and Lancaster University are part of the E-NEXT ( Network of Excellence. The architecture was implemented and tested using a testbed provided by E-NEXT.


[1] Singh, K., Nair, G., Schulzrinne, H.: Centralized Conferencing using SIP. Columbia University. (2001)


[3] Patrikakis, C.; Despotopoulos, Y.; Rompotis, A.; Minogiannis, N., Lambiris, A., Salis, A.: An implementation of an overlay network architecture scheme for streaming media distribution. Telecommunications Laboratory of the NTUA. (2003)

[4] Rosenberg, J.; Shulzrinne, H.; Camarillo, G.; “SIP: Session Initiation Protocol”; RFC 3261; IETF; June 2002)

[5] Blundell, N.; Egi, N.; Mathy, L.: “Audio Conferencing over Application Level Multicast”, In Proc. of Workshop on Multimedia Systems and Networking (WMSN’06); Phoenix, AZ, USA; April 2006

[6] Mathy, L.; Canonico, R.; Hutchison, D.: “An Overlay Tree Building Control Protocol”; In Proc. of Networked Group Communication (NGC2001); London, UK; November 2001

[7] [8]





Related subjects :