VoIP in 3G Networks: An End-to- End Quality of Service Analysis

Full text


VoIP in 3G Networks: An End-to- End Quality of Service Analysis

Renaud Cuny


, Ari Lakaniemi



Nokia Networks

P.O.Box 301, 00045 Nokia Group, Finland renaud.cuny@nokia.com


Nokia Research Center

P.O.Box 407, 00045 Nokia Group, Finland ari.lakaniemi@nokia.com

Abstract-- This paper presents the results of a Quality of Service (QoS) study for VoIP service over 3G WCDMA networks. An end-to-end simulation platform has been used for this purpose. The simulations have been run using Adaptive Multi-Rate (AMR) speech codec at 12.2 kbit/s with combination of RTP, UDP and IPv6 protocols. The simulated transmission path includes two radio links (uplink and downlink), connected with a packet switched core network and UTRAN Radio Access Networks with several different radio transmission conditions.

Furthermore, RObust Header Compression is applied in both radio links. The results include buffering statistics, end-to-end delay estimates, and packet loss statistics.


During the last few years, the voice over data network services have gained increased popularity. Quick growth of the Internet Protocol (IP) based networks, especially the Internet, has directed a lot of interest towards Voice over IP (VoIP).

The VoIP technology has been used in some cases, to replace traditional long-distance telephone technology, for reduced costs for the end-user. Naturally to make VoIP infrastructure and services commercially viable, the Quality of Service (QoS) needs to be at least close to the one provided by the Public Switched Telephone Network (PSTN). On the other hand, VoIP associated technology will bring to the end user value added services that are currently not available in PSTN.

On the other front, the current development in the cellular radio network technologies are paving the way towards IP capable radio networks. The so called Third Generation (3G) cellular networks, developed and standardized by the Third Generation Partnership Project (3GPP), will provide IP over wireless services, enabling therefore also VoIP. In current cellular systems, e.g. in GSM, the telephony service is based on circuit switched approach. This service is currently highly optimized for transmission of voice, thereby providing good speech quality and good spectral efficiency. However, carrying VoIP will be also possible in 3G WCDMA networks, e.g.

3GPP release 5, and may be of special interest for the mobile network operators for multiple reasons: Firstly, as the bandwidth for individual flows in packet switched domain is not reserved in advance, the multiplexing effects should bring significant capacity savings. Secondly, VoIP service will be supported by the Session Initiation Protocol (SIP), which is a text-based protocol, similar to HTTP and SMTP, for initiating

interactive communication sessions between users [1]. Such sessions can include voice, but also e.g. video, chat, interactive games, and virtual reality. Finally, the convergence towards packet switched and IP technology may convince mobile operators to go for solutions that are truly all-IP in order to simplify network interconnection and network management.

Naturally, for wide end user acceptance and deployment, the VoIP service is required to provide similar perceived voice quality as provided by current highly optimized GSM networks. The challenges for achieving this include typical VoIP related QoS problems, such as packet loss, delay, and delay variation (i.e. jitter), as well as additional overhead brought by the VoIP protocol stack. Therefore the end-to-end VoIP QoS should be studied and evaluated carefully. As an example, it is likely that the packet switched technology, although managed by e.g. Differentiated Services [2], will generate more delay and jitter than the circuit switched technology. Further additional delay and jitter may be caused by the packet segmentation in the radio interface. The end-to- end delay is likely to be close to the maximum delay still providing acceptable conversational quality (around 250- 300ms [3]), extra attention needs to be paid to jitter: too much jitter for a voice stream may be problematic since basic jitter compensation methods may not apply very well or have limited effects. So one important issue to investigate is whether the jitter in 3G networks, will have negative impact on the end-user perceived voice quality.

This paper is organized as follows. Section II presents in details the end-to-end VoIP simulator used for this study. Each component of the tool is described in detail. Section III presents the simulation results focusing on packet loss ratio and end-to-end delays. Finally, the conclusion in section IV summarizes the main finding of this study and points out the areas that could be investigated further.


Protocols used by the VoIP over 3G can be roughly divided

into two categories: signaling related protocols and media

related protocols. Although the signalling protocols, such as

SIP, are very important part of a VoIP system, in this study we

concentrate only on media related protocols and transmission

of media data.


To run the end-to-end simulations we developed a VoIP speech simulator application for modeling the telephony application and protocol layers from application down to IP and PDCP. The lower layers required for radio link and core network modelling were simulated using external simulation tools and the resulting network conditions were applied in the VoIP speech simulator using error pattern files. The different components of the simulation chain are described in detail in the following subsections.

A. Speech application

On application level we assumed usage of Adaptive Multi- Rate (AMR) speech codec, which is a mandatory codec for conversational speech services within 3G systems. For all simulation runs we selected usage of AMR 12.2 kbit/s mode with DTX functionality enabled, and employed bandwidth efficient mode of the AMR RTP payload format. This implies that during talk spurts the source generates 32-byte speech payload at 20 ms intervals, while due to DTX during silence periods we will have 7-byte payload carrying Silence Descriptor (SID) frame at 160 ms intervals.

We further assumed the typical VoIP protocol stack employing Real-Time Transport Protocol (RTP) encapsulated in User Datagram Protocol (UDP), which is further carried by the IP. The combination of these protocols introduces total of 40 bytes header data when using IP version 4 (IPv4), and bytes header when using IP version 6 (IPv6). We selected IPv6, which has two implications: the size of an IP packet carrying one AMR frame will be either 92 bytes (speech) or 67 bytes (SID), and we need to enable UDP checksum because the IPv6 header does not include a checksum of its own but the most critical fields of the header are covered as part of the UDP pseudo header.

Protocol layers below IP follow the 3GPP release 5 specifications, as illustrated in Figure 1.

Figure 1 3GPP Protocol stack

B. Robust Header Compression (ROHC)

When operating in the bandwidth limited 3G networks it is important to use the radio band as effectively as possible, and header overhead up to 60 bytes can seriously degrade the spectral efficiency of a VoIP service over such link. The

RObust Header Compression (ROHC) protocol [4] has been developed to tackle this problem. ROHC provides link-based compression of IP/UDP/RTP headers, in best case down to 1 byte. The effective compression makes use of the fact that majority of the fields in the combined IP/UDP/RTP header either remain constant or introduce constant change throughout a session. However, the maximum compression mentioned above can only be reached when imposing some limitations, a more typical compressed header size would be three or four bytes. The ROHC operation is based on synchronized compression (at the sender site) and decompression (at the receiver site) contexts. The decompression context is initialised by transmitting full IP/UDP/RTP headers in the beginning of the session. Also irregularities in the transmitted stream e.g. by DTX operation or lost packet can introduce compressed headers slightly larger than in the optimal state. In error prone transmission conditions a feedback mechanism is important part of robust compression operation, enabling recovery in case the synchronization between compressor and de-compressor is lost. The ROHC protocol was implemented in our simulator.

The ROCH in R-MODE is assumed on both radio links, providing feedback mechanism to enable safe convergence to optimal compression state. We also assume that ROHC Context Identifier is transmitted as a part of the compressed packet. These settings imply that the minimum size of a compressed IP/UDP/RTP header is four bytes.

C. Radio network modeling

The model for the radio network included the actual radio link, processing in layers below PDCP and access transport in UTRAN. The radio link error patterns were prepared using a separate WCDMA system simulator. Three different radio conditions were investigated, introducing frame error rates (FER) of 1%, 3% and 5%. Additionally we also included error-free case in the set of simulation conditions. Different error patterns were prepared for both uplink (UL) and downlink (DL), and the error patterns were obtained from a traced terminal that was moving along a predefined route.

For UL radio network we assumed processing and transport delay of 36 ms, and for DL radio network the corresponding delay is 49 ms. Note that 36+49=85 ms is the lower limit for the time before the ROHC compressor can receive a feedback message from the decompressor regarding a specific packet.

This delay is significant in such a way that in beginning of a stream the ROHC decompressor context needs to be initialized by sending full headers, which will be sent until a feedback message indicating successful decompressor context initialization is received. A similar situation can occur also if the decompression context gets corrupted for some reason, e.g.

hard handover or excessive amount of transmission errors.

However, for this work we assumed that no ROHC decompressor context re-initialization is required during a session.


MAC E.g., IP,

PPP Application



L2 Relay


L2 GTP-U E.g., IP, PPP




Uu Gn Gi





L2 Relay


D. RLC, MAC and PDCP layers

The WCDMA unacknowledged radio mode is the natural choice for transmitting the VoIP packets over the radio link.

This mode provides possibility for segmentation and padding of IP packets into radio Time Transfer Intervals (TTIs) to make best possible usage of allocated radio resources. The radio bearer was configured 16 kbit/s; with TTI length of 20ms this enables transmission 40 bytes of user data at 20ms intervals.

E. Packet switched domain modelling

In the packet switched domain we considered the following delay components: Delay in the IP backbone, delay in the gateway elements (SGSN and GGSN) and delay in IuPS interface. Typically, the backbone elements (IP routers) and gateways may introduce some jitter to VoIP traffic, depending on the load in the network. However appropriate traffic prioritisation (e.g. based on Differentiated Services) can limit the queuing delay (and thus potential jitter) to specific values defined by the operator.

We modelled this kind of PS domain structure to generate a delay distribution file for a stream of 30 000 packets transmitted at 20 ms intervals. The resulting delay distribution is illustrated in Figure 2, and it introduces 19 ms average delay with 1.0 ms standard deviation. The minimum and maximum values for the delay are 12.4 ms and 23.7 ms, respectively.

Figure 2: Delay distribution in PS domain.

F. Buffering

Typically an audio playout device in the receiving terminal is synchronized to a local clock signal to make sure that there is always signal available for playback. In practice this implies that a new frame is required regularly at intervals determined by the frame rate. On the other hand, due to jitter the packets can arrive at the receiver at irregular rate that is not synchronous to the playout. Therefore, the buffering of speech

packets is needed to ensure continuous data flow between asynchronous input and synchronous output. In VoIP this kind of jitter buffering plays an important part in the overall speech quality. The basic approach to jitter buffering is to wait for a predetermined time after the reception of the first packet before playing out the frame carried by this packet. The purpose of the playout delay is to allow some variation in the arrival times of subsequent packets. Frames arriving after their scheduled playout time are discarded and in the speech decoder point of view they are lost frames. Naturally in this approach the predetermined buffering delay is the most important factor of the buffering performance: too short buffering delay will risk buffer underflows when packets do not arrive in time due to jitter, and on the other hand too long buffering time introduces unnecessarily long delay and can also introduce buffer overflows.

However, for this study we configured the jitter buffer in receiving terminal in such a way that no frames were discarded, neither due to late arrival nor due to buffer overflow. The main reason for this choice was the aim to concentrate on the QoS issues that are dependent on the network.

When considering VoIP traffic over a wireless 3G network, it is not sufficient to buffer only in the receiving terminal.

Actually in this environment the most critical link between asynchronous input and synchronous output is between the PS core network and the DL radio network. At this point of data path the units we are buffering are IP packets received from the packet switched core network, which will be forwarded to the radio path. Here we assume a slightly different buffering strategy as described above for jitter buffering in the receiving terminal: instead of relying on long enough buffering delay we use FIFO buffer with limited size (as number of packets in the buffer) and specify a maximum time a packet can be stored in a buffer. I.e. if a predetermined number of packets are already stored in the buffer, a new incoming packet will dropped. And if a packet has been waiting in the buffer for longer time than specified by the discard timer, it will be dropped to avoid accumulating delay for subsequent packets. However, to make sure that the large packets required for ROHC initialization will get through without unfeasibly large value for discard timer we made the assumption that a (tail of a) packet that has been already partially transmitted due to segmentation is never dropped even if the timer has elapsed.

Although in general it might not seem sensible to perform

buffering in the transmitting terminal for a VoIP application,

due to strictly limited radio bandwidth, allocated according to

optimally compressed headers, and ROHC initialisation

requiring transmission of full IP/UDP/RTP headers, we need

to consider also buffering prior to UL transmission. We apply

similar buffering mechanism as described for DL, i.e. we

specify fixed size FIFO buffer with a discard timer to make

sure that this bottleneck does not cause unfeasibly long delay.


G. Additional simulation settings

We used the same speech input sequence for all simulation runs. This speech sequence has approximately 6 minutes 30 seconds duration and it is an excerpt of a real discussion, and therefore introduces realistic structure of alternating talk spurts and silence periods. The speech is in Finnish and it is recorded in low-noise office environment. The observed speech activity is approximately 50%.

We also repeated all simulation scenarios ten times with different randomly selected starting points in the radio link error pattern files and in the PS domain delay distribution file to make sure that the results are not affected by some local anomaly in the simulated network conditions.


Since a fixed-delay jitter buffering scheme was assumed in the receiving terminal, the total end-to-end delay is fixed throughout the session. However, because of the TTI structure, packet segmentation at the RLC level and ROHC behaviour the network delay on packet level is not fixed throughout the connection. Packets too large to be carried by a single TTI need to be segmented over several TTIs thus introducing longer transmission delay over a radio link, in most cases in both uplink and downlink. Furthermore, some of the subsequent packets following the large packets are also segmented over two TTIs although in principle they could fit into single TTI because they are not aligned with the TTI structure. The reason for this is that when these packets are obtained from the buffer, there is still some room in the tail of the current radio frame, and as much data as possible from the beginning of the next packet, if available, are carried here.

For these scenarios there are two causes for frame losses (packet losses); a packet can be lost on the radio path due to transmission errors, or a packet can be dropped due to buffering, either in transmitting terminal, in DL RNC or in receiving terminal. We would like to point out one observation regarding frame losses: the observed packet loss rate in the radio link seems to be slightly higher than the nominal frame error rate specified for the error patterns over all radio FER conditions. Because of the segmentation a loss of single radio frame can cause loss of two packets: when a radio frame carrying data from two separate packets is lost, both these packets will be unusable and will be dropped by the receiver.

The simulation results are summarized in Table 1. The results include packet loss rate (PLR) and buffering time statistics, as well as end-to-end delays in different scenarios.

There is also a further breakdown of packet loss statistics into losses due to DL buffering and losses due to transmission errors on the radio path. Since we carry one AMR frame per packet the FER at speech decoder input equals PLR. Note that losses in the UL terminal buffering are not presented in the table, but they are included in the total packet loss rate. Note

also that the average network delay includes the jitter buffering time in the receiving terminal.

Table 1: Simulation results.

Radio link FER

PLR in DL buff

PLR on radio

Total PLR

Avg.DL buff delay


network delay

0% 0.02% 0% 0.06% 9.79ms 221.96ms 1% 0.02% 2.05% 2.08% 9.79ms 221.96ms 3% 0.02% 6.02% 6.08% 9.79ms 221.96ms 5% 0.02% 10.26% 10.31% 9.79ms 221.96ms

The overall frame error rate (FER) can be used as a rough objective speech quality estimate. Typically, with AMR codec the speech quality can be still considered good when FER is around 1-2%, but it should be noted that also the distribution of frame losses has an effect on the subjective speech quality.


Our end-to-end Quality of Service analysis shows that 3GPP networks will be able to offer an adequate level of quality for Voice over IP (VoIP) services. The difference in QoS with current voice services technology (CS voice) is very small:

The additional packet loss ration introduced by packet switched characteristics is less than 1%, whereas the end-to- end network average delay is expected to be around 220ms.

The enabling features for the obtained quality level are summarized below:

N WCDMA unacknowledged mode in radio

N ROHC at the PDCP layer that allows usage of limited bandwidth radio bearer (16kbits/s).

N Relevant buffering limits and discarding rules in the PDCP buffer (DL) and in the transmitting terminal to avoid potential cumulative delay and jitter.

N Differentiated Services support in the core network and backbone to ensure minimal buffering delay in packet switched domain.

Nevertheless there are few other important aspects that require further investigations in order to determine if VoIP services will be quickly deployed in 3G networks.

1. The User Equipment (UE) may contribute to the mouth- to-ear delay: The processing time needed to compress the VoIP headers should not be negligible. Also, because the first few packets during the ROHC initialisation phase are transmitted with full headers, the UE may require special buffering mechanism in order to minimize the delay.

2. The radio capacity needed to transfer VoIP flows is

slightly higher than the capacity needed for sending circuit


switched voice frames, even with header compression. A detailed analysis, that would take pricing into account, would be useful to determine if offering VoIP services in 3G networks is efficient and interesting from an operator business perspective.


The authors wish to thank Zhi-chun Honkasalo and Mattias Wahlqvist for their frequent feedback along this study. Mika Kolehmainen and Outi Hiironniemi also contributed to this work by providing support for radio link error and PS domain delay modelling.


[1] IETF Session Initiation Protocol (SIP) Working Group, http://www.ietf.org/html.charters/sip-charter.html [2] IETF Differentiated Services (DiffServ) Working

Group, http://www.ietf.org/html.charters/diffserv- charter.html

[3] ITU-T Recommendation G.114, “One-way transmission time”, 05/2000

[4] RFC 3095, “RObust Header Compression (ROHC);

Framework and four profiles: RTP, UDP, ESP, and

uncompressed”, July 2001





Related subjects :