VoIP Playout Buffer Adjustment using Adaptive Estimation of Network Delays

(1)

VoIP Playout Buffer Adjustment using Adaptive Estimation of Network

Delays

Miroslaw Narbutt and Liam Murphy* Department of Computer Science

University College Dublin, Belfield, Dublin 4, IRELAND

Abstract

The poor quality of Voice over IP can be improved by adaptive playout buffering at the receiver. This technique dynamically adapts the playout deadline to network conditions, thus minimizing both late packet loss and buffering time. A standard playout buffer strategy uses an estimate (Exponentially Weighted Moving Average) of the mean and variance of network delay to set the playout deadline. This estimation is characterized by a fixed, constant weighting factor. We show that tuning of this parameter so that the strategy works very well for all network conditions is not feasible. Therefore we propose to extend this standard buffer strategy by replacing the fixed, constant weighting factor with a dynamic one. In our solution, the weighting factor is dynamically adjusted according to the observed delay variations. When these variations are high (which implies that the network conditions are changing), the parameter is set low, and vice-versa. This allows rapid adaptation to network variations and reduces the frequency of late packets (or buffering time). Simulations and experimental results show that with our strategy, the trade-off between buffering delay and late packet loss at the receiver is improved significantly.

1. INTRODUCTION

A typical VoIP application buffers incoming packets and delays their playout in order to compensate for variable network delays (jitter). This allows the slowest packets to arrive in time to be played out. The fluctuating end-to-end network delays may cause playout times to increase to a level, which is irritating to users (when the buffer is too big) or may cause packet losses due to their late arrivals (when the buffer is too small). The two conflicting goals of minimizing buffering time and minimizing late packet loss have engendered various playout algorithms. The need for adaptive buffering comes when the end-to-end delay is high (close or above the interactivity constraint of 100-150ms) and when the delay is unknown and the receiver does not know how to select appropriate playout times [1]. Adaptive playout mechanism makes it possible to balance the length of the buffer – a major addition to end-to-end delay – with the possibility of packet loss. Generally, a good playout algorithm should be able to achieve the best possible trade-off between loss and delay. In this paper we present a new playout buffer algorithm that significantly improves this trade-off.

(2)

In section two the motivation of our work is demonstrated and basic idea of the new proposed algorithm is outlined. In section three the new algorithm is described and potential improvements are outlined. Later its effectiveness is evaluated through simulations with the use of network emulator (section four) and through experiments on a real network (section five). In section six effects of the new buffering scheme on the subjective quality is addressed. Finally, in section seven conclusions are drawn.

2. MOTIVATION

Most of the adaptive playout algorithms described in the literature perform continuous estimation of the network delay and its variation to dynamically adjust the talkspurt playout time. Standard adaptive playout algorithm [2] is based on Jacobson’s work on TCP roundtrip time estimation [3]. The algorithm estimates two statistics: the delay itself, and its variance and uses them to calculate the playout time. Both estimated are in the form of:

i i i d n d = ⋅ − + − ⋅ ∧ ∧ ) 1 ( 1 α α ; | | ) 1 ( 1 i _i i i v d n v = ⋅ + − ⋅ − ∧ − ∧ ∧ α α ; where ∧ i d and ∧ i

v and are the i-th estimates of delay and its variance respectively, while n_i is the i-th packet delay.

Parameter α has a critical impact on the rate of convergence of this estimation. Following the claim made in [2], and in accordance with NeVot [4], the weighting factor α is fixed and chosen to be high (α = 0.998002) to limit sensitivity of the estimation to short-term packet jitter. By experiments with different values of α we observed that such high value of α is good only in situations when network conditions are stable (delay and jitter are constant). When network conditions are changing rapidly (sudden increases/decreases in delay) smaller values of α (0.7, 0.8, 0.9) were more appropriate.

Figure1 illustrates that as a decreases, calculated playout times (solid lines) track variations of network delays (dots) more efficiently. As a result less packets arrive too late (from 3.5% down to 1%) and the average buffering time is smaller (from 27.8ms to 7.4ms).

(3)

Unfortunately, a single tuning of the parameter αthat works well for all network conditions is not easy (or not even a feasible) problem to solve. Figures 2 and 3 show that there is no optimal fixed value of α when network condition vary in time.

Fig. 2, 3. Calculated playout times for various values of α

When jitter is small and fluctuations in the end-to-end delays are large (Fig. 2), the best results are achieved when α is small. In this case both the packet loss ratio and average buffering time are relatively small (3.7% of lost packets and 3ms of buffering time). When α is set to 0.998002, the packet loss ratio is high (11.7%), and the buffering time is much larger than necessary (36.6ms).

On the other hand, when jitter is large but average network delay is constant (Fig. 3), the best results are achieved when α = 0.998002. In this case, the packet loss ratio is below 1%. When α is small, the algorithm is too sensitive to short-term delay jitter and this causes larger late packet loss (2.7%).

Since there is no optimal fixed value of α that works well for all network conditions we claim that the accuracy of the estimates can be greatly improved by dynamically choosing the values of α.

3. PLAYOUT BUFFER ALGORITHM WITH ADAPTIVE

α

The idea behind our algorithm is to adaptively adjust the value of α depending on the variation in the network delays (α is set high when end-to-end variations are small and vice-versa). This new, dynamic parameter α (recomputed with each incoming packet) can be used to perform continuous estimation of the network delay and its variation in the same way like before.

Let αi be a dynamic parameter based on new estimates of the variance vˆ of the end-to-end i′

delays between source and destination: )

ˆ ( _i

i = f v′

(4)

where the function f(vˆ_i′)was chosen experimentally to maximize the performance of our algorithm over a large set of network traces.

The dynamic version of parameter α is now used to maintain adaptive estimations of average delay and its variation:

i i i i i d n d = ⋅ + − ⋅ ∧ − ∧ ) 1 ( 1 α α | | ) 1 ( 1 _i _i _i i i i v d n v = ⋅ + − ⋅ − ∧ − ∧ ∧ α α

Finally the playout timepi at which the the i-th packet, assumed to be the first packet in a

talkspurt played at the destination is calculated as follow:

i i i i t d v p ∧ ∧ ⋅ + + = β

Parameter ß controls delay/packet loss ratio. The larger the coefficient, the more packets are played out at the expense of longer delays.

Any subsequent packets of that talkspurt are played out with rate equal to the generation rate at the sender - that is,

i j i

j p t t

p = + −

Fig. 4. Playout time etimation.

This mechanism uses the same playout delay throughout a given talkspurt but permits different playout delays for different talkspurts. The variation of the playout delay introduces artificially elongated or reduced silence periods between successive talkspurts.

playout delay network delay i

n

sending time reception time playout time SENDER RECEIVER SPEAKER buffering delay i p p_j i t tj

(5)

4. BUFFERING PERFORMANCE TESTS THROUGH NETWORK

EMULATIONS

We have tested the performance of the new algorithm through network emulations. For the test we have chosen NISTNET 2.1.0 network emulation software [5] and we modeled various delay patterns (Fig. 5, 6, 7, 8) using its default Pareto distribution.

Fig. 5. First delay pattern - delay and jitter are

constant (delay = 100ms, jitter = 50 ms).

Fig. 7. Third delay pattern - delay varies in time, jitter is constant (delay jumps between 100, 150 and 200ms every minute, jitter = 30ms).

During experiments we used two voice sources (with and without hangover time). Regarding ITU-T recommendation P.59 [6], human speech was modeled as a process that alternates between talkbursts and silence periods that follow exponential distributions (Fig. 9,10) with a mean of 227 and 596ms, without hangover time or 1004 and 1587ms with hangover time respectively. In our model voice packets were generated every 30ms. No packets were generated during silence periods. Total duration of each simulation was 1 hour.

Fig. 6. Second delay pattern - delay constant and jitter varies in time (delay = 100ms, jitter jumps between 0, 10, 20, 30, 40, 50 ms every minute) .

Fig. 8. Fourth delay pattern - delay and jitter vary in time (delay jumps between 50, 100 and 150ms, jitter jumps betwen 0, 10, 20, 30, 40, 50 ms every 10 seconds)

(6)

Fig. 9. Talkbursts and gaps generated by the Fig. 10. Talkbursts and gaps generated by the

voice source without hangover time. voice source with hangover time.

In order to compare the performance of the new playout algorithm with the basic one, we recorded network delays at the receiver and processed that data with the program that simulated the behaviour of the two algorithms. The delay/packet loss ratio was controlled by different values of the ß factor (2<ß<4). Figures below show the delay/loss trade-off of both algorithms for different network conditions and two voice sources. The solid lines represent the performance of the standard algorithm (four different fixed values of α) while the lines with circles represent the new algorithm with dynamic α.

Fig. 13, 14. Algorithms performance comparison - average delay is constant but jitter varies in time (voice source w. and w/o hangover time).

Fig. 11,12. Algorithms performance comparison - delay and jitter constant (voice source w. and w/o hangover)

0 500 1000 1500 0

500 1000

TALKBURSTS AND GAPS w/o HANGOVER TIME

duration [ms] # talkbursts TALKBURSTS DISTRIBUTION : MEAN TALKBURST = 227 ms MIN TALKBURST = 33 ms MAX TALKBURST = 1760 ms TOTAL TALK TIME = 1001 s

0 1000 2000 3000 4000 5000 0 500 1000 duration [ms] # gaps GAPS DISTRIBUTION : MEAN GAP = 596 ms MIN GAP = 52 ms MAX GAP = 5122 ms TOTAL GAPS TIME = 2599 s

0 2000 4000 6000 0

100 200

TALKBURSTS AND GAPS w. HANGOVER TIME

duration [ms] # talkbursts TALKBURSTS DISTRIBUTION : MEAN TALKBURST = 1004 ms MIN TALKBURST = 79 ms MAX TALKBURST = 7363 ms TOTAL TALK TIME = 1447 s

0 2000 4000 6000 8000 10000 12000 0 100 200 duration [ms] # gaps GAPS DISTRIBUTION : MEAN GAP = 1587 ms MIN GAP = 79 ms MAX GAP = 11840 ms TOTAL GAPS TIME = 2152 s

60 80 100 120 140 160 180 0 2 4 6 8 10 12 14

ALGORITHMS PERFORMANCE COMPARISON (2<β<4)

α=0.7

late packets loss rate [%]

average buffering delay [ms]

α=0.8 α=0.9 α=0.998 dynamic α 60 80 100 120 140 160 180 0 2 4 6 8 10 12 14

α=0.7

late packets loss loss rate [%]

α=0.8 α=0.9 α=0.998 dynamic α 30 40 50 60 70 80 0 2 4 6 8 10 12 14

α=0.7

α=0.8 α=0.9 α=0.998 dynamic α 30 40 50 60 70 80 0 2 4 6 8 10 12 14

α=0.7

α=0.8

α=0.9

α=0.998

(7)

Fig. 15, 16. Algorithms performance comparison - delay varies in time and jitter is constant (voice source w. and w/o hangover time).

Fig. 17, 18. Algorithms performance comparison - delay and jitter vary in time (voice source w. and w/o hangover time).

From the figures above it can be noticed that when network conditions were stable (jitter and delay were constant) the algorithm proposed performed at least as well as the algorithm with fixed α. When network conditions were changing (jitter and delay varied in time), our algorithm performed better for all delay patterns.

5. Experimental measurements and algorithm comparison

To examine the performance of the new playout algorithm two packet audio terminals were built based on OpenH323 source code [7]. One terminal was set up at the Performance Engineering Laboratory in Dublin (IRELAND), and another one at the Computer Center of the Lodz Univeristy of Technology - LODMAN (POLAND). The distance between sender and receiver was 14 hops and the interconnecting links had a bandwidth of between 2 and 155 Mbits per second. The clocks of the terminals were synchronized using NTP software which for our purposes is sufficiently precise.

For the experiments the simplest G.711 A-law encoding scheme (PCM) was chosen. The terminal encoder was sending one frame of audio (240 bytes) every 30 ms. As an input signal a sequence of alternating audio signals and silence periods was used (following ITU-T P.59 recommendation – without hangover time) and no audio packets were generated during

40 60 80 100 120 140 0

5 10 15

α=0.7

α=0.8 α=0.9 α=0.998 dynamic α 40 60 80 100 120 140 0 2 4 6 8 10 12 14

α=0.7

α=0.8 α=0.9 α=0.998 dynamic α 20 40 60 80 100 120 140 0 5 10 15

α=0.7

α=0.8 α=0.9 α=0.998 dynamic α 20 40 60 80 100 120 140 0 5 10 15

ALGORITHMS PERFORMANCE COMPARISON (2<β<4) 0.7

0.8 0.9

0.998

(8)

silence periods. During one hour of transmission all experimental data (the arriving times, timestamps, sequence numbers, and marker bits) were collected at the receiving host.

Fig. 19 shows delays and the histogram of delays experienced by audio packets during one hour experiment. The delay/loss trade-off of the two algorithms is shown in Fig. 20.

Fig. 19. Delays experienced by audio packets Fig. 20. Algorithm performance comparison. and a histogram of these delays.

The comparison of calculated playout times for the whole network trace is shown in Fig. 21 and for the 500 seconds of transmission in Fig. 22.

Fig. 21, 22. Calculated playout times for fixed and dynamic α and dynamic α vs. time.

6. Effects of the new buffering scheme on subjective quality

To estimate the subjective quality of packet voice for various α, the E-Model (ITU-T Recommendation G.107) [8] was used. E-Model combines individual impairments (loss, delay, echo, codec type, noise, etc.) due to both the signal’s properties and the network characteristics into a single R-rating that ranges from 0 to 100. Everything below 50 is clearly unacceptable and everything above 94.15 is unobtainable in narrowband telephony. The R-rating is a linear combination of the individual impairments and is given by the following formula: A I I I R R=( o − s)− d − e+ 0 20 40 60 80 100 120 0 2 4 6 8 10

α=0.7

α=0.8

α=0.9

α=0.998002

average buffering delay [ms] dynamic α fixed α

(9)

From our point of view delay impairment delayI_d (captures the effect of delay) and equipment impairment Ie(captures the effect of information loss due to encoding scheme

and packet loss) are the most interesting. Other impairments: loud connection and quantization impairmetIS, basic signal to noise ratioR0, and the “advantage factor” A (zero in the fixed Internet) do not depend on the transmission parameters. Therefore, we can conclude that we can write the R rating (for undistorted G.711 audio) as:

e d I

I

R=94.15− −

Figures below show for several encoders and different levels of echo cancellation how the call quality decreases due to one-way delay (Fig. 23) and how the equipment impairment increases for increasing packet loss ratios (Fig. 24).

Based on R rating, we assessed transmission quality and subjective user satisfaction over a one-hour period. First we calculated average playout delays and average packets loss for 10 seconds periods. Assuming the G.711 encoding with PLC and echo cancellation implemented (TELR = 55, 65) we calculated delay impairments Id and equipment

impairments I_e and finally found time varying quality of the call.

Fig. 23. Transmission rating factor R as a function

of the one-way delay [9]. Fig. 24. Equipment impairment _{of the packet loss [10].} Ie as a function

Fig. 25. User satisfaction for various a when TELR=65 Fig. 26. User satisfaction for various a when TELR=55

6% 12%

32%

47%

USER SATISFACTION vs. α for TELR=65 dB

α = 0.8 6% 9% 37% 44% α = 0.9 96% 3% α = 0.99 2% 3% 23% 71% dynamic α not recommended almost all users dissatisfied many users dissatisfied some users dissatisfied satisfied very satisfied 4% 12% 32% 50%

USER SATISFACTION vs. α for TELR=55 dB

α = 0.8 3% 4% 9% 27% 57% α = 0.9 3% 83% 14% α = 0.99 2% 11% 85% dynamic α not recommended almost all users dissatisfied many users dissatisfied some users dissatisfied satisfied very satisfied 0 5 10 15 20 0 10 20 30 40 50 60

Equipment Impairment Ie vs. Packet Loss

packet loss [%] Ie G.711 w/o PLC G.723.1 GSM G.729A G.711 Bursty Loss w. PLC G.711 w. PLC Random Loss 0 100 200 300 400 500 50 60 70 80 90 100

Transmission Rating Factor R vs. Delay

one-way delay [ms]

R

TELR=65dB TELR=55dB TELR=45dB

(10)

Figures 25 and 26 show user satisfaction levels (based on calculated R values) for two types of echo canceling (TELR=55, 65) and for various parameters α (0.8, 0.9, 0.998, dynamic α). As we can see, the best in maximizing R values and thus user satisfaction when TELR=65 was adaptive buffering scheme with dynamic α (72% of time with very good results). Second in maximizing user satisfaction was adaptive buffering with fixed α=0.8 (47% of time with very good results). When echo cancellation level was TELR=45dB, the best in maximizing user satisfaction was again adaptive buffering with dynamic α (79% of time with good results) while with fixed α=0.8 good results were achieved only during 50% of time.

7. CONCLUSIONS

The new playout buffer algorithm proposed predicts and follows network delays more efficiently than the basic algorithm with fixed α. We compared those algorithms through simulations and experiments on real networks using realistic voice sources and various delay patterns. Results show that with dynamic α one can achieve better delay/loss trade-off and thus better call quality and user satisfaction.

ACKNOWLEDGMENT

The support of the Research Innovation Fund of Enterprise Ireland is gratefully acknowledged.

REFERENCES

1. A. P. Markopoulou, F. A. Tobagi, and M. J. Karam. “Assessment of VoIP Quality over Internet Backbones”, in Proceedings of the IEEE Infocom, ‘02

2. Ramachandran Ramjee, Jim Kurose, Don Towsley, and Henning Schulzrinne, “Adaptive playout mechanisms for packetized audio applications in wide-area networks”, in

Proceedings of the Conference on Computer Communications (IEEE Infocom), Toronto,

Canada, ‘94

3. V. Jacobson, “Congestion avoidance and control”, in Proceedings of ACM SIGCOMM

Conference, Stanford, ‘88

4. H. Schulzrinne, “Voice Communication Across the Internet: a Network Voice Terminal”, Technical Report, Dept. of Computer Science, U. Massachusetts, Amherst MA, July’92 5. Source code available from: www.antd.nist.gov

6. ITU-T Recommendation P.59 “Telephone transmission quality objective measuring apparatus: Artificial conversational speech”, Geneva, March ‘93.

7. Source code available from www.openh323.org

8. ITU-T Recommendation G.107 “The E-model, A Computational Model for Use in Transmission Planning”, ‘98

9. Telecommunications Industry Association “Voice Quality Recommendations for IP Telephony – TIA/EIA/TSB116”, ‘01

10. ITU-T Recommendation G.113, "General Characteristics of General Telephone Connections and Telephone Circuits - Transmission Impairments", February ‘96.