was obtained by applying PESQ/E-model directly as shown in Section 7.4. The scatter diagram of the predicted versus the measured MOSc scores for Internet trace data (#1 to #4) using regression model is illustrated in Figure 7.10. Preliminary results show that the correlation coefficient of 0.98 and MSE of 0.12 were obtained. In this application, it seems that the non- linear regression model achieves higher accuracy of voice quality prediction than the neural network model. 0.5 1 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3 3.5 4
Quality prediction for traces 1 to 4 (using regression model)
Measured MOSc
Predicted MOSc
Figure 7.10: Predicted MOSc vs measured MOSc for trace data (#1 to #4) using regression model
7.6
Performance Analysis/Comparison between NN and
Regression Models
In order to compare with measured MOSc from PESQ/E-model and predicted MOSc from the neural network model and from the non-linear regression model, the average measured MOSc, predicted MOSc using the regression model, and predicted MOSc using the neural network model are summarized in Table 7.3 for trace data #1 to #4. The network delay, network loss and actual end-to-end delay and packet loss are also shown in the table.
7.6. Performance Analysis/Comparison between NN and Regression Models
Table 7.3: Comparison of measured against predicted MOSc for trace data #1 to #4 Trace Network Delay (ms) Network Loss (%) Actual Delay (ms) Actual Loss (%) MOSc (Mea- sured) MOSc (NN- predicted) MOSc (Regression- predicted) 1 153 1.2 205 10.1 2.27 2.33 2.35 2 46 0.3 58 0.4 3.33 3.13 3.37 3 186 14.2 309 22.3 1.22 1.47 1.27 4 16 4.2 71 4.3 2.90 2.84 2.96
In general, trace #2 has the best network performance and the highest perceived speech quality (MOSc = 3.33) and trace #3 has the worst network performance and the lowest per- ceived speech quality (MOSc = 1.22). The predicted MOSc using the neural network model and the regression model both are quite close to the measured MOSc obtained directly from PESQ/E-model with higher accuracy from the regression model (correlation coefficient of 0.98) than that from the neural network model (correlation coefficient of 0.94).
It is noticed that the regression model has higher accuracy than the neural network model. The possible reasons are as following:
- The regression model used here is the one developed only for G.723.1 codec with packet size of one, whereas, the neural network model applied here is a general model suitable for four codecs (G.723.1, G.729, AMR and iLBC) and different packet size (packet size of 1 to 5). The generality of the neural network model causes the lower accuracy for speech quality prediction while compared with the regression model.
- There is no difference between regression model and PESQ/E-model method in consid- ering the impact from end-to-end delay (both using Equation 5.7). The only difference between them is at the calculation of Ie, where PESQ/E-model obtains Ie value from
PESQ algorithm, whereas the regression model calculates Ie using a simplified regres-
sion function (see Equation 7.3). However, only five delay values (100, 150, 200, 300, 400ms) are considered in the training set for the neural network model, this also con- tributes to the lower accuracy of the neural network model while compared with the
7.7. Summary
regression one.
In general, there are both advantages and disadvantages for neural network and non-linear regression models for voice quality prediction. These are summarized in Table 7.4.
Table 7.4: Comparison between neural network and regression models
Models Advantages Disadvantages
Neural net- work
learning ability with adaptability and generality, robust
low accuracy, more complex Non-linear
regression
simple and straight-forward, high accuracy for specific scenario
static, lack of generality, applica- tion inconvenient (one equation for one scenario)
7.7
Summary
In this chapter, the measurement, collection and preprocess of Internet trace data has been presented. The trace data from international links between UK and USA, UK and China, and UK and Germany have been chosen for analysing the IP network performance (e.g. delay, jitter, packet loss and their distributions). Results show that different traces have different delay and delay variation (e.g. the trace between UK and USA has lower delay and delay variation, whereas, the trace between UK and China (BUPT) has higher delay and delay variation). All the traces show that 2-state Gilbert model has a better fit for packet loss characterisation than Bernoulli model. The neural network models and regression models developed in previous chapters have been used to predict voice quality from real Internet traces. Preliminary results show that both regression and neural network models can predict voice quality well. Non-linear regression models can achieve higher accuracy of voice quality prediction (with correlation coefficient of 0.98) while compared with neural network models (with correlation coefficient of 0.94). The reason behind and the performance analysis and comparison between neural network and regression models were also presented.
Chapter 8
Perceived Speech Quality Prediction for
Buffer Optimization
8.1
Introduction/Motivation
In this chapter, an application of the voice quality prediction models on perceived qual- ity driven jitter buffer optimization is investigated. Nonlinear regression models are used for simplicity.
In Voice over IP (VoIP) applications, delay, jitter (i.e. delay variation) and packet loss are the main network impairments that affect perceived speech quality. Jitter can be partially compensated for by using a playout buffer at the receiving end, but this introduces further delay (buffer delay) and additional packet loss (packets arriving after their playout times will be dropped by the receiver). A tradeoff is necessary between increased packet loss and buffer delay to achieve a satisfactory result for any playout buffer algorithm. For example, the longer the buffer delay, the lower the late arrival loss and vice versa.
In the past, the choice/design of buffer algorithms was largely based on buffer delay and loss performance ( e.g. a design objective could be to achieve a minimum average end-to- end delay for a specified packet loss rate [29, 117–119] or minimum late arrival loss [117]. This approach is inappropriate as it does not provide a direct link to perceived speech quality. From QoS perspective, the choice of the best buffer algorithm for a given situation should be
8.1. Introduction/Motivation
determined by the likely perceived speech quality. The importance of this is now starting to be recognised [10, 22, 120]. For example, in [120], perceived voice quality is used to control the playout buffer in order to maximise the MOS values in terms of delay and loss. The concept of perceptual optimization has also been extended to other QoS control problems, such as joint playout buffer/FEC control [101] to maximise MOS values in terms of delay, loss and rate.
However, current methods of perceptual optimization are based on assumptions about per- ceived voice quality which are inappropriate. In [120], the method is based on the assumption that the effects of packet loss and delay on voice quality are linearly additive on the MOS scale which is doubtful. A further assumption is that the relationship between MOS and packet loss for codecs is linear which is not correct for most codecs. It has also been suggested in [101] that one equation may be used to represent the impairments due to packet loss for all codecs. This may not be appropriate, especially for newer codecs.
In all perceptual-based buffer design/optimisation and QoS control for VoIP, voice quality is used as the key metric because it provides a direct link to user perceived QoS. However, this requires an efficient and accurate objective way to measure perceived voice quality. Most current methods [101] [121] use the ITU-T E-model [7] to predict voice quality, but the E- model requires subjective tests to derive model parameters which is time-consuming and often impractical. As a result, the E-model is only applicable to a limited number of codecs and net- work conditions. It is also inevitable that discontinuities exist in subjective results [9] because only a limited range of scenarios can be tested for. PESQ [4] gives a good measure of voice quality, but it is not appropriate for optimisation because of the overhead involved in its use in real-time.
In Chapter 5, novel methods to predict voice quality non-intrusively based on a combination structure of PESQ and E-model have been proposed and two models (e.g. statistical nonlin- ear regression model and neural network-based model) have been developed. The developed nonlinear regression models are used in this chapter for perceived quality driven playout buffer optimization.
8.1. Introduction/Motivation
For perceived buffer design, it is important to understand the delay distribution modeling as it is directly related to buffer loss (or late arrival loss). The characteristics of packet trans- mission delay over Internet can be represented by statistical models which follow Normal, Exponential, Pareto and Weibull distributions depending on applications. For example, the de- lay distribution for Internet packets (for a UDP traffic) has been shown to be consistent with an Exponential distribution [42], whereas, Pareto distribution may be the most appropriate one to represent the tail delay characteristics for streaming media [116]. As delay characteristics may change with networks and applications, it is unclear what the appropriate delay distribu- tion modelling is the best fit for current VoIP traffic (with on/off pattern). This motivated us to investigate the delay distribution modelling for VoIP trace data described in the previous chapter.
In this Chapter, the existing four jitter buffer algorithms are examined using the method of perceptual speech quality analysis. An adaptive playout buffer algorithm which can adapt to the most suitable traditional buffer algorithm according to network delay and delay variation is proposed. Further perceived quality driven playout buffer algorithm is investigated. It is proposed to use minimum overall impairment as a criterion for buffer optimization or QoS control. This criterion is more efficient than using traditional maximum MOS score. It is also shown that the delay characteristics of Voice over IP traffic is better characterized by a Weibull distribution than a Pareto or an Exponential distribution. Based on the developed new regression models for voice quality prediction, the Weibull delay distribution model and the minimum impairment criterion, a perceptual optimization playout buffer algorithm is proposed and performance is compared with other jitter buffer algorithms.
The structure of the Chapter is as follows. In Section 8.2, the existing four playout buffer algorithms and their performance are analyzed. In Section 8.3, a new adaptive playout buffer algorithm based on traditional jitter buffer algorithms is presented. In Section 8.4, a perceptual optimum playout buffer algorithm is described based on the speech quality prediction models, a minimum impairment criterion and Weibull delay distribution modeling. The performance