Simulation System Structure - Neural Network Models to Predict Listening Voice Quality

6.2 Neural Network Models to Predict Listening Voice Quality

6.2.1 Simulation System Structure

A block diagram of the speech quality prediction system that was used in the study is depicted in Figure 6.1. It is a PC-based software system that allows the simulation of key processes in voice over IP. It enables the simulation of a variety of network conditions and objective measurement of the effects on perceived speech quality. The system includes a speech database, an encoder/decoder, a packet loss simulator, a speech quality measurement module, a parameter extraction and an ANN model. The speech database is taken from the TIMIT data

6.2. Neural Network Models to Predict Listening Voice Quality

set [92]. Speech files from different male and female talkers are chosen to generate a database for ANN model development

Three modern codecs were chosen for the study. These are G.729 CS-ACELP (8 Kbps), G.723.1 MP-MLQ/ACELP (5.3/6.3 Kbps) and Adaptive Multi-Rate (AMR) codecs with eight modes (4.75 to 12.2 Kbps). Quality measure (PESQ) Measured MOS Reference speech Speech Database Encoder Packet loss

simulator Decoder Degraded _speech

Parameter Extraction ANN Model Predicted MOS

Figure 6.1: System structure for speech quality analysis and prediction

A 2-state Gilbert model was used to simulate packet loss (see Figure 2.6).

In our system, the latest ITU perceptual measurement algorithm, PESQ, is used to measure the perceived speech quality under different network conditions and for different talkers/lan- guages. The PESQ compares the degraded speech with the reference speech and computes an objective MOS value in a 5-point scale. In the study, the MOS score obtained from the PESQ is referred to as the ’measured MOS’ to differentiate it from the ’predicted MOS’ obtained from the ANN model. The Parameter Extraction module is used to extract salient information from the IP network and the decoder (including the codec type and network packet loss). In real VoIP applications, codec type and packet loss would be parsed from the RTP header. After processing, the information is fed to the ANN model to predict speech quality.

As a network packet payload may include a normal speech frame (speech talkspurt) or a silence frame. The number of silence frames depends on whether VAD (Voice Activity Detec- tion) is activated or not at encoder side. If VAD is activated, silence frame only represents SID (Silence Insertion Description) frame. Packet loss during silence period or small signal energy segment has no or very small impact on perceived speech quality.

6.2. Neural Network Models to Predict Listening Voice Quality

and calculated the ulp and clp according to Gilbert model only during speech talkspurt. In this case, State 1 in Figure 2.6 represents loss during talkspurt, and State 0 represents no loss or loss during silence. We used ulp(VAD) and clp(VAD) to differentiate them from the simulated net- work ulp and clp. The benefit is that it can always count the packet loss which are perceptually relevant no matter whether or not VAD is used, or what kinds of VAD is used in the system. Another benefit is that the calculation of ulp(VAD) and clp(VAD) can be frame-based, which can include the impact of different packet size. It may save one input parameter for the neural network analysis. The frame size depends on codec used. It is 10 ms for G.729, 20 ms for AMR and 30 ms for G.723.1.

The pitch delay can be extracted from decoder and the gender can be decided according to a preset threshold for pitch delay between the male and female. In this stage of the research, we just set the gender value according to the speech file we chose.

6.2.2 Artificial Neural Network Model

An important objective of our study is to develop neural-networks based models to learn the non-linear relationships between the key impairment parameters and perceived voice quality. The use of learning models is necessary because the relationships are not explicit. Unlike conventional models which are static, e.g. the E-model, a neural networks based model can also be re-trained to learn new relationships for IP networks which are continually changing.

For simplicity, a three-layer, feed-forward neural net architecture and the standard back- propagation learning algorithm were used (see Figure 6.2). Four variables were identified as inputs to the neural network model, namely: codec type, gender, ulp(VAD) and clp(VAD). The predicted MOS score was the only output(see Figure 6.3).

For a three-layer feed-forward neural net, the network is made up of the input layer, the hidden layer and the output layer. Input data is fed to the input layer and processing is done layer by layer up to the output layer. Activation function of a node controls the output signal from the node. To start with, a given set of randomized values of the weights and biases are

6.2. Neural Network Models to Predict Listening Voice Quality Simulated VoIP system Reference speech Degraded speech

Network, Codec & Speech parameters Quality measure (PESQ) Measured MOS Backprop Predicted MOS Σ + -

Figure 6.2: Conceptual diagram of the training process for neural network model (for listening quality prediction) 1 2 3 4 1 2 3 4 5 1 MOS Gender Codec type ulp(VAD) clp(VAD)

6.2. Neural Network Models to Predict Listening Voice Quality

assigned to the network. The connection weights are then updated to decrease the difference (error) between the network output and the desired output using certain minimization algorithm. The process is repeated until the error falls below a specified limit. The neural net is then said to have been trained. Outputs of the hidden and output layers are generated using the asymmetric sigmoid activation function. Input and output values are scaled from 0 to 1 using the minimum and maximum values in the training data.

ulp(Real) and clp(Real) are generated from the Gilbert model and represent the contribution

from packet loss. In this context, the Gilbert model serves as a means of pre-processing the received packet streams to capture and represent the underlying features of packet loss before it is applied to the neural networks to facilitate learning. It allows the packet loss behaviour of IP networks to be represented as a Markov process because several of the mechanisms that contribute to loss are transient in nature (e.g. network congestion, late arrival of packets at a gateway/terminal, buffer overflow or transmission errors), which is in fact why packet loss is bursty in nature [46]. An attraction is that it provides a compact representation of the loss behaviour of IP networks which can be used directly as inputs to the learning models.

The Stuttgart Neural Network Simulator (SNNS) package [102] was used for neural network training and testing. The neural network was trained to learn the non-linear relationship between four input variables and one output variable.

In document SPEECH QUALITY PREDICTION FOR VOICE OVER INTERNET PROTOCOL NETWORKS. L. Sun (Page 113-117)