A New Method of Objective Speech Quality Assessment in Communication System
IV. D ESCRIPTION OF NPESQ A LGORITHM
The NPESQ algorithm is a new method of objective method of speech quality assessment method of speech communication system.The whole algorithm consists of three parts: Level and time alignment pre-processing; Auditory transform and Cognitive modeling. Each part is explained below.
A. Level and Time-alignment Pre-processing
a) Level alignment: The input of the system is reference signal X t( ) and degraded signalY t( ). The reference signal is the original speech signal. The degraded signal is the reference signal through the test
system The test system is a voice system. The signal level of reference signal and degrade signal is different, when the original speech signal through the voice system. In order to compare the original and degrade signal, we need adjust the signal level to uniform level.
b) IRS filtering: It is assumed that the listening tests were carried out using an IRS receiver or a modified IRS receiver characteristic in the handset. The input of IRS filter are X tS( ) and Y tS( ) ,and the output are
( )
IRSS
X t andYIRSS( )t .
c) The time-alignment process: Before performing auditory transform, it is needs to estimate the time delay between the degraded speech and the original speech. Because the parameter of the PESQ score is calculated frame-by-frame, and the speech signal is time-varying, each frame is not the same delay. If the time is not aligned between the frame and frame, it will cause large errors. The time alignment is including envelope-based alignment and the fine alignment.
B. Auditory Transform
The auditory transform is a physiological acoustic model, it mapped the speech to the time-frequency domain, to simulate the process of human ear receive speech. The process of the auditory transform consists of four steps: processing by Hanning window, Short-term Fast Fourier Transform, Frequency spectrum is mapped to ERB-spectrum and ERB-spectrum is mapped to loudness. The structure of the Auditory transform is shown in Fig.2. ( ) IRSS X t XWIRSS( )t PXWIRSS( )f ( ) N WIRSS PPX f ( ) N LX f
Fig.2. The Auditory transform of NPESQ
a) Processing by Hanning window: The size of Hanning window is 32 ms. For 8 kHz this amounts to 256 samples per window and for 16 kHz the window counts 512 samples while adjacent frames are overlapped by 50%.
The Hanning window function is:
2 ( ) 0.5(1 cos n), 0 1 W n n N N π = − ≤ ≤ − (8)
With n is the number of frame.
The reference and degraded signal become
( )
WIRSS n
X t andYWIRSS( )t n after being windowed.
( ) ( ) ( ) ( ) ( ) ( ) WIRSS n IRSS WIRSS n IRSS X t W n X t Y t W n Y t = = (9)
b) Short-term Fast Fourier Transform: The humanear
performs a time-frequency transformation which is implemented in NPESQ by a short-term FFT with a Hanning window. The power spectra–the sum of the squared real and squared imaginary parts of the complex FFT components–are stored in separate real valued arrays for the original and degraded signals. Through FFT transform, the time domain into the frequency domain:
( ) ( ) ( ) ( ) WIRSS n WIRSS n WIRSS n WIRSS n X t FFT X k Y t FFT Y k (10)
Then calculate the power spectral density. The
( )
WIRSS n
PX k and PYWIRSS( )k n are the power spectral of reference and degraded signal.
2 2
2 2
( ) (Re ( ) ) (Im ( ) )
( ) (Re ( ) ) (Im ( ) )
WIRSS n WIRSS n WIRSS n WIRSS n WIRSS n WIRSS n
PX k X k X k PY k Y k Y k = + = + (11)
c) Frequency spectrum is mapped to ERB-spectrum: The calculation process of ERB spectrum is mainly consists of two steps:The first step is to divide the entire auditory frequency range into a series of frequency band by using a certain length of the ERB scale
The second step is to calculate the center frequency and sound pressure level of each band by using the energy center of gravity method.
The formula of calculate the center frequency is:
1 1 n i i i c n i i f Y f Y = = ⋅ =
(12)where fc on behalf of the center frequency of the band, n is the total number of the spectrum in the band, fiis the i-th spectrum within the frequency band, Yiis the power spectrum amplitude of the i-th spectral line.
The formula of calculate the sound pressure level of each band is:
1 2 10 lg 2 n i i c ref Y SPL P = =
(13) where SPLc is the sound pressure level, Pref is the reference sound pressure.The center frequency fc and the frequency band SPLc are composed of the ERB spectrum. ERB spectrum as the input of model can be carried out loudness calculation. The warping function that maps the frequency scale in Hertz to the pitch scale in ERB does not exactly follow the values given in the literature. The resulting signals are known as the pitch power densities N ( )
WIRSS n
PPX f
and N ( )
WIRSS n
PPY f .
d) ERB-spectrum is mapped to loudnes The ERB- spectrum is mapped to (Sone) loudness scale, including a frequency-dependent threshold and exponent. This gives the perceived loudness in each time-frequency cell.
The resulting two-dimensional arrays N( )
n
LX f and
( )
N n
LY f are called loudness densities. C. Cognitive Modeling
a) Calculation of the disturbance density: The difference of the two signals loudness density is calculated. When the difference is positive, the degraded signal to introduce some components. When the difference is negative, the loss of some of the components. This difference is known as the original disturbance density. After this process the value of the raw disturbance density is moved to The direction of the absolute value decreases, with the distance of the masking threshold.
Due to the masking effect of the human ear, we need to process each time-frequency component to get a disturbance density D f( )n.
b) The processing of asymmetry: The asymmetry factor is calculated from a stabilized ratio of the ERB spectral density of the degraded to the reference signals in each time-frequency cell.
The asymmetrical disturbance density DA(f)n is calculated by multiplication of the disturbance density
( )n
D f with an asymmetry factor.
c) Aggregation of the disturbance densities: The disturbance values are aggregated using an L norm, p which calculates a non-linear average using the following formula:
(
)
3 3 ( ) N n n n f k D =M
D f W (13)(
)
3 3 ( ) N n n n f k DA =M
DA f W (14) with Mn a multiplication factor, resulting in an emphasis of the disturbances that occur during silences in the original speech fragment, and Wf a series of constants proportional to the width of the modified ERB bins. These aggregated values, DNn and DANn, are called frame disturbances.
d) Realignment of bad intervals: Consecutive frames with a frame disturbance above a threshold are called bad intervals. Each bad interval is realigned and the disturbance is recalculated. For each frame, if the realignment results in a lower disturbance value, the new value is used. Then aggregation of the disturbance within split second intervals, to calculate the average symmetric disturbance N
SYM
d and the average asymmetric
disturbance N ASYM
d .
e) Computation of the NPESQ score: A combination of two parameters symmetric disturbance N
SYM
d and
asymmetric disturbance N ASYM
d shows a good balance between accuracy of prediction and ability of generalization. The final NPESQ score is a linear combination of the N
SYM
d and N ASYM