• No results found

Estimation of a for the 1 sided priors

4.4 Adaptive estimation of the shape parameter

4.4.2 Estimation of a for the 1 sided priors

4.4.2.1 1 sided Chi priors

Given the model for the complex STFT coefficients X=S+Nthe fourth moment of the noisy speech spectral amplitude can be written as:

whereR,A and B are the amplitudes of the noisy speech, the clean speech and the noise respectively. Based on the Gaussian noise model of eq. 3.5, the second and fourth moments of the noise spectral amplitude are given by

E[B2] = 2σN2 , and E[B4] = 8σN4

and the second and fourth moments are related by

E[B4] = 2 E[B2]2

(4.17)

The corresponding moments for the speech spectral amplitude for the Chi prior density are: E[A2] =θa/2, and E[A4] =θ2a(a+ 2)/4 and subsequently E[A4] = a+ 2 a E[A 2]2 (4.18)

Substituting eqs. 4.17 and 4.18 in 4.16 we have

(a+ 2)

a =

E[R4]4E[A2]E[B2]2 (E[B2])2

(E[A2])2 =κ1 (4.19) The estimator of a then reads

ˆ

a= 2

κ1−1

(4.20)

whereκ1 is the kurtosis of the clean speech amplitude, defined asκ1 ≡E[A4]/E[A2]2.

Note that the form of eq. 4.20 is the same as eq. 4.11, which is a consequence of the second and fourth moments being the same for the 1 sided and the 2 sided Chi pdf’s.

4.4.2.2 1 sided Gamma priors

The procedure for obtaining the estimates ofais identical to that of §4.4.2.1, except for the expressions of the speech prior moments. For the 1 sided Gamma prior these are:

and

E[A4] = (a+ 2)(a+ 3)

a(a+ 1) E[A

2]2

(4.21)

Following the same steps as in §4.4.2.1 we have:

(a+ 2)(a+ 3)

a(a+ 1) =

E[R4]4E[A2]E[B2]2 (E[B2])2

(E[A2])2 =κ1 (4.22) Or finally, solving the quadratic equation w.r.ta:

ˆ a= 5−κ1+ p κ2 1+ 14κ1+ 1 2κ1−2 (4.23)

The valid root from the solution of the quadratic equation is the one with the (+) for the same reasons as those stated in§4.4.1.2. Note again that eq. 4.23 is identical to eq. 4.15, which is the consequence of the second and fourth raw moments of the 1 sided and 2 sided Gamma density functions being identical.

4.4.2.3 Lognormal priors

The expressions for the second and the fourth moments of the Lognormal priors are [56]:

E[A2] = exp 2θ+a−1

, and E[A4] = exp 4θ+ 4a−1

and the two moments are related by

E[A4] = exp 2a−1

E[A2]2

(4.24)

Following the same procedure as in§4.4.2.1 we can show that

exp(2a−1) = E[R

4]4E[A2]E[B2]2 (E[B2])2

(E[A2])2 =κ1 (4.25) Solving the above equation with respect to a, we have the following expression for the estimator

ˆ

a= 2 ln(κ1)

4.5

Summary

The priors we employ for modelling the speech STFT data have two parameters: the scale parameter θ and the shape parameter a. In this chapter we proposed a number of methods for estimating their values. The proposed methods were grouped in two categories: the first category contains methods that estimate the parameters by fitting the priors to long term speech data, while the second consists of adaptive methods.

The methods that use long term speech data were two: the first method used data from all the available frequency bins, while the second method involved fitting the priors to data from each frequency bin separately. In both cases, the best fit was provided by the Lognormal priors. The Gamma priors offered a somewhat poorer fit and the Chi priors were generally the least successful models. The priors estimated with the above methods can be called long term priors, because long term speech data are used for the estimation of their parameters.

Enhancing speech using fixed values of θ, as estimated from the long term priors, results in musical noise artifacts, as we will show in the next chapter. For this reason we investigated an adaptive method for the estimation of the shape parameter θ, which is based on the DD method for the estimation of the a priori SNR. The DD method is renown for aiding the reduction of the musical noise artifacts, while the priors it defines are short term, as the values of their parameters change during the enhancement of speech.

The selection of an adaptive method for the estimation of the scale parameter im- plies that the use of long term estimates for the shape parameter a is not justified theoretically. We implemented a method for the estimation of a that is found in the literature and is compatible with the estimation of θ via the DD method. This method estimatesavia fitting the priors to data from narrow a priori SNR intervals. We showed that the results of this method are not consistent and depend strongly on the selection of the a priori SNR interval. In view of the shortcomings of this method, in the following chapter we evaluate the performance of the algorithms as a function of the shape parametera and seek an optimal value based on the results.

developed, which was based on moment matching. Expressions for the estimators of a were analytically derived for each of the employed priors, while the results of this method are also evaluated in the next chapter.

Chapter 5

Evaluation

In this chapter we present the results from the evaluation of the of Bayesian algo- rithms described in chapter 3. The evaluation is based on simulations performed with a number of clean speech phrases, artificially corrupted with additive white Gaussian and car noise, which are then enhanced with the proposed algorithms. The performance of the algorithms is measured using a number of objective mea- sures, while formal and informal listening tests are employed to subjectively assess the quality of the enhanced speech.

Of particular interest in this evaluation, is the effect of the priors’ shape parametera

on the quality of the enhanced speech. In§5.3 the performance of the algorithms is evaluated as a function of the shape parametera, where it is revealed that its value essentially controls the trade off between the musical character of the residual noise and its overall level, while the preservation of the weaker speech spectral components is influenced to some extent. In the same section there is also a discussion on the performance of the algorithms with values extracted with the methods presented in

§4.1 -§4.3. In§5.4, optimal values forathat maximise the speech quality are sought, by means of a formal subjective listening test. Finally, the adaptive scheme for the estimation of a presented in §4.4 is evaluated in §5.5. Prior to the presentation of the results however, some details about the specifics of the performed simulations and the employed evaluation measures will be first given in §5.1 and§5.2 .

0 1000 2000 3000 4000 −40 −35 −30 −25 −20 −15 P S D [d B ] Frequency [Hz]

Figure 5.1: Car noise power spectral density.

5.1

Simulation setup

The clean speech database used for the simulations in this chapter is a subset of the database that was used in chapter 4. It comprises of three male and three female speakers, each uttering 8 sentences. The total duration of the database is 2 minutes and 10 seconds and the sampling frequency is 8 KHz. The transformation to the frequency domain was performed using Hamming windows of 256 samples length, overlapped by 75%. The windows were also normalised so that their amplitude when overlapped and added was 1.

The speech phrases were corrupted with white Gaussian and car noise at 0, 10 and 20 dB input Segmental SNR. For these input Segmental SNR levels the corresponding noisy speech PESQ scores were 2.11, 2.80 and 3.46 for the white noise and 2.89, 3.49 and 4.07 for the car noise respectively1. The white noise was computer generated, while the car noise was recorded in a car traveling on a motorway at 60 mph. The car noise contained not apparent transients or long term trends, and its power spectrum is shown in figure 5.1. To eliminate the effect of a noise estimation algorithm on the speech enhancement schemes, the noise power was estimated directly from the noise samples, which were known as the mixing of the noise with speech was performed artificially. In practice however, the noise power can be estimated with a noise estimation algorithm, such as those described in chapter 6.

1For a definition of Segmental SNR and PESQ see

Related documents