In this chapter we derived a number of speech enhancement algorithms that form the backbone of this thesis. We started by formulating the problem of enhancing speech as an estimation problem in the STFT domain. We then derived a frame- work of STFT speech enhancement algorithms that can be grouped in the following categories: Firstly, according to the STFT feature they estimate, which was either the Re and Im parts (DFT algorithms) or the STFT amplitude (amplitude algo- rithms). Secondly, according to the estimator they employed, which was the MMSE or the MAP. The final feature of the algorithms were the priors used to model the clean speech samples. For the DFT algorithms, the 2 sided Chi and Gamma priors were used. For the amplitude algorithms, the priors used were the 1 sided Chi and Gamma and the Lognormal.
Two assumptions made during the development of the algorithms were that the Re and Im parts and the amplitude and phase of the speech STFT are independent. These assumptions, which cannot hold simultaneously for other than Gaussian mod- els, were tested in the last section of this chapter. The results showed that although the amplitude and phase are independent, some dependencies exist between the Re and Im parts. Nevertheless, these dependencies were not taken into account in the development of the respective algorithms because they were rather weak, while their incorporation was likely to result in a significant increase in the complexity of the estimators.
Chapter 4
Parameter estimation
The prior densities used in the development of the estimators of chapter 3 have two parameters: the shape parameteraand the scale parameterθ. In the present chapter we shall examine a number of approaches for estimating their values. The estimation methods we discuss can be divided in two categories: the first, is based on fitting the prior densities to a large amount of speech data and extracting the parameter values that provide the optimal fit. The second category includes methods that estimate the parameters adaptively during the enhancement process. Two methods of the first category are discussed in §4.1 and §4.2, while §4.3 and §4.4 discuss two adaptive methods.
The optimal fit of the prior densities to the speech data can be found via the Kullback-Leibler (KL) divergence1. Its definition for the discrete case is [60]:
KL = Nbin X m=1 (pd(m)−ps(m)) ln pd(m) ps(m) (4.1)
where pd(m) is the pdf of the data, calculated from a histogram, and ps(m) is the speech prior evaluated at the position of the histogram’s bins. Nbin is the number of bins used for the creation of the histogram. The values of the density function parameters that provide the best fit to the data are those that minimize the KL divergence. The purpose of fitting densities to the data is actually twofold. Apart from extracting values for the parameters, which can subsequently be used with the
estimation algorithms, it can also show the appropriateness of the proposed densities for modelling the data.
A first approach in obtaining parameter estimates via the fitting method is to fit the priors to the entire STFT data (full data set) obtained from a large speech database. The results of this method are presented in§4.1. A more refined approach would be the separate fitting of the priors to data extracted from a single frequency bin, thus allowing for variations in the form of the densities that model data from different frequencies. The results of the last approach are shown in §4.2. In both cases however, it must be ensured that the data to which the priors are fitted is scaled, so that it has the same standard deviation with the speech data which is to be enhanced. In the present work, the data used in the evaluation of the speech enhancement algorithms is a subset of that used for fitting the priors, hence the above requirement is met.
Although the above methods can yield estimates for botha and θ, it is beneficial in the implementation of the algorithms to couple one of them with the a priori SNR. The incorporation of the a priori SNR and its estimation with a method such as the DD method [31], is reported to aid the reduction of the background noise level and also to suppress the musical noise artifacts [16]. The a priori SNR is linked by definition to the second moment of the speech samples. Despite the fact that the second moment of all the considered densities is controlled by both a and θ, simulations show that the parameter θ is related to the scale of the density, while the parametera controls the shape. This can be easily verified by fitting a density function to a random variable multiplied with two different constants, in which case the value of a that provides the best fit remains unaffected, while θ changes according to the multiplying constant. It seems therefore more appropriate that the parameter that is coupled with the a priori SNR is θ. The adaptive estimation of the scale parameter via the a priori SNR and the DD method is discussed in §4.3.
From a Bayesian theoretic point of view, the methods of§4.1 and§4.2 model speech with a long term prior. That is, a prior with fixed values of the scale and shape parameters is employed for modelling the louder and quieter portions of speech as well as the small segments of silence between words. With the introduction of the DD method on the other hand, the priors become local or short term, because their
scale is now a function of the a priori SNR which changes with time.
The estimation of θ via the a priori SNR implies that the use of the estimates of
a obtained from long term speech data (§4.1, §4.2) is not justified theoretically. The reason is that the latter methods assume a constant value of θ for the whole duration, which is not the case asθ is adaptively estimated from the a priori SNR. A method for estimatingavia the fitting of priors that is compatible with the adaptive estimation model of θ is shown in §4.3.3. Finally, in§4.4 we will present a method for the adaptive estimation of a, which is based on the moment matching method and is also compatible with the estimation of θ from the a priori SNR.
The speech data to which all the priors of this chapter are fitted was taken from the TIMIT database. The data used consisted of 16 male and 16 female speakers, each uttering 8 sentences. After removing the silent frames with a Voice Activity Detector (VAD), the total length of the data was 12.5 minutes. The sampling frequency was 8 KHz, while the STFT transformation was performed with Hamming windows of 256 samples and a 75% overlap. It is conceivable that there might be differences between the distribution of clean speech data, and speech data extracted from real life noisy speech recordings. A possible source of these discrepancies for example might be the Lombard effect. We assume however, that the differences should not be major and proceed with the use of clean speech data, which are significantly easier to obtain.
4.1
Fitting densities to the full data set
We begin by demonstrating the fitting of the proposed densities to the full data set, beginning with the Re and Im parts and then with the amplitude. Figure 4.1(a) shows the histogram of the real part of the full data set and the 2 sided Gamma and Chi densities. The respective histograms for the imaginary parts are essentially identical and are not shown. The parameters used in the densities are those that provided the best fit according to the KL divergence. Figure 4.1(b) shows the central part of figure 4.1(a). Table 4.1 shows the parameter values and the KL divergence values for the Re/Im parts.
−4 −2 0 2 4 10−10 10−8 10−6 10−4 10−2 100 102 (a) −1 −0.5 0 0.5 1 10−6 10−4 10−2 100 102 (b)
Figure 4.1: (a) Histogram (solid) of the real part of the full data set and fitting of the Gamma (dash) and Chi (dash dot) densities, (b) zoom in the central part of (a).
Density a θ KL
Chi 0.15/0.14 0.024/0.028 669/564 Gamma 0.25/0.24 0.036/0.038 289/228
Table 4.1: Parameter values that minimize the KL divergence when fitting the 2 sided Chi and Gamma densities to the Re/Im parts of the full data set.
Gamma density models the speech data more accurately. In their attempt to capture the large peak at zero however, both distributions underestimate the long tails of the speech data histogram.
Figure 4.2(a) shows the histogram of the spectral amplitude of the full data set and the three densities with parameter values that provide the best fit according to the KL divergence. Because the speech spectral amplitude distribution has a high concentration close to zero, while a few samples have relatively large amplitudes, it is difficult for histograms with a linear data bins segmentation to provide a good resolution for the whole range of values. A remedy for this problem is to calculate the histogram of the logarithm of the speech spectral amplitude instead. This is feasible since amplitude values are always non negative and are practically never zero. Visual evaluation of the fitting of the densities however, requires that they are also transformed into the logarithmic domain. Figure 4.2(b) shows the histogram of the natural logarithm of the speech spectral amplitudes and the corresponding transformed densities. Table 4.2 shows the parameter values that provide the best fit according to the KL divergence. The functional forms of the densities transformed
10−6 10−4 10−2 100 10−6 10−4 10−2 100 102 104 (a) Amplitude −15 −10 −5 0 10−6 10−5 10−4 10−3 10−2 10−1 100 (b) Logarithm of amplitude
Figure 4.2: Histogram (solid) of the amplitude of the full data set and fitting of the Gamma (dash), Chi (dash dot) and Lognormal (dot) densities.
in the logarithmic domain are shown in appendix B.
Density a θ KL
Chi 0.17 0.034 1042 Gamma 0.28 0.056 464 Lognormal 0.16 -5.49 13
Table 4.2: Parameter values that minimize the KL divergence when fitting the 1 sided Chi and Gamma and the Lognormal densities to the amplitude of the full data set.
The results demonstrate clearly that the fitting of the Lognormal density to the data is superior compared to that provided by either the Gamma or the Chi. The Lognormal density has the ability to capture the heavy tails of the speech amplitude data and at the same time model the drop of the distribution as the amplitude values approach zero. The Chi and Gamma densities on the other hand, underestimate the tails of the distribution, and additionally predict that the probability density increases as we move toward zero, which is not in agreement with the evidence provided by the data.