International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)469
Wavelet Package-Based Thresholding Methods for Speech
Enhancement: A Review
Zhiyuan Chen
1, Yutong Zhang
2,M.Abdulghafour
31,2,3
New York Institute of Technology - Nanjing Site, China
Abstract— Wavelet Package Threshold Algorithm (WPTA) is an effective tool for reducing noise from speech signals. Threshold method plays a key role inWPTA. Invariant threshold method, which is commonly used, can lead to severe loss of speech information because of unchanged thresholds. In this paper, three adaptive threshold methods are investigated: Time-Frequency (TF) Threshold; Teager Energy Operator (TEO) Threshold; Adaptive Noise Estimation (ANE) Threshold. Speech signals with stationary noise and non-stationary noise, respectively, are chosen as samples to test these threshold methods. Experimental results are presented.
Keywords—Wavelet, Invariant Threshold, Time-Frequency Threshold, Teager Energy Operator Threshold and Adaptive Noise Estimation Threshold.
I. INTRODUCTION
Speech enhancement is the foundation of speech signal processing. Its purpose is to reduce noise and improve the quality of speech. There are many speech enhancement methods now and wavelet analysis is a fundamental and widely used method for noise reduction — in particular to deal with non-stationary speech signal. However, wavelet analysis mainly focuses on the low frequency section of speech signal but ignores the high frequency segment which contains lots of important, detailed information. To analyze speech signals better and improve the de-noising results, Wavelet Package Analysis(WPA), provides a more flexible and accurate method for analyzing speech signal, was implemented [1,2]. Wavelet Package Threshold Algorithm( WPTA) based on WPA ( as shown in Fig.1 ) is an effective method for removing noise from speech signals. In this algorithm, choosing the threshold is very critical because the threshold is a value that determines the remaining part of speech signal. The hard threshold and soft threshold proposed by Donoho and Johnstone are two fundamental thresholds.[3] Although they can result in high output Signal to Noise Ratio(SNR), some useful speech components and noise would be suppressed together and thuslead to severe speech distortion. Meanwhile, in more realistic environments, where the noise is changing all the time, threshold of high frequency noise segments should differentiate the one of low frequency noise segments. Hence, it is necessary to alter the threshold through the whole process of speech enhancement to get a better de-noising result.
Many adaptive threshold methods have been proposed. In this paper, the threshold method for WPTA is studied. It willbemainly concerned with investigating three adaptive threshold methods to improve the accuracy of choosing threshold for speech signal under low SNR. The performance of these adaptive threshold methods will be evaluated to compare with one invariant threshold method through output SNR and subjective test. Also, speech signals with stationary noises and non-stationary noises are both tested to find the performance of different threshold methods in various realistic speech environments.
Fig.1 Wavelet Package Threshold Algorithm
II. WAVELET PACKAGE THRESHOLD ALGORITHM
A. Wavelet Package Transform
Let a discrete noisy speech signal be written as
) ( ) ( )
(n sn d n
x
Where s(n)is clear speech signal and d(n)is noise signal.
For a J level Wavelet Package Transform (WPT) of
) (n
x , x(n) can be decomposed into 2J subbands corresponding to a wavelet package coefficient set
) ), ( WPT( )
(k x n i
xi
Where i1, 2,, M ,k1, 2,, Ni,xi(k)is the
kth coefficient of ith subband, Ni is the length of ith
subband coefficients ,M 2J.
B. Threshold Choosing
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)470 1)Invariant Threshold
Invariant threshold proposed by Donoho and Johnstoneis an initial threshold, widely used by multiple researchers [3,4].
) log log(
2 N 2N
6745 . 0 / MAD
Where N is the length of subband and median absolute deviation(MAD) is the median of the absolute value of each subband’s WPT coefficient [3].
2) Time-Frequency (TF) Threshold
TF Threshold modulates characteristics of speech signal in pre-estimation according to the masking property of human auditory system theory [5].
First, select the invariant threshold proposed by Donoho and Johnstone as initial threshold 0[2],
6745 . 0 / , ) log log(
2 0 2 0
0 N N MAD
Traditional speech enhancement method — Spectral Subtraction(SS) is used in pre-estimation to get estimated clear signal sˆ(t)[6]. Then, applying WPT to get TF
wavelet package coefficients setsi(k) of noise signal,
) ), ( ˆ ( ) (
ˆ k WPT sn i
si
and modulate initial threshold0 to obtain threshold
i
i k k N
sˆ( ),0), 1,2, ,
max( 0
3)Teager Energy Operator (TEO) Threshold
TEO is a powerful nonlinear operator proposed by Teager and its discrete form was raised by Kaiser [7, 8]. The speech signal x(n)can be estimated by
) 1 ( ) 1 ( ) ( )] (
[ 2
x n x n xn x n
The use of TEO in speech enhancement was proposed by Bahoura and Rouat [8]. In this method, calculating TEO of wavelet package coefficients to determine a speech frame is whether speech-dominated or noise-dominated and then adjusts the threshold according to the result of TEO.
First, calculate TEO coefficients
T
i(k
)
of waveletpackage coefficientsxi(k),
)] ( [ )
(k x k
Ti i
and then smooth the TEO coefficients based on the following formula,
H k T k
Mi( ) i( )
Where H is a second order IIR low pass filter. In our experiment, we choose the Butterworth filter. The digital cutoff frequency we choose is 0.1 which is depend on the length of each subband. Longer subband requires lower
cutoff frequency in order to make the Mi(k) smooth enough. In addition, the amplitude normalization of
) (k
Mi should be done, that is to say, the maximum value
of Mi(k)is 1.
Next, a threshold modulation criterion is set as follow: Define a offset Si to distinguish speech and noise frames :
)))] ( ( [max(F M k abscissa
Si i
Where Fis the amplitude distribution of Mi(k). It
should be noted that Mi(k) is in the wavelet domain, where the abscissa represents time. To get the values in abscissa, dividing the index of each point with sampling rate. When Siis close to the origin of abscissa, the frame
is speech-dominated; when Si is close to the end of
abscissa, the frame is noise-dominated. If Siis below the discriminatory value of half length of speech signal x(n), it indicates that threshold should be modulated, or else threshold remains unchanged.
Based on this criterion, the following formula is used to suppress and normalize the offset before modulating threshold ] ) ) ( max( ) ( [ ) ( ' i i i i i S k M S k M k M
In our experiment, is set to be 2 1
. So, the threshold
is )) ( 1 ( )
(k M'i k
i
Where
2log(Nlog2N) and 1, N is the length of noise speech.4)Adaptive Noise Estimation(ANE) Threshold
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)471 Define the noisy speech power as 2(i,d) , noise
power as
w2(i,d), where i and d represent the ith subband of dth speech frame.
N
n
i nd
x N d i 1 2 2(, ) 1 ( ( , ))
Where
x
i is wavelet package coefficient of ithsubband. In fact, there are some overlap between the frames and to solve this problem, in our experiment, the premise of this threshold method is that each frame is independent and we ignore the last several points for better result and convenience.
Assume the first five frames as noise frames, so the
noisy speech power2(i,d)can be regarded as the noise
power
w2(i,d) for the first five frames. Then, set average noise power
5 1 2 2 ) , ( 5 1 ) , ( m ww i d i d m
Where
w2is the variance of noise signal.Next, a smoothing filter is used to estimate the noise power[10] ) , ( )) , ( 1 ( ) 1 , ( ) , ( ) ,
( 2 2
2 d i d i d i d i d i w
w
Where ) ) , ( ( 1 1 ) ,
( SNRid T
e d
i
, SNR is
determined by the formula
) 1 , ( ) , ( ) , ( 2 2 d i d i d i SNR w
. In
our experiment, T is set to be 5 and
is set to be 0.4.At last, the threshold is
N d i d i d
i, ) (1 (, ))ˆ(, ) 2ln
(
Where ˆ(i,d)MAD/0.6745.
C. Thresholding
Soft thresholding and hard thresholding are the most widely used thresholding algorithms. However, residual noise such as musical noise[11] can be introduced when applying hard thresholding. In this paper, we choose soft thresholding[3]. ) ( ) ( ) ) ( ( )) ( ( 0 ) ( ~ k x if k x if k x k x sign k x i i i i i
Where sign()is defined as symbol function.
D.Inverse Wavelet Package Transform
The enhanced speech signal is synthesized with inverse wavelet package transform(IWPT) of the final wavelet coefficients )) ( ~ ( ) (
~s k IWPT xi k
III. EXPERIMENTAL RESULTS
The experiment mainly compares these methods from two aspects— the change of SNR and speech distortion will be examined by objective test — output SNR and objective test — Mean Opinion Score(MOS)[12].
As in realistic environment, most noises are non-stationary, in order to test these methods thoroughly, we take speech samples with stationary noises - additive Gaussian noise and various samples with non-stationary noises to stimulate real speech environment.
A. Input SNR and output SNR
The SNR is calculated[13] by
)
log(
10
d sP
P
SNR
Where Ps is the average power of clear speech signal and Pd is the average power of noise signal. So, the average power of noisy speech signal can be written by :
N n n x N n x P 0 2( ) 1)] ( [
In this experiment, asit is very difficult to estimate the noise from the output signal, what we did in the experiment are:
Step1:Plot the output signal in Matlab, and find the pure noise segment based on judgment. The criterion is thatmagnitude of residual noise isusually much smaller than that of speech part.
Step2:Calculate the average noise power fromtheselected segment. The more segments selected, the closer to the accurate SNR.
Step3:Calculate the average output signal power and minus the average noise power which comes from Step2.
Step4:Calculate SNR by the SNR formula.
1) Male speech with stationary noises
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)472 All noise signals are sampled at 44KHZ with input SNR from -3dB to 3 dB and all the speech samples are decomposed by 10 level (J10) WPT (See Fig.2 ).
Fig.2: Output SNR of male speech with stationary noises
2) Male speech with non-stationary noises
Three real noise environments are chosen in the experiment. We take desk fraction sound as the intended noise, laundry machine noise as the machine noise and several different combined noises as the mixed noise. These noises are common in daily life. We add these noises to clear, male sample and calculate the input SNR, then compare four threshold methods via output SNR (See TABLE 1).
TABLE1
OUTPUT SNR OF MALE SPEECH WITH NON-STATIONARY NOISES
Noise Type Input SNR(dB)
Output SNR(dB)
Invariant TF TEO ANE Unintended
Noise 3.84 13.22 6.10 11.33 5.98 Mixed
Noise 4.46 17.69 8.01 12.86 8.87 Machine
Noise 6.67 15.3 7.62 12.11 11.44
B. Mean Opinion Score(MOS)
30 students are involved in the listening test. Each listener gives a score to each test speech sample. The scores of MOS are ranging from 1 to 5with a higher score representing better listening perception.
Defined MOS as
5
1 1
i i iN
W N MOS
Where N is the total number of listeners, Ni is the
total number of one score, Wi 1, 2, 3, 4, 5.
Mixed speech which contains male and female voiceswith stationary and non-stationary noise istested.
1)Mixed speech with stationary noises
TABLE2
MOS OF MIXED SPEECH SIGNAL WITH STATIONARY NOISE
Input
SNR(dB) Invariant TF TEO ANE
3 3.6 3.8 3.7 3.4
0 3.4 3.3 3.3 3.1
-3 3.1 3.0 2.9 2.8
2) Mixed speech with non-stationary noises
TABLE3
MOS OF MIXED SPEECH SIGNAL WITH NON-STATIONARY NOISE
Noise Type Input
SNR(dB) Invariant TF TEO ANE Unintended
Noise 3.84 3.1 3.4 3.6 3.5
Mixed Noise 4.46 2.9 3.6 3.7 3.4
Machine Noise 6.67 3.6 4.0 4.3 3.8
C. Enhanced Speech Waveform
Fig.3 Waveform of male speech
1) Male speech with stationary noise
Fig.4 Waveform of speech with additive Gaussian noise(SNR=3dB)
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)473
Fig.6 Waveform of enhanced speech by TF threshold method
Fig.7 Waveform of enhanced speech by TEO threshold method
Fig.8 Waveform of enhanced speech by ANE threshold method
2)Male speech with non-stationary noise
Fig.9 Waveform of male speech with machine noise(SNR=6.67dB)
Fig.10 Waveform of enhanced speech by invariant threshold method
Fig.11 Waveform of enhanced speech by TF threshold method
Fig.12 Waveform of enhanced speech by TEO threshold method
Fig.13 Waveform of enhanced speech by ATE threshold method
IV. CONCLUSION
The invariant threshold method has greatest SNR performance in four thresholds even though it does not have the best subjective test result.Also, invariant threshold cannot deal with speech signal with non-stationary noise well.Anotherthreethreshold methods do not have excellent SNR performance either, but their enhanced speech signalshave less distortion and are easier to understand. TEO threshold method is most suitable for performing speech signal with non-stationary noise;from the result, it is obvious that TEO threshold method is suitablein daily practice when applying WPTA to speech enhancement.
Although the SNR performances of TF threshold method are considerable, it is mainly rely on the ideal estimation of noise which is difficult in the reality. That is to say, the drawback of TF threshold method is that it cannot stand alone without accurate noise estimation.
Moreover, during the experiment we found that, usually the noise de-noising performs better when the WPT level J grows. However, bigger J means longer time for processing. Also, the ANE threshold requires framing and each frame should have certain length. At this moment, framing would become impossible and pointless if J is too big. That is the reason why we choose J9 for ANE and J10 for other three threshold methods.
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)474 By adjusting the value of the two parameters Tand ,
we found they are very crucial to the performance of ANE threshold method. A pair of
and Twith certain value should be completely different for each subband, but the ANE method chooses average values of
and T on every subband. Therefore, how todecide
T
iand
ifor ith subband is a promising topicwhich should be worked on in the future.
REFERENCES
[1] Johh G. Ackenhusen, Real-Time signal processing: Design and implementation of signal processing systems, Pearson Education. Inc. Publishing: as prentice Hall PTR, 2006.
[2] Y.H. Xu, G. Wang, Y. Gu and H.Y. Liu, “A Novel Wavelet Packet Speech Enhancement Algorithm Based On Time-Frequency Threshold ,” International Conference on Innovative Computing, Information and Control, , pp.493, Sept. 2007, [3] D.L. Donoho, “Denoising by soft thresholding,” IEEE
Transactions on Information Theory, vol. 41, pp.613-627, May 1995.
[4] D.L. Donoho and I.M. Johnstone, "Ideal Spatial Adaptation by Wavelet Shrinkage," Biometrika, vol. 81, pp. 425-455, 1994. [5] N. Virag, “Single Channel Speech Enhancement Based on
Masking Properties of the Human Auditory System ,” IEEE Transactions on Speech and Audio Processing, vol. 7, pp.126-137, 1999.
[6] S.F.Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, pp.113-120, Apr. 1979.
[7] J.F.Kaiser, “On a simple algorithm to calculate the ’energy’ of a signal ,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp.381-384, 1990. [8] Mohammed Bahoura and Jean Rouat, “Wavelet Speech
Enhancement Based on the Teager Energy Operator ,” IEEE Signal Processing Letters, vol. 8, pp.10-12, Jan. 2001
[9] R.W. Li, C.C. Bao and H.J. Dou, “Speech Enhancement Using Adaptive Threshold Based on Bi-orthogonal Wavelet Packet Decomposition”, Chinese Journal of Scientific instrument, 2008. [10] S.F. Lei and Y.K. Tung, “Speech Enhancement for Nonstationary
Noises by Wavelet Package Transform and Adaptive Noise Estimation ,” Proceeding of 2005 International Symposium on Intelligent Signal Processing and Communication Systems, pp.41-44, Dec. 2005.
[11] S. Chang, Y. Kwon and S. Yang, “Speech enhancement for non-stationary noise environment by adaptive wavelet package ,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.1, pp. 561-564, May 2002.
[12] Y.J. Tian, H.Y. Zuo, Y.M. Dong and C.Wang , “A new algorithm of wavelet package adaptive threshold speech de-noising ,” Applied Acoustics, vol.1, pp.72-80, Jan. 2011.