Wavelet Package-Based Thresholding Methods for Speech Enhancement: A Review

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 5, Issue 6, June 2015)

469

Wavelet Package-Based Thresholding Methods for Speech

Enhancement: A Review

Zhiyuan Chen

1

, Yutong Zhang

2

,M.Abdulghafour

3

1,2,3

New York Institute of Technology - Nanjing Site, China

Abstract— Wavelet Package Threshold Algorithm (WPTA) is an effective tool for reducing noise from speech signals. Threshold method plays a key role inWPTA. Invariant threshold method, which is commonly used, can lead to severe loss of speech information because of unchanged thresholds. In this paper, three adaptive threshold methods are investigated: Time-Frequency (TF) Threshold; Teager Energy Operator (TEO) Threshold; Adaptive Noise Estimation (ANE) Threshold. Speech signals with stationary noise and non-stationary noise, respectively, are chosen as samples to test these threshold methods. Experimental results are presented.

Keywords—Wavelet, Invariant Threshold, Time-Frequency Threshold, Teager Energy Operator Threshold and Adaptive Noise Estimation Threshold.

I. INTRODUCTION

Speech enhancement is the foundation of speech signal processing. Its purpose is to reduce noise and improve the quality of speech. There are many speech enhancement methods now and wavelet analysis is a fundamental and widely used method for noise reduction — in particular to deal with non-stationary speech signal. However, wavelet analysis mainly focuses on the low frequency section of speech signal but ignores the high frequency segment which contains lots of important, detailed information. To analyze speech signals better and improve the de-noising results, Wavelet Package Analysis(WPA), provides a more flexible and accurate method for analyzing speech signal, was implemented [1,2]. Wavelet Package Threshold Algorithm( WPTA) based on WPA ( as shown in Fig.1 ) is an effective method for removing noise from speech signals. In this algorithm, choosing the threshold is very critical because the threshold is a value that determines the remaining part of speech signal. The hard threshold and soft threshold proposed by Donoho and Johnstone are two fundamental thresholds.[3] Although they can result in high output Signal to Noise Ratio(SNR), some useful speech components and noise would be suppressed together and thuslead to severe speech distortion. Meanwhile, in more realistic environments, where the noise is changing all the time, threshold of high frequency noise segments should differentiate the one of low frequency noise segments. Hence, it is necessary to alter the threshold through the whole process of speech enhancement to get a better de-noising result.

Many adaptive threshold methods have been proposed. In this paper, the threshold method for WPTA is studied. It willbemainly concerned with investigating three adaptive threshold methods to improve the accuracy of choosing threshold for speech signal under low SNR. The performance of these adaptive threshold methods will be evaluated to compare with one invariant threshold method through output SNR and subjective test. Also, speech signals with stationary noises and non-stationary noises are both tested to find the performance of different threshold methods in various realistic speech environments.

Fig.1 Wavelet Package Threshold Algorithm

II. WAVELET PACKAGE THRESHOLD ALGORITHM

A. Wavelet Package Transform

Let a discrete noisy speech signal be written as

) ( ) ( )

(n sn d n

x  

Where s(n)is clear speech signal and d(n)is noise signal.

For a J level Wavelet Package Transform (WPT) of

) (n

x , x(n) can be decomposed into 2J subbands corresponding to a wavelet package coefficient set

) ), ( WPT( )

(k x n i

xi 

Where i1, 2,, M ,k1, 2,, N_i,xi(k)is the

kth coefficient of ith subband, Ni is the length of ith

subband coefficients ,M 2J.

B. Threshold Choosing

(2)

International Journal of Emerging Technology and Advanced Engineering

470 1)Invariant Threshold

Invariant threshold proposed by Donoho and Johnstoneis an initial threshold, widely used by multiple researchers [3,4].

) log log(

2 N ₂N





 6745 . 0 / MAD 



Where N is the length of subband and median absolute deviation(MAD) is the median of the absolute value of each subband’s WPT coefficient [3].

2) Time-Frequency (TF) Threshold

TF Threshold modulates characteristics of speech signal in pre-estimation according to the masking property of human auditory system theory [5].

First, select the invariant threshold proposed by Donoho and Johnstone as initial threshold ₀[2],

6745 . 0 / , ) log log(

2 ₀ ₂ ₀

0 N N  MAD



Traditional speech enhancement method — Spectral Subtraction(SS) is used in pre-estimation to get estimated clear signal sˆ(t)[6]. Then, applying WPT to get TF

wavelet package coefficients setsi(k) of noise signal,

) ), ( ˆ ( ) (

ˆ k WPT sn i

si 

and modulate initial threshold₀ to obtain threshold

i

i _k _k _N

sˆ( ),0), 1,2, ,

max( 0  

 



3)Teager Energy Operator (TEO) Threshold

TEO is a powerful nonlinear operator proposed by Teager and its discrete form was raised by Kaiser [7, 8]. The speech signal x(n)can be estimated by

) 1 ( ) 1 ( ) ( )] (

[  2   

 x n x n xn x n

The use of TEO in speech enhancement was proposed by Bahoura and Rouat [8]. In this method, calculating TEO of wavelet package coefficients to determine a speech frame is whether speech-dominated or noise-dominated and then adjusts the threshold according to the result of TEO.

First, calculate TEO coefficients

T

i

(k

)

of wavelet

package coefficientsxi(k),

)] ( [ )

(k x k

Ti  i

and then smooth the TEO coefficients based on the following formula,

H k T k

Mi( ) i( )

Where H is a second order IIR low pass filter. In our experiment, we choose the Butterworth filter. The digital cutoff frequency we choose is 0.1 which is depend on the length of each subband. Longer subband requires lower

cutoff frequency in order to make the Mi(k) smooth enough. In addition, the amplitude normalization of

) (k

Mi should be done, that is to say, the maximum value

of Mi(k)is 1.

Next, a threshold modulation criterion is set as follow: Define a offset S_i to distinguish speech and noise frames :

)))] ( ( [max(F M k abscissa

S_i  i

Where Fis the amplitude distribution of Mi(k). It

should be noted that Mi(k) is in the wavelet domain, where the abscissa represents time. To get the values in abscissa, dividing the index of each point with sampling rate. When S_iis close to the origin of abscissa, the frame

is speech-dominated; when S_i is close to the end of

abscissa, the frame is noise-dominated. If Siis below the discriminatory value of half length of speech signal x(n), it indicates that threshold should be modulated, or else threshold remains unchanged.

Based on this criterion, the following formula is used to suppress and normalize the offset before modulating threshold  ] ) ) ( max( ) ( [ ) ( ' i i i i i S k M S k M k M   

In our experiment, is set to be 2 1

. So, the threshold

is )) ( 1 ( )

(k M'i k

i _ _

  

Where







2log(Nlog₂N) and 1, N is the length of noise speech.

4)Adaptive Noise Estimation(ANE) Threshold

(3)

International Journal of Emerging Technology and Advanced Engineering

471 Define the noisy speech power as 2(i,d) , noise

power as



_w2(i,d), where i and d represent the ith subband of dth speech frame.





 N

n

i nd

x N d i 1 2 2₍_, ₎ 1 ₍ ₍ _, ₎₎



Where

x

_i is wavelet package coefficient of ith

subband. In fact, there are some overlap between the frames and to solve this problem, in our experiment, the premise of this threshold method is that each frame is independent and we ignore the last several points for better result and convenience.

Assume the first five frames as noise frames, so the

noisy speech power2(i,d)can be regarded as the noise

power



_w2(i,d) for the first five frames. Then, set average noise power



   5 1 2 2 ) , ( 5 1 ) , ( m w

w i d  i d m



Where



_w2is the variance of noise signal.

Next, a smoothing filter is used to estimate the noise power[10] ) , ( )) , ( 1 ( ) 1 , ( ) , ( ) ,

( 2 2

2 d i d i d i d i d i _w

w    

     Where ) ) , ( ( 1 1 ) ,

( _SNR_i_d _T

e d

i _ _



 _

 , SNR is

determined by the formula

) 1 , ( ) , ( ) , ( ₂ 2   d i d i d i SNR w  

. In

our experiment, T is set to be 5 and



is set to be 0.4.

At last, the threshold is

N d i d i d

i, ) (1 (, ))ˆ(, ) 2ln

(    



Where ˆ(i,d)MAD/0.6745.

C. Thresholding

Soft thresholding and hard thresholding are the most widely used thresholding algorithms. However, residual noise such as musical noise[11] can be introduced when applying hard thresholding. In this paper, we choose soft thresholding[3].    _         ) ( ) ( ) ) ( ( )) ( ( 0 ) ( ~ k x if k x if k x k x sign k x i i i i i

Where sign()is defined as symbol function.

D.Inverse Wavelet Package Transform

The enhanced speech signal is synthesized with inverse wavelet package transform(IWPT) of the final wavelet coefficients )) ( ~ ( ) (

~_s _k __IWPT _xi _k

III. EXPERIMENTAL RESULTS

The experiment mainly compares these methods from two aspects— the change of SNR and speech distortion will be examined by objective test — output SNR and objective test — Mean Opinion Score(MOS)[12].

As in realistic environment, most noises are non-stationary, in order to test these methods thoroughly, we take speech samples with stationary noises - additive Gaussian noise and various samples with non-stationary noises to stimulate real speech environment.

A. Input SNR and output SNR

The SNR is calculated[13] by

)

log(

10

d s

P

SNR



Where P_s is the average power of clear speech signal and P_d is the average power of noise signal. So, the average power of noisy speech signal can be written by :



  N n n x N n x P 0 2₍ ₎ 1

)] ( [

In this experiment, asit is very difficult to estimate the noise from the output signal, what we did in the experiment are:

Step1:Plot the output signal in Matlab, and find the pure noise segment based on judgment. The criterion is thatmagnitude of residual noise isusually much smaller than that of speech part.

Step2:Calculate the average noise power fromtheselected segment. The more segments selected, the closer to the accurate SNR.

Step3:Calculate the average output signal power and minus the average noise power which comes from Step2.

Step4:Calculate SNR by the SNR formula.

1) Male speech with stationary noises

(4)

International Journal of Emerging Technology and Advanced Engineering

472 All noise signals are sampled at 44KHZ with input SNR from -3dB to 3 dB and all the speech samples are decomposed by 10 level (J10) WPT (See Fig.2 ).

Fig.2: Output SNR of male speech with stationary noises

2) Male speech with non-stationary noises

Three real noise environments are chosen in the experiment. We take desk fraction sound as the intended noise, laundry machine noise as the machine noise and several different combined noises as the mixed noise. These noises are common in daily life. We add these noises to clear, male sample and calculate the input SNR, then compare four threshold methods via output SNR (See TABLE 1).

TABLE1

OUTPUT SNR OF MALE SPEECH WITH NON-STATIONARY NOISES

Noise Type Input SNR(dB)

Output SNR(dB)

Invariant TF TEO ANE Unintended

Noise 3.84 13.22 6.10 11.33 5.98 Mixed

Noise 4.46 17.69 8.01 12.86 8.87 Machine

Noise 6.67 15.3 7.62 12.11 11.44

B. Mean Opinion Score(MOS)

30 students are involved in the listening test. Each listener gives a score to each test speech sample. The scores of MOS are ranging from 1 to 5with a higher score representing better listening perception.

Defined MOS as





 5

1 1

i i iN

W N MOS

Where N is the total number of listeners, N_i is the

total number of one score, Wi 1, 2, 3, 4, 5.

Mixed speech which contains male and female voiceswith stationary and non-stationary noise istested.

1)Mixed speech with stationary noises

TABLE2

MOS OF MIXED SPEECH SIGNAL WITH STATIONARY NOISE

Input

SNR(dB) Invariant TF TEO ANE

3 3.6 3.8 3.7 3.4

0 3.4 3.3 3.3 3.1

-3 3.1 3.0 2.9 2.8

2) Mixed speech with non-stationary noises

TABLE3

MOS OF MIXED SPEECH SIGNAL WITH NON-STATIONARY NOISE

Noise Type Input

SNR(dB) Invariant TF TEO ANE Unintended

Noise 3.84 3.1 3.4 3.6 3.5

Mixed Noise 4.46 2.9 3.6 3.7 3.4

Machine Noise 6.67 3.6 4.0 4.3 3.8

C. Enhanced Speech Waveform

Fig.3 Waveform of male speech

1) Male speech with stationary noise

Fig.4 Waveform of speech with additive Gaussian noise(SNR=3dB)

(5)

International Journal of Emerging Technology and Advanced Engineering

473

Fig.6 Waveform of enhanced speech by TF threshold method

Fig.7 Waveform of enhanced speech by TEO threshold method

Fig.8 Waveform of enhanced speech by ANE threshold method

2)Male speech with non-stationary noise

Fig.9 Waveform of male speech with machine noise(SNR=6.67dB)

Fig.10 Waveform of enhanced speech by invariant threshold method

Fig.11 Waveform of enhanced speech by TF threshold method

Fig.12 Waveform of enhanced speech by TEO threshold method

Fig.13 Waveform of enhanced speech by ATE threshold method

IV. CONCLUSION

The invariant threshold method has greatest SNR performance in four thresholds even though it does not have the best subjective test result.Also, invariant threshold cannot deal with speech signal with non-stationary noise well.Anotherthreethreshold methods do not have excellent SNR performance either, but their enhanced speech signalshave less distortion and are easier to understand. TEO threshold method is most suitable for performing speech signal with non-stationary noise;from the result, it is obvious that TEO threshold method is suitablein daily practice when applying WPTA to speech enhancement.

Although the SNR performances of TF threshold method are considerable, it is mainly rely on the ideal estimation of noise which is difficult in the reality. That is to say, the drawback of TF threshold method is that it cannot stand alone without accurate noise estimation.

Moreover, during the experiment we found that, usually the noise de-noising performs better when the WPT level J grows. However, bigger J means longer time for processing. Also, the ANE threshold requires framing and each frame should have certain length. At this moment, framing would become impossible and pointless if J is too big. That is the reason why we choose J9 for ANE and J10 for other three threshold methods.

(6)

International Journal of Emerging Technology and Advanced Engineering

474 By adjusting the value of the two parameters Tand  ,

we found they are very crucial to the performance of ANE threshold method. A pair of



and Twith certain value should be completely different for each subband, but the ANE method chooses average values of



and T on every subband. Therefore, how to

decide

T

_iand



_ifor ith subband is a promising topic

which should be worked on in the future.

REFERENCES

[1] Johh G. Ackenhusen, Real-Time signal processing: Design and implementation of signal processing systems, Pearson Education. Inc. Publishing: as prentice Hall PTR, 2006.

[2] Y.H. Xu, G. Wang, Y. Gu and H.Y. Liu, “A Novel Wavelet Packet Speech Enhancement Algorithm Based On Time-Frequency Threshold ,” International Conference on Innovative Computing, Information and Control, , pp.493, Sept. 2007, [3] D.L. Donoho, “Denoising by soft thresholding,” IEEE

Transactions on Information Theory, vol. 41, pp.613-627, May 1995.

[4] D.L. Donoho and I.M. Johnstone, "Ideal Spatial Adaptation by Wavelet Shrinkage," Biometrika, vol. 81, pp. 425-455, 1994. [5] N. Virag, “Single Channel Speech Enhancement Based on

Masking Properties of the Human Auditory System ,” IEEE Transactions on Speech and Audio Processing, vol. 7, pp.126-137, 1999.

[6] S.F.Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, pp.113-120, Apr. 1979.

[7] J.F.Kaiser, “On a simple algorithm to calculate the ’energy’ of a signal ,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Albuquerque, NM, pp.381-384, 1990. [8] Mohammed Bahoura and Jean Rouat, “Wavelet Speech

Enhancement Based on the Teager Energy Operator ,” IEEE Signal Processing Letters, vol. 8, pp.10-12, Jan. 2001

[9] R.W. Li, C.C. Bao and H.J. Dou, “Speech Enhancement Using Adaptive Threshold Based on Bi-orthogonal Wavelet Packet Decomposition”, Chinese Journal of Scientific instrument, 2008. [10] S.F. Lei and Y.K. Tung, “Speech Enhancement for Nonstationary

Noises by Wavelet Package Transform and Adaptive Noise Estimation ,” Proceeding of 2005 International Symposium on Intelligent Signal Processing and Communication Systems, pp.41-44, Dec. 2005.

[11] S. Chang, Y. Kwon and S. Yang, “Speech enhancement for non-stationary noise environment by adaptive wavelet package ,” IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.1, pp. 561-564, May 2002.

[12] Y.J. Tian, H.Y. Zuo, Y.M. Dong and C.Wang , “A new algorithm of wavelet package adaptive threshold speech de-noising ,” Applied Acoustics, vol.1, pp.72-80, Jan. 2011.