An Improved Psychoacoustic Model for Audio Coding Based on Wavelet Packet

(1)

Technologies of Information and Telecommunications March 25-29, 2007 – TUNISIA

An Improved Psychoacoustic Model for Audio Coding

Based on Wavelet Packet

Samar Krimi, Kaïs Ouni and Noureddine Ellouze

Laboratory of Systems and Signal Processing (LSTS) National Engineering School of Tunis (ENIT)

BP 37 Le Belvédère, 1002, Tunis, TUNISIA [email protected]

(kais.ouni, N.ellouze)@enit.rnu.tn

Abstract: This paper describes a new design of a psychoacoustic model for audio coding following the model used in the standard MPEG-1 audio layer 3 using an appropriate wavelet packet decomposition of the speech/audio signal. The design of a psychoacoustic model is achieved by wavelet packet decomposition whose connections are selected in such a way that sub bands correspond to the best possible one to the critical bands. For that, we used gammachirp wavelet packet because gammachirp filter underwent a good success in psychoacoustic research. In order to investigate this new design of a psychoacoustic model based on wavelet packet decomposition, we proceed to some experimental results: the distribution curve of masking threshold for tone at 1 kHz and for a real complex signal were calculated by utilizing the two different psychoacoustic models respectively, the old one based on FFT analysis and the newest one based on wavelet packet analysis especially gammachirp wavelet packet. Also, we show some gammachirp scalograms of some audio sounds.

Key words: Audio coding, gammachirp filter, psychoacoustic model, wavelet packet.

I

NTRODUCTION

The MPEG-1/Audio coding standard [1],[17],[18] is about to become a universal standard in many application areas with totally different requirements in the fields of consumer electronics, professional audio processing, telecommunications, and broadcasting [31]. TheMPEG-1/Audio standard represents the state of the art in audio coding. MPEG/Audio coders are controlled by psychoacoustic models which may be improved thus leaving room for an evolutionary improvement of codecs.

The human auditory system has some interesting properties, which are exploited in perceptual audio coding. We have a dynamic frequency range from about 20 to 22000 Hz, and we hear sounds with intensity varying over many magnitudes. The hearing system may thus seem to be a very wide-range instrument, which is not altogether true. To obtain those characteristics, the hearing is very adaptive: what we hear depends on what kind of audio environment we are in.

Most psychoacoustic models for coding applications use a uniform (equal bandwidth) spectral decomposition as a first step to approximate the frequency selectivity of the human auditory system.

However, the equal filter properties of the uniformsub bands do not match the non uniform characteristics of cochlear filters and reduce the precision of psychoacoustic modelling.

Our study proposed an analysis method which incorporates results from psychoacoustic studies of perceptual masking and critical bands of the human auditory system. Since our ears are more sensitive to low frequencies than high frequencies and our hearing threshold is very high in the high frequency regions, we used a compression method for which the detail coefficients (corresponding to high frequency components) of wavelet packets are thresholded such that the error due to thresholding is inaudible to our ears [2]. In fact, speech/audio signal is considered to be a signal whose component localization varies widely in time and frequency; it contains both high/low frequency components and short/large duration sounds. Therefore it’s important to decompose audio signals into waveforms whose time frequency proprieties are adapted to its local structures [11].

(2)

called mother wavelet which itself is a band pass function.

1. Wavelet packet sub band decomposition

Wavelet transform [12]-[13]-[14] was recently introduced as an alternative technique for analyzing non stationary signal. It provides a new way for representing signal into well behaved expression that yields useful proprieties.

The continuous wavelet transform of signal x relative to the basic wavelet is given by:

* 1 ( , ) ( ) t b W x a b x t a a ψ ψ +∞ −∞ ⎛ ⎞ ⎜ ⎟ ⎝ ⎠ − =

_∫

(1) Where a,b (a,b∈\, a≠0) are respectively the translation ad scale parameters. Furthermore, if the basic wavelet satisfies the admissibility condition [13], then the wavelet reconstruction formula is:

, 2 ( ) ( , ) _{a b}( )dadb x t W x a b t a ψ

ψ

=

_∫∫

\ (2)

The first stage splits the signal into a high pass and low pass band, each of which is spread to full band by the subsequent down sampling. Given this spreading that accompanies down sampling, the second stage can be viewed as simply splitting the low pass portion of the original signal into halves. Each stage of the discrete wavelet transform thus splits the low pass spectrum from the previous stage; this result in an octave band pass filter bank in which the sampling rate of a sub band is proportional to its bandwidth. The wavelet analysis is sometimes inefficient because it only partitions the frequency axis finely toward the low frequency.

The wavelet packet transform [11] constitutes a solution that permits a finer an adjustable resolution of frequencies at high frequencies and gives a rich structure that allows adaptation to particular signals or signals classes [12]. In relation to the wavelet transform, wavelet packet decomposition is defined as that where we are free to iterate the filter bank through low and high pass branches. It can be used to approximate several non uniform bandwidth decompositions, for example, the critical band decomposition in the inner ear. This transformation creates a division of the frequency domain torepresent the signal optimally.

Indeed, the wavelet packet decomposition offers a bookshop of wavelets organized according to their properties of analysis and localization time-frequency and thus of filtering band pass according to a binary architecture of tree [10]. (The depicted decomposition scheme is for a sampling rate fe =44.1 Khz).

The wavelet packet decomposition defined by Sinha and Tewfik [3] can be adapted to the critical-band decomposition as shown in the figure 1. We adopted then to decompose the signal x(t) into 28 sub band wavelet packet tree.

28 1 ( ) ( ) ( )_i i x t W x iψ

ψ

t = =

∑

(3) ( )

W x i_ψ : Wavelet packet (WP) transform of x, i: sub band,

ψ

_i: WP function of the ith_{sub band.}

6 5 15 18 27 21 24 13 9 8 7 1 23 4 11 25 26 28 22 23 17 20 19 10 12 16 14

Figure 1.Wavelet packet tree covering 0-22 Khz Table 1. Parameters of wavelet packet tree: number

of filters, center frequency and band pass

Filters Center Frequency (Hz) Band Pass (Hz)

1 43,75 0-87,5 2 131,25 87,5-175 3 218,75 175-262,5 4 306,25 262,5-350 5 393,75 350-437,5 6 481,25 437,5-525 7 568,75 525-612,5 8 656,25 612,5-700 9 787,5 700-875 10 962,5 875-1050 11 1137,5 1050-1225 12 1312,5 1225-1400 13 1575 1400-1750 14 1925 1750-2100 15 2275 2100-2450 16 2625 2450-2800 17 3137,5 2800-3475 18 3812,5 3475-4150 19 4487,5 4150-4825 20 5162,5 4825-5500 21 6187,5 5500-6875 22 7562,5 6875-8250 23 8937,5 8250-9625 24 10312,5 9625-11000 25 12375 11000-13750 26 15125 13750-16500 27 17875 16500-19250 28 20625 19250-22000

1.1. The design of the psychoacoustic model

(3)

This model analyzes the input signal on various stages and determines for each stage the spectrum of the signal then it models the masking properties of the human auditive system and considers the minimal level audible. The calculation of the spectrum of the input signal was carried out using decomposition of wavelets packets whose connections are selected in such a way that sub bands correspond to the best possible one to the critical bands. In a first stage, we applied a decomposition of wavelets packet on 1024 points of the wav signal. We adopted the decomposition of Sinha and Tewfik (28 sub bands) [4] which is a good approximation of the critical bands. In a second stage, we calculate tonal and non tonal components. This stage begins with the determination of the local maxima, followed by the extraction of the tonal components (sinusoidal) and non tonal components (noise), in a bandwidth of a critical band. The selective suppression of tonal and non tonal components of masking is a procedure used to reduce the number of maskers taken into account for the calculation of the global masking threshold. The tonal and non tonal components remaining are those which are above the hearing absolute threshold. Individual masking threshold takes into account the masking threshold for each remaining component. Lastly, global masking threshold is calculated by the whole of tonal and non tonal components which are deduced from the spectrum of the transform of the wavelets packet decomposition.

Figure 2. Design of psychoacoustic model I for

layer 3 by an analysis with wavelet packet decomposition

1.2. Gammachirp model of cochlear Filter

The cochlea filter pass bands are of non uniform bandwidth, the "critical bandwidth" is a function of frequency that quantifies the cochlear filter pass bands. Furthermore, the cochlear filter bank is based on a novel structure that supports the time- and frequency resolution necessary to simulate psychophysical data closely related to cochlear spectral decomposition properties. The wavelet packet decomposition induces an organization of information according to a frequential segmentation as shown in figure 1: on each decomposition level, the information corresponding to the entire signal is divided into frequency bands of equal widths in bark scale. The human ear has remarkable faculty to integrate certain zones of frequency in bands called critical bands. This concept proves amongst other things that our ear is equipped with selective receivers in frequency, treating frequential zones whose width is precisely the critical bandwidth. Two separate sounds of more than one critical band excite completely disjoined receivers; they are thus completely discriminated. We can thus model the ear as a filter bank spreading out along the audible field [4]. Indeed, the decomposition of the frequency spectrum in critical bands corresponds to a fundamental property of hearing. Hearing can form a critical band at any point of the frequency scale. By arranging them arbitrarily one beside the other, we find, in the frequency zone from 0 (Hz) to 22 (Khz), approximatively 28 critical bands [4].

1.2.1 Gammachirp filter as a wavelet:

Many models have been proposed to simulate filtering properties in the inner ear. Among them, the gammachirp filters are a reasonably accurate description for auditory filtering at moderate intensity levels [6]-[7].

The choice of the gammachirp filter is based on two reasons [7]:

• First reason is that the gammachirp filter has a well defined impulse response, unlike the conventional roex auditory filter, and so it is an excellent candidate for an asymmetric, level-dependent auditory filter bank in time-domain models of auditory processing [6]-[8].

• Second reason is that this filter was derived by Irino as a theoretically optimal auditory filter that can achieve minimum uncertainty in a joint time-scale representation.

The gammachirp is constructed by adding a frequency modulation term to the gammatone function. This function has minimal uncertainty in joint time/scale representation. The gammachirp auditory filter is the real part of the analytic gammachirp function, has an asymmetric amplitude characteristic and provides an excellent fit to human masking data. It has as impulse response the following function [7]:

0 2 ln 2 1

( )

n t i f t i ic t c

g t

₌

At e

− − πβ

e

π + +φ ₍₄₎ 0 ( ) b ERB f β= (5) 1024 samples Wavelet packet decomposition Local maxima Elimination of the spectral components masked by absolute threshold of hearing Determination of tonal and non tonal

components

Selective suppression of tonal and non tonal components of masking

Individual masking threshold

(4)

Figure 3. Examples of impulse response and

corresponding spectrum of gammachip

Where time t>0, A is the amplitude, n and b are parameters defining the envelope of the gamma distribution, and f0 is the asymptotic frequency. c is a

factor introducing the asymmetry of this filter it is a parameter for the frequency modulation or the chirp rate,φ is the initial phase, lnt is a natural logarithm of time, and ERB (f0) is the equivalent rectangular

bandwidth of an auditory filter at f0.

The function ERB (f0) is defined by the expression

[7]:

0 0

( ) 24,7 0,108

ERB f = + f (Hz) (6)

The gammachirp function which is a window modulated in amplitude by the frequency f0 and modulated in phase by the parameter c can thus be seen as wavelet roughly analytical [8].

This wavelet has the following properties: it is with non compact support, it is not symmetric, it is non orthogonal and it does not present a scale function [8]. The gammachirp filter bank extends logarithmically in frequency, which can be closely linked to logarithmic tonotopic organisation in the cochlear [19].

However, in order to keep the signal decomposition functional unit as generic as possible the spectral bands of the filter banks they are assumed to be linearly spaced in frequency and symmetrical around the centre frequencies.

The individual pass bands are identical in shape and can be thought of as a base-band filter being translated up in the frequency domain to the required centre frequencies [16].

The amplitude spectrum of the gammachirp can be written in terms of the gammatone as:

c ( f )

c T

|G ( f )| A ( c )|G ( f )| .e

=

_Γ θ (7) Where G f_c( )is the Fourier transform of the gammachirp function, G f_T( ) is the Fourier transform of the corresponding gammatone function, c is the chirp parameter, A ( c )_Γ is a gain factor which depends on c, and θ is given by:

0 0 ( )f arctan _{b E R B f}f ₍f ₎ θ ⎛⎜ ⎞⎟ ⎜ ⎟ ⎝ ⎠ − = (8) ( ) c f

eθ is an asymmetric function since is anti symmetric function centred at the asymptotic frequency. The asymmetric function is low pass filter for negative values of c, a high pass filter when c is positive, and when c=0 is the complex form of the gammatone filter. In figure 4 we show a family of level dependent gammachirp. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 104 -80 -70 -60 -50 -40 -30 -20 -10 0 Frequency (Hz) A m p lit ude ( d B ) c=-3 c=-2 c=-1 c=3 c=2 c=1 c=0

Figure 4. Examples of normalized filter

gammachirp centered on frequency of 10 Khz according to the c parameter (n=4, b=1.019)

The peak frequency fp in the amplitude spectrum can be obtained analytically by setting the derivative of Equation 7 to zero and solving the equation for the frequency. The result is by [15]:

0 ( )0 p f cbERB f f n + = (9) Therefore, the size of the peak shift is proportional to the chirp parameter c and the ratio of the envelope parameter b, ERB (f0) to n. This shift as well as the

(5)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 x 104 0 0.2 0.4 0.6 0.8 1 Frequency (Hz) N o rm a lis ed A m pl it ud e

Figure 5. Examples of normalized gammachirp

amplitude spectrum (n=4, b=1.019 and c= 3)

2. Experimental Results

This new psychoacoustic model has been implemented in reference coder based on the standard MPEG-1 audio layer 3 [1]. Figure 6 shows distribution curve of masking threshold for tone at 1 kHz calculated by utilizing the two different psychoacoustic models respectively, the old one based on FFT analysis and the newest one based on wavelet packet analysis especially gammachirp wavelet packet. From this figure we can observe that distribution of masking curve in new psychoacoustic model matches better with the real masking threshold. Whereas masking curve in MPEG psychoacoustic model 1 is more sensitive to distribution of spectrum energy, and the slope of pre-energy spreading at peak pre-energy is nearly equal to that of post-energy spreading. This is different from observed acoustic phenomena. In contrast, the new psychoacoustic model represents preferably the hearing property that pre-energy spreading effect is bigger than post-energy spreading effect.

Figure 6. Distribution curve of masking threshold

for tone at 1 kHz

Figure 7 shows distribution of masking threshold for a real complex signal calculated by utilizing the two different psychoacoustic models respectively.

Similarly, comparing to MPEG psychoacoustic model 1, wave crest of spectral energy in our new psychoacoustic model has more intensive masking effect against energy trough in high frequency band. As for masking effect against energy trough in low frequency band, both models are similar. The new psychoacoustic model is more coincident with hearing observation and examination phenomena than MPEG psychoacoustic model 1.

Figure 7.Distribution curve of masking threshold

for complex signal

We also found during experiments that this new psychoacoustic model is not mature enough so far. So subjective listening test is necessary, we discovered that the subjective quality of new model was not better than that of MPEG psychoacoustic model 1 because, it is necessary to adjust original rate control and bit allocation strategy due to changes of psychoacoustic model. Subjective quality has been remarkably improved after adjustment.

2.1. Gammachirp scalogram of some audio sounds The types of the sounds chosen for the tests try to cover some difficult aspects to code such as percussions and the pure sounds.

• Rock music: this type of sound contains the electric guitar, it is not dense.

• Classic music: this type of sound contains violin like some percussions.

• Jazz music: this type of sound contains piano. • Voice: a recorded sentence made by the first

author. The recording was made in a calm medium.

(6)

Time F requ enc y ( H z )

Classic Gammachirp Scalogram

0 5 10 15 20 25 30 1000 2000 3000 4000 5000 6000 7000 8000 0 1 2 3 4 5 6 x 105 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 Classic Signal

Figure 8. Classic gammachirp scalogram & its signal

Time F requ enc y ( H z ) 0 5 10 15 20 25 1000 2000 3000 4000 5000 6000 7000

8000 Voice Gammachirp Scalogram

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 105 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 Voice Signal

Figure 9. Voice gammachirp scalogram & its signal

Rock Gammachirp Scalogram

0 5 10 15 20 25 30 1000 2000 3000 4000 5000 6000 7000 8000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x 105 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Rock Signal

Figure 10. Rock gammachirp scalogram & its signal

Opera Gammachirp Scalogram

0 2 4 6 8 10 12 14 16 1000 2000 3000 4000 5000 6000 7000 8000 0 0.5 1 1.5 2 2.5 3 x 105 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 Opera Signal

(7)

Time F req uen c y ( H z )

Jazz Gammachirp Scalogram

0 5 10 15 20 25 1000 2000 3000 4000 5000 6000 7000 8000 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 x 105 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 Jazz Signal

Figure 12. Jazz gammachirp scalogram & its signal It has been found that both the formant frequencies and harmonic structures of speech are well preserved in the gammachirp scalogram. We can clearly notice that the vowels zones, the fricative zones as well as the clear bands correspond to the transitions zones or silences.

3. Conclusion

The improved psychoacoustic model based on gammachirp wavelet packet takes account of the critical bands and takes account of the masking phenomenon. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear.

This new model is proved to be practical. Due to consideration of more acoustic properties, the proposed psychoacoustic model can characterize the auditory properties of human ear more precisely than the old one based on FFT analysis. It gives good performances and can lead to some interesting perspectives on audio coding.

R

EFERENCES

[1] ISO/IEC 11172-3 (F), “Norme internationale :

technologies de l’information Codage de l’image animée et du son associé pour les supports de stockage numérique jusqu’à environ 1,5 Mbits/s,”–

Partie 3 : Audio, 1993.(in french)

[2] D. Sinha and A. Tewfik, “Low bit rate transparent audio compression using adapted wavelets,” IEEE

Trans.Sig.Proc, pp.3463-3479, December 1993.

[3] E. Zwicker, G. Flottorp and S.S. Stevens, “Critical bandwidth in loudness summation,” Psychoacoustic Laboratory, Harvard University, Cambridge, Massachusetts, the journal of the acoustical society, 1957.

[4] Calliope, Editeur Principal: J.P. TUBACH. Collection: “La parole et son traitement

automatique,” Collection Technique et Scientifique

des Télécommunications", MASSON Paris 1989.

[5] Ted Painter, Andreas Spanias, “Perceptual coding of digital Audio,” Proceedings of the IEEE, Vol.88, No.4, April 2000.

[6] Irino, T., Patterson, R. D, “A time-domain, level-dependent auditory filter: the gammachirp,” JASA, Vol. 101, No. 1, pp. 412-419, January 1997.

[7] Ouni Kais, “Contribution to the vocal signal analysis using knowledges on the auditory perception and multiresolution time frequency representation of the speech signals,” (in french), PhD Thesis on Electrical Engineering. National Engineering School of Tunis. February 2003.

[8] Alex Park, “Using the gammachirp filter for auditory analysis of speech”, May 14, 2003. 18.327: Wavelets

and Filter banks

[9] E. Zwiker and U. Zwiker, “Audio engineering and psychoacoustics: matching signals to the final receiver, the human auditory system,” J. Audio Eng. Soc., pp. 115-126, Mar. 1991.

[10] Laurent Buniet, “Traitement automatique de la parole en milieu bruité : étude de modèles connexionnistes statiques et dynamique, ” (in french) PhD Thesis, Université Henri Poincaré-Nancy 1spécialité informatique, février 1997.

[11] M. V. Wickerhauser, “Adapted wavelet analysis from

theory to software,” Wellesley, Massachusetts, 1994.

[12] C.S.Burrus, R.A. Gopinath and H.Guo, “Introduction

to wavelets and wavelets transforms: A Primer,”

Prentice Hall, 1998.

[13] I. Daubechies, “Ten lectures on wavelets,” STAM Press, 1992.

[14] S. Mallat., “A wavelet tour of signal processing,” Second Edition, Academic Press, 1999.

[15] Irino, T. and Unoki, M., “An analysis/synthesis auditory Filter bank based on an IIR implementation of the Gammachirp,” The Journal of the Acoustical

(8)

[16] Yost, W. A, “Fundamentals of hearing, an introduction,” 4th edn, Academic Press, London, 2000.

[17] Van der Waal, R.G., Brandenburg, K. and Stoll, G., Current and future standardization of high quality digital audio coding in MPEG, Proc. IEEE ASSP

Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, 1993.

[18] Brandenburg, K. and Stoll, G., the ISO/MPEG-audio codec: A generic standard for coding of high quality digital audio, J. Audio Eng. Soc. (AES), 42(10), 780– 792, Oct. 1994.

[19] S. Krimi, K. Ouni et N. Ellouze, “ Realization of a Psychoacoustic Model for MPEG 1 using Gammachirp Wavelet transform ”,13th European Signal Processing Conference, EUSIPCO 2005, Antalya-Turquie, 2005.