Artificial Bandwidth Extension Method of telephony Speech in Mobile Terminal: A Review

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 10, October 2012)

277

Artificial Bandwidth Extension Method of telephony Speech in

Mobile Terminal: A Review

Tejal Chauhan

1

, Ninad Bhatt

2

, Shraddha Singh

3 1,3

Patel College of Science and Technology, Bhopal, Madhya Pradesh, India 2_{Veer Narmad South Gujarat University, Surat, Gujarat, India}

Abstract—The restricted audio quality of today’s telephone

networks is mainly due to the Narrow Band (NB) limitation to the frequency range from about 300 Hz to 3.4 KHz. Meanwhile, codecs for Wide Band (WB) telephony (50Hz to 7 kHz) exist with significantly improved speech intelligibility and naturalness. However, the broad introduction of wideband speech coding will require strong efforts of both network operator and customers because many elements of the networks have to be modified. The intermediate solution to overcome this narrowband limitation can be achieved by applying Artificial Bandwidth Extension (ABE) in the receiving terminal. At the receiver, wideband speech is produced by artificial bandwidth extension (BWE). The BWE algorithms can be realized with or without some low bit rate side information. In this paper we review the basic principles of bandwidth extension and discuss both methods of bandwidth extension.

Keywords— Artificial Bandwidth Extension, Narrow Band

coding, Wide Band Coding, Data hiding, Side information.

I. INTRODUCTION

Quality, intelligibility and naturalness of the speech are the main factors in digital telecommunication systems. The speech quality can be degraded due to Limited bandwidth of the speech signal to the telephone frequency band: 300-3400 Hz. Many noise reduction and error concealment techniques have been devised to improve the speech quality and intelligibility still it may sound unnatural and muffled. Especially to distinguish between certain unvoiced or plosive utterances, such as /s/ and /f/ or /p/ and /t/ when applied only a narrowband speech signal. This is due to the fact that the considerable portion of their energy is located in higher frequency components, while the low-frequency characteristic can easily be confused among these sounds [1]. Human speech contains considerably more frequency components than it is being utilized for NB telephone speech coding. It is due to the limitation in storage, coding complexity and bandwidth provided by telephone networks. Since the inception of pulse code modulation (PCM), a speech coding algorithm that has been used in telecommunications for more than 30 years, the frequency bandwidth has been limited to 300 Hz to 3.4 kHz.

This, so called telephone bandwidth, has been used both in the Public Switched Telephone Network (PSTN) and in the second-generation (2G) mobile communications such as the global system for mobile communications (GSM). The major degradation of narrowband speech quality, compared with wideband speech (0-8kHz) [4], is due to the loss of information in 50-300 Hz and 3400-8000Hz which causes a muffled effect and degraded speech quality and intelligibility. Solutions to the above mentioned bottleneck problems are highlighted further in this section as follows:

A. Implementation of Wide-Band coder

Implementing wideband system yield the experience of wideband higher signal quality and many more new applications like hands free speaking and teleconferencing. Several wideband speech codecs have been standardized in the past. In 1985, a first wideband speech codec (G.722) was specified by CCITT (now ITU-T) for ISDN and tele-conferencing with bit rates of 64, 56 and 48 Kbit/s. It is mainly applied in context with radio broadcast stations by external reporters using special terminals and ISDN connections from outside to the studio. In 1999, a second wideband codec (G.722.1) was introduced by ITU-T that produces almost comparable speech quality at reduced bit rates of 32 and 24 Kbit/s. Most recently, the adaptive multi-rate wideband (AMR-WB) speech codec was standardized by ETSI and 3GPP for CDMA cellular networks such as UMTS. The AMR-WB codec has also been adopted for fixed network applications by ITU-T (G.722.2) [25]. By the AMR-WB standard a family of wideband codecs with nine data rate modes between 6.6 and 23.85 Kbit/s is defined together with control mechanisms to adapt the codec mode to channel conditions. Further research has been extended to the AMR-WB+ codec that support general audio in mono/stereo with frequency bandwidths from 7 to more than 16 kHz and bit rates of between 6.6 and 32 bit/s.

(2)

International Journal of Emerging Technology and Advanced Engineering

278

Considering the fact about better speech quality performance offered by WB coders, still sudden replacement of entire NB coding and transmission systems is not feasible because of tremendous infrastructure expenses incurred to operators and also is a case with customers. Current speech transmission system is a mixture of traditional narrowband terminals and new wideband terminals. It will take longer time to replace all the equipment, protocols and whole transmission link supporting wideband transmission. The long transitional period, between up-gradation of narrowband to wideband system, demands to enhance speech quality without much modification of already existing network infrastructure. It has motivated the approach of bandwidth extension. During this transition period different technical solutions may be employed. All of these solutions produce WB speech at the near-end terminal.

B. Implementation of BWE in NB Coder

Many technical solutions can be considered during the long transitional period of NB and WB telephony for generating wideband speech at the near end terminal. One alternative solution is to implement bandwidth extension (BWE) algorithm in legacy narrowband coder. Bandwidth extension artificially adds the missing frequencies of the signal at the receiver [4,6], using only the information contained in the narrowband signal or either using the side information transmitted. This produces more natural sounding speech, and the user can benefit from the improved wideband capabilities of the terminal. BWE approach can be divided mainly as:

 Stand-alone BWE

 BWE with side information

[image:2.612.335.553.132.331.2]

BWE can be applied to enhance the received speech signal. This approach does not require any modification of the sending terminal and the network. The implementation of BWE is particularly attractive for manufacturers with respect to the competition on the terminal market. For reasons of compatibility, the narrowband encoder has to be used in the WB terminal for the reverse direction. Naturally, all of the schemes described below to take the step from NB to WB quality can be applied again to realize efficient transmission of super-wideband speech based on a WB codec.

Fig.1 Steps from narrowband telephony to wideband telephony [4]

Fig. 1 shows respectively as:

a) Narrowband transmission and bandwidth extension in the receiver

b) Narrowband transmitter and bandwidth extension in the network

c)Transmission of parallel BWE information for bandwidth extension

d) Embedding of BWE information into narrowband signal e) Speech transmission using true wideband coding f) Wideband transmission plus bandwidth extension

for-super wideband speech quality.

(3)

International Journal of Emerging Technology and Advanced Engineering

279

II. STAND-ALONE BWE

Bandwidth expansion can be defined as the process of widening the signal bandwidth by artificially generating the missing frequency components of the signal at the receiving terminal using only the information contained in the narrowband signal. The main objective of ABE methods has been to enhance the quality, intelligibility and naturalness of narrowband speech.

Feature vector Xf are extracted from the narrowband signal and from these a set of wideband AR- coefficients are estimated. To obtain a wideband estimate, the narrowband speech is first interpolated and then fed to an analysis filter obtained from estimation of the envelope. The excitation then extended and given to the synthesis filter. A wideband estimate of the speech is obtained by this process. The methods approached can be classified into following categories:

• BWE with speech production model • BWE without speech production model

Model describing the production of speech is called the source-filter model. This model is motivated by studies of the human speech production system and makes a decomposition of a given speech signal into two parts. One part describing the excitation signal from the source and the other part describing filter which are driven by these excitation. The result of driving these filters with the excitation is then speech signal.

A. BWE with Speech Production Model

Most of the recent BWE algorithms are based on linear model of speech production model like [4],[9],[11] etc. Using this model, bandwidth extension can be divided into two sub tasks [4]:

Fig.2 Stand-alone BWE algorithm [9].

• Estimation of the wideband spectral envelope • Extension of narrowband excitation signal.

1) Estimation of the wideband spectral envelope:

Lots of methods can be used for estimation of wideband spectral envelope. Many different techniques are discussed and compared in this literature as below:

a) Codebook Mapping based method: Codebook mapping [3] is utilized for estimation of high band spectral envelope. Using the standard wideband speech dataset codebook is trained. Prediction of high band envelope is made from this pre trained codebook. The received narrowband envelope is compared to wideband envelope entries in the codebook, and the entry closest to the received narrowband envelope is then chosen. The wideband envelope corresponding to the selected entry is used as the spectral envelope estimate. Codebook mapping can be seen as a most basic method, against which other methods are compared. It produces good results in the spectral distortion sense, but has a tendency to high band (HB) power overestimates creating perceptually annoying signal. One considerable assumption puts the limitation on this method that two wideband spectral envelopes having same narrowband spectral must have same high band spectral envelope. It also suffers from considerable cost and delay because every codebook codeword must be compared with the narrowband spectrum to get the best matching codeword.

b) Adaptive Codebook Mapping: In fixed codebook, mapping method the number of possible high band envelopes which can be predicted is limited to the size of the codebook. J. Epps [12] has proposed an interpolative codebook mapping to overcome that limitation in which rather than selecting a single codeword closest to the narrowband spectral envelope, N closest codewords are selected and their corresponding wideband code vectors are combined using weighted average.

Several other modifications are made for improved results (like [12] suggests) using different codebooks for voiced and unvoiced speech frames. The voiced and unvoiced codebooks are trained separately using voicing detection. This additional voicing information helps for the expansion process.

(4)

International Journal of Emerging Technology and Advanced Engineering

280

Park and Kim [14] also report a high preference for the GMM based method in a subjective preference test in comparison with the VQ codebook mapping method. Although computational requirements for both the methods are around of the same order, BWE in [14] introduced GMM with Minimum Mean Square Error (MMSE) providing reasonably high conversion accuracy but W. Fujitsuruy[13] proposed a bandwidth extension algorithm based on Maximum Likelihood Estimation (MLE) considering dynamic features and the Global Variance (GV) with a GMM resulting subjective test demonstrates that the proposed algorithm outperforms the conventional MMSE based algorithm.

d) Hidden Markov Models (HMM): Compared to both a codebook and a GMM based approach, the properties of an HMM can be used to describe a time varying process. The main benefit of using HMM for envelope prediction is its capability of implicitly exploiting information from the preceding signal frames to improve the estimation quality. Each state of the HMM usually corresponds to a specific speech sound, so the change of the state of the HMM defines the envelope change of speech.

HMM offers better results with lower order in comparison with GMM with even higher order. In [10]

along with HMM model, they use Expectation

Maximization (EM) trained Gaussian Mixture Models to approximate the observation probability PDFs and produces a cepstral estimate of the wideband coefficients. The state and state transition probabilities are estimated from the wideband training data using the true state sequence. For efficient estimation they defined some classifications like Maximum Likelihood (ML), Maximum A Posteriori (MAP) and MMSE amongst them MMSE is found most appropriate. Many other methods are also demonstrated with slight modification or addition in HMM like continuous density HMM proposed by [15]. The drawback of the HMM method is that it requires considerable amounts of data to reliably estimate the state models and transition probabilities. It requires efficient dataset like TIMIT (million frames) corpus for training data. And also complexity increases for calculating state probability from these much database.

e) Neural Network Based Methods: Iser and Schmidt [16] compare neural network based spectral envelope prediction method to a codebook method, and achieved better Log Spectral Distortion (LSD) and Cepstral Distortion (CD), but slightly worse Log Area Ratio (LAR) distortion. The computational costs of the neural networks are considerably lower than those of codebook methods.

In [17] neural network and other many more modifications are proposed giving much better results than previous basic algorithms.

2) Extension of narrowband excitation towards high frequencies:

Spectral folding and spectral translation are represented by [18] for regeneration of high band excitation signal. Both include the up-sampling of excitation signal by an integer number. Spectral folding includes mirroring of base band spectral components and spectral translation includes shifting the spectrum without mirroring. Spectral translation can also be made dependent on the fundamental frequency of voiced speech such that the duplicated spectrum extends the harmonic structure correctly [18]. Pitch-adaptive modulation methods have also been developed [1], but compared to increased complexity, the improvement obtained was found to be small. Other proposed methods include nonlinear transformations [11] and sinusoidal transform coding [12].

B. BWE without Speech Production Model

(5)

International Journal of Emerging Technology and Advanced Engineering

281

H. Pullakka [8] has produced better results than other approaches.

III. REASON TO INTRODUCE BWEWITH SIDE

INFORMATION

All the methods discussed in the previous section are BWE without side information transmission. They all have main advantage of having backward compatible with respect to legacy telecommunication network and also with respect to end user terminal. They all are having approach towards better estimation of high band spectral envelope estimation using features of narrow band speech only. Still these methods have some disadvantages in implementation perspective.

The disadvantage of BWE lies in that an accurate estimation of envelope usually involves a complicated speaker-dependent training of statistical models, which is very computation-costive and thus not feasible for real-time processing. Although the training process can also be carried out off-line and speaker-independently is an average over a large speech database, the performance of the WB reconstruction for a specific person’s voice degrades significantly. Training of the models need the standard wideband speech corpus database, which need authorization to access it and also very costly. Here efficient results in terms of better recovered speech can be expected when same or similar set of training and testing database are utilized. For speech having quite different higher band than that of model estimation will give degraded output of speech. It also needs higher computation load and increased cost of standard wideband speech database. So the stand alone BWE is not sufficient for high quality wideband speech recovery at receiver without transmitting additional information about the high band spectral envelope.

IV. BWEWITH SIDE-INFORMATION

All the drawbacks mentioned above can be overcome by transmitting some side information which represents the high band envelope information of wideband speech. Using this side information efficient wideband speech can be regenerated. To reduce the computation load while maintaining the quality of reconstruction, it can be proposed to estimate and encode high band feature parameters as side-information of the original extended band (EB) signal at the transmitter. These parameters are transmitted to the receiver which will estimate the high band envelope using these parameters.

Fig.3 Basic block diagram of BWE with side-information

In conventional method the approach is of providing dedicated side-information channel for the transmission, but it remains no longer backward compatible with respect to network also and reserving a channel of network is also not feasible at all. So alternative is to transmit this side-information within the narrowband signal transmission using some efficient data hiding scheme.

The approach here is to generate a vector representing extended band is extracted and encoded at transmitter terminal and is sent as side information within narrowband speech. In order to retain backward compatibility [4] of existing 2G networks, the algorithms which hide this side information in the narrowband speech signal or in the bit stream by using methods of data hiding is proposed. The side information sent as hidden data within NB speech, at the receiver this side information is extracted and decoded. It will give more easy estimation of wideband speech with improved speech quality.

The basic block diagram of BWE with side information is shown in fig 3. Original wideband speech is band splitted first to separate NB and HB (high band) components. NB speech is given to the legacy NB coder to encode the speech. High band feature vectors are estimated and encoded in side-information extraction block. These coded bits are watermarked within the narrowband speech signal in NB coder in such manner that it will not degrade the speech quality of NB speech. Robust data hiding scheme is necessary to embed the side information. At the receiver this hidden data is extracted and the BWE procedure is applied to get the wideband signal.

(6)

International Journal of Emerging Technology and Advanced Engineering

282

Therefore, an estimate of spectral high band envelope from the NB signal at the receiver is good enough for the reconstruction performance.

A. Side Information Extraction

The method presented in [5] is only for PSTN telephone network and the high pass filtered wideband signal is down-sampled to 8 kHz and from this signal AR coefficients representing high band features are estimated and encoded by LPC. In embedded wideband coding like AMR-WB and AMR-WB+ also uses LPC techniques for envelope shaping. In [22, 23] sub-band envelope is computed by Selective Linear Prediction, (SLP) i.e., computation of the wideband power spectrum followed by an IDFT of its upper band components and a subsequent Levinson-Durbin recursion of order 8. The resulting sub-band LPC coefficients are converted into the cepstral domain and are finally quantized by a vector quantizer. The side information which is transmitted as watermark message m is the codebook index of the quantized cepstral vector for each speech frame. In [21] parameter set of side information comprises a (logarithmic) high band gain and a spectral envelope i.e., (logarithmic) DFT domain sub-band energies are estimated from higher band. These estimated vectors are then quantized using vector quantizer.

B. Data Hiding

In [24] various methods of data hiding in narrowband signal are described like Signal Domain Data Hiding (before encoding), Bitstream Data Hiding (after encoding), Joint coding and data hiding (inside the encoder). According to [24], before hiding the side information in NB coded speech signal, some requirements should be considered like negligible or small degradation of the speech quality of the narrowband codecs (due to replacing NB coded parameter bits with side information bits), low additional computational complexity and Low (or ideally zero) additional algorithmic delay.

V. CONCLUSION

Speech quality in telephone networks is inherently limited due to restricted frequency range as per the standards from the old analogue telephone networks in NB range. If the bandwidth is increased artificially at receiver side, the intelligibility and naturalness of the received speech signal can be increased. So by applying the bandwidth extension algorithm, missing higher frequency components can be added at the receiver in order to produce WB comparable speech.

This is attractive as the potential quality improvement can be witnessed without increasing the bit rate, i.e. no extra information is required to be transmitted over the channel but can be embedded in narrow band signal itself. Stand-alone bandwidth extension is limited in real time implementation because of the fact that computation of high band features are totally dependent on statistical model which in turn relies on training data sets and the size of such datasets imposes limitations of its usage. So the side information of high band spectrum is embedded and transmitted as digital watermarked data at the transmitter. This hidden data is extracted at the receiving end terminal and merged with decoded narrowband speech signal, which efficiently improves the quality and intelligibility of speech signal sounding more natural to user.

REFERENCES

[1] P. Jax and P. Vary, ―On artificial bandwidth extension of telephone speech,‖ Signal Process., vol. 83, no. 8, pp. 1707–1719, 2003. [2] Peter vary and Bernd geiser ―Steganographic wideband telephony

using narrowband speech codecs‖ 41st Asilomar Conference on Signals, Systems, and Computers in Pacific Grove, CA, USA, Nov. 2007.

[3] Qian, Y. & Kabal, P. ―Wideband speech recovery from narrowband speech using Classified codebook mapping‖,( mcgill university) Proceedings of the 9th Australian International Conference on Speech Science & Technology Melbourne, December 2 to 5, 2002. [4] P. Jax and P. Vary, ―Bandwidth extension of speech signals: A

catalyst for the introduction of wideband speech coding?,‖ IEEE Commun.Mag., vol. 44, no. 5, pp. 106–111, 2006.

[5] S. Chen and H. Leung, ―Artificial bandwidth extension of telephony speech by data hiding,‖ in Proc. of Intl. Symp. on Circuits and Systems (ISCAS), Kobe, Japan, May 2005.

[6] Peter Jax ―‖Artificial Bandwidth Extension of Speech Signals‖ (IND) Aachen University (RWTH), Germany..

[7] Murali Mohan D, Dileep B. Karpur, Manoj Narayan, Kishore J ―Artificial bandwidth extension of narrowband Speech using Gaussian Mixture Model‖, BNM Institute of Technology,IEEE 2011,pp. 410-412.

[8] H. Pulakka, L. Laaksonen, M. Vainio, J. Pohjalainen, and P. Alku, ―Evaluation of an artificial speech bandwidth extension method in three languages,‖ IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 6, Aug. 2008.

[9] P. Jax and P. Vary, ―Wideband extension of telephone speech using a hidden Markov model,‖ in IEEE Speech Coding Worksh., Delavan,WI, USA, Sept. 2000, pp. 133–135.

[10] P. jax and P.vary ―Artificial bandwidth extension of speech signals Using MMSE estimation based on a hidden markov model‖ Institute of communication systems and data processing (ind), Aachen University (RWTH), 52056 Aachen, Germany,IEEE( ICASSP) 2003.

(7)

International Journal of Emerging Technology and Advanced Engineering

283

[12] J. Epps and W. H. Holmes, ―A new technique for wideband enhancement of coded narrowband speech,‖ in Proc. IEEE Workshop Speech Coding, 1999, pp. 174–176.

[13] W. fujitsuruy, H. sekimotoy, ―Bandwidth extension of cellular phone speech based on maximum Likelihood estimation with GMM‖ International Workshop on Nonlinear Circuits and Signal Processing NCSP'08, Gold Coast, Australia, March 6-8, 2008.

[14] K.Y. Park and H.S. Kim. ―Narrowband to wideband conversion of speech using GMM based transformation,‖ Proc. ICSLP, pp. 1847– 1850, Istanbul, June, 2000

[15] S.Yao and C.-F.Chan,―Block-based bandwidth extension of narrowband speech signal by using CDHMM,‖ in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),2005.

[16] Bernd Iser and Gerhard Schmidt. ―Neural networks versus codebooks in an application for bandwidth extension of speech signals‖. European Conference on Speech Communication and Technology, Geneva, Switzerland.september 2003.

[17] H. Pulakka, V. Myllylä, L. Laaksonen, and P. Alku, ―Bandwidth extension of telephone speech using a filter bank implementation for

highband mel spectrum,‖ IEEE transactions on audio, speech, and language processing, vol. 19, no. 7, september 2011.

[18] J. Makhoul and M. Berouti, ―High-frequency regeneration in speech coding systems,‖ in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 1979, vol. 4, pp. 428–431.

[19] Yasukawa H., ‖Enhancement of telephone speech quality by simple spectrum extrapolation methods‖, proc. of Eurospeech’95, pp. 1545- 1548, september 1995.

[20] Uncini A., Gobbi F. and Piazza F., ‖Frequency recovery of narrowband speech using Adaptive Spline Neural Networks‖, pp. 997-999,1999.

[21] B. Geiser and P. Vary, ―Backwards compatible wideband telephony in mobile networks: CELP watermarking and bandwidth extension,‖ in Proc. of ICASSP, Honolulu, Hawai, USA, Apr. 2007.

[22] B.Geiser, P.Jax, and P.Vary, ―Artificial Bandwidth Extension of speech supported by watermark-transmitted side information,‖ Proc. INTERSPEECH, Lisbon, Portugal, Sept. 2005.

[23] Bernd geiser, Peter Jax and Peter Vary ―Robust wideband enhancement of speech by Combined coding and artificial bandwidth extension‖ Institute of Communication Systems and Data Processing RWTH Aachen University, Templergraben 55, Aachen, Germany

[24] Peter vary and Bernd geiser ―Steganographic wideband telephony using narrowband speech codecs‖ 41st Asilomar Conference on Signals, Systems, and Computers in Pacific Grove, CA, USA, Nov.2007.