COMPARATIVE STUDY OF FEATURE EXTRACTION TECHNIQUES FOR PUNJABI SPEECH RECOGNITION

(1)

COMPARATIVE STUDY OF FEATURE EXTRACTION TECHNIQUES FOR PUNJABI SPEECH

RECOGNITION

Jyoti Guglani

¹

,A. N. Mishra

²

1

Department of Electronics & Communication, Institute of Management Studies, Ghaziabad, (India)

2

Department of Electronics & Communication, Krishna Engineering College, Ghaziabad, (India)

ABSTRACT

This paper evaluates comparative recognition performance of Linear Predictive Coefficients (LPC), Linear Predictive Cepstral Coefficient (LPCC), Perceptual Linear Prediction Coefficients (PLP), Mel Frequency Cepstral Coefficient (MFCC) and Wavelet coefficients. These features have been tested for speaker independent isolated Punjabi digits recognition. The comparative study of recognition performance of LPC, LPCC, PLP, MFCC and Wavelet coefficients has been studied. The recognition performance of PLP is better than LPC and LPCC coefficients. The recognition performance of LPCC coefficients is 4.2% better than LPC. The recognition performance of PLP is 10.95% better than LPCC coefficient. And MFCC coefficient gives 2.76% better performance than PLP. Wavelet Coefficients gives 14.79% better recognition performance than MFCC. A classifier is implemented by using Hidden Markov Model (HMM) on MATLAB for training and testing of Punjabi digits.

Keywords: Linear Prediction Coefficient, Linear Prediction Cepstral Coefficient, Mel Frequency Cepstral Coefficients, Perceptual Linear Prediction Coefficients, Wavelet Coefficient, Hidden Markov Model

I. INTRODUCTION

Speech is the primary mode of communication. It is a way of sharing facts, thoughts and emotions and also a way of transferring human intelligence from one person to another [1]. Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands and control, data entry, and document preparation.Speech recognition has been a goal of research for more than four decades but the desired goal is yet to be achieved, namely, a machine that can understand spoken discourse on any subject by all speakers in all environments[2,3]. A speech recognition system basically comprises of a collection of algorithms from a variety of disciplines, such as statistical pattern recognition, communication theory, signal processing, combinatorial mathematics, linguistic etc. In different recognizers each of these areas is relied on to different degrees. One of the most difficult aspects of performing research in speech recognition by machine is it‟s inter disciplinary nature and the tendency of most researcher to apply a monolithic approach to individual problem.

A speech recognition system has three major components: database preparation, feature extraction and classification as shown in Figure 1. A prerequisite for the development and evaluation of automatic speech

(2)

recognition system is the availability of appropriate database. The recognition performance depends on the performance of the feature extraction technique. Thus choice of features and its extraction from the speech signal should be such that it gives high recognition performance with reasonable amount of computation. LPC, LPCC, PLP, MFCC coefficients and Wavelet features techniques [4, 5, 6, 7, 8] have been used by various

researchers for speech recognition with different languages.

In this work Hidden Markov Model (HMM) [9] is used for classification. Satisfactory results are obtained by using all the kinds of feature extraction techniques in combination with this classifier for recognition of speaker independent Isolated Punjabi digits. Database preparation and feature extraction techniques are given in Sections 2 and 3 respectively. Experimental setup and results are explained in Section 4. Conclusions are drawn in Section 5.

II. DATABASE PREPARATION

„Cool software‟ was used for preparation of this database. A database of twenty four speakers, twelve females and twelve males was made for a total of ten Punjabi digits -‘sifar’ ,‘ikk’, ’do’, ’tinn’, ’char’,

‘panj’,chhe’,’satt’,’atth’,’nau’(i.e.‟zero‟,‟one‟,‟two‟, ‟three‟, „four‟,‟five‟,‟six‟,‟seven‟,‟eight‟,‟nine‟). It was prepared with sampling frequency 16 kHz and 8 bits per sample. Speakers were chosen from different geographical areas of Punjab, different social classes and of different age groups. Every speaker was asked to repeat each digit ten times with short inter-digit pauses. Further, all ten repetitions of each digit were segmented manually. This Database for twelve female and twelve male speakers from different regions of Punjab was prepared by using „Cool software‟. The people from different age group is chosen because of convinces of availability. A distance of 2-6 inch was maintained between microphone and the speaker at the time of database recording. The microphone of Sony make is used for recording the database.

III. FEATURE EXTRACTION

The feature extraction involves simplifying the amount of resources required to describe a large set of data accurately. The raw speech signal is complex and may not be suitable for feeding as input to an automatic speech recognition system; hence the need for a good front-end arises. The task of this front-end is to extract all relevant acoustic information in a compact form compatible with the acoustic models. In other words, the preprocessing should remove all non-relevant information such as background noise and characteristics of the recording device, and encode the remaining (relevant) information in a compact set of features that can be given as input to the classifier. Features can be defined as a minimal unit, which distinguishes maximally close classes.

The entire scheme for feature extraction using LPC, LPCC, PLP, MFCC and Wavelet Coefficients are shown in Figure 2.

3.1 Linear Prediction Coefficient

Linear prediction technique [4] is used to derive the filter coefficients (corresponding to the vocal tract) by minimizing the mean square error between the input and the estimated sample. These coefficients are extracted

(3)

using the auto-correlation method or the covariance method.The all pole representation of the vocal tract transfer function H

 

z can be represented by the following equation:

  

2 p ^p



1 2

1z a z ... a z a

1

G )

z ( A z G

H _ _ _



 

 (1)

The values

a

iare called the prediction coefficients while G represents the amplitude or gain associated with the vocal tract excitation.

Pre-processing, framing and windowing are common steps for the LPC, LPCC, PLP, MFCC and Wavelet Coefficients feature extraction techniques. Pre-emphasis filtering, normalization and mean subtraction are the three steps in Pre-processing. The digitized speech is pre-emphasized using a digital filter with a transfer function

^H   ^z ^ ¹ ^ ⁰ ^. ⁹⁸ ^z

^¹. Pre-emphasis filter spectrally flattens the signal and makes it less susceptible to finite precision effects later in the signal processing. Due to possible mismatch between training and test conditions, it is considered good practice to reduce the amount of variations in the data that does not carry important speech information as much as possible. For instance, differences in loudness between recordings are irrelevant for recognition. For reduction of such irrelevant sources of variation, normalization transforms are applied. During normalization every sample value of the speech signal is divided by the highest amplitude sample value. Mean of the speech signal is subtracted from the speech signal to remove the DC offset.

After preprocessing, Punjabi digit samples were divided into different frames of 25ms duration. The second frame starts after 15ms of first frame and overlaps the first frame by 10ms. The third frame starts after 15ms of second frame and overlaps the second frame by 10ms and this process continue till the end of all the frames of speech sample. Next each frame was multiplied by a hamming window and 13^thorder LPC analysis was performed on each frame. First thirteen LPC coefficients were chosen for each frame. Finally thirteen LPC coefficients were selected for each sample of Punjabi digits by applying vector quantization on features of all frames of each sample of each Punjabi digit.

3.2 Linear Prediction Cepstral Coefficient

A very important LPC parameter set, directly derived from LPC coefficients are the cepstral coefficients [4, 5].

The cepstral coefficients, which are the coefficients of the Fourier transform representation of the log magnitude spectrum, have proved more robust, reliable feature set for speech recognition application. A special feature of cepstral coefficient is orthogonality [5].

(4)

The recursion used for LPCC features from LPC is as follows:

2 0

 ln 

c

(2)



2is the gain term in the LPC model.

p m a

c m k a

c

_k _m _k

m

k m

m

 

^ _

 



¹

⁽ ^/ ⁾ ^, ¹

1

(3)

p m a

c m k

c

_m _k

m

k

m



^ _





¹

⁽ ^/ ⁾ ^,

1

(4)

3.3 Perceptual Linear Prediction

PLP inherits its characteristics from LPC as well as MFCC. PLP features [7] can be obtained by following the steps as shown in Figure1. The speech signal is pre-emphasized and normalized similarly as in LPC. The frame duration and overlapping are also taken same as in LPC. The power spectral estimate for the windowed speech signal is computed. The power spectrum is integrated within overlapping critical band filter responses. For obtaining PLP, trapezoidal shaped filters are applied at 1-bark intervals. Restrictions are imposed on higher frequencies of band-pass filter in such a way that pass-band in the domain of critical band modulations is established to a range that appears to be required for speech intelligibility. The elements of critical band spectrum are explicitly weighted. To reduce the effects of amplitude variations for spectral resonances cube root of spectral amplitudes are taken after that IDFT is performed. Durbin‟s algorithm is used for computing the PLP coefficients. Again 13 features are taken for each speech sample by applying vector quantization on the features of all frames of a sample.

3.4 Wavelet Based Feature Extraction

The speech signal is pre-emphasized and normalized similarly as in LPC. The frame duration and overlapping are also taken same as in LPC. In Wavelet based feature extraction technique [8] the pre-emphasized signal is decomposed into 24 bands using different wavelets. The total energy of wavelet coefficients in each sub-band is calculated. Logarithm of the energy of all the 24 frequency bands is computed. Then DCT is applied which arranges the coefficients in ascending order of the weightage of the information contained in the coefficients.

First 13 coefficients are selected out of 24 DCT coefficients.

IV. EXPERIMENTAL SETUP AND RESULTS

(5)

Theexperimental set-up is shown in Figure 1. By using LPC, LPCC, PLP, MFCC and Wavelet coefficients, 13 features are extracted for each sample of all 24 Punjabi digits. The features of all the twenty four speakers were stored for future processing after extraction. The feature vectors of first twenty speakers were used for the purpose of training and the feature vectors of last four speakers of each digit were used for testing the classifier.

Comparative speaker independent results for the LPC, LPCC, PLP, MFCC and Wavelet Coefficient feature extraction techniques with HMM are shown in Figure 3. The recognition performance of LPCC Coefficients is 4.2% better than LPC while the recognition performance of PLP is further improved by 10.95% than that of LPCC Coefficients. One of the reasons for better performance of LPCC features is that these are cepstral features. MFCC coefficient gives 2.76% better than PLP coefficient. Wavelet coefficient gives 14.79 % better than MFCC. The Cepstral features are useful because they operate in a domain in which the excitation function and the vocal tract filter function are separable. Wavelet coefficient gives better efficiency than other techniques as it gives

V. CONCLUSIONS

The size of twenty four speakers was selected based on convince. The system provides satisfactory results for speaker independent cases. The recognition performance with PLP Coefficients features was better than that with LPC and LPCC features. The recognition performance with MFCC features was further improved than that with PLP Coefficients features. Wavelet Coefficient gives best recognition performance among all feature extraction techniques

REFERENCES

[1] Allen, J.B., 1994. How do humans process and recognize speech. IEEE Transactions on Speech and Audio Process, vol. 2, no. 4, pp. 567–576.

[2] Gong, Y., 1995.Speech recognition in noisy environments: A survey. Speech Communication, vol. 16, no.3, pp. 261–291.

[3] Sreenivas, T. and Kirnapure, P., 1996. Codebook constrained Wiener filtering for speech enhancement.

IEEE Transactions on Speech & Audio Process., vol. 4, pp. 383–389.

[4] J. Makhoul, “Linear prediction: A tutorial review,” Proc. of IEEE, vol. 63, no. 4, pp. 561-580, 1975.

[5] Atal, B.S., 1974. Effectiveness of Linear Prediction characteristic of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America, vol.55, no.6, pp.1304–1312.

[6] Hasan, M.R., Jamil, M. and Saifur Rahman, M.G., 2004. Speaker identification using Mel frequency Cepstral Coefficients. 3rd International conference on Electrical and Computer Engineering, Dhaka, Bangladesh, pp. 565-568.

[7] Hermansky, H., 1990. Perceptual linear prediction (PLP) analysis of speech. Journal of Acoustic Society America 87: pp.1738-1752.

[8] M. Krishnan, C. P. Neophytou and G. Prescott, “Wavelet Transform Speech Recognition Using Vector Quantization, Dynamic Time Warping and Artificial Neural Networks,” International Conference on Spoken Language Processing, 1994, Yokohama, Japan.

[9] Levinson, S.E., Rabiner L.R., and Sondhi, M.M.,1983. Speaker independent isolated digits recognition using Hidden Markov Model. ICASSP, Boston

COMPARATIVE STUDY OF FEATURE EXTRACTION TECHNIQUES FOR PUNJABI SPEECH RECOGNITION