Volume 1, Issue 3, November 2012 Page 69

(1)

Volume 1, Issue 3, November 2012

Page 69

ABSTRACT

A biometric recognition system uses for law enforcement and legal purposes of secure entry in private and government institution and organization. In this paper, speech recognition is one of the techniques of biometrics, to collect the words convert to an acoustic signal. The use of isolators Marathi word feature extraction, to identify and verify each spoken word is a field that is being actively researched. The results from MFCC and VQ algorithm apply on IWAMSR databases, found easy access of end users.

Keywords: Speech recognition, Marathi speech recognition, IWAMSR, MFCC and vector quantization.

1. INTRODUCTION

Biometrics first came into extensive use for law-enforcement and legal purposes—identification of criminals and illegal aliens, security clearances for employees in sensitive jobs, paternity determinations, forensics, positive identifications of convicts and prisoners, and so on. Today, however, many civilian and private-sector applications are increasingly using biometrics to establish personal recognition [1]. Biometrics in its various forms provides the potential for reliable, convenient, automatic person recognition. Systems based on fingerprints, faces, hand geometries, iris images, vein patterns and voice signals are now all available commercially. While this is true, it is also true that the take-up of this technology has been slow. Biometrics applications act as a guard control, limiting access to something or somewhere; it is the something or the somewhere that in turn defines the potential merits of biometric-based solutions. Yet biometrics has not entered every-day life to any extent whatsoever. If it had, then conventional guards, physical keys and the all-pervading passwords and PIN numbers essential to gain access to mobile phones, computer software, buildings and so on, all would be in decline. This lack of take-up is even more surprising given the rate at which passwords and PINs are forgotten, as indicated by the prevalence of the web-page rescue message “Forgotten your password?” In short, why has biometric technology not made PINs, passwords and many physical keys things of the past? The answer must lie in factors such as cost, accuracy and convenience of use [13].

The main used of speech recognition is collected words (captured by talker, microphone and telephone) convert to an acoustic signal. The use of isolated Marathi word feature extraction, to identify and verify each spoken word is a field that is being actively researched. Isolated speech recognition refers to the ability of the speech recognizer to identify and recognize a specific word in a stream of words [20].

Automatic Speech Recognition for Marathi

Isolated Words

Gajanan Pandurang Khetri1, Satish L. Padme2, Dinesh Chnadra Jain3 ,Dr. H. S. Fadewar4, Dr. B. R. Sontakke5 and Dr. Vrushsen P. Pawar6

1

Research scholar, department of Computer Science, Singhania University, Rajasthan, India

2

Assistant Librarian, Dr. Babasaheb Ambedkar Marathwada Universtiy, Aurangabad, Maharashtra, India

5

Research Scholar, Department of Computer Science, SRTM University, Nanded, Maharashtra, India

4

Assistant Professor, Department of Computational Science, Swami Ramanand Teerth Marathwada University, Nanded, Maharashtra, India

5

Associate Professor, Department of Mathematics, Pratisthan Mahavidyalaya, Paithan, Maharashtra, India

6

Assistant Professor and Head, Department of Computational Science, Swami Ramanand Teerth Marathwada University, Nanded, Maharashtra, India.

(2)

Volume 1, Issue 3, November 2012

Page 70

2. RELATED WORK

Features extraction in ASR is the computation of a sequence of feature vectors which provides a compact representation of the given speech signal. Ursin [2] identified the most common operations required for front-end (features extraction) techniques. Those operations include; sampling with a preferred sampling rate between 16000 to 22000 times a second for speech processing corresponding to 16 kHz to 22 kHz. According to Martens [3], there are various speech features extraction techniques, including Linear Predictive Coding (LPC), Perceptual Linear Prediction (PLP) and Mel-Frequency Cepstral Coefficient (MFCC). However, MFCC has been the most frequently used technique especially in speech recognition and speaker verification applications. Feature extraction is usually performed in three main stages. The first stage is called the speech analysis or the acoustic front-end, which performs spectro-temporal analysis of the speech signal and generates raw features describing the envelope of the power spectrum of short speech intervals. The second stage compiles an extended feature vector composed of static and dynamic features. Finally, the last stage transforms these extended feature vectors into more compact and robust vectors that are then supplied to the recognizer. [3].The main objective of features extraction is to extract characteristics from the speech signal that are unique, discriminative, robust and computationally efficient to each word which are then used to differentiate between different words [2].

Based on a highly simplified model for speech production, the linear prediction coding (LPC) algorithm is one of the earliest standardized coders, which works at low bit-rate inspired by observations of the basic properties of speech signals and represents an attempt to mimic the human speech production mechanism [9]. In addition, Chetouani et al. [10] considered LPC as a useful method for feature extraction based on a modelization of the vocal tract. Another popular feature set is the set of perceptual linear prediction (PLP) coefficients, which was first introduced by Hermansky [11], who formulated PLP as a method for deriving a more auditory-like spectrum based on linear predictive (LP) analysis of speech, which is achieved by making some engineering approximations of the psychophysical attributes of the human hearing process [12]. PLP introduces an estimation of the auditory properties of human ear, and takes features of the human ear into account more specifically than many other acoustical front end techniques [11].

The most prevalent and dominant method used to extract spectral features is calculating Mel-Frequency Cepstral Coefficients (MFCC). MFCCs are one of the most popular feature extraction techniques used in speech recognition based on frequency domain using the Mel scale which is based on the human ear scale [10]. MFCCs being considered as frequency domain features are much more accurate than time domain features [14].

Huang et al. [14] also defined the Mel-Frequency Cepstral Coefficients (MFCC) as a representation of the real Cepstral of a windowed short-time signal derived from the Fast Fourier Transform (FFT) of that signal. The difference from the real Cepstral is that a nonlinear frequency scale is used, which approximates the behavior of the auditory system. According to Zoric [15], MFCC is an audio feature extraction technique which extracts parameters from the speech similar to ones that are used by humans for hearing speech, while at the same time, deemphasizes all other information. As MFCCs take into consideration the characteristics of the human auditory system, they are commonly used in the automatic speech recognition systems.

MFCCs are extracted by transforming the speech signal which is the convolution between glottal pulse and the vocal tract impulse response into a sum of two components known as the cepstrum that can be separated by band pass linear filters [16]. MFCCs extraction involves a frame-based analysis of a speech signal where the speech signal is broken down into a sequence of frames. Each frame undergoes a sinusoidal transform (Fast Fourier Transform) in order to obtain certain parameter which then undergoes Mel-scale perceptual weighting and de-correlation. The result is a sequence of feature vectors describing useful logarithmically compressed amplitude and simplified frequency information [17].

According to Martens [3], MFCC features are obtained by first performing a standard Fourier analysis, and then converting the power-spectrum to a mel-frequency spectrum. Therefore, MFCC will be obtained by taking the logarithm of that spectrum and by computing its inverse Fourier transform. Quantization is the process of approximating continuous amplitude signals by discrete symbols, which can be quantization of a single signal value or parameter known as scalar quantization, or joint quantization of multiple signal values or parameters known as vector quantization [14]. VQ offers several unique advantages over scalar quantization, including the ability to exploit the linear and nonlinear dependencies among the vector components, and is highly versatile in the selection of multidimensional quantizer cell shapes. Due to these reasons, VQ results in lower distortion than scalar quantization [9]. According to Goldberg and Riek [18], VQ is a general class of methods that encode groups of data rather than individual samples of data in order to exploit the relation among elements in the group to represent the group as a whole more efficiently than each element by itself.

(3)

Volume 1, Issue 3, November 2012

Page 71

3. PREPROCESSING

3.1 Features extraction using MFCC algorithm

According to Ursin [2], MFCC is considered as the standard method for feature extraction in speech recognition tasks. Perhaps the most popular features used for speech recognition today are the mel-frequency cepstral coefficients (MFCCs), which are obtained by first performing a standard Fourier analysis, and then converting the power-spectrum to a melfrequency spectrum. By taking the logarithm of that spectrum and by computing its inverse Fourier transform one then obtains the MFCC. The individual features of the MFCC seem to be just weakly correlated, which turns out to be an advantage for the creation of statistical acoustic models [3]. MFCC has generally obtained a better accuracy and a minor computational complexity with respect to alternative processing as compared to other features extraction techniques [4].

Mel-Frequency Cepstral Coefficient (MFCC) is a deconvolution algorithm applied in order to obtain the vocal tract impulse response from the speech signal. It transforms the speech signal which is the convolution between glottal pulse and the vocal tract impulse response into a sum of two components known as the cestrum. This computation is carried out by taking the inverse DFT of the logarithm of the magnitude spectrum of the speech frame [8]

(1)

Where, h ^ [n], u ^ [n] and x ^ [n] are the complex cestrum of h[n], u[n] and x[n] respectively.

3.2 Features training using VQ algorithm

The training process of the VQ codebook applies an important algorithm known as the LBG VQ algorithm, which is used for clustering a set of L training vectors into a set of M codebook vectors. This algorithm is formally implemented by the following recursive procedure: [5], the following steps are required for the training of the VQ codebook using the LBG algorithm as described by Rabiner and Juang [6].

1. Design a 1-vector codebook; this is the centroid of the entire set of training vectors. Therefore, no iteration is required in this step.

2. Double the size of the codebook by splitting each current codebook yn according to the following rule (2)

(2)

Where, n varies from 1 to the current size of the codebook, and _ is the splitting parameter, whereby _ is usually in the range of (0.01 _ _ _ 0.05).

3.3 Feature matching testing using euclidean distances measure

The matching of an unknown Marathi word is performed by measuring the Euclidean distance between the feature vectors of the unknown Marathi word to the model of the known Marathi words in the database. The goal is to find the codebook that has the minimum distance measurement in order to identify the unknown word [7].

For example in the testing or identification session, the Euclidean distance between the features vector and codebook for each Marathi spoken word is calculated and the word with the smallest average minimum distance is picked as shown in the equation below (3).

(3)

Where, xi is the ith input features vector, yi is the ith features vector in the codebook, and d is the distance between xi and

(4)

Volume 1, Issue 3, November 2012

Page 72

4. EXPERIMENTAL SETUP

IWAMSR database created using three steps, one is talking person, talking condition, the transducers and transmission system. IWAMSR The first step contains five different speakers out of whom their speech samples were collected. Those five speakers include three male and two female speakers belonging to different ages, genders and races. Second step follow speaking condition in which the speech samples were collected from, which basically refer to the environment of the recording stage, finally third step is transducers and transmission system, speech samples were recorded and collected using a normal microphone.

Database1 collects 63 unknown Marathi speech samples contain six speech samples collected from two different talkers, which are trained to produce the VQ codebook. Database2 collects 90 unknown Marathi speech samples contain one male talker and remaining was collected at tested against recording. Database3 collects 36 unknown Marathi speech samples contain nine speech samples from four different talkers who are male and female, which are trained to produce VQ codebook. The experiments were carried out Pentium IV processor with 3.0 GHz, hard disk drive is 200GB, 1GB Random Access Memory and talker.

5. EXPERIMENTAL RESULTS

The feature matching stage was done for three different databases of IWAMSR. Figure 1, 2 shows three different databases of IWAMSR, each database contains number of unknown speech samples for testing purpose. database1, database2 and database3 contain 63, 90 and 36 unknown speech samples were tested and yielding an accuracy rate is near about 85.71%, 31.11% and 88.88%. Database1 of IWAMSR out of 63 unknown speech samples found 54 correct matching and remaining 9 speech samples are false matching. Database2 of IWAMSR out of 90 unknown speech samples found 28 correct matching and 62 speech samples are false matching. Database3 of IWAMSR out of 36 unknown speech samples found 32 correct matching and 4 speech samples are false matching. Table 1 shows, FAR (%) results on IWAMSR databases.

Figure 1 Testing Results of various IWAMSR DATABASES

(5)

Volume 1, Issue 3, November 2012

Page 73

Table 1: FAR (%) Results on IWAMSR DATABASE

IWAMSR Databases FAR (%)

Database1 0.14

Database2 0.68

Database3 0.11

6. CONCLUSION

FIWAMSR database1 has better performance than IWAMSR database2. Database3 found high recognition rate it main reason behind this database contains nine speech samples collected from different male and female talkers. This research has satisfy the fasten process of communication. Also, this research will satisfy the end users and easy access for that purpose. This system could be implemented in various Marathi businesses; Marathi organization Marathi based financial institution and Marathi academic institutions.

References

[1] Salil Prabhakar, Sharath Pankanti and Anil K. Jain, [2003], “Biometric Recognition: Security and Privacy Concerns,” IEEE Security and Privacy Magazine, pp. 33-42. J. S. Mason and J. D. Brand, “The Role of dynamics in Visual Speech Biometrics,” IEEE, 2002.

[2] M. Ursin, “Triphone Clustering In Finnish Continuous Speech recognition,” Master Thesis, Department of Computer Science, Helsinki University of Technology, Finland, 2002.

[3] J. P. Martens, “Continuous Speech Recognition over the Telephone,” Electronics and Information Systems, Ghent University, Belgium, 2000.

[4] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations of Monosyllabic Word recognition In Continuously Spoken Sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, pp. 357- 366, 1980.

[5] Y. Linde, A. Buzo and R. M. Gray, “An Algorithm for Vector Quantizer Design,” IEEE Transactions on Communications, vol. COM28, no 1, pp. 84-95, 1980.

[6] L. Rabiner and B. H. Juang, “Fundamentals of speech recognition,” Prentice Hall, NJ, USA, 1993.

[7] P. Franti, T. Kaukoranta and O. Nevalainen, “On the Splitting Method for Vector Quantization Codebook Generation,” Optical Engineering;, vol. 36, no. 11, pp. 3043-3051, 1997.

[8] C. Becchetti and L. P. Ricotti, “Speech Recognition Theory and C++ Implementation,” John Wiley and Sons Ltd;, England, 1999.

[9] C. C. Wai, “Speech Coding Algorithms Foundation and Evolution of Standardized Coders,” John Wiley and Sons Inc., NJ, USA, 2003

[10]M. Chetouani, B. Gas, J. L. Zarader and C. Chavy, “Neural Predictive Coding for Speech Discriminant Feature Extraction: The DFE-NPC,” ESANN'2002 proceedings – European Symposium on Artificial Neural Networks, Bruges, Belgium, pp. 275-280, 2002.

[11]H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis of Speech,” The Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.

[12]B. Milner, “A comparison of front -end Configurations for Robust Speech Recognition,” ICASSP ’2002, vol. 01, pp. 797–800, 2002.

[13]J. S. Mason and J. D. Brand, “The Role of dynamics in Visual Speech Biometrics,” IEEE, 2002.

[14]X. Huang, A. Acero and H. W. Hon, “Spoken Language Processing: A Guide to Theory, Algorithm, and System Development,” Prentice Hall, Upper Saddle River, NJ, USA, 2001.

[15]G. Zoric, “Automatic Lip Synchronization by Speech Signal Analysis,” Master Thesis, Department of Telecommunications, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia, 2005. [16]O. Khalifa, S. Khan, M. R. Islam, M. Faizal and D. Dol, “Text Independent automatic Speaker Recognition,” 3rd

International Conference on Electrical & Computer Engineering, Dhaka, Bangladesh, pp. 561-564, 2004.

[17]C. R. Buchanan, “Informatics Research Proposal – Modeling the Semantics of sound,” http://www.inf.ed.ac.uk/teaching/courses/diss/projects/s0454291.pdf, retrieved in March 2005.

[18]R. Goldberg and L. Riek, “A Practical Handbook of speech Coders,” CRC Press,USA, 2000. [19]A. R. Webb, “Statistical pattern Recognition,” John Wiley and Sons Ltd., England, 2002. [20]C. R. Spitzer, “The Avionics Handbook,” CRC Press, USA, 2001

AUTHOR

(6)

Volume 1, Issue 3, November 2012

Page 74

Satish L.Padme received the MLIsc (Library and Information Science) Degree from Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, (India) He Cleared his NET in Library and Information Science .Now he doing his Ph.D. from Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra, (India). He has published 15 research papers in conferences, seminars .Currently he has working as an Assistant Librarian in Dr. Babasaheb Ambedkar Marathwada University, Aurangabad, Maharashtra (India).

Mr. Dinesh Chandra Jain received M.Tech (IT) degree from BU and B.E(Computer Science ) degree from RGPV –University Bhopal. He has published many Research Papers in reputed International Journal. He is a member of ISTE and many Engineering Societies. He is pursuing PhD in Computer Science from SRTM University-Nanded (India).