6. A Hybrid Speech Recognition Technique Based On MFCC and PLP

(1)

40

A Hybrid Speech Recognition Technique Based On

MFCC and PLP

Osheen Nehru

Department of Electronics and Communication Engineering. Maharishi Ved Vyas Engineering College

Haryana, India. [email protected].

K

amal K

umar

Department of Electronics and Communication Engineering. Maharishi Ved Vyas Engineering College

Haryana, India. [email protected]

Abstract-Voice Biometric System is implemented on three modules that are Database Creation, Feature Extraction and Classifier. Voice samples consist of voice of 35 different persons having their names in which 20 samples are used for testing and remaining will be taken for training. The features of first ten samples are MFCC (Mel-frequency cepstrum coefficients) and 11 to 20 are PLP (perceptual linear prediction) technique. The voice sample contains of voice of male and female both in which they read out their names. The knowable mel-frequency cepstral coefficients (MFCC) have been one of the most far and widely used speech description for speech recognition over loads of years. The Perceptual Linear Prediction PLP model urbanized by Hermansky. PLP models the individual vocalizations based on the notion of psychophysics of consideration. PLP castoffs extraneous information of the communication and thus improves speech appreciation rate. The indispensable attitude here is to differentiate words into probabilistic sculpts wherein the assortment of phonemes which donate to the word symbolize the circumstances of the HMM while the changeover probabilities would be the likelihood of the next phoneme being articulated.

General Terms: Voice samples, Database creation, Feature extraction, Classifiers, Acoustic speaker recognition, Voice identification.

Keywords: MFCC (Mel-frequency cepstrum coefficients), PLP (perceptual linear prediction), Acoustic recognition, Vocalization.

1. INTRODUCTION

The knowable Mel-frequency cepstral coefficients (MFCC) is on the whole extensively used communication characteristics for speech recognition over loads of years.

1.1 Acoustic Speaker Recognition:

Established substantiation relies on solitary of these three items: what you are in possession of, what you subsist and what you discern. Explanation or card-based systems exemplify what you have possession of. Restrain and password based systems rely on what you discern. Voice passwords encompass: declaration substantiation for access control and password observance. Biometrics and in meticulous speaker recognition rely on what you subsist. The

new loom of speech biometrics or informal biometrics employs text-autonomous speaker recognition to acoustically categorize or verify answers from the punter in dialog with the coordination. The queries addressed to the punter can be haphazardly selected, follow a pre-defined progression or follow a business judgment. With this loom, user corroboration and detection rely on acoustic recognition and on the contented of the answers to the questions.

1.2 The Voice Identification and Verification Agent:

(2)

41 Figure 1: VIVA OVERVIEW [1]

For the announcement amongst a VIVA client and the server a proprietary protocol was introduced concerning the conception of a substantiation \symposium" and \consultation." An interview represents an uncomplicated dialog consisting of a small number of questions given a sanctuary policy. The guidelines can be distinct as the ratio of the utmost number of questions and the slightest number of precise answers per interview. Within one session, multiple interviews with altering policies can be opened which allows for adapting the session length to the current voice-print match buoyancy. The questions asked within one session are generated so as to avert repetitions across interviews and also to pledge for sufficient security by apt topic coverage, e.g. there will be at least one password question amid questions about family, hobbies, or favorite colors. The archetypal procedure looks like the following: The relevance creates an occurrence of the CSB VIVA component when the user tries to log on. The VIVA then takes over the run and first tries to get hold of the user's ID claim. This is achieved throughout a direct prompt or using detection procedure based on voice and verbal information. In our execution the claim ID is a digit filament (extension number). If the punter does not explicitly specify this number and issues directives to the application directly, the VIVA suspends the speech decree and starts

decisive the claim in a short dialog using an open-set acoustic speaker credentials. In case of unsuccessful credentials the user is prompted for the allege integer explicitly. By means of the allege ID the communication biometrics organize generates a verification session and a preliminary consultation with the VIVA attendant. Questions engendered by the attendant are synthesized for the user and user's retort is decoded using the IBM via Voice Telephony contraption. Apposite vocabularies and grammars are switched by the biometrics control module on a question root according to the current issue. Decoded retort is returned sponsor to the server for appraisal. In parallel, the control component collects the audio data in a shield. Once the security strategy for the open conference is satisfied the server returns a positive consequence and closes the verbal interview (session remains still open). The control module triggers the voice-print corroboration based on the composed audio and afterwards decides whether to admit the speaker, or to persist the substantiation session by creating an additional conference, or possibly to reject the speaker due to numerous unsuccessful interviews (incorrect or vague answers) or a meagre acoustic match. After this initial substantiation, if closed positive, the run is given back to the application. The illustration of the CSB entity may be concluded if the submission does not entail any further authentication or may remain instantiated.

In the concluding case the vocalization biometrics control creates a listener allied with the audio brook and collects the speech from the customary user-application communication. Then, a voice-print re-verification can be requested from the relevance at any time (particularly before committing decisive operations) consequently achieving unremitting speaker tracking and detecting budding speaker changes.

The VIVA system ropes a preset user enrollment via HTML for the acquaintance database and a telephone voice compilation for the acoustic in sequence. [1]

1.3 Biometric System:

(3)

42 the models are based on specific word oddments, having

text-autonomous modeling, where models are generated by using overweight word groupings. So, as we augment the quantity of in sequence gathered from each question in challenge response by extending it from binary to manifold choices. With the help of this protocol, a scrap of the refuge results from the server scrambling data to engender a challenge. Certainly, all challenges must also enclose a finite number of feasible choices for server to scuttle and convey them to client. This makes it diverse from customary biometrics, in that the etiquette does not wrest statistics from the retort and appraise it to a stored value; somewhat it uses the rejoinder to choose between multiple possibilities as accessible by the server. It is imperative to remind that the scrutiny of security, in terms of bits, is outburst, in terms of how greatly refuge should be supplemented on top of encryption. Our execution of Vaulted Voice Verification gives P-bits of refuge for the encryption from the saline user password with the conjecture that the device will get locked up after numeral times of attempts, S-bits of refuge from the server encryption of the prototype, K-bits of “knowledge-based” security, and B-K-bits of “biometric-identity” refuge. During K-bits of knowledge-based refuge, it is intended to say, security that is gained for each challenge-response question that is due to somewhat the reliable user knows and an attacker does not. Thus, depending upon the assail model, the odds for an attacker guessing accurately mirror that of random chance. When we utter of B-bits of biometric-identity security, we connote protection that is gained through the use of voice-based models that take improvement of the divergence in the voices and speech patterns of the different speakers [4]. Akin to other speech processing tools, voice biometrics that extracts information from the stream of the speech to accomplish their work. They can be defined to control on many of the same acoustic parameters as their next speech processing relative-speech gratitude. And analogous to speech appreciation, they benefits from lots of data, good microphones and noise annulment software. There are so many imperative differences amid voice biometrics systems and other speech-processing technologies which also includes speech acknowledgment. The most noteworthy part is that voice biometrics do not recognize what a person is saying, relying on speech recognition cannot subsist for voice biometrics. So, by definition, voice biometrics is always concurrent to a particular speaker. As a result, they necessitate some type of the enrolment for every user. The development of the speech systems functioning on tainted speech obtained from HF communication channels has inward renewed interest owing to the DARPA Robust Automatic Transcription of Speech (RATS) program. During data allotment under the RATS program, a proposal of autoregressive (AR) model based technique for occurrence offset assessment which exploits the choral properties of the speech signal was recommended. When the speech indication is frequency shifted, the

elementary harmonics of the voiced regions are also shifted linearly. Nevertheless, severance involving ensuing harmonics is unchanged and the ethereal distance can be used to estimate the fundamental frequency. If the ethereal peaks in the frequency shifted signal are acknowledged, then the differences between the authentic location of such peaks and the expected location in the baseband signal grant likely candidates for the frequency shift assess. The proposed AR model based shift estimation procedure is providing momentous robustness compared to other methods in terms of shift assessment accuracy (with virtual improvements of about 25%). The signal excellence, impartially measured by the use of perceptual evaluation of the speech quality (PESQ), is also shown to perk up over baseline methods. Moreover, the improved signal will be used to enhance the recital of automatic language identification (LID) system [5].

2. RELATEDWORK

Stephane H. Maes et al (2001):- thrashed out concerning the

innovative modality for narrator appreciation informal biometrics as an elevated sanctuary voice-based corroboration technique for E-commerce relevancies. By mingling contemporaneous diverse conversational exercising, far above the ground precision clear as crystal, orator’s detection becomes probable even throughout conduit or milieu mismatches. For categorizing orator on prodigious populations, we merge dialogs to diminish the lay aside of confusable orator and text-autonomous speaker identification to pin-point the actual speaker. Likewise, dialogs with entity muddled or predefined questions are used to execute at formerly knowledge-based and acoustic-based corroborations of the user. Users those are recognizable with the system can tag along into the coordination with 0.8% or 1.3% fake rebuff and ca. 5 _ 10−12% or 2 _ 10−6% false receiving rates contained by about 40 sec or 20 sec correspondingly which is an outlandish consequence as compared to exclusively voice-print based endorsement.[1]

R.C. Johnson et al (2013):- scrutinized a work of fiction

(4)

43 intriguing place client side; somewhere an itinerant device can

be used. [2]

Yekini N.A. et al (2012):- portrayed an involuntary teller

contraption, necessitate a user to leave behind a distinctiveness test aforementioned to any indenture can be contracted. The active technique obtainable for precise of doorway is in instruct of in ATM is based on smartcard. Hard work were through to carry out a conference with predetermined questions accompanied by the ATM punters and the consequence proofed, that a lot of exertion was correlated with ATM smartcard for right of entry organize. Along with the exertions are; it is exceedingly stiff to ward off an added personality since conquering and with a justifiable folks card, also humdrum smartcard can be vanished, spare, stolen or imitated with precision. To tackle the exertion, we predicted the utilization of biometric voice-based precise of entry run coordination in automatic teller machine. [3]

Nicolas Scheffer et al (2013):- offered researchers have

attempted obscure accent biometrics application that reverberates with the justification and investigating communities. Such exertions enclose non-ideal recording circumstances which are repeatedly originate in functioning scenarios, such as clatter, tainted channels, and echo and compacted audio. In this article, we accentuate SRI’s modernism that resulted commencing the IARPA Biometrics Exploitation Science & Technology (BEST), furthermore the DARPA Robust Automatic Transcription of Speech (RATS) programs, in accumulation to SRI’s loom for codec degraded speech. [4]

Sriram Ganapathy et al(2012) :- depicted the wraithlike

repute of speech signalsconversed over high frequency single side band (HF-SSB) radio conduits is exaggerated by acoustic artefacts like linear occurrence transpositions. We recommend a loom to involuntary assessment and rectification for the occurrence swing prearranged the contaminated indication at the SSB receiver. The premeditated technique exploits the harmonic character of the speech signal in the voiced province. The elementary choral occurrence, obtained inauguration an autoregressive sculpt of the variety, is used to compute roughly the offset measurement for the existing mount. The offset morals starting the neighbouring mounts are united in concert to endow with the largely pertinent reckon for the inward bound indication. The projected algorithm grants imperative enhancements over erstwhile baseline offset consideration methods in stipulations of precision of counterbalance appraisal in addition to the LID cataloguing recitation (with virtual improvements of about 10-25%).[5]

3. PROPOSEDWORK

Voice Biometric System consists of three modules that are

Database Creation, Feature Extraction and Classifier.

Database Creation:-Voice samples consist of voice of 35 different persons having their names in which 20 samples are used for testing and remaining will be taken for training. The features of first ten samples are MFCC (Mel-frequency cepstrum coefficients) and 11 to 20 are PLP (perceptual linear prediction) technique. The voice sample contains of voice of male and female both in which they read out their names.

Feature Extraction:-we have used two techniques, MFCC and PLP, the first ten features are MFCC and others are PLP for each sample:

MFCC: - The knowable mel-frequency cepstral coefficients (MFCC) have been one of the most far and widely used speech description for speech recognition over loads of years. In deriving the MFCC, the short-time Fourier transform (STFT) is applied. Nonetheless, due to its time-frequency properties, STFT is essentially not very apt for analyzing a non-stationary signal akin to vocalizations, which implies the consequential MFCC is not for eternity finest for demonstrating the vocalizations signal and perhaps provides not as much of gratitude accurateness.

PLP: - The Perceptual Linear Prediction PLP model urbanized by Hermansky. PLP models the individual vocalizations based on the notion of psychophysics of consideration. PLP castoffs extraneous information of the communication and thus improves speech appreciation rate. Classifier: - We have used hidden markow model.

Hidden Markov Model: The indispensable attitude here is to differentiate words into probabilistic sculpts wherein the assortment of phonemes which donate to the word symbolize the circumstances of the HMM while the changeover probabilities would be the likelihood of the next phoneme being articulated. Sculpt for the expressions which are part of the terminology are created in the preparation phase.

4. RESULT

(5)

44 Figure 2:- Testing Module

Figure 3:- Browsing sample for testing

Figure 4:-Selecting the first trained sample

Figure 5:- Showing the result of testing

(6)

45 Figure 6

Figure 7

Graph 1:- Comparison of existing mfcc and proposed mfcc & plp

The graph shows the comparison of accuracy of existing mfcc which is 70% and the accuracy of proposed mfcc & plp which is 85.71%.

5. CONCLUSION

To calculate accuracy we will first test the 35 training samples and then we will test 10 non- trained samples. After testing 35 trained samples, 30 are correct and 5 are incorrect, so the accuracy percentage for trained samples will be 85.71and after testing 10 untrained samples, 2 are correct and 8 are incorrect, so the accuracy percentage of untrained samples will be 80. It clearly shows about FAR (False Acceptance Ratio) where it is accepting those samples which are not trained and FRR (False Rejection Ratio) where it is rejecting those which are already trained.

Table1:-Accuracy of trained samples

No. Of

trained samples

Correct Incorrect Accuracy

35 30 5 85.71

Table2:- Accuracy of untrained samples

No. Of

untrained samples

Correct Incorrect Accuracy

10 2 8 80

0 20 40 60 80 100

1

Existing MFCC

(7)

46 Table3:- False Acceptance Ratio

FAR Correct Incorrect

20 2 8

Table4:- False Rejection Ratio

FRR Correct Incorrect

15 30 35

REFERENCES

[1]. St_ephane H. Maes, Ji_r__ Navr_atil, and Upendra V. Chaudhari, “Conversational

Speech Biometrics”, IBM T.J. Watson Research Center Rt. 134, Yorktown Heights, NY, USA. fsmaes,jiri,[email protected]

[3]. Zia Saquib, Nirmala Salam, Rekha Nair, Nipun Pandey, “Voiceprint Recognition Systems for Remote Authentication-A Survey”, International Journal of Hybrid Information Technology Vol. 4, No. 2, April, 2011.

[4]. Nicolas Scheffer, Luciana Ferrer, Aaron Lawson, Yun Lei, Mitchell McLaren, “Recent Developments in Voice Biometrics: Robustness and High Accuracy”, Speech Technology and Research Laboratory (STAR) SRI International Menlo Park, CA.

[5]. Rupali L. Telgad, Almas M. N. Siddiqui and Dr. Prapti D. Deshmukh, “Automated Biometric Verification: A Survey on Multimodal Biometrics”, International Journal of Computer Science and Business Informatics, ISSN: 1694-2108 | Vol. 6, No. 1. OCTOBER 2013.

[6]. Adewole, Kayode S, Abdulsalam Sulaiman Olaniyi and Jimoh R. G., “Application of Voice Biometrics as an Ecological and Inexpensive Method of Authentication”, International Journal of Science and

Advanced Technology (ISSN 2221-8386) Volume 1 No 6 August 2011.

[7]. Sanjay Kumar and Dr. Ekta Walia, “Analysis of

various Biometric Techniques”, (IJCSIT)

International Journal of Computer Science and Information Technologies, Vol. 2 (4) , 2011, 1595-1597