A Speech Recognition System Based Improved Algorithm of Dual-template HMM

(1)

Procedia Engineering 15 ( 2011 ) 2286 – 2290

Available online at www.sciencedirect.com

A Speech Recognition System Based Improved Algorithm of

Dual-template HMM

JingZhang

a,b

Min Zhang

c

a*

a _{Faculty of Automation, Guangdong University of technology, Guangzhou Guangdong 510006, China}

b_{Dept. of Computer Science and Technology, Guangdong University of Foreign Studies, Guangzhou Guangdong 510006, China} c_{Faculty of information, Guangdong University of technology, Guangzhou Guangdong 510006, China}

Abstract

The hidden Markov (HMM) and speech recognition algorithm based this model were studied in the paper. In addition the model and recognition algorithm of HMM got be improved based on the traditional the HMM. In the process of modeling, through the training of multiple observe sequence to achieve the recognition of non-specific people, and according to the different number of HMM states to establish the double-template of rough and high precision, and through the second matching algorithm to achieve higher recognition rate. A speech recognition system combined MFCC parameters and HMM algorithm was constructed based improved HMM algorithm. Experimental result shown the speech recognition rate of large vocabulary of non-specific people was greatly improved.

Keywords:Speech Recognition; Hidden Markov Model; Double Template Matching; Multiple observe sequence; Data overflow

1. Introduction

Speech signal is smooth and time-invariant in a very short analysis interval, and the statistical feature can be described by the parameters of classical linear model, but for an overall speech signal, the signal is time-varying and the parameters of linear model also would change. As a statistical model, HMM could both represent the information of short-term smooth linear model, and describe the transition between each model. Therefore, HMM was used to build statistical models for the speech signal: one is Markov chain with a finite number of states used to simulate the implied stochastic process of the speech signal with changing statistical properties, and the other is the stochastic process of observation sequence associated with each state of Markov chain, as an ideal solution of speech recognition, the HMM completely expressed the acoustic model of speech. However, for the practical application of HMM in speech recognition process there are still many problems, such as the large number of observation sequences involved in the training, the estimate of initial parameters, and so on, the key of improving speech recognition rate is how to resolve these issues.

* Corresponding author. Tel.: +-0-020-61990270 E-mail address: [email protected].

Open access under CC BY-NC-ND license. Open access under CC BY-NC-ND license.

(2)

L

2. The improved HMM model and recognition algorithm

2.1. The improved HMM models ---- multiple observations sequence modeling

For the classic Baum-Welch algorithm, the parameter revaluation formula was deduced under the condition of assuming only one observation sequence. In the application, there are a large number of observation sequence involved in training, that is, for each HMM model, a large number of speech data will be collected, the respective sequence of MFCC parameters should be calculated, and then used for the parameters revaluation of corresponding HMM.

For example, for the building HMM models of word "ball", it should find a lot of people, and record multiple wav files for "ball" of everybody, after the endpoint detection, then to calculate parameters sequence of MFCC, that is so-called the observation sequence, then the Parameters of the model can be trained.

In actual application, usually more than one observation sequence were used to train a HMM, then when train a HMM with L observations sequence, the revaluation formula of Baum-Welch algorithm

should to be amended. Assuming L observations sequence were , in

which , and assuming each observation sequence was independent, then got

formula (1): , , 2 , 1 , ) ( _l _L Ol = ) ( 1 ) ( 2 ) ( 1 ) ( _, _, l T l l l _O _O _O O =

∏

= POl O P( |λ) ( ()|λ) = l 1

Since the revaluation formula was based on the frequency of different time, therefore, foe L a training sequences, a revised revaluation shown as the formula (2),(3),(4).

(1) N i O P i i a L l l l l ≤ ≤ =

¦

= 1 , ) | ( ) ( ) ( 1 () ) ( 1 ) ( 1 λ β π (2) N j i O P i i a O P j O b a i a a L l T t l l t l t L l T t l l t l t j ij l t ij l l ≤ ≤ =

¦¦

= = = = + + _,₁ _, ) | ( ) ( ) ( ) | ( ) ( ) ( ) ( 1 1 ) ( ) ( 1 1 ) ( 1 ) ( 1 ) ( λ β λ β (3) M k N j O P j j a O P j j a b _L l T t l l t l t L l T v andO t l l t l t jk l l k l ≤ ≤ ≤ ≤ =

¦¦

¦ ¦

= = = = = _,₁ _,₁ ) | ( ) ( ) ( ) | ( ) ( ) ( 1 1 ) ( ) ( 1 1 ) ( ) ( λ β λ β (4)

2.2. The improved recognition algorithm of HMM---double-template matching

The most important issue of HMM model training is the estimates of initial parameters. Different initial training may produce different results and the appropriateness of initial estimates is also related to the final model parameters can whether converge to the global optimum or not. The improvement for traditional HMM recognition algorithm is mainly adopting two-template matching method for different templates initialization.

For a speech model, the parameters of continuous HMM are more than that of discrete HMM, and it could characterize the spatial distribution of feature vector more accurately. But its problems are large calculation and the convergence is slow.

(3)

One of the prominent performance is, in theory, the model can be achieved convergence after finite iterations of any initial value. In fact, if improper initial value were chosen, combined with the imprecise characterization of model parameters, which may cause iteration divergence or excessive iterations and result in long training time or training results do not converge, then the requirement of actual application couldn’t be achieved.

It was found in the research and experiment that the initial value selection of model parameters ʌ and A had little effect, in general, it was selected randomly or meanly. In the traditional HMM model the initialization of the B is generally taken to mean. That to divide observation vector sequences of each primitive to be identified into N segments (N is the number of states.). Since HMM model has the

structure from left to right, the observation vector is corresponding to each state in time.After Segments

dividing, each observation vector corresponds to a state of HMM, and then to calculate the mean

correlation, variance and other parameters of each section.But the initial processing of B according to this

approach has a certain relationship with the selection of the states quantity, that the fewer states, the rough division, and the characteristics of the speech signal could not be well reflected.

The speech recognition is mainly pattern matching through each HMM template in system, that the pattern matching by Viterbi algorithm, then select the speech that was closest to the speech to be recognized as the recognition results according to matching probability. Therefore, in the implementation process of system the multi-template matching algorithm was proposed, that is, through two matches to achieve speech recognition and the improvement of recognition accuracy.

The basic idea of the twice matching algorithms is: the system trained twice, and the training was divided into the HMM template of rough accuracy and high precision, the number of states N in the HMM of high precision is larger than that of rough accuracy.

In the recognition process, firstly to match the speech to be recognized with the HMM parameters of rough accuracy, and then according to the recognition result, take the N recognition results with high similarity probability as the basis of the template need to be matched when selecting the high accuracy template, and then to the second match with the template with high accuracy, thereby to improve speech recognition accuracy.

In theory, the larger number state is the better because that the error rate of recognition will be reduced to a stable level with the increase in the number of states. However, since the training samples is limited, therefore, the number of states N can not be too large, otherwise, the training many of the corresponding items of states in the parameter Ȝ = (A, B, ʌ) would be 0 or very close to 0 as redundancy, the experimental number of speech states fixed ranging from 3 to 8 according to the complexity. Taken English number “0-9” voice as test data, experiments were carried with different number of states, Table 1 shown the average experimental results of three times recognition with the templates of different states number.

It can be drawn from the data in Table 1 that with the increasing number of states N, the system recognition rate increased accordingly, but as the same time, the recognition time system required also increased, which is due to the calculation of the system increased, that the complexity of the system increased.

(4)

Number of states Number of templates Number of average recognition Average recognition rate Average recognition time 3 10 9.003 90.03% 0.2806 4 10 9.012 90.12% 0.2913 5 10 9.223 92.23% 0.3122 6 10 9.311 93.11% 0.3143 7 10 9.345 93.45% 0.3221 8 10 9.371 93.71% 0.3219 3. System evaluation

The recognition module of system was carried by using dual-template matching. In the recognition process, in order to improve the recognition efficiency, the two templates was not used for each recognition, if the first identification has been successful, template of rough accuracy used only, only the first unsuccessful recognition or the recognition result was not among the top two results then the second recognition ("re- recognition "or "twice recognition ")used. As Fig.1 shown, it is successful after twice recognition.

Fig.1. result of twice recognition

The multi-template matching of four categories of sports vocabulary in the training window were carried, and the experimental data as shown in Table 2

Table 2 the recognition result of dual-template matching

category

The average recognition rate of once recognition

The average recognition time of once recognition

The average recognition rate of re-recognition

The average recognition time of re-recognition

Basketball class 90.12% 0.2903 92.14% 0.3249

Swimming class 89.76% 0.2899 91.35% 0.3197

Track and Field class 88.57% 0.2917 90.68% 0.3092

From the data in Table 3 can be seen, after using dual-template, the recognition rate of system has been increased, but it is same to the recognition time, the overall average recognition rate of the system reached 90% or more, which meet the requirements of System performance.

(5)

Conclusion

A dual-template matching method was proposed after the improvement of HMM algorithm aimed at the large vocabulary of non-specific, and some effective solutions about such issues as the initial model selection of HMM algorithm encountered in the practical application of, multiple observations sequence involved in the training and Data underflow, which made the speech recognition rate of large vocabulary of non-specific achieved more than 90% in the PC platform. However, although the system using HMM algorithm has some advantages, but there are also some disadvantages, such as with the number of recognition template increased, the total states umber of HMM model also increased exponentially, which will require large storage space. In order to get robust and can well applied to the model of embedded systems; some state clustering method need be studied.

Acknowledgements

This work is partially supported by The ministry of education of humanities and social science project #10YJCZH220.

Reference

[1]FengQin Yang,Changhai Zhang, GeBai. “A novel Genetic Algorithm Based on Tabu Search for HMM Optimization”. Fourth International Conference on Natural Computation.Jinan,2008;p.57-61

[2]Zhao Hui, Gu Ya-qiang, Tang Chao-jing. Speech “Recognition Method of Dual-mode Based Multiplication HMM”, Computer Engineering,vol. 36, Dec. 2010, p. 7-9

[3]Hu Lei, Lu Luo-xian, Huang Tao. “The Application of An Improved Hidden Markov Model in Speech Recognition “.Information and Control, vol. 36, June. 2007, p. 715-719.

[4]Yu Mei-juan, Ma Xi-rong. “An Improvement of dynamic gesture recognition based HMM.” Computer Science, vol. 38, Feb. 2011, p. 251-252

[5]Liu Xuan-he, Song Ting-xin. Speech Recognition and Control Applications. Beijing: Science Press, 2008,pp.29-30

[6]Zhang Jian-ping, Li Ming, Suo Hong-bin. “The Application of Long speech features in the speaker recognition,” Journal of Acoustics, vol. 35,Feb. 2010,p.267-269

[7]Zhao Hui, Tang Chaojing, Yu Tao. “Fast Thresholding Segmentation for Image with High Noise.” Proc. ICIA’08, Dec.2008, p. 290-295

[8]Liu Qing-sheng, Xu Xiao-peng, Huang Wen-hao. “A study of speech endpoint detection.” Computer Engineering, ,vol. 29, Mar. 2003, p. 120-123

[9]Li Shou. A study of Speech Features Extraction. Xi'an: Xi'an University of Electronic Science and Technology press, 2005,p.43-45

[10]Ye Qing-yun, Jiang Jia. “An Improvement Algorithm Based MFCC feature of Speech”. Wuhan University of Technology, vol. 29, May. 2007, p. 150-152

[11]He Qian, Liu Jia. “The Optimization Methods about the State number of the HMM model in Chinese continuous speech.,”Journal of Information Processing, vol. 20, June. 2006, p. 83-87