A Fast Framework for the Constrained Mean Trajectory Segment Model by Avoidance of Redundant Computation on Segment

(1)

Computational Linguistics and Chinese Language Processing

Vol. 11, No. 1, March 2006, pp. 73-86 73 © The Association for Computational Linguistics and Chinese Language Processing

[Received Feb. 24, 2005; Revised Oct. 8, 2005; Accepted Oct. 12, 2005]

A Fast Framework for the Constrained Mean

Trajectory Segment Model by Avoidance of

Redundant Computation on Segment

1

Yun Tang

∗

, Wenju Liu

∗

, Yiyan Zhang

+

and Bo Xu

∗

Abstract

The segment model (SM) is a family of methods that use the segmental distribution rather than frame-based density (e.g. HMM) to represent the underlying characteristics of the observation sequence. It has been proved to be more precise than HMM. However, their high level of complexity prevents these models from being used in practical systems. In this paper, we propose a framework that can reduce the computational complexity of the Constrained Mean Trajectory Segment Model (CMTSM), one type of SM, by fixing the number of regions in a segment so as to share the intermediate computation results. Our work is twofold. First, we compare the complexity of SM with that of HMM and point out the source of the complexity in SM. Secondly, a fast CMTSM framework is proposed, and two examples are used to illustrate this framework. The fast CMTSM achieves a 95.0% string accurate rate in the speaker-independent test on our mandarin digit string data corpus, which is much higher than the performance obtained with HMM-based system. At the mean time, we successfully keep the computation complexity of SM at the same level as that of HMM.

Keywords: Speech Recognition, Segment Model, Mandarin Digit String Recognition

1. Introduction

The Hidden Markov Model (HMM) [Rabiner et al. 1993] has been used successfully for

1

This work was supported in part by the China National Nature Science Foundation (No. 60172055) and the Beijing Nature Science Foundation (No.4042025).

∗

Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, nlpr, Beijing 100080 Tel: +086 01062659279 Fax: +086 01062551993

E-mail: [email protected]

+

(2)

74 Yun Tang et al.

acoustic modeling in many speech recognition systems. Given the state sequence, feature vectors are assumed to be conditionally independent, and the task of extracting the trajectory can be elegantly achieved by applying the Viterbi algorithm frame by frame. However, the above assumption is far from realistic, which limits the HMM’s ability to capture the relations within a segment. Another weakness of HMM is that it is not accurate enough to represent a non-stationary observation sequence by means of a piecewise constant state [Deng et al. 1994; Hon et al. 1999]. In order to handle these problems, a lot of methods have been proposed, including SM [Ostendorf et al. 1996], which is a family of methods among them.

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

82 Yun Tang et al.

was 10ms. For each frame, a 39-dimension vector, composed of 12 MFCC and 1 normalized energy, 13 first order deviations and 13 secondary deviations, was calculated.

Baseline Systems: three systems were studied, HMM, SSM, and PTM. The state in HMM (or region in SSM) was modeled by the Gaussian Mixture Model (GMM). In all the experiments, a diagonal covariance matrix was assumed for each GMM. Table 2 compares the baselines’ configures. Sts is the number of states, Res is the number of regions, MCs is the number of mixture components, “ID” means the acoustic unit is modeled by the whole-word (context independent), and “D” means the acoustic unit is tri-word based (context dependent) in Table 2.

Table 2. The settings of the models in the experiments

Model Sts (Res) MCs Type HMMI 8 16 ID HMMII 8 16 D SSMI 25 5 ID SSMII 40 10 ID

PTM 15 ID

The HMMs in the experiments were structured left to right with 8 states, 6 emitting distributions, and no state skipping, except for the "silence" model, which had 3 states and 1 emitting distribution. HMMI was decoded with the conventional Viterbi algorithm, and HMMII adopted a two-pass search strategy: the first pass was implemented using the forward Viterbi algorithm, and the second pass using the backward A* decoding to integrate the duration distribution [Deng et al. 2000]. HMMII was modeled using the tri-word model, while the other systems were modeled by the whole-word model. SSMI and SSMII were two SSM systems. SSMI had Gaussian densities comparable with those of HMMI so that a comparison of the performance between SSM and HMM would be meaningful. SSMII, which had more region models and mixture components than SSMI, achieved the best performance in the digit string recognition task. The baseline PTM was consisted of three sub-segments [Deng et al. 1994] and the polynomial regression order was 2.

4.2 Experimental Results

(11)

A Fast Framework for the Constrained Mean Trajectory Segment Model by 83 Avoidance of Redundant Computation on Segment

substitution error rate respectively.

Table 3. Comparison of digit string recognition performance achieved with SSM and HMM

For the purpose of comparison, the number of regions in a sub-region sequence was fixed at 20 and the total number of region models was 60 (20

×

3) in each fast PTM. The feature frames in a segment were mapping to these 60 region models using the time linear resampling. The other parameters were the same as those for the baseline PTM system. Table 4 presents the recognition results obtained with the fixed PTM and the original PTM. It shows that the performance of the PTM system was slightly downgraded following the modifications but still acceptable (0.6% string accuracy loss).

Table 4. Recognition results obtained with PTM and fixed PTM

Methods S Corr. W err Ins err Del err Sub err

PTM 95.10% 1.53% 0.30% 0.24% 0.99%

Fixed PTM 94.50% 1.82% 0.14% 0.68% 1.00%

The efficiency of the different recognition systems, including the conventional SSM, fast SSM, PTM, fixed PTM, and HMM, is compared in Table 5. We used the utterances of one person (80 strings) in the test set. As shown in Table 5, the fast algorithm boosted SMs and reduced the complexity of SM to the same level of that of HMM. The most noticeable achievement was made by the fixed PTM system, which was 90 times faster than the original one.

Table 5. Time comparison of SM, Fast SM, and HMM

T (s)

HMMI 35

HMMII 87

Conventional SSMI 1816

Fast SSMI 101

Fast SSMII 162

PTM 23854

Fixed PTM 271

S Corr. W err Ins err Del err Sub err

HMMI 87.10% 3.88% 0.64% 2.14% 1.10%

HMMII 91.80% 2.53% 0.19% 0.87% 1.47%

SSMI 92.52% 2.58% 0.23% 0.72% 1.63%

(12)

84 Yun Tang et al.

5. Conclusions

In this paper, a fast framework has been proposed to boost the speed of CMTSM based on the assumption that the region model of SM is independent from the segment duration, so that intermediate results are shared during the computation of segment scores. Two examples, SSM and PTM, have been used to illustrate this framework. The improved systems are far more effective than the original models. Based on this framework, it is potential to implement SM to LVCSR [Tang et al. 2005] in current computation condition and this will be our focus of future work.

Acknowledgements

The authors would like to thank the anonymous reviewers and Mr. Ludwig for their useful comments.

References

Deng, L., M. Aksmanovic, X. Sun, and C. Wu, “Speech Recognition Using Hidden Markov Models with Polynomial Regression Functions as Non-stationary States,” IEEE Trans. Speech Audio Processing, 2, 1994, pp. 507-520

Deng, Y., T. Huang, and B. Xu, “Towards high performance continuous mandarin digit string recognition,” In Proceeding of Int. Conf. on Spoken Language Processing, 2000, Beijing, China, vol.3, 642-645.

Digalakis, V., M. Ostendorf, and J. Rohlicek, “Fast Algorithms for phone classification and recognition using Segment-based Models,” IEEE Trans. Speech Audio Processing, 40(12), 1992, pp 2885-2896.

Gish, H., K.Ng, and J. Rohlicek, “Secondary Processing using Speech Segments for an HMM Word Spotting System,” In Proceeding of Int. Conf. on Spoken Language Processing, 1992, Banff, Canada, pp. 17-20.

Glass, J., "A probabilistic framework for segment-based speech recognition," Computer Speech and Language, 17, 2003, pp. 137-152.

Hon, H.W., and K. Wang. “Combining Frame and Segment Based Models for Large Vocabulary Continuous Speech Recognition”, In IEEE Workshop on Automatic Speech Recognition and Understanding. 1999, Keystone, USA, pp. 221-224.

Lee S., and J. Glass, “Real-Time Probabilistic Segmentation for Segment-Based Speech Recognition,” In Proceeding of Int. Conf. on Spoken Language Processing, 1998, Sydney, Australia, pp. 1803-1806.

(13)

A Fast Framework for the Constrained Mean Trajectory Segment Model by 85 Avoidance of Redundant Computation on Segment

Ostendorf, M., V. Digalakis, and O. Kimball, “From HMM’s to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition.” IEEE Trans. Speech Audio Processing, 4(5), 1996, pp. 360-378.

Rabiner, L., and B. H. Juang, Fundamentals of speech recognition, Prentice Hall, 1993. Rabiner, L., J. Wilpon, and F. Soong, “High Performance Connected Digit Recognition Using

Hidden Markov Models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(8), 1989, pp. 1214-1225.

(14)