In the first experiments involving training of two-component intra-segment mixture models (Holmes and Russell, 1996a), promising results were obtained by adding a second (small
Static GSHMMs o f sub-phone segments 155
variance) intra-segm ent m ixture com ponent to the sin gle-G au ssian m odels. H ow ever, as the second m ixture com ponent seem ed to be necessaix' for the correct operation o f these m odels and hence for appropriate state-alignm ent o f the m odels to the training data, later experim ents focu sed on finding a suitable initialisation and training strategy w hich incorporated the m ixture m odel at the very first stage o f training (H olm es and R ussell, 1996b ). A fter initial in vestigations, it w as found that a good approach w a s to initialise the m odels from a standard- H M M segm entation in the sam e w ay as for sin gle-G au ssian m odels, but then to add the second m ixture com ponent with a sm all variance and low w eight before training the m odels. This strategy w as adopted for all o f the experim ents reported here, using a variance valu e o f 0 .1 and a w eight o f 0 .1 . The second m ixture com ponent thus acted to describe the peak o f distributions such as the exam ples show n in Figure 10.3.
W hen Iteratively calculating the optim al target according to equation (1 0 .1 ), a reasonable initial estim ate w a s provided by the approxim ate valu e defined in equation (1 0 .2 ). The optim al target value w as generally found to change a sm all am ount at the first iteration, w ith v er \ little change on further iterations. In view o f the practical issu e o f com puting tim e, a sin gle iteration w as considered to be adequate -4.6e+07 HMM GSHMM 2-mix GSHMM -4.8e+07 -5e+07 -5.2e+07 -5.4e+07 -5.6e+07 -5.8e+07 -6e+07 -6.2e+07 -6.4e+07 13 2 3 4 5 6 7 8 9 10 12 14 15 1 11 iteration number
F igure 10.5: L o g p ro b a b ility o f the d ig it train in g s e t a s a fu n ctio n o f itera tio n n u m ber f o r tw o- com pon en t in tra -seg m en t m ixture sta tic G SH M M s c o m p a r e d with sin g le-G a u ssia n s ta tic GSHM M s. The b a selin e s ta n d a rd H M M s are also shown f o r referen ce (segm en tal m o d e ls a re show n a s sta rtin g
a t itera tio n 6, a s th ey w ere in itia lise d fr o m the H M M s a fter 5 tra in in g iteration s).
T he probability o f the training set for the tw o-com p onent m ixture m odels w a s found to increase with each training iteration and to converge after around five iterations, as can be seen from Figure 10.5. It therefore appears that the G S H M M training algorithm still operates correctly w hen extended to incorporate tw o-com p onent G au ssian intra-segm ental distributions.
and gives higher probabilities than the single-Gaussian models. This is encouraging, although a higher probability o f the training set is not necessarily associated with better recognition performance, as evidenced by the poor performance of the single-Gaussian left-right GSHMMs. Good recognition performance is dependent on other factors, particularly ability to generalise to new data and to perform segmentation simultaneously with recognition.
10.4.2 Trained-model characteristics
After training, the two intra-segment variance components had retained the characteristic of one having a much smaller value than the other, so modelling the peak and the skirts of the distribution respectively. The new variance component was found almost always to remain at its minimum value o f 0.1, and could in fact be fixed at this value during training without loss in accuracy of the trained models. The other variance component tended to have a somewhat higher value than in the corresponding single-Gaussian model, as it was no longer so greatly influenced by the sharp peak at the distribution mean. The weights of the two components adjusted appropriately during training, such that the models now provided a good fit to the data they represented, as can be seen from the examples in Figure 10.6.
8000 7000 6000 5000 4000 3000 2000 1000 0 - energy ■ i I -50 5000 4000 3000 2000 1000 -50 50 c3 50 5000 4000 3000 2000 1000 0 5000 4000 3000 2000 1000 0 -50 cl 50 50 j C4
L
6000 0 5000 1000 50 -50'
J
C2L
50 50J
' 05L
50 F ig u re 1 0.6: O b s e r v e d in tra -s e g m e n t d is tr ib u tio n s c o m p a r e d w ith c a lc u la te d m o d e l d is tr ib u tio n s f o rtw o -c o m p o n e n t in tr a -s e g m e n t m ix tu re s ta tic G SH M M s, f o r th e e n e r g y f e a tu r e a n d f i r s t f i v e c e p s tr a l c o e ffic ie n ts o f th e la s t s ta te o f Ini.
The observed distributions for the two-component intra-segment mixture models show a sharper peak at the mean than was seen with the single Gaussian models (Figure 10.3). This peak reflects a return to the situation seen with the HMMs and HSMMs, with a high
Static GSHMMs o f sub-phone segments 157
proportion of single-frame segments, as can be seen from the training-data duration distribution shown in Figure 10.7. The alignments of the models to the training utterances were generally more sensible, which suggested a better balance between the extra- and intra- segmental probabilities than had been achieved with the single-Gaussian models.
v .3 4000 n.1 4000 L n.2 4000 n.3 3000 300 0 3000 2000 2000 2000 "h-m 1000 riTilrhii 1000 r liT h r m 1000 r l i l k n 10 0 10 0 10 0 10
F ig u re 1 0.7: D is tr ib u tio n s s h o w in g d u ra tio n o f s ta te o c c u p a n c ie s f o r tw o -c o m p o n e n t in tra -s e g m e n t m ix tu re s ta tic G S H M M s o f th e th r e e s ta te s f o r th e p h o n e s/a/ (V. 1, V.2, V.3) a n d Inf (n. 1, n.2, n.3).
10.4.3 Recognition results
The two-component intra-segment mixture GSHMMs gave much improved recognition performance on the connected-digit test set (see Table 10.2), overcoming the excessive problems with deletion errors which had been encountered with the single-Gaussian models. Performance is however still not quite as good as that obtained with the HSMMs. Both sets of models appeared to be adequately trained after five iterations of re-estimation, as performing a further five iterations did not reduce the word error-rate.
Model set % Correct % Subs. %Del. % Ins. % Errors
2-mixture GSHMM 93.2 5.3 1.5 0.4 7.2
HSMM 94.1 5.2 0.7 0.7 6.6
T a b le 10.2: C o n n e c te d -d ig it r e c o g n itio n r e s u lts f o r tw o -c o m p o n e n t in tr a -s e g m e n t m ix tu r e s ta tic G S H M M s c o m p a r e d w ith H SM M s.
10.4.4 Discussion
The results have demonstrated improvements over standard-HMM recognition performance by using segmental models with three segments per phone and a maximum segment duration of ten frames \ However, making a distinction between within- and between-segment variability by introducing the optimal-target static GSHMM has not improved performance over that
‘ The static GSHMMs have improved recognition performance relative to standard HMMs, as reported in Holmes and Russell (1996b). However, the (more recent) comparison with HSMMs revealed that the advantage could in fact be explained by the imposed duration constraint rather than the model of variability.
obtained with a much simpler HSMM structure. There are various possible contributory reasons for this disappointing performance of the static GSHMMs, which are discussed below.
Recognition performance for these segmental models was found to be critically dependent on describing the distributions accurately according to the model assumptions, in order to correctly balance the two types of probability. Reasonable representations were obtained by using a two-component intra-segment mixture distribution, but it is possible that performance could be further improved by more experimentation to find better parameter settings associated with the second intra-segment mixture component. Also, although the model behaviour seems to be reasonable given the limitations of a three-segment-per-phone model structure, the segmental approach could be more useful with phones represented by a smaller number of (longer) segments, and explicit alternative routes through the model if required.
Although it may be possible to improve the static GSHMM performance to some degree, it seems probable that there is a more fundamental limitation due to the static assumption in the trajectory model. While the standard-HMM assumption of complete independence is less valid, it may lead to better performance by treating all frames equally and relying on cumulative discrimination over several frames. In contrast, the static GSHMM imposes a trajectory model and then computes the observation probabilities conditioned on that trajectory. The intra-segmental probability acts to ensure that the trajectory is plausible representation of a sequence of observations, while discrimination capability relies mainly on the extra- segmental probability which is only computed once per segment. If the trajectory model is overly simplistic, it may not provide sufficient discrimination power even if the distribution assumptions are reasonable. Problems with the trajectory assumption would also explain the poor performance obtained by Gales and Young (1993a, 1993b) with their static segmental HMM. Although some success was achieved by Digalakis (1992) with a similar “target-state” segment model, he used up to five regions per phone (compared with three in the current experiments and those of Gales and Young). With more regions (or states), the static trajectory assumption is a better approximation, and so some benefit could be expected from introducing the segmental model of variability. This hypothesis is further supported by the recognition performance improvements demonstrated here for self-loop static GSHMMs (see Table 10.1), which in effect also use more regions per phone. In the next chapter, experiments are described which investigated the influence of the trajectory assumption by progressing to the much more realistic linear dynamic trajectory model.
159