• No results found

Initial digit-recognition experiments

Static GSHMMs of sub-phone segments

10.1. Initial digit-recognition experiments

The aim of the experiments described in this chapter was to use the digit-recognition task to assess the performance of a static segmental model o f sub-phone segments. To ensure that each phone could only be represented by three segments, the self-loop transitions (used in the experiments described in Chapter 8) were removed from the model structure. In addition to comparing recognition performance with that of the baseline standard-HMMs, comparisons were made with two types of models using a more limited segmental structure. Firstly, comparisons were made with simple static “self-loop” GSHMMs as used in the early recognition experiments described in Chapter 8, in order to evaluate the effect of using the segmental HMM with a fixed number of three segments per phone rather than simply to model sequences of similar frames. Secondly, comparisons were performed using the three-segment- per-phone model structure with the same duration constraints as for the new static GSHMMs, but with a conventional-HMM probability calculation. These models thus have the same structure as the hidden semi-Markov models investigated by Russell and Moore (1985), but with simply a duration constraint (and uniform duration distribution) rather than a trained duration model. For naming convenience, the duration-constrained HMMs studied here will be referred to as a form of hidden semi-Markov model (HSMM). By including the “HSMMs”, it was possible to evaluate effects due to distinguishing between intra- and extra-segment variability separately from segment-duration effects. It was considered important to include this comparison because, although the experiments were set up so that neither the segmental nor the standard HMMs distinguished between models on the basis of duration characteristics, there were still differences in their representations: the segmental models impose a maximum

segment duration and were assigned a uniform duration distribution, whereas standard HMMs have an implicit geometric duration distribution and allow arbitrarily long segment durations.

10.1.1 Method

New GSHM M model structure

For the new type o f segmental model, which will be referred to as a left-right GSHMM, a strict left-right topology was used with no self-loops so that the only allowed transition was from any given state onto the next state (a,y = 0 if y 9 ^ 7 +1). The maximum segment duration was

increased from five to 10 frames, so imposing a maximum phone duration of 300 ms, which was considered adequate for most speech sounds in connected utterances. The self-loops were retained for the non-speech models, to provide a simple way o f accommodating any long periods without speech.

The left-right model structure was also used for the HSMMs. As explained in Chapter 6,

HSMMs can be regarded as special cases of static GSHMMs in which all extra-segment variances are equal to zero so that all variability is modelled by the intra-segment variance. Both models include provision for a duration component, which has not been investigated here.

Segmental-HMM initialisation strategy

The three sets of segmental models were initialised as follows:

Self-loop GSHMMs: The self-loop GSHMMs were initialised directly from the trained standard-HMM parameters, using the same procedure of setting the intra-segment variances to a fixed small value as was adopted in the earlier experiments (Section 8 .2 ).

This strategy allows a small degree of variability within a segment, similar to the extent of variability which is successful for VFR analysis.

H SM M s: The HSMMs were also initialised directly from the trained standard-HMM parameters, but for these models the intra-segment variances were copied from the standard-FlMM variances and the extra-segment variances were set to zero.

Left-right GSHMMs: In the case of the left-right GSHMMs, it was necessary to initialise the intra-segment variances to values which were smaller than the total variability as represented by the HSMMs, but larger than the intra-segment variances for the self-loop GSHMMs so as to accommodate the greater variability associated with using a smaller

Static GSHMMs o f sub-phone segments 145

number of segments per phone. Although the precise values are not critical, preliminary experiments showed that it was important to initialise the extra- and intra-segment variances to appropriate values for representing typical (phone-dependent) variability for each model state, to ensure sensible segmentation at the start of training. The importance of initialisation is not surprising, because the model relies on correctly balancing extra- and intra-segmental probabilities, and also because the re-estimation equations for the variances use previous values in place of re-estimated values on their right-hand sides. Therefore, rather than relying on making the correct arbitrary choice for the model initialisation, in these experiments an automatic procedure was used to initialise the model parameters directly from the complete set of training data as segmented by the trained standard HMMs. Extra-segment means and variances and intra-segment variances were all initialised using the procedure described in Chapter 9 for estimating data distributions.

Segmental-HMM training procedure

Several iterations of segmental-HMM re-estimation were applied and, as can be seen from Figure 10.1, the probability of the training set was observed to increase with each successive iteration and to be consistently highest for the left-right GSHMMs. The HSMMs, which incorporate the same duration constraints as the left-right GSHMMs but retain the conventional-HMM probability calculation, give a slightly higher probability than the basic standard HMMs. The optimised probabilities are however considerably higher when using the GSHMM probability calculation. The left-right GSHMMs have quickly converged to their locally-optimum parameter values, which is presumably due to the model initialisation procedure having ensured that the extra- and intra-segment variances started with approximately correct values. It therefore appears that the training procedure operates appropriately for the left-right segmental models as well as for the self-loop models.

10.1.2 Trained-model characteristics

As in the previous experiments, all the models were considered to have been adequately trained after five iterations of re-estimation. The intra-segment variances o f the self-loop GSHMMs showed similar characteristics to those observed previously (see Section 8.3). For the left-right GSHMMs, which need to represent longer segments and therefore allow greater within- segment variability, the intra-segment variance was generally larger than for the self-loop models, but it was usually still considerably smaller than the extra-segment variance.

-4.8e+07 HMM — HSMM — self-loop GSHMM ■ left-right GSHMM ... -5e+07 -5.2e+07 -5.4e+07 -5.6e+07 -5.8e+07 -6e+07 -6.2e+07 -6.4e+07 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration number

Figure 10.1: L o g p ro b a b ility o f the d ig it train in g se t a s a fu n ctio n o f itera tio n num ber f o r left-rig h t sta tic GSHMALs co m p a re d with se lf-lo o p G SH M M s a n d with co n ven tio n a l H M M s a n d HSM Ms. Training f o r the H SM M s a n d tw o s e ts o f G SH M M s is shown sta rtin g fr o m itera tio n num ber 6, as

these m o d els w ere in itia lise d fro m the H M M s a fter 5 train in g iteration s.

10.1.3 Recognition results and discussion

The connected-digit recognition results are shown in T able 10.1 for the tw o t \p e s o f G S H M M com pared w ith the standard H M M s and the H S M M s. Here, the different model sets have all been trained using the im proved experim ental fram ew ork described m Section 7 .7 .2 , and the perform ance o f both the H M M s and the self-loop G S H M M s is therefore better than the corresponding perform ance figures presented in T ab le 8.1. A s with the initial experim ents, the self-loop G S H M M s have show n som e im provem ent over the standard H M M s. H ow ever, the recognition perform ance o f the left-right G S H M M s w as disastrous, with m any m ore word substitution and deletion errors than the other m odel sets. T hese errors corresponded to a preference for representing frame sequences by a sin gle long segm ent, rather than using m ultiple shorter segm ents. T his poor perform ance did not appear to be a result o f insufficient training, as perform ing a further five iterations o f re-estim ation only reduced the w ord error- rate for the left-right G S H M M s from 30.7% to 29.8% . Furthermore, the problem s w ere not due to the im posed segm ent duration structure, as the H S M M s w ith a m axim um segm ent duration o f 10 fram es gave a low er error rate than the conventional H M M s even w hen further training had been applied to the conventional H M M s. Thus, for the experim ental task investigated here, there w ere considerable advantages in constraining the m axim um segm ent duration, w hich acts to prevent unrealistically long state occup an cies for the speech model states. The problem s with the left-right G S H M M s m ust therefore have been in the acou stic m odel descriptions. It w as concluded that, at least for a cepstral representation, the current

Static GSHMMs o f sub-phone segments 147

formulation of the static GSHMM is only successful when restricted to modelling short segments. Investigations into possible reasons are discussed in the next section.

Model set % Correct % Subs. %Del. % Ins. % Errors

HMM (5 training iterations) 92.3 6.2 1.5 0.9 8.6 Self-loop GSHMM 93.0 5.4 1.6 0.9 7.9 HMM (10 training iterations) 92.4 6.0 1.6 0.8 8.4 HSMM 94.1 5.2 0.7 0.7 6.6 Left-right GSHMM 69.4 20.4 10.2 0.1 30.7 T a b le 1 0.1: C o n n e c te d -d ig it re c o g n itio n r e s u lts f o r tw o s e t s o f s ta tic G S H M M s a n d f o r H S M M s c o m p a r e d w ith th e o r ig in a l s ta n d a r d H M M s.