Recognition experiment Training procedure

YxeS,Piy,MM).X^jd^jVj

12.3.4 Recognition experiment Training procedure

U sin g trajectory-independent probabilities, sets o f linear G S H M M s w ere trained from different initial conditions in the sam e w ay as for the m odels p reviously trained w ith optim al-trajectoiy p robabilities. One set o f m odels w as trained based on the flex ib le-slo p e initialisation strategy (set 1 o f Section 11. 1.1), and another set w as trained using initial estim ates w ith a zero-m ean slop e and constrained variance (sim ilar to set 3 o f Section 11.1.1). W hen using trajectoiy- m dependent probabilities, it w as more m eaningful to exp licitly set the slop e variance to zero for the constrain ed -slop e m odels, rather than to rely on ch oosin g som e suitable sm all valu e as w a s n ecessary for the optimal-trajectory' approach. In all ca ses, the probability o f the training set show ed the exp ected pattern o f increasing w ith each iteration and converging after around the fifth (Figure 12.4).

-4.2e+07 flexible slope slope var=0.0 -4.25e+07 -4.3e+07 -4.35e+07 ■ jo _-4.4e+07

I

S- -4.45e+07 S’ -4.5e+07 -4.55e+07 -4.6e+07 -4.65e+07 2 3 4 5 6 7 8 9 10 iteration number

F igu re 12.4: L og p r o b a b ility o f the d ig it tra in in g s e t a s a fu n ctio n o f itera tio n n u m ber f o r differen t s e ts o f lin ea r G SH M M s tra in e d u sing the tra je c to ry -in d e p e n d e n t approach.

B oth sets o f linear G S H M M s trained to g iv e n on-zero valu es for the m eans o f the slop e param eters. A s exp ected , w hen the slope variances w ere initialised to zero, they rem ained at zero after training. Sim ilarly, the m odels initialised w ith the flexib le slop e representation retained this characteristic after training. It is interesting that the optim ised probabilit} o f the training set w as considerably higher w hen the m odel slop es w ere allow ed to vary than w hen there w as no slop e variability. T his difference w as not seen with the optim al-trajectory m odels (com pare sets 1 and 3 in Figures 11.1 and 11.6), for w hich the training-set probabilities w ere m uch more sim ilar for m odels w ith flexib le versu s constrained slop e variances, even w hen u sin g tw o-com p onent m tra-segm ent m ixture distributions (Figure 11.6). The trajectory-

independent approach may be providing greater compensation than the mixture distribution for the underestimation of short-segment probabilities which occurs with the optimal-trajectory approach. The problem with penalising short segments was known to be greatest for models with extensive variation in the slope parameters (Figure 12.3). It is therefore to be expected that any differences between the approaches would be most apparent when there is variability in the slope as well as the mid-point parameters.

Recognition results

After five training iterations, connected-digit recognition with the models trained using the trajectory-independent approach showed a similar influence of initialisation strategy as had been found for the optimal-trajectory approach (see Table 12.2). The best performance was obtained when there was no variability in the slope representation, with these models giving an error rate of only 2.9%. The results achieved with the trajectory-independent approach compare favourably with the optimal-trajectory results, particularly for the flexible slope models which include variability in the slope parameter. This performance difference may reflect an advantage of explicitly taking greater account of segment-duration effects, by modelling the effect of duration on the range of possible underlying trajectories as well as on the identity of the most likely trajectory. However, although the flexible-slope models performed better with the trajectory-independent approach (and gave higher probabilities for the training set) than the optimal-trajectory approach, performance was still significantly worse than for the constrained-slope models. It therefore appears that there are still problems with modelling variability in the slope parameter, and possible causes and solutions are considered further in Chapter 14.

Model set % Correct % Subs. % Del. % Ins. % Errors Single-Gaussian opt. traj. (flexible slope) 69.1 16.5 14.4 0.1 31.0

2-mix. opt. traj. (flexible slope) 86.2 9.1 4.7 0.1 13.9 Traj. independent (flexible slope) 91.0 4.9 4.0 0.0 9.0 Single-Gaussian opt. traj. (constrain slope) 92.9 3.3 3.8 0.1 7.2 2-mix. opt. traj. (constrain slope) 96.9 2.0 1.1 0.2 3.2 Traj. independent (constrain slope) 97.1 2.0 0.8 0.1 2.9

Table 12.2: Connected-digit recognition results fo r linear GSHMMs using trajectory-independent probabilities com pared with those using optim al-trajectory probabilities,for different initialisation

An alternative to optimal-trajectory probabilities 193

12.4. Discussion

The experiments described in the previous sections have demonstrated good recognition performance with single-Gaussian segmental HMMs by using a “trajectory-independent” probability calculation which explicitly takes account of the duration-dependent bias in estimating the true underlying trajectory given a short sequence of data samples. For all the model sets tested, the performance using this approach was similar to the performance of the corresponding model set using optimal-trajectory probabilities with a two-component intra segment distribution. There is thus good evidence to suggest that the problems experienced with the single-Gaussian optimal-trajectory approach were indeed caused by a tendency to underestimate the variance around the optimal trajectory for short segments, for which two alternative solutions have been identified. The first was to use a Richter distribution to describe the observed variation, with the alternative being the use of the trajectory-independent probability which in effect considers all possible trajectories within the probability calculation.

Having studied and compared the two approaches to calculating segmental-HMM probabilities, it has become apparent that the trajectory-independent approach of considering all possible trajectories has several advantages over the original, conceptually simpler, optimal- trajectory technique. Firstly, given that two-component intra-segment distributions seem to be necessary with the optimal-traj ectory approach, the trajectory-independent probability expression is simpler and therefore requires less computation, and the models have fewer parameters. In addition, issues of initialising the small-variance mixture component and setting its minimum variance appropriately during training are avoided. The trajectory independent approach is also mathematically more elegant in a number of respects. It explicitly takes into account the fact that variance around the estimated trajectory depends on segment duration, and does not suffer the complications associated with the mixture distributions (especially when deriving re-estimation formulae). Furthermore, simpler segmental models and conventional HMMs arise naturally as special cases by simply setting appropriate parameters to zero. Overall, the expression for the optimal-trajectory provides a useful way of describing model representations of data sequences, by identifying the most likely trajectory. However, the trajectory-independent approach of integrating over all the possible trajectories is more satisfactory for calculating observation probabilities.

The static GSHMM investigated in this Chapter used the same probability calculations as those described by Gales and Young (1993a, 1993b), although their re-estimation formulae were different. As with the earlier static-GSHMM experiments described in Chapter 10, the results presented here are in agreement with those of Gales and Young, who also found no performance advantage with static GSHMMs. When considered in conjunction with the good performance demonstrated here with linear GSHMMs, there is further evidence to support the idea that the trajectory assumption needs to be appropriate for segmental HMMs to improve recognition performance relative to conventional HMMs.

In document Modelling Segmental Variability for Automatic Speech Recognition (Page 192-195)