2.3 Deriving units automatically
2.3.2 Discussion
In this section, comparisons between the approaches reviewed above are made, including comparing aspects of the systems to the approach of this thesis. In this way, the approach of this thesis is introduced in the context of research which precedes it.
Fenones (Bahl et al.), among the earliest automatically derived SWUs, are a simple, data driven, unit inventory and have been seen to improve word recognition. As de-scribed above, prototype fenones are output by a vector quantizer, and the training data is then segmented by each frame’s distance from these prototypes. Fenones are in the same class when they are closer to a particular prototype than to any other prototype, using some distance measure. Markov models of each fenone are trained subsequently for use in recognition. This is typical of SWU generation: segmentation by some means and then acoustic modelling of groups of the segments. This simple
data driven method can achieve gains in recognition. Theoretically more improvement is possible if the sub-word unit derivation allows the statistical model to play a part in determining the units. This is the approach taken in this thesis: jointly, segment boundaries and the acoustic models for each unit are determined. Instead of segmenta-tion and clustering providing the decision about which frames of speech to model using the same statistical model, the models themselves define the boundaries. Segments of speech are in the same class if they are modelled by the same acoustic model, instead of being modelled by the same acoustic model because they are in the same class as determined by a separate objective function.
It is possible that the work using fenones did not continue due to the challenges of generating a pronunciation dictionary, as is implied in Holter & Svendsen (1997). As discussed above, fenone generation requires word boundaries, and pronunciations are sequences of fenones from a single example of each word. This does not allow for pronunciation variation, and clearly relies heavily on the example of each word being representative of many occurances of that word. The work using fenones indicates that using sub-word units that are automatically derived can improve upon the modelling and recognition power of using phones. Without methods to search for a good prototype pronunciation, the process is limited, however. The work in this thesis looks at these problems and presents a method of dictionary generation.
The method presented in this thesis combines segmentation and clustering into a single step by using an ergodic HMM, followed by dictionary generation using joint multi-grams. Each state of the ergodic HMM is designated to be a sub-word unit. The segmentation and clustering step jointly determines segment boundaries, clusters, and HMM model variables. The Bayesian information criterion is used as an objective func-tion to determine the number of sub-word units in the system, and the complexity of the models (in terms of number of Gaussian mixture components per HMM state). The dictionary generation process is automatic, aligning word transcriptions with sub-word unit transcriptions, but not requiring word boundary information. The result of this process is a multiple pronunciation, probabilistic dictionary.
The sub-word unit generation process in this thesis differs from a standard process.
Typically, acoustic models of the sub-word unit set are trained following some kind of segmentation of the data. For example, the fenone SWU (Bahl et al.) is the output
of a vector quantizer. Fenones are in the same class when they are closer to each other than to fenones in another class, using some distance measure. Markov models of each fenone are trained subsequently for use in recognition. Fenones are a simple, data driven, unit inventory, and have been seen to improve word recognition. If the sub-word unit derivation allows the statistical model to play a part in determining the units, theoretically more improvement is possible.
The task of jointly determining a unit inventory and dictionary is complex, as we have seen. In order to achieve a match between the unit design and the dictionary, some iteration between the two tends to be used. Bacchiani iterates between data segmentation given unit inventory, and unit parameters given segmentation. Singh et al. iterates between fixing the dictionary and using it to find acoustic models, and then using the acoustic models to update the dictionary. A further layer of iteration is also required by Singh, in the search for the number of sub-word units, N . N is gradually increased, and the units and dictionaries retrained accordingly, while the increase in N increases the recognition rates of some held out test data. In the BIC-multigrams approach presented in this thesis, re-estimation of segment boundaries and model parameters is done using a trained dictionary in order to ensure that the units and dictionary are matched. This is the only iterative aspect to this process.
Jointly optimising a unit inventory and a dictionary for a particular data set is a highly unconstrained problem, with many parameters. Typically, to make the problem more tractable, a number of constraints are employed. In most of the systems reviewed above, with the exception of Singh et al. and Hersch, the search for sub-word units is constrained at the outset by pre-defining the number of sub-word units the system will have. The methods which require this figure to be predefined are focusing on different aspects of the overall unit inventory derivation, and need the simplification of this con-straint in order to explore other system parameters, for example word pronunciations, or the clustering and modelling of segments. However, without allowing the number of units to be different, it is impossible to tell whether the final trained systems are optimal for any optimisation criterion. In Hersch, this constraint is not required; in-stead the size of the unit set is adjusted by merging clusters according to some criteria.
However, the technique requires the number of units per word to be predefined, which is another way of simplifying the search. As noted above, in Singh et al. (2002) the size of
the unit inventory is a variable in the system, and the criterion to be met in increasing the inventory size is the increase in recognition scores for some held out data. In the BIC-multigrams approach of this thesis, the number of units is not pre-defined, but is determined on the basis of the BIC value each model set size scores during training.
Another input required by many of the systems reviewed here to constrain the search for a unit inventory is word boundaries. The early systems (Bahl et al., Paliwal, Svendsen et al.) were developed on isolated word data, and hence any decisions about searching for word boundaries was completely avoided. Bacchiani, Fukada et al., Paliwal, and Holter & Svendsen all require word boundary locations as inputs to their methods, removing the need to search for possible word boundaries as part of their algorithms.
This constraint is exploited by Paliwal in his deterministic dictionary generation pro-cess: pronunciations for each word can be collected by splitting the sub-word unit transcriptions by the word boundaries. Bacchiani requires word boundaries in order to introduce two pronunciation constraints, which limit another potentially large search space, for the number of units per word and each word’s pronunciation. The methods presented in this thesis do not require this knowledge of word boundary locations.
The next chapter fully presents the method of joint unit and model design. Chapter 4 presents the dictionary generation method.
Author Task Acoustic Requirements / Modelling Constraints
Bacchiani read speech (RM) HMM word bounds, N
& spontaneous speech
Bahl isolated word Markov model word bounds, N
Fukada isolated word trajectory model word bounds, N , phone transcription, phone lexicon
Hersch spontaneous speech HMM number of units
(OGI Numbers) per word
Holter isolated word HMM word bounds, N
Paliwal isolated word HMM word bounds, N
Singh read speech (RM) HMM none necessary
Svendsen isolated word HMM word bounds, N
Couper Kenney spontaneous speech HMM none necessary (OGI Numbers)
Table 2.1: A comparison of the systems reviewed in this chapter. All systems require acoustic training data, word transcriptions, and a prescribed acoustic model topology.
Requirements beyond these are listed in the final column, where N indicates that the number of units in the system is pre-determined.
Chapter 3
Segmentation and Clustering
As outlined in Section 2.3, the problem of designing a unit inventory can be broken down into the steps of (1) segmentation, (2) clustering, and (3) lexicon generation. This chapter is concerned with steps (1) and (2). A literature review of methods available for segmentation and clustering is below in Section 3.1, followed by details of two experiments investigating different methods of segmentation, in Sections 3.2 and 3.3.