Model-based segmentation and clustering - Automatic determination of sub-word units for automat

The acoustic segmentation carried out in the section above does not attempt to take into account the type of statistical model that will be used in the speech recognition system.

The acoustic models representing each sub-word unit are of fundamental importance in a system, since poor modelling cannot lead to accurate recognition. Therefore it would be better to allow the use the modelling structure as an input to the segmentation algorithm as well as the acoustic data, in order that the units found

1. are modelled well by the given model type 2. occur frequently in the given data

So, simply put, the goal of the segmentation step is to find a segmentation of the acoustic data that best fits the type of statistical model used and the given data.

Having chosen a statistical model type, we want to train a set of these models, allowing each model to choose which pieces of data to model. The locations where transitions between models occur will be considered to be segment boundaries. Each of the trained models will be considered to represent a sub-word unit, automatically derived. This process enables segmentation and clustering to be achieved simultaneously. Figure 3.6 illustrates this basic model topology, for a generic statistical model. In order to use this method for generating sub-word units, we need (a) a way of training the connected set without a transcription, and (b) a way of determining the location of transitions between models.

There are training and decoding algorithms for the hidden Markov model (HMM) which meet these requirements ((a) and (b)), providing a way of training the connected model set, and a way of locating the transitions between models. If (as in Bacchiani (1999) and Fukada et al. (1996)) each sub-word unit is to be modelled by a single HMM state, the search for segment boundaries and training of sub-word unit models can be achieved by embedded training of an ergodic HMM¹. This is made possible using existing training algorithms, and is the main technique used for segmentation and clustering in this thesis. In training an ergodic HMM, the number of states and initial parameter set

1in an ergodic HMM, all states can follow all other states, i.e. the transition matrix is full

model set

models connected

Figure 3.6: Illustrating allowing the models to determine segments and clusters: take a model set (a set of a particular type of statistical model), and connect the models in the set, such that each model can follow any other model. Train the large, connected model on acoustic data, and interpret each sub model as the model of an individual sub-word unit. See text.

must be specified. As we will see in Section 3.4 below, a criterion must be devised which enables the number of states used to be chosen in a motivated way, and in the experiments here, the Bayesian Information Criterion is used. First, though, we will look at the process for generating the sub-word unit set using an ergodic HMM in more detail.

3.3.1 Process

The process used to find the set of sub-word units is:

1. Take n HMM states, and connect them as an ergodic HMM.

2. Initialise the mean and variance of each state to be close to the data mean and variance, but distinct from all other states. The method used to achieve this is described fully in Section 3.5.3 below.

3. Train the ergodic HMM on all training data, providing the training algorithm with no transcription information, at either word or sub-word level.

4. The most likely state sequence of the ergodic HMM for each utterance is the sub-word unit transcription. The training of a single ergodic n state HMM leads to

a set of n automatically derived sub-word unit models and a transcription of the training data in terms of these. There is no need for a ‘clustering’ step. Standard decoding techniques are used to find the most likely state sequence; details are below.

5. The sub-word unit models are extracted from the ergodic HMM: each state rep-resenting one sub-word unit. Now we have a set of models and a transcription of the training data in terms of the model names. With a pronunciation dictionary in terms of these models, we would be able to carry out a standard recognition task, i.e. generate a word transcription for acoustic data. The generation of a dictionary is covered in Chapter 4

Embedded training of ergodic HMM

The outcome of training an HMM are the set of statistics

• the transition probabilities

• the mean and variance of each Gaussian

• the observation probabilities : the probability that observations were generated by the HMM

HMMs are typically trained using the forward-backward (or Baum-Welch) algorithm, which (iteratively) maximises the likelihood of observations given models. Details of the forward-backward algorithm can be found in Rabiner & Juang (1993, chapter 6) or Jurafsky & Martin (2000, Appendix D) or the HTK² manual.

This algorithm is used without alteration (as implemented by HTK ) in the experi-ments of this thesis. Instead of a set of HMMs being trained, as in a standard speech recognizer, here just one (large) ergodic HMM is trained.

2http://htk.eng.cam.ac.uk/

Decoding to find the most likely state sequence

Clearly a necessary outcome of any sub-word unit generation process is a transcription of the data in terms of the new unit inventory - without this, it is not possible to train a dictionary. For these experiments, where the units are determined using an ergodic HMM, the transcription of each utterance directly corresponds to the state sequence through the HMM for that utterance. However, since the state sequence is hidden, it is only possible to discover the likelihood of any state sequence for an observation sequence. In many decoding tasks, it is only necessary to find the most likely state sequence. This is true here; we do this using the Viterbi algorithm.

Details of this algorithm can be found in Rabiner & Juang (1993, page 339), Jurafsky

& Martin (2000, sections 5.9 and 7.3) and the HTK manual.

In document Automatic determination of sub-word units for automatic speech recognition (Page 56-59)