Unit inventory and model structure - Framework for digit-modelling experiments

Framework for digit-modelling experiments

7.5. Unit inventory and model structure

7.5.1 Phonetic transcriptions and units for modelling

The basic modelling units were phones, although the task was word recognition and the training data were transcribed at the word level. The sequence of models for representing any one word was defined by a dictionary, which specified a single phonetic-transcription for each word in the vocabulary. This approach is common for speech recognition systems, but will not always describe actual phonetic events accurately as it does not accommodate alternative word pronunciations or connected-speech effects. For a digit vocabulary, the pronounciations of the individual words are generally predictable, although speakers vary in whether they pronounce “seven” with or without a schwa between the /v/ and the Ini. In all the digit experiments described in this thesis, the schwa was included in the transcription of “seven”, although in initial experiments it was found that results were similar for the two possible transcriptions.

Framework fo r digi t-modelling experiments 111

7.5.2 Model sets

Experiments were carried out with two different model sets. In the first set, a single context- independent “monophone” model was used to represent each distinct phoneme symbol in the transcription dictionary. The second model set included some representation of phonetic context effects on acoustic realisation, by using “triphone” models dependent on immediately- adjacent left and right phone context. For the vocabulary-dependent training condition, a model was trained for every triphone in the digit vocabulary. When using vocabulary-independent training, it was important to ensure that each triphone model was trained on a sample of data that was sufficiently representative for the model set to generalise to a new vocabulary. Therefore, a model was only trained for a triphone context if that triphone occurred in at least two different word contexts (as suggested in the “context adaptive phone” approach proposed by Moore (1993)). When training triphones in this way, examples which did not contribute to a triphone model were used to update the relevant monophone model, so that the resulting model should be a reasonable representation of different contexts for which no triphone model was trained.

For both monophone and triphone model sets, the speech models were supplemented by four context-independent non-speech models, to represent silence, breath noise and other non-speech noises of varying length. Both sets of training data had been previously-annotated at the sentence or word-group level, with the non-speech regions annotated in terms of one of the four non-speech models.

7.5.3 Model structure

The experiments reported here all used the simple left-right model structure that is typically used in most FIMM systems, with three states per phone and single-state non-speech models. For standard HMMs, transitions were allowed from each state back to itself and on to the immediately following state. From the viewpoint of acoustic-phonetic modelling, the simple three-state-per-phone model structure is unlikely to be ideal, especially for segment models, but was considered to provide a good baseline as it represents an approach which has been used successfully for conventional HMMs.

When comparing segmental-HMM recognition performance with the performance of conventional HMMs, it was considered important to evaluate the different models of the

acoustics, separately from any duration-modelling differences. Both the conventional and the segmental HMMs were therefore structured so that they did not distinguish between alternative models on the basis of transition probabilities, nor on the basis of segment duration probabilities in the case of the segmental HMMs. All transitions from a state were assigned the same transition probability, and for the segmental HMMs all segment durations were assigned equal probability divided between the allowed duration range. In the case of standard HMMs, this restriction led to only a small drop in performance over that which was achieved by training the model-dependent transition probabilities. For the experimental conditions reported here, training the transition probabilities typically reduced the word error rates by around 8% o f the corresponding error rate with the fixed transition probabilities.

7.5.4 interpretation of model variance

It is useful for any one value of model variance to have a similar meaning for each feature, especially when initialising to arbitrary fixed values or when defining minimum values for parameter re-estimation. In all the experiments described in this thesis, the individual acoustic features were therefore scaled to have approximately the same total per-state variance across the complete training data set. The scaling was performed with the aim of equalising average variance per state rather than total variance over all data frames as, for the average amplitude feature in particular, the total variance could be much greater than the variance associated with any one phonetic event.

For each feature, a scale factor was determined from the feature’s standard deviation over all data frames, after taking into account an estimate of its mean value for the relevant phonetic event. Some appropriate labelling of the training database was therefore required, and was provided by an HMM state labelling, obtained from a Viterbi alignment procedure using monophone models trained without the scale factors. A different set of scale factors was derived for each set of training data, but the same values were used for all experiments based on any one training set. It should be emphasised that, although a set of models was used to identify phonetic events and hence to derive the scale factors, the factors themselves are simply multiplying constants which are applied in the same way to every data frame in both training and recognition. Because of the purpose for which they are used, the precise values of these scale factors are in any case not at all critical.

Framework fo r digi t-modelling experiments 113

In document Modelling Segmental Variability for Automatic Speech Recognition (Page 111-114)