13.1. Introduction
The investigations described in the preceding three chapters provided the basis for a good understanding of the practical implications of modelling trajectories and of distinguishing between intra- and extra-segmental variability. Having developed a modelling approach with segmental HMMs which gave advantages over standard HMMs for the simple connected-digit recognition task, the next step was to progress to a more demanding modelling task. The aim of the experiments described in this chapter was to peform an initial evaluation of the performance of GSHMMs with trajectory-independent probabilities on a phonetic classification task, which involves determining the identity of speech segments with specified phonetic boundaries. This task provided a means to investigate and compare phonetic- modelling capabilities for different speech sounds. Studying classification rather than recognition has computational advantages, but also allows for the investigation of description and discrimination abilities separately from segmentation properties. Considerable variability needed to be accommodated as the tasks required “speaker-independent” modelling, but they were constrained to the extent of using data only from male speakers. The main emphasis was still on representing dynamics within the model applied to “single-frame” acoustic features representing the spectrum at particular instants in time. However, in order to also assess whether there were performance benefits from applying the segmental HMM to dynamic features as well as single-frame features, some experiments included delta features.
13.2. Phonetic classification task
A useful set of data for evaluating phonetic classification performance is the DARPA TIMIT acoustic-phonetic continuous-speech database of American English (Lamel, Kassel and Seneff, 1986; Garofolo et al., 1993). This database comprises 6300 utterances, all of which have been phonetically transcribed, segmented and labelled. TIMIT was designed to provide broad phonetic coverage, and is therefore particularly appropriate for testing approaches to improved acoustic-phonetic modelling.
13.2.1 Speech data
The complete TIMIT database comprises 10 spoken sentences from each of 630 native speakers of a range o f dialects o f American English. The sentences were divided into different types, so that each speaker read two “dialect-calibration” sentences, five “phonetically- compact” sentences to provide good coverage o f phonetic contexts, and three “phonetically- diverse” sentences to supplement the word coverage. The dialect-calibration sentences are not intended to be used in either training or testing, but the remaining 5040 utterances have been divided into training and test sets such that there is no overlap of speakers or texts. The testing material has been subdivided into two sets: a small core test set and a much larger complete test set.
The experiments reported here have adopted the designated subdivision into training and test sets, using only data from the male speakers and minimising computation time for the testing task by performing recognition on only the core test material. The sizes of these data sets were 2608 spoken sentences for training (8 utterances from each of 326 speakers), and 128 for testing (8 utterances from each of 16 speakers).
13.2.2 Acoustic features
Analysed versions o f the TIMIT data files were available (Madsen, 1994), for which a 20 ms Hamming window had been applied to the 16 kHz-sampled speech at a rate of 100 frames/s, and a fast Fourier transform had been computed. The output had been converted to a mel scale with 20 channels, and a cosine transform had been applied. The first 12 cosine coefficients together with an average amplitude feature formed the basic feature set for the work described
Phonetic classification experiments 199
here. Some experiments have also included derivative features, computed for each frame by applying linear regression over a five-frame window centred on the current frame.
13.2.3 Unit inventory and model structure
Phonetic transcriptionsThe time-aligned phonetic transcriptions provided with TIMIT use a total of 61 symbols, which are given in Table 13.1. These symbols include some allophones and non-speech symbols in addition to the basic phoneme set. The additional symbols are defined as follows:
• Closure intervals of stops are distinguished from stop releases. The closure symbols for the stops p, t, k, b, d, g are pci, tel, kcl, bel, del, gel, respectively. The closure portions of the affricates jh and ch are represented by del and tel.
• The following allophones are defined:
• Flap (dx), which occurs in words such as “muddy” and “dirty”. • Nasal flap (nx), which is found in words such as “wirmer”.
• Glottal stop (q), which can be an allophone of /t/, or mark an initial vowel or vowel- vowel boundary.
• Syllabic versions of /I/ (el) and the three nasals (en, em, and eng). • Voiced allophone of /h/ (hv), typically found intervocalically. • A fronted allophone of /u/ (ux), typically found in alveolar contexts.
• Four types of schwa are defined: axr is a destressed version o f er, which tends to occur in words such as “butter”. Of the other instances o f schwa, ix is generally used between two alveolars (e.g. “roses”), with ax being used for most other contexts (e.g. “ahead”). In addition, there is a devoiced schwa (ax-h), which is a very short devoiced vowel that tends to occur for reduced vowels surrounded by voiceless consonants.
• There are additional symbols for three types of silence: • A pause (pau) during an utterance.
• Silence and/or non-speech events (h#) found at the beginning and end of the signal. • Epinthetic silence (epi) between a fricative and a semivowel or nasal, as in “slow” or