In this section we present the experimental setup and the results of the phoneme recognition study on the TIMIT corpus.
7.2. Phoneme Sequence Recognition Study
7.2.1 Experimental Setup TIMIT Corpus
The training set, validation set and test set are same as in the previous chapters, detailed in Section 3.3.1. The phoneme set is composed of 61 phonemes. For evaluation, the 61 phonemes are mapped to the 39 phoneme set [Lee and Hon, 1989]. A phoneme segmentation is provided with this corpus. We refer to this segmentation as “manual segmentation”.
CNN-based System Setup
The input features for this part of the study are raw speech waveform, as described in Chapter 3. The architecture is composed of four filter stages. The hyper-parameters are tuned based on the phoneme error rate of the validation set, and are presented in Table 7.1.
Table 7.1 – Network hyper-parameters.
# hidden
System layers wi n nhu kW dW dn kWmp
CNN 1 310 ms 1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2 2 310 ms 1000,1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2 3 310 ms 1000,1000,1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2
In the CNN-based architecture, the number of output labels, i.e. the length of the inferred phoneme sequence, is given directly by the hyper-parameters. The duration of one output label Tl ab(in seconds) is given by the duration of one sample of the input waveform (given by
the inverse of the sampling frequency fs) multiplied by the total pooling Npool, i.e.
Tl ab=
1
fs ∗ Npool
(7.12) Using 4 filter stages, the number of pooling is given by:
Npool=
4
i=1
dWi∗ dWmp,i (7.13)
To be consistent with the baselines, the output label duration was set to Tl ab= 10ms, thus Npool= 160. The hyper-parameters grid search was limited to fit this constraint.
Raw speech utterance Feature Extraction MLP CRF Phoneme sequence S X L∗ Joint Training
Figure 7.6 – Illustration of the ANN-based system using MFCC features as input.
Baselines
We compare the CNN-based system using raw speech as input to ANN-based systems using MFCC features as inputs. The score for a path in Equation (7.1) becomes:
c(X , L,Θ) = T t=1 flt t (X ,θf)+ Alt,lt−1 (7.14) where X= {x1. . . xT} is a sequence of feature, as illustrated in Figure 7.6. The system is trained
using the three training strategies presented above. We use the same MFCC features as used in the previous TIMIT study in Chapter 3. The classifier is a MLP composed of one to three hidden layers. The number of hidden units for each layer is set to 1000.
For the sake of completeness, we also compare our results to the CRF based system proposed in [Morris and Fosler-Lussier, 2008]. This system uses local posterior estimates provided by an ANN (trained separately using PLP features) as features for the CRF. This system is referred as “CRF”. The second baseline is a ANN/CRF based system [Prabhavalkar and Fosler-Lussier, 2010], where the ANN using PLP features as input is trained jointly with the CRF by back- propagation. It is referred to as “ML-CRF”. All these systems are trained using the 61 phoneme, mapped to the 39 phonemes set for evaluation.
CRF Hyper-parameters
The hyper-parameters of the segmentation graph are the minimum and maximum phoneme duration tmi nand tmax. They are tuned on the phoneme error rate of the validation set. The
minimum duration tmi nwas set to 30ms, or 3 frames. The maximum duration tmaxwas set to
300ms, or 30 frames. The maximum duration of the silence class is set to 150 frames, or 1.5 s.
7.2.2 Results
The results on the phoneme sequence recognition task are reported in Table 7.2 for the two training strategies using manual segmentation, namely separate training and joint training, and for the weakly-supervised training strategy. Using manual segmentation, one can see that the ANN-based system with single hidden layer yields similar performance to the CRF baseline (30.2% and 30.7% PER) and to the ML-CRF baseline (29.1% and 28.9% PER). Adding more layer
7.2. Phoneme Sequence Recognition Study
improves the performance. The end-to-end CNN-based system clearly outperforms the CRF baselines and the ANN-based systems. Moreover, the CNN-based system with one hidden layer yields better performance than the ANN-based system using three hidden layers. One can see that the joint training approach leads to similar or better systems than the separate approach.
Systems trained using the weakly-supervised training approach yield similar or better perfor- mance than systems trained using manual segmentation. Figure 7.7 illustrates the segmenta- tion obtained by the proposed approach with the manual segmentation for an utterance. It can be observed that there are only minor differences between the segmentations. These results clearly indicate that the proposed weakly supervised training approach, which maximizes
P (L|X ), can be a good alternative to the independent training approach, based on maximizing P (L, X ). 0 500 1000 1500 2000 2500 0 5 10 15 20 25 30 35 40 /sil/ /ih/ /t/ /s/ /f/ /ah/ /n/ /t/ /uw/ /r/ /ow/ /s/ /t/ /m/ /ao/ /r/ /sh/ /sil/ /m/ /eh/ /l/ /ah/ /z/ /ao/ /n/ /ih/ /g/ /ae/ /s/ /b/ /er/ /n/ /er/ /sil/ Time [ms] Label Manual segmentation Learned segmentation
Figure 7.7 – Phoneme segmentation example using the 39 phoneme set, for sequencesx32of speakermcdc0.
Table 7.2 – Evaluation of the proposed approach on the TIMIT core testset. Results are expressed in terms of PER. The CRF baseline performance is reported in [Morris and Fosler- Lussier, 2008] and the ML-CRF performance is reported in [Prabhavalkar and Fosler-Lussier, 2010].
# Hidden Separate Joint Weakly-sup. Input Systems Layers Training Training Training
Previous works MFCC CRF 1 30.7 - - PLP ML-CRF 1 - 28.9 - Proposed approach MFCC ANN 1 30.2 29.1 28.7 MFCC ANN 2 29.9 28.0 27.9 MFCC ANN 3 29.7 27.6 27.3 RAW CNN 1 25.6 25.5 26.6 RAW CNN 2 25.0 25.4 25.7 RAW CNN 3 24.9 25.4 25.7