Phoneme Sequence Recognition Study - Towards End-to-End Speech Recognition

In this section we present the experimental setup and the results of the phoneme recognition study on the TIMIT corpus.

7.2. Phoneme Sequence Recognition Study

7.2.1 Experimental Setup TIMIT Corpus

The training set, validation set and test set are same as in the previous chapters, detailed in Section 3.3.1. The phoneme set is composed of 61 phonemes. For evaluation, the 61 phonemes are mapped to the 39 phoneme set [Lee and Hon, 1989]. A phoneme segmentation is provided with this corpus. We refer to this segmentation as “manual segmentation”.

CNN-based System Setup

The input features for this part of the study are raw speech waveform, as described in Chapter 3. The architecture is composed of four ﬁlter stages. The hyper-parameters are tuned based on the phoneme error rate of the validation set, and are presented in Table 7.1.

Table 7.1 – Network hyper-parameters.

# hidden

System layers wi n nhu kW dW dn kWmp

CNN 1 310 ms 1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2 2 310 ms 1000,1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2 3 310 ms 1000,1000,1000 30,7,7,7 5,1,1,1 200,100,100,100 4,2,2,2

In the CNN-based architecture, the number of output labels, i.e. the length of the inferred phoneme sequence, is given directly by the hyper-parameters. The duration of one output label Tl ab(in seconds) is given by the duration of one sample of the input waveform (given by

the inverse of the sampling frequency fs) multiplied by the total pooling Npool, i.e.

Tl ab=

fs ∗ Npool

(7.12) Using 4 ﬁlter stages, the number of pooling is given by:

Npool=

i=1

dWi∗ dWmp,i (7.13)

To be consistent with the baselines, the output label duration was set to Tl ab= 10ms, thus Npool= 160. The hyper-parameters grid search was limited to ﬁt this constraint.

Raw speech utterance Feature Extraction MLP CRF Phoneme sequence S X L∗ Joint Training

Figure 7.6 – Illustration of the ANN-based system using MFCC features as input.

Baselines

We compare the CNN-based system using raw speech as input to ANN-based systems using MFCC features as inputs. The score for a path in Equation (7.1) becomes:

c(X , L,Θ) = T t=1 flt t (X ,θf)+ Alt,lt−1 (7.14) where X= {x1. . . xT} is a sequence of feature, as illustrated in Figure 7.6. The system is trained

using the three training strategies presented above. We use the same MFCC features as used in the previous TIMIT study in Chapter 3. The classiﬁer is a MLP composed of one to three hidden layers. The number of hidden units for each layer is set to 1000.

For the sake of completeness, we also compare our results to the CRF based system proposed in [Morris and Fosler-Lussier, 2008]. This system uses local posterior estimates provided by an ANN (trained separately using PLP features) as features for the CRF. This system is referred as “CRF”. The second baseline is a ANN/CRF based system [Prabhavalkar and Fosler-Lussier, 2010], where the ANN using PLP features as input is trained jointly with the CRF by back- propagation. It is referred to as “ML-CRF”. All these systems are trained using the 61 phoneme, mapped to the 39 phonemes set for evaluation.

CRF Hyper-parameters

The hyper-parameters of the segmentation graph are the minimum and maximum phoneme duration tmi nand tmax. They are tuned on the phoneme error rate of the validation set. The

minimum duration tmi nwas set to 30ms, or 3 frames. The maximum duration tmaxwas set to

300ms, or 30 frames. The maximum duration of the silence class is set to 150 frames, or 1.5 s.

7.2.2 Results

The results on the phoneme sequence recognition task are reported in Table 7.2 for the two training strategies using manual segmentation, namely separate training and joint training, and for the weakly-supervised training strategy. Using manual segmentation, one can see that the ANN-based system with single hidden layer yields similar performance to the CRF baseline (30.2% and 30.7% PER) and to the ML-CRF baseline (29.1% and 28.9% PER). Adding more layer

7.2. Phoneme Sequence Recognition Study

improves the performance. The end-to-end CNN-based system clearly outperforms the CRF baselines and the ANN-based systems. Moreover, the CNN-based system with one hidden layer yields better performance than the ANN-based system using three hidden layers. One can see that the joint training approach leads to similar or better systems than the separate approach.

Systems trained using the weakly-supervised training approach yield similar or better performance than systems trained using manual segmentation. Figure 7.7 illustrates the segmentation obtained by the proposed approach with the manual segmentation for an utterance. It can be observed that there are only minor differences between the segmentations. These results clearly indicate that the proposed weakly supervised training approach, which maximizes

P (L|X ), can be a good alternative to the independent training approach, based on maximizing P (L, X ). 0 500 1000 1500 2000 2500 0 5 10 15 20 25 30 35 40 /sil/ /ih/ /t/ /s/ /f/ /ah/ /n/ /t/ /uw/ /r/ /ow/ /s/ /t/ /m/ /ao/ /r/ /sh/ /sil/ /m/ /eh/ /l/ /ah/ /z/ /ao/ /n/ /ih/ /g/ /ae/ /s/ /b/ /er/ /n/ /er/ /sil/ Time [ms] Label Manual segmentation Learned segmentation

Figure 7.7 – Phoneme segmentation example using the 39 phoneme set, for sequencesx32of speakermcdc0.

Table 7.2 – Evaluation of the proposed approach on the TIMIT core testset. Results are expressed in terms of PER. The CRF baseline performance is reported in [Morris and Fosler- Lussier, 2008] and the ML-CRF performance is reported in [Prabhavalkar and Fosler-Lussier, 2010].

# Hidden Separate Joint Weakly-sup. Input Systems Layers Training Training Training

Previous works MFCC CRF 1 30.7 - - PLP ML-CRF 1 - 28.9 - Proposed approach MFCC ANN 1 30.2 29.1 28.7 MFCC ANN 2 29.9 28.0 27.9 MFCC ANN 3 29.7 27.6 27.3 RAW CNN 1 25.6 25.5 26.6 RAW CNN 2 25.0 25.4 25.7 RAW CNN 3 24.9 25.4 25.7

In document Towards End-to-End Speech Recognition (Page 100-104)