Post-evaluation analysis - Robust ASR - Segmentation, Diarization and Speech Transcription: Sur

6.3 Robust ASR

7.1.3 Post-evaluation analysis

The results of the Dutch and Flemish broadcast news tasks are disappointing compared to the results of the Dutch BN development set. In part, the word error rates

are higher because the data contain a good amount of interviews and discussions. The development data consists mainly of prepared studio speech. In table 7.6 the evaluation results of the Dutch BN task are shown for the four conditions with which the evaluation data was labeled. The word error rate of the clean studio condition is 26.3%. This is comparable to the results obtained on the development set (27.5%).

Audio Number SUB DEL INS WER

condition of words % % % %

Broadcast (F0) 7177 15.1 7.6 3.6 26.3

Spontaneous (F1) 10126 22.3 15.8 2.1 40.2

Telephone (F2) 3775 17.5 43.1 1.9 62.5

Degraded (F4) 2953 24.9 11.1 3.0 39.0

Table 7.6: The N-Best evaluation results for the Dutch BN task. The results are shown for the main four audio conditions that are present in the task. The word error rate (WER) is divided into substitution (SUB), deletion (DEL) and insertion (INS) errors.

The difference in evaluation and development data is not the only problem. Af- ter studying the evaluation results, it became clear that a high amount of speech was discarded by the segmentation subsystem that falsely labeled this speech as audible non-speech. Inspection of these segments revealed that the subsystem filtered all speech out of the system that was recorded over a telephone line. The deletion percentage of 43.1%WER in the F2 condition in table 7.6 is a clear indication of this problem. The development set did not contain any telephone speech and therefore this flaw in the system was not noted before. Next, a short explanation of this problem in the segmentation subsystem will be given.

Telephone speech in broadcast news

In chapter 4 the segmentation subsystem was described. This subsystem first identifies speech segments using a bootstrapping SAD component. Using this initial segmentation, a speech model, silence model and audible non-speech model are trained. For training the audible non-speech model, audio fragments with high energy levels that are classified as non-speech are used. Because in some cases the audible non-speech model is actually trained on speech fragments, the speech and non-speech models are compared and if they are considered similar, the audible non-speech model is discarded. The method failed because of two assumptions that are not valid in this case.

First, the bootstrapping is performed by a model based speech/silence segmentation component. This means that segments are only classified as speech when they fit the speech model well enough. If the audio conditions differ too much from the conditions of the training data for the speech model, speech segments might fit the more general silence model better, causing the segments to be classified as non-speech. This is what happened for the telephone speech.

Second, it is assumed that if the audible non-speech is trained with speech data, using the BIC method it is possible to detect that the speech model and audible

non-speech model both contain speech so that the error can be fixed. This is indeed possible as long as the conditions of the speech of both models are similar enough. In this case the telephone speech with which the audible non-speech model was trained, did not match the speech from the speech model and the error was not fixed at all.

To avoid this problem a narrow-band/broadband detection subsystem can be ap- plied before segmentation is performed. It is also possible to adjust the segmentation subsystem so that it is more robust for this channel problem. In the future work section of the next chapter, ideas for improving the segmentation subsystem will be given.

Task BN model CTS model Significance

%WER %WER p

BN-Dutch 35.5 34.9 < 0.001

BN-Flanders 33.5 31.7 < 0.001

Table 7.7: The N-Best evaluation results for the Dutch BN task, where segments origi- nally labeled as audible non-speech, are processed by the ASR subsystem using the telephone acoustic models. The significance levels are measured compared to the original submission (table 7.5).

For this specific evaluation, where it is guaranteed that the audio fragments do not contain any audible non-speech, it is possible to interpret the results of the segmentation subsystem slightly different. In this case, all audio that is used for training the audible non-speech model is known to contain high energy levels and mismatch the acoustical conditions of the training data. The high energy levels indicate that the fragments are not silence and therefore must be speech. The obvious condition that does not match the training data is speech over telephone lines and therefore the segmentation results can be interpreted as a silence/studio-speech/telephone-speech classification. Table 7.7 contains the results of experiments on the Dutch and Flemish BN task where the audible non-speech segments were interpreted as being telephone speech. These segments are passed to both the broadcast news ASR subsystem and the CTS ASR subsystem3_{. The experiments show that indeed the best results are}

obtained by applying the CTS acoustic models. The word error rate of the telephone condition (F2) for the Dutch BN task is 39.6% (8.5%WER deletions) when using the CTS models.

Post-processing

In section 7.1.1, the post-processing steps were described that are added to the system for the N-Best evaluation. Table 7.8 contains the word error rates after each post- processing step for the Dutch BN task (where the telephone speech is decoded with the CTS acoustic models). If no post-processing would have been performed, the WER would be 38.6%. Table 7.8 shows that each of the steps improves this result.

3_{For this experiment, only one decoding pass is used. The models are not adapted for a second}

Although the contribution of some post-processing steps is marginal, all improvements are significant with p < 0.001.

Post-processing step %WER

No post-processing, scored case-sensitive 38.6

Filled pauses 37.9

Filled pauses and compounds 37.6

Filled pauses, compounds and case 35.3

Filled pauses, compounds, case and numbers 34.9 All post-processing, scored case-insensitive 33.8

Table 7.8: Results of post-processing experiments on the N-Best Dutch-BN evaluation data. All experiments are significant with p < 0.001.

If the task is scored case-insensitive, the WER is reduced with almost one percent. Although the case of the reference transcription is not correct in all places, this means that the case normalization step can be improved further.

In document Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled (Page 139-142)