4.2 Forced alignment
4.2.1 Forced alignment accuracy
The procedure of orthographic transcription combined with forced alignment can quickly produce phone boundary labels for relatively large speech corpora. However, these auto- matically produced labels will necessarily be less accurate than manual labels due to the fact that it is impossible for acoustic models to account for all the variation that could be present in the speech signal. Nevertheless, for the purposes of automatic vowel analysis, a small degree of error in the phone alignments is acceptable, as long as the formant extrac- tion procedure is still able to obtain accurate results from the output of the forced alignment. In one study using the P2FA forced alignment system, Yuan and Liberman (2008a) report that the vast majority of automatically generated word onset boundaries differed from the manual boundaries by less than 50 msec.
In order to test P2FA’s performance on the current corpus, the phone boundaries for all stressed vowels from two word list recordings were manually segmented. One recording was taken from a face-to-face interview, and one was taken from a telephone interview. This was done in order to determine whether P2FA performs worse on telephone speech (since the acoustic models were not trained using telephone speech). The results for the two recordings were similar, though, and both are pooled together for the analysis below.
For this experiment, 324 vowels with primary stress were manually provided with onset and offset labels. Figure 4.1 shows a histogram comparing the absolute difference between the FA onset boundaries and the manual ones, and Figure 4.2 provides the same comparison for the vowel offset boundaries. These results are quite good: a difference of 10 msec or less is by far the most common result.
Table 4.1 summarizes the results presented in the two histograms in Figures 4.1 and 4.2, and shows that about two thirds of the automatically assigned boundaries fall within 20 msec of the manual ones in both cases, and all but one fall within 50 msec for the vowel onset. These numbers are promising, especially since accurate alignment performance
Absolute difference (sec) Frequency 0.00 0.02 0.04 0.06 0.08 0.10 0 20 40 60 80 100 120 140
Absolute difference (sec) Frequency 0.00 0.05 0.10 0.15 0.20 0.25 0 20 40 60 80 100 120 140
within 20 msec within 50 msec V Onset 224 (69.1%) 323 (99.7%) V Offset 204 (63.0%) 276 (85.2%)
Table 4.1: Comparison between FA and manual vowel boundaries (N = 324)
is most important for the vowel onset. As Section 4.3.4 will show, most manual vowel formant measurements are taken closer to the onset. The approach that will be adopted for automatic measurement point selection in Section 4.3 takes this into account, so alignment errors that occur in the vowel offset will generally not have an effect on automatic formant extraction.
As an additional metric for evaluating the performance of P2FA for the purpose of au- tomatic formant extraction, we can consider the number of cases in which the manually selected measurement point falls inside of the vowel boundaries produced by P2FA. Cases in which this does not occur are serious errors, since the point marked as the correct mea- surement point by the human annotator is not available to the automatic vowel analysis system. Alternatively, for all cases in which the measurement point does fall inside the automatically produced vowel boundaries, the automatic vowel analysis system has the potential to choose the same point for formant measurement as the human annotator did.
In order to test this, manual vowel formant measurements were extracted for the 324 tokens from the preceding analysis, and the manual measurement points were compared with the vowel boundary labels produced by P2FA. In only 6 out of the 324 cases (less than 2%) did the manual vowel measurement point fall outside of the FA vowel boundary labels.2 In four out of these six cases, the FA boundary errors were caused by neighboring
2There was one further case, in the wordroute, that initially appeared to be a mis-alignment. However,
upon visual inspection it was clear that the FA boundaries were correct, and that the manual formant mea-
liquid consonants in the wordsrider,fool, and two tokens of the wordfull. The other two errors were in the words hammer andbegan, and were caused by a mis-alignment of the preceding segment.
Based on these results, it seems safe to conclude that the forced alignment results ob- tained by using P2FA will be accurate enough for conducting automatic vowel analysis. Though the mis-alignment rate for interview data is likely to be higher than the rate de- termined in this study for word list speech, this study suggests that any further errors that arise will be due to problems inherent to interview data, such as disfluencies, laughter, etc., and not due to inadequacies of the forced alignment system. In the case of interview data, there are generally thousands of tokens from each speaker. Thus, even with some measure- ment errors due to mis-alignments, the Law of Large Numbers will ensure that the means calculated for each vowel will be stable.3