User Interface - Speech verification for computer assisted pronunciation training

Figure 5.2: A Screenshot of MAT

5.2 User Interface

MAT is built with a user friendly interface, as depicted in Figure 5.2. Tasks for annotators are quite clear: to load a set of audio files with their transcriptions, to listen to each of them carefully, and to mark the pronunciation errors they find.

After loading the data with the open button above, the audios are listed on the left, and their phonetic information are shown on the right when chosen. Transcription of selected audio file is shown in the middle as text. The upper pane displays aligned audio with its transcription, segmented into phonemes or words. Annotators can choose to play the audio clip of these speech units by clicking on the header bar, or play any part of the whole audio by choosing a clip using the mouse in the waveform. In this way, annotators can easily focus on single speech units if they find them suspicious, without having to hear the long sentence repeatedly.

The lower pane is where the annotation takes place. Possible pronunciation errors are arranged in a table. The types of errors are read from columns, and the rows shows the speech unit each error could happen to, which can be phoneme,

syllable and word, ordered and grouped according to their appearances in the transcription. Generally, errors can be annotated by simply selecting the checkbox in the table cell corresponding to the speech unit in the row and error type in the column, beyond that, annotators are also asked to provide more information on these errors according to their perception. Hence each type of error is handled slightly diﬀerently:

• For deletion, no other annotation is required, simply to select the checkbox when a certain phoneme is removed in speech.

• For insertion, annotators should write the inserted phoneme in the spoken column.

• In case of substitution, the substituted phoneme should also be annotated in the spoken column.

• Distortion is handled with more diligence. Besides checking the distortion box, annotators also need to perceive in which way a phoneme is distorted and provide hints on how to correct it. Hints are chosen from a previously configured list. If the hint for correcting the distortion is not available in the list, it can be added from the configuration panel.

• If annotators find that a phoneme or word is spoken with an obvious foreign accent, they can check foreign_accent or foreign_accent_W. • If a stress is misplaced for syllables in words, stress should be checked. • Pauses after certain words should also be annotated.

Sentence level errors are annotated below the table. Furthermore, annotators are asked to give a score by considering all phonetic and prosodic errors in the speech. These scores can be used to evaluate machine generated ones. Scores are by default integers between 1 and 10, and could be given either manually or chosen from the list. Fractional numbers like 8.5 are also allowed if for example annotator thinks the score should be between 8 and 9. Annotators can also leave comments if there is anything not covered by the user interface.

Annotators can define which types of error they want to annotate, by opening the error configuration panel. In the tree view annotators can adapt error types for each level. Furthermore, they can also modify the hints list before and during

5.2 User Interface 66 Tongue needs to be slightly further forward.

Tongue needs to be slightly further back. Mouth needs to be slightly more closed. Mouth needs to be slightly more open. Lips need to be rounded.

Lips need to be unrounded.

Mouth needs to start slightly more open. Mouth needs to start slightly more closed. Tongue needs to start slightly further back. Tongue needs to start slightly further forward. Lips need to be rounded at the end.

Vowel needs to be longer.

Vowel needs to be longer and tongue needs to be slightly further back. Table 5.2: Collections of distortions specified by annotators.

annotating, so that the responsible hint can be conveniently chosen. Hints play a key role in classifying distortion errors as they not only show how annotators suggest to correct the distortions but also represent how the errors are made, e.g. hint “Lips need to be rounded” suggest that the distortion is caused by pronouncing with unrounded lips. Hints gathered by annotators for German learning English are listed in Table 5.2.

MAT also inherits speech signal processing ability from MARY. If annotators are willing to explore more information in the corpus, they can turn on the speech analysis view, as shown in Figure 5.4. From there annotators can inspect the waveform, spectrum, pitch contours and energy graph of the speech signal and view values at a certain time chosen via a mouse click.

When all annotations are done, annotators press the Commit button. Several checking are performed before the annotation is stored to file, including:

• Errors need to be annotated in the correct way, i.e. the previously de- scribed annotation behavior should be complied, for example if substitution is checked, the substituted phoneme should also be given in spoken. • Only one type of phonetic error per phoneme. There is no clear boundary

between substitution and distortion, a phoneme could be heavily distorted and in the end sound like another one, which is equal to substitution. In this case we ask annotators to choose only one of the two types of errors

(a)

(b)

Figure 5.3: MAT Configuration Panels: (a) Configuring errors of diﬀerent levels; (b) Managing hints.

5.3 Annotation Format 68

In document Speech verification for computer assisted pronunciation training (Page 84-88)