Supporting Collaborative Transcription of Recorded Speech with a 3D Game Interface

(1)

Recorded Speech with a 3D Game Interface

Saturnino Luz1, Masood Masoodian2, and Bill Rogers2

1

School of Computer Science and Statistics Trinity College Dublin

Dublin, Ireland [email protected]

2

Department of Computer Science The University of Waikato

Hamilton, New Zealand {masood,coms0108}@cs.waikato.ac.nz

Abstract. The amount of speech data available on-line and in insti-tutional repositories, including recordings of lectures, “podcasts”, news broadcasts etc, has increased greatly in the past few years. Effective ac-cess to such data demands transcription. While current automatic speech recognition technology can help with this task, results of automatic tran-scription alone are often unsatisfactory. Recently, approaches which com-bine automatic speech recognition and collaborative transcription have been proposed in which geographically distributed users edit and cor-rect automatically generated transcripts. These approaches, however, are based on traditional text-editor interfaces which provide little satisfac-tion to the users who perform these time-consuming tasks, most often on a voluntarily basis. We present a 3D “transcription game” interface which aims at improving the user experience of the transcription task and, ultimately, creating an extra incentive for users to engage in a pro-cess of collaborative transcription in the first place.

Key words: Computer-assisted speech transcription, collaborative tran-scription, 3D interfaces, automatic speech recognition, single-player games.

1 Introduction

The unprecedented growth in availability of content on the Internet in the past decade has spurred a great deal of research on technologies for rendering this content more easily accessible to users. From an initial focus on fully automated content indexing, technology has steadily moved towards incorporating input from the user community.

Textual content exemplifies this situation quite clearly. Although vast vol-umes of data can be efficiently searched by the existing large-scale search “en-gines”, language (access to multilingual content) and semantics (access to struc-tured content) remain as barriers to widespread access. Despite the fact that progress continues to be made in machine translation and text analytics research,

(2)

tools that support user collaboration in text structuring (through collaborative filtering [13], or extensions to popular “wiki” platforms [16], for instance) and machine translation (via distributed content platforms [3], or though “crowd-sourcing” projects etc) have received increasing attention.

The volumes of speech content publicly available on the network, both through audio and video media, have also increased greatly in recent years3_{. In addition}

to the requirements common to textual data indexing, speech content also re-quires conversion to text for practical and effective access. For the purposes of information retrieval tasks similar to text document search, techniques based on automatic speech recognition (ASR) have been devised which employ word lattices [18, 2] as input to the indexing process. However, other tasks, such as browsing of lecture “podcasts” and instructional video material, require users to be able to read the spoken content as they would a written text. For such tasks, the levels of accuracy attained by current ASR are inadequate and manual correction is often necessary. As in the case of translation and semantic structur-ing, schemes which harness the contributions of the user community have been devised.

Ogata & Goto [12], for instance present a Web-based system which displays transcriptions of “podcasts” which the user can correct at will. These transcrip-tions are initially generated by a large vocabulary speech recognizer so that in correction mode the user can access for each word a set of alternatives produced by the recogniser. Any of these words can be selected if they correspond to the speech actually heard by the user, or the user can enter a new word if the speech does not correspond to any word in the list. Corrected sentences can then be fed back to ASR training, potentially improving future performance, in addition to improving the quality of the existing transcriptions. The success of this kind of systems obviously relies heavily on user involvement. Although Ogata & Goto, claim that users of their system found the task of correcting transcriptions “en-joyable”, it is hard to imagine many people would feel compelled to do it on a regular basis.

In this paper we describe TRAEDRIS, a “transcription editing game” loosely inspired by the popular Tetris game which aims to provide users with a stronger motivation to correct transcripts, namely, entertainment. TRAEDRIS displays sentence transcription candidates through animated 3D representations of word lattices generated by speech recognition. The user can interact with these sen-tence representations by selecting the correct paths as the words move towards the background.

1.1 Problem definition

ASR accuracy varies widely, depending on a number of factors. These include: the level of noise in the recording, variations in accent and voice quality, speaker

3

This growth in volume is apparently matched by a growth in use. According to a Nielsen’s Netratings report (http://en-us.nielsen.com/main/insights/reports), video content delivered by the main video websites grew by 41% from 2008 to 2009.

(3)

change, the accuracy of pre-processing (e.g. sentence boundary detection or other types of segmentation [14]), and the adequacy of the training data to the speech data to be transcribed, in terms of audio quality, vocabulary and language mod-els.

Although relatively versatile user interaction techniques, such as alternative lists [10], “respeaking” [1] and other variants of multimodal interaction [15], have been developed for dictation systems and ASR-assisted transcription [11], users tend to prefer retyping from scratch when the accuracy of the initial ASR-generated transcription is low [7].

Accuracy in speech recognition is usually measured in terms of word error rates (WER), that is, the total number of deletions, insertions and substitutions in the transcribed sentence with respect to the correct sentence. In other words, WER is given by the Levenshtein distance wer(W, R) between the ASR hypoth-esis W and the reference sentence R. Another, less forgiving way of measuring accuracy is at the sentence level. A sentence-level error (SER) can be defined as the ratio between the number of sentences containing at least one error and the total number of sentences transcribed. Even the best recognisers available today have typically high SER while exhibiting relatively low WER. This is spe-cially true when the ASR system is used with good quality audio produced by a small number of speakers as is the case of much broadcast, “podcast”, and some recorded meeting data available on the Internet. A state of the art sys-tem employed to transcribe the Hub5 conversational telephone speech dataset, for example, achieves WER of 26.5% while its SER is about 66.2% [4]. Similar results have been reported for other systems that attempt to minimise WER directly on the Switchboard corpus [9] and the North American Business corpus (NAB’94). On the latter, the system presented in [17] attains WER as low as 11.1% even though its SER is 74%. As regards user input, this disparity between WER and SER basically means that such transcripts are well suited to manual correction in that, most of the time, the user will only need to correct a few words at a time in order to repair an erroneous sentence.

In addition to high accuracy, applications often require precise time alignment between the speech signal and the transcription at a phrase [14] or word level. Given the above described scenario and requirements, the design problem which motivated the development of TRAEDRIS can be stated as follows: to design an interactive 3D graphics game for generation of accurate speech transcripts by rapid, collaborative correction of medium to low WER speech recognition results by users on the Internet.

2 The 3D Transcription Game

TRAEDRIS is a single-player game in which ASR transcribed sentences appear on the screen as their corresponding audio is played, and move slowly towards the background. The player’s aim is to keep their screen clear by correcting each sentence before it disappears towards the background, or before it reaches the last unoccupied position on the z axis. When a sentence reaches its limit for

(4)

placement on the z axis (i.e. either the position in front of a previous sentence or a pre-defined limit point, if the z horizon is empty) a confidence score is computed. If the confidence score is greater than a threshold value, the sentence simply fades out into the horizon. Otherwise, the sentence stops, “piling up” on the screen. As in Tetris, the game can be set to different initial speeds, and the speed at which the sentences move increases as the game progresses. The game ends when the z axis has accumulated enough sentences that a newly appearing sentence cannot move towards the back. The player’s total score can be computed in different ways, but for simplicity let us define it as the number of words in the sentences that faded out (i.e. those whose scores exceeded the confidence threshold).

2.1 The Transcription System

The overall architecture of the TRAEDRIS system is shown in Fig. 1. We as-sume that the speech input is already segmented into sentences and that these sentences are appropriately stored in a speech database. This audio signal is ini-tially fed to the ASR system which decodes it producing, for each sentence W , a word lattice and a confusion network which store the recognition hypotheses for that sentence along with its posterior probabilities P (W |A). This is done offline, in batch, so that all recognition hypotheses are available to the scoring module and the game’s graphical front-end when the game starts.

ASR

word lattices confusion nets

Game interface

dictionary

Scoring

module

speech data

Fig. 1. TRAEDRIS system architecture.

Word lattices are employed in computing the scores and determining the con-fidence thresholds for each user-corrected sentence, as explained in Section 2.2. Optionally, an external, domain-specific dictionary can be used to improve the scores when out-of-vocabulary corrections are entered. The graphical representa-tion of the sentence candidates shown on the user interface, on the other hand,

(5)

is based on the hypotheses encoded in the confusion network. Confusion nets can be regarded as compact representations of word lattices [9] and are used by TRAEDRIS, as well as other computer-assisted transcription systems [11, 12, 7], in order to simplify the correction task, from the users perspective.

The game interface also supports audio playback. Once the confusion network for a given sentence is selected for display, the corresponding speech is retrieved from the database and played back to the user.

2.2 The Scoring Method

Ideally, corrected sentences should be scored against a “gold standard”, perhaps a faithful transcription produced by a professional transcriber. However, such a scoring mechanism would defeat the point of supporting collaborative tran-scription in the first place. It is precisely because such reference trantran-scriptions are not readily available for most speech content that one would like to support transcription by the user community. Since there is no gold standard, possible sources of information that can be exploited in generating a score for a user-corrected transcription are: (a) measures derived directly from ASR system’s posterior probabilities, and (b) measures that incorporate the changes made by previous players.

The latter might involve, for instance, clustering the various versions of a sentence (based on, say, their Levenshtein distances to each other), aligning all sentences against the best cluster and then scoring the newly edited sentence against the alignments so that edits that contradict the majority get penalised. This scheme would require the system to present the same sentences to many players so as to gather a number of hypotheses large enough to produce a mean-ingful consensus. However, this would once again defeat the purpose of the game, which is to encourage manual correction of as much transcription as possible without repeating the same transcripts many times over to different players. We therefore chose to focus on a score derived from the posterior probabilities available through the word lattice.

Sentence posteriors in the maximum likelihood approach to ASR [6] are ap-proximated as the product of a language model prior P (W ) and acoustic likeli-hoods P (W |A) learnt from training data:

P (W |A) ≈ P (W )P (A|W ) (1)

In this framework, the transcription hypothesis chosen by the system is the one which maximises the posterior (MAP hypothesis). It has been shown, how-ever, that although MAP minimises SER it is not guaranteed to minimise WER [5, 9]. Since we would like our score to somehow reflect the number of words the user corrected, we need to base it on an alternative criterion of error minimisa-tion. Goel et Al. [5] propose a decision rule based on minimising the expected word error according to the posterior distribution. We derive our scoring strategy from that rule.

(6)

First, we compute the best (i.e. lowest) expected word error for the n best hypotheses in the word lattice [9]:

Emin = n min i=1 n X j=1 wer(Wi, Wj)P (Wi|A) (2)

We then set the threshold τ to be a fraction of −Emin and score the

user-corrected sentence Wu by weighting it against the ASR hypotheses as shown in

equation (3), so that the sentence is allowed to disappear from the screen if Su

exceeds τ . The loss function l(·, ·) for this calculation can be simply wer(·, ·) or a modified edit distance which does not penalise substitutions or insertions of words missing from the ASR vocabulary but contained in the domain specific dictionary (see Fig. 1).

Su= −

n

X

j=1

l(Wu, Wj)P (Wi|A) (3)

The scoring mechanism can also be made to vary as new hypotheses are entered by the users (and the transcription presumably improves) by adding such sentences to the n-best list with maximum posterior probabilities.

2.3 Visualisation and Interaction

The TRAEDRIS user interface is based on an earlier prototype developed for visualisation of speech recognition results through a simulated 3D environment [8]. The system represents the recognition hypotheses as paths in a graph and displays this representation, initially on the foreground. The recognition alterna-tives with the greatest posterior probabilities highlighted and connected through red-coloured edges. Hypotheses of lower likelihood are dimmed. As time passes, the word graphs move towards the background until it is no longer possible for the user to interact with them. The interface is shown in Fig. 2.

For speed, the user is allowed to move the cursor through the graph with the keyboard’s arrow keys, highlighting the path containing the corrected sentence. Special split and merge operations can be activated through key combinations if a word must be split into two or more words, or if a group of word slots must be merged into a single word. The user is also allowed to stop the sentence in mid-air a limited number of times (depending on the difficulty level set), in order to enter a correction not shown in the current confusion network by typing it in. Once a sentence reaches the horizon or the top of the pile, its score is com-puted in the manner indicated above, and a new confusion network is selected for display. TRAEDRIS displays the player’s current accumulative score based on the scores of individual sentences that have been processed. At the end of the game the player’s final score and overall ranking are presented.

(7)

Fig. 2. TRAEDRIS user interface

3 Conclusions and Further Work

In this paper we have presented a 3D single-player game as an alternative to the conventional text editor interfaces common to most manual speech transcription correction systems. We made a design decision to keep the TRAEDRIS game rather simple in its user interaction due to the fact the actual task of tran-scription correction is mentally demanding, requiring the user to listen to the corresponding speech audio segments while also reading and correcting the ASR generated transcripts. Despite its simplicity, however, TRAEDRIS presents a potentially more engaging activity than standard text-editing. Past experience has also shown that even simple games, such as Tetris, can be fun and stimu-lating if the player can use simple and quick actions (e.g. the keyboard arrow keys) to play the game. These improvements in user experience are meant to provide a motivation for greater involvement from the community of users of speech archives in collaboratively improving these archives.

We have also introduced a novel application and extension of the consensus technique [9] for scoring user-edited sentences. Planned future work includes a detailed evaluation of TRAEDRIS in order to clarify the type of user experience it provides, and whether it facilitates generation of more accurate transcripts.

References

1. Ainsworth, W.A., Pratt, S.R.: Feedback strategies for error correction in speech recognition systems. International Journal of Man-Machine Studies 36(6), 833–842 (Jun 1992)

(8)

2. Chelba, C., Silva, J., Acero, A.: Soft indexing of speech content for search in spoken documents. Computer Speech & Language 21(3), 458 – 478 (2007)

3. D´esilets, A., Gonzalez, L., Paquet, S., Stojanovic, M.: Translation the Wiki way. In: WikiSym ’06: Proceedings of the 2006 international symposium on Wikis. pp. 19–32. ACM, New York, NY, USA (2006)

4. Evermann, G., Woodland, P.C.: Posterior probability decoding, confidence estima-tion and system combinaestima-tion. In: Proceedings of the Speech Transcripestima-tion Work-shop. College Park, MD (Oct 2000)

5. Goel, V., Byrne, W., Khudanpur, S.: LVCSR rescoring with modified loss functions: a decision theoretic perspective. In: Procs. of the IEEE Intl Conf on Acoustics, Speech and Signal Processing (ICASSP’98). vol. 1, pp. 425–428 (1998)

6. Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press (1998) 7. Luz, S., Masoodian, M., Rogers, B., Deering, C.: Interface design strategies for

computer-assisted speech transcription. In: Proceedings of the Australasian Con-ference on Human-Computer Interaction (OZCHI’08). pp. 203–210. ACM (2008) 8. Luz, S., Masoodian, M., Rogers, B., Zhang, B.: A system for dynamic 3D

visuali-sation of speech recognition paths. In: Bottoni, P., Levialdi, S. (eds.) Proceedings of Advanced Visual Interfaces (AVI’08). pp. 482–483. ACM Press (2008)

9. Mangu, L., Brill, E., Stolcke, A.: Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language 14(4), 373–400 (2000)

10. Munteanu, C., Baecker, R., Penn, G.: Collaborative editing for improved usefulness and usability of transcript-enhanced webcasts. In: Proceedings of the 26th SIGCHI Conference on Human Factors in Computing Systems (CHI’08). pp. 373–382. ACM (2008)

11. Nanjo, H., Kawahara, T.: Towards an efficient archive of spontaneous speech: De-sign of computer-assisted speech transcription system. The Journal of the Acous-tical Society of America 120, 3042 (2006)

12. Ogata, J., Goto, M.: PodCastle: a spoken document retrieval system for podcasts and its performance improvement by anonymous user contributions. In: SSCS’09: Proceedings of the ACM Multimedia Workshop on Searching Spontaneous Con-versational Speech. pp. 37–38. ACM, New York, NY, USA (2009)

13. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: GroupLens: an open architecture for collaborative filtering of netnews. In: Proceedings of the 1994 ACM conference on Computer supported cooperative work. pp. 175–186. CSCW ’94, ACM, New York, NY, USA (1994)

14. Roy, B., Roy, D.: Fast transcription of unstructured audio recordings. In: Proceed-ings of the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH). p. 4. Bristol, UK (2009)

15. Suhm, B., Myers, B., Waibel, A.: Multimodal error correction for speech user interfaces. ACM Transactions on Computer-Human Interaction 8(1), 60–98 (2001) 16. V¨olkel, M., Kr¨otzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Proceedings of the 15th international conference on World Wide Web. pp. 585– 594. WWW ’06, ACM, New York, NY, USA (2006)

17. Wessel, F., Schluter, R., Ney, H.: Explicit word error minimization using word hypothesis posterior probabilities. In: Procs. of the IEEE Intl Conf on Acoustics, Speech, and Signal Processing (ICASSP’01). vol. 1, pp. 33–36 (2001)

18. Zhou, Z.Y., Yu, P., Chelba, C., Seide, F.: Towards spoken-document retrieval for the internet: lattice indexing for large-scale web-search architectures. In: Proceed-ings of the Conference of the North American Chapter of the ACL. pp. 415–422 (2006)