Paul Thompson
1. What are specialised audio-visual corpora and what are they used for?
Writing a chapter about building audio-visual corpora is a challenge as this is an area of considerable growth in corpus linguistics, computational linguistics, behavioural sciences and language pedagogy, among others, and, by the time this chapter appears, it is likely that technological advances will have moved thefield substantially further forward.
In broad terms, an audio-visual corpus is a corpus that consists of orthographic tran- scripts of spoken language communication events, and the audio and/or video recordings of the original events. Such a corpus is likely to have links in the transcripts, which makes it possible to locate the relevant parts of the audio-visual records. In a basic form the links could consist of indexical information included with the transcripts which would allow the researcher tofind the section of the recording manually, but in a more sophisticated form such annotation, included in an electronic document, would allow the user to click on a button or activate a hyperlink within the electronic version of the transcript and automatically open the file in a media player at the exact point. A further, alternative type of audio-visual corpus is one in which existing audio-visual texts, such as films, poster advertisements or online news pages, are annotated for multimodal analysis (see Adolphs and Knight, this volume). Such annotations may be organised on a range of levels, coding features such as voice, music, other sounds, graphically represented words, hand gestures, facial gestures, location, and so on, and these codes can be organised in parallel rows or columns. Baldry and Thibault (2006), for example, present a framework for transcription and analysis of multimodal texts using television advertisements and websites as example texts, and their approach can be applied to collections of multimodal texts.
A specialised audio-visual corpus may therefore contain recordings of sets of spoken language events that are used for analysis of situated language behaviours in specialised settings – such as doctor–patient consultations, child–caregiver interactions or classroom task activities – or it may contain samples of certain categories of multimodal texts. The purpose of constructing an audio-visual corpus is to make it possible to identify relationships between the non-linguistic and linguistic features of human or textual
interaction, or to allow access to information that supplements the plain orthographic transcription. In this chapter, the focus is primarily on corpora in which the transcripts are linked to the video or audio recordings, or in which the video data have been made searchable for certain coded features.
Some linguistic investigations are more heavily dependent on audio-visual information than others. An example of the former is the study of sign languages, which are gestural and where the facility to record language performance on video (frontal view of the signer, with facial expression and hand gestures clearly presented) constitutes an excellent alternative to simple orthographic representation, or to a succession of still photographs, each portraying a single gesture. Such a project does, however, also present its own challenges as the video data have to be searchable by some means. If one is to look within a sign language corpus for the representation of‘a large red ball’, for example, one has to have either the means to enter the orthographic form ‘a large red ball’ (which would use non-sign language means to retrieve sign language representations), or some graphic means by which a sign language user could formulate a non-orthographic query capable of locating all examples of this concept within the corpus. The British Sign Language corpus and the corpus of German Sign Language data are two major projects building large-scale audio-visual corpus resources for the sign linguistics community.
There are many purposes for which linked transcript and video data can be used. In language teaching, the presentation of communicative events visually as well as ortho- graphically can help the learner to relate language use to the contexts in which it occurs. An audio-visual corpus can be used in the same way as a multimedia language learning package, except that it also offers the user the opportunity to retrieve multiple examples of a phrase or a grammatical structure and hear/see those examples, one after another. The EU-funded SACODEYL project, for instance, exploits clips of commissioned video recordings of teenagers, from seven different European language groups, speaking about their interests, experiences, friends and families, and the SACODEYL website (see also Chambers, this volume) contains language learning activities which prompt students to watch clips from the videos and search for answers to set questions. At one level, the video provides language learners with good listening practice, with orthographic tran- scripts provided so that the learner can check his or her understanding, but, on another level, the learner can search the data to locate certain features. The SACODEYL data have been annotated so that one can search by topic, grammatical point and part of speech, among other features, and one can also do concordance searches. When the concordance lines appear, it is then possible to select any one line, click‘Go to section’, and open the relevant wider section of the transcript. The learner can then choose to view that section of the video, online.
Another example of the use of an audio-visual corpus is in the investigation of lan- guage use in education. The Singapore Corpus of Research in Education (SCoRE) project at the Centre for Research in Pedagogy and Practice, National Institute of Edu- cation, Singapore, is collecting recordings of classroom interactions in a variety of subject areas (English, Mandarin, Malay, Tamil, Maths, Science) in different levels of education in Singapore. The corpus interface allows the user to search for words or phrases in the corpus and then choose to view a video clip (if available) or listen to an audio clip. The corpus data have also been annotated on a number of levels: it is consequently possible to search by part of speech, by semantic category or by syntactic, pragmatic or pedagogical features. The recordings have been divided into speaker turns, and for each turn there is a soundfile. The user is given the choice for any search to receive the results in turns (in
other words, with each word shown within the full turn of the speaker) rather than as KWIC concordances; if this option is chosen, the user is given the text for each turn in which the search items occur and also a link to the audio or videofile. In addition to the access to the audio-visual material, the interface also generates statistics on the frequency of occurrence of each feature in eachfile, both in raw terms and also as a percentage of the entire file.
Constructing an audio-visual corpus involves providing the links between the tran- script and the audio or video files. In the previous example, that of the SCoRE, the corpus developers have devoted an enormous amount of time, resources and expertise to preparing the corpus. Recording the data is in itself a major task, but after that the recordings have to be transcribed and speaker turns identified. Audio files for each turn are then created and given unique identifying names. The transcripts are annotated for the various features mentioned above, and the information stored in a searchable data- base. The interface then has to be built, trialled, revised and extended, exploiting existing technologies. Not every audio-visual corpus will have the same levels of multilayered annotation as the SCoRE but it has to be recognised that working with audio-visual corpora is a demanding enterprise.
An alternative way to work with audio data is to use a popular concordance program such as WordSmith Tools (Scott 2008) with a corpus of transcripts and audio recordings (see chapters by Scott and Tribble, this volume). Such a solution might be more appro- priate for end-users who are trained in the use of the particular computer program for corpus analysis work and who, on a specific investigation, require access to the audio files for closer analysis. In the case of a study of phraseology in seminar talk, for example, analysts may want to be able to do concordance searches in a corpus of seminar tran- scripts for a variety of lexical chunks. Within WordSmith Tools, provided the corpus has been prepared in advance and the program’s tag settings have been configured, the user can click in the Tag column of WordSmith Tools Concord to activate the audio player at the right point, and hear the intonational contours of the lexical chunks. To prepare the files, one needs to insert tags into the transcripts that refer to the audio recordings (the default audio file formats supported in WordSmith Tools are .mp3 and .wav but other formats can be accommodated). An example of the tagging is as follows, where thefirst tag is placed at the part of the transcript referred to and it identifies the .mp3 file that is to be played, while the closing tag indicates where the recording ends:
< soundfile name = ah02e001.mp3 > on a double-sided sheet and once again i haven’t put a summary on this one but what i have put < /soundfile >
The Help files for the program offer some guidance in this, but, again, it must be recognised this is a time-consuming task, and there are several complexities involved. Thefiles can be set up in such a way that it is possible to listen to small clips of the audio files, as in the above example, but this requires creating many small files, with a high degree of precision, from the original audio recording. The morefine-grained the detail, the more time-intensive the task, but without fine granularity the corpus may be too limited in its uses. Another point that needs to be taken into account is that annotation of the data in order to make it useable in WordSmith Tools would not necessarily make it useable in other applications. In other words, the corpus is then tied into a parti- cular package, when a more useful solution would be to make it useable in a range of programs.
So far we have proposed a number of reasons why a researcher might want to build an audio-visual corpus, and we have identified some of the ways in which a researcher might link audio and video inputs with orthographic transcripts. The purpose of the rest of this chapter is to give an overview of what the process of building an audio-visual corpus entails, from initial conception through project design to data collection, proces- sing andfinally the development of tools and interfaces for exploitation. Design criteria and data collection are discussed in Section 2, and transcription and annotation issues are reviewed in Section 3.
There are several tools available for the development of audio-visual corpora which make the job of linking points in the transcript to points in the video and audio files much easier. Some of these tools tie the developer into the proprietary system, while others use systems which have a higher degree of potential for interchangeability. A number of these tools will be discussed in Sections 3 and 4. As suggested in thefirst paragraph of this chapter, technology is advancing quickly, and it is dangerous to provide too much information on specific tools and platforms, so the discussion below will not attempt to be exhaustive. It is useful at this point to suggest that XML technologies offer flexibility (the ‘X’ in XML stands for‘extensible’) and power, and that with researchers now starting to build better tools and interfaces for handling XML documents, it is probable that XML will become a standard for audio-visual corpora in the future. The final section of this chapter looks towards the future and speculates on what advances may be made in the coming years.