Synchronized Subtitles in Live Television Programmes
4.2 Subtitling live TV programmes
The real- time subtitling of live television programmes is a multidisci-plinary research field encompassing disciplines and technologies from, amongst others, the fields of audiovisual translation, automatic speech recognition, respeaking, natural language processing, computer science, network transmission and broadcasting. For a better understanding of the processes involved, we should take note of the steps outlined in Figure 4.1:
1. Audio transcription: manual, semi- automatic or automatic real- time transcription of speech from one or many speakers into text.
2. Subtitle generation: the text is split into subtitles according to stand-ards (AENOR 2012). Editing/correction and natural language process-ing may be included in this step.
3. Coding and packetization: coding into the appropriate subtitle pro-tocol (e.g. DBVB Sub).
4. Transmission/broadcasting: network transmission/broadcasting, including video and audio.
5. Reception, decoding and presentation on the user’s screen.
Synchronized Subtitles in Live Television Programmes 53
4.2.1 Audio transcription
An alternative to live speech transcription is the use of stenotypists, who produce the text transcription of the audio manually in real time.
Although quality and speed are good, the cost of this process and the low availability of stenotypists constitute limitations as far as real- time mass subtitling is concerned. Another alternative is the use of automatic speech recognition (ASR) engines directly applied to the audio. Costs are drastically reduced, but available technology is not yet of an acceptable quality in areas requiring speaker independence and large dictionaries.
Error rates are very sensitive to audio quality, signal to noise ratios and even the noise type.
In order to obtain better results than with direct ASR, a technique known as respeaking may also be used. In respeaking, an intermedi-ate respeaker will use ASR systems trained to his/her voice and specific vocabulary. The editing/correction of the generated subtitles is also common. Currently, respeaking is the normal practice in live subtitling for television and is the most common procedure in countries where live television subtitling is widely available, such as the UK, Spain, France, Germany and the USA (Eugeni 2009; Romero- Fresco 2011). ASR systems can also be used to minimize human intervention in certain television programmes – the news for instance –, where the text is available in advance (García et al. 2009), or can be adapted to support
Audio
Figure 4.1 Subsystems involved in real- time subtitling on live television
54 Mercedes de Castro et al.
multiple speakers simultaneously (Wald 2008). As is highlighted by Boulianne et al. (2008), the use of ASR as a remote captioning applica-tion is also a possibility and should have tremendous cost- saving ben-efits. A complete description of the different transcription methods used for the real- time audio transcription of live programmes can be found in Romero- Fresco (2011).
4.2.2 ASR in live subtitling
Today’s ASR systems are able to recognize arbitrary sentences with a large, but finite, vocabulary. Typical vocabulary sizes are of the order of 10, 000– 100,000 word forms (Bisani and Ney 2005). A large vocabulary speech- recognition system is mainly composed of the acoustic model, the language model and the decoder. The acoustic model assigns prob-abilities to phonetic elements (phonemes, three- phonemes, sub- words, etc.) for every sequence of input observations. The language model creates the acoustic model output sequence in order to evaluate word probabilities. If they are combined, they result in multiple word sequence hypotheses (Ruokolainen 2009). To find the best recognition hypothe-sis, the decoder should try all possible transcripts and pick the one with the highest probability (Siivola 2007). As a consequence, ASR does not deliver transcriptions at the same pace as the audio, but uses current audio input to find the more probable alternatives to the former frag-ments, thus increasing accuracy. Transcriptions are held back until there is confidence that new incoming words will not change the probability of former ones – this usually occurs during periods of silence. The longer the voice fragment, the lower the probability of a word at the beginning being changed in the final transcript. For this reason, many systems tend to use long speech fragments before issuing a final decision, increasing accuracy, but penalizing response time because no output is produced until a full fragment has been received. It is possible to balance accuracy and delay by setting up a maximum waiting time before issuing a final hypothesis. This is a critical issue in live TV subtitling.
To improve problems in terms of ASR quality resulting from speaker variability, noisy environments and low- quality audio, an intermediary respeaker is usually in charge of live TV subtitling. This approach allows the phonetic model to be adjusted to the respeaker, who usually works in an isolated room; all this results in better rates of accuracy.
4.2.3 Composing a digital multimedia stream
In DTT broadcast and IPTV, TV channels are delivered to users by way of MPEG (ISO 2007) and DVB (ETSI 2009) Transport Stream codification
Synchronized Subtitles in Live Television Programmes 55 techniques. Subtitles can be conveyed in the Transport Stream in the form of DVB Subtitle stream(s) (ETSI 2006) or as Teletext subtitles embedded in the Teletext stream (ETSI 2003). Video, audio and subtitles are assembled according to MPEG and DVB standards to create a mul-timedia service (the MPEG term for TV channel) transmitted over IPTV or DTT broadcast networks. The use of a common clock reference is essential to the process of multiplexing video, audio and data in the same Transport Stream (according to MPEG structure, subtitles lie in the data category). MPEG presentation time- stamps are assigned to video, audio and subtitle packets when multiplexing takes place; for this rea-son, the delay between a subtitle packet and its corresponding audio packets, which is caused by the time spent on the audio transcription, editing and coding processes, is maintained during transmission and reproduction. Different coding and packetization are used for Internet TV, but the same principles apply.
As is shown in Figure 4.2, real- time subtitle generation is a paral-lel process to the encoding and packetization of the audio and video input signals resulting in transmission packets. The packages from the three sources (audio, video and subtitles) are finally multiplexed and transmitted. Irrespective of the methods used (respeaking, direct ASR, stenotype), subtitle packages are available several seconds after the cor-responding video and audio packages have been created and sent. As a result, the appearance of subtitles on the user’s screen will be out of step with the audio/video by several seconds.
VIDEO
Subtitle Generation Coding & Packetization
Subtitle is several seconds later
Statistical Multiplexing
Coding & Packetization AUDIO
Figure 4.2 Delay of several seconds between audio/video and subtitles in the transcription process
56 Mercedes de Castro et al.
4.2.4 The quality of live subtitling
Spanish standards concerned with subtitle quality (AENOR 2012) not only address the formal presentation aspects such as use of colours, the number of lines or reading time, but also encompass the content of the subtitles and refer to parameters such as literality, density and synchronization.
Literality reflects the closeness of the written text to the spoken words in the audio, whereas density measures the number of words per minute presented in the subtitles according to the assumed reading speed of the viewers ( Romero- Fresco 2011). Synchronization is related to the ideal time- in and time- out settings, enabling subtitles to appear on the screen in syn-chrony with the audio and images. Density and literality are closely related and a compromise is necessary when the word rate of the speakers exceeds the viewers’ reading speed. Lack of synchronization is highly disturbing as it creates dissociation between the essential elements within the audio-visual communication and tends to be the main reason for complaint.
4.2.5 Subtitle delays on live TV
In live subtitling, real- time constraints strongly affect all the quality parameters noted above. Literality and synchronization conflict with one another and, as will be shown in the following paragraphs, signifi-cant delays occur between audio/video and subtitles. In ASR- based live subtitling environments, the better the accuracy between the oral and written elements, the greater the delay, so that any solutions intended to minimize the negative impact that subtitle delays have on the audi-ence must always take accuracy into account. According to certain stud-ies, the degree of success can, in some cases, be close to 97 per cent or 98 per cent (Lambourne et al. 2004). An exception to this occurs when the speech content is known in advance and pre- prepared subtitles can then be broadcast synchronously, either with human intervention or automatically (García et al. 2009; Gao et al. 2010).
Taking into account the fact that audio, video and subtitles undergo parallel coding, packetization and transmission processes, it is in the audio transcription subsystem (see Figure 4.1) where delays between audio/video and subtitles are generated. According to Romero- Fresco (2011), when a stenotype is used, delays are small if subtitles are emit-ted word by word, but may be significant if block subtitles are used.
With respeaking, the main sources of delay with regard to the audio include the time needed by the respeaker to listen to an audio frag-ment, the time needed to respeak it into an ASR system and the time the ASR needs to produce the transcription. Respeakers can be trained to insert silences so that the ASR can generate shorter text strings and thus
Synchronized Subtitles in Live Television Programmes 57 reduce transcription time. The additional tasks of correcting errors and adding punctuation and colours also add to the overall time required before the final subtitles are obtained (Wald et al. 2007).