ASR transcripts versus human transcripts and written text

1.1 Spoken Content Retrieval system overview

1.1.2 ASR transcripts versus human transcripts and written text

text

The purpose of automatic speech recognition (ASR) is to identify words spoken in the audio stream. Without this transformation the information contained in the audio stream remains so-called “tacit knowledge” that does not easily lend itself to be represented in certain structures and is harder to be transfered from one person to another (Brown et al., 2001). Figure 1.2 shows the main principles of how a transcript is created by a statistical ASR system (Jurafsky and Martin, 2000). Acoustic and language models of the data to be recognised are created based on selected training data (labeled audio signal, collections of texts that use the same language as the targeted spoken content). The same front-end feature extraction module is used to train the acoustic model and to extract the acoustic features from the input signal. The ASR decoder finds the sequence of symbols that is most probable according to the acoustic, language and pronunciation models.

The transcript created by a statistical ASR system always has a certain probability assigned to each consituent unit, e.g. sub-word, phoneme, syllable, word, that reflects how reliable these units in the transcript might be (Jurafsky and Martin, 2000; Huang et al., 2001). As the spoken content might be informal, and the conver-

sation might go back and forth between several topics, even 100% accuracy of the transcription cannot garantee its readability. In general, however it is in practice even harder to read due to errors and absence of such structures as sentences, para- graphs etc. While its conversion into textual format does not necessarily produce a readable transcript, it does mean that it becomes easier to store and process within other applications, e.g. an IR system (Brown et al., 2001; Lee and Chen, 2005; Goldman et al., 2005; Chelba et al., 2008). However, the transcript text has to be treated carefully in the same way as the written one, bearing in mind that potential ASR errors may impact on further processing. The important meaningful words spoken in the audio data might not be recognized correctly by the ASR system, in which cases they will be replaced by other incorrect, meaningful or otherwise common words in the transcript. Since common words are usually removed as stop words by an IR system, these errors in the ASR transcript decrease the chances of the correspondent item being retrieved since important information has been lost or changed, but do not result in false retrieval. Substitution of more meaningful incorrect words in a transcript may be more problematic.

ASR systems vary depending on the task for which they have been created. Early in their development when computational and model training resources were limited, the recognition of individual sounds and isolated words was the target (Rabiner and Levinson, 1981). Currently ASR technologies using much larger training sets and more powerful computers can deal with continuous spoken signal in alternative environments. The actual quality of the ASR result depends on the data the system was built with, the form of speech to be recognized, the acoustic environment and the hardware used for data capture and recognition.

Large vocabulary continuous speech recognition (LVCSR) systems generally at- tempt to provide a full transcript of the spoken input. They require the collection of data to build detailed acoustic models that correspond to the data to be recognized and large amounts of text to build language models. However, the vocabulary of the system is limited to that chosen when the system was constructed, and any word

outside this selected vocabulary has no possibility to be recognized correctly. Thus this is the first source of potential errors in ASR, even if the rest of the transcript is perfect. As outlined above, errors in the transcript can arise from other sources as well. If the word sequence that was actually spoken has a low probability of occurence, it might be replaced by similar sounding sequence with a higher overall probability in the final ASR output. Alternatively the sounds can be best matched to the wrong models, and further decoding of the sounds into words may result in a high error rate.

The vocabulary of an ASR system may be larger than that which actually ap- pears in the output transcript because certain word combinations are more likely to occur due to the training and language modelling in the ASR system, and appear in the result output instead of potential correct ones. Thus the ASR transcript output is limited not only by the ASR system vocabulary, but also by the word probabilities that are learnt on the training data collection (Jones et al., 2007).

The structure that is usually an intrinsic part of a written text is much harder to deduce from an audio stream (Lee and Chen, 2005). For example, while the ASR transcripts may contain information about the pauses between speech segments, and speaker changes. these units will highly depend on the context, and are hard to generalize upon. The same speaker may talk throughout a long lecture and cover many topics, thus the segmentation on the speaker level will not be very helpful. In case of conversations, a part of the discussion involving several speakers may represent a relevant segment, thus again speaker segmentation might not be useful in this task. The pauses that the speaker naturally takes to take a breath or to make a break in the delivery on purpose cannot be used in combination with the knowledge of the topic because they do not always correspond to a topic break. They might instead signify an important speech segment within the same topic.

While sentence segmentation is currently available for some ASR systems and languages (Gauvain et al., 2002), it does not feature in most ASR systems. Thus segmenting the ASR generated transcripts represents a harder task than in case

of manually written text, because topical segmentation methods usually rely on sentences, e.g. TextTiling and C99 algorithms (Hearst, 1993; Choi, 2000). It is possible to use so-called pseudo sentences, assuming that a full stop should be put in the transcript at each N number of words, however this does not reflect actual semantically coherent sentence units.

Overall, an ASR transcript is not equivalent to a written text for several reasons: the words contain potential errors, the ASR vocabulary and internal weighting poses potential limitations on the words being used in the transcript; and finally the transcript is harder to read and further segment due to lack of structural information even in case of a perfect transcript on the level of words. Also much spoken data (particularly spontaneous content) has di↵erent linguistic structure to written text, the language model (LM) of an ASR is often trained on written text, so the LM of an ASR system is often not a good model of the spoken data which is to be recognized. Developments in the ASR field are seeking to close this gap, however it is as yet far from perfect, especially because ASR systems usually need costly adaptation of both their acoustic and language models when there is a change in the content to be recognized.

In document Towards effective retrieval of spontaneous conversational spoken content (Page 32-35)