Representing Speech with Subwords - Exploiting Automatic Speech Recognition Output

Exploiting Automatic Speech Recognition Output

4.3 Representing Speech with Subwords

The challenge of covering all possible words is generally referred to,

more prosaically, as “The OOV Problem.” Recall that, in order to recognize a word, an ASR system must have that word included in its vocabulary. For many SCR use scenarios, it is not possible to assume that all necessary words can be known in advance. For this reason, SCR systems suﬀer under the problem of OOV words — words that are encountered in the speech signal, but are not contained in the vocabulary. In order to address the OOV problem, words are not recognized directly, but rather in smaller building blocks, called subwords. This subsection presents an introduction to subwords that will provide the necessary background to the discussion in the material that follows.

4.3.1 Introduction to Subwords

A subword is a unit that is smaller than an orthographic word. It may or may not correspond to a linguistic unit (cf. subsection 3.1), such as a phoneme, syllable or morpheme, but it may also be any small unit that is used by an SCR system. The main motivation for SCR systems to use subword indexing units is to address the OOV problem. A subword inventory represents the speech stream with a smaller, more ﬂexible set of building blocks, which gives greater coverage of the spoken content than a predeﬁned vocabulary. This subsection provides a background on the principle of subword units.

Subword indexing units also have the potential of addressing the general problem of error. Although OOV is a major contributor to

ASR error, acoustic mismatches and sub-optimal pronunciation mod- eling can lead to the recognizer mis-recognizing an in-vocabulary word, substituting another word or word string in its place. The substituted word string represents a “best ﬁt” with the signal and for this reason stands to share a large degree of similarity with the correct word. Subwords make it possible for the retrieval system to exploit partial matches (also referred to as “inexact matches” or “fuzzy matches”) within speech transcripts. These partial matches are particularly useful in situations where unexpected pronunciations or channel conditions cause a recognition error in the ASR transcripts.

The creation of a subword inventory for the representation of spoken content follows one of two basic strategies: either subwords are based on the orthographic forms of words or on word phonemizations.

Orthographic subwords are derived from the written forms of words

and consist of a sequence of graphemes. For example, under the orthographic subword approach “knowledge” would be represented as know ledge. The advantage of this method is that a text corpus can be eas- ily decomposed into subwords for the purposes of training a subword language model for the speech recognizer. Component orthographic subwords are similar for words with similar spellings. In this example, “knowledge” (know ledge) shares a common subword with “knowing” (know ing) and with “acknowledging” (ac know ledg ing). In the case of orthographic subwords, it is necessary to provide their pronunciations to the ASR system in the lexicon. A single subword can have multiple pronunciations. For example, orthographic subword know is pronounced diﬀerently in “knowledge” and “knowing.”

Phonetic subwords are derived from word pronunciations. Here,

subwords are represented as strings of phonemes. For example, under the phonetic subword approach “knowledge” would be represented as n A l I dZ . In this case, component orthographic subwords are similar for words with similar pronunciations. Under the phonetic subword approach, “knowledge” (n A l I dZ ) does not share a common subword with “knowing” (n o w I N )1 Notice,

1_{The underscore, ‘ ’, in the phonetic representation is used for readability, but also can be}

however, it shares two common subwords with “acknowledging” (I k n A l I dZ I N ). Under the phonemic subword approach there is only a single pronunciation per subword. The number of diﬀerent orthographic subwords that map to a phonetic subword with a single pronunciation varies from language to language. Another important language-dependent eﬀect is the number of shared subwords resulting when semantically related words are decomposed.

The error compensation potential of subword units can be best illus- trated by considering an example. We choose to examine the orthographic subword decomposition of the word “Shenandoah.” The same principles apply to other words and to phonetic subwords. Two possible subword decompositions of “Shenandoah” are syllables (she nan do ah) and overlapping grapheme strings (shen hena enan nand ando ndoa doah). Systems that make use of subwords are attempting to leverage two eﬀects.

First, if a word is not in the vocabulary of the ASR system, it can be reconstructed from a series of subword units. For example, the original audio could be recognized using an ASR system with a syllable vocabulary. If this vocabulary contained the units she nan do ah, it would be able to recognize the string shenandoah without explicit knowledge of the existence of the word “Shenandoah.”

Second, if the ASR system makes an error with a particular word, subwords can help to provide a partial match. Take again the example of the ASR system with a syllable vocabulary. If an error occurs, for instance, because the word is spoken in the speech signal with a pronunciation differing slightly to that included in the ASR system’s lexicon, the following string might result she nen do ah. In this case, three out of the four syllables are recognized correctly. The possibility to make use of a partial match during the IR process is left open. If the ASR system had a word-based vocabulary, it would output a word-level error for the misrecognized word, for example, crescendo. This word-level error is difficult to match with the original spoken word, “Shenandoah.” It is also possible to take this partial match one step further. Note that the syllables in our mis-recognized strings she nen do ah (output of the syllable-level recognizer) and cre scen do (syllabified output of the word-level) contain syllables that are very similar to the correct

syllabiﬁcation of the word she nan do ah. Speciﬁcally, nen is not far from nan and scen bears a resemblance to she. Some subword-based systems attempt to use matches on multiple levels, in order to create the most reliable inexact match possible between the target word and its realization in the ASR transcripts.

It is not necessary to have a recognizer with a subword language model in order to exploit subword matching eﬀects. A word-level transcript containing crescendo in place of shenandoah could be decomposed. A syllabiﬁcation (syllable-based decomposition) of the substitution error word would yield the syllable sequence cre scen do. Here, one out of four syllables matches a syllable in the original spoken word. This match is rather distant, but could still prove useful to the retrieval system. Using subword units, however they are generated, thus implements a partial match between words.

In document Spoken content retrieval: A survey of techniques and technologies (Page 92-95)