Motivation - Using Auxiliary Sources of Knowledge for Automatic Speech Recognition

State-of-the-art HMM-based ASR models the joint likelihood p(Q, X), the evolution of the hid- den state space Q and the observed feature space X over time. The states represent the subword units which describe the word model. Standard ASR systems typically use phoneme as subword units. The states represent the subword units (typically, phonemes) which describe the word model. The feature vectors are typically derived from the smoothed spectral envelope of the speech signal. In Chapter 4, we studied how to model the evolution of auxiliary knowledge source A = {a1, · · · , an, · · · aN}along with Q and X, i.e. model p(Q, X, A) instead of p(Q, X). The auxiliary

knowledge source that was mainly investigated were auxiliary features pitch frequency, short-time energy and rate-of-speech. In this chapter, we extend this strategy of modelling auxiliary source of knowledge to model additional subword units. Here, these additional subword units will be re- ferred to as as auxiliary subword units, and the auxiliary subword units that we investigated are graphemes.

In recent studies, good results have been reported using graphemes as subword units for languages such as German, Dutch and Swedish (Schukat-Talamazziniet al., 1993; Kanthak and Ney, 2002; Killeret al., 2003). There are certain advantages in using graphemes as subword units, such as:

• The definition of the lexicon is easy, i.e., the orthographic transcription of the word can be easily derived.

• The word model representation is unique, e.g., the word ZERO can be pronounced as /z/ /ih/ /r/ /ow/ or /z/ /iy/ /r/ /ow/, but the grapheme-based representation remains as [Z][E][R][O]. • Graphemes could complement the phonetic information.

• There is no need for phonetic transcription.

While there are certain advantages in using graphemes as subword units, there are certain draw- backs too, such as:

• There is no obvious relationship with acoustic features. In other words, the acoustic feature vectors derived from the smoothed spectral envelope of the speech signal typically depict the characteristics of phonemes.

• There is a weak correspondence between the graphemes and the phonemes in languages such as English (Sejnowski and Rosenberg, 1987). For instance, the grapheme [E] in word ZERO associates itself to phoneme /ih/, where as, in word EIGHT it associates itself to phoneme /ey/. Finnish ASR system is an ideal example for a grapheme based ASR system as mismatches between the written form and the spoken format of words are quite exceptional (Kurimo, 1997). Thus, in Finnish ASR system although the speech is modelled by phonemes, they are written down as graphemes. The mismatch errors and some unmodelled rare phonemes have been found to increase phoneme error rate. More recent works in Finnish ASR are looking into other subword units such as syllable, morphs (Siivolaet al., 2003).

As mentioned earlier, unlike Finnish for other languages such as English, German Dutch there is no direct correspondence between written form and spoken form. (Schukat-Talamazzini et al., 1993) used “polygraph” as subword units for word modelling, which is essentially letters-in-context similar to polyphones (phonemic units allowing preceding and following context of arbitrary length). Experimental studies conducted on continuous speech and isolated word recognition tasks showed that good results (better than context-independent phone) could be obtained using “polygraph” as subword units.

In a more recent study, an approach of explicitly mapping orthographic transcription to a phonetic one was investigated in the context of speech recognition (Kanthak and Ney, 2002). In this approach, the orthographic transcription of the words are used to map them onto acoustic HMM state models using phonetically motivated decision tree questions, e.g., a grapheme is assigned to a phonetic question if the grapheme is part of the phoneme. The decision tree was generated manually as well as automatically (using log-likelihood gain and observation count). Recognition studies were performed on databases of three different languages (Dutch, German and English). For Dutch and German, where there is stronger association between phonemes and graphemes, this approach yielded performance comparable to their respective phoneme-based ASR system. For English though, where the grapheme to phoneme mapping is more complex, the performance of the system was fairly poor compared to purely phoneme-based ASR system.

(Killeret al., 2003), have investigated a context-dependent grapheme based speech recognition, where the context is modelled through a decision tree based clustering procedure (Killer et al., 2003). Experimental studies conducted on English, German and Spanish languages yielded com- petitive results compared to phoneme-based system for German and Spanish languages, but fairly poor performance for English language.

In this chapter, we propose a phoneme-grapheme based ASR system that, during training, jointly models the phoneme and grapheme subword units. During recognition, the decoding is done either using one or both the subword units (Magimai.-Doss et al., 2003b, 2004a). Basically, this can be seen as a system where word models are described by two different complimentary subword units, i.e., the phonemes and the graphemes (as shown in Figure 7.1).

/ay/

/t/

t

h

g

i

e

Figure 7.1. A word model in phoneme-grapheme based ASR. In standard ASR system, several states make up a phoneme. For simplicity in this figure we have represented every phoneme by a single state.

The architecture of the resulting system is then similar to factorial HMMs (Ghahramani and Jordan, 1997), where there are several chains of states as opposed to a single chain in standard HMMs. Each chain has its own states and dynamics; but the observation at any time depends upon the current state in all the chains (see Figure 7.2). (Logan and Moreno, 1997) were one of the first to use factorial HMM for ASR. They modified the factorial HMM where, the same discrete space was used for each chain and, each chain had different observations. This system did not yield promising results. In our case, instead of dividing states representing the same subword units into chains, there are two parallel chains, each corresponding to a specific subword unit representation and, the observation is same for both the chains. Similar models have also been used for ASR more re- cently, e.g. DBN-based multi stream speech recognition (Zhanget al., 2003), modelling articulatory features (Westeret al., 2004), multi-rate modeling of speech (Cetin and Ostendorf, 2005).

In the following section, we describe the modelling process of the phoneme-grapheme based ASR in detail.

In document Using Auxiliary Sources of Knowledge for Automatic Speech Recognition (Page 96-98)