Evaluation - Dynamic Language Model evaluation

5.2 Dynamic Language Model evaluation

5.2.3 Evaluation

The proposed technique is evaluated using the Wall Street Journal which is described by Paul et al.[167]. A computational environment with limited hardware resources was used. A similar one is typically used by car head units or mobile phones. In fact, the acoustic model was evaluated with integer values and no advanced acoustic adapta- tion technique was applied such as MLLR etc. Both the acoustic and language models have to be as small as possible to fulfill memory requirements of the embedded de- vices. The SRI Language Model toolkit proposed by Stolcke[216] was used to learn a 5000 word 2-gram language model.

The intention of the proposed technique is the use of specific grammatically given language knowledge. For this purpose, the corpus was preprocessed to simulate various grammar coverages. Five grammars were used to evaluate the proposed method on the Wall Street Journal corpus. Each grammar represents one of the following sets. The set of all monetary terms is denoted with S_$and the set of percentages is denoted with S%. Sκ denotes the set of abbreviations. Table 5.1 gives a detailed definition in

Backus–Naur form. Also a statistical N -gram model for abbreviations was used. The result was a slightly degraded speech recognition accuracy over all experi- ments. The set of Months is denoted with S_µ and S_η denotes the set of Weekdays. Further, let S_i0 ⊆ Si with i ∈ {$, %, µ, η, κ} be the observable portion of Si in the test

corpus. Each set can be used alone or in combination with others. Here, there are 32= 5₁ + 5₂ + 5₃ + 5₄ + 1 combinations.

Table 5.1:

Overview of grammars in Backus-â ˘A¸SNaur Form

〈NUM〉 ::= 〈num〉 [‘

point

’_{〈num〉][‘}

comma

’_〈num〉] 〈num〉 ::= ‘

one

’_{〈num〉 | ‘}

two

’_{〈num〉 | ... | 〈empty〉} 〈MONETARY 〉 ::= 〈NUM〉 ‘

dollar

’ |_{〈NUM〉 ‘}

cent

’ | ... 〈PERCENT〉 ::= 〈NUM〉 ‘

percent

’

〈ABBREVIATION〉 ::= 〈Letter〉 | 〈ABB〉 〈Letter〉 ::= ‘

A.

’_{〈Letter〉 | ... | 〈empty〉} 〈ABB〉 ::= ‘

SVOX

’ | ...

Letπ be one arbitrary combination. This defines the set S_π, S0 πas follows: S_π= [ ∀i ∈π Si and S_π0= [ ∀i ∈π S_i0. S0

πis used to adjust the grammar coverage in the text corpora which is then used for

language model learning. This enables a realistic application with similar coverages. The corpus Cπincludes each sentence of C which did not contain any term from Sπ0:

Cπ= {w ∈ C|>m,n : wm :n∈ S_π0}.

Each corpus Cπwas used to learn a language model for a traditional speech recog-

nizer. The proposed technique requires one further step. R is applied which replaces every term of S_πwith a corresponding grammar tag:

C0_π= {w ∈ ((W ∪ T )∗\Sπ)∗|∃w0∈ Cπ: w = R(w0)}.

Using Then, a language model for each C0

πwas learned, too. Both the N -gram model

and grammar of S_πwere then used to evaluate the proposed technique by means of recognition accuracy.

In addition, define the corpus size is defined as the number of sentences in percent compared with the original corpus. Let the grammar coverage for Cπbe defined as the

percentage of sentences which contain a term from S_π:

100_{· |{w ∈ C}_π_{|∃m, n : w}m :n∈ Sπ}|

|Cπ|

The test corpus has a grammar coverage between 3.64% and 40.61% depending on the considered S_π. Different S0

πfor Cπcauses a tag coverage between 6, 72% and 36, 44%.

Best recognition accuracy is achieved if the grammar coverage in Cπis high while

the coverage in the test corpus is low. Conversely, the error rate increases if the grammar coverage in Cπis low while the coverage in the test corpus is high. In this way a

word error rate of 19.3% was obtained. The effect can be reduced by using dynamic language models. This has been confirmed by the experiments. A word error rate of 13.3% was achieved with a low grammar coverage of 6.72% by using the proposed dynamic language model. Figure 5.2 shows the results for each experiment. The novel technique is beneficial for situation with a low grammar coverage.

5.2. DYNAMIC LANGUAGE MODEL EVALUATION 111

Figure 5.2:

Using 2-gram models reduces the word error rate by up to 31% for low grammar coverage. No significant word error rate improvements achieved for high grammar coverages of 36%.

Another aspect next to grammar coverage is the corpus size, which was used to learn the language model. An increasing error rate for decreasing corpus size was ex- pected. The proposed technique could as well prevent this deterioration. An analysis is provided on the next page in Figure 5.3.

Finally, the influence of the corpus size for constant grammar coverage was ana- lyzed. Two use cases are presented here. In the first case, the grammar coverage is good, with 36.44%. Here, no further improvements could be achieved. The corpus randomly shrunk for each speech recognition experiment. No improvements were ob- served for various corpus sizes. This can be seen on the next page in Figure 5.4.

In the second case, the grammar coverage is low (6.86%_{±0.22% variance). Here, an} improvement could be achieved using the proposed technique for large corpora. The result was a 6% reduction in word error rate from 19.3% to 13.3%. The achievement is still observable for small corpora, although the relative improvement decrease in this case. The baseline system has a word error rate of 27.0% using a word 2-gram language mode, while the proposed technique with the proposed grammar 2-gram model achieves 23.5% word error rate. The results are shown in Figure 5.5 on the next page.

Figure 5.3:

The novel technique improves speech recognition for domains with a lim- ited amount of train data. The word error rate could be reduced from 19% to 13% using a grammar 2-gram model for small corpora.

Figure 5.4:

No improvement achieved for a constant grammar coverage of 36%. The word error rate increases with a potential dependence for randomly shrunk corpus.

5.2. DYNAMIC LANGUAGE MODEL EVALUATION 113

Figure 5.5:

A constant grammar coverage of 7% leads to a relative word error rate reduction from 31% to 12% for a randomly shrunk corpus.

In document Dynamic language models for hybrid speech recognition (Page 121-125)