Considering Time Alignment with Simple Accumulation

3.4 Posterior Probability as Confidence Measure

3.4.2 Considering Time Alignment with Simple Accumulation

The experimental results presented inWessel et al.(2001) show that the performance of the confidence measure calculated in the previous section can be significantly improved by summing up the posterior probabilities of all hypotheses of the same word with overlapping time intervals. This is because word graphs usually contain several hypotheses, other than the best hypothesis, which have slightly different time alignment of the same word. The usage of fixed starting and ending time as in the previous section, however, does not allow consideration of those hypotheses in the computation of the posterior probability, and the word probability is split among them. Figure 3.4 on page 42 shows schematically seven word graph arcs of the same word, which belong to different sentence hypotheses but have overlapping time intervals. This example is an excerpt of a word graph as typically produced by HTK speech recognition system. For the computation of the best word hypothesis’ confidence score (arc r2, boldface inFigure 3.4) by considering intersection in time

boundaries of similar word hypotheses, we must sum up posterior probabilities of arcs which have overlapping time intervals (e.g. arc r3 in Figure 3.4). This is equivalent

CHAPTER 3. CONFIDENCE MEASUREMENT TECHNIQUES PlaceholderPl PlaceholderPla PlaceholderPlacehol r1 r2 r3 r4 r5 r6 r7 time τ1 τ2 t₁ τ3 t₂ t₃

Figure 3.4: Overlapping time intervals, τ2− t1and τ3− t2, for different arcs riof a word

graph as typically produced by HTK ASR system. Boldface denotes arcs belonging to the best recognition result.

Csec([w; τ, t]) = Lln ∀ [w;τ′,t′]: {τ,...,t}∩{τ′,...,t′}6=∅

C[w; τ′, t′], (3.13)

or for the example shown inFigure 3.4:

Csec([w2; τ2, t2]) = 4 L ln i=1 Cri, (3.14)

where Cri is the confidence score of arc riaccording to the definition inEquation 3.12

(page 41).

It is to be mentioned that since Csec does not necessarily fulfill the condition of

posterior probability as formulated inEquation 3.3(page 35) and does not sum up to unity in the normal space of probabilities, it can lead to posterior scores Csec > 0 in

the logarithmic space. It does, however, perform significantly better on five different evaluation corpora than the score defined in Equation 3.12 (page 41) as reported

by Wessel et al. (2001). Also evaluations carried out in the scope of this work on

two additional test corpora confirm better results in confidence error rate for the definition of CM as in Equation 3.13than those based onEquation 3.12(page 41).

Wessel et al.(2001) propose two additional ways of summing up posterior prob-

abilities of word hypotheses with slightly different starting and ending time boundaries. On the one hand, the method known as Cmed accumulates posterior probabil-

ities restricted to only those arcs with the same word hypotheses which intersect the median time frame of the best hypothesis, for which the CM is actually computed. This method also fulfills the original condition of posterior probabilities as given in

3.4. POSTERIOR PROBABILITY AS CONFIDENCE MEASURE start end SIL I HAVE VERY OFTEN SIL FINE SIL I MOVE VERY I

HAVE VEAL OFTEN HAVE _IT _VERY FINE FAST SIL SIL MOVE VERY HAVE IT VERY SIL I - HAVE MOVE IT - VEAL VERY FINE OFTEN FAST

Figure 3.5: Sample word graph and corresponding multiple alignment represented as confusion network, presented inMangu & Brill(1999).

Equation 3.3(page 35). On the other hand, the method known as Cmaxaccumulates

posterior probabilities not only for the median time frame but for all time frames which intersect with the best hypothesis, and the maximum of these values is chosen from all sums as the measure of confidence.

3.4.3 Considering Time Alignment as Consensus Hypothesis

Another possibility for the computation of the posterior probability based confidence score on word graphs is described in Mangu et al.(2000). The algorithm primarily used for the computation of the so-called consensus hypothesis can also be applied to generate posterior probability based confidence scores. Figure 3.5shows an example word graph with its corresponding multiple alignment. The approach presented in

Mangu et al.(2000) selects that word at each position in the alignment which has the

highest posterior score; the resulting hypothesis is called the consensus hypothesis

by Mangu et al. (2000). For this method, posterior scores of hypothesized words

are computed in the same way as described in Section 3.4.1 on page 34. However, the accumulation of the confidence scores, which makes use of the time alignment information of the word graph, differs from that described inSection 3.4.2on page 41. The algorithm proposed inMangu et al. (2000) has as its primary goal to compute the consensus hypothesis which minimizes the word error rate of recognition results

CHAPTER 3. CONFIDENCE MEASUREMENT TECHNIQUES

rather than the sentence error rate. Empirical results inMangu et al.(2000) prove a significant lack in correlation between sentence error rate and word error rate and the difference between optimizing for both. In order to determine the consensus hypothesis, the word graph is converted to a compact format through the following computation steps:

Step 1: Computation of the posterior score of each word hypothesis (arc) of the word graph, as described inSection 3.4.1 on page 34

Step 2: Building equivalence classes, composed of all the arcs with the same word label and identical starting and ending times (see Figure 3.6 on page 45)

Step 3: Merging equivalence classes which contain the same word by computation of time similarity by overlapping time intervals (intra-word clustering)

Step 4: Grouping equivalence classes if they correspond to different words with so-called phonetic similarity (inter-word clustering)

For analysis carried out in the scope of this work, it is not necessary to perform Step 4, because Step 3 already considers time alignment information of similar word hypotheses and allows accumulation of posterior scores of the word hypotheses with overlapping time intervals which are grouped together in common equivalence classes. Merging equivalence classes in Step 3 is an iterative grouping process. In each iteration step the time similarity between all pairs of classes is computed. The pair of classes which are most similar to each other are then combined into a new equivalence class. As a measure of similarity S between two equivalence classes, Ei and Ej, Mangu et al. (2000) propose the following definition for the intra-word

clustering step:

S(Ei, Ej) = max ri∈ Ei

rj∈ Ej

O(ri, rj) p(ri) p(rj), (3.15)

where O stands for time overlap between arc ri and arc rj normalized by the sum of

their time duration (from start time to end time of the arc). InEquation 3.15, O is weighted by the posterior probabilities of corresponding arcs to make the measure of similarity more robust against unlikely word hypotheses.

Iteration steps are repeated until no more classes can be merged. Upon comple- tion, all arcs with overlapping time intervals are merged to one equivalent class. For computation of the confidence measure, posterior scores of all arcs within resulting classes are accumulated as shown in the example inEquation 3.16 (page 45).

3.4. POSTERIOR PROBABILITY AS CONFIDENCE MEASURE PlaceholderPl PlaceholderPla t1 τ3 PlaceholderPlacehol E1 E2 E3 r1 r2 r3 r4 r5 r6 r7 time τ1 τ2 t2 t3

Figure 3.6: Initial equivalence classes, created in the second step of the consensus network algorithm, with inter-class time overlaps τ2− t1 and τ3− t2.

In document Confidence Measurement Techniques in Automatic Speech Recognition and Dialog Management (Page 53-57)