Compensation by Alternative Recognition Hypotheses

4. Compensation of Spoken Term Detection Errors

4.2. Compensation by Alternative Recognition Hypotheses

mating the word sequence with the highest probability of having generated the observed feature sequence. This is also true for the decoder used within this evaluation [65]. However, ASR decoders can output more information about the decoding process than

4.2. Compensation by Alternative Recognition Hypotheses just the 1-best transcription, for example they can provide competing recognition hypotheses. This additional output from the decoder can be exploited in retrieval. In the following, we describe lattices, which represent a standard approach for encoding alternative recognition hypotheses from the ASR decoder, and show how they can be used in Spoken Term Detection.

Instead of storing only the most probable 1-best transcription of a single utterance, the speech recognizer can also produce a list of competing sentence hypotheses. This N-best list contains the N most probable sentence hypotheses, ordered by probability of the sentence hypothesis. This approach is particularly suited for simple and efficient retrieval on recognition alternatives. In principle, each sentence hypothesis can be indexed as a different transcription of the same utterance, and searching for the query boils down to simple text search. We note that the difference between two hypotheses is typically very small, and will often consist of only one or two words. Hence, in the worst case, large parts of the information encoded in the N-best-lists will consist of repetitions. This is a major disadvantage for longer utterances consisting of many words, where the number of sentences N that need to be stored to cover the most important alternatives is too large for practical applications. The amount of required sentence alternatives even increases if we consider using N-best lists for subword retrieval, as the number of tokens per utterance transcription is much higher than in the case of words (c.f. table 3.7).

More compact representations of the ASR hypothesis space can be used in order to overcome the storage drawbacks of the N-best list. In particular, lattices have been used extensively for this task, and they have been successfully applied across different retrieval units such as words [99], syllables [77] or phonemes [112].

Formally, a lattice is an acyclic directed graph

G = (V, E) (4.1)

with a set of nodes V and a set of edges E. Each node n ∈ V is a tuple

n = (i, l, ts, te, c) (4.2)

The tuple specifies a unique node id i, a node label l representing the transcribed token identity (i.e., the word, syllable or phoneme) and the start and end time ts and te of the speech segment covered by this node. Moreover, for each node we store a confidence c, 0 ≤ c ≤ 1 which indicates the degree of uncertainty from the decoder. The node set contains two special nodes: an initial node ninit with no incoming and at least one outgoing edge, and a terminal node nend with at least one incoming and no

4. Compensation of Spoken Term Detection Errors

outgoing edges. Figure 4.2 illustrates exemplary word and syllable lattices, generated on the same sample utterance from the DiSCo corpus with the reference transcription ¨uber den ganz Deutschland staunt. We used the ASR configuration described in section 3.3 to generate the lattices. Note that both ASR decoders use the same acoustic model, but differ in language model and lexicon. Hence, paths through the syllable lattice can be substantially different from syllabified paths through a word lattice. For example, alternatives for the correct word staunt includes the word stammt, but the corresponding syllable S t a m t is not part of the syllable lattice.

The syllable lattice contains a good example for pronunciation variation and its impact on retrieval from subword ASR output. Here, the lattice contains the syllable sequence d OY t S l a n corresponding to the spoken word Deutschland, which is the correct syllabic representation if the speaker deleted the final t of the second syllable. In such cases, it is possible that the correct syllable sequence is not part of the lattice (because it has a relatively low acoustic likelihood), and retrieval would fail to recover from the pronunciation variation.

Figure 4.2.: Example for word and syllable lattices, generated on the same DiSCo sample utterance.

4.2. Compensation by Alternative Recognition Hypotheses Lattice retrieval, i.e., detecting query occurrences in a lattice, can formulated as a search problem on the lattice graph. Consider a lattice G and a given word query q = q1· · · qn with n words. Then, we search for all node paths p = p1· · · pn with n nodes where the sequence of the corresponding node labels equals q. This description can be easily transformed for retrieval from subword graphs, such as syllable or phoneme lattices. In this case, the query is again broken down to subwords just as in the case of exact search on the 1-best transcript, and the retrieval is performed on the the subword graph just as in case of word lattices.

Several important design decisions have to be taken when using lattices for retrieval. In particular, one has to decide

1. on the size of the lattice graph, i.e., on the amount of competing hypotheses that shall be included,

2. how to control the number of nodes that will be included in the recognition lattice and

3. how to estimate the uncertainty of the decoder for a certain lattice node.

The optimal size of the graph that will be indexed depends on the requirements during retrieval. Basically, there are two alternatives. The graph can be pruned by removing unlikely nodes (i) until the graph contains only hypotheses that are assumed to yield correct STD results or (ii) by keeping as much hypotheses as possible within the given storage constraints. In the first case, only nodes with a high local recognition confidence are kept, and all others are removed during an offline process at indexing time. In the second case, we keep all hypotheses above a certain minimal confidence threshold and defer the actual STD decision to the retrieval process. The latter approach offers a greater flexibility for the user, who can adjust the balance between precision and recall of the search at query time. We refer to the first variant as offline graph pruning and to the second variant as online graph pruning.

Both approaches require a process for controlling the number of nodes in the lattice graph, which in turn requires a confidence measure for assessing the decoding quality at a certain node.

Our process for pruning nodes with low confidence is as follows. First, we obtain a large and unpruned lattice from the ASR decoder. Then, we estimate a confidence score for each lattice node based on the acoustic and language model likelihoods that were estimated by the ASR process. We follow the approach described in [14]. Given a node q, the lattice confidence Cq that the label of the node was spoken at the given time stamp is estimated by

4. Compensation of Spoken Term Detection Errors

Cq =

Lα(q)LAM(q)LLM(q)Lβ(q) Lmax

(4.3) Here, Lα and Lβ are the forward and backward scores of the considered node q. The forward score of a node represents the likelihood that a lattice path leads to this particular node, while the backward score represents the likelihood for a path from this node to the end of the lattice. The acoustic likelihood of the node is denoted by LAM(q), and LLM(q) is the language model likelihood. The confidence score is normalized by the maximum likelihood Lmax of the Viterbi path through the lattice, yielding a confidence score between 0 an 1. As usual in ASR decoders, we use log-likelihood scores instead of probabilities to cope with the problem of very small probability values. Hence, multiplication of probabilities in equation 4.3 becomes adding the corresponding scores. In [14], the authors further distinguish between word and subword systems, as their subword phoneme system in not constrained by a language model, and is thus completely unconstrained from a linguistic point of view. In our case, this further distinction is not necessary and we can use the same confidence scoring approach for all considered units. For the actual implementation, a standard forward-backward algorithm is used [90]. First, the list of nodes is sorted in decreasing order by the time stamps of the nodes. Starting with the rightmost node (which is now the first node in the list), we perform the following forward procedure on all nodes:

• If the node is the rightmost node, we initialize the forward score with a value of 0.0.

• The forward pass terminates if an initial node is encountered (the last node in the list).

• For all other nodes, we propagate the forward score of the current node to all its left neighbors by adding the acoustic model score of the current node and the language model score for the transition between neighbor and current node. For each left neighbor q0 of the current node q, the forward score Lα(q0) is given by

Lα(q0) = Lα(q)LAM(q)LLM(q0, q) (4.4)

In a similar fashion, we carry out the backward procedure by sorting the nodes by time stamp in increasing order. Starting with the leftmost node, we estimate the backward scores from left to right. For the actual confidence score, we apply equation 4.3 for each

4.2. Compensation by Alternative Recognition Hypotheses node q, using the estimated forward and backward scores. For the normalizing factor Lmax, we take the maximal forward score (i.e., the forward score at the initial node).

For offline graph pruning, we directly use this confidence score for limiting the number of nodes per time frame by applying the following procedure which was used successfully to prune lattices in [65]:

• for each time frame t, we obtain a list of nodes that have a start time t_s≤ t and an end time te≥ t

• we sort the list of obtained nodes in increasing order by their confidence value as estimated by the forward-backward procedure above

• for a graph cut of n, we remove all but the last n nodes from the list that represent the n nodes with the highest confidence at the current time frame

For online graph pruning, we have proposed a method in [77], which assesses the confidence of a path through the lattice at search time. The idea is to pre-calculate the confidence of each node at indexing time, and combine the confidences of a matching lattice path at runtime. We use the same approach as described above for pre-calculating the confidence for a single node at indexing time. Then, at runtime we can approximate the confidence score for the whole sequence in several ways. One approach presented in [77] is to calculate the product of the normalized node confidence scores as a lower bound for the query confidence score, i.e., for a query with n tokens we obtain:

Cq= n Y i=1

Cqi (4.5)

We note that this approach is particularly sensitive to outliers, i.e., the overall confidence score can become very small if only one of many query tokens has a low confidence. As an alternative, one can also consider using the average of the confidence scores:

Cq= 1 n n X i=1 Cqi (4.6)

For both approaches, we only need access to the node confidence scores at runtime and can ignore all other node-specific information (such as acoustic and language model likelihoods). This is especially useful in the case of subword decoding, where the graphs are typically substantially larger than word lattices. The size of the large subword ASR output graph is a major drawback when using subwords for lattice decoding and retrieval, and there is need for efficient approaches to scoring and accessing the information

4. Compensation of Spoken Term Detection Errors

contained in the lattice. In our experiments, the already pruned lattices from syllable decoding contained on average 13 times more nodes than the 1-best syllable transcription (see section 6). This requires attention in scenarios where users perform ad-hoc searches on large data sets, as exhaustive graph search for long query path matches (and optionally additional online pruning) can be computationally expensive. Moreover, storing such large lattices for thousands of hours of data requires compact representations for data persistence. Within the scope of this section, we focus on the baseline performance of lattice indexing and retrieval, i.e., we are interested in the performance without tight restrictions on the indexing and retrieval efficiency. Scalability aspects of the retrieval system - both in terms of retrieval efficiency and storage requirements - will be investigated in detail in section 6.

In document Holistic Vocabulary Independent Spoken Term Detection (Page 82-88)