3.5 Conclusions
4.2.2 Computation of posterior probabilities
The sparse coding procedure needs a dictionary of speech and noise exemplars. In all experiments in this chapter we used a dictionary that comprises 17,148 speech exemplars and 13,504 noise exemplars. For each configuration of the the modu-lation filterbank a new dictionary was constructed. Exemplars consist of a single feature frame (EMS vector). Given the amplitude response of the modulation filters with the lowest centre frequencies, information about continuity of spectral changes over time is preserved in the EMS features. For all configurations of the modulation filterbank the exact same time frames extracted from the training set in aurora-2 were used as exemplars.
The speech and noise exemplars were obtained by means of a semi-random se-lection procedure. We made sure that we had the same number of exemplars from female and male speakers, and almost the same number of exemplars asso-ciated with the 179 states in the aurora-2 task. For that purpose we labelled the clean training speech by means of a conventional HMM system using forced alignment. Most states were represented by 98 exemplars in the dictionary. The
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 87
Figure 4.3: Block diagram of the posterior probability computation block. A sample posterior probability matrix is visualized in the right side of the figure.
The activation vector (S) and state posterior probability vector (P) of a single time frame of the sample signal is shown in the bottom part of the figure.
remaining states, which had fewer frames in the training material, were repre-sented by at least 86 exemplars. To obtain the noise exemplars the noise signals were reconstructed by subtracting the clean speech from the noisified speech in the multi-condition training set. The resulting signals were processed by the mod-ulation frequency frontend, and the noise exemplars were randomly selected from these output signals.
As can be seen in Figure 4.3, the procedure for estimating posterior probabilities of sub-word units consists of several steps. The first step involves a normaliza-tion of the EMS features (i.e., standard devianormaliza-tion equalizanormaliza-tion and Euclidean-normalization), the second implements the reconstruction of unknown observa-tions as a sparse sum of exemplars in a dictionary (sparse coding), and the third step converts the exemplar activations to posterior probabilities.
Standard deviation equalization and Euclidean-normalization We used a Lasso procedure for reconstructing EMS vectors as a sparse sum of exemplars from the dictionary (Efron et al., 2004). Lasso is able to handle the positive and negative components in the EMS vectors. The Lasso procedure minimizes the root mean square of the difference between an observation and its reconstruction.
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 88 The range and variance of the components of the EMS vectors differs consider-ably (Ahmadi et al., 2014). To make sure that all gammatone bands can make an effective contribution to the distance measure, some equalization in the EMS vectors is required. We follow the strategy used in Ahmadi et al. (2014), in which the standard deviations of the samples of the gammatone envelope signals Eg(t) within each modulation band are equalized using weights obtained from the speech exemplars in the dictionary. Each Em,g(t) is multiplied by an equalization weight wg:
wg= 1/{ 1 M+ 1
M+1
Õ
m=1
σ15·(m−1)+g} f or 1 ≤ g ≤ 15, (4.3)
whereσi (i= 15 · (m −1)+g), 1 ≤ i ≤ 15 · (M +1), is the standard deviation of the ith element of the speech dictionary exemplars. With this procedure the standard deviation of these modified features is equalized within each modulation band, while the relative importance of the different modulation bands is retained. The equalization weights were recomputed for each configuration of the modulation filterbank.
Algorithms for finding the optimal representation of unknown observations in the form of a sparse sum of exemplars are sensitive to the (Euclidean) norm of the observations and exemplars. Therefore, we normalized all exemplars and all unknown feature vectors to unit Euclidean norm. However, for speech-silence segmentation, information about the absolute magnitude of the filter outputs is needed. We used the unnormalized EMS vectors for that purpose.
Sparse coding Unknown observations −−−−→
E M S(t) are reconstructed as a sparse linear combination of exemplars from a dictionary A that contains both speech and noise exemplars,
−−−−→ E M S(t) ≈
N
Õ
n=1
sna®n = A®S, (4.4)
where ®S is a sparse weight vector that contains the non-negative exemplar ac-tivation scores of the dictionary exemplars that minimize the Euclidean distance
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 89
between the test vector−−−−→
E M S(t) and the reconstructed version, subject to a sparsity constraint (controlled by λ):
From activations to posterior probabilities The exemplar activation scores must be converted into state posterior probabilities. For that purpose, we use the state labels of the speech exemplars in the dictionary. As the exemplar dictionary A= [As, An] is the concatenation of a noise and a speech dictionary, the activation vector ®S in eq. (4.5) can be split into two separate parts ®S = " ®Ss
S®n
#
, indicating the weights corresponding to speech and noise exemplars, respectively. Since the noise exemplar activations are irrelevant for estimating the posterior state probabilities, we ignore the noise exemplar activations ( ®Sn). With ®L1×NAs the label vector (NAs = 17, 148 is the number of speech exemplars), and the it h element 1 ≤ Li ≤ 179 representing the label of the it h exemplar in the speech dictionary, we compute a cumulative state activation vector ®C in which each element Cj, j = 1, 2, ..., 179 is the sum of the activation scores corresponding to dictionary exemplars that have state label number j:
Cj = Õ
{i|Li=j}
Si, (4.6)
where Si is the it h element in Ss. The state posterior probability estimate is then computed by normalizing the vector ®C to L1 norm 1.
P®= C® Í179
j=1Cj
. (4.7)
As in Gemmeke et al. (2011b), it appeared that the procedure of eq. (4.6) system-atically underestimates the posterior probability of the three silence states. This is due to the fact that the normalization of all EMS vectors to unit length effec-tively equalizes the overall magnitude, thereby destroying most of the information that distinguishes silence from speech. Therefore, we implemented an additional procedure that estimates the probability of a frame being either speech or silence
Chapter 4. Human-inspired Modulation Frequency Features for Noise-robust ASR 90 on the basis of the unnormalized feature values. In frames that were classified as silence by this procedure the posterior probability of the three silence states was set to 0.333, and the posterior probability of the 176 speech states was set to some small floor value.