Figure 3.7: Restricted NMF reconstructions, R = 12 speaker independent basis and 500 iterations, of the spectrograms
4 N EAR E ND S PEAKER E XTRACTION USING N ONNEGATIVE
4.2 Nonnegative Matrix Factorization Near-end Speaker Extraction 1 Formulation of NMF-NSE
4.3.2 Parameter Study
4.3.2.2 Study of R v and R d and N
This study elucidates the inherent trade-off between echo reduction and distortion (during DT) by examining the influence that different values for Rd and Rv have on NMF-NSE
performance. This study also examines the effect of window size on NMF-NSE performance. The specific range of values for these parameters tested were: Rd = [1, 2…16], Rv = [1,
2…16], and N = [64, 128, 256, 512, 1024, 2048] samples. The results for this study are displayed in Figure 4.1 in which each surface plot contains the results for a particular performance measure and window size for each combination of Rv and Rd. A separate close up
view of the results for N = 512 is displayed in Figure 4.2.
In general, the averaged ERLE results in Figure 4.1, and in Figure 4.2, indicate that in the absence of near-end speech (no DT), less echo and noise reside in the output
€
ˆ
v (n) signals for higher values of Rd (more basis vectors in Bd(k)) and for lower values of Rv (less basis
Rd = 1 and Rv = 16 producing the highest averaged ERLE for N. Ignoring noise for the present,
the averaged ERLE values imply that by increasing the number of basis vectors in Bd(k), for a
fixed Rv and N, less echo matching occurs. This outcome was expected since an increase in
the number of basis vectors in Bd(k) better enables this basis to express the variability of its
source d(k). The averaged ERLE results also imply that by decreasing the number of basis vectors in Bv (until the minimum of Rv = 1) for a fixed Rd and N echo matching is also
reduced. This reduction is attributable to the reduced speech variability that a lower rank Bv
can express; therefore, it is less able to erroneously express d(k). It is apparent that the rise in ERLE for decreases in Rv or increases in Rd is not consistent across all the values of N, Rv or
Rd. For example, for N = 1024 for Rd between 1 and 7 for all Rv the averaged ERLE results
exhibit a relatively sharp rise in averaged ERLE, followed by relatively small increases for Rd
> 7, particularly for Rv > 2.
The averaged ERLE results also depend on N, for each increase in N from 64 to 2048 samples there is a rise in averaged ERLE for each pair of Rd and Rv values, demonstrating that
Figure 4.1: Experimental Results illustrating the influence of Rd, Rv and N on NMF-NSE performance in the absence of DT and
during DT. Each column of plots displays a different value for N while each row displays a different performance measure; the z-axis label of the leftmost plot indicates the particular measure. For each row all plots are plotted across the same scale, with the lowermost and uppermost z-axis labels indicating the minimum and maximum values (rounded to nearest 1/100) attained across
the output €
ˆ
v (n) signals contain less echo for longer windows in the absence of DT. However, if Rd is adjusted such that Bd(k) spans approximately the same time interval of x(n) for each N,
these performance disparities are reduced, for example: for Rv = 2, Rd = 8, and N = [128, 256,
512, 1024, 2048], averaged ERLE = [15.5089, 19.7951, 25.8791, 30.1866, 32.0126] dBs respectively; while for Rd adjusted, i.e. Rd = [16, 8, 4, 2, 1] the corresponding averaged ERLE
values become [18.0708, 19.7951, 20.2526, 20.8338, 21.2105] dBs. Nonetheless, it is apparent that longer frame lengths are still preferable for maximal echo reduction in the absence of DT. It is probable that less echo-matching occurs for longer frames because the basis vectors of the composite basis B(k) are required to fit more frequency bins, and so the more general basis vectors in the Bv component of B(k) are less likely to be matched a portion
of y(k). Furthermore, shorter windows generate spectrums with lower resolution of their constituent spectral components, and therefore, Bd(k), Bv and y(k) are populated with less
distinctive and more blurred spectral features, enabling the less specific basis vectors of Bv to
more easily fit portions of y(k).
Assessing now the averaged SIR surface plots in Figure 4.1, which pertain to the echo reduction performance of NMF-NSE during DT; for each increase in Rd there is a slight
increase in averaged SIR for all Rv, indicating lower levels of echo remaining in the output
€
ˆ
v (n) signals. A similar trend is seen in the averaged ERLE results, which would suggest, as expected, that in both the absence of, and during DT echo matching declines with an increase in Rd for a fixed Rv. As described in the context of averaged ERLE, the reductions in echo
matching, are due to the enhanced ability of a higher rank Bd(k) to express d(k). The influence
of Rv on the averaged SIR results is more complex, for example for N = 256 averaged SIR
peaks for Rv = 9 and declines thereafter. Similar features are seen for higher values of N,
though for different values of Rv in general. This feature of the averaged SIR results is
somewhat surprising given that it was expected that a reduction in Rv, which is expected to
consistently decrease echo matching by curtailing the speech variability Bv can express, would
lead to a reduction in averaged SIR, as it did for averaged ERLE. For N = 64, and 128
1 6 11 16 1 6 11 16 2.84 14.57 26.31 38.04 Rv N = 512 samples, Rd ERLE [dB] 1 6 11 16 1 6 11 169 12.89 16.78 20.67 Rv N = 512 samples, Rd SIR [dB] 1 6 11 16 1 6 11 16 3.19 7.65 12.12 16.58 Rv N = 512 samples, Rd SAR [dB] 1 6 11 16 1 6 11 16 33.85 35.63 37.41 39.19 Rv N = 512 samples, Rd SNR [dB] 1 6 11 16 1 6 11 16 2.65 6.05 9.44 12.84 Rv N = 512 samples, Rd SDR [dB] 1 6 11 16 1 6 11 16 3.58 4.15 4.73 5.3 Rv N = 512 samples, Rd LSD [dB]
Figure 4.2: Experimental Results illustrating the influence of Rd, Rv for N = 512 samples on NMF-NSE performance
however, averaged SIR decreases on average for an increase in Rv as expected. Although not
shown here, for smaller NFR values, i.e. greater proportion of echo in y(n) during DT, the averaged SIR values vary much more closely with those of averaged ERLE, as expected, suggesting that for NFR = 0 dB, SIR is somewhat ineffective at measuring the echo interference in
€
ˆ v (n) .
The averaged SAR results in Figure 4.1 are characterized by rising values for increasing Rv and rising values for decreasing Rd for all N. This is to be expected; by
increasing the number of basis vectors in Bv, extending the range of speech it can express, for
a fixed Bd(k); or by decreasing the number of basis vectors in Bd(k), restricting the range of
speech it can express, for a fixed Bv; a larger portion of v(k) is matched onto Bv and therefore,
speaker-matching is reduced, with a commensurate reduction of distortion in the output €
ˆ v (n) signals. However, as indicated by the averaged SIR values, a decrease in Rd promotes echo
matching during DT; consequently, the averaged SIR and SAR results jointly demonstrate that during DT the choice of Rd is a trade-off between echo matching and speaker matching. A
similar trade-off is seen for Rv for N = 64, 128; but for N = 248, 512, 1024, and 2048, the
choice of Rv is less of a compromise, since both averaged SAR and SIR increase with Rv up to
a point, after which, averaged SIR starts to decrease while SDR continues to increase. The averaged SAR results vary negligible across N.
The averaged SNR results in Figure 4.1 suggest that during DT more noise is omitted from the output
€
ˆ
v (n) signals for higher values of Rv, Rd and N. The omitted noise corresponds
to the portion of noise that is matched onto Bd(k) and is therefore absent from Bv. Given the
lack of structure of w(n) in the magnitude-STFT domain, it is natural to assume that the noise contribution in y(k) is divided equally between the basis vectors of B(k) such that the noise is assigned to Bv and Bd(k) in proportion to the ratio of Rv and Rd. This assumption is somewhat
evinced from the averaged SNR results, which exhibit an increase in value for increasing Rd
across all N; however, the results also exhibit a slight increase for increasing Rv for higher
values of N, a result that does not corroborate this assumption. The averaged SNR values peak in value for N = 1024 for most values of Rv and Rd.
The averaged SDR values of this study, which can be taught of as a combination of the averaged SAR, SIR and SNR values, correlate substantially more with the averaged SAR values than with either averaged SIR or averaged SNR. This implies that distortion is the primary error in the output
€
ˆ
v (n) signals during DT, and thus, while averaged SAR and SDR are also reflecting model error and error due to phase substitution, relatively speaking, speaker matching as opposed to echo matching contributes most to the error during DT. The averaged LSD values, which like averaged SDR are an overall performance measure, vary negatively with the averaged SDR values for all N, implying that the averaged LSD values also reflect primarily distortion. The predominance of speaker matching error in NMF-NSE during DT is ascribable to the generality of the basis vectors in Bv, which deter its ability to fit v(k), with
of v(k). On the other hand, relatively low echo matching error occurs during DT, owing to the likeness of the basis vectors in Bd(k) to d(k), and the lack of specificity of Bv for d(k).
Contrasting now the performance of NMF-NSE in the absence of DT (echo only) and during DT, according to the DT performance measures values, speaker matching can be reduced, or equivalently distortion during DT is reduced, by either lowering Rd (decreasing
the number of basis vectors in Bd(k)) or by increasing Rv (increasing the number of basis
vectors in Bv); however, either of these changes would also increase echo matching during
periods of echo only, leading to increased residual echo for such periods. It follows therefore that both the choice of Rd and the choice of Rv is a trade-off between echo reduction during
echo only periods, and distortion during DT. Furthermore, although the effect of echo matching during DT is less significant than that of speaker matching, a decrease in Rd, and to
a lesser extent Rv, also increases residual echo during DT. Consequently, the choice of Rd and
of Rv can be stated more generally as a trade-off between echo matching and speaker
matching, or as a trade-off between increased echo reduction and increased distortion of the near-end speech.
Contrasting now the results in terms of N, it can be observed that in the absence of DT longer window lengths produce more echo reduction, while during DT optimal performance is attained for window lengths of size 512 or 1024 samples. The variability in the DT performance values across N may be related to the influence that N has on the validity of the assumption of pair-wise disjoint supports of speech signals in the STFT domain, which was given as a justification for the model (4.1). It was demonstrated in [155] that speech signals generally satisfy this assumption in practice, but that the level to which they do varies depending on N, as well as other factors. For example, for speech signals sampled at 16 kHz the optimal value for N was shown empirically to be 1024 samples/64 ms, which we assume corresponds to a window size of 512 samples/64 ms for speech sampled at 8 kHz. Relating this finding to the performance of NMF-NSE during DT, it is apparent from the results in Figure 4.1 that averaged SIR exhibits a peak in value at N = 512 for all Rv and Rd, indicating
that optimum echo reduction occurs during DT for this value of N; this then influences averaged SDR, which also exhibits a small peak in value, though this peak is spread out over N = 512 and N = 1024; and the averaged LSD values, which have a slight trough in value at N = 512 and N = 1024 for the same Rv and Rd values. The results indicate therefore that an
optimum value of N exists during DT, which we contend is linked to the level of pair wise disjointness between d(k) and v(k). This relationship may arise because increased disjointness of the sources facilitates more accurate matching of their energy onto their respective basis vectors in B(k) during the φ updates i.e. the restricted NMF, with less overlap between bases. Furthermore, increased disjointness between d(k) and v(k) implies that the energy in each time frequency bin of y(k) is more likely to belong exclusively to either d(k) and v(k), which if the spectral energy in this bin is assigned to the correct bases during the φ updates, mitigates the possibility of cross matching during the assignment of e(k) during the subsequent ψ updates of both B(k) and g(k).