In this section, we explain different DTD algorithms that can be useful for AEC. We start with the Geigel algorithm since it was the very first DTD pro-posal.
3.1 THE GEIGEL ALGORITHM
A very simple algorithm due to A. A. Geigel [8] is to declare the presence of near-end speech whenever
where and T are suitably chosen constants. This detection scheme is based on a waveform level comparison between the microphone signal and the far-end speech assuming the near-end speech in the microphone signal will be typically stronger than the echo The maximum or norm of the most recent samples of is taken for the comparison because of the undetermined delay in the echo path. The threshold T is to compensate for the energy level of the echo path response and is often set to 2 for network echo cancelers because the hybrid loss is typically about 6 dB or more. For an AEC, however, it is not clear how to set a universal threshold to work reliably in all the various situations because the loss through the acoustic echo path can vary greatly depending on many factors. For one choice is to set it the same as the adaptive filter length L since we can assume that the echo path is covered by this length.
3.2 THE CROSS-CORRELATION METHOD
In [9] the cross-correlation coefficient vector between x and was proposed as a means for double-talk detection. A similar idea using the cross-correlation coefficient vector between x and has proven more robust and reliable [10, 6].
This section will therefore focus on the cross-correlation coefficient vector
where is the cross-correlation coefficient between and The idea here is to compare
to a threshold level T. The decision rule will be very simple: if then double-talk is not present; if then double-talk is present.
(Although the norm used in (6.7) is perhaps the most natural, other scalar metrics, e.g., could alternatively be used to assess the cross-correlation coefficient vectors. However, there is a fundamental problem here which is not linked to the type of metric used. The problem is that these cross-correlation coefficient vectors are not well normalized. Indeed, we can only say in general
that If that does not imply that or any other
known value. We do not know the value of in general. The amount of correlation will depend a great deal on the statistics of the signals and of the echo path. As a result, the best value of T will vary a lot from one experiment to another. So there is no natural threshold level associated with the variable
when
Next section presents a decision variable that exhibits better properties than the cross-correlation algorithm. This decision variable is formed by properly normalizing the cross-correlation vector between x and
3.3 THE NORMALIZED CROSS-CORRELATION METHOD
There is a simple way to normalize the cross-correlation vector between a
vector x when
where Since
and a scalar in order to have a natural threshold level for Suppose that In this case:
we have
and (6.9) can be re-written as
In general for we have,
If we divide (6.11) by (6.12) and compute its square root, we obtain the decision variable [5, 14]
where
x and is what we will call the normalized cross-correlation vector between
Substituting (6.10) and (6.12) into (6.13), we show that the decision variable is:
We easily deduce from (6.15) that for and for
Note also that is not sensitive to changes of the echo path when For the particular case when is white Gaussian noise, the autocorrelation matrix is diagonal: Then (6.14) becomes:
Hence, in general what we are doing in (6.13) is equivalent to prewhiten-ing the signal x, which is one of many known “generalized cross-correlation”
techniques [15]. Thus, when x is white, no prewhitening is necessary and This suggests a more practical implementation, whereby matrix operations are replaced by an adaptive prewhitening filter [16].
Finally, a fast version of (6.15) can be derived by recursively updating
using the Kalman gain Estimated quantities of the cross-correlation and the near-end signal power have to be introduced for the derivation of a fast version. Equation (6.15) can be rewritten as
where we squared the statistic for simplicity. The correlation variables are estimated recursively as,
where is a forgetting factor. The statistic can be shown to be updated as
where the likelihood variable and is the residual error, Hence, the quantities required to form the test statistic of the fast version of the NCC DTD are given by the simple first-order recursions in (6.20) and (6.22).
Table 6.1 gives the calculations for the fast NCC DTD, where it is assumed that the Kalman gain has been calculated “for free” by the FRLS algorithm [17].
3.4 THE COHERENCE METHOD
Instead of using the cross-correlation vector, a detection statistic can be formed by using the squared magnitude coherence. A DTD based on coherence was proposed in [4]. The idea is to estimate the coherence between and The coherence is close to one when there is no double-talk and it is close
to zero in a double-talk situation. Figure 6.2 shows an example of estimated coherence between loudspeaker and microphone signals in the presence and absence of double-talk. The squared coherence is defined as,
where is the DFT based cross-power spectrum and is the DFT frequency index. As decision parameter, an average over a few frequencies is used as detection statistics,
where I is the number of intervals used. Typical choices of these parameters are I = 3 and are the intervals chosen such that their center correspond to approximately 300, 1200, and 1800 Hz respectively. This gives in practice a significantly better performance than averaging over the whole frequency range since there is a poorer speech-to-noise ratio in the upper frequencies (the average speech spectrum declines with about 6 dB/octave above 2 kHz).
where is the eigenspectrum
and is analogously defined. The window is the discrete spheroidal wave function [19]. is the block length of the DFT. The multiple window method has advantages such as easy tradeoff between bias and variance.
Another possibility is to use the Welch spectrum estimation method [20].
Since this DTD is based on block processing of the signals, there is a tradeoff between calculation complexity and time between decisions. It is desirable to keep the time between decisions as short as possible in order to have as low detection failures as possible (both false alarm and detection miss).
3.5 THE NORMALIZED CROSS-CORRELATION MATRIX
Obviously, the cross-correlation and coherence methods are related in some sense. This link can be established by extending the definition of the cross-correlation method to incorporate cross-correlation between two vectors x and y instead of only the scalar [5]. Define the normalized cross-correlation matrix between two vectors x and y as follows
where
is a vector of size N. There are two interesting cases:
(i) N = 1, (normalized cross-correlation vector between x and (ii) N = L = 1, (cross-correlation coefficient between and
By extension to (6.13), we then form the detection statistic
where the subscript “F” denotes the Frobenius norm. We note that for case (i), as before. Again, we can interpret this formulation as a “generalized cross-correlation”, where now both x and y are prewhitened, which is also known as the “smoothed coherence transform” (SCOT) [15].
The link between the normalized cross-correlation matrix and the coherence can now be established as follows: Suppose that In this case, a Toeplitz matrix is asymptotically equivalent to a circulant matrix if its elements are absolutely summable [21], which is the case for the intended application.
Hence we can decompose as
where F is the discrete Fourier transform (DFT) matrix and
is a diagonal matrix formed by the first column of and
is the DFT cross-power spectrum. Now:
since tr(AB) = tr(BA). Using (6.30), we easily find that
where
where
Except is the transfer function of and
is the near-end talker to far-end talker spectral ratio at frequency
for an unrestricted frequency range, this form is identical to the coherence-based double-talk detector presented in Section 3.4. We find that this idea is
and when
and
very appropriate since when the two signals are completely coherent and then and
3.6 THE TWO-PATH MODEL
An interesting approach to double-talk handling was proposed in [13]. This method was introduced for network echo cancellation. However, it has proven far more useful for the AEC application. In this method, two filters model the echo path, one background filter which is adaptive as in a conventional AEC solution and one foreground filter which is not adaptive. The foreground filter cancels the echo. Whenever the background filter performs better than the foreground, its coefficients are copied to the foreground. Coefficients are copied only when a set of conditions are met, which should be compared to the single statistic decision declaring “no double-talk” in a traditional DTD presented in the previous sections.
The basic set of conditions found in [13] are given by (6.38)-(6.40). Copying is performed, equivalent to no double-talk present, if any of (6.38)-(6.40) is fulfilled,
where
is also imposed when (6.40) is fulfilled. The last is the short time smoothed absolute magnitude of a signal
A hangover time
tion (6.40) is basically the same as in the Geigel DTD with a unity threshold, i.e., the echo path is assumed not to attenuate the far-end speech. If all three conditions are satisfied over D consecutive decisions, copying of background coefficients is resumed.
Condition (6.38) ensures the background adaptive filter is canceling echo, while condition (6.39) ensures the background filter is outperforming the fore-ground filter. The above decision logic is effective for certain applications, but is not without shortcomings. First, conditions (6.38) and (6.39) are not always sufficient to prevent coefficient transfer in the presence of double-talk and/or high background noise. For speech or any other non-spectrally diverse excitation, the inequalities in (6.38) and (6.39) can be satisfied in the short term (over duration D for example) even though the actual misalignment error of the background coefficients is worse than that of the foreground coefficients.
Second, (6.38) and (6.39) employ thresholds that limit the responsiveness of the logic to changes in the performance of the background canceler and to changes
tory of the echo canceler. Last, condition (6.40) ensures no update is performed unless But, this property can be used to inhibit adaptation only
in cases for which the physical echo path introduces signal loss If the echo path introduces gain condition (6.40) prevents adaptation
even in the absence of near-side speech and noise. For this reason, these rules cannot in general be used in echo-canceling speakerphones, where
3.6.1 A Threshold-Free Decision Logic. In addition to the beneficial aspects of the original two-path logic, a two-path canceler decision logic should possess the following characteristics:
Faster initial convergence and reconvergence following echo path changes.
Applicability to echo paths having signal gain
Reduced dependence upon user-selected constants, such as thresholds and timers.
A logic that exhibit these properties is proposed and described in detail in [22]. This decision logic differs from that of prior works in that it does not use decision thresholds (constants). A smoothing parameter is the only constant that has to be chosen. Moreover, the logic applies to both lossy and gain-incurring echo paths, and possesses favorable convergence properties for many scenarios encountered in practice. Hence, the great advantage with this algorithm is that it is not sensitive to echo path changes since the background filter is allowed to track changes freely and as soon as it performs better than the foreground it is copied over.
3.7 DTD COMBINATIONS WITH ROBUST STATISTICS
All practical double-talk detectors have a probability of miss, i.e.
Requiring the probability of miss to be smaller will undoubtedly increase the probability of false alarms hence slowing down the convergence rate. As a consequence, no matter what DTD is used, undetected near-end speech will perturb the adaptive algorithm from time to time. Figure 6.4 shows the remain-ing undetected near-end speech (double-talk) after double-talk detection with a Geigel detector with T = 2. The impact of this perturbation is governed by the echo to near-end speech ratio as described in Section 1.
In practice, what has been done in the past is, first the DTD is designed to be
“as good as” one can afford and then, the adaptive algorithm is slowed down so that it copes with the detection errors made by the DTD. This is natural to do since if the adaptive algorithm is very fast, it can react faster to situation changes (e.g. double-talk) than the DTD and thus can diverge. However, this approach severely penalizes the convergence rate of the AEC when the situation is good, i.e. far-end but no near-end talk is present.
In the light of these facts, it may be fruitful to look at adaptive algorithms that can handle at least a small amount of double-talk without diverging. This approach has been studied and proven very successful in the network echo can-celer case [23], where the combination of outlier resistant adaptive algorithms
As for any AEC/DTD, adaptation is inhibited by setting the step-size parameter to zero when double-talk is detected. The scaled non-linearity in (6.41) can be chosen to be the limiter [24],
where is an adaptive scale factor. Making the scale factor adaptive and supervised by the DTD is the key to the success of this approach. The scale factor should reflect the background noise level at the near-end, be robust to short burst disturbances (double-talk) and track long term changes of the residual error (echo path changes). To fulfill these requirements one can choose the scale factor estimate as
where Adaptation of is performed as long as the DTD has not detected double-talk. Justification and details of the above derivations can be found in [23].