DOUBLE-TALK DETECTION ALGORITHMS - Audio signal processing

In this section, we explain different DTD algorithms that can be useful for AEC. We start with the Geigel algorithm since it was the very first DTD pro-posal.

3.1 THE GEIGEL ALGORITHM

A very simple algorithm due to A. A. Geigel [8] is to declare the presence of near-end speech whenever

where and T are suitably chosen constants. This detection scheme is based on a waveform level comparison between the microphone signal and the far-end speech assuming the near-end speech in the microphone signal will be typically stronger than the echo The maximum or norm of the most recent samples of is taken for the comparison because of the undetermined delay in the echo path. The threshold T is to compensate for the energy level of the echo path response and is often set to 2 for network echo cancelers because the hybrid loss is typically about 6 dB or more. For an AEC, however, it is not clear how to set a universal threshold to work reliably in all the various situations because the loss through the acoustic echo path can vary greatly depending on many factors. For one choice is to set it the same as the adaptive filter length L since we can assume that the echo path is covered by this length.

3.2 THE CROSS-CORRELATION METHOD

In [9] the cross-correlation coefficient vector between x and was proposed as a means for double-talk detection. A similar idea using the cross-correlation coefficient vector between x and has proven more robust and reliable [10, 6].

This section will therefore focus on the cross-correlation coefficient vector

where is the cross-correlation coefficient between and The idea here is to compare

to a threshold level T. The decision rule will be very simple: if then double-talk is not present; if then double-talk is present.

(Although the norm used in (6.7) is perhaps the most natural, other scalar metrics, e.g., could alternatively be used to assess the cross-correlation coefficient vectors. However, there is a fundamental problem here which is not linked to the type of metric used. The problem is that these cross-correlation coefficient vectors are not well normalized. Indeed, we can only say in general

that If that does not imply that or any other

known value. We do not know the value of in general. The amount of correlation will depend a great deal on the statistics of the signals and of the echo path. As a result, the best value of T will vary a lot from one experiment to another. So there is no natural threshold level associated with the variable

when

Next section presents a decision variable that exhibits better properties than the cross-correlation algorithm. This decision variable is formed by properly normalizing the cross-correlation vector between x and

3.3 THE NORMALIZED CROSS-CORRELATION METHOD

There is a simple way to normalize the cross-correlation vector between a

vector x when

where Since

and a scalar in order to have a natural threshold level for Suppose that In this case:

we have

and (6.9) can be re-written as

In general for we have,

If we divide (6.11) by (6.12) and compute its square root, we obtain the decision variable [5, 14]

where

x and is what we will call the normalized cross-correlation vector between

Substituting (6.10) and (6.12) into (6.13), we show that the decision variable is:

We easily deduce from (6.15) that for and for

Note also that is not sensitive to changes of the echo path when For the particular case when is white Gaussian noise, the autocorrelation matrix is diagonal: Then (6.14) becomes:

Hence, in general what we are doing in (6.13) is equivalent to prewhiten-ing the signal x, which is one of many known “generalized cross-correlation”

techniques [15]. Thus, when x is white, no prewhitening is necessary and This suggests a more practical implementation, whereby matrix operations are replaced by an adaptive prewhitening filter [16].

Finally, a fast version of (6.15) can be derived by recursively updating

using the Kalman gain Estimated quantities of the cross-correlation and the near-end signal power have to be introduced for the derivation of a fast version. Equation (6.15) can be rewritten as

where we squared the statistic for simplicity. The correlation variables are estimated recursively as,

where is a forgetting factor. The statistic can be shown to be updated as

where the likelihood variable and is the residual error, Hence, the quantities required to form the test statistic of the fast version of the NCC DTD are given by the simple first-order recursions in (6.20) and (6.22).

Table 6.1 gives the calculations for the fast NCC DTD, where it is assumed that the Kalman gain has been calculated “for free” by the FRLS algorithm [17].

3.4 THE COHERENCE METHOD

Instead of using the cross-correlation vector, a detection statistic can be formed by using the squared magnitude coherence. A DTD based on coherence was proposed in [4]. The idea is to estimate the coherence between and The coherence is close to one when there is no double-talk and it is close

to zero in a double-talk situation. Figure 6.2 shows an example of estimated coherence between loudspeaker and microphone signals in the presence and absence of double-talk. The squared coherence is defined as,

where is the DFT based cross-power spectrum and is the DFT frequency index. As decision parameter, an average over a few frequencies is used as detection statistics,

where I is the number of intervals used. Typical choices of these parameters are I = 3 and are the intervals chosen such that their center correspond to approximately 300, 1200, and 1800 Hz respectively. This gives in practice a significantly better performance than averaging over the whole frequency range since there is a poorer speech-to-noise ratio in the upper frequencies (the average speech spectrum declines with about 6 dB/octave above 2 kHz).

where is the eigenspectrum

and is analogously defined. The window is the discrete spheroidal wave function [19]. is the block length of the DFT. The multiple window method has advantages such as easy tradeoff between bias and variance.

Another possibility is to use the Welch spectrum estimation method [20].

Since this DTD is based on block processing of the signals, there is a tradeoff between calculation complexity and time between decisions. It is desirable to keep the time between decisions as short as possible in order to have as low detection failures as possible (both false alarm and detection miss).

3.5 THE NORMALIZED CROSS-CORRELATION MATRIX

Obviously, the cross-correlation and coherence methods are related in some sense. This link can be established by extending the definition of the cross-correlation method to incorporate cross-correlation between two vectors x and y instead of only the scalar [5]. Define the normalized cross-correlation matrix between two vectors x and y as follows

where

is a vector of size N. There are two interesting cases:

(i) N = 1, (normalized cross-correlation vector between x and (ii) N = L = 1, (cross-correlation coefficient between and

By extension to (6.13), we then form the detection statistic

where the subscript “F” denotes the Frobenius norm. We note that for case (i), as before. Again, we can interpret this formulation as a “generalized cross-correlation”, where now both x and y are prewhitened, which is also known as the “smoothed coherence transform” (SCOT) [15].

The link between the normalized cross-correlation matrix and the coherence can now be established as follows: Suppose that In this case, a Toeplitz matrix is asymptotically equivalent to a circulant matrix if its elements are absolutely summable [21], which is the case for the intended application.

Hence we can decompose as

where F is the discrete Fourier transform (DFT) matrix and

is a diagonal matrix formed by the first column of and

is the DFT cross-power spectrum. Now:

since tr(AB) = tr(BA). Using (6.30), we easily find that

where

Except is the transfer function of and

is the near-end talker to far-end talker spectral ratio at frequency

for an unrestricted frequency range, this form is identical to the coherence-based double-talk detector presented in Section 3.4. We find that this idea is

and when

and

very appropriate since when the two signals are completely coherent and then and

3.6 THE TWO-PATH MODEL

An interesting approach to double-talk handling was proposed in [13]. This method was introduced for network echo cancellation. However, it has proven far more useful for the AEC application. In this method, two filters model the echo path, one background filter which is adaptive as in a conventional AEC solution and one foreground filter which is not adaptive. The foreground filter cancels the echo. Whenever the background filter performs better than the foreground, its coefficients are copied to the foreground. Coefficients are copied only when a set of conditions are met, which should be compared to the single statistic decision declaring “no double-talk” in a traditional DTD presented in the previous sections.

The basic set of conditions found in [13] are given by (6.38)-(6.40). Copying is performed, equivalent to no double-talk present, if any of (6.38)-(6.40) is fulfilled,

where

is also imposed when (6.40) is fulfilled. The last is the short time smoothed absolute magnitude of a signal

A hangover time

tion (6.40) is basically the same as in the Geigel DTD with a unity threshold, i.e., the echo path is assumed not to attenuate the far-end speech. If all three conditions are satisfied over D consecutive decisions, copying of background coefficients is resumed.

Condition (6.38) ensures the background adaptive filter is canceling echo, while condition (6.39) ensures the background filter is outperforming the fore-ground filter. The above decision logic is effective for certain applications, but is not without shortcomings. First, conditions (6.38) and (6.39) are not always sufficient to prevent coefficient transfer in the presence of double-talk and/or high background noise. For speech or any other non-spectrally diverse excitation, the inequalities in (6.38) and (6.39) can be satisfied in the short term (over duration D for example) even though the actual misalignment error of the background coefficients is worse than that of the foreground coefficients.

Second, (6.38) and (6.39) employ thresholds that limit the responsiveness of the logic to changes in the performance of the background canceler and to changes

tory of the echo canceler. Last, condition (6.40) ensures no update is performed unless But, this property can be used to inhibit adaptation only

in cases for which the physical echo path introduces signal loss If the echo path introduces gain condition (6.40) prevents adaptation

even in the absence of near-side speech and noise. For this reason, these rules cannot in general be used in echo-canceling speakerphones, where

3.6.1 A Threshold-Free Decision Logic. In addition to the beneficial aspects of the original two-path logic, a two-path canceler decision logic should possess the following characteristics:

Faster initial convergence and reconvergence following echo path changes.

Applicability to echo paths having signal gain

Reduced dependence upon user-selected constants, such as thresholds and timers.

A logic that exhibit these properties is proposed and described in detail in [22]. This decision logic differs from that of prior works in that it does not use decision thresholds (constants). A smoothing parameter is the only constant that has to be chosen. Moreover, the logic applies to both lossy and gain-incurring echo paths, and possesses favorable convergence properties for many scenarios encountered in practice. Hence, the great advantage with this algorithm is that it is not sensitive to echo path changes since the background filter is allowed to track changes freely and as soon as it performs better than the foreground it is copied over.

3.7 DTD COMBINATIONS WITH ROBUST STATISTICS

All practical double-talk detectors have a probability of miss, i.e.

Requiring the probability of miss to be smaller will undoubtedly increase the probability of false alarms hence slowing down the convergence rate. As a consequence, no matter what DTD is used, undetected near-end speech will perturb the adaptive algorithm from time to time. Figure 6.4 shows the remain-ing undetected near-end speech (double-talk) after double-talk detection with a Geigel detector with T = 2. The impact of this perturbation is governed by the echo to near-end speech ratio as described in Section 1.

In practice, what has been done in the past is, first the DTD is designed to be

“as good as” one can afford and then, the adaptive algorithm is slowed down so that it copes with the detection errors made by the DTD. This is natural to do since if the adaptive algorithm is very fast, it can react faster to situation changes (e.g. double-talk) than the DTD and thus can diverge. However, this approach severely penalizes the convergence rate of the AEC when the situation is good, i.e. far-end but no near-end talk is present.

In the light of these facts, it may be fruitful to look at adaptive algorithms that can handle at least a small amount of double-talk without diverging. This approach has been studied and proven very successful in the network echo can-celer case [23], where the combination of outlier resistant adaptive algorithms

As for any AEC/DTD, adaptation is inhibited by setting the step-size parameter to zero when double-talk is detected. The scaled non-linearity in (6.41) can be chosen to be the limiter [24],

where is an adaptive scale factor. Making the scale factor adaptive and supervised by the DTD is the key to the success of this approach. The scale factor should reflect the background noise level at the near-end, be robust to short burst disturbances (double-talk) and track long term changes of the residual error (echo path changes). To fulfill these requirements one can choose the scale factor estimate as

where Adaptation of is performed as long as the DTD has not detected double-talk. Justification and details of the above derivations can be found in [23].

In document Audio signal processing (Page 169-180)