ECHO CANCELLATION
6.8 ECHO CANCELLER CONTROL FUNCTIONS
Adaptive fi lters use an FIR - equivalent fi lter for forming the echo path. The echo path is updated with different algorithms of LMS and RLS. In recent implementations, the adaptive fi lter is split into two parts. In Fig. 6.7 , the fi lters are marked as adaptive as well as the hold fi lter. The adaptive fi lter keeps adapting as per the previously explained algorithms. The hold fi lter gets an update from the adaptive fi lter in good conditions. Although the adaptive fi lter
is settling, the hold fi lter keeps taking update. The hold fi lter will not have any fi lter coeffi cient adaptation. The adaptive fi lter summing junction output
is mainly used for the adaptive fi lter closed - loop adaptation. The hold
fi lter summing junction output is the actual output used with an NLP opera- tion. When the adaptive fi lter is disturbed, it can reload good coeffi cients from the hold fi lter. When there is a near - end signal or a double talk condition, the adaptive fi lter is not updated. This type of double - fi ltering scheme mini- mizes the disturbances during undesired conditions [URL (Cisco - G168) ]. The quality of echo removal will be better and stable with this scheme. This type of scheme requires slightly higher memory and processing mainly to validate the updates as well as an additional fi ltering operation. More aspects of the hold fi lter in associating with double talk are given in the subsequent part of this chapter.
In the echo canceller, the control plane includes the double talk detection, nonlinear processor, and modem/fax tone detectors. Double talk detection is the main control that infl uences the adaptation. Modem and fax detections are mainly single events that decide on enabling adaptation and NLP enable/ disable of echo cancellation operations. The NLP operation is used to remove the echo residue.
6.8.1 Double Talk Detection
DT is the simultaneous presence of the near - end and far - end speech. The main purpose of the DT detector is to avoid adaptation whenever the double talk signal is present. Double talk detection has to be declared with any signifi cant presence of near - end signal (nonecho) irrespective of the far - end signal and echo. The double talk detector should also detect near - end low - level signals, music, and strong background noise conditions. Hence, the DT detector is also called the near - end speech detector or near - end voice activity detection. DT detection freezes the adaptive fi lter and disables the NLP operation. However, the linear part of the echo cancellation will occur with the previously adapted fi lter. With reference to Fig. 6.6 (a) and 6.7 , DT detection passes near - end
speech S gen directly to S out without any distortion while cancelling the linear
part of the echo.
As shown in Fig. 6.7 , the main inputs to the DT are the far - end signal (R in )
and the S in signal that includes echo and the near - end signal. Echo is also
a strong signal present in the S in . Echo alone should not detect DT without
the presence of actual near - end speech, even with low ERL. Low ERL means
a strong echo in S in. Some popular DT detection approaches are given
below.
Geigel Algorithm. In the early implementations of double talk, the Geigel algorithm was used. In recent times, this is supplemented with several other signal - processing techniques. The Geigel algorithm [URL (SPRA129) ] com-
pares near - end speech S in with a short - term history of R in . In the following
equations, signal S in ( i ) is referred to as s ( i ) which is a combination of near - end
speech x ( i ) and echo r ( i ). R in ( i ) is referred to as y ( i ) to maintain consistency
with Fig. 6.6 (a). The Geigel algorithm detects the presence of DT when the following condition is satisfi ed:
s i( ) = ( ) + ( ) ≥x i r i 1 {y i( ) y i( − ) y i( −N)}
2max , 1 , . . . , (6.13)
where N is the FIR adaptive fi lter order. The factor of one half is based on the hypothesis that the echo path loss known as ERL through a hybrid is at least 6 dB. For different ERL requirements, this threshold has to be modifi ed. Keeping one quarter in place of one half will cater up to a 12 - dB ERL. The selected threshold, and the variations in ERL under multiple usage scenarios should not create double talk detections from self - echo. The preferred condi-
tion is | x ( i ) | >> | r ( i ) | for better DT detection. In Eq. (6.13) , N samples are
compared for every new input sample. In actual implementation to reduce computation, this will be optimized as partial maxima over a few small blocks
and not all previous N − 1 samples are required in each step.
The more robust version of the Geigel algorithm uses the short - term power estimate, sˆ ( i ) and yˆ ( i ), for the power estimates of the recent past of the far - end
signal and near - end hybrid signal s ( i ). These estimates are computed recur- sively by the equations.
s iˆ( +1) = −(1 α) ( ) +s iˆ αs i( ) (6.14)
y iˆ( +1) = −(1 α) ( ) +y iˆ αy i( ) (6.15)
In the equations, the fi lter gain α = 2 − 5 . The near - end speech or double talk is
detected whenever
s iˆ( ) ≥ (1 2)max ˆ{y i y i( ), ˆ( −1), . . . ,y iˆ( −N)}
As the near - end speech detector algorithm detects short - term peaks, it is desir- able to continue declaring near - end speech for some hangover time after initial detection [URL (SPRA129) ]. During hangover time, the previous conditions are continued. Once hangover time is over, the adaptation of fi lter coeffi cients is allowed and NLP is enabled.
Correlation - Based DT Detections. In correlation - based detections, correla- tion is computed on reference y ( i ), and near - end speech with echo s ( i ). It is expected that actual echo r ( i ) will correlate better with y ( i ) and that near - end speech x ( i ) will have less correlation with y ( i ). These correlations are validated with thresholds and absolute power levels. There are several extensions to this basic correlation - based method. Assuming that the echo canceller is in an
adapted state, in relation to Fig. 6.6 (a), r ˆ ( i ) ≈ r ( i ). During double talk, r ( i ) will
be growing more quickly than r ˆ ( i ). This measure is helpful to build extra logic for DT detection.
ERL and ERLE - Based DT Detection. The echo adapting fi lter is a linear part of the hybrid ERL contribution. To make use of ERL for DT detection, ERL is monitored continuously. This process may require normalization of fi lter coeffi cients and converting ERL to the dB scale. During the beginning phase of DT, the adaptive fi lter will keep adapting and ERL will keep decreas- ing. It is one of the early indications of double talk, which will happen because of fi lter coeffi cients quickly trying to adapt to a strong near - end signal even though it is not echo. This procedure also goes by the name “ fi lter disturbance detection. ” As a continuity of the previous condition, a reduction of ERLE at the summing junctions is also treated as the likely condition of DT.
Double Filter as a Helpful Option to DT . In recent implementations of the adaptive algorithm, double fi lters are used. As shown in Fig. 6.7 , one fi lter marked as the hold fi lter will preserve the good version of adapted coeffi cients. The incorporation of the hold fi lter gives relaxed conditions for DT detection [URL (Cisco - G168) ]. Good versions of the adaptive coeffi cients are always preserved in the hold fi lter. If the adaptive fi lter is disturbed, the hold fi lter
can upload the good coeffi cients to the adapting fi lter. The hold fi lter will take a copy from the adaptive fi lter, when all the conditions are favorable. The double talk detection is based on hold fi lter echo cancellation. This double fi lter scheme cannot detect any double talk detection, but it is helpful to use with other detection methods.
Power - Based Normalization. The LMS algorithm given in Section 6.7 adapts slowly for lower powers. The technique indicated in [Al - Naimi (2003) ] makes use of slowing down adaptation by controlling the resultant step. During a strong s ( i ) signal, the adaptation step is reduced so that it slows down adap- tation. The goal is to make the adaptation too slow on any likely conditions of DT. Once it comes out of the DT condition, the adaptive fi lter will continue to adapt at the set rate. This type of approach is usually attempted with a single adaptive fi lter scheme. No separate DT detector is used in this approach.
6.8.2 NonLinear Processing
An example of a device that reduces or cancels small echo signals by nonlinear operation on the samples of the transmitted audio signal is a center clipper ITU - T - P.340 (2000) ]. Nonlinearities in the echo path of the telephone circuit, uncorrelated near - end speech, speech clippings, and quantization effects [URL (SPRA129) ] of codecs limit the amount of achievable echo rejection in the adaptive fi lter to 19 – 35 dB. In most situations of VoIP, echo has to be removed
to − 65 dBm [ITU - T - G.168 (2004) ]. Test equipment terminated on the VoIP
system may behave as equivalent to a linear FIR fi lter creating the possibility to reject most of the echo as linear. With practical phones, the adaptive FIR fi lter will not remove echo to this level, which calls for residual echo suppres- sion algorithms. When the near - end signal is present, the residual echo sup- pressor must pass the signal without any noticeable distortions. A suppression algorithm detects when to operate the NLP to remove the residue.
In simple suppression control, a decision is made when the error signal falls below a certain level in relation to the reference signal. The residue is elimi- nated by turning the signal into complete silence. In some implementation, instead of complete silence, attenuation of usually 12 – 24 dB is applied. Sudden application of silence or attenuation will create an annoyance to the percep- tion. Hence, time - dependent attenuation shaping is applied in the NLP window of operation. This process will help to reduce annoyance, but it will not elimi- nate it. To help improve perception, comfort noise (CN) is created during the NLP region [URL (Cisco - G168) ]. As explained in Chapter 4 , VAD can be power based and match the spectral envelope. The G.168 requirements mainly talk about power - based tracking. Several advanced implementations [Bourget (2003) ] create a pleasant background that completely eliminates the hollow- ness. The creation of comfort noise based on power or spectral shaping has to track the background. Some of these techniques may not take care of low - level near - end speech or music. It is essential to detect these sounds to avoid
removing these along with echo residue. To create improved perception, it is also essential to send the background speech or music directly without treating it through comfort noise. Taking care of these conditions will help improve the perception [Bourget (2003) ].
Once a major part of the echo is rejected in the adaptive fi lter, major voice quality enhancement or degradation happens in the NLP block. Echo residue extends in time. Hence, certain hangover time is created that mainly coincides with the end of the speech utterance. This time varies based on signal condi- tions. Typical hangovers are from 50 to 120 ms [ITU - T - G.168 (2004) ]. Higher hangover is undesirable, but this has to be suffi cient enough to create rejection and better perception. Simpler algorithms operate on power levels. To arrive at optimal hangover, signal analysis is also performed.
Power - Based NLP Detection. Simple power - based detection [URL (SPRA129) ] is given here. In this formulation, the echo residue is shown as e ( i ). The signal power is estimated as
e iˆ( ) = −(1 ρ) −e iˆ( 1) + ( )ρe i The reference y ( i ) power is estimated as
y iˆ( ) = −(1 ρ)y iˆ( −1) +ρy i( )
Suppression is enabled on the transmitted S out ( i ) to zero whenever
ˆ ˆ
e i y i( ) ( ) ≤ 1
16
The threshold of one sixteenth corresponds to 24 dB. A recommended value
of ρ for this operation is 2 − 7 . Recently, implementation caters to higher end
signal processing techniques that analyze the residual signal and apply required rejection. During double talk mode and modem/fax 2100 - Hz tone detection, NLP - based residue detection and removal is disabled completely.
NonLinear Filtering. In theory, echo residue can be decomposed as a small linear part and as a nonlinear part. The framework of nonlinear fi ltering is given in [Mathews et al. (2000) , Borys (2001) ]. A nonlinear fi lter in a simple way is a higher order fi lter with several product terms. The higher order terms help in modeling the nonlinearities of echo. While writing this book, the non- linear fi lter was not widely adapted in echo cancellers implementation because of several limitations and constraints.
Nonlinear fi lters are more complex by design. They take more processing. Numerical precision for higher order product terms is also important. Memory requirements will grow with the order of the fi ltering. Fixed nonlinear fi ltering is reasonably easy to establish. Adaptive nonlinear fi ltering complicates the operation. To establish a reasonable balance of complexity, the models are
truncated in order. The truncated order will limit the cancellation. Hence, another level of cancellation is required even with nonlinear fi ltering.
In future designs, new designs may adapt nonlinear fi ltering because of unlimited (relative to the nonlinear fi ltering) availability of processing with higher numerical precision and memory. The end - to - end delays may reduce over time because of improved network and bandwidth conditions. As a result, the requirements of nonlinear echo rejection will reduce, which provides the possibility to manage nonlinear fi ltering with lower order.
NLP and Quality Issues in Relation to Codecs and VAD / CNG . NLP removes or reduces echo residue that consists of a linear and a nonlinear part. Depending on the NLP implementation, the NLP output can be silence, power - matched background noise, spectral matching comfort noise with additional checks on background voice and music, and so on. Some voice quality aspects of NLP are given in reference [Bourget (2003) ]. In the case of simple NLP such as silence or power - level - based comfort noise, voice codecs and VAD go through distortions as illustrated in Fig. 6.8 .
In VAD disable mode, all samples are compressed as speech frames. Imper- fections in NLP as well as the inability to reproduce the exact background at NLP output can disturb the parameters mainly with code excited linear predic- tion (CELP) codecs like G.729A. At the decoder, audible artifacts (sudden ticks or hits) are produced from disturbed parameters from the encoder. These artifacts are of very low power, but they will be audible in a clean acoustic environment. In waveform - based codecs, NLP imperfections are transparent through the encoder and decoder. As explained in Chapter 4 , VAD modules used with waveform codecs use simple power - based and CELP techniques. Power - based VADs will not create any disturbance with the imperfections of the NLP operation. In CELP - based VAD/CNGs, artifacts can be observed with the imperfections of the NLP operation.
In general, NLP has to be made perfect to use any codec with or without VAD/CNG mode. When end - to - end delays are lower, it is possible to disable NLP, but end - to - end delays may vary in different call combinations. Hence,
Figure 6.8. NLP, codecs and VAD/CNG relation.
perfecting NLP is essential to maintain good voice quality. In the receive path,
when CNG is operating, the R in signal level will be very low of the order of
− 40 to − 60 dBm. It is good to disable adaptation during comfort noise genera-
tion to avoid disturbance to the adapted coeffi cients.
In general, the NLP operation infl uences codecs, and VAD/CNG operations and these disturbances are audible in a reasonably silent environment and degrade quality in multiparty conferences.
6.8.3 Monitoring and Confi guration
Real - line Transport Protocal (RTP) transports voice packets and RTP Control Protocal (RTCP) transports packet statistics. RTCP extended reports (RTCP - XR) is the RTCP extension that monitors several voice quality parameters. RTCP - XR, which is discussed in Chapter 20 , makes use of parameters residual echo return loss, signal level, and noise level. It is possible to derive these parameters from echo canceller functional blocks. In addition to RTCP - XR parameters, echo canceller monitoring can be extended for validating adaptive fi lter convergence, double talk detector, NLP, tone detectors, coordination of control plane, and processing resources utilization. The echo canceller will require confi gurations for several modes of operations. The important confi gu- rations are echo tail length, stress on adaptation, NLP, DT controls, hangover, reset states, and echo canceller external testing controls.