Weighted Recursive Least Square Filter and Neural Network based Residual Echo Suppression for the AEC-Challenge
WWEIGHTED RECURSIVE LEAST SQUARE FILTER AND NEURAL NETWORK BASEDRESIDUAL ECHO SUPPRESSION FOR THE AEC-CHALLENGE
Ziteng Wang †‡ , Yueyue Na † , Zhang Liu † , Biao Tian † , Qiang Fu †‡† Machine Intelligence Technology, Alibaba Group ‡ Beijing Sound Connect Technology
ABSTRACT
This paper presents a real-time Acoustic Echo Cancella-tion (AEC) algorithm submitted to the AEC-Challenge. Thealgorithm consists of three modules: Generalized Cross-Correlation with PHAse Transform (GCC-PHAT) basedtime delay compensation, weighted Recursive Least Square(wRLS) based linear adaptive filtering and neural networkbased residual echo suppression. The wRLS filter is de-rived from a novel semi-blind source separation perspective.The neural network model predicts a Phase-Sensitive Mask(PSM) based on the aligned reference and the linear filteroutput. The algorithm achieved a mean subjective score of4.00 and ranked 2nd in the AEC-Challenge.
Index Terms — AEC-Challenge, weighted RLS, residualecho suppression, deep neural network
1. INTRODUCTION
Acoustic Echo Cancellation (AEC) plays an essential part infull-duplex speech communication systems. The goal of AECis no echo leakage when there is loudspeaker signal (far end)and no speech distortion when the users talk (near end). It hasbeen a challenging problem since the earlier days of telecom-munication [1]. A practical acoustic echo cancellation solu-tion, e.g. the one in the WebRTC project [2], usually consistsof three modules: Time Delay Compensation (TDC), linearadaptive filtering and Non-Linear Processing (NLP).Time delay compensation is necessary, especially in realsystems where microphone signal capturing and loudspeakersignal rendering are handled by different threads and the sam-ple clocks may not be synchronized. Typical delays betweenthe far end and near end signals range from 10 ms to 500 ms.Though in theory, the linear adaptive filter can handle anydelay by having a sufficient number of filter taps. TDC couldbenefit the performance by avoiding over-parameterizationand speeding up convergence. Time delay estimation meth-ods include the Generalized Cross-Correlation with PHAseTransform (GCC-PHAT) algorithm [3] and audio fingerprint-ing technology [4].Linear adaptive filters, such as Normalized Least MeanSquare (NLMS) filters [5] and Kalman filters [6], can be de-
TDCfarend nearend speechLinear FilterNLP Echo Path
Fig. 1 . A typical acoustic echo cancellation solution.signed either in the time domain or in the frequency domain.For the best performance possible, the filter length should belong enough to cover the whole echo path, which could bethousands of taps in the time domain. Frequency DomainAdaptive Filter (FDAF) [7] are more often chosen for com-putational savings and better modeling statistics.NLP is introduced as a complement to linear filtering tosuppress residual echos. The methods are generally adaptedfrom noise reduction techniques, e.g. the multi-frame Wienerfilter [8]. Many recent studies also adopt deep learningmethods for residual echo suppression [9, 10, 11, 12] andreport reasonable objective scores on synthetic datasets. Oneconcern is that the neural network models may degrade sig-nificantly in real applications. The AEC-Challenge [13] isthus organized to stimulate research in this area by provid-ing recordings from more than 2,500 real audio devices andhuman speakers in real environments. The evaluation isbased on the average P.808 Mean Opinion Score (MOS) [14]achieved across all different single talk and double talk sce-narios.This paper describes our submission to the AEC-Challenge,which consists of three cascading modules: GCC-PHAT fortime delay compensation, weighted Recursive Least Square(wRLS) for linear filtering, and a Deep Feedforward Se-quential Memory Network (Deep-FSMN) [15] for residualecho suppression. The wRLS filter is derived from a novelsemi-blind source separation perspective and is shown to bedouble talk friendly. The algorithm proved its efficacy in the a r X i v : . [ c s . S D ] F e b hallenge and it is described in the following section.
2. THE PROPOSED ALGORITHM
As in Figure 1, the captured signal at time t is expressed as: d ( t ) = x ( t ) ∗ a ( t ) + s ( t ) + v ( t ) (1)where x ( t ) , s ( t ) and v ( t ) are respectively the the far end sig-nal, the near end speech signal and the signal modeling error. a ( t ) denotes the echo path and ∗ denotes convolution. It isassumed v ( t ) = 0 in the following for simplicity. The fre-quency representations of d, x, a, s are respectively denotedas D, X, A, S . The GCC-PHAT algorithm is applied first to align the far endand near end signals. The generalized cross correlation is de-fined as Φ t,f = E [ X t,f D ∗ t,f ] with E [ · ] denoting expectation, f the frequency index and ( · ) ∗ the conjugate of a variable.The online implementation is given by: Φ t,f = α Φ t − ,f + (1 − α ) X t,f D ∗ t,f (2)where α is a smoothing parameter. The relative delay τ is ob-tained by performing Inverse Fast Fourier Transform (IFFT)and finding the index of the maximum: τ = argmax τ IFFT ( Φ t,f | Φ t,f | ) (3) Linear filtering is performed in the frequency domain on thetime-aligned signals x ( t (cid:48) ) and d ( t ) . Suppose an echo path of L taps, the signal model is reformulated as: (cid:20) D t,f x L,f (cid:21) = (cid:20) a HL,f (cid:21) (cid:20) S t,f x L,f (cid:21) (4)where x L,f = [ X ( t (cid:48) , f ) , X ( t (cid:48) − , f ) , ..., X ( t (cid:48) − L + 1 , f )] T and a L,f = [ A ( t, f ) , A ( t − , f ) , ..., A ( t − L + 1 , f )] T with ( · ) T denoting transpose and ( · ) H Hermitian transpose. I is aunitary matrix of order L . The near end speech can be sepa-rated by: (cid:20) ˆ S t,f x L,f (cid:21) = B f (cid:20) D t,f x L,f (cid:21) (5)where ˆ( · ) denotes the estimate of a variable and B f is termedthe unmixing matrix.Equation (5) clearly defines a semi-blind source separa-tion problem. Assuming independence of { D t,f , x L,f } , theunmixing matrix has this unique form as: B f = (cid:20) w HL,f (cid:21) (6) which can be solved by the well established source sourceseparation algorithms, such as the Independent ComponentAnalysis (ICA) and auxiliary-function based (Aux-)ICA al-gorithms [16]. The Aux-ICA solution is briefly described asfollows and a detailed derivation can be found in [17].The Kullback-Leibler divergence is introduced as the in-dependence measure J ( B f ) = (cid:90) S t,f (cid:90) x L,f p ( S t,f , x L,f ) log p ( S t,f , x L,f ) q ( S t,f , x L,f ) (7)where p ( · ) represents the source Probability Density Function(PDF) and q ( · ) the product of approximated PDF of individ-ual sources. The loss is upper bounded by the auxiliary lossfunction Q ( B f , C f ) = L +1 (cid:88) i =1 b Hi,f C i,f b i,f + const. (8)where b Hi,f is the i -th row vector of B f and the auxiliary vari-able C i,f = E [ G (cid:48) ( r i,t,f ) r i,t,f x t,f x Ht,f ] (9)with x t,f = [ D t,f , x TL,f ] T and r i,t,f the i -th separated source. G ( r ) is called the contrast function and has a relationship G ( r ) = − log p ( r ) .Equation (8) can be minimized in terms of b ,f as: b ,f = [ B f C ,f ] − i = C − ,f i . (10)with i = [1 , , ..., T a L + 1 dimensional vector. Further byapplying block matrix inversion of C ,f , the unmixing filtercoefficients are given by w L,f = − R − L,f r L,f (11)where R L,f = E [ G (cid:48) ( r ) r x L,f x HL,f ] , r L,f = E [ G (cid:48) ( r ) r x L,f D ∗ t,f ] . (12)The separated near end speech is obtained as: ˆ S t,f = D t,f + w Hf x L,f . (13)Equation (11) stands for a weighted RLS filter, in whichthe correlation weighting factor is determined by the under-lying near end source PDF. In literature, a general super-Gaussian source PDF has the form of G ( D t,f ) = ( D t,f η ) β , < β ≤ (14)where a shape parameter of β ∈ [0 . , . is suggested. plice +MVN … fbank(40*2) timealignedfarendlinearfilteroutput skipconnection FSMN(256, 256) PSM(161)FSMN(256, 256)memoryblock Fig. 2 . The Deep-FSMN model for residual echo suppression.
The Deep-FSMN model for residual echo suppression is il-lustrated in figure 2. Logarithm filter bank energies (fbank)of the time aligned far end and wRLS filter output signals areused as input to the neural network. The computation flow isgiven by: f in = [ fbank ( ˆ S t ) , fbank ( X t (cid:48) )] p = ReLU ( U f in + v ) p j +1 = FSMN ( p j ) , j ∈ [1 , , ..., J − f out = Sigmoid ( U J +1 p J + v J +1 ) (15)where U j and v j are respectively the weight matrix and biasvector in the j -th layer. Each FSMN block has one hiddenlayer, one projection layer and one memory block. The real-ization is given by: h jt = ReLU ( U j p jt + v j ) ¯p t = U j h jt p j +1 t = p jt + ¯p t + N (cid:88) i =0 m ji (cid:12) ¯p t − i (16)where m ji is a memory parameter weighting the history infor-mation ¯p t − i and (cid:12) denotes element-wise multiplication. N isthe look-back order. Skip connections are added between thememory blocks to alleviate the gradient vanishing problem inthe training phase.The training target is a modified version of the vanillaPhase Sensitive Mask (PSM) and is clipped to the range of[0,1] PSM = | S t,f || ˆ S t,f | · Re ( S t,f ˆ S t,f ) . (17)Though complex masks as applied in the recent DNS-Challenge [18] have potentially better performance, no sig-nificant gains are observed in our preliminary experiments.
3. RELATION TO PRIOR WORK
Addressing AEC from the source separation perspective hasbeen investigated in [19, 20], and ICA based solutions are dis-cussed therein. Here, an Aux-ICA based solution is derivedand results in a novel weighted RLS filter.Exploiting deep neural networks for residual echo sup-pression is a trending practice in literature. Here we considerthe capability of the causal Deep-FSMN architecture jointlywith TDC and wRLS filter in a systematic view.
4. EXPERIMENTS
The AEC-Challenge dataset covers the following scenarios:far end (FE) single talk (ST), with and without echo pathchange; near end (NE) single talk, no echo path change; dou-ble talk (DT), with and without echo path change. Both farand near end speech can be either clean or noisy. The evalu-ation is based on the P.808 Mean Opinion Score (MOS) [14]on a blind test set. The top 3 results are given in Table 1. The wRLS adaptive filter is computed based on 20 ms frameswith a hop size of 10 ms, and a 320-point discrete Fouriertransform. A filter tap of L = 5 in Equation (4) is used, andthe filter coefficients are updated as in Equation (11), withthe correlation matrix R and correlation vector r estimatedrecursively using a smooth parameter of . and a source PDFshape parameter of β = 0 . in Equation (14).The TDC part is configured to cover a relative delay ofup to 500 ms, which requires a 16384-point discrete Fouriertransform. To reduce the computational complexity, the esti-mation is updated every 250 ms by Equation (3) and the cal-culation of Φ t,f in different frequencies are spread evenly inthis period. https://aec-challenge.azurewebsites.net/ able 1 . MOS across different test scenarios.TeamId ST NEMOS ST FE EchoDMOS DT EchoDMOS DT OtherDMOS21 3.85 4.19 4.34 4.07Ours 3.84 4.19 4.26 3.719 3.76 4.20 4.30 3.74Baseline 3.79 3.84 3.84 3.28For the residual echo suppression neural network, the in-ference process is computed as in Equation (15). The output f out is point-wise multiplied with ˆ S t,f for signal reconstruc-tion. There are J = 9 FSMN blocks each with 256 hiddenunits, 256 projection units and a look-back order of N = 20 .The input feature is a spliced by one frame in the past and oneframe in the future, which leads to a vector dimension of 240,and then mean and variance normalized.There are 1.4M trainable parameters in the model. Theaverage time it takes to infer one frame is 0.61 ms (0.19 msfor TDC, wRLS and 0.42 ms for RES) on a Surface Laptopwith Intel Core i5-8350U clocked at 1.9 GHz, based on aninternal C++/SSE2 implementation. For training the neural network, the first 500 clips in the of-ficial synthetic dataset are used as the validation set and therest 9,500 utterances are used for training. Besides, the train-ing data is augmented as follows:1. Randomly remix the echo and near end speech in theofficial synthetic dataset (19,000 utterances).2. Select far end single talk utterances in the real datasetand randomly remix with the near end speech (28,998 utter-ances).3. Use sweep signals in the real dataset to estimate theecho paths and regenerate double talk data using utterancesfrom the LibriSpeech corpus [21] with Signal-to-Echo Ra-tio (SER) uniformly distributed in [-6, 10] dB (25,540 utter-ances).4. Regenerate 24,000 random room impulse responses insimulated rooms and selectively add audio effects [clipping,band-limiting, equalization, sigmoid-like transformation] tothe echo signal (24,000 utterances).The Deep-FSMN model is optimized using the Adam op-timizer with a learning rate of 0.0003, under the mean squarederror loss function. The model is first trained for 10 epochson the 9,500 utterances, and then fine tuned on the augmentedtraining set. The learning rate is decayed by 0.6 if the loss im-provement is less than 0.001. The best model is selected basedon the ITU-T recommendation P.862 Perceptual Evaluation ofSpeech Quality (PESQ) scores evaluated on the validation set.
In Table 1, the baseline is a recurrent neural network that takesconcatenated log power spectral features of the microphonesignal and far end signal as input, and outputs a spectral sup-pression mask [13]. It performs reasonably well in the STNE scenario, but lacks behind the top systems when echoexists. Informal listening indicates that our proposed algo-rithm sometimes over-suppresses the near end speech in dou-ble talk, which may explain the DT Other DMOS gap withthe 1st system.In Table 2, the proposed wRLS filter is compared withthe linear filter in WebRTC-AEC3 [2] in terms of PESQ andShort-Time Objective Intelligibility (STOI) [22] on 500 clipsof the validation set, and in terms of Echo Return Loss En-hancement (ERLE) on the ST FE in the test set. ERLE isdefined as: ERLE = 10 log E [ s ( t )] E [ˆ s ( t )] (18) Table 2 . PESQ and STOI are evaluated on the synthetic vali-dation set. ERLE is evaluated on the ST FE in the test set.PESQ STOI ERLE (dB)Orig 1.24 0.79 -WebRTC-AEC3 1.28 0.82 6.29wRLS, β = 0 β = 0 . β = 0 . β = 1 . β = 0 . +Deep-FSMN 2.07 0.91 49.39The performance of the wRLS filter varies with differentsource PDF shape parameters. A value of β = 0 . is finallychosen, which outperforms AEC3 by 0.15 in PESQ, 0.03 inSTOI and 0.27 dB in ERLE. The Deep-FSMN model greatlyboost the overall performance, achieving a PESQ score of2.07 and nearly complete echo reduction when echo exists.
5. CONCLUSION
This paper presents our submission to the AEC-Challenge.The algorithm achieves satisfactory subjective scores on realrecordings by systematically combing time delay compensa-tion, a novel wRLS linear filter and a Deep-FSMN model forresidual echo suppression. The wRLS filter is derived fromthe semi-blind source separation reformulation of the acousticecho cancellation problem and simplification of the Aux-ICAsolution. One end-to-end neural network model that takes theraw near end mic signal and far end signal as input and out-puts the near end speech is more appealing, which will befuture direction of this work. . REFERENCES [1] Jacob Benesty, Tomas G¨ansler, Dennis R Morgan,M Mohan Sondhi, Steven L Gay, et al., “Advances innetwork and acoustic echo cancellation,” 2001.[2] webrtc, “https://webrtc.googlesource.com/src,” .[3] Charles Knapp and Glifford Carter, “The generalizedcorrelation method for estimation of time delay,”
IEEEtransactions on acoustics, speech, and signal process-ing , vol. 24, no. 4, pp. 320–327, 1976.[4] Bjoern Voelcker and W Bastiaan Kleijn, “Robustand low complexity delay estimation,” in
Inter-national Workshop on Acoustic Signal Enhancement .VDE, 2012, pp. 1–4.[5] Simon S Haykin,
Adaptive filter theory , Pearson Edu-cation India, 2008.[6] Chao Wu, Xiaofei Wang, Yanmeng Guo, Qiang Fu, andYonghong Yan, “Robust uncertainty control of the sim-plified kalman filter for acoustic echo cancelation,”
Cir-cuits, Systems, and Signal Processing , vol. 35, no. 12,pp. 4584–4595, 2016.[7] John J Shynk et al., “Frequency-domain and multirateadaptive filtering,”
IEEE Signal processing magazine ,vol. 9, no. 1, pp. 14–37, 1992.[8] Hai Huang, Christian Hofmann, Walter Kellermann,Jingdong Chen, and Jacob Benesty, “A multiframe para-metric wiener filter for acoustic echo suppression,” in
International Workshop on Acoustic Signal Enhance-ment (IWAENC) . IEEE, 2016, pp. 1–5.[9] Guillaume Carbajal, Romain Serizel, Emmanuel Vin-cent, and Eric Humbert, “Multiple-input neuralnetwork-based residual echo suppression,” in
Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) . IEEE, 2018, pp. 231–235.[10] Amin Fazel, Mostafa El-Khamy, and Jungwon Lee,“Deep multitask acoustic echo cancellation.,” in
INTER-SPEECH , 2019, pp. 4250–4254.[11] Hao Zhang, Ke Tan, and DeLiang Wang, “Deep learn-ing for joint acoustic echo and noise cancellation withnonlinear distortions.,” in
INTERSPEECH , 2019, pp.4255–4259.[12] Amin Fazel, Mostafa El-Khamy, and Jungwon Lee,“Cad-aec: Context-aware deep acoustic echo cancella-tion,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020,pp. 6919–6923. [13] Ando Saabas Tanel Parnamaa Hannes Gamper Se-bastian Braun Robert Aichner Sriram SrinivasanKusha Sridhar, Ross Cutler, “Icassp 2021 acousticecho cancellation challenge: Datasets and testing frame-work,” arXiv preprint arXiv:2009.04972 , 2020.[14] Babak Naderi and Ross Cutler, “An open source imple-mentation of itu-t recommendation p. 808 with valida-tion,” arXiv preprint arXiv:2005.08138 , 2020.[15] Shiliang Zhang, Ming Lei, Zhijie Yan, and Lirong Dai,“Deep-fsmn for large vocabulary continuous speechrecognition,” in
ICASSP . IEEE, 2018, pp. 5869–5873.[16] Nobutaka Ono and Shigeki Miyabe, “Auxiliary-function-based independent component analysis forsuper-gaussian sources,” in
International Conferenceon Latent Variable Analysis and Signal Separation .Springer, 2010, pp. 165–172.[17] Ziteng Wang, Yueyue Na, Zhang Liu, Yun Li, Biao Tian,and Qiang Fu, “A semi-blind source separation ap-proach for speech dereverberation,” in
INTERSPEECH ,2020.[18] Chandan KA Reddy, Vishak Gopal, Ross Cutler,Ebrahim Beyrami, Roger Cheng, Harishchandra Dubey,Sergiy Matusevych, Robert Aichner, Ashkan Aazami,Sebastian Braun, et al., “The interspeech 2020 deepnoise suppression challenge: Datasets, subjective test-ing framework, and challenge results,”
Proc. Inter-speech 2020 , pp. 2492–2496, 2020.[19] Francesco Nesta, Ted S Wada, and Biing-Hwang Juang,“Batch-online semi-blind source separation applied tomulti-channel acoustic echo cancellation,”
IEEE trans-actions on audio, speech, and language processing , vol.19, no. 3, pp. 583–599, 2010.[20] Ryu Takeda, Kazuhiro Nakadai, Toru Takahashi,Kazunori Komatani, Tetsuya Ogata, and Hiroshi GOkuno, “Efficient blind dereverberation and echo can-cellation based on independent component analysis foractual acoustic signals,”
Neural computation , vol. 24,no. 1, pp. 234–272, 2012.[21] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus basedon public domain audio books,” in
International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2015, pp. 5206–5210.[22] Tiago H Falk, Chenxi Zheng, and Wai-Yip Chan, “Anon-intrusive quality and intelligibility measure of re-verberant and dereverberated speech,”