[PDF] Low-Complexity, Real-Time Joint Neural Echo Control and Speech Enhancement Based On PercepNet

Abstract

Speech enhancement algorithms based on deep learning have greatly surpassed their traditional counterparts and are now being considered for the task of removing acoustic echo from hands-free communication systems. This is a challenging problem due to both real-world constraints like loudspeaker non-linearities, and to limited compute capabilities in some communication systems. In this work, we propose a system combining a traditional acoustic echo canceller, and a low-complexity joint residual echo and noise suppressor based on a hybrid signal processing/deep neural network (DSP/DNN) approach. We show that the proposed system outperforms both traditional and other neural approaches, while requiring only 5.5% CPU for real-time operation. We further show that the system can scale to even lower complexity levels.

Full PDF

LLOW-COMPLEXITY, REAL-TIME JOINT NEURAL ECHO CONTROL AND SPEECHENHANCEMENT BASED ON PERCEPNET

Jean-Marc Valin, Srikanth Tenneti, Karim Helwani, Umut Isik, Arvindh Krishnaswamy

Amazon Web ServicesPalo Alto, CA, USA { jmvalin, stenneti, helwk, umutisik, arvindhk } @amazon.com ABSTRACT

Speech enhancement algorithms based on deep learning have greatlysurpassed their traditional counterparts and are now being consid-ered for the task of removing acoustic echo from hands-free com-munication systems. This is a challenging problem due to bothreal-world constraints like loudspeaker non-linearities, and to lim-ited compute capabilities in some communication systems. In thiswork, we propose a system combining a traditional acoustic echocanceller, and a low-complexity joint residual echo and noise sup-pressor based on a hybrid signal processing/deep neural network(DSP/DNN) approach. We show that the proposed system outper-forms both traditional and other neural approaches, while requiringonly 5.5% CPU for real-time operation. We further show that thesystem can scale to even lower complexity levels.

Index Terms — acoustic echo cancellation, neural residual echosuppression, speech enhancement

1. INTRODUCTION

In full-duplex communication applications, echo produced by theacoustic feedback from the loudspeaker to the microphone canseverely degrade quality. Traditional acoustic echo cancellation(AEC) aims at cancelling the acoustic echoes from the microphonesignal by ﬁltering the far-end (loudspeaker) signal with the esti-mated echo path modeled by an adaptive FIR ﬁlter, and subtractingthe resulting signal from the microphone signal [1, 2]. If the esti-mated echo path is equal to the true echo path, echo is removed fromthe microphone signal. In real-world applications, residual echoremains at the output of AEC due to issues such as non-linearitiesin the acoustic drivers, rapidly-varying acoustic environments, andmicrophone noise. Hence, residual echo suppressors are typicallyemployed after the system identiﬁcation-based AEC in order to meetthe requirements for high echo attenuation [3, 4, 5].In addition, background noise also degrades the speech quality,while limiting the ability of the AEC to adapt fast enough to trackacoustic path changes, further worsening the overall communicationquality. Traditional speech enhancement methods [6, 7] – sometimescombined with acoustic echo suppression [8] – can help reduce theeffect of stationary noise, but have been mostly unable to removehighly non-stationary noise. In recent years, deep-learning-basedspeech enhancement systems have emerged as state-of-the-art solu-tions [9, 10, 11, 12, 13]. Even more recently, deep-learning-basedresidual echo suppression algorithms have also demonstrated state-of-the-art performance [14, 15].In this paper, we present an integrated approach to noise sup-pression and echo control (Section 2) which abides to the idea of h f ( n )far-endspeech y ( n )AECout + RESx ( n )outputspeech ^ d ( n )mic z ( n )echo^ x ( n )speech v ( n )noise- z ( n )^ f Fig. 1 . Overview of the joint echo control and noise suppression sys-tem. The far-end signal f ( n ) is played through the loudspeaker. Themicrophone signal d ( n ) captures the reverberated near-end speechbut also some noise v ( n ) , as well as echo z ( n ) from the loud-speaker. The echo is partially cancelled by the adaptive ﬁlter ˆ h f toproduce y ( n ) . The RES then enhances y ( n ) by suppressing noise,reverberation, as well as the remaining echo, and produces the en-hanced output ˆ x ( n ) .incorporating prior knowledge from physics and psychoacoustics todesign a low complexity but effective architecture. Since, the acous-tic path between a loudspeaker and a microphone is well approx-imated as a linear FIR ﬁlter, we retain the traditional frequency-domain acoustic echo canceller (AEC) described in Section 3. Wecombine the adaptive ﬁlter with a perceptually-motivated joint noiseand echo suppression algorithm (Section 4). As in [16], we focus onrestoring the spectral envelope and the periodicity of the speech. Ourmodel is trained (Section 5) to enhance the speech from the AEC us-ing the far-end signal as side information to help remove the far-endsignal while denoising the near-end speech. Results from our experi-ments and from the Acoustic Echo Cancellation Challenge [17] showthat the proposed algorithm outperforms both traditional and otherneural approaches to residual echo suppression, taking ﬁrst place inthe challenge (Section 6).

2. SIGNAL MODEL

The signal model we consider in this work is shown in Fig. 1. Let x ( n ) be a clean speech signal. The signal captured by a hands-freemicrophone in a noisy room is given by d ( n ) = x ( n ) (cid:63) h x + v ( n ) + z ( n ) , (1)where v ( n ) is the additive noise from the room, z ( n ) is the echocaused by a far-end signal f ( n ) , h x is the impulse response from a r X i v : . [ ee ss . A S ] F e b nputOutput STFTPitchanalysisISTFT ScalebandsPitchfiltering DNNmodelEnvelopepostfilterFeatureextractionSTFTFar-endspeech y ( n ) f ( n ) Fig. 2 . Overview of the PercepNet joint noise and residual echosuppressor.the talker to the microphone, and (cid:63) denotes the convolution. Whenignoring non-linear effects, the echo signal can be expressed as z ( n ) = f ( n ) (cid:63) h f . Echo cancellation based on adaptive ﬁlteringconsists in estimating h f and subtracting the estimated echo ˆ z ( n ) from the microphone signal to produce the echo-cancelled signal y ( n ) . Unfortunately, the echo cancellation process is generallyimperfect and echo remains in y ( n ) . For this reason, we include ajoint residual echo suppression (RES) and noise suppression (NS)algorithm (RES block in Fig. 1) such that the enhanced output ˆ x ( n ) is perceptually as close as possible to the ideal clean speech x ( n ) .

3. ADAPTIVE FILTER

The adaptive ﬁlter component in Fig. 1 is derived from the Spe-exDSP implementation of the multidelay block frequency-domain(MDF) adaptive ﬁlter [18] algorithm. Robustness to double-talk isachieved through a combination of the learning rate control in [19]and a two-echo-path model as described in [20]. Moreover, a blockvariant of the PNLMS algorithm [21] is used to speed up adapta-tion. As a compromise between complexity and convergence, weuse a variant of AUMDF [18] where most blocks are alternativelyconstrained, but the highest-energy block is constrained on each it-eration.There is sometimes an unknown delay between the signal f ( n ) sent to the loudspeaker and the corresponding echo appearing at themicrophone. To estimate that delay D , we run a second AEC witha 400-ms ﬁlter and ﬁnd the peak in the estimated ﬁlter. The delay-estimating AEC operates on a down-sampled version of the signals(8 kHz) to reduce complexity. We use the delayed far-end signal f ( n − D ) to perform the ﬁnal echo cancellation at 16 kHz. We usea frame size of 10 ms, which matches the frame size used in the RESand avoids causing any extra delay.The length of the adaptive ﬁlter affects not only the complexity,but also the convergence time and the steady-state accuracy of theﬁlter. We have found that a 150-ms ﬁlter provides a good compro-mise, ensuring that the echo loudness is sufﬁciently reduced for theRES to correctly preserve double-talk. We do not make any attemptat cancelling non-linear distortion in the echo. https://gitlab.xiph.org/xiph/speexdsp/

4. RESIDUAL ECHO SUPPRESSION

The linear AEC output y ( n ) contains the near-end speech x ( n ) , thenear-end noise v ( n ) , as well as some residual echo z ( n ) − ˆ z ( n ) .The residual echo component includes• misalignment (or divergence) of the estimated ﬁlter ˆ h f • non-linear distortion caused by the loudspeaker• late reverberation beyond the impulse response of ˆ h f Unlike the problem of noise suppression, residual echo suppressioninvolves isolating a speech signal from another speech signal. Sincethe echo can sometimes be indistinguishable from the near-endspeech, additional information is required for neural echo suppres-sion to work reliably. While there are multiple ways to provideinformation about the echo, we have found that using the far-endsignal f ( n ) is both the simplest and the most effective way. Specif-ically, since f ( n ) does not depend on the AEC behaviour, conver-gence problems with the echo canceller are less likely to affect theRES performance. Similarly, we found that using the delayed signal f ( n − D ) leads to slightly poorer results – most likely due to thefew cases where delay estimation fails.We implement joint RES and NS using the PercepNet algo-rithm [16], which is based on two main ideas:• scaling the energy of perceptually-spaced spectral bands tomatch that of the near-end speech;• using a multi-tap comb ﬁlter at the pitch frequency to removenoise between harmonics and match the periodicity of thenear-end speech.Let Y b ( (cid:96) ) be the magnitude of the AEC output signal y ( n ) in band b for frame (cid:96) and X b ( (cid:96) ) be similarly deﬁned for the clean speech x ( n ) , the ideal gain that should be applied to that band is: g b ( (cid:96) ) = X b ( (cid:96) ) Y b ( (cid:96) ) . (2)Applying the gain g b ( (cid:96) ) to the magnitude spectrum in band b re-sults in an enhanced signal that has the same spectral envelope asthe clean speech. While this is generally sufﬁcient for unvoiced seg-ments, voiced segment are likely have a higher roughness than theclean speech. This is due to noise between harmonics reducing theperceived periodicity/voicing of the speech. The noise is particularlyperceptible due to the fact that tones have relatively little masking ef-fect on noise [22]. In that situation, we use a non-causal comb ﬁlterto remove the noise between the pitch harmonics and make the sig-nal more periodic. The comb ﬁlter is controlled by strength/mixingparameters r b ( (cid:96) ) , where r b ( (cid:96) ) = 0 causes no ﬁltering to occur and r b ( (cid:96) ) = 1 causes the band to be replaced by the comb-ﬁltered ver-sion, maximizing periodicity. In cases where even r b ( (cid:96) ) = 1 is in-sufﬁcient to make the noise inaudible, a further attenuation g (att) b ( (cid:96) ) is applied (Section 3 of [16]).Fig. 2 shows an overview of the RES algorithm. The short-timeFourier transform (STFT) spectrum is divided into 32 triangu-lar bands following the equivalent rectangular bandwidth (ERB)scale [23]. The features computed from the input and far-end speechsignals are used by a deep neural network (DNN) to estimate thegains ˆ g b ( (cid:96) ) and ﬁltering strengths ˆ r b ( (cid:96) ) to use. The output gains ˆ g b ( (cid:96) ) are further modiﬁed by an envelope postﬁlter (Section 5of [16]) that reduces the perceptual impact of the remaining noise ineach band. RUCONV1x5FC128 CONV1x3 GRU GRU GRUGRU FCFC512 512 512 512 512 512128 3232

Fig. 3 . Overview of the DNN architecture computing the 32 gains ˆ g b and 32 strengths ˆ r b from the 100-dimensional input feature vector f .The number of units on each layer is indicated above the layer type.

5. DNN MODEL

The model uses two convolutional layers (a 1x5 layer followed bya 1x3 layer), and ﬁve GRU [24] layers, as shown in Fig. 3. The con-volutional layers are aligned in time such as to use up to M framesinto the future. In order to achieve the 40 ms algorithmic delay al-lowed by the challenge [17], including the 10-ms frame size and the10-ms overlap, we have M = 2 .The input features used by the model are tied to the 32 bands weuse. For each band, we use three features:1. the energy in the band with look-ahead Y b ( (cid:96) + M )

2. the pitch coherence [16] without look-ahead q y,b ( (cid:96) ) (the co-herence estimation itself uses the full look-ahead), and3. the energy of the far-end band with look-ahead F b ( (cid:96) + M ) In addition to those 96 band-related features, we use four extra scalarfeatures (for a total of 100 input features):• the pitch period T ( (cid:96) ) ,• an estimate of the pitch correlation with look-ahead,• a non-stationarity estimate, and• the ratio of the L -norm to the L -norm of the excitationcomputed from y ( n ) .For each band b , we have 2 outputs: the gain ˆ g b ( (cid:96) ) approximates g (att) b ( (cid:96) ) g b ( (cid:96) ) and the strength ˆ r b ( (cid:96) ) approximates r b ( (cid:96) ) .The 8M weights in the model are forced to a ± range andquantized to 8-bit integers. This reduces the total memory require-ment (and cache bandwidth), while also reducing the computationalcomplexity of the inference when taking advantage of vectorization(more operations for the same register width). In some situations, it is desirable to further reduce the complexityof the model. While it is always possible to reduce the numberof units in each layer, it has recently been found that using sparseweight matrices (i.e. sparse network connections) can lead to bet-ter results [25, 26]. Since modern CPUs make heavy use of singleinstruction, multiple data (SIMD) hardware, it is important for thealgorithm to allow vectorization. For that reason, we use structuredsparsity – where whole sub-blocks of matrices are chosen to be ei-ther zero or non-zero – implemented in a similar way to [27, 28].In this work, we use 16x4 sub-blocks. All fully-connected layers, as well as the ﬁrst convolutional layer are kept dense (no sparsity).The second convolutional layer is 50% dense, and the GRUs use dif-ferent levels of sparsity for the different gates. The matrices thatcompute the new state have a density of 40%, whereas the updategate matrices are 20% dense and the reset gate matrices have only10% density. This reﬂects the unequal usefulness of the differentgates on recurrent units.The resulting sparse model has 2.1M non-zero weights, or 25%of the size of the full model. We also consider an even lower com-plexity model with the same density but layers limited to 256 units,resulting in 800k non-zero weights, or 10% of the full model size.When training sparse models, we use the sparsiﬁcation schedule pro-posed in [26].

We train the model on synthetic mixtures of clean speech, noise andecho that attempt to recreate real-world conditions, including rever-beration. We vary the signal-to-noise ratio (SNR) from -15 dB to 45dB (with some noise-free examples included), and the echo-to-near-end ratio is between -15 dB and 35 dB. We use 120 hours of cleanspeech data along with 80 hours of various noise types. Most of thedata is sampled at 48 kHz, but some of it – including the far-endsingle-talk data provided by the challenge organizers – is sampled at16 kHz. We use both synthetic and real room impulse responses forthe augmentation process.In typical conditions, the effect of the room acoustics on thenear-end speech, the echo, and the noise is similar, but not identical.This is due to the fact that while all three occur in the same room(same RT ), they can be in different locations and – especially –at different distances. For that reason, we pick only one room im-pulse response for each condition, but scale the early reﬂections (ﬁrst20 ms) with a gain varying between 0.5 and 1.5 to simulate the dis-tance changing. Inspired by [29], the target signal includes the earlyreﬂections as well as an attenuated echo tail (with RT = 200 ms )so that late reverberation is attenuated to match the acoustics of asmall room.We improve the generalization of the model by using variousﬁltering augmentation methods [30, 16]. That includes applying alow-pass ﬁlter with a random cutoff frequency, making it possible touse the same model on narrowband to fullband audio.The loss function used for the gain attempts to match human per-ception as closely as possible. For this reason we use the followingloss function for the gain estimations: L g = (cid:88) b D ( g b , ˆ g b ) + λ (cid:88) b [ D ( g b , ˆ g b )] , (3)with the distortion function D ( g b , ˆ g b ) = (cid:0) g γb − ˆ g γb (cid:1) max (cid:0) g γb , ˆ g γb (cid:1) + (cid:15) , (4)where γ = 0 . is the generally agreed-upon exponent to convertacoustic power to the sone scale for perceived loudness [23]. Thepurpose of the denominator in (4) is to over-emphasize the losswhen completely attenuating speech or when letting through smallamounts of noise/echo during silence. We use λ = 10 for thesecond term of (3), an L term that over-emphasizes large errors ingeneral. We use the same loss function as [16] for ˆ r b . able 1 . AEC Challenge ofﬁcial results: P.808 MOS of near-endsingle-talk, P.831 Echo DMOS for far-end single-talk, P.831 EchoDMOS for double-talk, P.831 other degradations DMOS of double-talk. The baseline model is provided by the challenge organizers.As a comparison, we also include the mean of the four algorithmsstatistically tied in second place.Algorithm ST ST DT DT Mean

NE FE Echo OtherBaseline 3.79 3.84 3.84 3.28 3.682 nd place 3.80 4.18 4.25 3.74 3.99 PercepNet 3.85 4.19 4.34 4.07 4.11

Complexity (% CPU Core) . . . . . . . . Q u a li t y ( M O S ) AEC+RESRES-onlyAEC-onlySpeexDSP

Fig. 4 . P.808 MOS results as a function of complexity. The 95%conﬁdence interval is 0.05.

6. EXPERIMENTS AND RESULTS

The complexity of the proposed RES with the largest (non-sparse)model is dominated by the 800M multiply-accumulate operationsper second required to compute the contribution of all 8M weights on100 frames per second. The RES thus requires 4.6% of an x86 mo-bile CPU core (Intel i7-8565U) to operate in real-time. When com-bined with the AEC, the total complexity of the proposed 16 kHzecho control solution as submitted to the AEC challenge [17] is5.5% CPU (0.55 ms per 10-ms frame). Since the RES is alreadydesigned to operate at 48 kHz, the total cost of fullband echo controlonly increases to 6.6%, with the difference due to the increased AECsampling rate.The AEC challenge organizers evaluated blind test samples pro-cessed with the above AEC, followed by the PercepNet-based RES.The mean opinion score (MOS) [31, 32] results were obtained us-ing the crowdsourcing methodology described in P.808 [33]. Thetest set includes 1000 real recordings. Each utterance was rated by10 listeners, leading to a 95% conﬁdence interval of 0.01 MOS forall algorithms. The proposed algorithm signiﬁcantly out-performsthe ResRNN baseline, as shown in Table 1, and ranked in ﬁrst placeamong the 17 submissions to the challenge. An interesting observa-tion is that although the proposed algorithm performs well over allthe metrics, the improvement over the other submitted algorithms isparticularly noticeable for the “DT Other” metric, which measuresthe degradation caused to the near-end speech during double-talkconditions.

Complexity (% CPU Core) E R L E * ( d B ) AEC+RESRES-onlyAEC-onlySpeexDSP

Fig. 5 . Median ERLE* on the far-end single-talk cases as a functionof complexity.In addition to the ofﬁcial challenge experiments, we conductedfurther experiments on the challenge blind test set. Those exper-iments were all conducted after the submission deadline so as tonot inﬂuence the model to be submitted. We compared the qual-ity obtained with lower complexity versions of the proposed algo-rithm (Section 5.1). More speciﬁcally, the three RES model sizeswere each evaluated with and without a linear AEC in front. In addi-tion, the AEC alone (no RES) was evaluated, along with the AECfollowed by the SpeexDSP conventional joint RES and NS. TheMOS results from all 600 utterances that include near-end speech(i.e. excluding far-end single-talk samples) are shown in Fig. 4.They demonstrate that the PercepNet-based RES signiﬁcantly out-performs the SpeexDSP conventional RES, even when used as a pureecho suppressor (except for the lowest complexity setting). Despitethe good double-talk performance when operated as a residual echosuppressor, the results demonstrate the beneﬁts of using the adaptiveﬁlter component.The far-end single-talk samples are evaluated based on a mod-iﬁed echo return loss enhancement (denoted ERLE*) metric whereboth noise and echo are considered. Since the RES is meant to re-move all energy from those samples, we simply ﬁnd the ratio ofthe input energy to the output energy. The results in Fig. 5 showthat all PercepNet-based algorithms remove far more echo and noisethan the conventional approach. Combined with Fig. 4, these resultsconﬁrm that the linear AEC does not help attenuating isolated (far-end-only) echo, but greatly contributes to preserving speech duringdouble-talk.

7. CONCLUSION

We demonstrate an integrated algorithm for echo and noise suppres-sion in hands-free communication systems. The proposed solution,based on the PercepNet model, incorporates perceptual aspects ofhuman speech in a hybrid DSP/deep learning approach. Evaluationresults show signiﬁcant quality improvements over both traditionaland other neural echo control algorithms while using only 5.5% ofa CPU core. We further evaluate the impact of the model size onquality down to 1.5% CPU. We believe these results demonstrate thebeneﬁts of modeling speech using perceptually-relevant parametersin an echo control task. . REFERENCES [1] S. Haykin,

Adaptive ﬁlter theory , Prentice-Hall, Inc., 1996.[2] H. Buchner, J. Benesty, and W. Kellermann, “Multichannelfrequency-domain adaptive ﬁltering with application to multi-channel acoustic echo cancellation,” in

Adaptive Signal Pro-cessing , pp. 95–128. Springer, 2003.[3] C. Faller and C. Tournery, “Robust acoustic echo control usinga simple echo path model,” in

Proc. International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , 2006.[4] A. Favrot, C. Faller, and F. Kuech, “Modeling late reverber-ation in acoustic echo suppression,” in

Proc. InternationalWorkshop on Acoustic Signal Enhancement (IWAENC) , 2012,pp. 1–4.[5] K. Helwani, H. Buchner, J. Benesty, and J. Chen, “A single-channel MVDR ﬁlter for acoustic echo suppression,”

IEEESignal Processing Letters , vol. 20, no. 4, pp. 351–354, 2013.[6] S. Boll, “Suppression of acoustic noise in speech using spec-tral subtraction,”

IEEE Transactions on acoustics, speech, andsignal processing , vol. 27, no. 2, pp. 113–120, 1979.[7] Y. Ephraim and D. Malah, “Speech enhancement using aminimum mean-square error log-spectral amplitude estimator,”

IEEE Transactions on Acoustics, Speech, and Signal Process-ing , vol. 33, no. 2, pp. 443–445, 1985.[8] S. Gustafsson, R. Martin, P. Jax, and P. Vary, “A psychoacous-tic approach to combined acoustic echo cancellation and noisereduction,”

IEEE Transactions on Speech and Audio Process-ing , vol. 10, no. 5, pp. 245–256, 2002.[9] B. Xia and C. Bao, “Speech enhancement with weighted de-noising auto-encoder.,” in

Proc. INTERSPEECH , 2013, pp.3444–3448.[10] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approachto speech enhancement based on deep neural networks,”

IEEETransactions on Audio, Speech and Language Processing , vol.23, no. 1, pp. 7–19, 2014.[11] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J.R. Hershey, and B. Schuller, “Speech enhancement withLSTM recurrent neural networks and its application to noise-robust ASR,” in

Proc. International Conference on Latent Vari-able Analysis and Signal Separation . Springer, 2015, pp. 91–99.[12] U. Isik, R. Giri, N. Phansalkar, J.-M. Valin, K. Helwani, andA. Krishnaswamy, “PoCoNet: Better speech enhancementwith frequency-positional embeddings, semi-supervised con-versational data, and biased loss,” 2020.[13] C.K.A. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng,H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke, “The INTERSPEECH2020 deep noise suppression challenge: Datasets, subjectivetesting framework, and challenge results,” arXiv preprintarXiv:2005.13981 , 2020.[14] L. Ma, H. Huang, P. Zhao, and T. Su, “Acoustic echo cancel-lation by combining adaptive digital ﬁlter and recurrent neuralnetwork,” in

Proc. INTERSPEECH , 2020.[15] H. Zhang, K. Tan, and D. Wang, “Deep learning for jointacoustic echo and noise cancellation with nonlinear distor-tions,” in

Proc. INTERSPEECH , 2019, pp. 4255–4259.[16] J.-M. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, andA. Krishnaswamy, “A perceptually-motivated approach forlow-complexity, real-time enhancement of fullband speech,” in

Proc. International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2020.[17] K. Sridhar, R. Cutler, A. Saabas, T. Parnamaa, H. Gamper,S. Braun, R. Aichner, and S. Srinivasan, “ICASSP 2021 acous-tic echo cancellation challenge: Datasets and testing frame-work,” arXiv preprint arXiv:2009.04972 , 2020.[18] J.-S. Soo and K. K. Pang, “Multidelay block frequency domainadaptive ﬁlter,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 38, no. 2, pp. 373–376, 1990.[19] J.-M. Valin, “On adjusting the learning rate in frequency do-main echo cancellation with double-talk,”

IEEE Transactionson Audio, Speech, and Language Processing , vol. 15, no. 3, pp.1030–1034, 2007.[20] K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler with twoecho path models,”

IEEE Transactions on Communications ,vol. 25, no. 6, pp. 589–595, 1977.[21] D.L. Duttweiler, “Proportionate normalized least-mean-squares adaptation in echo cancelers,”

IEEE Transactions onSpeech and Audio Processing , vol. 8, no. 5, pp. 508–518, 2000.[22] H. Gockel, B.C.J. Moore, and R.D. Patterson, “Asymmetry ofmasking between complex tones and noise: Partial loudness,”

The Journal of the Acoustical Society of America , vol. 114, no.1, pp. 349–360, 2003.[23] B.C.J. Moore,

An introduction to the psychology of hearing ,Brill, 2012.[24] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “Onthe properties of neural machine translation: Encoder-decoderapproaches,” in

Proceedings of Eighth Workshop on Syntax,Semantics and Structure in Statistical Translation , 2014.[25] S. Narang, E. Elsen, G. Diamos, and S. Sengupta, “Explor-ing sparsity in recurrent neural networks,” arXiv preprintarXiv:1704.05119 , 2017.[26] M. Zhu and S. Gupta, “To prune, or not to prune: exploringthe efﬁcacy of pruning for model compression,” arXiv preprintarXiv:1710.01878 , 2017.[27] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord,S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio syn-thesis,” arXiv preprint arXiv:1802.08435 , 2018.[28] J.-M. Valin and J. Skoglund, “LPCNet: Improving neuralspeech synthesis through linear prediction,” in

Proc. Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 5891–5895.[29] Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Late reverberationsuppression using recurrent neural networks with long short-term memory,” in

Proc. International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2018, pp. 5434–5438.[30] J.-M. Valin, “A hybrid DSP/deep learning approach to real-time full-band speech enhancement,” in

Proceedings of IEEEMultimedia Signal Processing (MMSP) Workshop , 2018.[31] ITU-T,

Recommendation P.800: Methods for subjective deter-mination of transmission quality , 1996.[32] ITU-T,

Recommendation P.831: Subjective performance eval-uation of network echo cancellers , 1998.[33] ITU-T,