ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results
Kusha Sridhar, Ross Cutler, Ando Saabas, Tanel Parnamaa, Markus Loide, Hannes Gamper, Sebastian Braun, Robert Aichner, Sriram Srinivasan
IICASSP 2021 ACOUSTIC ECHO CANCELLATION CHALLENGE: DATASETS, TESTINGFRAMEWORK, AND RESULTS
Kusha Sridhar , Ross Cutler , Ando Saabas , Tanel Parnamaa , Markus Loide , Hannes Gamper ,Sebastian Braun , Robert Aichner , Sriram Srinivasan The University of Texas at Dallas, Microsoft Corp.
ABSTRACT
The ICASSP 2021 Acoustic Echo Cancellation Challenge is in-tended to stimulate research in the area of acoustic echo cancellation(AEC), which is an important part of speech enhancement and still atop issue in audio communication and conferencing systems. Manyrecent AEC studies report good performance on synthetic datasetswhere the train and test samples come from the same underlyingdistribution. However, the AEC performance often degrades signif-icantly on real recordings. Also, most of the conventional objectivemetrics such as echo return loss enhancement (ERLE) and percep-tual evaluation of speech quality (PESQ) do not correlate well withsubjective speech quality tests in the presence of background noiseand reverberation found in realistic environments. In this challenge,we open source two large datasets to train AEC models under bothsingle talk and double talk scenarios. These datasets consist ofrecordings from more than 2,500 real audio devices and humanspeakers in real environments, as well as a synthetic dataset. Weopen source two large test sets, and we open source an online sub-jective test framework for researchers to quickly test their results.The winners of this challenge will be selected based on the averageMean Opinion Score (MOS) achieved across all different single talkand double talk scenarios.
Index Terms — Acoustic Echo Cancellation, deep learning, sin-gle talk, double talk, subjective test
1. INTRODUCTION
With the growing popularity and need for working remotely, the useof teleconferencing systems such as Microsoft Teams, Skype, We-bEx, Zoom, etc., has increased significantly. It is imperative to havegood quality calls to make the users’ experience pleasant and pro-ductive. The degradation of call quality due to acoustic echoes isone of the major sources of poor speech quality ratings in voice andvideo calls. While digital signal processing (DSP) based AEC mod-els have been used to remove these echoes during calls, their per-formance can degrade given devices with poor physical acousticsdesign or environments outside their design targets and lab-basedtests. This problem becomes more challenging during full-duplexmodes of communication where echoes from double talk scenariosare difficult to suppress without significant distortion or attenuation[1]. With the advent of deep learning techniques, several supervisedlearning algorithms for AEC have shown better performance com-pared to their classical counterparts [2, 3, 4]. Some studies have alsoshown good performance using a combination of classical and deeplearning methods such as using adaptive filters and recurrent neu-ral networks (RNNs) [4, 5] but only on synthetic datasets. Whilethese approaches provide a good heuristic on the performance of PCC SRCCERLE 0.31 0.23PESQ 0.67 0.57
Table 1 . Pearson and Spearman rank correlation between ERLE,PESQ and P.808 Absolute Category Rating (ACR) results on singletalk with delayed echo scenarios (see Section 5).AEC models, there has been no evidence of their performance onreal-world datasets with speech recorded in diverse noise and rever-berant environments. This makes it difficult for researchers in theindustry to choose a good model that can perform well on a repre-sentative real-world dataset.Most AEC publications use objective measures such as ERLE[6] and PESQ [7]. ERLE is defined as:
ERLE = 10 log E [ y ( n )] E [ˆ y ( n )] (1)where y ( n ) is the microphone signal, and ˆ y ( n ) is the enhancedspeech. ERLE is only appropriate when measured in a quiet roomwith no background noise and only for single talk scenarios (not dou-ble talk). PESQ has also been shown to not have a high correlationto subjective speech quality in the presence of background noise [8].Using the datasets provided in this challenge we show the ERLE andPESQ have a low correlation to subjective tests (Table 1). In orderto use a dataset with recordings in real environments, we can not useERLE and PESQ. A more reliable and robust evaluation frameworkis needed that everyone in the research community can use, whichwe provide as part of the challenge.This AEC challenge is designed to stimulate research in the AECdomain by open sourcing a large training dataset, test set, and sub-jective evaluation framework. We provide two new open sourcedatasets for training AEC models. The first is a real dataset cap-tured using a large-scale crowdsourcing effort. This dataset consistsof real recordings that have been collected from over 2,500 diverseaudio devices and environments. The second is a synthetic datasetwith added room impulse responses and background noise derivedfrom [9]. An initial test set was released for the researchers to useduring development and a blind test near the end which was used todecide the final competition winners. We believe these datasets arenot only the first open source datasets for AEC’s, but ones that arelarge enough to facilitate deep learning and representative enoughfor practical usage in shipping telecommunication products.The training dataset is described in Section 2, and the test setin Section 3. We describe a DNN-based AEC method in Section 4.The online subjective evaluation framework is discussed in Section a r X i v : . [ ee ss . A S ] O c t ig. 1 . The custom recording application recorded the loopback andmicrophone signals.5. The challenge rules are described in Section 6. The results of thechallenge is discussed in Section 7.
2. TRAINING DATASETS
The challenge will include two new open source datasets, one realand one synthetic. The datasets are available at https://github.com/microsoft/AEC-Challenge.
The first dataset was captured using a large-scale crowdsourcing ef-fort. This dataset consists of more than 2,500 different real environ-ments, audio devices, and human speakers in the following scenar-ios:1. Far end single talk, no echo path change2. Far end single talk, echo path change3. Near end single talk, no echo path change4. Double talk, no echo path change5. Double talk, echo path change6. Sweep signal for RT60 estimationA total of 2,500 completed scenarios are provided in the dataset,with an additional 1,000 partial scenarios for a total of 18K audioclips. For the far end single talk case, there is only the loudspeakersignal (far end) played back to the users and users remain silent (nonear end signal). For the near end single talk case, there is no farend signal and users are prompted to speak, capturing the near endsignal. For double talk, both the far end and near end signals are ac-tive, where a loudspeaker signal is played and users talk at the sametime. Echo path change was incorporated by instructing the users tomove their device around or bring themselves to move around thedevice. The near end single talk speech quality is given in Figure 2.The RT60 distribution for the dataset is estimated using a method byKarjalainen et al. [10] and shown in Figure 3. The RT60 estimatescan be used to sample the dataset for training.We use
Amazon Mechanical Turk as the crowdsourcing plat-form and wrote a custom HIT application which includes a customtool that raters download and execute to record the six scenarios de-scribed above. The dataset includes only Microsoft Windows de-vices. Each scenario includes the microphone and loopback signal(see Figure 1). Even though our application uses raw audio mode,the PC can still include Audio DSP on the receive signal (e.g., equal-ization and Dynamic Range Compression (DRC)); it can also in-clude Audio DSP on the send signal, such as AEC and noise sup-pression.For clean speech far end signals, we use the speech segmentsfrom the Edinburgh dataset [11]. This corpus consists of short singlespeaker speech segments ( to seconds). We used a long short termmemory (LSTM) based gender detector to select an equal number of Fig. 2 . Sorted near end single talk clip quality (P.808) with 95%confidence intervals.male and female speaker segments. Further, we combined to of these short segments to create clips of length between and seconds in duration. Each clip consists of a single gender speaker.We create a gender-balanced far end signal source comprising of male and female clips. Recordings are saved at the maximumsampling rate supported by the device and in 32-bit floating pointformat; in the released dataset we down-sample to 16KHz and 16-bit using automatic gain control to minimize clipping.For noisy speech far end signals we use clips from the nearend single talk scenario, gender balanced to include an equal numberof male and female voices.For near end speech, the users were prompted to read sentencesfrom TIMIT [12] sentence list. Approximately 10 seconds of audiois recorded while the users are reading. The second dataset provides 10,000 synthetic scenarios, each includ-ing single talk, double talk, near end noise, far end noise, and vari-ous nonlinear distortion scenarios. Each scenario includes a far endspeech, echo signal, near end speech, and near end microphone sig-nal clip. We use 12,000 cases (100 hours of audio) from both theclean and noisy speech datasets derived in [9] from the LibriVoxproject as source clips to sample far end and near end signals. TheLibriVox project is a collection of public domain audiobooks readby volunteers. [9] used the online subjective test framework ITU-TP.808 to select audio recordings of good quality (4.3 ≤ MOS ≤ and DEMAND [14] databases at signal to noise ratiossampled uniformly from [0, 40] dB.To simulate a far end signal, we pick a random speaker froma pool of 1,627 speakers, randomly choose one of the clips fromthe speaker, and sample 10 seconds of audio from the clip. For thenear end signal, we randomly choose another speaker and take 3-7 seconds of audio which is then zero-padded to 10 seconds. Ofthe selected far end and near end speakers, 71% and 67% are male,respectively. To generate an echo, we convolve a randomly cho-sen room impulse response from a large internal database with thefar end signal. The room impulse responses are generated by using https://librivox.org https://freesound.org ig. 3 . Distribution of reverberation time (RT60).Project Acoustics technology and the RT60 ranges from 200 ms to1200 ms. In 80% of the cases, the far end signal is processed bya nonlinear function to mimic loudspeaker distortion. For example,the transformation can be clipping the maximum amplitude, usinga sigmoidal function as in [15], or applying learned distortion func-tions, the details of which we will describe in a future paper. Thissignal gets mixed with the near end signal at a signal to echo ratiouniformly sampled from -10 dB to 10 dB. The far end and near endsignals are taken from the noisy dataset in 50% of the cases. Thefirst 500 clips can be used for validation as these have a separate listof speakers and room impulse responses. Detailed metadata infor-mation can be found in the repository.
3. TEST SET
Two test sets are included, one at the beginning of the challenge anda blind test set near the end. Both consist of approximately 1000 realworld recordings and are partitioned into the following scenarios:1. Clean, i.e. recordings with clean far end and near end(MOS >
4. BASELINE AEC METHOD
We adapt a noise suppression model developed in [16] to the taskof echo cancellation. Specifically, a recurrent neural network withgated recurrent units takes concatenated log power spectral featuresof the microphone signal and far end signal as input, and outputs aspectral suppression mask. The STFT is computed based on 20 msframes with a hop size of 10 ms, and a 320-point discrete Fouriertransform. We use a stack of two GRU layers followed by a fully-connected layer with a sigmoid activation function. The estimatedmask is point-wise multiplied with the magnitude spectrogram ofmicrophone signal to suppress the far end signal. Finally, to resyn-thesize the enhanced signal, an inverse short-time Fourier transform Fig. 4 . The audio processing pipeline used in the challenge.is used on the phase of the microphone signal and the estimated mag-nitude spectrogram. We use a mean squared error loss between theclean and enhanced magnitude spectrograms. The Adam optimizerwith a learning rate of 0.0003 is used to train the model.
5. ONLINE SUBJECTIVE EVALUATION FRAMEWORK
We have extended the open source P.808 Toolkit [17] with methodsfor evaluating the echo impairments in subjective tests. We followedthe
Third-party Listening Test B from ITU-T Rec. P.831 [18] andITU-T Rec. P.832 [19] and adapted them to our use case as well asfor the crowdsourcing approach based on the ITU-T Rec. P.808 [20]guidance.A third-party listening test differs from the typical listening-onlytests (according to the ITU-T Rec. P.800) in the way that listenershear the recordings from the center of the connection rather in for-mer one in which the listener is positioned at one end of the connec-tion [18]. Thus, the speech material should be recorded by havingthis concept in mind. During the test session, we used different com-binations of single- and multi-scale ACR ratings depending on thespeech sample under evaluation. We distinguished between singletalk and double talk scenarios. For the near end single talk, we askedfor the overall quality, and for far end single talk we an used echoannoyance scale. In the double talk scenario, we asked for an echoannoyance and impairments of other degradations in two separatequestions . Both impairments were rated on the degradation cate-gory scale (from 1: Very annoying , to 5:
Imperceptible ). The impair-ments scales leads to a Degradation Mean Opinion Scores (DMOS).The audio pipeline used in the challenge is shown in Figure 4.In the first stage (AGC1) a traditional automatic gain control is usedto target a speech level of -24 dBFS. The output of AGC1 is savedin the test set. The next stage is an AEC, which participants willprocess and upload to the challenge CMT site. The next stage is atraditional noise suppressor (DMOS <
6. AEC CHALLENGE RULES AND SCHEDULE6.1. Rules
This challenge is to benchmark the performance of real-time algo-rithms with a real (not simulated) test set. Participants will evaluatetheir AEC on a test set and submit the results (audio clips) for eval-uation. The requirements for each AEC used for submission are:• The AEC must take less than the stride time T s (in ms) to pro-cess a frame of size T (in ms) on an Intel Core i5 quad-coremachine clocked at 2.4 GHz or equivalent processors. For ex-ample, T s = T / for 50% overlap between frames. The totalalgorithmic latency allowed including the frame size T , stride Question 1: How would you judge the degradation from the echo ofPerson 1’s voice? Question 2: How would you judge degradations (missingaudio, distortions, cut-outs) of Person 2’s voice? ime T s , and any look ahead must be ≤ T = T + T s is less than 40ms, then you can use up to (40 − T ) ms future information.• The AEC can be a deep model, a traditional signal process-ing algorithm, or a mix of the two. There are no restrictionson the AEC aside from the run time and algorithmic latencydescribed above.• Submissions must follow instructions on http://aec-challenge.azurewebsites.net• Winners will be picked based on the subjective echo MOSevaluated on the blind test set using ITU-T P.808 frameworkdescribed in Section 5.• The blind test set will be made available to the participantson October 2, 2020. Participants must send the results (au-dio clips) achieved by their developed models to the organiz-ers. We will use the submitted clips to conduct ITU-T P.808subjective evaluation and pick the winners based on the re-sults. Participants are forbidden from using the blind test setto retrain or tune their models. They should not submit re-sults using other AEC methods that they are not submittingto ICASSP 2021. Failing to adhere to these rules will lead todisqualification from the challenge.• Participants should report the computational complexity oftheir model in terms of the number of parameters and the timeit takes to infer a frame on a particular CPU (preferably IntelCore i5 quad-core machine clocked at 2.4 GHz). Among thesubmitted proposals differing by less than 0.1 MOS, the lowercomplexity model will be given a higher ranking.• Each participating team must submit an ICASSP paper thatsummarizes the research efforts and provide all the details toensure reproducibility. Authors may choose to report addi-tional objective/subjective metrics in their paper.• Submitted papers will undergo the standard peer-review pro-cess of ICASSP 2021. The paper needs to be accepted to theconference for the participants to be eligible for the challenge. • September 8, 2020 : Release of the datasets.•
October 2, 2020 : Blind test set released to participants.•
October 9, 2020 : Deadline for participants to submit theirresults for objective and P.808 subjective evaluation on theblind test set.•
October 16, 2020 : Organizers will notify the participantsabout the results.•
October 19, 2020 : Regular paper submission deadline forICASSP 2021.•
January 22, 2021 : Paper acceptance/rejection notification•
January 25, 2021 : Notification of the winners with winnerinstructions, including a prize claim deadline.
Fig. 5 . Final results of the challenge.
Participants may email organizers at aec [email protected] any questions related to the challenge or in need of any clarifi-cation about any aspect of the challenge.
7. RESULTS
We received 17 submissions for the challenge. Each team submittedprocessed files from the blind test set with 500 noisy and 500 cleanrecordings (see Section 3). We batched all submissions into threesets:• Nearend single talk files for MOS test (NE ST MOS).• Farend single talk files for Echo DMOS test (ST FE EchoDMOS).• Double talk files for Echo and Other degradation DMOS test(DT Echo/Other DMOS).To obtain the final overall rating, we averaged the results fromthe four questionnaires, weighting them equally. The final standingsare shown in Figure 5. The resulting scores show a wide varietyin model performance. The score differences in near end, echo anddouble talk scenarios for individual models highlight the importanceof evaluating all scenarios, since in many cases, performance in onescenario comes at a cost in another scenario. The overall Pearsoncorrelation between the four tests are given in Figure 7 (omitting thelast place outlier, which significantly skews the result).For the top five teams, we ran an ANOVA test to determine sta-tistical significance (Figure 6). While the first place stands out asthe clear winner, the differences between places 2–5 were not sta-tistically significant, and per the challenge rules, places 2 and 3 arepicked based on the computational complexity of the models.
Fig. 6 . P-values of ANOVA test of the top 5 teams. ig. 7 . Pearson correlation coefficients between different tests.
Fig. 8 . MOS histograms of the top 3 models and baselineSome models, including the winning entry, perform speech en-hancement (noise suppression) in addition to echo cancellation. http://aec-challenge.azurewebsites.net/ includes the results for clean andnoisy subsets of data. The tables highlight that models that do speechenhancement (noise suppression) have a small overall advantage intests. For example, the baseline model, which does not do noise sup-pression, has a delta of -0.16 on noisy NE ST when compared tothe winning entry, but has a similar performance on the clean NE STdata. In general, though, rankings do not differ significantly betweenthe two sets.Histograms of MOS and DMOS values of top 3 submissions andbaseline are given in Figure 8.
8. CONCLUSIONS
The results of this challenge shows that deep learning models or hy-brid models can significantly outperform traditional DSP models,even when given the low latency and low complexity requirementsof the challenge. This is encouraging as it is feasible that these newclasses of AEC’s can be integrated into products and improve the ex-perience for billions of users of audio telephony. It is our hope thatthe dataset, test set, and test framework created for the challengewill accelerate research in this area, as there is still improvement tobe made.A future area of research is to improve the overall score of thesubjective scores over the unweighted mean used in Figure 5.
9. ACKNOWLEDGEMENTS
The double talk survey implementation was written by BabakNaderi.
10. REFERENCES [1] “IEEE 1329-2010 Standard method for measuring transmis-sion performance of handsfree telephone sets,” 2010.[2] A. Fazel, M. El-Khamy, and J. Lee, “CAD-AEC: Context-aware deep acoustic echo cancellation,” in
ICASSP 2020 -2020 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 6919–6923.[3] M. M. Halimeh and W. Kellermann, “Efficient multichannelnonlinear acoustic echo cancellation based on a cooperativestrategy,” in
ICASSP 2020 - 2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP) ,2020, pp. 461–465.[4] Lu Ma, Hua Huang, Pei Zhao, and Tengrong Su, “Acous-tic echo cancellation by combining adaptive digital filter andrecurrent neural network,” arXiv preprint arXiv:2005.09237 ,2020.[5] Hao Zhang, Ke Tan, and DeLiang Wang, “Deep learning forjoint acoustic echo and noise cancellation with nonlinear dis-tortions.,” in
INTERSPEECH , 2019, pp. 4255–4259.[6] “ITU-T recommendation G.168: Digital network echo can-cellers,” Feb 2012.[7] “ITU-T recommendation P.862: Perceptual evaluation ofspeech quality (PESQ): An objective method for end-to-endspeech quality assessment of narrow-band telephone networksand speech codecs,” Feb 2001.[8] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, andJ. Gehrke, “Non-intrusive speech quality assessment usingneural networks,” in
ICASSP 2019 - 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) , 2019, pp. 631–635.[9] Chandan KA Reddy, Vishak Gopal, Ross Cutler, EbrahimBeyrami, Roger Cheng, Harishchandra Dubey, Sergiy Matu-sevych, Robert Aichner, Ashkan Aazami, Sebastian Braun,et al., “The INTERSPEECH 2020 deep noise suppression chal-lenge: Datasets, subjective testing framework, and challengeresults,” arXiv preprint arXiv:2005.13981 , 2020.[10] Matti Karjalainen, Poju Antsalo, Aki M¨akivirta, Timo Pelto-nen, and Vesa V¨alim¨aki, “Estimation of modal decay param-eters from noisy response measurements,”
J. Audio Eng. Soc ,vol. 50, no. 11, pp. 867, 2002.[11] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, and Ju-nichi Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural net-works.,” in
Interspeech , 2016, pp. 352–356.[12] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S.Pallett, and N. L. Dahlgren, “DARPA TIMIT acoustic phoneticcontinuous speech corpus CDROM,” 1993.[13] Jort F. Gemmeke, Daniel P.W. Ellis, Dylan Freedman, ArenJansen, Wade Lawrence, R. Channing Moore, Manoj Plakal,and Marvin Ritter, “Audio set: An ontology and human-labeleddataset for audio events,” in .IEEE, 2017, pp. 776–780.14] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-cent, “The diverse environments multi-channel acoustic noisedatabase: A database of multichannel environmental noiserecordings,”
The Journal of the Acoustical Society of Amer-ica , vol. 133, no. 5, pp. 3591–3591, 2013.[15] Chul Min Lee, Jong Won Shin, and Nam Soo Kim, “DNN-based residual echo suppression,” in
Sixteenth Annual Confer-ence of the International Speech Communication Association ,2015.[16] Yangyang Xia, Sebastian Braun, Chandan KA Reddy, Har-ishchandra Dubey, Ross Cutler, and Ivan Tashev, “Weightedspeech distortion losses for neural-network-based real-timespeech enhancement,” in
ICASSP 2020-2020 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 871–875.[17] Babak Naderi and Ross Cutler, “An open source implementa-tion of ITU-T recommendation P.808 with validation,” arXivpreprint arXiv:2005.08138 , 2020.[18] “ITU-T P.831 Subjective performance evaluation of networkecho cancellers ITU-T P-series recommendations,” 1998.[19] ITU-T Recommendation P.832,