Interspeech 2021 Deep Noise Suppression Challenge
Chandan K A Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan
IINTERSPEECH 2021 DEEP NOISE SUPPRESSION CHALLENGE
Chandan K A Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Nair, Vishak Gopal,Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan
Microsoft Corporation, Redmond, USA
ABSTRACT
The Deep Noise Suppression (DNS) challenge is designed to fos-ter innovation in the area of noise suppression to achieve superiorperceptual speech quality. We recently organized a DNS challengespecial session at INTERSPEECH and ICASSP 2020. We open-sourced training and test datasets for the wideband scenario. Wealso open-sourced a subjective evaluation framework based on ITU-T standard P.808, which was also used to evaluate participants ofthe challenge. Many researchers from academia and industry madesignificant contributions to push the field forward, yet even the bestnoise suppressor was far from achieving superior speech quality inchallenging scenarios. In this version of the challenge organized atINTERSPEECH 2021, we are expanding both our training and testdatasets to accommodate full band scenarios. The two tracks in thischallenge will focus on real-time denoising for (i) wide band, and(ii) full band scenarios. We are also making available a reliable non-intrusive objective speech quality metric for wide band called DNS-MOS for the participants to use during their development phase.
Index Terms — Speech Enhancement, Perceptual Speech Qual-ity, P.808, Deep Noise Suppressor, Machine Learning.
1. INTRODUCTION
With the explosion in the number of people working remotely dueto the pandemic, there has been a surge in the demand for reliablecollaboration and real-time communication tools. Excellent speechquality in our audio calls is a need during these times as we try to stayconnected and collaborate with people every day. We are easily ex-posed to a variety of background noises such as a leaf blower, wash-ing machine, dog barking, a baby crying, kitchen noises, etc. Back-ground noise significantly degrades the quality and intelligibility ofthe perceived speech leading to fatigue. Background noise poses achallenge in other applications such as hearing aids and smart de-vices as well.Real-time Speech Enhancement (SE) for perceptual quality isa decades old classical problem and researchers have proposed nu-merous solutions [1, 2]. In recent years, learning-based approacheshave shown promising results [3, 4, 5]. The Deep Noise Suppres-sion (DNS) Challenge organized at INTERSPEECH 2020 [6] andICASSP 2020 [7] showed great progress, while also indicating thatwe are still about 1.6 Differential Mean Opinion Score (DMOS)away from the ideal Mean Opinion Score (MOS) of 5 when testedon the challenge test set, which was reasonable representative of re-alistic scenarios. The DNS Challenge is the first contest that we areaware of using the subjective evaluation to benchmark SE methodsusing a realistic noisy test set [6].We open sourced a large dataset for INTERSPEECH 2020 andICASSP 2021 DNS challenge . For ease of reference, we will call https://github.com/microsoft/DNS-Challenge the INTERSPEECH 2021 challenge as DNS Challenge 3, ICASSP2021 challenge as DNS Challenge 2 and the INTERSPEECH 2020challenge as DNS Challenge 1. The DNS Challenge 3 will be fo-cused on real-time denoising similar to track 1 of both the DNSchallenges 1 and 2. We will have two tracks in DNS challenge 3for wide band (sampling rate = 16000 Hz) and full band (samplingrate = 48000 Hz) scenarios. The datasets include over 760 hours ofclean speech including singing voice, emotion data, and non-Englishlanguages. Noise data in training set remains the same as DNS Chal-lenge 2. Both clean speech and noise are made available for bothwide and full band scenarios. We provide over 118,000 room im-pulse responses (RIR), which includes real and synthetic RIRs frompublic datasets for wide band. We provide acoustic parameters: Re-verberation time (T60) and Clarity (C50) for read clean speech andRIR sample. The test set includes a variety of noisy speech utter-ances in English and non-English. We also include emotional speechand singing in the presence of background noise.Unlike ITU-T P.808 that was used for DNS Challenge 1 and 2,we will use the implementation of ITU-T P.835 [8] for the DNSChallenge 3. In addition to the overall speech quality as in P.808,P.835 provides standalone quality scores of speech and noise. Thestandalone ratings will guide us better to focus on the areas thatrequires improvement to achieve better overall quality. Many noisesuppressors are very good in suppressing the background noisebut do not improve the quality of speech, which becomes the bot-tleneck for improving the overall quality. We will also provide anon-intrusive objective speech quality metric for wide band scenariocalled DNSMOS as an Azure service. We show that DNSMOSis more reliable than other widely used objective metrics such asPESQ, SDR and POLQA [9]. Also, it does not require referenceclean speech and hence can work on real recordings. Participantscan request for the DNSMOS API by following instructions in thechallenge git page.
2. CHALLENGE TRACKS
The challenge will have the following two tracks:1. Track 1: Real-Time Denoising track for wide band scenario• The noise suppressor must take less than the stride time T s (in ms) to process a frame of size T (in ms) on anIntel Core i5 quad-core machine clocked at 2.4 GHz orequivalent processor. For example, T s = T / for 50%overlap between frames. The total algorithmic latencyallowed including the frame size T , stride time T s , andany look ahead must be ≤ https://github.com/microsoft/P.808 https://github.com/microsoft/DNS-Challenge/tree/master/DNSMOS a r X i v : . [ c s . S D ] J a n esulting in an algorithmic latency of 30ms, then yousatisfy the latency requirements. If you use a frameof size 32ms with a stride of 16ms resulting in an al-gorithmic latency of 48ms, then your method does notsatisfy the latency requirements as the total algorithmiclatency exceeds 40ms. If your frame size plus stride T = T + T s is less than 40ms, then you can use up to (40 − T ) ms future information.2. Track 2: Real-Time Denoising track for full band scenario• Satisfy Track 1 requirements.
3. TRAINING DATASETS
The goal of releasing the clean speech and noise datasets is to pro-vide researchers with an extensive and representative dataset to traintheir SE models. We initially released MSSNSD [10] with a fo-cus on extensibility, but the dataset lacked the diversity in speakersand noise types. We published a significantly larger and more di-verse data set with configurable scripts for DNS Challenge 1 [6].Many researchers found this dataset useful to train their noise sup-pression models and achieved good results. However, the train-ing and the test datasets did not include clean speech with emo-tions such as crying, yelling, laughter or singing. Also, the datasetonly includes the English language. For DNS Challenge 2, we areadding speech clips with other emotions and included about 10 non-English languages. Clean speech in training set is total 760.53 hours:read speech (562.72 hours), singing voice (8.80 hours), emotion data(3.6hours), Chinese mandarin data (185.41 hours). We have grownclean speech to 760.53 hours as compared to 562.72 hours in DNSChallenge 1. The details about the clean and noisy dataset are de-scribed in the following sections.
Clean speech consists of three subsets: (i) Read speech recordedin clean conditions; (ii) Singing clean speech; (iii) Emotional cleanspeech; and (iv) Non-english clean speech. The first subset is de-rived from the public audiobooks dataset called Librivox . It is avail-able under the permissive creative commons 4.0 license [11]. It hasrecordings of volunteers reading over 10,000 public domain audio-books in various languages, the majority of which are in English. Intotal, there are 11,350 speakers. Many of these recordings are of ex-cellent speech quality, meaning that the speech was recorded usinggood quality microphones in a silent and less reverberant environ-ments. But there are many recordings that are of poor speech qualityas well with speech distortion, background noise, and reverberation.Hence, it is important to clean the data set based on speech quality.We used the online subjective test framework ITU-T P.808 [12] tosort the book chapters by subjective quality. The audio chapters inLibrivox are of variable length ranging from few seconds to severalminutes. We randomly sampled 10 audio segments from each bookchapter, each of 10 seconds in duration. For each clip, we had 2ratings, and the MOS across all clips was used as the book chapterMOS. Figure 1 shows the results, which show the quality spannedfrom very poor to excellent quality. In total, it is 562 hours of cleanspeech, which was part of DNS Challenge 1.The second subset consists of high-quality audio recordingsof singing voice recorded in noise-free conditions by professional https://librivox.org/ singers. This subset is derived from VocalSet corpus [13] with Cre-ative Commons Attribution 4.0 International License (CC BY 4.0).license. It has 10.1 hours of clean singing voice recorded by 20 pro-fessional singers: 9 males, and 11 females. This data was recordedon a range of vowels, a diverse set of voices on several standard andextended vocal techniques, and sung in contexts of scales, arpeggios,long tones, and excerpts. For wideband, we downsampled the mono.WAV files from 44.1kHz to 16kHz and added it to clean speechused by the training data synthesizer.The third subset consists of emotion speech recorded in noise-free conditions. This is derived from Crowd-sourced Emotional Mu-timodal Actors Dataset (CREMA-D) [14] made available under theOpen Database License. It consists of 7,442 audio clips from 91 ac-tors: 48 male, and 43 female accounting to total 3.5 hours of audio.The age of the actors was in the range of 20 to 74 years with diverseethnic backgrounds including African America, Asian, Caucasian,Hispanic, and Unspecified. Actors read from a pool of 12 sentencesfor generating this emotional speech dataset. It accounts for six emo-tions: Anger, Disgust, Fear, Happy, Neutral, and Sad at four inten-sity levels: Low, Medium, High, Unspecified. The recorded audioclips were annotated by multiple human raters in three modalities:audio, visual, and audio-visual. Categorical emotion labels and real-value emotion level values of perceived emotion were collected us-ing crowd-sourcing from 2,443 raters. This data was provided as 16kHz .WAV files so we added it to our wideband clean speech as it is.The fourth subset has clean speech from non-English languages.It consists of both tonal and non-tonal languages including Chi-nese (Mandarin), German and Spanish. Mandarin data consists ofOpenSLR18 THCHS-30 [15] and OpenSLR33 AISHELL [16]datasets, both with Apache 2.0 license. THCHS30 was published byCenter for Speech and Language Technology (CSLT) at TsinghuaUniversity for speech recognition. It consists of 30+ hours of cleanspeech recorded at 16-bit 16 kHz in noise-free conditions. Nativespeakers of standard Mandarin read text prompts chosen from a listof 1000 sentences. We added the entire THCHS-30 data in our cleanspeech for the training set. It consisted of 40 speakers: 9 male, 31female in the age range of 19-55 years. It has total 13,389 cleanspeech audio files [15]. The AISHELL dataset was created by Bei-jing Shell Shell Technology Co. Ltd. It has clean speech recordedby 400 native speakers ( 47% male and 53% female) of Mandarinwith different accents. The audio was recorded in noise-free condi-tions using high fidelity microphones. It is provided as 16-bit 16kHz.wav files. It is one of the largest open-source Mandarin speechdatasets. We added the entire AISHELL corpus with 141,600 utter-ances spanning 170+ hours of clean Mandarin speech to our trainingset. Spanish data is 46 hours of clean speech derived from OpenSLR39,OpenSLR61, OpenSLR71, OpenSLR73, OpenSLR74 and OpenSLR75where re-sampled all .WAV files from 48 kHz to 16 kHz to use themin wideband setting. German data is derived from four corporanamely (i) The Spoken Wikipedia Corpora [17], (ii) TelecooperationGerman Corpus for Kinect [18], (iii) M-AILABS data [19], (iv)zamia-speech forschergeist corpora. Complete German data consti-tute 636 hours. Italian (128 hours), French (190 hours), Russian (47hours) are taken from M-AILABS data [19]. M-AILABS SpeechDataset is a publicly available multi-lingual corpora for trainingspeech recognition and speech synthesis systems. ig. 1 . Sorted near-end single-talk clip quality (P.808) with 95%confidence intervals. The noise clips were selected from Audioset [20] and Freesound . Audioset is a collection of about 2 million human labeled 10ssound clips drawn from YouTube videos and belong to about 600audio events. Like the Librivox data, certain audio event classes areover-represented. For example, there are over a million clips withaudio classes music and speech and less than 200 clips for classessuch as toothbrush, creak, etc. Approximately 42% of the clips havea single class, but the rest may have 2 to 15 labels. Hence, we de-veloped a sampling approach to balance the dataset in such a waythat each class has at least 500 clips. We also used a speech ac-tivity detector to remove the clips with any kind of speech activity,to strictly separate speech and noise data. The resulting dataset hasabout 150 audio classes and 60,000 clips. We also augmented anadditional 10,000 noise clips downloaded from Freesound and DE-MAND databases [21]. The chosen noise types are more relevant toVOIP applications. In total, there is 181 hours of noise data. Thenoise files were originally of full band, which were resampled forwide band use case. We provide 3076 real and approximately 115,000 synthetic roomsimpulse responses (RIRs) where we can choose either one or bothtypes of RIRs for convolving with clean speech. Noise is thenadded to reverberant clean speech while DNS models are expectedto take noisy reverberant speech and produce clean reverberantspeech. Challenge participants can do both de-reverb and denois-ing with their models if they prefer. These RIRs are chosen fromopenSLR26 [22] and openSLR28 [22] datasets, both releasedwith Apache 2.0 License. We provide two acoustic parameters: (i) Reverberation time,T60 [23] and (ii) Clarity, C50 [24] for all audio clips in cleanspeech of training set. We provide T60, C50 and isReal Booleanflag for all RIRs where isReal is 1 for real RIRs and 0 for syntheticones. The two parameters are correlated. A RIR with low C50 canbe described as highly reverberant and vice versa [23, 24]. Theseparameters are supposed to provide flexibility to researchers forchoosing a sub-set of provided data for controlled studies. https://research.google.com/audioset/ https://freesound.org/
4. TEST SET
For DNS challenge 3, the test set includes utterances in Englishand non-English languages recording in the presence of a varietyof background noises at different SNR and target levels. The utter-ances were collected at a distance of 1-5 meters from the microphonewhen they were not using the headphone. These clips are more re-verberant. The development test set also includes utterances withemotions such as laughter, crying, yelling and surprise in the pres-ence of background noise. This is to ensure that the human emotionsare not suppressed by the noise suppressors. A small segment of theclips include speech in the presence of musical instruments such asguitar, piano, violin playing in the background. We will provide theT60 and C50 estimates to all the clips, which can help researchers totune their models to perform well in highly reverberant conditions.
5. CHALLENGE RULES AND SCHEDULE5.1. Rules • The participants must adhere to the requirements specified inSection 2 for each track.• Participants may participate only in one track or both thetracks.• Participants must report the number of parameters and thenumber of operations per second. Number of operations persecond = Number of operations per frame / Frame shift inseconds.• Participants can use any data of their choice to train theirmodels and are not limited to challenge data sets.• Participants can use a signal processing based or a learning-based deep model or a combination of both. There are norestrictions on the algorithm used.• Submission must follow the instructions on https://dns-challenge.azurewebsites.net/. Use Shift+F5 (Windows) or Cmd+R(Mac)to get the latest updates on that site.• Winners will be picked based on the subjective evaluation us-ing ITU-T P.835 overall scores.• Participants must send the results (audio clips) achieved bytheir developed models to the organizers.• For track 2, the participants must send the audio clips en-hanced with and without using speaker information and mustshow that quality is better with using speaker information.• We will use the submitted clips with no alteration to conductITU-T P.808 subjective evaluation and pick the winners basedon the results. Participants are forbidden from using the blindtest set to retrain or tune their models.• Participants must submit results only if they intend to submita paper to INTERSPEECH 2021.• Participants should report the computational complexity oftheir model in terms of the number of parameters and the timeit takes to infer a frame on a particular CPU (preferably IntelCore i5 quad core machine clocked at 2.4 GHz).• Among the submitted proposals, if the difference betweenthe proposals is not statistically significant, the submissionwith the least number of operations per second will be rankedhigher. Each participating team must submit a paper to INTER-SPEECH 2021 that describes the research efforts and provideall the details to ensure reproducibility of the work. Authorsmay choose to report additional objective/subjective metricsin their paper.• Submitted papers will undergo the standard peer-review pro-cess of INTERSPEECH 2021. The paper needs to be ac-cepted to the conference for the participants to be eligible forthe challenge. • January 8, 2021: Release of the datasets and scripts for train-ing and testing.• March 8, 2021: Blind test set released to participants.• March 15, 2021: Deadline for participants to submit their re-sults for P.835 subjective evaluation on the blind test set.• March 22, 2021: Organizers will notify the participants aboutthe results.• March 26, 2021: Regular paper submission deadline for IN-TERSPEECH 2021.• June 2, 2021: Paper acceptance/rejection notification.• June 4, 2021: Notification of the winners.
Participants may email organizers at dns [email protected] any questions related to the challenge or in need of any clarifi-cation about any aspect of the challenge.
6. SUMMARY & CONCLUSIONS
The INTERSPEECH 2021 DNS Challenge was designed to advancethe field of real-time noise suppression optimized for human per-ception in challenging noisy conditions. Large inclusive and di-verse training and test datasets with supporting scripts were opensourced. Many participants from both industry and academia foundthe datasets very useful and submitted their enhanced clips for finalevaluation. Only two teams participated in the personalized DNStrack, which also shows that the field is in its nascent phase.
7. REFERENCES [1] Y. Ephraim and D. Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude es-timator,”
IEEE TASLP , 1984.[2] C. Karadagur Ananda Reddy, N. Shankar, G. Shreedhar Bhat,R. Charan, and I. Panahi, “An individualized super-gaussiansingle microphone speech enhancement for hearing aid userswith smartphone as an assistive device,”
IEEE Signal Process-ing Letters , vol. 24, no. 11, pp. 1601–1605, 2017.[3] S. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-basedspeech enhancement by fully convolutional networks,” in .[4] Hyeong-Seok Choi, Hoon Heo, Jie Hwan Lee, and Kyogu Lee,“Phase-aware single-stage speech denoising and dereverbera-tion with U-net,” arXiv preprint arXiv:2006.00687 , 2020. [5] Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, and BhikshaRaj, “Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional net-works,” arXiv preprint arXiv:2005.11611 , 2020.[6] Chandan KA Reddy et al., “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective testingframework, and challenge results,” in
ISCA INTERSPEECH ,2020.[7] Chandan KA Reddy, Harishchandra Dubey, Vishak Gopal,Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aich-ner, and Sriram Srinivasan, “Icassp 2021 deep noise suppres-sion challenge,” arXiv preprint arXiv:2009.06122 , 2020.[8] Babak Naderi and Ross Cutler, “A crowdsourcing extensionof the itu-t recommendation p. 835 with validation,” arXivpreprint arXiv:2010.13200 , 2020.[9] Chandan KA Reddy, Vishak Gopal, and Ross Cutler, “Dnsmos:A non-intrusive perceptual objective speech quality metric toevaluate noise suppressors,” arXiv preprint arXiv:2010.15258 ,2020.[10] Chandan KA Reddy et al., “A scalable noisy speechdataset and online subjective test framework,” arXiv preprintarXiv:1909.08050 , 2019.[11] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in
IEEE ICASSP , 2015.[12] Babak Naderi and Ross Cutler, “An open source implementa-tion of ITU-T recommendation P.808 with validation,” in
ISCAINTERSPEECH , 2020.[13] Julia Wilkins, Prem Seetharaman, Alison Wahl, and BryanPardo, “Vocalset: A singing voice dataset.,” in
ISMIR , 2018.[14] Houwei Cao, David G Cooper, Michael K Keutmann, Ruben CGur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,”
IEEE Trans. onAffective Computing , vol. 5, no. 4, pp. 377–390, 2014.[15] Zhiyong Zhang Dong Wang, Xuewei Zhang, “THCHS-30 : Afree chinese speech corpus,” 2015.[16] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng,“Aishell-1: An open-source mandarin speech corpus and aspeech recognition baseline,” in
IEEE ICASSP , 2017.21] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-cent, “The diverse environments multi-channel acoustic noisedatabase (demand): A database of multichannel environmen-tal noise recordings,”
The Journal of the Acoustical Society ofAmerica , p. 3591, 05 2013.[22] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael LSeltzer, and Sanjeev Khudanpur, “A study on data augmen-tation of reverberant speech for robust speech recognition,” in
IEEE ICASSP , 2017.[23] Poju Antsalo et al., “Estimation of modal decay parametersfrom noisy response measurements,” in
Audio Engineering So-ciety Convention 110 , 2001.[24] Hannes Gamper, “Blind C50 estimation from single-channelspeech using a convolutional neural network,” in