[PDF] ICASSP 2021 Deep Noise Suppression Challenge

Abstract

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. We recently organized a DNS challenge special session at INTERSPEECH 2020. We open sourced training and test datasets for researchers to train their noise suppression models. We also open sourced a subjective evaluation framework and used the tool to evaluate and pick the final winners. Many researchers from academia and industry made significant contributions to push the field forward. We also learned that as a research community, we still have a long way to go in achieving excellent speech quality in challenging noisy real-time conditions. In this challenge, we are expanding both our training and test datasets. There are two tracks with one focusing on real-time denoising and the other focusing on real-time personalized deep noise suppression. We also make a non-intrusive objective speech quality metric called DNSMOS available for participants to use during their development stages. The final evaluation will be based on subjective tests.

Full PDF

IICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE

Chandan K. A. Reddy, Harishchandra Dubey, Vishak Gopal, Ross Cutler, Sebastian Braun,Hannes Gamper, Robert Aichner, Sriram Srinivasan

Microsoft Corporation, Redmond, USA

ABSTRACT

The Deep Noise Suppression (DNS) challenge is designed to fosterinnovation in the area of noise suppression to achieve superior per-ceptual speech quality. We recently organized a DNS challenge spe-cial session at INTERSPEECH 2020. We open-sourced training andtest datasets for researchers to train their noise suppression models.We also open-sourced a subjective evaluation framework and usedthe tool to evaluate and pick the ﬁnal winners. Many researchersfrom academia and industry made signiﬁcant contributions to pushthe ﬁeld forward. We also learned that as a research community,we still have a long way to go in achieving excellent speech qual-ity in challenging noisy real-time conditions. In this challenge, weare expanding both our training and test datasets. Clean speech inthe training set has increased by 200% with the addition of singingvoice, emotion data, and non-English languages. The test set hasincreased by 100% with the addition of singing, emotional, non-English (tonal and non-tonal) languages, and, personalized DNS testclips. There are two tracks with a focus on (i) real-time denoising,and (ii) real-time personalized DNS.

Index Terms — Speech Enhancement, Perceptual Speech Qual-ity, P.808, Deep Noise Suppressor, Machine Learning.

1. INTRODUCTION

In recent times, remote work has become the ”new normal” as thenumber of people working remotely has exponentially increased dueto the pandemic. There has been a surge in the demand for reliablecollaboration and real-time communication tools. Audio calls withvery good to excellent speech quality are needed during these timesas we try to stay connected and collaborate with people every day.We are easily exposed to a variety of background noises such as adog barking, a baby crying, kitchen noises, etc. Background noisesigniﬁcantly degrades the quality and intelligibility of the perceivedspeech leading to fatigue. Background noise poses a challenge inother applications such as hearing aids and smart devices.Real-time Speech Enhancement (SE) for perceptual quality isa decades old classical problem and researchers have proposed nu-merous solutions [1, 2]. In recent years, learning-based approacheshave shown promising results [3, 4, 5]. The Deep Noise Suppres-sion (DNS) Challenge organized at INTERSPEECH 2020 showedpromising results, while also indicating that we are still about 1.4Differential Mean Opinion Score (DMOS) from the ideal MeanOpinion Score (MOS) of 5 when tested on the DNS Challenge testset [6, 7]. The DNS Challenge is the ﬁrst contest that we are awareof that used subjective evaluation to benchmark SE methods using arealistic noisy test set [8]. We open sourced clean speech and noisecorpus with conﬁgurable scripts to generate noisy-clean speechpairs suitable to train a supervised noise suppression model. Therewere two tracks, real-time and non-real-time based on the compu-tational complexity of the inference. We received an overwhelming response to the challenge with participation from a diverse group ofresearchers, developers, students, and hobbyists from both academiaand industry. We also received positive responses from the partic-ipants as many found the open sourced datasets quite useful, andboth the dataset and test framework have been cloned at a fast ratesince the challenge.The ICASSP 2021 Deep Noise Suppression (DNS) Challenge is intended to stimulate research in the area of real-time noise sup-pression. For ease of reference, we will call the ICASSP 2021 chal-lenge as DNS Challenge 2 and the Interspeech 2020 challenge asDNS Challenge 1. The DNS Challenge 2 will have a real-time de-noising track similar to the one in DNS Challenge 1. In addition,we will have a personalized DNS track focused on using speakerinformation to achieve better perceptual quality. In addition to thedatasets we open sourced for DNS Challenge 1, we increased cleanspeech in training set by 50% resulting in over 760 hours which in-cludes singing voice, emotion data, and non-english languages (Chi-nese). Noise data in training set remains the same as DNS Chal-lenge 1. We provide over 118,000 room impulse responses (RIR),which includes real and synthetic RIRs from public datasets. Weprovide acoustic parameters: Reverberation time (T60) and Clarity(C50) for read clean speech and RIR sample. In DNS Challenge 2,we increased the testset by 100% by adding emotion data, singingvoice, non-English languages in Track 1 and real and synthetic clipsfor personalized DNS in Track 2. For DNS Challenge 1, we opensourced a subjective evaluation framework based on ITU-T P.808[9]. The ﬁnal evaluation of the participating models were done basedon subjective evaluation using the P.808 subjective testing frame-work. We describe the results of the challenge at the end.

2. CHALLENGE TRACKS

The challenge had the following two tracks:1. Track 1: Real-Time Denoising track requirements• The noise suppressor must take less than the stride time T s (in ms) to process a frame of size T (in ms) on anIntel Core i5 quad-core machine clocked at 2.4 GHzor equivalent processors. For example, T s = T / for50% overlap between frames. The total algorithmic la-tency allowed including the frame size T , stride time T s , and any look ahead must be ≤ https://github.com/microsoft/DNS-Challenge a r X i v : . [ ee ss . A S ] O c t atency exceeds 40ms. If your frame size plus stride T = T + T s is less than 40ms, then you can use up to (40 − T ) ms future information.2. Track 2: Personalized Deep Noise Suppression (pDNS) trackrequirements• Satisfy Track 1 requirements.• You will have access to 2 minutes speech of a partic-ular speaker to extract and adapt speaker related infor-mation that might be useful to improve the quality ofthe noise suppressor. The enhancement must be doneon the noisy speech test segment of the same speaker.• The enhanced speech using speaker information mustbe of better quality than enhanced speech without usingthe speaker information.

3. TRAINING DATASETS

The goal of releasing the clean speech and noise datasets is to pro-vide researchers with an extensive and representative dataset to traintheir SE models. We initially released MSSNSD [10] with a fo-cus on extensibility, but the dataset lacked the diversity in speakersand noise types. We published a signiﬁcantly larger and more di-verse data set with conﬁgurable scripts for DNS Challenge 1 [8].Many researchers found this dataset useful to train their noise sup-pression models and achieved good results. However, the train-ing and the test datasets did not include clean speech with emo-tions such as crying, yelling, laughter or singing. Also, the datasetonly includes the English language. For DNS Challenge 2, we areadding speech clips with other emotions and included about 10 non-English languages. Clean speech in training set is total 760.53 hours:read speech (562.72 hours), singing voice (8.80 hours), emotion data(3.6hours), Chinese mandarin data (185.41 hours). We have grownclean speech to 760.53 hours as compared to 562.72 hours in DNSChallenge 1. The details about the clean and noisy dataset are de-scribed in the following sections.

Clean speech consists of three subsets: (i) Read speech recordedin clean conditions; (ii) Singing clean speech; (iii) Emotional cleanspeech; and (iv) Non-english clean speech. The ﬁrst subset is de-rived from the public audiobooks dataset called Librivox . It is avail-able under the permissive creative commons 4.0 license [11]. It hasrecordings of volunteers reading over 10,000 public domain audio-books in various languages, the majority of which are in English. Intotal, there are 11,350 speakers. Many of these recordings are of ex-cellent speech quality, meaning that the speech was recorded usinggood quality microphones in a silent and less reverberant environ-ments. But there are many recordings that are of poor speech qualityas well with speech distortion, background noise, and reverberation.Hence, it is important to clean the data set based on speech qual-ity. We used the online subjective test framework ITU-T P.808 [9]to sort the book chapters by subjective quality. The audio chaptersin Librivox are of variable length ranging from few seconds to sev-eral minutes. We randomly sampled 10 audio segments from eachbook chapter, each of 10 seconds in duration. For each clip, we had2 ratings, and the MOS across all clips was used as the book chapterMOS. Figure 1 shows the results, which show the quality spanned https://librivox.org/ from very poor to excellent quality. In total, it is 562 hours of cleanspeech, which was part of DNS Challenge 1.The second subset consists of high-quality audio recordingsof singing voice recorded in noise-free conditions by professionalsingers. This subset is derived from VocalSet corpus [12] with Cre-ative Commons Attribution 4.0 International License (CC BY 4.0).license. It has 10.1 hours of clean singing voice recorded by 20 pro-fessional singers: 9 males, and 11 females. This data was recordedon a range of vowels, a diverse set of voices on several standard andextended vocal techniques, and sung in contexts of scales, arpeggios,long tones, and excerpts. We downsampled the mono .WAV ﬁlesfrom 44.1kHz to 16kHz and added it to clean speech used by thetraining data synthesizer.The third subset consists of emotion speech recorded in noise-free conditions. This is derived from Crowd-sourced Emotional Mu-timodal Actors Dataset (CREMA-D) [13] made available under theOpen Database License. It consists of 7,442 audio clips from 91 ac-tors: 48 male, and 43 female accounting to total 3.5 hours of audio.The age of the actors was in the range of 20 to 74 years with diverseethnic backgrounds including African America, Asian, Caucasian,Hispanic, and Unspeciﬁed. Actors read from a pool of 12 sentencesfor generating this emotional speech dataset. It accounts for six emo-tions: Anger, Disgust, Fear, Happy, Neutral, and Sad at four inten-sity levels: Low, Medium, High, Unspeciﬁed. The recorded audioclips were annotated by multiple human raters in three modalities:audio, visual, and audio-visual. Categorical emotion labels and real-value emotion level values of perceived emotion were collected us-ing crowd-sourcing from 2,443 raters. This data was provided as 16kHz .WAV ﬁles so we added it to our clean speech as it is.The fourth subset has clean speech from non-English languages.It consists of both tonal and non-tonal languages including Chi-nese (Mandarin), German and Spanish. Mandarin data consists ofOpenSLR18 THCHS-30 [14] and OpenSLR33 AISHELL [15]datasets, both with Apache 2.0 license. THCHS30 was published byCenter for Speech and Language Technology (CSLT) at TsinghuaUniversity for speech recognition. It consists of 30+ hours of cleanspeech recorded at 16-bit 16 kHz in noise-free conditions. Nativespeakers of standard Mandarin read text prompts chosen from a listof 1000 sentences. We added the entire THCHS-30 data in our cleanspeech for the training set. It consisted of 40 speakers: 9 male, 31female in the age range of 19-55 years. It has total 13,389 cleanspeech audio ﬁles [14]. The AISHELL dataset was created by Bei-jing Shell Shell Technology Co. Ltd. It has clean speech recordedby 400 native speakers ( 47% male and 53% female) of Mandarinwith different accents. The audio was recorded in noise-free condi-tions using high ﬁdelity microphones. It is provided as 16-bit 16kHz.wav ﬁles. It is one of the largest open-source Mandarin speechdatasets. We added the entire AISHELL corpus with 141,600 utter-ances spanning 170+ hours of clean Mandarin speech to our trainingset. Spanish data is 46 hours of clean speech derived from OpenSLR39,OpenSLR61, OpenSLR71, OpenSLR73, OpenSLR74 and OpenSLR75where re-sampled all .WAV ﬁles to 16 kHz. German data is de-rived from four corpora namely (i) The Spoken Wikipedia Cor-pora [16], (ii) Telecooperation German Corpus for Kinect [17],(iii) M-AILABS data [18], (iv) zamia-speech forschergeist corpora.Complete German data constitute 636 hours. Italian (128 hours),French (190 hours), Russian (47 hours) are taken from M-AILABSdata [18]. M-AILABS Speech Dataset is a publicly available multi- ig. 1 . Sorted near-end single-talk clip quality (P.808) with 95%conﬁdence intervals.lingual corpora for training speech recognition and speech synthesissystems. The noise clips were selected from Audioset [19] and Freesound .Audioset is a collection of about 2 million human labeled 10s soundclips drawn from YouTube videos and belong to about 600 audioevents. Like the Librivox data, certain audio event classes are over-represented. For example, there are over a million clips with audioclasses music and speech and less than 200 clips for classes such astoothbrush, creak, etc. Approximately 42% of the clips have a singleclass, but the rest may have 2 to 15 labels. Hence, we developeda sampling approach to balance the dataset in such a way that eachclass has at least 500 clips. We also used a speech activity detector toremove the clips with any kind of speech activity, to strictly separatespeech and noise data. The resulting dataset has about 150 audioclasses and 60,000 clips. We also augmented an additional 10,000noise clips downloaded from Freesound and DEMAND databases[20]. The chosen noise types are more relevant to VOIP applications.In total, there is 181 hours of noise data. We provide 3076 real and approximately 115,000 synthetic roomsimpulse responses (RIRs) where we can choose either one or bothtypes of RIRs for convolving with clean speech. Noise is thenadded to reverberant clean speech while DNS models are expectedto take noisy reverberant speech and produce clean reverberantspeech. Challenge participants can do both de-reverb and denois-ing with their models if they prefer. These RIRs are chosen fromopenSLR26 [21] and openSLR28 [21] datasets, both releasedwith Apache 2.0 License. We provide two acoustic parameters: (i) Reverberation time,T60 [22] and (ii) Clarity, C50 [23] for all audio clips in cleanspeech of training set. We provide T60, C50 and isReal Booleanﬂag for all RIRs where isReal is 1 for real RIRs and 0 for syntheticones. The two parameters are correlated. A RIR with low C50 canbe described as highly reverberant and vice versa [22, 23]. Theseparameters are supposed to provide ﬂexibility to researchers forchoosing a sub-set of provided data for controlled studies. https://research.google.com/audioset/ https://freesound.org/

4. TEST SET

In DNS Challenge 1, the test set consisted of 300 real recordings and300 synthesized noisy speech clips. The real clips were recordedinternally at Microsoft and also using crowdsourcing tools. Someof the clips were taken from Audioset. The synthetic clips weredivided into reverberant and less reverberant clips. These utteranceswere predominantly in English. All the clips are sampled at 16 kHzwith an average clip length of 12 secs. The development phase testset is in the ”ICASSP dev test set” directory in the DNS Challengerepository. For this challenge, the primary focus is to make the testset as realistic and diverse as possible.

Similar to DNS Challenge 1, the test set for DNS Challenge 2 isdivided into real recordings and synthetic categories. However, thesynthetic clips are mainly composed of the scenarios that we werenot able to collect in realistic conditions. The track 1 test clips canbe found in the track 1 sub-directory of

ICASSP dev test set . The real recordings consist of non-English and English segments.The English segment will have 300 clips that are from the blind testset from DNS Challenge 1. These clips were collected using thecrowdsourcing platform and internally at Microsoft using a varietyof devices, acoustic and noisy conditions. The non-English segmentcomprises of 100 clips in the following languages: Non-tonal: Por-tuguese, Russian, Spanish, and Tonal: Mandarin, Cantonese, Pun-jabi, and Vietnamese. In total, there are 400 real test clips.

The synthetic test set consists of 200 noisy clips obtained by mix-ing clean speech (non-English, emotional speech, and singing) withnoise. The test set noises is taken from Audioset and Freesound [8].The 100 non-English test clips include German, French and Italianlanguages from Librivox audio books. Emotion clean speech con-sists of laughter, yelling, and crying chosen from Freesound andmixed with test set noise to generate 50 noisy clips. Similarly, cleansinging voice from Freesound was used to generate 50 noisy clipsfor the singing test set.

The blind test set for track 1 contains 700 noisy speech clips out ofwhich 650 are real recordings and 50 synthetic noisy singing clips.It contains the following categories: (i) emotional (102 clips), (ii)English (276 clips), (iii) non-English including tonal (272 clips),(iv) tonal languages (112 clips) and (v) singing (50 clips). The realrecordings were collected using the crowdsourcing platform and in-ternally at Microsoft. This is the most diverse publicly available testset for a noise suppression task.

For the pDNS track, we provide 2 minutes of clean adaptationdata for each primary speaker with the goal to suppress neighbor-ing speakers and background noise. pDNS models are expectedto leverage speaker-aware training and speaker-adapted inference.There are two motivations to provide clean speech for the primaryspeaker: (1) speaker models are sensitive to false-alarms in speechctivity detection (SAD) [24]; clean speech can be used for obtain-ing accurate SAD labels. (2) speaker adaptation is expected to workwell using multi-conditioned data; clean speech can be used forgenerating reverberant and noisy data for speaker adaptation.

The development test set contains 100 real recordings from 20 pri-mary speakers collected using crowdsourcing. Each primary speakerhas noisy test clips for three scenarios: (i) primary speaker in pres-ence of neighboring speaker; (ii) primary speaker in presence ofbackground noise; and (iii) primary speaker in presence of bothbackground noise and neighboring speaker.

The synthetic clips include 500 noisy clips from 100 primary speak-ers. Each primary speaker has 2 minutes of clean adaptation data.All clips have varying levels of neighboring speakers and noise. testset noise from Track 1 was mixed with primary speech extractedfrom VCTK corpus [25]. We used VoxCeleb2 [26] corpus for neigh-boring speakers.

The blind test set for track 2 contains 500 noisy speech real record-ings from 80 unique speakers. All the real recordings were collectedusing the crowdsourcing platform. The noise source in the major-ity of these clips is a secondary speaker. We provided 2 mins cleanspeech utterances for each of the primary speakers that could be usedto adapt the noise suppressor. All the utterances were in English.

5. CHALLENGE RESULTS5.1. Evaluation Methodology

Most DNS evaluations use objective measures such as and PESQ[27], SDR, and POLQA [28]. However, these metrics are shown tonot correlate well with subjective speech quality in the presence ofbackground noise [29]. Subjective evaluation is the gold standardfor this task. Hence, the ﬁnal evaluation was done on the blind testset using the crowdsourced subjective speech quality metric basedon ITU P.808 [9] to determine the DNS quality. For track 2, weappended each processed clip with a 5 secs utterance of the pri-mary speaker at the beginning and 1 sec silence. We modiﬁed P.808slightly to instruct the raters to focus on the quality of the voice ofthe primary speaker in the remainder of the processed segment. Weconducted a reproducibility test with this change to P.808 and foundthat the average Spearman Rank Correlation Coefﬁcient (SRCC) be-tween the 5 runs was 0.98. Hence, we concluded that the change isvalid. We used 5 raters per clip. This resulted in a 95% conﬁdenceinterval (CI) of 0.03. We also provided the baseline noise suppressor[30] for the participants to benchmark their methods.

A total of 16 teams from academia and industry participated in track1. Table 1 shows the categorized P.808 results for track 1. The sub-missions are stack ranked based on the overall Differential MOS(DMOS). DMOS is the difference in MOS between the processedset and the original noisy. We can observe from the results that mostof methods struggled to do well on singing and emotional clips. Thenoise suppressors tend to suppress the certain emotional and singing

Table 1 . Track 1 P.808 Results.

Table 2 . P.808 Results for Track 2segments. Overall, the results show that performance of noise sup-pressors are not great in singing and emotional categories. The re-sults also shows that the training data must balanced between En-glish, non-English and tonal languages for the model to generalizewell. It is important to include singing and other emotions as theground truth to achieve better quality in these categories. Only halfof the submissions did better than the baseline. The best model isonly 0.53 DMOS better than the noisy, which is of absolute MOS3.38. This shows that we are far away from achieving a noise sup-pressor that works robustly in almost all conditions.

This track is the ﬁrst of its kind and there is not much work donein this ﬁeld. There were only 2 teams who participated in track 2.Each team submitted one set of processed clips with using speakerinformation for model adaptation and the other set without explicitlyusing speaker information. The results are shown in table 2. Thebest model gave only 0.14 DMOS. This shows that the problem ofusing speaker information to adapt the model is still in the infancystage.

6. SUMMARY & CONCLUSIONS

The ICASSP 2021 DNS Challenge was designed to advance the ﬁeldof real-time noise suppression optimized for human perception inchallenging noisy conditions. Large inclusive and diverse trainingand test datasets with supporting scripts were open sourced. Manyparticipants from both industry and academia found the datasets veryuseful and submitted their enhanced clips for ﬁnal evaluation. Onlytwo teams participated in the personalized DNS track, which alsoshows that the ﬁeld is in its nascent phase. . REFERENCES [1] Y. Ephraim and D. Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude es-timator,”

IEEE TASLP , 1984.[2] C. Karadagur Ananda Reddy, N. Shankar, G. Shreedhar Bhat,R. Charan, and I. Panahi, “An individualized super-gaussiansingle microphone speech enhancement for hearing aid userswith smartphone as an assistive device,”

IEEE Signal Process-ing Letters , vol. 24, no. 11, pp. 1601–1605, 2017.[3] S. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-basedspeech enhancement by fully convolutional networks,” in .[4] Hyeong-Seok Choi, Hoon Heo, Jie Hwan Lee, and Kyogu Lee,“Phase-aware single-stage speech denoising and dereverbera-tion with U-net,” arXiv preprint arXiv:2006.00687 , 2020.[5] Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, and BhikshaRaj, “Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional net-works,” arXiv preprint arXiv:2005.11611 , 2020.[6] Jean-Marc Valin et al., “A perceptually-motivated approach forlow-complexity, real-time enhancement of fullband speech,” arXiv preprint arXiv:2008.04259 , 2020.[7] Umut Isik et al., “PoCoNet: Better speech enhancement withfrequency-positional embeddings, semi-supervised conversa-tional data, and biased loss,” arXiv preprint arXiv:2008.04470 ,2020.[8] Chandan KA Reddy et al., “The INTERSPEECH 2020deep noise suppression challenge: Datasets, subjective testingframework, and challenge results,” in

ISCA INTERSPEECH ,2020.[9] Babak Naderi and Ross Cutler, “An open source implementa-tion of ITU-T recommendation P.808 with validation,” in

ISCAINTERSPEECH , 2020.[10] Chandan KA Reddy et al., “A scalable noisy speechdataset and online subjective test framework,” arXiv preprintarXiv:1909.08050 , 2019.[11] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in

IEEE ICASSP , 2015.[12] Julia Wilkins, Prem Seetharaman, Alison Wahl, and BryanPardo, “Vocalset: A singing voice dataset.,” in

ISMIR , 2018.[13] Houwei Cao, David G Cooper, Michael K Keutmann, Ruben CGur, Ani Nenkova, and Ragini Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,”

IEEE Trans. onAffective Computing , vol. 5, no. 4, pp. 377–390, 2014.[14] Zhiyong Zhang Dong Wang, Xuewei Zhang, “THCHS-30 : Afree chinese speech corpus,” 2015.[15] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng,“Aishell-1: An open-source mandarin speech corpus and aspeech recognition baseline,” in

IEEE ICASSP , 2017.[20] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vin-cent, “The diverse environments multi-channel acoustic noisedatabase (demand): A database of multichannel environmen-tal noise recordings,”

The Journal of the Acoustical Society ofAmerica , p. 3591, 05 2013.[21] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael LSeltzer, and Sanjeev Khudanpur, “A study on data augmen-tation of reverberant speech for robust speech recognition,” in

IEEE ICASSP , 2017.[22] Poju Antsalo et al., “Estimation of modal decay parametersfrom noisy response measurements,” in

Audio Engineering So-ciety Convention 110 , 2001.[23] Hannes Gamper, “Blind C50 estimation from single-channelspeech using a convolutional neural network,” in

Proc. IEEEMMSP , 2020, pp. 136–140.[24] John HL Hansen and Tauﬁq Hasan, “Speaker recognition bymachines and humans: A tutorial review,”

IEEE Signal pro-cessing magazine , vol. 32, no. 6, pp. 74–99, 2015.[25] Junichi Yamagishi, Christophe Veaux, Kirsten MacDonald,et al., “CSTR VCTK corpus: English multi-speaker corpusfor CSTR voice cloning toolkit (version 0.92),” 2019.[26] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman,“Voxceleb2: Deep speaker recognition,”

ISCA INTER-SPEECH , 2018.[27] “ITU-T recommendation P.862: Perceptual evaluation ofspeech quality (PESQ): An objective method for end-to-endspeech quality assessment of narrow-band telephone networksand speech codecs,” Feb 2001.[28] John Beerends et al., “Perceptual objective listening qual-ity assessment (POLQA), the third generation ITU-T standardfor end-to-end speech quality measurement part II-perceptualmodel,”

AES: Journal of the Audio Engineering Society , vol.61, pp. 385–402, 06 2013.[29] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, andJ. Gehrke, “Non-intrusive speech quality assessment usingneural networks,” in