[PDF] A Modulation-Domain Loss for Neural-Network-based Real-time Speech Enhancement

Abstract

We describe a modulation-domain loss function for deep-learning-based speech enhancement systems. Learnable spectro-temporal receptive fields (STRFs) were adapted to optimize for a speaker identification task. The learned STRFs were then used to calculate a weighted mean-squared error (MSE) in the modulation domain for training a speech enhancement system. Experiments showed that adding the modulation-domain MSE to the MSE in the spectro-temporal domain substantially improved the objective prediction of speech quality and intelligibility for real-time speech enhancement systems without incurring additional computation during inference.

Full PDF

©© 2021 IEEE. Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses, inany current or future media, including reprinting/republishingthis material for advertising or promotional purposes, creatingnew collective works, for resale or redistribution to servers orlists, or reuse of any copyrighted component of this work inother works a r X i v : . [ ee ss . A S ] F e b MODULATION-DOMAIN LOSS FOR NEURAL-NETWORK-BASED REAL-TIME SPEECHENHANCEMENT

Tyler Vuong, Yangyang Xia, Richard M. Stern , Department of Electrical and Computer Engineering, Carnegie Mellon University Language Technologies Institute, Carnegie Mellon University [email protected] , [email protected] , [email protected] ABSTRACT

We describe a modulation-domain loss function for deep-learning-based speech enhancement systems. Learnablespectro-temporal receptive ﬁelds (STRFs) were adapted tooptimize for a speaker identiﬁcation task. The learned STRFswere then used to calculate a weighted mean-squared er-ror (MSE) in the modulation domain for training a speechenhancement system. Experiments showed that adding themodulation-domain MSE to the MSE in the spectro-temporaldomain substantially improved the objective prediction ofspeech quality and intelligibility for real-time speech en-hancement systems without incurring additional computationduring inference.

Index Terms — Real-time speech enhancement, spectro-temporal receptive ﬁeld, loss functions

1. INTRODUCTION

Supervised speech enhancement (SE) using deep neural net-works (DNNs) has received tremendous attention in recentyears. The availability of abundant amounts of training dataand advancements in DNN architectures have resulted in sys-tems that provide better performance than the ideal binarymask [1] – a target that highly correlates with speech intel-ligibility [2, 3]. The core design of a DNN-based SE systeminvolves decisions that perform compensation in one featuredomain and calculate the loss function in another. Despite theemerging time-domain compensation methods ( e.g., [1]), pre-dicting a time-varying gain function, or a time-frequency (TF)mask [4], has been the most popular and reliable approach.Loss functions for supervised SE in the TF domain havehistorically been calculated in the time or frequency domain.However, most existing loss functions [5] were motivated bystatistically-optimal solutions [6, 7, 8] and do not necessarilycorrelate with perceptual quality or intelligibility of speech[9]. More recently, perceptually-motivated loss functionshave been proposed to optimize modiﬁed predictors of speechquality [10] and intelligibility [11]. Interestingly, these meth-ods did not show improvement over objective metrics for which the loss functions did not directly optimize, suggestingthat there is room for improving their generalization ability.Modulation is closely related to speech intelligibility. Thespeech transmission index (STI) measures the extent to whichamplitude modulation of speech is preserved in degraded en-vironments and is highly correlated with speech intelligibility[12, 13]. The spectro-temporal modulation index (STMI) wassubsequently proposed to account for joint spectro-temporalmodulation [14]. SE in the modulation domain has alsobeen explored [15, 16]. However, these methods assume thatspeech and noise are separable in the modulation domain.Moreover, they typically require a complete set of spectro-temporal receptive ﬁelds (STRFs) in order to invert the pro-cessed modulation spectra back to the TF domain. This maybe computationally infeasible for real-time applications.In this paper, we propose a simple mean-squared error(MSE)-based loss function in the spectro-temporal modu-lation domain for supervised SE. We call the loss spectro-temporal modulation error (STME) because of its close re-lation to template-based STMI [14], which correlates wellwith speech intelligibility. The calculation of the STME isbased on a set of pre-selected modulation kernels, which hadbeen shown to be critical for the accuracy of predicted speechintelligibility using speech stimuli [13]. Following our re-cent success in discriminating live speech from synthetic orbroadcast speech using learnable spectro-temporal receptiveﬁeld (STRF) kernels [17], we develop an automatic way todetermine these kernels through an auxiliary speaker identi-ﬁcation (SID) task. STME is applicable to any deep neuralnetwork (DNN) system as long as the short-time Fouriertransform magnitude (STFTM) of the target and degradedspeech are accessible when training the DNN. Our proposedsystem’s loss is easy to compute, does not incur additionalcomputation during inference, and avoids lossy inversion,which is a problem with conventional modulation-domain SEapproaches [15].

Organization of this paper.

In the next section, we in-troduce related work in supervised SE and the backgroundof STMI. We then describe our selection procedure for theSTRFs and the loss function. Finally, we describe the evalua-ion procedure and discuss the experimental results.

2. BACKGROUND

In this section we brieﬂy review supervised speech enhance-ment and the spectro-temporal modulation index.

Supervised DNN-based speech enhancement.

We as-sume that the observed noisy speech contains clean speechcorrupted by additive noise. This relationship in the short-time Fourier transform (STFT) domain is described by X [ t, k ] = S [ t, k ] + N [ t, k ] (1)where X [ t, k ] , S [ t, k ] , and N [ t, k ] represent the STFT at timeframe t and frequency bin k of the observed noisy speech,clean speech and noise, respectively. Without loss of general-ity, we assume that a DNN is trained to predict a magnitudegain, G [ t, k ] , from past and current information of degradedspeech. The enhanced STFTM is obtained by element-wisemultiplication of the predicted gain by the noisy STFTM, (cid:12)(cid:12)(cid:12) (cid:98) S [ t, k ] (cid:12)(cid:12)(cid:12) = G [ t, k ] | X [ t, k ] | . (2)A popular loss function that drives the learning for this DNNis the MSE between the enhanced STFTM and the cleanSTFTM, or the time-frequency error (TFE),TFE = || ( (cid:126)S − (cid:126) (cid:98) S ) || (3)where (cid:126)S and (cid:126) (cid:98) S denote the vector representations of the cleanSTFTM and enhanced STFTM, respectively. Spectro-temporal modulation index.

The spectro-temporal modulation index (STMI) is a measure of speechintegrity in the modulation domain as viewed by a model ofthe auditory system [14]. At the core of this model is a bankof STRFs that are believed to exist in the central auditorysystem and respond to a range of patterns of temporal andspectral modulation [18]. Each STRF is a TF template inwhich two seed functions control the temporal modulation(rate) and spectral modulation (scale) selectivity, respec-tively. The spectro-temporal modulation response (STMR)of a spectrographic representation of speech to a STRF isdeﬁned by the convolution of the two,STMR S [ t, k ] = STRF [ t, k ] ∗ f ( | S [ t, k ] | ) (4)where f is a frequency-integration function followed by a log-arithmic compression that mimics the function of the earlyauditory system. Finally, the template-based STMI [14] isdeﬁned as STMI T = 1 − ||−−−−→ STMR S − −−−−→ STMR X || ||−−−−→ STMR S || , (5)where STMR [ t, k ] is typically integrated over time before be-ing converted to the vector form. In previous implementations of modulation-domain SE, compensation was performed di-rectly on STMR X [ t, k ] [15, 16]. In our method, we performenhancement in the TF domain and use a loss function in themodulation domain that is closely related to STMI T for train-ing the DNN.It should be noted that the selection of meaningful mod-ulation frequency becomes an issue when speech signals (in-stead of modulated noise) are used to calculate an estimateof speech intelligibility [13]. Previous work using STRFstypically performed dimensionality reduction on features ex-tracted from densely-sampled STRFs [19, 20, 21]. We learnthe parameters of the STRFs through an auxiliary SID task.We describe our own method next.

3. METHOD

In this section, we present the DNN system we used forspeech enhancement and the calculation of our spectral-temporal modulation error loss.

Speech enhancement system.

We used the normalizedlog power spectra (LPS) as the input feature. The STFT isﬁrst obtained using a 20-millisecond Hamming window with50% overlap and a 512-point discrete Fourier transform. Thenwe take the natural logarithm of the power of the STFT andnormalize the LPS with frequency-dependent online normal-ization following [22].To estimate the magnitude gain for each frame, we useda similar real-time network architecture to the one describedin [5]. The network consists of a single fully connected (FC)layer followed by two stacked unidirectional Gated RecurrentUnits (GRUs) and three more FC layers. Rectiﬁed linear unit(ReLU) activation is used after each of the FC layers exceptthe very last one where a sigmoid activation is used to boundthe output magnitude gain to be between zero and one. Weobtain the enhanced waveform by multiplying the magnitudegain element-wise with the noisy STFTM and using the orig-inal noisy phase for reconstruction. In total, the network con-tained roughly 2.8 million learnable parameters.

Tuning learnable STRFs on speaker identiﬁcation.

One central problem involving the construction of the lossfunction is the selection of modulation parameters that arerelevant to speech intelligibility [13]. Previous work hasshown that the STMR is redundant [19] and the possiblevalues for those modulation parameters span a wide range[20, 21]. Following the success of our previous work on dis-criminating live speech from synthetic speech using learnableSTRFs [17], we trained the

STRFNet system on SID us-ing the Librispeech [23] dataset with artiﬁcially-added noisefrom Sound Bible to learn the parameters of each Gabor-based STRF [20] automatically. The SID system was able toachieve an average of 95% accuracy with 2484 speakers andsignal-to-noise ratios (SNRs) ranging from 0 to 30 dB. We hen keep the learned STRFs ﬁxed and utilize them for ourloss. We use 60 STRF kernels, each with a time support of300 milliseconds and a span of 20 channels on the Mel scale.This pipeline is depicted in the upper panel of Figure 1. Fig. 1 . Flow diagrams of the STRF tuning stage (top) and theSTME loss calculation stage (bottom).

Loss function.

Given N STRFs, we deﬁne the responseof the speech STFTM to the i th STRF to beSTMR ( i ) S [ t, k ] = STRF ( i ) [ t, k ] (cid:63) log (cid:0) m ( | S [ t, k ] | ) (cid:1) , (6)where m is a frequency integration function using the Melweighting and (cid:63) denotes cross-correlation. The loss functionis then STME = (cid:80) Ni =1 ||−−−−→ STMR ( i ) S − −−−−→ STMR ( i ) (cid:98) S || (cid:80) Ni =1 ||−−−−→ STMR ( i ) S || , (7)where no time integration is applied to obtain the vector formof STMR. In general, the STME is strongly motivated bySTMI and is in fact very similar to the weighted distance termin the deﬁnition of STMI T with a few differences. First, theSTME uses learned STRF kernels that are discriminativelytrained to optimize for a SID task. Second, the STMR is cal-culated using cross-correlation instead of convolution. Third,the original auditory spectrogram [18] is approximated by theMel-spectrogram and logarithmic compression. Finally, theSTME calculates a weighted MSE using instantaneous re-sponse.To train our enhancement system, we use the STME lossin addition to the standard TFE in Eq. (3). This is motivatedby the observation that the STMR is smooth and omits spec-tral details that are available in the TF domain. We believethat the medium-time STME loss complements the short-timeTFE for supervised SE.A block diagram of the STME calculation is shown in thebottom panel of Figure 1. In the next section, we will describethe experimental setup used to evaluate the beneﬁts of includ-ing the STME in the training loss. We also assess the value ofusing automatically-learned Gabor-based STRF kernels com-pared to randomly-selected kernels.

4. EXPERIMENTAL SETUPDatasets.

We used a small-scale and a large-scale dataset forevaluating the SE system. The small-scale dataset by Valen- tini et al. [24] (VBD henceforth) contains 9.4 hours and 35minutes of noisy speech in the training and test set, respec-tively. We downsampled the entire dataset to 16 kHz. For thelarge-scale dataset, we used the Interspeech 2020 Deep NoiseSuppression (DNS) dataset [25] with RIR responses providedby [26]. The DNS training set contains a total of 500 hoursof noisy speech. For evaluation, the DNS dataset has two testsets named no reverb and with reverb , which both contain25 minutes of noisy speech. In both datasets, we are alsoprovided with the original clean speech that was used to arti-ﬁcially generate the noisy speech.We evaluated our system’s capabilities to improve speakerveriﬁcation performance in noisy conditions by using a mod-iﬁed version of the VoxCeleb1 test set [27]. The original testset contained 4874 speech pairs spoken by 40 unseen speak-ers. We modiﬁed the VoxCeleb1 test set by randomly addingnoise from the DNS test set at SNRs ranging from -6 to 6 dB.

Training and evaluation procedure.

To train our SE sys-tems on the VBD dataset and DNS dataset, we randomly sam-pled 1-second noisy speech segments from the training dataand 5-second noisy speech segments from the training data,respectively. All the SE systems were trained using the Adamoptimizer with a learning rate of e − and a batch size of64 in PyTorch. For evaluation, we used the perceptual eval-uation of speech quality (PESQ) [28], scale-invariant signal-to-distortion ratio [29], and short-time objective intelligibility(STOI) [30] metrics.To evaluate our SE systems on speaker veriﬁcation, weused a DNN-based speaker veriﬁcation system [31] pretrainedon VoxCeleb2 [32], a dataset that contains over 1 million ut-terances and 6112 speakers. The veriﬁcation system obtainedan equal error rate (EER) of 2.2% on the VoxCeleb1 test set,which is one of the lowest reported EER compared to anyother method with a similar number of parameters [31]. Al-though the system was not trained with artiﬁcial degradation,all the audio clips were extracted from YouTube which willnaturally contain different acoustical conditions. To test if ourSE system improves the performance of this strong speakerveriﬁcation system in noisy conditions, we added noise fromthe DNS test set to the original VoxCeleb1 test set at SNRsranging from -6 to 6 dB. We evaluated the EER of the speakerveriﬁcation system on clips enhanced by the SE system. Baseline systems.

We evaluated three different base-line systems to illustrate the beneﬁts of the additional STMEloss. Each of the baseline systems have the exact same net-work architecture, but they were each trained with differentloss functions. Our ﬁrst baseline system, GRU(TFE), wastrained only with the TFE loss. To evaluate the beneﬁts ofthe STME loss by itself, we trained a second baseline system,GRU(STME), using only the STME loss. Our third base-line system, GRU(TFE+STME R ), was trained with both lossterms, although the parameters of the Gabor-based STRF ker-nels that were used to calculate the STME were randomly se-lected. Speciﬁcally, the temporal and spectral modulation fre- able 1 . Table summarizing the objective speech quality eval-uation on the VBD test setMethods PESQ SI-SDR STOI(MOS) (dB) (%)Noisy 1.97 8.5 92.1ERNN [33] 2.54 — —GRU(TFE) 2.68 R ) 2.76 16.9 93.2GRU(TFE+STME) . Table summarizing the objective speech quality eval-uation on the DNS no reverb ( with reverb ) test set Method PESQ SI-SDR STOI(MOS) (dB) (%)Noisy 1.58 (1.82) 9.1 (9.0) 91.5 (86.6)DNS Baseline [22] 1.83 (1.52) 12.5 (9.2) 90.6 (82.1)GRU(TFE) 2.27 (2.36) 14.9 (13.2) 94.2 (89.4)GRU(STME) 2.59 (2.64) 12.4 (12.0) 94.2 (90.1)GRU(TFE+STME R ) 2.57 (2.63) quencies are uniformly sampled over [0 , Hz and [0 , . cycles per channel, respectively. This comparison allows usto assess the beneﬁts of using automatically-learned Gabor-based STRF kernels to train the system, GRU(TFE+STME).

5. EXPERIMENTAL RESULTS AND DISCUSSION

In this section, we present and discuss results of our STMEloss on speech enhancement and speaker veriﬁcation.

VBD results.

In Table 1, we show the objective evalu-ation of each SE system trained with different losses usingthe VBD dataset. On a small dataset, our GRU(TFE+STME)outperformed the GRU(TFE) baseline in all the objective met-rics. Interestingly, different initialization of the STRF ker-nels in GRU(TFE+STME R ) resulted in similar improvementsover the baseline TFE loss, but roughly 20% of the time therandom parameter selection resulted in a much worse perfor-mance. This highlights the beneﬁts of using automatically-learned Gabor-based STRF kernels over randomly-selectedGabor-based STRF kernels. DNS results.

The objective evaluation of our SE systemstrained with different losses using the DNS no reverb and with reverb test set is summarized in Table 2. Even witha large amount of training data, our GRU(TFE+STME) lossfunction outperformed both the GRU(TFE) baseline and theprovided challenge baseline [22] in all the objective metrics.Most notably, there is a signiﬁcant improvement in PESQwhich results in our system having a similar PESQ to the topsystem in the ofﬁcial DNS challenge [34]. We also evalu-ated the beneﬁts of the STME loss by itself, GRU(STME).Curiously, training with only our the STME loss provides -6 -3 0 3 6

SNR (dB) EE R ( % ) VoxCeleb1 Speaker Verification Results

NoisyGRU(TFE)GRU(TFE+STME)

Fig. 2 . Equal error rates on the VoxCeleb1 Test Set.a higher PESQ but much lower SI-SDR compared to train-ing with only the TFE loss. Nevertheless, optimizing thecombination of both losses during training caused both thePESQ and SI-SDR scores to increase compared to train-ing on each loss individually. This conﬁrms our belief thatthe medium-time STME loss is complemented by the short-time TFE loss. As in the VBD experiments, the use ofautomatically-learned Gabor-based STRF kernels providesa greater increase of PESQ scores compared to the use ofrandomly-selected Gabor-based STRF kernels.

Speaker veriﬁcation results.

The performance of thespeaker veriﬁcation system with noise added at low SNRs isshown in Figure 2. At low SNRs, the speaker veriﬁcation sys-tem’s performance starts to substantially degrade. Our sys-tem GRU(TFE+STME) improved the EER by an average of15.4% relative and outperformed our baseline GRU(TFE) byan average of 5.5% relative. At higher SNRs, the veriﬁcationsystem’s performance quickly approached the state-of-the-artperformance of 2.2% and our enhancement systems did notprovide any additional beneﬁts.

6. CONCLUSIONS

In this paper, we introduced a novel modulation-domain lossfor training neural-network-based speech enhancement sys-tems. We showed that by adding spectro-temporal modula-tion error to the standard time-frequency error during train-ing, all three common objective speech quality metrics sub-stantially improved on two different datasets. Additionally,we demonstrated the value of utilizing automatically-learnedGabor-based STRF kernels over randomly-selected kernels.We also showed that our speech enhancement system can im-prove a strong speaker veriﬁcation system at low SNRs. In thefuture, we plan on exploring deep-learning-based techniquesto perform SE directly in the modulation domain. We willalso explore ways of directly optimizing the STRF parame-ters for speech enhancement. . REFERENCES [1] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,”

IEEE/ACMTransactions on Audio, Speech, and Language Processing , vol. 27,no. 8, pp. 1256–1266, Aug. 2019.[2] P. C. Loizou and G. Kim, “Reasons why current speech-enhancementalgorithms do not improve speech intelligibility and suggested solu-tions,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 19, no. 1, pp. 47–56, Jan. 2011.[3] K. D. Kryter, “Validation of the articulation index,”

The Journal of theAcoustical Society of America , vol. 34, no. 11, pp. 1698–1702, Nov.1962.[4] Y. Wang, A. Narayanan, and DeLiang Wang, “On training targetsfor supervised speech separation,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 22, no. 12, pp. 1849–1858,Dec. 2014.[5] S. Braun and I. Tashev, “A consolidated view of loss functions forsupervised deep learning-based speech enhancement,” Sep. 2020.[6] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,”

IEEETransactions on Acoustics, Speech, and Signal Processing , vol. 32,no. 6, pp. 1109–1121, Dec. 1984.[7] ——, “Speech enhancement using a minimum mean-square errorlog-spectral amplitude estimator,”

IEEE Transactions on Acoustics,Speech, and Signal Processing , vol. 33, no. 2, pp. 443–445, Apr. 1985.[8] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth com-pression of noisy speech,”

Proceedings of the IEEE , vol. 67, no. 12,pp. 1586–1604, Dec. 1979.[9] P. C. Loizou,

Speech Enhancement : Theory and Practice, SecondEdition . CRC Press, Feb. 2013.[10] J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado,“A deep learning loss function based on the perceptual evaluation ofthe speech quality,”

IEEE Signal Processing Letters , vol. 25, no. 11,pp. 1680–1684, Nov. 2018.[11] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually guided speechenhancement using deep neural networks,” in .Calgary, AB: IEEE, Apr. 2018, pp. 5074–5078.[12] H. J. Steeneken and T. Houtgast, “A physical method for measuringspeech-transmission quality,”

The Journal of the Acoustical Society ofAmerica , vol. 67, no. 1, pp. 318–326, 1980.[13] K. L. Payton and L. D. Braida, “A method to determine the speechtransmission index from speech waveforms,”

The Journal of theAcoustical Society of America , vol. 106, no. 6, pp. 3637–3648, Nov.1999.[14] M. Elhilali, T. Chi, and S. A. Shamma, “A spectro-temporal modu-lation index (STMI) for assessment of speech intelligibility,”

SpeechCommunication , vol. 41, no. 2-3, pp. 331–348, Oct. 2003.[15] N. Mesgarani and S. Shamma, “Denoising in the domain of spec-trotemporal modulations,”

EURASIP Journal on Audio, Speech, andMusic Processing , vol. 2007, pp. 1–8, 2007.[16] M. Mirbagheri, N. Mesgarani, and S. Shamma, “Nonlinear ﬁltering ofspectrotemporal modulations in speech enhancement,” in . Dallas, TX, USA: IEEE, Mar. 2010, pp. 5478–5481.[17] T. Vuong, Y. Xia, and R. M. Stern, “Learnable spectro-temporal recep-tive ﬁelds for robust voice type discrimination,” in

Interspeech 2020 .ISCA, Oct. 2020, pp. 1957–1961.[18] T. Chi, P. Ru, and S. A. Shamma, “Multiresolution spectrotemporalanalysis of complex sounds,”

The Journal of the Acoustical Society ofAmerica , vol. 118, no. 2, pp. 887–906, Aug. 2005. [19] N. Mesgarani, M. Slaney, and S. Shamma, “Discrimination of speechfrom nonspeech based on multiscale spectro-temporal modulations,”

IEEE/ACM Transactions on Audio, Speech and Language Processing ,vol. 14, no. 3, pp. 920–930, May 2006.[20] B. T. Meyer and B. Kollmeier, “Robustness of spectro-temporal fea-tures against intrinsic and extrinsic variations in automatic speechrecognition,”

Speech Communication , vol. 53, no. 5, pp. 753–767,May 2011.[21] S. V. Ravuri and N. Morgan, “Easy does it: Robust spectro-temporalmany-stream ASR without ﬁne tuning streams,” in . Kyoto, Japan: IEEE, Mar. 2012, pp. 4309–4312.[22] Y. Xia, S. Braun, C. K. A. Reddy, H. Dubey, R. Cutler, and I. Ta-shev, “Weighted speech distortion losses for neural-network-basedreal-time speech enhancement,” in . IEEE,May 2020, pp. 871–875.[23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:An ASR corpus based on public domain audio books,” in . IEEE, Apr. 2015, pp. 5206–5210.[24] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “In-vestigating RNN-based speech enhancement methods for noise-robusttext-to-speech.” in

SSW , 2016, pp. 146–152.[25] C. K. A. Reddy, E. Beyrami, H. Dubey, V. Gopal, R. Cheng, R. Cutler,S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srini-vasan, and J. Gehrke, “The INTERSPEECH 2020 deep noise suppres-sion challenge: Datasets, subjective speech quality and testing frame-work,” arXiv:2001.08662 [cs, eess] , Apr. 2020.[26] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper,R. Aichner, and S. Srinivasan, “ICASSP 2021 deep noise suppressionchallenge,” 2020.[27] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scalespeaker identiﬁcation dataset,” in

Interspeech 2017 . ISCA, Aug.2017, pp. 2616–2620.[28] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evalua-tion of speech quality (PESQ)-a new method for speech quality assess-ment of telephone networks and codecs,” in ,vol. 2. Salt Lake City, UT, USA: IEEE, 2001, pp. 749–752.[29] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in . IEEE, 2019,pp. 626–630.[30] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-timeobjective intelligibility measure for time-frequency weighted noisyspeech,” in . Dallas, TX, USA: IEEE, 2010, pp.4214–4217.[31] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham,S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning forspeaker recognition,” arXiv:2003.11982 [cs, eess] , Apr. 2020.[32] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deepspeaker recognition,” in

Interspeech 2018 . ISCA, Sep. 2018, pp.1086–1090.[33] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, and N. Harada, “Real-time speech enhancement using equilibriated RNN,” in . IEEE, 2020, pp. 851–855.[34] U. Isik, R. Giri, N. Phansalkar, J.-M. Valin, K. Helwani, and A. Kr-ishnaswamy, “Poconet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and bi-ased loss,” arXiv preprint arXiv:2008.04470arXiv preprint arXiv:2008.04470