[PDF] Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Enhancement

Abstract

The generative adversarial networks (GANs) have facilitated the development of speech enhancement recently. Nevertheless, the performance advantage is still limited when compared with state-of-the-art models. In this paper, we propose a powerful Dynamic Attention Recursive GAN called DARGAN for noise reduction in the time-frequency domain. Different from previous works, we have several innovations. First, recursive learning, an iterative training protocol, is used in the generator, which consists of multiple steps. By reusing the network in each step, the noise components are progressively reduced in a step-wise manner. Second, the dynamic attention mechanism is deployed, which helps to re-adjust the feature distribution in the noise reduction module. Third, we exploit the deep Griffin-Lim algorithm as the module for phase postprocessing, which facilitates further improvement in speech quality. Experimental results on Voice Bank corpus show that the proposed GAN achieves state-of-the-art performance than previous GAN- and non-GAN-based models

Full PDF

DDYNAMIC ATTENTION BASED GENERATIVE ADVERSARIAL NETWORK WITH PHASEPOST-PROCESSING FOR SPEECH ENHANCEMENT

Andong Li (cid:63) ‡ , Chengshi Zheng (cid:63) ‡ , Renhua Peng (cid:63) ‡ , Cunhang Fan †‡ , Xiaodong Li (cid:63) ‡ (cid:63) Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academyof Sciences, Beijing, China † NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China ‡ University of Chinese Academy of Sciences, Beijing, China

ABSTRACT

The generative adversarial networks (GANs) have facilitated thedevelopment of speech enhancement recently. Nevertheless, theperformance advantage is still limited when compared with state-of-the-art models. In this paper, we propose a powerful D ynamic A ttention R ecursive GAN called DARGAN for noise reduction inthe time-frequency domain. Different from previous works, wehave several innovations. First, recursive learning, an iterativetraining protocol, is used in the generator, which consists of multiplesteps. By reusing the network in each step, the noise componentsare progressively reduced in a step-wise manner. Second, thedynamic attention mechanism is deployed, which helps to re-adjustthe feature distribution in the noise reduction module. Third, weexploit the deep Grifﬁn-Lim algorithm as the module for phase post-processing, which facilitates further improvement in speech quality.Experimental results on Voice Bank corpus show that the proposedGAN achieves state-of-the-art performance than previous GAN- andnon-GAN-based models. Index Terms — speech enhancement, generative adversarialnetwork, dynamic attention, recursive learning

1. INTRODUCTION

Speech enhancement (SE) is regarded as the technique to extract thespeech components from the noisy signals, which helps to improvethe speech quality and speech intelligibility [1]. It is widely used forautomatic speech recognition (ASR), hearing assistive devices andspeech communication. Recently, due to the tremendous capabilityof deep neural networks (DNNs) in modeling complicated non-linearmapping functions, a multitude of DNN-based SE approaches havebeen proposed to recover the speech components in low signal-to-noise ratio (SNR) and unstable noise environments [2, 3]. This paperfocuses on monaural speech enhancement task.Recently, generative adversarial networks (GANs) have beenwidely used in image-to-image translation tasks [4], which receiveconsiderable attention from the speech research community [5, 6, 7].They encompass two principal parts, namely generator network (G)and discriminator network (D), which are optimized by playinga min-max game between each other [8]. The objective of thegenerator is to synthesize the fake samples which resemble the targetdata distribution while the discriminator attempts to discriminatebetween the real and fake samples. SEGAN is the ﬁrst network in-corporating GAN for SE task, where the speech is enhanced directlyin the time domain [5]. Nonetheless, no notable improvement inobjective metrics is observed than the traditional signal-processing approach. Afterward, more training strategies are introduced, whichfacilitate better performance for time-domain based GANs [9, 10,11]. Another line of research is based on the time-frequency(T-F) domain, where G is to map the noisy T-F features to thecorresponding T-F targets [12]. The experimental results indicatethat T-F masking-based approaches are more beneﬁcial for noisereduction than SEGAN [12].Despite the fact that impressive performance has been achievedfor various GAN-based SE approaches, they still have several draw-backs, which are attributed as three-fold. First, although the time-domain-based GANs effectively circumvent the phase estimationproblem, they make it more challenging to optimize D. This isbecause waveform has less structural characteristics than T-F rep-resentation. For example, the frequency information is implicitlydetermined through neighboring points in the time domain whilstthe frequency distribution is explicitly represented when transformedinto the T-F domain. As a consequence, D has a better discriminativecapability in the T-F domain. Second, most of the networks adoptcomplicated topology for better performance. However, Ren etal. proposed a simple baseline by unfolding the shallow networkrepeatedly, which achieved state-of-the-art (SOTA) performance inthe deraining task [13]. It reveals the signiﬁcance of the multi-stagetraining protocol. Third, most T-F domain-based GANs estimate themagnitude of the spectrum, leaving the phase information unpro-cessed, which causes the phase mismatches.Motivated by our proposed dynamic attention recursive con-volutional network (DARCN) [14], we propose a novel GAN-based model noted as DARGAN. Compared with previous GANs,innovations can be summarized as three-fold. First, a recursiveprotocol is utilized during the training of G, i.e., different fromdirectly generating the fake samples in G, the mapping procedureis decomposed into multiple stages, where the dependencies acrossthe stages are bridged through a memory mechanism. Second, adynamic attention mechanism is introduced, where the attentiongenerator network is speciﬁcally designed to control the featuredistribution of noise reduction network. Third, different fromdirectly reconstructing the waveform with noisy phase information,we introduce a type of phase post-processing (PPP) technique basedon deep Grifﬁn-Lim algorithm (DGLA) [15], which is capable ofeffectively reconstructing the phase information via unfolding theblock for multiple times. Experimental results indicate that theproposed model achieves SOTA performance than previous GAN-based SE models.The remainder of the paper is organized as follows. The conceptof GAN is introduced in Section 2. Section 3 explains the proposednetwork. Section 4 gives the experimental settings. Results and a r X i v : . [ c s . S D ] J un oise Removal ModuleAttention Generator Module CoupleNoisy Spectrum

Noise Removal ModuleAttention Generator Module

Couple

Noise Removal ModuleAttention Generator Module

Couple stage=1 stage=2 stage=Q

EstimatedSpectrum TrueSpectrum

Conv-EncoderBLSTMFCAdaptive Avg Pool

True / False

Generator Discriminator weight sharing weight sharingweight sharing weight sharing  X Y X ... Fig. 1 . The schematic of proposed architecture. It consists of two parts: a generator network (G) and a discriminator network (D). G comprisestwo modules, namely Noise Removal Module (NRM) and Attention Generator Module (AGM). Q denotes the number of stages in G.analysis are illustrated in Section 5. We draw some conclusions inSection 6.

2. GENERATIVE ADVERSARIAL NETWORK

Generative adversarial network (GAN) is ﬁrst proposed by Good-fellow et. al. [8]. It is comprised of two parts, namely generatornetwork (G) and discriminator network (D). G aims to map thenoise variable z from the prior distribution p z ( z ) to generated fakesamples G ( z ; θ g ) . As for D, it is trained to accurately recognizewhether the input is from the generated samples (fake) or trainingdata p data ( x ) (real). Both components are optimized by playing amin-max game.In GAN-based SE models, conditional GAN (cGAN) is usuallyadopted [16], i.e., G generates the enhanced speech based on theprior noisy speech input. In this study, the waveform is ﬁrsttransformed into T-F domain with the short-time Fourier transform(STFT), and then the amplitude spectrum is utilized as both theinput and output during the sample generating process. Least-squares GAN (LS-GAN) [17] is employed for training. Assumingthe input noisy features, clean targets, and the generated version arenotated as | Y | , | X | , and | ˜X | , respectively, the objective functionsare formulated as: min D L ( D ) = E | X |∈ p data [ D ( | X | ) − + E | Y |∈ p z [ D ( G ( | Y | ))] , (1) min G L ( G ) = E | Y |∈ p z [ D ( G ( | Y | )) − + λ G (cid:107) G ( | Y | ) − | X |(cid:107) , (2)where D ( · ) denotes the probability of data being true. λ G denotesthe hyper-parameter to weight between the adversarial loss and L -regularization loss in optimizing G. Note that the regularizationterm is thought to be necessary and important to recover the speechdetails [5, 10].

3. PROPOSED NETWORK

This section describes the proposed DARGAN, which is shownin Fig. 1. First we introduce the generator used herein, then weillustrate the discriminator. Finally we introduce the phase post-processing module.

In this paper, we use the proposed DARCN [14] as the generatormodule. This is because DARCN has shown satisfactory per-formance in noise suppression and speech recovery with limited trainable parameters [14]. Compared with previous networks [2,3], it combines recursive learning and dynamic attention together.For recursive learning, the training procedure is decomposed intomultiple stages. Between adjacent stages, stage recurrent neuralnetwork (SRNN) is proposed to bridge the relationship with amemory mechanism [14]. Therefore, the estimation in each stagecan be reﬁned progressively. To illustrate the point, we formulatethe calculation of SRNN at stage l as: h l = f srnn (cid:16) | Y | , (cid:12)(cid:12)(cid:12) ˜X l − (cid:12)(cid:12)(cid:12) , h l − (cid:17) (3)where h denotes the state term after SRNN, which is subsequentlysent to the afterward module to update the estimation output. f srnn is the mapping function of SRNN. Superscript l refers to the l thstage. We can ﬁnd that the correlation between adjacent stages isbridged through SRNN, which facilitates the estimation afterward.Dynamic attention simulates the dynamic property of humanauditory perception, i.e., when the real environment changes rapidly,human tends to adjust their auditory attention accordingly. Torealize that, a separate attention generator network is designed,which outputs the layer-wise values used to adjust the featuredistribution. The coupling method between two modules is viapoint-wise convolutions and sigmoid functions. To illustrate howthe two models operation, we give the calculation process in stage l as: a l = G A (cid:16) | X | , | ˜S l − | (cid:17) , (4) | ˜S l | = G R (cid:16) | X | , | ˜S l − | , a l (cid:17) , (5)where a refers to the values output by generator attention network. G A and G R refer to the mapping function of attention generatornetwork and noise removal network, respectively.In this study, the parameter settings of G are the same as [14]. Asstated in [14], when the number of stages Q equals to 3, the networkcan adequately balance between performance and computationalcomplexity. So Q is set to 3 in this paper. Additionally, weonly apply the supervision to the ﬁnal stage for computationalconvenience, which is different from [14]. In our experiment, we use a typical convolutional recurrent network(CRN) as the discriminator, which is shown in Fig. 1. It encompassesfour parts, namely convolutional-encoder (CE), bidirectional LSTM(BLSTM), fully-connected (FC) layers and adaptive average pool(ADP) layer. For CE, 6 consecutive convolutional blocks are

Amplitude  m    R  m    Z     m-1    X A P C P  m    X Fig. 2 . The schematic of phase post-processing. It is similarto the Deep Grifﬁn-Lim algorithm except the clean magnitude isreplaced by the estimated magnitude processed by GAN. ˜X [ m ] isthe estimated complex-valued spectrum in the m th iteration. ˜R and ˜Z denote the estimated spectrum after P A and P C , respectively.utilized, each of which consists of a convolutional layer, spectralnormalization (SN) [18] and exponential linear unit (ELU) [19]. SNis utilized herein to stabilize the training process of the discriminator.The kernel size and the stride are set to (2 , and (1 , along thetemporal and frequency axis, respectively. The number of channelsthroughout the CE is (16 , , , , , . After the featureencoding, BLSTM is utilized to model the contextual correlationsin both directions. Here one BLSTM layer is used, which has 128units in each direction. After that, we use two FC layers to compressthe features and the number of units is (16 , . To tackle the variantlength issue of different utterances, the ADP layer is utilized toaverage the results of all the timesteps, leading to the global result. Recently, deep Grifﬁn-Lim algorithm (DGLA) is proposed for phasereconstruction when only clean magnitude is available [15], whichcombines the classical Grifﬁn-Lim algorithm (GLA) and a trainableDNN sub-block together. The diagram is shown in Fig. 2. Inthis study, we utilize DGLA as the phase post-processing (PPP)module. Different from using clean magnitude as the referencein [15], we provide the estimated magnitude from G as the referenceamplitude. Therefore, the spectral phase can be reﬁned by iteratingthe block multiple times. The block comprises three parts, namely P A , P C and Φ , where P A and P C work as the parameter-ﬁxedprojection operations, and Φ the module with trainable parametersfor denoising. We refer the readers to [15] for details.The calculation procedure within each iteration is given as ˜X [ m ] = ˜Z m − Φ (cid:16) ˜X [ m − , ˜R [ m ] , ˜Z [ m ] (cid:17) . After M iterations,the estimated complex-valued spectrum is denoted as ˜X [ M ] . Thenthe amplitude is normalized to 1 for each T-F bin to extract thephase information. As a consequence, the ﬁnal estimated complexspectrum after post-processing can be computed as: ˜X = A (cid:12) ˜X [ M ] (cid:11) (cid:12)(cid:12)(cid:12) ˜X [ M ] (cid:12)(cid:12)(cid:12) , (6)where A is the estimated amplitude from GAN, (cid:12) and (cid:11) denoteelement-wise multiplication and division, respectively.

4. EXPERIMENTAL SETTINGS4.1. Dataset

The experiments are conducted on the dataset released by Valentiniet.al. [20], which is chosen from the Voice Bank corpus [21]. Thereare 30 native speakers in total (including 14 male and 14 female),where 28 speakers are used for training (11,572 utterances) and 2 for testing (824 utterances). For each speaker, around 400 utterancesare available.For training noisy set, 10 types of noises are utilized, including2 synthetic and 8 real from the Demand database [22]. 4 SNRlevels are set for training: 15dB, 10dB, 5dB and 0dB. For testingnoisy set, a total of 20 conditions are created: 5 noises from [22]are mixed, each of which is under 4 SNR levels (17.5dB, 12.5dB,7.5dB and 2.5dB). To select the best model during the training, 572utterances are randomly split from the training set as the validationset. As a result, the number of pairs for training, validation, andtesting is 11,000, 572, and 824, respectively. All the utterances aredownsampled from 48kHz to 16kHz in our experiment.

The 20ms hamming window is applied, with 50% overlap betweenadjacent frames. 320-point STFT is adopted, leading to a 161-Dfeature vector. For GAN training, the magnitudes of the spectrumare calculated for both the feature and target. We train the networkfor 100 epochs, optimized by Adam optimizer [23] . The initializedlearning rates for G and D are set to 0.0005 and 0.0001, respectively.We only halve the learning rate when consecutive 3 validation lossincrement happens and the network is terminated after consecutive10 validation loss increment arises. The minibatch is set to 4 at theutterance level, the utterance whose length is less than the longestone will be zero-padded.When the training of GAN is ﬁnished, the noisy utterances inthe training, validation and testing set are processed with the optimalmodel, which are then combined with the clean versions to establishnew pairs for phase post-processing training. The feature extractionprocedure is the same as the previous GAN except the complex-valued spectrum is computed as the feature and the correspondingtarget. Mean-absolute error (MAE) is used as the training criterion,which is consistent with [15]. The number of iterations M is set to 5despite that more iterations can be given. The network is trained for60 epochs, where the initialized learning rate is set to 0.0002. Theminibatch is set to 4 at the utterance level. To evaluate the performance of the proposed model, a variety ofapproaches are utilized as the baselines, which are categorized as twotypes, namely GAN-based methods and non-GAN-based methods.For GAN-based methods, there are SEGAN [5], SERGAN [10],GSEGAN [9], MMSE-GAN [12], MetricGAN [24] and CP-GAN [11]. For non-GAN-based methods, there are Wavenet [25],Deep Feature Loss (DFL) [26], G+M+P [27], MDPhD [28], Wave-U-Net [29], WaveCRN [30] and STFT-TCN [31]. The reasons forchoosing these models are two-fold. First, the metric results ofthese models are reported on the same dataset [20], which makesit possible for fair comparison. Second, the performance of thesemodels are quite competitive.

We adopt four metrics to compare the performance of differentapproaches, which are open-source and described as follows: • PESQ [32]: Perceptual Evaluation Speech Quality, whosevalues range from -0.5 to 4.5. The wide-band version is usedherein. able 1 . Experimental results among different models. Wereimplement the results of CSEGAN in [9]. N/A denotes the resultis not provided in the original paper. PPP denotes the phase post-processing is applied after the GAN estimation. BOLD denotes thebest result for each case.

Model PESQ CSIG CBAK COVLNoisy 1.97 3.35 2.44 2.63GAN SEGAN [5] 2.16 3.48 2.94 2.80SERGAN [10] 2.62 N/A N/A N/ACSEGAN [9] (2.21) (3.56) (2.89) (3.28)MMSE-GAN [12] 2.53 3.80 3.12 3.14MetricGAN [24] 2.86 3.99 3.18 3.42CP-GAN [11] 2.64 3.93 3.29 3.28Non-GAN Wavenet [25] N/A 3.62 3.23 2.98DFL [26] N/A 3.86 3.33 3.22G+M+P [27] 2.69 4.00 3.34 3.34MDPhD [28] 2.70 3.85 3.39 3.27Wave-U-Net [29] 2.62 3.91 3.32 3.18WaveCRN [30] 2.64 3.94 3.37 3.29STFT-TCN [31] 2.89 4.24 3.40 3.56Proposed DARGAN( λ G = 0 . ) 2.82 4.22 3.35 3.53DARGAN( λ G = 0 . ) 2.89 4.23 3.39 3.57DARGAN( λ G = 1 ) 2.93 . . DARGAN( λ G = 1 ) + PPP . .

47 3 . • CSIG [33]: Mean opinion score (MOS) prediction related tosignal distortion, whose scores range from 1 to 5. • CBAK [33]: MOS prediction related to background noise,whose scores range from 1 to 5. • COVL [33]: MOS prediction considering overall quality,whose scores range from 1 to 5.

5. RESULTS AND ANALYSIS

Table 1 presents the metric scores of different models. For theproposed network, 3 different values with respect to λ G are explored,namely 0.01, 0.1 and 1, indicating different emphasis is given to L -regularization loss during the training. For λ G = 1 , PPP is alsoapplied to analyze the role of post-processing. First, we comparethe results of the proposed model with different λ G . When thevalue is changed from 0.01 to 1, a notable improvement in allthe metrics is achieved. For example, 0.11, 0.08, 0.10 and 0.11improvement in PESQ, CSIG, CBAK and COVL are observed,which show that a relatively larger weighted coefﬁcient is beneﬁcialto the improvement of speech quality. We have attempted to increasethe weighted value but the performance begins to decrease. This isbecause a further increase in λ G will decrease the role of adversarialloss, which may cause implicit damage to the performance.We then compare the role of PPP. When PPP is employed as thepost-processing technique, slight metric improvement is observed.For example, 0.03 and 0.02 score improvement in terms of PESQand CBAK is observed, which shows that PPP is beneﬁcial to phasereﬁnement. Note that the performance can be further improved withthe increase of the iterations [15].Finally, we compare the proposed DARGAN with previousworks. When comparing previous GAN-based models and DAR-GAN, we observe a notable improvement. For example, we surpassSEGAN by a large margin in PESQ, CSIG, CBSK and COVL,which are 0.80, 0.81, 0.53 and 0.84, respectively. Even comparedwith more recently proposed MetricGAN, the proposed model stillachieves consistent improvement. When it comes to recently pro-posed non-GAN methods, the proposed model also gets satisfactoryresults. For example, DARGAN outperforms G+M+P, MDPhD, Fig. 3 . Visualization of noisy, clean, SEGAN and the proposedmodel. (a) The spectrum of clean utterance. (b) The spectrum ofnoisy utterance. (c) The spectrum of the utterance processed bySEGAN. (d) The spectrum of the utterance processed by DARGAN.WaveCRN in terms of four objective measurements. We also surpassSTFT-TCN, which was used in DNS-Challenge and ranked fourthin the non-real-time track. It demonstrates the superiority of ourproposed model.Fig. 3 presents the spectrograms of the utterance enhancedby SEGAN and DARGAN. Form the ﬁgure, one can getthat the proposed model can effectively suppress the noisecomponents whilst some unnatural residual noise componentsstill remain for SEGAN, as shown in the black dox area ofFig. 3 (c). In addition, compared with SEGAN, the proposedmodel can also well reserve the speech components like theformant information. More samples are provided at https://github.com/Andong-Li-speech/DARGAN .

6. CONCLUSION

In this paper, we propose a novel GAN-based model called DAR-GAN. Compared with previous GAN-based models, three contri-butions are introduced. First, we adopt the recursive learning, aiterative training protocol to decompose the generating process intomultiple stages. Therefore, the estimation result can be reﬁned stageby stage. Second, a dynamic attention mechanism is introduce,where the feature distribution in noise reduction module can beadaptively controlled for better estimation. Third, phase post-processing module is utilized, which facilitates the phase reﬁnementwith the increase of module iterations. By doing so, the speechquality can be further improved. Experimental results demonstratethe superiority of DARGAN. Further research involves the directoptimization toward the complex-valued spectrum with GAN. https://github.com/microsoft/DNS-Challenge . REFERENCES [1] P. C. Loizou, Speech enhancement: theory and practice , CRCpress, 2013.[2] DeLiang Wang and Jitong Chen, “Supervised speechseparation based on deep learning: An overview,”

IEEE/ACMTrans. Audio, Speech, Lang. Process. , vol. 26, no. 10, pp.1702–1726, 2018.[3] Ke Tan and DeLiang Wang, “Learning complex spectralmapping with gated convolutional recurrent networks formonaural speech enhancement,”

IEEE/ACM Trans. Audio,Speech, Lang. Process. , vol. 28, pp. 380–390, 2019.[4] Edgar Sch¨onfeld, Bernt Schiele, and Anna Khoreva, “A u-net based discriminator for generative adversarial networks,” arXiv preprint arXiv:2002.12655 , 2020.[5] Santiago Pascual, Antonio Bonafonte, and Joan Serra,“SEGAN: Speech enhancement generative adversarial net-work,” arXiv preprint arXiv:1703.09452 , 2017.[6] Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo,Yusuke Ijima, Kaoru Hiramatsu, and Kunio Kashino,“Generative adversarial network-based postﬁlter for statisticalparametric speech synthesis,” in

ICASSP . IEEE, 2017, pp.4910–4914.[7] Daniel Michelsanti and Zheng-Hua Tan, “Conditionalgenerative adversarial networks for speech enhancementand noise-robust speaker veriﬁcation,” arXiv preprintarXiv:1709.01703 , 2017.[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative adversarial nets,” in

Advances inneural information processing systems , 2014, pp. 2672–2680.[9] Cunhang Fan, Bin Liu, Jianhua Tao, Jiangyan Yi, ZhengqiWen, and Ye Bai, “Noise prior knowledge learning for speechenhancement via gated convolutional generative adversarialnetwork,” in

APSIPA ASC . IEEE, 2019, pp. 662–666.[10] Deepak Baby and Sarah Verhulst, “Sergan: Speechenhancement using relativistic generative adversarial networkswith gradient penalty,” in

ICASSP . IEEE, 2019, pp. 106–110.[11] Gang Liu, Ke Gong, Xiaodan Liang, and Zhiguang Chen,“CP-GAN: Context pyramid generative adversarial networkfor speech enhancement,” in

ICASSP . IEEE, 2020, pp. 6624–6628.[12] Meet H Soni, Neil Shah, and Hemant A Patil, “Time-frequencymasking-based speech enhancement using generative adversar-ial network,” in

ICASSP . IEEE, 2018, pp. 5039–5043.[13] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, andDeyu Meng, “Progressive image deraining networks: A betterand simpler baseline,” in

CVPR . IEEE, 2019, pp. 3937–3946.[14] Andong Li, Chengshi Zheng, Cunhang Fan, Renhua Peng,and Xiaodong Li, “A recursive network with dynamicattention for monaural speech enhancement,” arXiv preprintarXiv:2003.12973 , 2020.[15] Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, YasuhiroOikawa, and Noboru Harada, “Deep grifﬁn–lim iteration,” in

ICASSP . IEEE, 2019, pp. 61–65.[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros,“Image-to-image translation with conditional adversarialnetworks,” in

CVPR , 2017, pp. 1125–1134. [17] Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, ZhenWang, and Stephen Paul Smolley, “Least squares generativeadversarial networks,” in

ICCV , 2017, pp. 2794–2802.[18] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, andYuichi Yoshida, “Spectral normalization for generativeadversarial networks,” arXiv preprint arXiv:1802.05957 , 2018.[19] Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter,“Fast and accurate deep network learning by exponential linearunits (elus),” arXiv preprint arXiv:1511.07289 , 2015.[20] Cassia Valentini-Botinhao, Xin Wang, Shinji Takaki, andJunichi Yamagishi, “Investigating rnn-based speech enhance-ment methods for noise-robust text-to-speech.,” in

SSW , 2016,pp. 146–152.[21] Christophe Veaux, Junichi Yamagishi, and Simon King, “Thevoice bank corpus: Design, collection and data analysisof a large regional accent speech database,” in

O-COCOSDA/CASLRE . IEEE, 2013, pp. 1–4.[22] Joachim Thiemann, Nobutaka Ito, and Emmanuel Vincent,“The diverse environments multi-channel acoustic noisedatabase: A database of multichannel environmental noiserecordings,”

J. Acoust. Soc. Am. , vol. 133, no. 5, pp. 3591–3591, 2013.[23] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[24] Szu-Wei Fu, Chien-Feng Liao, Yu Tsao, and Shou-De Lin,“Metricgan: Generative adversarial networks based black-boxmetric scores optimization for speech enhancement,” in

ICML ,2019, pp. 2031–2041.[25] Dario Rethage, Jordi Pons, and Xavier Serra, “A wavenet forspeech denoising,” in

ICASSP . IEEE, 2018, pp. 5069–5073.[26] Francois G Germain, Qifeng Chen, and Vladlen Koltun,“Speech denoising with deep feature losses,” arXiv preprintarXiv:1806.10522 , 2018.[27] Jian Yao and Ahmad Al-Dahle, “Coarse-to-ﬁne optimizationfor speech enhancement,”

Proc. Interspeech 2019 , pp. 2743–2747, 2019.[28] Jang-Hyun Kim, Jaejun Yoo, Sanghyuk Chun, Adrian Kim,and Jung-Woo Ha, “Multi-domain processing via hybriddenoising networks for speech enhancement,” arXiv preprintarXiv:1812.08914 , 2018.[29] Ritwik Giri, Umut Isik, and Arvindh Krishnaswamy,“Attention wave-u-net for speech enhancement,” in

WASPAA .IEEE, 2019, pp. 249–253.[30] Tsun-An Hsieh, Hsin-Min Wang, Xugang Lu, and Yu Tsao,“Wavecrn: An efﬁcient convolutional recurrent neural networkfor end-to-end speech enhancement,” arXiv preprintarXiv:2004.04098 , 2020.[31] Yuichiro Koyama, Tyler Vuong, Stefan Uhlich, and BhikshaRaj, “Exploring the best loss function for dnn-based low-latency speech enhancement with temporal convolutionalnetworks,” arXiv preprint arXiv:2005.11611 , 2020.[32] ITU-T Recommendation, “Perceptual evaluation of speechquality (PESQ): An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs,”

Rec. ITU-T P. 862 , 2001.33] Yi Hu and Philipos C Loizou, “Evaluation of objective qualitymeasures for speech enhancement,”