On Loss Functions and Recurrency Training for GAN-based Speech Enhancement Systems
Zhuohuang Zhang, Chengyun Deng, Yi Shen, Donald S. Williamson, Yongtao Sha, Yi Zhang, Hui Song, Xiangang Li
OOn Loss Functions and Recurrency Training for GAN-based SpeechEnhancement Systems
Zhuohuang Zhang , , Chengyun Deng , Yi Shen , Donald S. Williamson ,Yongtao Sha , Yi Zhang , Hui Song , Xiangang Li Department of Speech, Language and Hearing Sciences, Indiana University, USA Department of Computer Science, Indiana University, USA Didi Chuxing, Beijing, China [email protected], { shen2, williads } @indiana.edu, { dengchengyun, shayongtao, zhangyi, songhui, lixiangang } @didiglobal.com Abstract
Recent work has shown that it is feasible to use generative ad-versarial networks (GANs) for speech enhancement, however,these approaches have not been compared to state-of-the-art(SOTA) non GAN-based approaches. Additionally, many lossfunctions have been proposed for GAN-based approaches, butthey have not been adequately compared. In this study, wepropose novel convolutional recurrent GAN (CRGAN) archi-tectures for speech enhancement. Multiple loss functions areadopted to enable direct comparisons to other GAN-based sys-tems. The benefits of including recurrent layers are also ex-plored. Our results show that the proposed CRGAN modeloutperforms the SOTA GAN-based models using the same lossfunctions and it outperforms other non-GAN based systems, in-dicating the benefits of using a GAN for speech enhancement.Overall, the CRGAN model that combines an objective met-ric loss function with the mean squared error (MSE) providesthe best performance over comparison approaches across manyevaluation metrics.
Index Terms : speech enhancement, generative adversarial net-works, convolutional recurrent neural network
1. Introduction
Speech enhancement can be used in many communication sys-tems, e.g., as front-ends for speech recognition systems [1, 2, 3]or hearing aids [4, 5, 6]. Many speech enhancement algorithmsestimate a time-frequency (T-F) mask that is applied to a noisyspeech signal for enhancement (e.g., ideal ratio mask [7, 8],complex ideal ratio mask [9, 10]). Both deep neural network(DNN) and recurrent neural network (RNN) structures havebeen utilized to estimate T-F masks. Recent RNN approaches,such as long short-term memory (LSTM) [11] and bidirectionalLSTM (BiLSTM) [12] networks, have demonstrated superiorperformance over DNN-based approaches [5], due to their abil-ity to better capture the long-term temporal dependencies ofspeech.More recently, generative adversarial networks (GANs)have been investigated for speech enhancement. A numberof GAN-based speech enhancement algorithms have been pro-posed, including end-to-end approaches that directly map anoisy speech signal to an enhanced speech signal in the timedomain [13, 14]. Other GAN-based speech enhancement al-gorithms operate in the T-F domain [15, 16] by estimating a
Part of the work was done while Z. Zhang was an intern at Didi AILabs, Beijing, China
T-F mask. Current GAN-based end-to-end systems solely useconvolutional layers that have skip connections [13, 14], andthose implemented in the T-F domain either use only fully con-nected layers [15] or a combination of recurrent and fully con-nected layers [16]. Convolutional and fully-connected archi-tectures cannot leverage long-term temporal information dueto the small kernal size and individual frame-level predictions,which is crucial for speech. On the other hand, recurrent-onlylayers do not fully explore the local correlations along the fre-quency axis [17]. Additionally, existing GAN-based methodsadopt different loss functions while using different network ar-chitectures, so it is not clear what is the best performing lossfunction for training such a system.In this paper, we incorporate a convolutional recurrent net-work (CRN) [18] into a GAN-based speech enhancement sys-tem, which has not been previously done. This convolutional re-current GAN (CRGAN) exploits the advantages of both convo-lutional neural networks (CNNs) and RNNs, where a CNN canutilize the local correlations in the T-F domain [19], and a RNNcan capture long-term time-dependent information. We furtherextend the “memory direction” of the original CRN structure in[18] by replacing the LSTM layers with BiLSTM layers [20].We compare the performance of our CRGAN-based modelswith several recently proposed loss functions [13, 14, 15, 16],to determine the best performing loss function for GANs. Theinfluence of an adversarial training scheme is also investigatedby comparing the proposed CRGANs with a non-GAN basedCRN. Furthermore, results from previous studies revealed onlya small amount of improvement over some legacy approaches(i.e., Wiener filtering, DNN-based method) when they are eval-uated by objective metrics [13, 14, 15]. To better understandthe benefits of GAN-based training, we additionally compareour model with recent state-of-the-art (SOTA) non GAN-basedspeech enhancement approaches.The rest of the paper is organized as follows. Section 2provides background information on speech enhancement us-ing GANs. In Section 3 and 4, we describe the proposed frame-work and loss functions of our CRGAN model. The experi-mental setup is presented in Section 5 and results are providedin Section 6. Finally, conclusions are drawn in Section 7.
2. Speech Enhancement Using GANs
GANs have gained much attention as an emerging deep learn-ing architecture [21]. Unlike conventional deep-learning sys-tems, they consist of two networks, a generator ( G ) and a dis-criminator ( D ). This forms a minimax game scenario, where G generates fake data to fool D , and D discriminates between real a r X i v : . [ ee ss . A S ] S e p eneratorDiscriminator Discriminator DiscriminatorGenerator 𝑦 𝑥ො𝑦 ො𝑦𝑥 Freeze
Freezereal realfake U pd a t e D U pd a t e D U pd a t e G Training Iterations … Figure 1:
Training procedure for GANs. Discriminator D andGenerator G are trained alternatively. x denotes the noisy log-magnitude spectrogram, y represents the target (oracle) T-Fmask and (cid:98) y is the estimated T-F mask generated by G . and fake data. D ’s output reflects the probability of the inputbeing real or fake and G learns to map samples x from priordistribution X to samples y from distribution Y .A speech enhancement system operating in the T-F domainusually takes the magnitude spectrogram of noisy speech as theinput, predicts a target T-F mask, and resynthesizes an audiosignal from the enhanced spectrogram. Let X denote the dis-tribution of noisy log-magnitude spectrograms x and let Y rep-resent the distribution of target masks y . During adversarialtraining, G will learn a mapping from X to Y . A depiction ofthe GAN-based training procedure is shown in Figure. 1. Thediscriminator D and generator G are trained alternately. In thefirst step of the iteration, D updates its weights given the target(oracle) T-F mask y that is labeled as real. Then in the sec-ond step, D updates its weights again using predicted T-F mask (cid:98) y generated by G , which is labeled as fake. Eventually in theideal situation, given the log-magnitude noisy spectrogram asinput, G should be able to generate an estimated T-F mask thatcan fool D (i.e., D ( G ( x )) = ‘real’). Note that while D is beingtrained, the weights in G remain frozen, and vice versa.
3. Convolutional Recurrent GAN
The network structure for the proposed CRGAN is depictedin Figure. 2, where G has an encoder-decoder structure thattakes the noisy log-magnitude spectrogram as the input and es-timates a T-F mask. In the encoder, we deploy five 2-D con-volutional layers to extract the local correlations of the speechsignal. The encoded feature is then passed to a reshape layer,which is followed by two BiLSTM layers in the middle of the G network. The recurrent layers capture long-term temporalinformation. The decoder is simply a reversed version of theencoder, which comprises five deconvolutional (i.e., transposedconvolution) layers. We apply batch normalization (BN) [22]after each convolutional/deconvolutional layer. Exponential lin-ear units (ELU) [23] are used as the activation function in thehidden layers, while we apply a sigmoid activation function inthe output layer to estimate the T-F mask. Moreover, the G network incorporates skip connections, which pass fine-grainedspectrogram information to the decoder. The skip connectionconcatenates the output of each convolutional-encoding layerwith the input of the corresponding deconvolutional-decodinglayer. The network is deterministic, where the output is solely dependent on the input. D has a similar structure as G ’s encoder, except that weadopt leaky ReLU activation functions after the convolutionallayers and there is a flattening layer after the fifth convolutionallayer, which is then followed by a single-unit fully connected(FC) layer. Here D gives two types of outputs, D ( y ) or D l ( y ) ,where D ( y ) indicates the sigmoidal output and D l ( y ) denotesthe linear output such that σ ( D l ( y )) = D ( y ) with σ beingthe sigmoid non-linearity. These two forms of D ’s output areneeded for the loss functions that are used to train the network.
4. Loss Functions
Existing GAN-based speech enhancement algorithms utilizedifferent loss functions to stabilize training and improve per-formance. These loss functions have different benefits, but thebest performing loss function is unknown. We further investi-gate three different loss functions within our CRGAN model toanswer this question. Below, we describe three loss functionsthat are implemented in our model, including Wasserstein loss[24], relativistic loss [25], and a metric-based loss [16].
The Wasserstein loss function improves the stability and robust-ness of a GAN model [24]. It is formulated as L D = − E y ∼Y [ D l ( y )] + E x ∼X [ D l ( G ( x ))] L G = − E x ∼X [ D l ( G ( x ))] , (1)where L D and L G are the Wasserstein losses for the discrimina-tor and generator, respectively. L D maximizes the expectationof classifying the true mask as real and it minimizes the expec-tation of classifying a fake mask as a true one. L G maximizesthe expectation of generating a fake mask that seems real. Agradient-penalty (GP) term is included in D ’s loss, since it pre-vents exploding and vanishing gradients [26]: L GP = E ˜ y ∼ ˜ Y (cid:104)(cid:0) (cid:107)∇ ˜ y D l (˜ y ) (cid:107) − (cid:1) (cid:105) , (2)where ∇ ˜ y D l (˜ y ) is the gradient on D ’s linear output, ˜ y = (cid:15)y +(1 − (cid:15) )ˆ y , (cid:15) is sampled from a uniform distribution from 0 to1, ˆ y denotes the generated mask from G , and ˜ Y stands for thedistribution of ˜ y . A L loss term (i.e., || ˆ y t,f − y t,f || ) is addedto G to improve performance as reported in [14], where { t, f } represent the time-frequency (T-F) point of the T-F mask, y .Thus, the discriminator loss becomes L D + λ GP ∗ L GP andthe generator loss is L G + λ L ∗ L . λ GP and λ L serve ashyperparameters that control the weights of GP and L1 losses.We use W-CRGAN to denote this model. A relativistic loss function is adopted in [14], since it consid-ers the probabilities of real data being real and fake data beingreal, which is an important relationship that is not consideredby conventional GANs. The discriminator and generator aremade relativistic by taking the difference between the output of D given fake and real inputs [25]: L D = − E ( x,y ) ∼ ( X , Y ) [log( σ ( D l ( y ) − D l ( G ( x ))))] L G = − E ( x,y ) ∼ ( X , Y ) [log( σ ( D l ( G ( x )) − D l ( y )))] . (3)This loss, however, has high variance as G influences D , whichmakes the training process unstable [14]. Alternatively, the rel- arget T-FMask/SpeechNoisy Log-Magnitude Spectrogram Conv2DConv2DConv2DConv2D
Conv2D
BiLSTM Deconv2DDeconv2DDeconv2DDeconv2DDeconv2D
Encoder Decoder
Skip Connections
Reshape BN+ELUReshape BN+ELU
BiLSTM
BN+ELUBN+ELUBN+ELUBN+ELUBN+ELUBN+ELU Sigmoid
Conv2DConv2DConv2DConv2D
Conv2D
Discriminator
BN+LeakyReLU
Generator
Estimated T-F Mask/Speech
Concatenation FC Flatten
Linear &Sigmoidal Outputs
BN+LeakyReLUBN+LeakyReLUBN+LeakyReLU
Figure 2:
CRGAN structure. The generator estimates a T-F mask. The arrows between layers represent skip connections. The targetT-F mask and estimated mask are provided as inputs to the discriminator for all proposed models except W-CRGAN. ativistic average loss can be formulated as [25]: L D = − E y ∼Y [log( D y )] − E x ∼X [log(1 − D G ( x ) )] L G = − E x ∼X [log( D G ( x ) )] − E y ∼Y [log(1 − D y )] , (4)where D y = σ ( D l ( y ) − E x ∼X [ D l ( G ( x ))]) and D G ( x ) = σ ( D l ( G ( x )) − E y ∼Y [ D l ( y )]) . GP and L1 terms are also in-cluded to stabilize training for the discriminator and generator,respectively. R-CRGAN and Ra-CRGAN denote the modelswith relativistic and average relativistic loss, respectively. Optimizing a network using traditional loss functions may notlead to noticeable quality or intelligibility improvements. Re-cent approaches have thus turned to optimizing objective met-rics that strongly correlate with subjective evaluations by humanobservers [27, 28, 29]. This is adapted here, where the metricloss is defined as [16]: L D = E ( x,s ) ∼ ( X , S ) [( D l ( s, s ) − + ( D l ( G ( x ) , s ) − Q (cid:48) ( iST F T ( G ( x )) , iST F T ( s ))) ] L G = E x ∼X [( D l ( G ( x ) , s ) − ] , (5)where s stands for the target speech spectrogram and Q (cid:48) standsfor the normalized evaluation metric [i.e., perceptual evaluationof speech quality (PESQ)] whose output range becomes [0,1](1 means the best). iST F T denotes the inverse short-timeFourier transform that converts the spectrogram into a time-domain speech signal. The first input of D is the enhancedsignal and the second input is the clean reference signal (i.e., D simulates the evaluation metric).When using metric loss, without defining the target maskexplicitly, G learns a T-F mask-like representation and appliesit (with an additional multiplication layer) to the noisy speechspectrogram. The generated enhanced spectrogram (i.e., G ( x ) )is then fed into D to get the simulated metric scores as feedbackto G . In such settings, D learns the distribution of the actualmetric score and G should generate enhanced speech spectro-grams with higher metric scores. We use M-CRGAN to repre-sent our model with metric loss. Furthermore, we found thatadding a mean squared error (MSE) term to L G leads to im-provements for several evaluation metrics. The generator lossbecomes L G + λ MSE ∗ || ˆ y t,f − y t,f || , with λ MSE beingthe hyperparameter that controls the MSE weight. We use M-CRGAN-MSE to represent this model.
5. Experimental Setup
The proposed algorithm is evaluated on the speech dataset pre-sented in [30]. The dataset contains 30 native English speak-ers from the Voice Bank corpus [31], from which 28 speakers(14 female speakers) are used in the training set and 2 otherspeakers are used in the test set. There are 10 different noises(2 artificial and 8 from the DEMAND database [32]), each ofwhich is mixed with the target speech at 4 different signal-to-noise ratios (SNRs) (0, 5, 10, and 15 dB), resulting in a totalof 11572 speech-in-noise mixtures in the training set. The testset includes 5 different noises mixed with the target speech at 4SNRs (2.5, 7.5, 12.5, and 17.5 dB), resulting in 824 mixtures.All mixtures are resampled to 16 kHz.The log-magnitude spectrogram is used as the input feature,where we use a FFT size of 512 with 25 ms window lengthand 10 ms step size. For W-CRGAN and R/Ra-CRGAN, thetraining target is the phase-sensitive mask (PSM) defined as M PSMt,f = | s t,f || n t,f | cos ( θ t,f ) [12], where | s t,f | and | n t,f | denotemagnitude spectrograms of the clean and noisy speech, respec-tively. θ t,f is the difference between the clean and noisy phasespectrograms. PSM not only estimates the magnitude but alsotakes phase into account. Hyperparameters are empirically setto λ GP = 10 and λ L = 200 for W-CRGAN, R-CRGAN andRa-CRGAN, selected based on published papers [13, 14]. ForM-CRGAN-MSE, λ MSE is set to 4.The number of feature maps in the respective convolutionallayers of G ’s encoder are set to: 16, 32, 64, 128, and 256, re-spectively. The kernel size for each layer is (1,3) for the firstconvolutional layer and (2,3) for the remaining layers, withstrides all set to (1,2). The BiLSTM layers consist of 2048 neu-rons, with 1024 neurons in each direction and a time step of 100frames. The decoder of G follows the reverse parameter settingof the encoder. For the discriminator D , the number of featuremaps are 4, 8, 16, 32, and 64 for the respective convolutionallayers. Note that the number of input channels changes whenwe apply the relativistic or metric loss, D in this case takes in in-put pairs (i.e., clean and noisy pairs). All models are trained us-ing the Adam optimizer [33] for 60 epochs with a learning rateof 0.002 and a batch size of 60 except for M-CRGAN and M-CRGAN-MSE, where we follow a similar setup as described in[16] and use a batch size of 1. In M-CRGAN and M-CRGAN-MSE, the time step of the Bi-LSTM layers changes with thenumber of frames per utterance and each epoch contains 6000randomly selected utterances from the training set.We compare our proposed system with the same networktructure that does not have recurrent layers in the generator(i.e. remove the BiLSTM) to quantify the impact that recurrentlayers have on performance. These systems are denoted as W-CGAN, R-CGAN, Ra-CGAN, M-CGAN and M-CGAN-MSE.We also develop a CRN with MSE loss to investigate the in-fluence of the GAN training scheme, and a CNN-MSE model(i.e., remove the BiLSTM) to test the influence of RNN lay-ers on non GAN-based systems. We additionally compare withstate-of-the-art non GAN-based speech enhancment systems toinvestigate whether it is beneficial to apply GAN training forspeech enhancement. We implemented two RNN-based sys-tems, including a LSTM-based approach and a BiLSTM-basedapproach. Both RNN-based speech enhancement systems havetwo layers of RNN cells and a third layer of fully connectedneurons. In each recurrent layer, there are 256 LSTM nodes or128 BiLSTM nodes for each direction, with time steps set to100. The output layer has 257 nodes with sigmoid activationfunctions that predict a PSM training target. The MSE is usedas the loss function and RMSprop [34] is applied as the opti-mizer. The other settings are identical to the CRGAN approach.These two approaches share similar network architectures andtraining schemes that were mentioned in [11, 12], where theseapproaches utilize RNN-based structures and achieve better per-formance than traditional DNN based approaches [5].The enhanced speech signals are evaluated using severalobjective metrics, including PESQ [35] (from -0.5 to 4.5), theshort-term objective intelligibility (STOI) [36] (from 0 to 1),CSIG (signal distortion), CBAK (intrusiveness of backgroundnoise) and COVL (overall effect) [37] (from 1 to 5).
6. Results
The results for the different systems are shown in Table 1,where all the algorithms are able to improve speech quality overunprocessed noisy speech signals. We first compare the perfor-mance of our model with state-of-the-art GAN-based systems.The results indicate that our best approach achieves much betterperformance in speech quality (e.g., PESQ: 2.92 in M-CRGAN-MSE) when compared to the best performing GAN-based sys-tem (e.g. PESQ: 2.57 in RaSGAN). Similar results occur for in-telligibility (e.g. STOI). This also occurs when the same lossfunction is used (e.g. MetricGAN). The significant improve-ments in CSIG, CBAK and COVL also indicate that our pro-posed systems better maintain speech integrity while removingthe background noise. We also include some non GAN-basedsystems (i.e., LSTM, BiLSTM, CRN-MSE and CNN-MSE) toverify that GAN-based systems are beneficial. It is interest-ing to see that all the existing GAN-based speech enhancementsystems, when compared to non GAN-based systems, achievelower scores in both enhanced speech quality and intelligibility(i.e., PESQ and STOI scores). Their enhanced speech also hasa greater degree of signal distortion (CSIG) and more intrusivenoise (CBAK). By comparing the performance of the proposedCRGAN models with these non GAN-based systems, the pro-posed models (i.e., Ra-CRGAN, M-CRGAN and M-CRGAN-MSE) tend to achieve better performance across nearly all met-rics (except that Ra-CRGAN achieves lower but similar perfor-mance with BiLSTM and CRN-MSE in CSIG and COVL). Thisindicates that GAN-based systems can outperform non GAN- We use author provided source code, when available, for compar-ison approaches. The results for [16] are different from the originalresults possibly due to differences with training hyperparameters (e.g.the authors in [16] use 400 epochs). We attach original results if code isnot available.
Table 1:
Results for the enhancement systems. Best scores arehighlighted in bold . (* indicates previously reported results.)
Setting PESQ STOI CSIG CBAK COVL
Noisy 1.97 0.921 3.35 2.44 2.63
GAN-based Systems
SEGAN [13] 2.31 0.933 3.55 2.94 2.91MMSEGAN [15] * 2.53 0.930 3.80 3.12 3.14RSGAN 2.51 0.937 3.82 3.16 3.15RaSGAN [14] 2.57 0.937 3.83 3.28 3.20MetricGAN [16] 2.49 0.925 3.81 3.05 3.13
Non GAN-based Systems
CNN-MSE 2.64 0.927 3.56 3.08 3.09LSTM [11] 2.56 0.914 3.87 2.87 3.20BiLSTM [12] 2.70 0.925 3.99 2.95 3.34CRN-MSE [18] 2.74 0.934 3.86 3.14 3.30
Proposed CRGAN without Recurrent Layers
W-CGAN 2.29 0.920 2.60 2.88 2.42R-CGAN 2.33 0.916 2.92 2.81 2.58Ra-CGAN 2.38 0.917 2.97 2.87 2.64M-CGAN 2.59 0.927 3.68 3.15 3.11M-CGAN-MSE 2.66 0.926 3.89 3.05 3.27
Proposed CRGAN with Different Losses
W-CRGAN 2.60 0.930 3.35 3.09 2.97R-CRGAN 2.72 0.932 3.67 3.09 3.17Ra-CRGAN 2.81 0.936 3.72 3.16 3.25M-CRGAN 2.87 0.938 4.11 based systems, when local and temporal information are con-sidered in conjunction with appropriate loss functions. We alsonotice that superior performance can be achieved by our pro-posed CRGAN models compared to the CRN model alone with-out the GAN framework, which reveals that adversarial trainingis beneficial for speech enhancement.We provide results of the proposed CRGAN-based modelswith different loss functions and results of these models with-out the recurrent layers, to quantify the importance of recur-rent layers. When comparing the results, we observe a dra-matic decrease in terms of speech quality (e.g., PESQ: 2.38 inRa-CGAN and 2.81 in Ra-CRGAN), intelligibility and overalleffect, which implies that recurrent layers are crucial and ben-eficial to GAN-based systems. We also notice that for CRN,the influence of recurrent layers is not as significant as our pro-posed CRGAN-based systems, suggesting that the CRGAN-based systems are more sensitive to the recurrent layers.Among our proposed models, the CRGAN that is trained onmetric loss with an additional MSE term yields the best perfor-mance across nearly all metrics (except that M-CRGAN achievebetter performance in CBAK). Suggesting that a metric loss isthe most beneficial loss function among the proposed ones for aCRGAN-based speech enhancement system. The CRGAN withrelativisitc average loss achieves comparable performance.
7. Conclusions
In this study, we propose a novel GAN-based speech enhance-ment algorithm with a convolutional recurrent structure that op-erates in the T-F domain. Results show that our proposed mod-els outperform other state-of-the-art GAN-based and non GAN-based speech enhancement systems across an array of evalua-tion metrics, indicating that it is promising to use a GAN frame-work for speech enhancement. We conclude that the introduc-tion of recurrent layers is important for our CRGAN model. Wealso investigate the influence of the GAN training scheme anddifferent loss functions. The metric loss greatly improves per-formance. By combining metric and MSE loss functions, theCRGAN approach achieves even greater performance. . References [1] J. H. Hansen and M. A. Clements, “Constrained iterative speechenhancement with application to speech recognition,”
IEEETransactions on Signal Processing , vol. 39, no. 4, pp. 795–805,1991.[2] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement with lstmrecurrent neural networks and its application to noise-robust asr,”in
LVA/ICA . Springer, 2015, pp. 91–99.[3] Y. Xu, C. Weng, L. Hui, J. Liu, M. Yu, D. Su, and D. Yu, “Jointtraining of complex ratio mask based beamformer and acousticmodel for noise robust asr,” in .IEEE, 2019, pp. 6745–6749.[4] L. P. Yang and Q. J. Fu, “Spectral subtraction-based speech en-hancement for cochlear implant patients in background noise,”
JASA , vol. 117, no. 3, pp. 1001–1004, 2005.[5] Z. Zhang, Y. Shen, and D. Williamson, “Objective comparison ofspeech enhancement algorithms with hearing loss simulation,” in
ICASSP . IEEE, 2019, pp. 6845–6849.[6] Z. Zhang, D. Williamson, and Y. Shen, “Impact of amplificationon speech enhancement algorithms using an objective evaluationmetric.”[7] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets forsupervised speech separation,”
TASLP , vol. 22, no. 12, pp. 1849–1858, 2014.[8] R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su,Y. Zou, and D. Yu, “Neural spatial filter: Target speaker speechseparation assisted with directional information.” in
Interspeech ,2019, pp. 4290–4294.[9] D. Williamson, Y. Wang, and D. L. Wang, “Complex ratio mask-ing for monaural speech separation,”
TASLP , vol. 24, no. 3, pp.483–492, 2016.[10] Y. Li and D. S. Williamson, “A return to dereverberation in thefrequency domain using a joint learning approach,” in
ICASSP .IEEE, 2020, pp. 7549–7553.[11] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller, “Dis-criminatively trained recurrent neural networks for single-channelspeech separation,” in
GlobalSIP . IEEE, 2014, pp. 577–581.[12] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deeprecurrent neural networks,” in
ICASSP . IEEE, 2015, pp. 708–712.[13] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speechenhancement generative adversarial network,” arXiv preprintarXiv:1703.09452 , 2017.[14] D. Baby and S. Verhulst, “Sergan: Speech enhancement usingrelativistic generative adversarial networks with gradient penalty,”in
ICASSP . IEEE, 2019, pp. 106–110.[15] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based speech enhancement using generative adversarial network,”in
ICASSP . IEEE, 2018, pp. 5039–5043.[16] S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, “Metricgan: Genera-tive adversarial networks based black-box metric scores optimiza-tion for speech enhancement,” arXiv preprint arXiv:1905.04874 ,2019.[17] K. M. Nayem and D. Williamson, “Incorporating intra-spectraldependencies with a recurrent output layer for improved speechenhancement,” in . IEEE, 2019, pp.1–6.[18] K. Tan and D. L. Wang, “A convolutional recurrent neural net-work for real-time speech enhancement.” in
Interspeech , 2018,pp. 3229–3233.[19] Z. Zhang, Z. Sun, J. Liu, J. Chen, Z. Huo, and X. Zhang, “Deeprecurrent convolutional neural network: Improving performancefor speech recognition,” arXiv preprint arXiv:1611.07174 , 2016. [20] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-tion with bidirectional lstm and other neural network architec-tures,”
Neural networks , vol. 18, no. 5-6, pp. 602–610, 2005.[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-sarial nets,” in
NIPS , 2014, pp. 2672–2680.[22] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” arXivpreprint arXiv:1502.03167 , 2015.[23] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and ac-curate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289 , 2015.[24] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXivpreprint arXiv:1701.07875 , 2017.[25] A. Jolicoeur-Martineau, “The relativistic discriminator: akey element missing from standard gan,” arXiv preprintarXiv:1807.00734 , 2018.[26] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville, “Improved training of wasserstein gans,” in
NIPS ,2017, pp. 5767–5777.[27] H. Zhang, X. Zhang, and G. Gao, “Training supervised speechseparation system to improve stoi and pesq directly,” in . IEEE, 2018, pp. 5374–5378.[28] M. Kolbæk, Z. H. Tan, and J. Jensen, “Monaural speech enhance-ment using deep neural networks by maximizing a short-time ob-jective intelligibility measure,” in .IEEE, 2018, pp. 5059–5063.[29] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually guidedspeech enhancement using deep neural networks,” in . IEEE, 2018, pp. 5074–5078.[30] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,“Investigating rnn-based speech enhancement methods for noise-robust text-to-speech,” in
SSW , 2016, pp. 146–152.[31] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: De-sign, collection and data analysis of a large regional accent speechdatabase,” in
O-COCOSDA/CASLRE . IEEE, 2013, pp. 1–4.[32] J. Thiemann, N. Ito, and E. Vincent, “The diverse environmentsmulti-channel acoustic noise database: A database of multichan-nel environmental noise recordings,”
JASA , vol. 133, no. 5, pp.3591–3591, 2013.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[34] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gra-dient by a running average of its recent magnitude,”
COURSERA:Neural networks for machine learning , vol. 4, no. 2, pp. 26–31,2012.[35] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,“Perceptual evaluation of speech quality (pesq)-a new method forspeech quality assessment of telephone networks and codecs,” in
ICASSP , vol. 2. IEEE, 2001, pp. 749–752.[36] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weightednoisy speech,” in
ICASSP . IEEE, 2010, pp. 4214–4217.[37] Y. Hu and P. C. Loizou, “Evaluation of objective quality measuresfor speech enhancement,”