[PDF] ICASSP 2021 Deep Noise Suppression Challenge: Decoupling Magnitude and Phase Optimization with a Two-Stage Deep Network

Abstract

It remains a tough challenge to recover the speech signals contaminated by various noises under real acoustic environments. To this end, we propose a novel system for denoising in the complicated applications, which is mainly comprised of two pipelines, namely a two-stage network and a post-processing module. The first pipeline is proposed to decouple the optimization problem w:r:t: magnitude and phase, i.e., only the magnitude is estimated in the first stage and both of them are further refined in the second stage. The second pipeline aims to further suppress the remaining unnatural distorted noise, which is demonstrated to sufficiently improve the subjective quality. In the ICASSP 2021 Deep Noise Suppression (DNS) Challenge, our submitted system ranked top-1 for the real-time track 1 in terms of Mean Opinion Score (MOS) with ITU-T P.808 framework.

Full PDF

IICASSP 2021 DEEP NOISE SUPPRESSION CHALLENGE: DECOUPLING MAGNITUDE ANDPHASE OPTIMIZATION WITH A TWO-STAGE DEEP NETWORK

Andong Li (cid:63) † , Wenzhe Liu (cid:63) † , Xiaoxue Luo (cid:63) † , Chengshi Zheng (cid:63) † , Xiaodong Li (cid:63) † (cid:63) Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academyof Sciences, Beijing, China † University of Chinese Academy of Sciences, Beijing, China

ABSTRACT

It remains a tough challenge to recover the speech signals contam-inated by various noises under real acoustic environments. To thisend, we propose a novel system for denoising in the complicatedapplications, which is mainly comprised of two pipelines, namely atwo-stage network and a post-processing module. The ﬁrst pipelineis proposed to decouple the optimization problem w.r.t. magnitudeand phase, i.e. , only the magnitude is estimated in the ﬁrst stage andboth of them are further reﬁned in the second stage. The secondpipeline aims to further suppress the remaining unnatural distortednoise, which is demonstrated to sufﬁciently improve the subjectivequality. In the ICASSP 2021 Deep Noise Suppression (DNS) Chal-lenge, our submitted system ranked top-1 for the real-time track 1 interms of Mean Opinion Score (MOS) with ITU-T P.808 framework.

Index Terms — Speech enhancement, two-stage, real-time,post-processing

1. INTRODUCTION

In real scenarios, environmental noise and room reverberation mayhave a negative impact on the performance of automatic speechrecognition (ASR) systems, video/audio communication, and hear-ing assistant devices. To tackle these problems, many speech en-hancement (SE) algorithms have already been proposed to effec-tively estimate the clean speech while sufﬁciently suppress the noisecomponents [1]. Recent years have witnessed the rapid developmentof deep neural networks (DNNs) toward SE research [2, 3]. By em-bracing a data-driven paradigm, the SE task can be formulated as asupervised learning problem, where the network attempts to uncoverthe complicated nonlinear relationship between noisy features andclean targets in the time-frequency (T-F) domain.In the previous studies, only the recovery of magnitude is ex-plored while noisy phase is directly incorporated for speech wave-form reconstruction [2, 3]. The reasons can be attributed to two-fold.For one thing, the phase is considered difﬁcult to estimate due to itsunclear structure. For another, previous literature reported that therecovery of phase did not bring a notable improvement in speechperception quality [4]. More recently, the importance of phase re-ceives continuously increasing attention in improving speech qualityand intelligibility [5]. Williamson et.al. [6] proposed the complexratio mask (CRM), where the mask was applied to both real andimaginary (RI) components, and both magnitude and phase couldbe perfectly estimated theoretically. Afterward, the complex spec-tral mapping technique was proposed, and the network was requiredto directly estimate the RI spectrum, which was reported to achieveimproved speech quality than masking-based methods [7]. More re-cently, time-domain based methods begin to thrive, where the raw waveform serves as both input and output [8]. In this way, ex-plicit phase estimation problem is effectively avoided. Althoughboth classes can obtain impressive performance in objective met-rics, we resort to the complex domain method as we found thatcomplex domain based methods achieved overall better mean opin-ion scores (MOS) than time-domain methods in the INTERSPEECH2020 Deep Noise Suppression (DNS) Challenge . We attribute thereason to better distinguishment for speech and noise in the T-F do-main than raw waveform format.To handle the noise reduction problem in more challengingacoustic environments in the ICASSP 2021 DNS Challenge [9], wepropose a novel SE system, called T wo- S tage C omplex N etworkwith a low-complexity P ost- P rocessing scheme (dubbed TSCN-PP). It mainly consists of two processing pipelines. Firstly, a noveltwo-stage network paradigm is designed, which is comprised oftwo sub-networks, namely coarse magnitude estimation network (dubbed CME-Net) and complex spectrum reﬁne network (dubbedCSR-Net). For CME-Net, it offers a coarse estimation toward thespectral magnitude, which is then coupled with noisy phase to obtaina coarse complex spectrum. Afterward, CSR-Net attempts to reﬁnethe complex spectrum by receiving both coarse estimation and noisyspectrum as the input. Note that the role of CSR-Net is two-fold.Firstly, instead of directly estimating the spectrum of clean target,it only captures the residual details, i.e. , the estimated details areadded by the input to obtain the ﬁnal reﬁned spectrum. Secondly,as some noise components still remain, CSR-Net helps to furthersuppress the residual noise. For the second pipeline, we propose alow-complexity post-processing (PP) module to further alleviate theunnatural residual noise, which is validated to be a signiﬁcant stepin improving the subjective speech quality.We explain the design rationale of our algorithm from two as-pects. Firstly, the single-stage network often fails to function wellfor relatively difﬁcult tasks due to its limited mapping capability.More recently, the literature [10, 11, 12] has revealed the advantageof multi-stage training over single-stage methods in many tasks, e.g. ,image deraining and speech separation. Secondly, due to the nonlin-ear characteristics of DNNs, some nonlinear distortion may be in-troduced when the test set mismatches the training conditions. Forexample, as the SE model is often trained with a wide range of syn-thetic noisy-clean pairs, when the trained model is applied under amore complicated real environment, it may introduce some unpleas-ant nonlinear distortion, which substantially degrades the subjectivequality. Therefore, it is of necessity to take a PP module to fur-ther suppress the residual noise if audible speech distortion can beavoided. In our subjective experiment, we do ﬁnd that, after applyingthe PP, the whole subjective quality can be consistently improved. https://dns-challenge.azurewebsites.net/phase1results/Interspeech2020 a r X i v : . [ c s . S D ] M a r he remainder of the paper is structured as follows. In Section 2,both the proposed two-stage network and the post-processing mod-ule are presented. In Section 3, the experimental settings are given.The experimental results are revealed in Section 4. We draw someconclusions in Section 5.

2. PROPOSED TSCN-PP2.1. Notations

We present the diagram of the proposed approach, as shown inFig. 1. In this paper, we denote the ( X r , X i ) as the noisy complexspectrum, while (cid:16) ˜ S cmr | i , S r | i (cid:17) , (cid:16) ˜ S csr | i , S r | i (cid:17) , and (cid:16) ˜ S ppr | i , S r | i (cid:17) arethe real and imaginary parts of the estimated and the clean speechafter CME-Net, CSR-Net and PP module, respectively. In addition,the mapping functions of CME-Net, CSR-Net and PP-Module aredeﬁned as F cm , F cs , and F pp with the parameter sets being Φ cm , Φ cs , and Φ pp , respectively. As illustrated in Fig. 1, the proposed two-stage network, i.e.,

TSCN,consists of two principal parts, namely CME-Net and CSR-Net.CME-Net uses the magnitude of the noisy spectrum as the inputfeature and the magnitude of the clean speech spectrum as the out-put, which is further coupled with the noisy phase to obtain a coarsecomplex spectrum (CCS), i.e., (cid:16) (cid:101) S cmr , (cid:101) S cmi (cid:17) . In the second stage,both CCS and the original noisy spectrum are concatenated as theinput of CSR-Net, and the network then estimates the residual spec-trum, which is directly added to CCS to obtain a reﬁned counterpart.Speciﬁcally, in the ﬁrst stage, only the magnitude is optimized andmost noise components can be removed. In the second stage, thenetwork only needs to modify the phase while further reﬁning themagnitude. In a nutshell, the calculation process is: | ˜ S cm | = F cm ( | X | ; Φ cm ) , (1) (cid:16) ˜ S csr , ˜ S csi (cid:17) = (cid:16) ˜ S cmr , ˜ S cmi (cid:17) + F cs (cid:16) ˜ S cmr , ˜ S cmi , X r , X i ; Φ cs (cid:17) , (2)where ˜ S cmr = (cid:60) (cid:16) | ˜ S cm | e jθ X (cid:17) and ˜ S cmi = (cid:61) (cid:16) | ˜ S cm | e jθ X (cid:17) .Both CME-Net and CSR-Net take the similar network topologyas [13], which includes the gated convolutional encoder, decoder andstacked temporal convolution modules (dubbed TCMs) [14]. Theencoder is utilized to extract the spectral patterns of the spectrumwhile the decoder is to reconstruct the spectrum. Note that insteadof using long-short term memory (LSTM) as the basic unit for se-quence modeling, stacked TCMs are utilized here to better captureboth short and long range sequence dependencies.As shown in Fig. 2(a), in the previous TCM settings, given inputwith size (256 , T ) where and T denote the number of chan-nels and the timesteps, respectively, TCM ﬁrst projects it to a higherchannel space, i.e. , 512, with input 1 × × × i.e. , a regular di-lated convolution is multiplied with another dilated branch, wherethe sigmoid function is applied to scale the output values into (0 , .Fig. 2(c) is an improved version named DTCM, where two gatedD-Convs are applied and the outputs from two branches are concate-nated. Not that the dilation rates between two branches complement each other, i.e. , if the dilation rate in one branch is r , then anotherdilation rate becomes M − r , where M = 5 in this study. The ra-tionale is that a large dilation rate means the long-range dependencycan be captured while the local sequence correlation can be learnedfor a small dilation rate. As a result, the two branches establish bothshort and long sequence correlation during the training process. Inthis study, we adopt the TCM in Fig. 2(b) as the basic unit for CME-Net while the DTCM in Fig. 2(c) for CSR-Net. For two-stage network, the following strategy is applied to train thenetwork. First, we separately train the CME-Net with the followingloss: L cm = (cid:13)(cid:13)(cid:13) | ˜ S cm | − | S | (cid:13)(cid:13)(cid:13) F , (3)Afterward, the pretrained model of CME-Net is loaded and jointlyoptimized with CSR-Net, given as: L = L RIcs + L Magcs + λ L cm , (4) L RIcs = (cid:13)(cid:13)(cid:13) ˜ S csr − S r (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) ˜ S csi − S i (cid:13)(cid:13)(cid:13) F , (5) L Magcs = (cid:13)(cid:13)(cid:13)(cid:13)(cid:113) | ˜ S csr | + | ˜ S csi | − (cid:112) | S r | + | S i | (cid:13)(cid:13)(cid:13)(cid:13) F , (6)where L cm , and L ∗ cs denote the loss function of CME-Net and CSR-Net, respectively. λ refers to the loss weight coefﬁcient, which is setto 0.1 in this paper. Note that here two types of losses are consid-ered for CSR-Net, namely RI loss and magnitude-based loss. Themotivation can be explained from two aspects. Firstly, when theRI components are gradually optimized, the magnitude consistencycannot guarantee, i.e. , the estimated magnitude may deviate its opti-mal optimization path [15]. Secondly, experiments reveal that whenmagnitude constraint is set, consistent PESQ improvement can beachieved [16], which is helpful for speech quality. With the MSE as the loss function, despite the notable improve-ment of speech quality, the residual noise components may becomevery unnatural, which may degrade the subjective quality. To im-prove the naturalness of the enhanced speech by deep learning ap-proaches, many loss functions have been proposed, such as scale-invariant signal-to-noise ratio (SI-SDR) [17], perceptual metric forspeech quality evaluation (PMSQE) [18], and the MSE with residualnoise control [19]. These loss functions can make the residual noisesound more natural than the MSE, and thus they can somewhat im-prove the speech quality.In this paper, inspired by [20, 21, 22], we use an extremely low-complexity deep learning approach similar to [21] to further sup-press the residual noise in the output of the pipeline 1. Instead of us-ing the deep learning mapping gain directly on the estimated cleanspeech spectrum of the pipeline 1, we use this gain as an estimateof the speech presence probability (SPP) to recursively estimate thenoise power spectral density (NPSD). With the estimated NPSD, theMMSE-LSA estimator is introduced to compute the ﬁnal gain andthen applies to suppress the residual noise. To further improve therobustness of the proposed PP scheme, we use a cepstrum-based pre-processing scheme to suppress the harmonic components before es-timating the NPSD. By doing so, the over-estimation problem of theNPSD can be avoided in most cases.

3. EXPERIMENTS3.1. Datasets

In this study, we ﬁrst explored the performance among differentmodels on the WSJ0-SI84 dataset [23] to validate the performance ecouple CME-Net CSR-NetCouple   , r i X X X X j e   | | cm S     , cm cmr i S S +     , cs cs r i S S

PP-Module     , pp ppr i S S

Pipeline 1 Pipeline 2

Mag-Encoder TCMs Mag-Decoder

CME-Net

Com-Encoder DTCMs Real-Decoder

CSR-Net

Imag-Decoder (a) Proposed approach (b) CME-Net(c) CSR-Net

Fig. 1 . Proposed processing pipelines for the ICASSP 2021 DNS Challenge. (a) Proposed two-stage framework with postprocessing. (b) Thenetwork detail of CME-Net. (c) The network detail of CSR-Net. + sigmoid + Gated D-Conv + GatedD-ConvCAT(a) (b) (c) (256, T)(512, T)(256, T) (256, T)(64, T)(256, T) (256, T) (64, T)(128, T)(256, T)

Fig. 2 . Comparisons among different types of TCMs, both normand activation layers are neglected for illustraction convenience. (a)Original TCM. (b) Proposed light-weight TCM. (c) Proposed light-weight dual TCM (DTCM).superiority of the proposed two-stage network. Then the model,alongside with post-processing, was trained and evaluated with theICASSP 2021 DNS-Challenge dataset to evaluate its performanceon more complicated and real acoustic scenarios. For WSJ0-SI84,5428, and 957 clean utterances with 77 speakers were selected toestablish the datasets for training and validations, respectively. Forthe test set, 150 utterances are selected. Note that the speaker in-formation in the test set is untrained. We randomly select 20,000noises from DNS-Challenge to form a 55 hours noise set for train-ing. For testing, 4 challenging noises are selected, namely babble,cafe, and white from NOISEX92 [24] and factory1 from CHIME3dataset [25]. In this study, we create totally 50,000, 4000 noisy-clean pairs for training, and validation, respectively, with SNR rang-ing from -5dB to 0dB. The total duration for training is around 100hours. 5 SNRs are selected for model evaluation, namely -5dB, 0dB,5dB, 10dB, and 15dB.For the ICASSP 2021 DNS Challenge, relatively more com-plicated acoustic scenarios are considered than the INTERSPEECH2020 DNS Challenge, including reverberation effect, cross-language,emotional, and singing cases. However, many provided utterancesare relatively noisy, which are found to heavily impact the trainingconvergence of the network. As a result, we drop the utteranceswith obviously poor quality. Totally, we generate a large noisy-cleantraining set with 517 hours duration, where about 65,000 providednoises are utilized and the SNR ranges from -5dB to 25dB. In ad-dition, considering the reverberation effect in the real environment,around 30% of utterances are convolved with 100,000 provided syn-thetic and real room impulse responses (RIRs) before mixed withdifferent noise signals. The reverberation time T ranges from 0.3to 1.3 second in this study. https://github.com/microsoft/DNS-Challenge All the utterances are sampled at 16kHz. The 20ms Hanning windowis adopted with 50% overlap between consecutive frames. To extractthe spectral features, 320-point FFT is utilized. Both models are op-timized by Adam [26]. When the ﬁrst model is separately trained,the initialized learning rate (LR) is set to 0.001. When the two mod-els are jointly trained, LRs are set to 0.001 and 0.0001, respectively.The batch size is set to 8 at the utterance level. Note that to decreasethe training time at the DNS-Challenge dataset, we directly ﬁnetunethe models pre-trained on the WSJ0-SI84 to help the model adapt tothe new dataset rapidly.

In this study, we compare the proposed two-stage network withanother advanced 5 baselines, namely CRN [27], DARCN [12],TCNN [28], GCRN [7] and DCCRN [29], which are given as fol-lows:• CRN: it is a causal convolutional recurrent network in theT-F domain. 5 convolutional and deconvolutional blocks areadopted as the encoder and decoder, respectively. 2 LSTMlayers with 1024 units are adopted for sequence modeling.Only the magnitude is estimated while keeping the noisyphase unaltered. We keep the best conﬁguration in [27] andthe number of parameters is 17.58M.• DARCN: it is a causal convolutional network in the T-F do-main, which combines recursive learning and dynamic atten-tion mechanism together. The current estimation output is fedback to the input and the network is then reused to reﬁne theestimation of the next stage. We keep the best conﬁgurationin [12] and the number of parameters is 1.23M.• TCNN: it is a causal encoder-TCMs-decoder topology de-ﬁned in the time domain. Raw waveforms serve as both theinput and output. We keep the same conﬁguration in [28] andthe number of parameters is 5.06M.• GCRN: it is the advanced version of CRN, where both mag-nitude and phase are estimated. It has a similar topology withCRN except two decoders are utilized for RI estimation. Wekeep the best conﬁguraion in [7] and the number of parame-ters is 9.06M.• DCCRN: it has achieved the ﬁrst rank in the real-time track ofInterspeech 2020 DNS-Challenge, where the complex-valueoperations are applied to CNN and RNN, and SI-SDR is uti-lized as the loss function. We keep the best conﬁgurationin [29] and the number of parameters is 3.72M.• TSCN: for the encoder and decoder parts, 5 (de)convolutionalblocks are set, where the number of intermediate channels able 1 . Objective results in terms of PESQ on the WSJ0-SI84dataset.

BOLD denotes the best result in each case. “Cau.” denoteswhether the system is causal implementation.

Model Cau. -5dB 0dB 5dB 10dB 15dB Avg.Noisy - 1.51 1.84 2.18 2.54 2.88 2.19CRN (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

TSCN-PP(Pro.) (cid:88)

Table 2 . Objective results in terms of ESTOI (in %) on the WSJ0-SI84 dataset.

BOLD denotes the best result in each case.

Model Cau. -5dB 0dB 5dB 10dB 15dB Avg.Noisy - 29.85 44.92 59.74 73.49 84.73 58.55CRN (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88)

TSCN-PP(Pro.) (cid:88) for each layer is 64, and the kernel size and stride are (2, 3)and (1, 2) in the time and frequency axis. For CME-Net, 18light-weight TCMs are utilized for sequence learning, while12 DTCMs are for CSR-Net . The number of parameter is1.96M for CME-Net and 4.99M for TSCN.

4. RESULTS AND ANALYSIS4.1. Objective comparisons

We utilize two objective metrics to evaluate the performance of dif-ferent models, namely PESQ [30] and ESTOI [31], which are closelyrelated to the human perceptual quality and intelligibility. The re-sults are reported in Tables 1 and 2. One can observe the follow-ing phenomena. Firstly, the proposed TSCN notably surpasses otherbaselines in both PESQ and ESTOI. For example, compared withDCCRN, a state-of-the-art approach, TSCN outperforms it by 0.14and 1.77% in terms of PESQ and ESTOI on average. This indicatesthe superior performance of the proposed method. Secondly, forrelatively high SNRs, DCCRN seems to be relatively more advan-tageous. For example, in -5dB, TSCN achieves about 0.23 PESQimprovement than DCCRN. However, for a high SNR like 15dB,the PESQ scores are similar. Thirdly, when PP is applied, perfor-mance in both metrics are decreased. This is because PP is set here tosuppress some unpleasant residual noise components, some speechcomponents with low energy may also be cancelled. Nonetheless,we also believe that it is beneﬁcial to implement PP as the networkmay generate some “fake” spectral components under low SNR con-ditions, which sounds unpleasant. We expect that PP can effec-tively suppress the negative effects and improve the subjective qual-ity, which will be validated in the next subsection. Overall, the pro-posed TSCN obtains impressive performance in objective metrics,motivating us to utilize it with PP for DNS-Challenge evaluation.

To verify the impact of PP, we conduct the AB subjective test, whichis similar to the procedure of [19]. 10 volunteers are involved inthe test. We randomly select 10 utterances from DNS blind test In this paper, 6 (D)TCMs form a group, where the dilation rate in eachgroup is (1, 2, 4, 8, 16, 32)

TSCN TSCN-PP EQ.020406080100

Fig. 3 . Subjective evaluation test between TSCN and TSCN-PP.“Equal” option is also provided if no decision can be made.

Table 3 . Subjective evaluation with P.808 criterion on the DNS-Challenge.

Model Track Singing Tonal Non-English English Emotional OverallNoisy - 2.96 3.00 2.96 2.80 2.67 2.86NSnet2(baseline) RT 3.10 3.25 3.28 3.30 2.88 3.21TSCN-PP(Pro.) RT set, including 2 emotional, 3 English, 3 non-English, and 2 singingspeech signals. Two processed types are provided, namely TSCNand TSCN-PP. The volunteers are required to select the preferreditem with better subjective quality. “Equal” option is also providedif no decision can be made. The testing results are illustrated inFig. 3. Compared with TSCN, after PP is applied, consistent subjec-tive preference is achieved. This indicates the gap between objectivemetrics and subjective option, i.e. , despite the notable degradationin the PESQ arises due to the spectrum information lost with PP,consistent subjective preference is still achieved as most unnaturalresidual noise is suppressed. It is interesting to see that this conclu-sion is consistent with the study in [32].

In Table 3, we present the subjective results of the submission withITU-T P.808 criterion [9], which is provided by the organizer. Onecan ﬁnd that our method outperforms the baseline model by overall0.17 MOS scores. Besides, the proposed method also achieves im-pressive performance on some unusual scenarios, like singing, tonal,and emotional, which are found considerably difﬁcult to handle thanconventional speech cases.Finally, we evaluate the processing latency of the algorithm. Inthis study, the window size T = 20ms, with overlap T s = 10ms be-tween consecutive frames. As a result, the algorithmic delay T d = T + T s = 30 ms, which meets the latency requirement. Note thatno future information is utilized in this study, i.e., the system isstrictly causal. We also calculate the processing time, and the av-erage processing time per frame is 4.80ms for TSCN-PP and 3.84msfor TSCN tested on an Intel i5-4300U PC. Note that despite the two-stage network is applied, as we restrict the number of convolutionalchannels to 64 in each encoder and decoder, and meanwhile, LSTMsare replaced by parallelizable TCMs, the inference efﬁciency is stillguaranteed.

5. CONCLUSIONS

In this challenge, we propose a novel system for denoising, whichconsists of a two-stage network and a low-complexity post-processingmodule. For the two-stage network, we decouple the optimizationtoward magnitude and phase, i.e. , the magnitude is ﬁrst coarsely es-timated, followed by the second network to reﬁne phase information.To obtain better subjective quality, we also propose a light-weightpost-processing module to further suppress the remaining unnaturalresidual noise, which usually arises when the test set mismatches thetraining conditions. Subjective results showed that the proposed al-gorithm ranked ﬁrst in MOS for the real-time track 1 of the ICASSP2021 DNS Challenge. . REFERENCES [1] Philipos C Loizou,

Speech enhancement: theory and practice ,CRC press, 2013.[2] Y. Xu, J. Du, L-R. Dai, and C-H. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”

IEEE/ACM Trans. Audio Speech Lang. Proc. , vol. 23, no. 1, pp.7–19, 2014.[3] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”

IEEE/ACM Trans. Audio SpeechLang. Proc. , vol. 26, no. 10, pp. 1702–1726, 2018.[4] D. Wang and J. Lim, “The unimportance of phase in speechenhancement,”

IEEE Transactions on Acoustics, Speech, andSignal Processing , vol. 30, no. 4, pp. 679–681, 1982.[5] K. Paliwal, K. W´ojcicki, and B. Shannon, “The importance ofphase in speech enhancement,”

Speech Commun. , vol. 53, no.4, pp. 465–494, 2011.[6] D. Williamson and D. Wang, “Time-frequency masking in thecomplex domain for speech dereverberation and denoising,”

IEEE/ACM Trans. Audio Speech Lang. Proc. , vol. 25, no. 7,pp. 1492–1501, 2017.[7] K. Tan and D. Wang, “Learning complex spectral map-ping with gated convolutional recurrent networks for monau-ral speech enhancement,”

IEEE/ACM Trans. Audio SpeechLang. Proc. , vol. 28, pp. 380–390, 2020.[8] A. D´efossez, G. Synnaeve, and Y. Adi, “Real Time SpeechEnhancement in the Waveform Domain,” in Proc. Interspeech2020 , pp. 3291–3295, 2020.[9] C. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun, H. Gamper,R. Aichner, and S. Srinivasan, “ICASSP 2021 Deep Noise Sup-pression Challenge,” arXiv preprint arXiv:2009.06122 , 2020.[10] W. Yu, Z. Huang, W. Zhang, L. Feng, and N. Xiao, “Gradualnetwork for single image de-raining,” in

Proc. of ACMM , 2019,pp. 1795–1804.[11] A. Li, M. Yuan, C. Zheng, and X. Li, “Speech enhancementusing progressive learning-based convolutional recurrent neu-ral network,”

Appl. Acoust. , vol. 166, pp. 107347, 2020.[12] A. Li, C. Zheng, C. Fan, R. Peng, and X. Li, “A recursivenetwork with dynamic attention for monaural speech enhance-ment,” in Proc. of Interspeech 2020 , 2020.[13] Y. Zhu, X. Xu, and Z. Ye, “FLGCNN: A novel fully con-volutional neural network for end-to-end monaural speech en-hancement with utterance-based objective functions,”

Appl.Acoust. , vol. 170, pp. 107511, 2020.[14] S. Bai, J. Kolter, and V. Koltun, “An empirical evaluationof generic convolutional and recurrent networks for sequencemodeling,” arXiv preprint arXiv:1803.01271 , 2018.[15] S. Wisdom, J. Hershey, K. Wilson, J. Thorpe, M. Chinen,B. Patton, and R. Saurous, “Differentiable consistency con-straints for improved deep speech enhancement,” in

Proc. ofICASSP . IEEE, 2019, pp. 900–904.[16] Z. Wang, P. Wang, and D. Wang, “Complex spectral mappingfor single-and multi-channel speech enhancement and robustASR,”

IEEE/ACM Trans. Audio Speech Lang. Proc. , vol. 28,pp. 1778–1787, 2020.[17] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– Half-baked or well done?,” in

Proc. of ICASSP , 2019, pp.626–630. [18] J. M. Martin-Do˜nas, A. M. Gomez, J. A. Gonzalez, and A. M.Peinado, “A deep learning loss function based on the percep-tual evaluation of the speech quality,”

IEEE Signal Process.Lett. , vol. 25, no. 11, pp. 1680–1684, 2018.[19] A. Li, R. Peng, C. Zheng, and X. Li, “A supervised speech en-hancement approach with residual noise control for voice com-munication,”

Appl. Sci. , vol. 10, no. 8, pp. 2894, 2020.[20] M. Tammen, D. Fischer, B. T. Meyer, and S. Doclo, “DNN-based speech presence probability estimation for multi-framesingle-microphone speech enhancement,” in

Proc. of ICASSP ,2020, pp. 191–195.[21] Jean-Marc Valin, “A hybrid DSP/deep learning approach toreal-time full-band speech enhancement,” in

Proc. of MMSP .IEEE, 2018, pp. 1–5.[22] X. Hu, S. Wang, C. Zheng, and X. Li, “A cepstrum-based pre-processing and postprocessing for speech enhancement in ad-verse environments,”

Appl. Acoust. , vol. 74, no. 12, pp. 1458–1462, 2013.[23] D. Paul and J. Baker, “The design for the wall street journal-based CSR corpus,” in

Workshop on Speech and Natural Lan-guage , 1992, p. 357–362.[24] A. Varga and H. Steeneken, “Assessment for automatic speechrecognition: II. NOISEX-92: A database and an experimentto study the effect of additive noise on speech recognition sys-tems,”

Speech Commun. , vol. 12, no. 3, pp. 247–251, 1993.[25] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third‘chime’speech separation and recognition challenge: Dataset,task and baselines,” in

Proc. of ASRU . IEEE, 2015, pp. 504–511.[26] D. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[27] K. Tan and D. Wang, “A convolutional recurrent neural net-work for real-time speech enhancement.,” in

Proc. of Inter-speech , 2018, pp. 3229–3233.[28] A. Pandey and D. Wang, “TCNN: Temporal convolutional neu-ral network for real-time speech enhancement in the time do-main,” in

Proc. of ICASSP . IEEE, 2019, pp. 6875–6879.[29] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu,B. Zhang, and L. Xie, “DCCRN: Deep complex convolutionrecurrent network for phase-aware speech enhancement,” in

Proc. of Interspeech 2020 , 2020, pp. 2472–2476.[30] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptualevaluation of speech quality (PESQ)-a new method for speechquality assessment of telephone networks and codecs,” in

Proc.of ICASSP . IEEE, 2001, vol. 2, pp. 749–752.[31] J. Jensen and C. Taal, “An algorithm for predicting the in-telligibility of speech masked by modulated noise maskers,”

IEEE/ACM Trans. Audio Speech Lang. Proc. , vol. 24, no. 11,pp. 2009–2022, 2016.[32] J. Valin, U. Isik, N. Phansalkar, R. Giri, K. Helwani, andA. Krishnaswamy, “A perceptually-motivated approach forlow-complexity, real-time enhancement of fullband speech,” inProc. of Interspeech 2020inProc. of Interspeech 2020