[PDF] Towards efficient models for real-time deep noise suppression

Abstract

With recent research advancements, deep learning models are becoming attractive and powerful choices for speech enhancement in real-time applications. While state-of-the-art models can achieve outstanding results in terms of speech quality and background noise reduction, the main challenge is to obtain compact enough models, which are resource efficient during inference time. An important but often neglected aspect for data-driven methods is that results can be only convincing when tested on real-world data and evaluated with useful metrics. In this work, we investigate reasonably small recurrent and convolutional-recurrent network architectures for speech enhancement, trained on a large dataset considering also reverberation. We show interesting tradeoffs between computational complexity and the achievable speech quality, measured on real recordings using a highly accurate MOS estimator. It is shown that the achievable speech quality is a function of network complexity, and show which models have better tradeoffs.

Full PDF

TTOWARDS EFFICIENT MODELS FOR REAL-TIME DEEP NOISE SUPPRESSION

Sebastian Braun, Hannes Gamper, Chandan K.A. Reddy, Ivan Tashev

Microsoft Research, Redmond, WA, [email protected]

ABSTRACT

With recent research advancements, deep learning models are be-coming attractive and powerful choices for speech enhancement inreal-time applications. While state-of-the-art models can achieveoutstanding results in terms of speech quality and background noisereduction, the main challenge is to obtain compact enough models,which are resource efﬁcient during inference time. An important butoften neglected aspect for data-driven methods is that results can beonly convincing when tested on real-world data and evaluated withuseful metrics. In this work, we investigate reasonably small re-current and convolutional-recurrent network architectures for speechenhancement, trained on a large dataset considering also reverbera-tion. We show interesting tradeoffs between computational com-plexity and the achievable speech quality, measured on real record-ings using a highly accurate MOS estimator. It is shown that theachievable speech quality is a function of network complexity, andshow which models have better tradeoffs.

Index Terms — speech enhancement, noise reduction, convolu-tional recurrent neural network, efﬁcient neural networks

1. INTRODUCTION

Speech enhancement using neural networks has seen large attentionin research in the recent years [1] and is starting to be deployed incommercial human-to-human communication applications. Whilethe trend in research still majorly follows the trajectory of develop-ing larger networks to further improve the performance and qual-ity, for real-world applications following the opposite trend is ofmuch higher interest:

How to obtain the best speech quality givena maximum computational budget?

Running current state-of-the-artnoise suppression neural networks is still challenging on resourcelimited devices, where noise suppression is often only a small frac-tion among several other tasks running on the devices, such as otheraudio processing tasks, video, encoding, transmission, etc.Earlier network architectures were mainly recurrent neural net-work (RNN) structures, which were believed promising in terms ofefﬁciency due to its efﬁcient temporal modeling capabilities [2–4].While such models seem to hit a performance saturation, the use ofconvolutional recurrent networks (CRNs) and convolutional neuralnetworks (CNNs) raised the performance, but resulted in develop-ment of enormously large architectures [5–8] that are impractical torun on typical edge devices like consumer laptops, mobile phones, oreven less powerful devices like wearables or hearing aids. Efﬁcientmodels are also obtained by building as much prior knowledge intothe models as possible, rather than trying to learn well-understoodblocks such as time-frequency transforms from scratch. While time-domain networks such as [9] could in theory yield superior perfor-mance than frequency-domain (FD) networks, proof of generaliza-tion on real data in reverberant environments and real recordings has not yet been shown [10]. Therefore, we stick in this work to FDimplementations.To draw valid and general conclusions from our experiments,we train on large scale data simulating the most important aspects ofreal-world data such as reverberation, many different speakers, a vastamount of noise types, and varying microphone signal levels. Wepropose a powerful data generation and augmentation pipeline thatdeals with reverberant and non-reverberant speech to reduce heavyreverberation, using signal-based estimates of reverberation param-eters. Results are shown on real recordings of a public dataset usinga deep neural network (DNN) based MOS predictor that has shownhigh correlation to subjective ratings in practice.In this work, we compare RNN with CRN architectures andshow which network parts can be scaled, removed, or replaced bymore efﬁcient modules, at which gains in complexity and which lossin quality. Speciﬁcally, we investigate the inﬂuence of RNN size,type, and the use of disconnected parallel RNNs. For CRNs with asymmetric convolutional encoder/decoder structure, we investigatethe convolution layers, spectral vs. spectro-temporal convolutions,and skip connections. As a result, we propose an efﬁcient CRNstructure with around 4-5 times less computational operations withsimilar quality than previously proposed CRNs.

2. ENHANCEMENT SYSTEM AND TRAININGOBJECTIVE

We use spectral suppression-based enhancement systems due to theirrobust generalization, logical interpretation and control, and easierintegration with existing speech processing algorithms. The inputfeatures to the networks are log power spectra. The network predictsa real-valued, time-varying suppression gain per time-frequency bin,that is applied to the complex input spectrum, and transformed backto time-domain as shown in Fig. 1 in the upper branch. To computea single frame, the network requires only the feature of the currentframe, or when using causal convolutions, also several past frames.Therefore, the algorithmic delay of the systems depend only on theshort-time Fourier transform (STFT) window size.

Noisy audio

STFT iSTFT

DNN

STFT(iSTFT(X))Loss

FeatureTarget audio STFT

Enhanced audioSuppression filterUtterance level Training only

Fig. 1 . Enhancement system and training with STFT consistencyand level-invariant loss [11]. © IEEE 2021 a r X i v : . [ ee ss . A S ] J a n o n v ( T , F ) C o n v ( T , F ) C o n v ( T , F ) G R U C / C / C L / C L-1 / C o n v T ( T , F ) C o n v T ( T , F ) C o n v T ( T , F ) C in / C out /C L / C L-1 / C / C /Downsampling/striding r e s h a p e f l a tt e n L symmetric encoder/decoder layers G R U P Fig. 2 . CRUSE network architecture with L encoder/decoder layersand a bottleneck with P parallel recurrent layers.We train the networks enforcing STFT consistency [8] by prop-agating the FD output through reconstruction and another STFT tocompute a FD loss as shown in Fig. 1. This preserves the ﬂexibilityof integrating the network with other FD algorithms, and ofﬂoadingthe STFT operations to optimized implementations. As shown inFig. 1, each training sequence, i. e. predicted and target signals, arenormalized by the active target utterance level, to ensure balancedoptimization for signal-level dependent losses [11].We train on the complex compressed mean-squared error (MSE)loss [12], blending the magnitude-only with a phase-aware term,which we found to be superior to other losses for reverberant speechenhancement [13]. The loss function per sequence is given by L = (1 − λ ) (cid:88) k,n (cid:12)(cid:12)(cid:12) | S | c −| (cid:98) S | c (cid:12)(cid:12)(cid:12) + λ (cid:88) k,n (cid:12)(cid:12)(cid:12) | S | c e jϕ S −| (cid:98) S | c e jϕ (cid:98) S (cid:12)(cid:12)(cid:12) , (1)where c = 0 . is a compression factor, λ = 0 . [13] is a weightingbetween complex and magnitude-based loss, and we omitted the de-pendency of the target speech spectral bins S ( k, n ) on the frequencyand time indices k, n for brevity.The networks are trained in batches of 10 sequences of 10 slength using the AdamW optimizer [14], learning rate of · − , andweight decay of 0.1. The best model is picked based on the valida-tion metric using a heuristic optimization criterion Q using percep-tual evaluation of speech quality (PESQ) [15], scale-invariant signal-to-distortion ratio (siSDR) [16] and cepstral distance (CD) [17]: Q = P ESQ + 0 . · siSDR − CD. (2)

3. NETWORK ARCHITECTURES

In this section, we describe RNN and CRN architectures and mod-ify them to improve efﬁciency. All models use the same features,prediction targets, loss, and training strategy described in Section 2.

The network proposed in [11], referred to as

NSnet2 , consists onlyof fully connected (FC) and gated recurrent unit (GRU) [18] layersin the format FC-GRU-GRU-FC-FC-FC. All FC layers are followedby rectiﬁed linear unit (ReLU) activations, except the last layer hassigmoid activations to predict a constrained suppression gain. Thestandard layer dimensions are 400 for GRUs, 600 for FC layers, i. e.400-400-400-600-600- K , but we also investigate different conﬁg-urations. The input and output dimensions are the number of fre-quency bins K . conv+nonlin. convT+nonlin.conv1x1Mid layersconv+nonlin. convT+nonlin.Mid layers a) concat skips b) add skips Fig. 3 . Skip connections by a) doubling the decoder input channels,b) addition. We found inserting × convolutions in the add skipsconnections useful. The second network is a CRN U-Net structure derived from [7],referred to in the remainder as

Convolutional Recurrent U-net forSpeech Enhancement (CRUSE) . As shown in Fig. 2, the structurehas L symmetric convolutional and deconvolutional encoder and de-coder layers with kernels of size (2 , in time and frequency di-mensions. The convolution kernels move with a stride of (1 , , i. e.downsample the features along the frequency axis efﬁciently, whilethe number of channels C (cid:96) for layer (cid:96) = { , . . . , L } increase perencoder layer, and decrease mirrored in the decoder. In this work,input and output channels C in = C out = 1 , but they can be extendedto e. g. take complex values or multiple features as multiple chan-nels. Convolutional layers are followed by leaky ReLU activations,while the last layer uses sigmoid. Between encoder and decoder sitsa recurrent layer, which is fed with all features ﬂattened along thechannels. In [7] a stack of two long-term short-term (LSTM) layerswas proposed at this stage. As will be shown in our experimentalresults in Section 5, replacement by a single GRU layer yields verylittle performance loss, but huge computational savings. A GRUsaves 25% complexity compared to an LSTM layer. Two furthermodiﬁcations are addressed in the following two paragraphs. Parallel RNN grouping

As will be shown in Section 5, the per-formance of both CRUSE and NSnet2 directly scales with the bot-tleneck size, i. e. the width R of the central RNN layer(s). How-ever, the complexity of RNN layers scales with R , making wideRNNs computationally unattractive. Therefore, we adopt the tech-nique proposed in [19], to group the wide fully connected RNNs into P disconnected parallel RNNs, still yielding the same forward infor-mation ﬂow as shown in Fig. 2. We denote the number of P parallelGRUs, where P = 1 means the last convolutional encoder outputis ﬂattened to a single vector and fed to a single GRU, while with P > , the encoder output is reshaped to P vectors of same length,being fed through P disconnected GRUs, and being reshaped againto the number of decoder channels C L . Another practical advantageis the possible parallel execution of the disconnected RNNs. Skip connections

As shown in Fig. 2, each convolutional en-coder layer is connected to its corresponding decoder layer by askip connection. In [19] skip connections between correspondingencoder and decoder layers have been implemented by concatenat-ing the encoder output to the corresponding decoder input along thechannel dimension as shown in Fig. 3a). This doubles the numberof decoder input channels, resulting in higher complexity. More ef-ﬁcient skip connections are implemented by simply adding the en-coder outputs to the decoder inputs, resulting in minor performancedegradation. We found when adding a trainable channel-wise scal-ing and bias in the add-skip connections, which can be implementedas convolutions with C (cid:96) channels and (1,1) kernels as in Fig. 3b),therefore being very cheap, improves the performance at negligibleadditional cost.2 peech Noise spectral augment active noise levelactive speech level SNR scalingLevelTraining mixtureTraining targetspectral augmentLevel copy scalingcopy filter * RIRReverb shaping reverberantnon-reverb. * Target speech

Mic speech

Fig. 4 . Training data generation: Reverberant speech is used as is,while non-reverberant speech is augmented with RIRs, and the train-ing targets are created using shaped RIRs.

4. EXPERIMENTAL SETUP4.1. Dataset

We a use large-scale synthetic training set and test on real record-ings to ensure generalization of our results to real-world signals.The training set uses 544 h of high mean opinion score (MOS) ratedspeech recordings from the LibriVox corpus, 247 h noise recordingsfrom Audioset, Freesound, internal noise recordings and 1 h of col-ored stationary noise. Except for the 65 h internal noise recordings,the data is available publicly as part of the 2nd DNS challenge . Weestimated T and C for each speech ﬁle using [20, 21] and classi-ﬁed them as reverberant if T > . s and C < dB.Our data generation pipeline, outlined in Fig. 4, is describedin the following. While already reverberant speech ﬁles are mixedwith noise as is, non-reverberant speech ﬁles were augmented withacoustic impulse responses randomly drawn from a set of 7000measured and simulated responses from several public and internaldatabases. 20% non-reverberant speech is not reverb augmented torepresent conditions such as close-talk microphones or profession-ally recorded speech. To obtain natural sounding, low-reverberant,and time-aligned target speech signals, the reverberant impulse re-sponses were shaped to a maximum decay of T max = 0 . s as shownin Fig. 4. The weighting function (shown as red line in the reverbshaping block) is an exponential decay with the desired reverbera-tion time [22], starting at the direct sound t of the room impulseresponse (RIR) w RIR ( t ) = (cid:40) exp (cid:16) − ( t − t ) T max (cid:17) , t ≥ t , t < t (3)To generate noisy training data, random speech and noise portionswere selected to form training clips of 10 s length. Each speechand noise segment is level normalized and concatenated with otherclips, if the duration was too short. The reverberant speech segmentsare then generated as described before, shown in the green box inFig. 4, providing the reverberant speech and non-reverberant targetspeech signals. The reverberant speech and noise is mixed with asignal-to-noise ratio (SNR) drawn from a Gaussian distribution with N (5 , dB. The resulting mixture signals are re-scaled to levelsdistributed with N ( − , dBFS. The speech targets are scaled https://github.com/microsoft/DNS-Challenge accordingly with the same factors. The optional spectral augmenta-tion was not used here due to the large amount of raw data. Usingthis pipeline, we created an augmented dataset of roughly 1000 h.For training monitoring and hyper-parameter tuning, we gen-erated a synthetic validation set in the same way as above, usingspeech from the DAPS dataset, and RIRs and noise from the QUT database. The ﬁnal test results are shown on the public developmentset of the Interspeech 2020 DNS challenge [23], consisting of 400real recordings and 300 synthetic mixtures from unseen datasets. Evaluating speech enhancement algorithms is a complex task, whichled to the development of various objective metrics, while subjectiveratings are still the gold standard. Recently, DNNs are developedto predict MOS [24, 25]. While we evaluated most of the proposedmodels using crowd-sourced ITU P.808 tests, we show only the pre-dicted DNSMOS [25] for consistency of presentation and space con-straints here. Nevertheless, all rankings and trends were coherentacross crowd-sourced MOS, DNSMOS and intrusive objective met-rics like PESQ, CD, and siSDR on the synthetic validation set.For all models, we relate their audio quality to an estimate ofthe computational complexity during inference in terms of multiply-accumulate (MAC) operations. Note that we count only the opera-tions related to applying the weights and biases, which usually con-tribute the major computational burden. We do not account for ap-plying activation functions, and also omit feature extraction, STFT,and enhancement operations, which are common for all models, andare both negligible compared to the burden of the DNN models.

We use a sampling frequency of 16 kHz, an STFT with 50% overlap-ping squareroot-Hann windows of 20 ms length, and a correspond-ing FFT size of 320 points. The inputs to the networks are 161-dimensional log power spectra. We parameterize

NSnet2 models de-noted by NSnet2- R , where R denotes the number of GRU nodesper layer. We parameterize CRUSE with different encoder/decodersizes, starting always with C = 16 channels, and doubling thechannels each layer. CRUSE models are denoted by CRUSE L - C L - N xRNN P , where L are the number of encoder/decoder layers, thelast encoder layer ﬁlters C L can vary to scale the RNN layer width, N are the number of RNN layers, and P are the number of parallelRNNs. For example, CRUSE4-120-1xGRU4 has 4 encoder/decoderlayers with ﬁlters 16-32-64-120, and 1 layer of 4 parallel GRUs.Convolution kernels are always (2,3), unless denoted explicitly as1D convolutions with (1,3) kernels operating only across frequency.

5. RESULTS

Figure 5 shows the tradeoff between computational complexity interms of MACs vs. the predicted overall audio quality using DNS-MOS [25] for several

NSnet2 and

CRUSE models, and other base-lines. In Fig. 5 all CRUSE models use add skips without × convblocks unless denoted explicitly. The ﬁrst baseline is a classic noisesuppressor ( classicNS ) exploiting only stationarity of speech [26].While this method non-surprisingly achieves only minor MOS im-provement of around 0.12 on the test set including many highly non-stationary noise types, the achievement relative to its computational https://ccrma.stanford.edu/ gautham/Site/daps.html https://research.qut.edu.au/saivt/databases/qut-noise-databases-and-protocols/

5 10 15 20 283

MACs (M) M O S classicNSCNN Park2017CRN Strake2020NSnetNSnet2-400NSnet2-500CRUSE5-256-2xLSTM1CRUSE5-256-2xGRU1CRUSE5-256-1xGRU1CRUSE4-64-1xGRU1CRUSE4-120-1xGRU1CRUSE4-64-1xGRU4CRUSE4-128-1xGRU4CRUSE5-256-1xGRU1-1DCRUSE4-120-1xGRU1-1DCRUSE4-128-1xGRU4-convskip Fig. 5 . MOS improvement vs. computational complexity (MACs).burden in the order of a few 1000 MACs is a tiny fraction of evenour smallest models. For direct comparison of the following DNN-based baselines, we only took the architectures, but trained themusing the same features, prediction targets, loss, and training pro-cedure as for all other networks in this paper: i)

NSnet [4] yieldssimilar MOS as the classic NS . ii) The fully convolutional architec-ture proposed in [27] underperforms in this task, which we mainlysuspect to the absence long-term temporal modeling as it uses only8 frames temporal context. iii) The

CRN-LSTM9 architecture pro-posed in [28] yields signiﬁcant MOS improvement, but is computa-tionally extremely inefﬁcient with 283M MACs due to large CNNﬁlter kernels and wide recurrent layers.

NSnet2 , shown with different RNN sizes denoted as NSnet2-400 and NSnet2-500, performs better than NSnet and its quality canbe improved by increasing the RNN size to 500 units, which alsoscales up the MACs. The CRUSE5 models gain largely in efﬁciencyby replacing the 2 LSTM layers ( × ) with 2 GRUs ( × ), and furtherby going to only a single GRU ( × ). We can observe how movingfrom fully connected GRUs ( P = 1 ) to P = 4 parallel GRUs fordifferent RNN sizes ( (cid:13)→ (cid:3) and (cid:13)→ (cid:3) ) reduces complexity withvery little performance loss. The 2D convolutional feature extrac-tion of CRUSE is very useful and efﬁcient as moving to 1D convolu-tions using kernels of (1,3) deteriorates the tradeoff signiﬁcantly forCRUSE5-256-1GRU ( ×→(cid:52) ) and CRUSE4-120-1GRU ( (cid:13)→(cid:52) ).This seems to contradict ﬁndings in [19], where no performancedegradation is claimed by using 1D convolutions on non-reverberantdatasets , which highlights the importance of using appropriate data.Overall, Figure 5 reveals a few very interesting trends: i) Thefully recurrent NSnet models have generally lower complexity, butalso lower quality than the

CRUSE models. We can conclude that theconvolutional encoder-decoder structures of

CRUSE are very help-ful. Especially using also temporal encoder/decoder layers boostsefﬁciency. ii) We can observe a surprisingly linear correlation be-tween MACs and MOS. Consequently for the tested models, the av-erage achieved quality is a monotonically increasing function of thecomputational effort. The MAC-MOS linear relation is even moreclear for models within the same class, e.g. the NSnet models, or

CRUSE models with different RNN widths. The most efﬁcient mod-els, i. e. the proposed

CRUSE models with parallel GRUs and op-timized skip connections ( (cid:46) ) break out of the linear trend, pushingtowards the desired upper left corner.Fig. 6 illustrates scalable performance depending on the RNNwidth. The blue and red lines show NSnet2- R and CRUSE4- C -1xGRU1 models, where the width of the RNN was scaled by chang-ing the RNN width R for NSnet2, or changing the last encoderlayer’s ﬁlters C and therefore also the RNN width for CRUSE4, re-spectively. We can observe a clear monotonic relationship between MACs (M) M O S Fig. 6 . Tradeoff by changing RNN width for NSnet2 and CRUSE,and parallel RNN grouping. Colored numbers denote the variableparameter per model, R , C , and P . Table 1 . Effect of skip connection type for CRUSE4-128-1GRU4.model MACs (M) ∆ MOSno skips 4.3 0.32add skips 4.3 0.35add conv × skips 4.3 concat skips 4.8 0.38RNN throughput or memory and the obtained quality. Obviously,the complexity vs. quality tradeoff deteriorates for too large mod-els. It is a useful property that the performance of these networkscan be scaled given a certain computational budget. The yellow linein Fig. 6 shows the effect of grouping the fully connected RNN fora CRUSE4-128-1xGRU P model into P disconnected RNNs whilekeeping the RNN feedforward ﬂow ﬁxed. Signiﬁcantly better trade-offs can be achieved with P = 2 and P = 4 .Table 1 shows further ablations on the skip connections usingthe CRUSE4-128-1xGRU4P architecture. Addition skips are betterthan no skip connections. While concat skips improve the MOS, thesame MOS can be achieved by inserting cheap × convolutions asshown in Fig. 3 by the orange block.While the execution time of the models is subject to optimiza-tion for the targeted hardware platform, we provide a sense of relat-ing the MACs to the actual execution time of the efﬁcient CRUSE4-128-1xGRU4 model measured on a Intel © Core™i7 QuadCore at3.5 GHz: Without further optimization, the ONNX model processesone audio frame on average in 0.3 ms, resulting in a reasonable CPUutilization of less than 3% within the hop size budget of 10 ms.NSnet2-500 runs in 0.15 ms.

6. CONCLUSIONS

We proposed a ﬂexible and scalable CRN for speech enhancement,trained on large data and tested on real recordings. We show thatusing simple spectral suppression based networks of comparativelysmall size can achieve substantial quality improvement when trainedon a representative dataset with a suitable loss taking the time do-main reconstruction into account. We show that the obtained speechquality is a function of network size, especially depending on therecurrent layer width. We show gains on the speech quality vs. com-putational complexity tradeoff by modiﬁed skip connections and adisconnected parallel RNN structure. While the proposed modelsuse only a fraction of the computational budget of standard CPUs inreal-time, the quality gain per computational burden compared to atraditional speech enhancement method is still less efﬁcient and canhopefully be further improved in the future.4 . REFERENCES [1] D. Wang and J. Chen, “Supervised speech separation basedon deep learning: An overview,”

IEEE/ACM Trans. Audio,Speech, Lang. Process. , vol. 26, no. 10, pp. 1702–1726, Oct2018.[2] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement withLSTM recurrent neural networks and its application to noise-robust ASR,” in

Proc. Latent Variable Analysis and SignalSeparation , 2015, pp. 91–99.[3] D. S. Williamson and D. Wang, “Time-frequency masking inthe complex domain for speech dereverberation and denois-ing,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol.25, no. 7, pp. 1492–1501, July 2017.[4] R. Xia, S. Braun, C. Reddy, H. Dubey, R. Cutler, and I. Tashev,“Weighted speech distortion losses for neural-network-basedreal-time speech enhancement,” in

Proc. IEEE Intl. Conf. onAcoustics, Speech and Signal Processing (ICASSP) , 2020.[5] G. Wichern and A. Lukin, “Low-latency approximation ofbidirectional recurrent networks for speech denoising,” in

Proc. IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA) , Oct 2017, pp. 66–70.[6] M. Strake, B. Defraene, K. Fluyt, W. Tirry, and T. Fingscheidt,“Separated noise suppression and speech restoration: LSTM-based speech enhancement in two stages,” in

Proc. IEEE Work-shop on Applications of Signal Processing to Audio and Acous-tics (WASPAA) , Oct 2019, pp. 239–243.[7] K. Tan and D. Wang, “A convolutional recurrent neural net-work for real-time speech enhancement,” in

Proc. Interspeech ,2018, pp. 3229–3233.[8] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen,B. Patton, and R. A. Saurous, “Differentiable consistency con-straints for improved deep speech enhancement,” in

Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , May 2019, pp. 900–904.[9] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”

IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no.8, pp. 1256–1266, Aug 2019.[10] M. Maciejewski, G. Wichern, E. McQuinn, and J. L. Roux,“WHAMR!: Noisy and reverberant single-channel speech sep-aration,” in

Proc. IEEE Intl. Conf. on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 696–700.[11] S. Braun and I. Tashev, “Data augmentation and loss normal-ization for deep noise suppression,” in

Proc. Speech and Com-puters , 2020.[12] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen atthe cocktail party: A speaker-independent audio-visual modelfor speech separation,”

ACM Trans. Graph. , vol. 37, no. 4, July2018.[13] S. Braun and I. Tashev, “A consolidated view of loss func-tions for supervised deep learning-based speech enhancement,”arXiv:2009.12286, 2020. [14] I. Loshchilov and F. Hutter, “Decoupled weight decay regular-ization,” in

International Conference on Learning Representa-tions , 2019.[15] ITU-T, “Perceptual evaluation of speech quality (PESQ), anobjective method for end-to-end speech quality assessmentof narrowband telephone networks and speech codecs,” Feb.2001.[16] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR -half-baked or well done?,” in

Proc. IEEE Intl. Conf. on Acous-tics, Speech and Signal Processing (ICASSP) , May 2019, pp.626–630.[17] N. Kitawaki, H. Nagabuchi, and K. Itoh, “Objective qualityevaluation for low bit-rate speech coding systems,”

IEEE J.Sel. Areas Commun. , vol. 6, no. 2, pp. 262–273, 1988.[18] K. Cho, B. Van Merri¨enboer, D. Bahdanau, , and Y. Bengio,“On the properties of neural machine translation: Encoder-decoder approaches,” in

Proc. 8th Workshop on Syntax, Seman-tics and Structure in Statistical Translation (SSST-8) , 2014.[19] K. Tan and D. Wang, “Learning complex spectral map-ping with gated convolutional recurrent networks for monauralspeech enhancement,” vol. 28, pp. 380–390, 2020.[20] H. Gamper and I. J. Tashev, “Blind reverberation time esti-mation using a convolutional neural network,” in

Proc. Intl.Workshop Acoust. Signal Enhancement (IWAENC) , 2018, pp.136–140.[21] H. Gamper, “Blind C50 estimation from single-channel speechusing a convolutional neural network,” in

Intl. Workshop onMultimedia Signal Processing (MMSP) .[22] J. D. Polack,

La transmission de l’´energie sonore dans lessalles , Ph.D. thesis, Universit´e du Maine, Le Mans, France,1988.[23] C. K. A. Reddy, E. Beyrami, H. Dubey, Gopal V., R. Cheng,R. Cutler, S. Matusevych, R. Aichner, A. Aazami, S. Braun,Rana P., S. Srinivasan, and J. Gehrke, “The interspeech 2020deep noise suppression challenge: Datasets, subjective speechquality and testing framework,” in to appear in Proc. Inter-speech 2020 .[24] A. R. Avila, H. Gamper, C. Reddy, R. Cutler, I. Tashev, andJ. Gehrke, “Non-intrusive speech quality assessment usingneural networks,” in

Proc. IEEE Intl. Conf. on Acoustics,Speech and Signal Processing (ICASSP) , 2019, pp. 631–635.[25] C. K. A. Reddy, H. Dubey, V. Gopal, R. Cutler, S. Braun,H. Gamper, R. Aichner, and S. Srinivasan, “ICASSP 2021 deepnoise suppression challenge,” in arXiv:2009.06122 [eess.AS] .[26] Ivan Tashev,

Sound Capture and Processing: Practical Ap-proaches , Wiley, July 2009.[27] S. R. Park and J. W. Lee, “A fully convolutional neural networkfor speech enhancement,” in

Proc. Interspeech , 2017.[28] M. Strake, B. Defraene, K. Fluyt, W. Tirry, and T. Fingscheidt,“Fully convolutional recurrent networks for speech enhance-ment,” in