A consolidated view of loss functions for supervised deep learning-based speech enhancement
AA CONSOLIDATED VIEW OF LOSS FUNCTIONS FOR SUPERVISED DEEPLEARNING-BASED SPEECH ENHANCEMENT
Sebastian Braun, Ivan Tashev
Microsoft Research, Redmond, WA, [email protected], [email protected]
ABSTRACT
Deep learning-based speech enhancement for real-time applicationsrecently made large advancements. Due to the lack of a tractableperceptual optimization target, many myths around training lossesemerged, whereas the contribution to success of the loss functionsin many cases has not been investigated isolated from other factorssuch as network architecture, features, or training procedures. In thiswork, we investigate a wide variety of loss spectral functions for arecurrent neural network architecture suitable to operate in onlineframe-by-frame processing. We relate magnitude-only with phase-aware losses, ratios, correlation metrics, and compressed metrics.Our results reveal that combining magnitude-only with phase-awareobjectives always leads to improvements, even when the phase isnot enhanced. Furthermore, using compressed spectral values alsoyields a significant improvement. On the other hand, phase-sensitiveimprovement is best achieved by linear domain losses such as meanabsolute error.
Index Terms — speech enhancement, noise reduction, recurrentneural network, loss functions
1. INTRODUCTION
Speech enhancement using neural networks has seen large attentionand success in the recent years [1]. While classic single-channel sta-tistical model-driven speech enhancement techniques used in prac-tical systems often only leverage signal models for quasi-stationarynoise [2], neural networks can potentially learn more complexspeech characteristics, which also allows reduction of highly non-stationary, transient noise, and non-speech sound sources.Unfortunately, state-of-the-art deep learning (DL) based noisereduction performance is currently only achieved by architecturesrequiring large look-ahead, large amounts of temporal context datainput [3–5], or computationally expensive network architectures [3,6–9]. As the performance seems to scale with the network size thisoften prohibits the use in real-time speech communication systemssuch as live-messengers or mobile communication devices.However, the training loss function is independent of the infer-ence complexity, and has therefore potential to improve performanceat no cost. Although the most popular choice for regression-basedDL is the mean-squared error (MSE), this might arguably be not theoptimal choice for speech enhancement. Loss functions and train-ing targets for speech enhancement have shifted from the MSE be-tween several versions of enhancement filters or masks [3, 10] tosignal-based metrics, such as spectral magnitude-based MSE, phase-sensitive MSE [11] and finally the complex spectral MSE [6]. Ap-proaches originating from a source separation background often usethe time-domain MSE or signal-to-distortion ratio (SDR) loss [5,12]. While recent attempts were made integrating perceptually mo-tivated metrics in the loss function [13, 14], optimizing on percep-tual metrics alone is often insufficient, and is therefore combinedagain with lower-level criteria such as the spectral magnitude MSE.It is often observed that optimization on some objective metrics likeperceptual evaluation of speech quality (PESQ) or short-time ob-jective intelligibility (STOI) improves the test results for the opti-mized metric, but fails to outperform other baselines in terms ofother metrics [13, 14]. While the log-energy sigmoid weighting pro-posed in [15] does not generalize as it is highly heuristic and sig-nal level dependent, we also could not verify improvements using anoise shaping weighting as proposed in [14] for our tested networksand data. Therefore, we take a step back and investigate differentbasic signal distance metrics as optimization criteria, which does notexclude the possibility to add perceptually motivated weightings.As in the last years a large variety of speech enhancement lossfunctions have been proposed, it is impossible to quantify theirindividual contribution to success due to the use of different en-hancement systems and datasets. The study in [16] compares aselection of loss functions for a convolutional time-domain network.These results may differ greatly from our study due to a complexnetwork architecture with larger delay, an inference complexitymore than 30 times larger than our network, and training/evaluationon non-reverberant speech, which is rarely encountered in practice.In this work, we compel an overview and comparison of differentfrequency-domain optimization criteria using a small recurrent neu-ral network suitable for on-the-edge real-time inference. We classifythe losses based on their distance metric in spectral magnitude andcomplex losses, propose some new losses closing gaps in this sys-tematic search, and point out interesting relations. We show that thebest performing of the tested loss functions are the compressed MSE,closely followed by the mean absolute error (MAE), which can be at-tributed to a better match to the signal distributions. We furthermoreshow that linear combination of magnitude and complex losses leadsto improvement in all cases. Another interesting finding is that ourresults on a reverberant speech dataset did not confirm advantages ofthe recently proposed speech distortion-weighted (SDW) [17] andnoise shaping losses [14].
2. SIGNAL MODEL
In a pure noise reduction task, we assume that the observed signal isan additive mixture of the desired speech and noise. We denote theobserved signal X ( k, n ) directly in the short-time Fourier transform(STFT) domain, where k and n are the frequency and time frameindices as X ( k, n ) = S ( k, n ) + N ( k, n ) , (1) a r X i v : . [ ee ss . A S ] S e p here S ( k, n ) is the potentially reverberant speech, and N ( k, n ) is the disturbing noise signal. The objective is to recover a speechsignal estimate (cid:98) S ( k, n ) by (cid:98) S ( k, n ) = G ( k, n ) X ( k, n ) . (2)where G ( k, n ) is a filter that can be either a real-valued suppressiongain, or a complex-valued filter. In this work, we consider only asuppression gain.
3. LOSS FUNCTIONS
In this section, we review and introduce a wide range of trainingloss functions targeting recovery of the speech signal S ( k, n ) . Allconsidered speech enhancement loss functions are distance metricsbetween the enhanced and target spectral representations. We canclassify the loss functions summarized in Table 1 in magnitude dis-tances and complex spectral distances, which also incorporate phaseinformation. The operator (cid:104) Y ( k, n ) (cid:105) = KN (cid:80) k,n Y ( k, n ) denotesthe arithmetic average over frequency and time indices, k and n , persequence. Newly proposed loss functions are marked with a † . In thefollowing, we introduce and discuss the loss functions in Table 1. The most straightforward choice is the L2-norm or squared errorbetween estimated and target signals. While this loss is often onlymagnitude based as in (3) [18, 19], its complex counterpart (4) isusually only used in direct spectral mapping approaches [6, 20], buthas strangely never been used in filter prediction networks so far.An actually better distance metric for the complex error is theL1-norm or MAE, as the distribution of STFT bins follow a moreLaplacian distribution rather than Gaussian, as can be observed inFig. 1 by the blue curves. The L1-norm of the magnitude and com-plex signal error are given by (5) and (6), respectively, where wedefine the L1 norm of a complex number as (cid:107) x R + jx I (cid:107) = | x R | + | x I | . The complex L1-norm loss (6) has been termed RI loss in [21]. To account for the logarithmic perceptual nature of the human ear,the log spectral distance (LSD) given by (7) can be used, which was astandard in traditional model-based speech enhancement for decades[22]. Note that so far, the LSD has only been proposed in methodsdirectly predicting the log power spectrum instead of a filter [13,23],while we use it to predict a filter. The log compression creates aGaussian-like distribution as shown in Fig. 1 by the yellow line.To extend the LSD (7) with a phase-error term, we propose thephase-aware logarithmic spectrum distance (PLSD) by (8), where ϕ S and ϕ (cid:98) S are the phase angles of S ( k, n ) and (cid:98) S ( k, n ) , respec-tively. The first term in (8), the magnitude error, is identical to (7).The second term, the phase error, is connected to the magnitude er-ror by bin-wise multiplication, which naturally decreases the phaseerror at bins with small magnitude error. The constant 2 ensuresthat the phase error lies within the range of [1 , , preventing vanish-ing magnitude error at zero phase error. Note that the cosine phasedifference can be calculated as the real part of the complex signaldivision, i.e. cos( ϕ (cid:98) S − ϕ S ) = (cid:60) (cid:110) (cid:98) SS (cid:111) . real part p r obab ili t y X|X| c e j log (|X| ) imag part X|X| c e j Fig. 1 . Distributions of linear complex, compressed complex and logspectral signals for 5 min noisy speech.
Due to the logarithmic compression, the standard LSD suffers fromthe problem of producing large errors also at low energy bins, whichare perceptually less relevant. As limiting the log mitigates this prob-lem only suboptimally, we propose to apply a bin-wise weightingbased on the target speech signal in (9) with W wlsd ( k, n ) = | (cid:98) S ( k, n ) + γX ( k, n ) | . , (20)where we chose γ = 0 . to blend in the noisy signal to preventapplying zero weights where high noise reduction is achieved, andapply a compression exponent of 0.3. The same weighting can alsobe applied to the PLSD as given by (10). A similar dynamic compression as the logarithm can be achievedusing power-law compression [24] applied to the magnitudes by (11)with a compression exponent < c < .A phase-aware compressed loss can be obtained by multiply-ing the phase terms to the compressed magnitudes as given by (12),which was proposed in [4, 25]. A commonly used compression ex-ponent is c = 0 . . In contrast to the logarithm, this compressionhas the advantage of producing positive semi-definite values. Wecan observe in Fig. 1 by the red lines that the compression broadensthe distributions complex compressed spectra, while values closer tozero occur less frequent. Commonly used ratios in speech enhancement are the signal-to-noise ratio (SNR) and SDR. The time-domain SDR has alreadybeen successfully used in DL based speech enhancement [26–28].However, this metric is not restricted to the time-domain, and canbe equivalently computed in the frequency domain. We employ herethe scale-variant SDR given by (14), as we believe a scaled outputsignal as in the scale-invariant SDR [28] is undesired.In analogy, computing this ratio from magnitudes is more com-monly termed the SNR, given by (13). Note that the SNR and SDRlosses, (13) and (14), are simply related to the MSE losses (3) and(4), normalized by the speech power, as was also pointed out in [29].
The speech intelligibility index and related objective metrics [30]are based on signal envelope correlation. Motivated by this fact, weintroduce the magnitude correlation loss given by (15).The complex equivalent, the complex correlation coefficientgiven by (16), is better known as the coherence . While the range of able 1 . Loss functionsmetric magnitude complexL2 magMSE (cid:68) | (cid:98) A − A | (cid:69) (3) cMSE (cid:68) | (cid:98) S − S | (cid:69) (4)L1 magMAE (cid:68) | (cid:98) A − A | (cid:69) (5) cMAE (cid:68) | (cid:98) S − S | (cid:69) (6)log MSE LSD (cid:68) | log (cid:98) A − log A | (cid:69) (7) PLSD † (cid:68) | log | (cid:98) SS | × (cid:16) − R{ (cid:98) SS } (cid:17)(cid:69) (8)weighted log MSE wLSD † (cid:68) W lsd | log (cid:98) A − log A | (cid:69) (9) wPLSD † (cid:104) W lsd ( -"- ) (cid:105) (10)compressed magComp (cid:68) | (cid:98) A c − A c | (cid:69) (11) cComp (cid:68) | (cid:98) A c e jϕ (cid:98) s − A c e jϕ s | (cid:69) (12)ratios SNR † − log (cid:104) A (cid:105)(cid:104) | (cid:98) A − A | (cid:105) (13) SDR − log (cid:104) | S | (cid:105)(cid:104) | (cid:98) S − S | (cid:105) (14)correlation magCorr † − (cid:104) (cid:98) AA (cid:105) (cid:104) (cid:98) A (cid:105)(cid:104) A (cid:105) (15) cCorr † − (cid:60) {(cid:104) (cid:98) SS ∗ (cid:105)} (cid:113) (cid:104) | (cid:98) S | (cid:105)(cid:104) | S | (cid:105) (16)speech dist. weight SDW λ (cid:10) | S − GS | (cid:11) + (1 − λ ) (cid:10) | GN | (cid:11) (17) –weighted L2 MSE-AMR (cid:68) W AMR | (cid:98) A − A | (cid:69) (18) cMSE-AMR (cid:68) W AMR | (cid:98) S − S | (cid:69) (19)(15) is [0 , , the range of (16) is [ − , . Note that in [31], the co-herence loss (16) has been termed source-to-distortion ratio. Specialproperties of the ratio and correlation based losses is that they aresignal-level independent. By using the signal components of speech and noise separately, theSDW loss [17, 32] given by (17) provides a trade-off parameter <λ < between speech distortion and noise reduction. Note thatwhile (17) does not explicitly use only magnitudes, the decomposednature and absence of the noisy signal X ( k, n ) implies that G ( k, n ) as zero-phase filter is optimal. Therefore, the loss is categorized asmagnitude loss. Drawbacks of the SDW loss are that the optimalweight λ is data dependent, and finding optimal adjustments of λ e.g. depending on the SNR, are heuristic and difficult to determine. In [14], a weighting for the MSE based on the AMR codec is pro-posed to spectrally shape the noise error. We include this loss givenby (18) and (19), while also other weightings can be applied to mostdistance metrics.Several works have proposed combined losses using linear com-binations of magnitude-only and phase-aware metrics [4, 8, 24] as L mix = (1 − β ) L mag + β L complex , (21)where L mag is a magnitude-based loss, L complex is a complex signalbased loss, and ≤ β ≤ is the mixing factor. We investigate alluseful combinations per row in Table 1.
4. NETWORK AND TRAINING
We use a recurrent network architecture based on gated recurrentunits (GRUs) [33] and feed forward (FF) layers, similar to the corearchitecture of [8], to estimate the enhancement filter G ( k, n ) . Thearchitecture was chosen to maintain real-time constraints without de-lay and moderate complexity. L o g p o w FF ( R e L U ) G R U G R U FF ( R e L U ) Filter I npu t f r a m e FF ( s i g m . ) x O u t pu t f r a m e S T F T i S T F T FF ( R e L U ) Fig. 2 . Network architecture and enhancement system.The network input is the logarithmic power spectrum P =log ( | X ( k, n ) | + (cid:15) ) with online mean and variance normaliza-tion [17]. We use a STFT size of 512 with 32 ms square-root Hannwindows and 16 ms frame shift, but feed only the relevant 255 fre-quency bins into the network, omitting 0th and highest (Nyquist)bins, which do not carry useful information. The network consists ofa FF embedding layer, two GRUs, and three FF layers with rectifiedlinear unit (ReLU) activations and an output layer with Sigmoid activation. The enhancement system and network architecture withlayer sizes is shown in Fig. 2, and has 2.8 M trainable parameters.The network was trained using the AdamW optimizer [34] witha learning rate of − . The training was monitored every 10 epochsusing a validation subset. The best model was chosen based on thehighest PESQ [35] on the validation set. Also the optimal weightingfactors for β, λ etc. were optimized by a grid search and choosingthe best performing parameter for PESQ on the validation set.
5. EXPERIMENTS5.1. Dataset and evaluation metrics
We used the Chime-2 WSJ-20k dataset [36], which is, despite onlyof medium size, a realistic self-contained public dataset includingmatching reverberant speech and noise conditions. The datasetcontains 7138, 2418, and 1998 utterances for training, validation / PES Q MSEMAECompSNR/SDRCorrSDW
Fig. 3 . Optimization of magnitude vs. complex loss weight β andspeech distortion weight λ on validation set.and testing, respectively. The target speech signals are binaural andreverberant, and the mixtures contain noise recorded in the samerooms. Validation and test sets are mixed with SNRs in from -6to 9 dB. For testing, we used only the left channel. We evaluatethe speech enhancement performance in terms of PESQ [35] as anindicator for noise reduction and speech quality, and scale-invariantsignal-to-distortion ratio (SI-SDR) [28] as a phase-sensitive metric. Each magnitude and complex loss per row in Table 1 was combinedby linear mixing (21). The LSD losses were omitted as the PLSD isalready a combined metric. The mixing factors were determined onthe development set. The PESQ results for the parameter sweeps of β are shown in Fig. 3. We can observe that the combination of mag-nitude and complex loss leads to an improvement for all distancemetrics. We can also see that the MAE and compressed losses out-perform the other distance metrics significantly at the optimal weight β . Furthermore it is interesting, that the combined compressed lossof (11), (12) achieves the highest performance with β = 0 . , whilefor magnitude loss only ( β = 0 ), compressed and MAE are similar,but for fully complex loss ( β = 1 ), the compressed loss shows asignificant performance drop. Although we experimented with ”out-of-metric” combinations, in particular combining magComp with abetter complex loss, e.g. cMAE, this did not lead to an improvement.The PESQ and SI-SDR results for all losses on the test set areshown in Table 2, where the combined losses in the right column usethe PESQ-optimal weightings. The best performers are highlightedin bold font. The PESQ results align well with the development set inFig. 3, namely that MAE and compressed loss are good performers.While the pure LSD is even slightly worse than the MSE, the sig-nal power-weighted wLSD outperforms the linear MSE. While thePLSD shows no advantage over the LSD, the wPLSD gives a slightadvantage over the magnitude-based wLSD, which confirms the im-portance of attributing low weights to unimportant frequency binsfor the LSD. It is not surprising that the SNR and SDR perform onpar with the L2 norm, as they are merely normalized versions. Thecorrelation-based losses are in the same range as well. It is surprisingthat on this reverberant dataset, the SDW loss performs significantlyworse than the magMSE or cMSE, which has been shown differentlyon non-reverberant datasets in [17, 32]. This highlights also the datadependency of the speech distortion weight λ , which varies from 0.3in [17], 0.5 in [32], and 0.6 in our case. Furthermore, on the rever-berant Chime2 dataset, we also could not confirm the effectivenessof perceptually motivated weightings, such as the AMR weighting Table 2 . PESQ (SI-SDR) on test set.loss magnitude complex comb. (21)noisy 2.29 (1.92)MSE 3.16 (9.57) 3.10 (9.58) 3.17 (9.58 )MAE ( ) 3.08 ( ) ( )LSD 3.04 (8.59) 3.03 (8.31) –wLSD 3.19 (9.12) (8.88) –Comp (9.45) 2.88 (9.21) (9.42)SNR / SDR 3.15 (9.54) 3.11 (9.62) 3.19 ( )Corr 3.16 (9.56) 3.11 (9.60) 3.16 (9.58)SDW 3.12 (9.61) – –MSE-AMR 3.01 (9.39) 2.98 (9.45) –proposed in [14], which performed significantly worse than the un-weighted MSEs. While the SI-SDR is less correlated with speechquality than PESQ, it shows the best results mostly for linear lossessuch as MAE and SDR. Overall we can say that magnitude compres-sion and carefully chosen distance metrics according to the spectraldomain’s distribution can lead to more suitable loss functions.
6. CONCLUSIONS
We have classified several signal-based frequency domain loss func-tions for speech enhancement and exploited relations and perfor-mance differences on the reverberant Chime2 dataset. Our exper-iments showed that for such realistic data, compressed losses arebeneficial and that combined magnitude and complex losses improvethe objective speech quality. We also showed different findings forweighted losses with reverberant speech than for anechoic data. Fu-ture work has to be done especially on improved phase-aware lossesto further improve the quality.
7. REFERENCES [1] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,” vol. 26, no. 10, Oct 2018.[2] M. S. Kavalekalam, J. K. Nielsen, M. G. Christensen, and J. B.Boldt, “A study of noise PSD estimators for single channelspeech enhancement,” in
Proc. IEEE Intl. Conf. on Acous-tics, Speech and Signal Processing (ICASSP) , 2018, pp. 5464–5468.[3] D. S. Williamson and D. Wang, “Time-frequency masking inthe complex domain for speech dereverberation and denois-ing,”
IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol.25, no. 7, pp. 1492–1501, July 2017.[4] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Has-sidim, W. T. Freeman, and M. Rubinstein, “Looking to listen atthe cocktail party: A speaker-independent audio-visual modelfor speech separation,”
ACM Trans. Graph. , vol. 37, no. 4, July2018.[5] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtimefrequency magnitude masking for speech separation,”
IEEE/ACM Trans. Audio, Speech, Lang. Process. , vol. 27, no.8, pp. 1256–1266, Aug 2019.[6] M. Strake, B. Defraene, K. Fluyt, W. Tirry, and T. Fingscheidt,“Separated noise suppression and speech restoration: LSTM-based speech enhancement in two stages,” in
WASPAA , Oct2019.7] G. Wichern and A. Lukin, “Low-latency approximation ofbidirectional recurrent networks for speech denoising,” in
Proc. IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA) , Oct 2017, pp. 66–70.[8] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen,B. Patton, and R. A. Saurous, “Differentiable consistency con-straints for improved deep speech enhancement,” in
Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , May 2019, pp. 900–904.[9] K. Tan and D. Wang, “A convolutional recurrent neural net-work for real-time speech enhancement,” in
Proc. Interspeech ,2018, pp. 3229–3233.[10] Y. Wang, A. Narayanan, and D. Wang, “On training targetsfor supervised speech separation,”
IEEE Trans. Audio, Speech,Lang. Process. , vol. 22, no. 12, pp. 1849–1858, Dec 2014.[11] M. Kolbk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant trainingof deep recurrent neural networks,” vol. 25, no. 10, Oct 2017.[12] F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu,and D. Yu, “A Comprehensive Study of Speech Separation:Spectrogram vs Waveform Separation,” in
Proc. Interspeech2019 , 2019, pp. 4574–4578.[13] J. M. Martn-Doas, A. M. Gomez, J. A. Gonzalez, and A. M.Peinado, “A deep learning loss function based on the percep-tual evaluation of the speech quality,”
IEEE Signal Process.Lett. , vol. 25, no. 11, pp. 1680–1684, Nov 2018.[14] Z. Zhao, S. Elshamy, and T. Fingscheidt, “A perceptual weight-ing filter loss for DNN training in speech enhancement,” in
Proc. IEEE Workshop on Applications of Signal Processing toAudio and Acoustics (WASPAA) , Oct 2019, pp. 229–233.[15] Q. Liu, W. Wang, P. J. B. Jackson, and Y. Tang, “Aperceptually-weighted deep neural network for monauralspeech enhancement in various background noise conditions,”in
Proc. European Signal Processing Conf. (EUSIPCO) , Aug2017, pp. 1270–1274.[16] M. Kolbk, Z. Tan, S. H. Jensen, and J. Jensen, “On loss func-tions for supervised monaural time-domain speech enhance-ment,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 28, pp. 825–838, 2020.[17] R. Xia, S. Braun, C. Reddy, H. Dubey, R. Cutler, and I. Tashev,“Weighted speech distortion losses for neural-network-basedreal-time speech enhancement,” in
Proc. IEEE Intl. Conf. onAcoustics, Speech and Signal Processing (ICASSP) , 2020.[18] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement withLSTM recurrent neural networks and its application to noise-robust ASR,” in
Proc. Latent Variable Analysis and SignalSeparation , 2015, pp. 91–99.[19] H. Zhao, S. Zarar, I. Tashev, and C. Lee, “Convolutional-recurrent neural networks for speech enhancement,” in
Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , 2018, pp. 2401–2405.[20] S. Fu, T. Hu, Y. Tsao, and X. Lu, “Complex spectrogram en-hancement by convolutional neural network with multi-metricslearning,” in
IEEE Intl. Workshop on Machine Learning forSignal Processing (MLSP) , 2017, pp. 1–6.[21] Z. Wang and D. Wang, “Multi-microphone complex spectralmapping for speech dereverberation,” 2020. [22] Y. Ephraim and D. Malah, “Speech enhancement using a mini-mum mean-square error log-spectral amplitude estimator,” vol.33, no. 2, 1985.[23] Y.-H. Tu, I. Tashev, S. Zarar, and C. Lee, “A hybrid approachto combining conventional and deep learning techniques forsingle-channel speech enhancement and recognition,” in
Proc.IEEE Intl. Conf. on Acoustics, Speech and Signal Processing(ICASSP) , April 2018, pp. 2531–2535.[24] J. Lee, J. Skoglund, T. Shabestary, and H. Kang, “Phase-sensitive joint learning algorithms for deep learning-basedspeech enhancement,”
IEEE Signal Processing Letters , vol.25, no. 8, pp. 1276–1280, 2018.[25] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, R. A.Saurous, J. Skoglund, and R. F. Lyon, “Exploring tradeoffsin models for low-latency speech enhancement,” in
IWAENC ,Sep. 2018, pp. 366–370.[26] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-surement in blind audio source separation,” vol. 14, no. 4, July2006.[27] H.-S. Choi, J. Kim, J. Hur, A. Kim, J.-W. Ha, and K. Lee.,“Phase-aware speech enhancement with deep complex U-Net,”in
Intl. Conf. on Learning Representations (ICLR) , 2019.[28] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR- half-baked or well done?,” May 2019.[29] J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, andR. Haeb-Umbach, “Demystifying tasnet: A dissecting ap-proach,” in
Proc. IEEE Intl. Conf. on Acoustics, Speech andSignal Processing (ICASSP) , 2020, pp. 6359–6363.[30] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,“An algorithm for intelligibility prediction of time-frequencyweighted noisy speech,”
IEEE Trans. Audio, Speech, Lang.Process. , vol. 19, no. 7, pp. 2125–2136, Sept 2011.[31] S. Venkataramani, J. Casebeer, and P. Smaragdis, “Adaptivefront-ends for end-to-end source separation,” in
Conference onNeural Information Processing Systems (NIPS) , 2017.[32] Z. Xu, S. Elshamy, and T. Fingscheidt, “Using separate lossesfor speech and noise in mask-based speech enhancement,” in
Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2020, pp. 7519–7523.[33] K. Cho, B. Van Merri¨enboer, D. Bahdanau, , and Y. Bengio,“On the properties of neural machine translation: Encoder-decoder approaches,” in
Proc. 8th Workshop on Syntax, Seman-tics and Structure in Statistical Translation (SSST-8) , 2014.[34] Ilya Loshchilov and Frank Hutter, “Decoupled weight decayregularization,” in
International Conference on Learning Rep-resentations , 2019.[35] ITU-T, “Perceptual evaluation of speech quality (PESQ), anobjective method for end-to-end speech quality assessmentof narrowband telephone networks and speech codecs,” Feb.2001.[36] E. Vincent, J. Barker, S. Watanabe, and F. Nesta, “The second’CHIME’ speech separation and recognition challenge: data-data, tasks and baselines,” in