A Perceptually-Motivated Approach for Low-Complexity, Real-Time Enhancement of Fullband Speech
Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri, Karim Helwani, Arvindh Krishnaswamy
AA Perceptually-Motivated Approach for Low-Complexity, Real-TimeEnhancement of Fullband Speech
Jean-Marc Valin, Umut Isik, Neerad Phansalkar, Ritwik Giri,Karim Helwani, Arvindh Krishnaswamy
Amazon Web Services {jmvalin, umutisik, neeradp, ritwikg, helwk, arvindhk}@amazon.com
Abstract
Over the past few years, speech enhancement methods based ondeep learning have greatly surpassed traditional methods basedon spectral subtraction and spectral estimation. Many of thesenew techniques operate directly in the the short-time Fouriertransform (STFT) domain, resulting in a high computationalcomplexity. In this work, we propose PercepNet, an efficientapproach that relies on human perception of speech by focusingon the spectral envelope and on the periodicity of the speech.We demonstrate high-quality, real-time enhancement of full-band (48 kHz) speech with less than 5% of a CPU core.
Index Terms : speech enhancement, pitch filtering, postfilter
1. Introduction
Over the past few years, speech enhancement methods based ondeep learning have greatly surpassed traditional methods basedon spectral subtraction [1] and spectral estimation [2]. Manyof these techniques operate directly on the short-time Fouriertransform (STFT), estimating either magnitudes [3, 4, 5] orideal ratio masks (IRM) [6, 7]. This typically requires a largenumber of neurons and weights, resulting in a high complexity.It also partly explains why many of those methods are restrictedto 8 or 16 kHz. The use of the STFT also brings up a trade-off with the window length – long windows can cause musi-cal noise and reverb-like effects, whereas short windows do notprovide sufficient frequency resolution for removing noise be-tween pitch harmonics. These problems can be mitigated bythe use of complex ratio masks [8] or time-domain process-ing [9, 10, 11], at the cost of further increasing complexity.We propose PercepNet, an efficient approach that reliesheavily on human perception of speech signals and improveson RNNoise [12]. More precisely, we rely on the perception ofaudio in critical bands (Section 2) and on the perception of tonesand noise (Section 3) with a new acausal comb filter. The deepneural network (DNN) model we use is trained using percep-tual criteria (Section 4). We propose a novel envelope postfilter(Section 5) that further improves the enhanced signal.The PercepNet algorithm operates on 10-ms frames with40 ms of look-ahead and can enhance 48 kHz speech in realtime using just 4.1% of an x86 CPU core. We show that itsquality significantly exceeds that of RNNoise (Section 6).
2. Signal Model
Let x ( n ) be a clean speech signal, the signal captured by ahands-free microphone in a noisy room is given by y ( n ) = x ( n ) (cid:63) h ( n ) + η ( n ) , (1)where η ( n ) is the additive noise from the room, h ( n ) is theimpulse response from the talker to the microphone, and (cid:63) de- −
10 0 10 20 30 40
Time (ms) . . . . . . A m p li t u d e Figure 1:
The current window being synthesized is shown insolid red. We use three windows of look-ahead (shown indashed lines) such that samples up to time t = 40 ms are usedto compute the audio output up to t = 0 . InputOutput STFTPitchanalysisISTFT ScalebandsPitchfiltering DNNmodelEnvelopepostfilterFeatureextraction
Figure 2:
Overview of the PercepNet algorithm. notes the convolution. Furthermore, the clean speech can beexpressed as x ( n ) = p ( n ) + u ( n ) , where p ( n ) is a locally pe-riodic component and u ( n ) is a stochastic component (here weconsider transients such as stops as part of the stochastic compo-nent). In this work, we attempt to compute an enhanced signal ˆ x ( n ) = ˆ p ( n )+ˆ u ( n ) which is as perceptually close to the cleanspeech x ( n ) as possible. Separating the stochastic component u ( n ) from the environmental noise η ( n ) is a very hard prob-lem. Fortunately, we only need ˆ u ( n ) to sound like u ( n ) , whichcan be achieved by filtering the mixture u ( n ) (cid:63) h ( n ) + η ( n ) to have the same spectral envelope as u ( n ) . Since p ( n ) is pe-riodic and the noise is assumed not to have strong periodicity, ˆ p ( n ) should be easier to estimate. Again, we mostly need ˆ p ( n ) to have the same spectral envelope and the same period as p ( n ) .We seek to construct an enhanced signal with the same1) spectral envelope, and 2) frequency-dependent periodic-to-stochastic ratio, as the clean signal. For both these properties,we use a resolution that matches human perception.We use the short-time Fourier transform (STFT) with 20-mswindows and 50% overlap. We use the Vorbis window func-tion [13] – which satisfies the Princen-Bradley perfect recon-struction criterion [14] – for analysis and synthesis, as shown inFig. 1. An overview of the algorithm is shown in Fig. 2. a r X i v : . [ ee ss . A S ] A ug
200 400 600 800 1000
Frequency (Hz) − − − − − − − A m p li t u d e ( d B ) | C ( z ) || C (0) ( z ) | Figure 3:
Frequency response of the proposed comb filter (red)vs the filter used in [12] (blue) for a pitch of 200 Hz.
The vast majority of noise signals have a wide bandwidth with asmooth spectrum. Similarly, both the periodic and the stochasticcomponents of speech have a smooth spectral envelope. Thisallows us to represent their envelope from 0 to 20 kHz using34 bands, spaced according to the human hearing equivalentrectangular bandwidth (ERB) [15]. To avoid bands with justone DFT bin, we impose a minimum band width of 100 Hz.For each band of the enhanced signal to be perceptuallyclose to the clean speech, both their total energy and their pe-riodic content should be the same. In this paper, we denote thecomplex-valued spectrum of the signal x ( n ) for band b in frame (cid:96) as x b ( (cid:96) ) . We also denote the L -norm of that band as X b ( (cid:96) ) . From the magnitude of the noisy speech signal in band b , wecompute the ideal ratio mask, i.e. the gain that needs to be ap-plied to y b such that it has the same energy as x b ( (cid:96) ) : g b ( (cid:96) ) = X b ( (cid:96) ) Y b ( (cid:96) ) . (2)In the case where the speech only has a stochastic component,applying the gain g b ( (cid:96) ) to the magnitude spectrum in band b should result in an enhanced signal that is almost indistinguish-able from the clean speech signal. On the other hand, when thespeech is perfectly periodic, applying the gain g b ( (cid:96) ) results inan enhanced signal that sounds rougher than the clean speech;even though the energy is the same, the enhanced signal is lessharmonic than the clean speech. In that case, the noise is partic-ularly perceptible due to the fact that tones have relatively littlemasking effect on noise [16]. In that situation, we use the combfilter described in the next section to remove the noise betweenthe pitch harmonics and make the signal more periodic.
3. Pitch Filtering
To reconstruct the harmonic properties of the clean speech, weuse comb filtering based on the pitch frequency. The comb filtercan achieve much finer frequency resolution than would other-wise be possible with the STFT (50 Hz using 20-ms frames).We estimate the pitch period using a correlation-based methodcombined with a dynamic programming search [17].
For a voiced speech signal with period T , a simple comb filter C (0) ( z ) = 1 + z − T (3)introduces zeros at regular interval between harmonics and at-tenuates the noisy part of the signal by around 3 dB. This pro- vided a small, but noticeable quality improvement in [12]. Inthis work, we extend the comb filtering to more than one pe-riod, including non-causal taps using the following filter: C M ( z ) = M (cid:88) k = − M w k z − kT , (4)where M is the number of periods on each side of the cen-tral tap and w k is a window function satisfying (cid:80) k w k = 1 .Using C M ( z ) , the noisy part of the signal is attenuated by σ w = (cid:80) k w k . Although a rectangular window would mini-mize σ w , we use a Hann window, which shapes the remainingnoise to be lower between harmonics. Due to the behavior oftone masking [15], this results in a lower perceptual noise. For M = 5 , we have σ w = − and the full response is shown inFig. 3. In practice, since the maximum look-ahead is bounded,we truncate the window w k to values of kT that are permitted.The filtering occurs in the time domain, with the output de-noted ˆ p ( n ) since it approximates the “perfect” periodic compo-nent p ( n ) from the clean speech. Its STFT is denoted ˆ p b ( (cid:96) ) . The amount of comb filtering is important: not enough filter-ing results in roughness, whereas too much results in a roboticvoice. The strength of the comb filtering in [12] is controlledby a heuristic. In this work, we instead have the neural net-work learn the strength that best preserves the ratio of periodicto stochastic energy in each band. The equations below describewhat that ideal strength should be. Since they rely on propertiesof the clean speech, they are only used at training time.We define the pitch coherence q x,b ( (cid:96) ) of the clean signal asthe cosine distance between the complex spectra of the signalwith its periodic component (both (cid:96) and b are omitted for clarity) q x (cid:44) (cid:60) (cid:2) p H x (cid:3) (cid:107) p (cid:107) · (cid:107) x (cid:107) , (5)where · H denotes the Hermitian transpose and (cid:60) [ · ] denotes thereal component. Similarly, we define q y as the pitch coherenceof the noisy signal. Since the ground truth p is not available,the coherence values need to be estimated. Considering that thenoise in ˆ p is attenuated by a factor σ w , the pitch coherence ofthe estimated periodic signal ˆ p itself can be approximated as q ˆ p = q y (cid:112) (1 − σ w ) q y + σ w . (6)We define the pitch filtering strength r ∈ [0 , , where r =0 causes no filtering to occur and r = 1 replaces the signal with ˆ p . Let z = (1 − r ) y + r ˆ p be a pitch-enhanced signal, we wantthe pitch coherence of z to match the clean signal: q z = p · ((1 − r ) y + r ˆp ) (cid:107) p (cid:107) · (cid:107) (1 − r ) y + r ˆ p (cid:107) = q x . (7)Solving (7) for r results in r = α α , (8) α = (cid:113) b + a (cid:0) q x − q y (cid:1) − ba , (9)where a = q p − q x and b = q ˆ p q y (cid:0) − q x (cid:1) . RUCONV1x5FC128 CONV1x3 GRU GRU GRUGRU FCFC512 512 512 512 512 512128 3434
Figure 4:
Overview of the DNN architecture computing the34 gains ˆ g b and 34 strengths ˆ r b from the 70-dimensional inputfeature vector f . The number of units on each layer is indicatedabove the layer type. In very noisy conditions, it is possible for the periodic es-timate ˆ p to have a lower coherence than the clean speech in aband ( q ˆ p < q x ). In that case, we set r = 1 and compute a gainattenuation term that will ensure that the stochastic componentof the enhanced speech matches the level of the clean speech (atthe expense of making the periodic component too quiet) g (att) = (cid:115) n − q x n − q p , (10)where n = 0 . (or 15 dB) limits the maximum attenuationto the noise-masking-tone threshold [18]. For the normal case( q ˆ p ≥ q x ), then g (att) = 1 .
4. DNN Model
The model uses both convolutional layers (a 1x5 layer followedby a 1x3 layer), and GRU [19] layers, as shown in Fig. 4. Theconvolutional layers are aligned in time such as to use up to M frames into the future. To achieve 40 ms look-ahead includingthe 10-ms overlap, we use M = 3 .The input features used by the model are tied to the 34 ERBbands. For each band, we use two features: the magnitude of theband with look-ahead Y b ( (cid:96) + M ) and the pitch coherence with-out look-ahead q y,b ( (cid:96) ) (the coherence estimation itself uses thefull look-ahead). In addition to those 68 band-related features,we use the pitch period T ( (cid:96) ) , as well as an estimate of the pitchcorrelation [20] with look-ahead, for a total of 70 input fea-tures. For each band b , we also have 2 outputs: the gain ˆ g b ( (cid:96) ) approximates g (att) b ( (cid:96) ) g b ( (cid:96) ) and the strength ˆ r b ( (cid:96) ) approxi-mates r b ( (cid:96) ) .The weights of the model are forced to a ± range andquantized to 8-bit integers. This reduces the memory require-ment (and bandwidth), while also reducing the computationalcomplexity of the inference by taking advantage of vectoriza-tion. We train the model on synthetic mixtures of clean speech andnoise with SNRs ranging from -5 dB to 45 dB, with somenoise-free examples included. The clean speech data includes120 hours of 48 kHz speech from different public and inter-nal databases, including more than 200 speakers and more than20 different languages. The noise data includes 80 hours of var-ious noise types, also sampled at 48 kHz. To ensure robustness in reverberated conditions, the noisysignal is convolved with simulated and measured room impulseresponses. Inspired by [21], the target includes the early reflec-tions so that only late reverberation is attenuated.We improve the generalization of the model by apply-ing a different random second-order pole-zero filter to boththe speech and the noise. We also apply the same randomspectral tilt to both signals to better generalize across differ-ent microphone frequency responses. To achieve bandwidth-independence, we apply a low-pass filter with a random cutofffrequency between 3 kHz and 20 kHz. This makes it possibleto use the same model on narrowband to fullband audio.
We use a different loss function for the gain and for the pitchfiltering strength. For the gain, we consider that the percep-tual loudness of a signal is proportional to its energy raised to apower γ/ , where we use γ = 0 . . For that reason, we raise thegains to the power γ before computing the metrics. In additionto the squared error, we also use the fourth power to overempha-size the cost of making large errors (e.g. completely attenuatingspeech): L g = (cid:88) b ( g γb − ˆ g γb ) + C (cid:88) b ( g γb − ˆ g γb ) , (11)where we use C = 10 to balance between the L and L terms.Although simple, the loss function in (11) implicitly in-corporates many of the characteristics of the improved lossfunction proposed in [22], including scale-invariance, SNR-invariance, power-law compression, and non-linear frequencyresolution.For the pitch filtering strength, we use the same principleas for L g but evaluating the loudness of the noisy componentof the enhanced speech. Since the comb filter with strength r b attenuates the noise by a factor (1 − r b ) , we use the strengthloss L r = (cid:88) b ((1 − r b ) γ − (1 − ˆ r b ) γ ) . (12)Since the enhancement is not overly sensitive to errors in thevalue of ˆ r b , we do not use a fourth power term.
5. Envelope Postfiltering
To further enhance the speech, we slightly deviate from thegains ˆ g b produced by the DNN. The deviation is inspired bythe formant postfilters [23] often used in CELP codecs. Weintentionally de-emphasize noisier bands slightly further thanthey would be in the clean signal, while overemphasizing cleanbands to compensate. This is done by computing a warped gain ˆ g ( w ) b = ˆ g b sin (cid:16) π g b (cid:17) , (13)which leaves ˆ g b essentially unaffected for clean bands, whilesquaring it (like the gain of a Wiener filter) for very noisy bands.To avoid over-attenuating the enhanced signal as a whole, wealso apply a global gain compensation heuristic computed as G = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) (1 + β ) E E β (cid:16) E E (cid:17) , (14)where E is the total energy of the enhanced signal using theoriginal gain ˆ g b and E is the total energy when using thearped gain ˆ g ( w ) b . We use β = 0 . , which results in a max-imum theoretical gain of 5.5 dB for clean bands. Scaling thefinal signal for the frame by G results in a perceptually cleanersignal that is about as loud as the clean signal. The band energyafter that postfilter is given by ˆ X b = G ˆ g ( w ) b Y b . (15)When listening to the enhanced speech through loudspeak-ers in a room, the impulse response of the room is added backto the signal such that it blends with any speech coming fromthe room. However, when listening through headphones, thelack of any reverberation can make the enhanced signal soundoverly dry and unnatural. This is addressed by enforcing a mini-mum decay in the energy, subject to never exceeding the energyof the noisy speech: ˆ X ( r ) b ( (cid:96) ) = min (cid:16) max (cid:16) ˆ X b ( (cid:96) ) , δ ˆ X ( r ) b ( (cid:96) − (cid:17) , ˆ Y b ( (cid:96) ) (cid:17) , (16)where δ is chosen to be equivalent to a reverberation time T =100 ms .After the frequency-domain enhanced speech is convertedback to the time domain, a high-pass filter is applied to the out-put. The filter helps eliminating some remaining low-frequencynoise and its cutoff frequency is determined by the estimatedpitch of the talker [20] to avoid attenuating the fundamental.
6. Experiments and Results
We evaluate the quality of the enhanced speech with two meanopinion score (MOS) [24] tests conducted using the crowd-sourcing methodology P.808 [25]. First, we use the 48 kHznoisy VCTK test set provided in [26] to compare PercepNet tothe original RNNoise [12], while also conducting an ablationstudy. The test includes 824 samples, rated by 8 listeners each,resulting in a 95% confidence interval of 0.04. We also pro-vide PESQ-WB [27] results as a reference for comparison withother methods like SEGAN [9]. The results in Table 1 not onlydemonstrate a base improvement over RNNoise, but also showthat both the pitch filter and the envelope postfilter help improvethe quality of the enhanced speech. In addition, subjective test-ing clearly shows the limitations of PESQ-WB when evaluat-ing the envelope postfilter – even though the subjective evalua-tion shows a strong improvement from the postfilter, PESQ-WBconsiders it a degradation. Note that the unusually high abso-lute numbers in the MOS results are likely due to the fullbandsamples in that test.In the second test, the DNS challenge [28] organizers evalu-ated blind test samples processed with PercepNet and providedus with the results in Table 2. The test set includes 150 syntheticsamples without reverberation, 150 synthetic samples with re-verberation, and 300 real recordings. Each sample was rated by10 listeners, leading to a 95% confidence interval of 0.02 forall algorithms. Since PercepNet operates at 48 kHz, the 16-kHzchallenge test data was internally up-sampled (and later down-sampled) in the STFT domain, avoiding any additional algorith-mic delay. The same model parameters were used for both thechallenge 16-kHz evaluation and our own 48-kHz VCTK eval-uation, demonstrating the capability to operate on speech withdifferent bandwidths. The quality also exceeds that of the base-line [29] algorithm.The algorithm complexity is mostly dictated by the neuralnetwork, and thus the number of weights. For a frame size of10 ms and 8M weights, the complexity is around 800 MMACS Table 1:
P.808 MOS results based on internal testing on theVCTK test set at 48 kHz.
Algorithm PESQ-WB
MOS (P.808)
Noisy 1.97 3.40SEGAN [9] 2.16 -RNNoise (original) [12] 2.29 3.70PercepNet (no pitch, no pf) 2.64 3.81PercepNet (no pf)
PercepNet
Table 2:
Challenge official P.808 MOS results. The baselinemodel is provided by the challenge organizers.
Algorithm Synthetic Synthetic Real
Overall w/o reverb w/ reverb recordNoisy 3.32 2.78 2.97 3.01Baseline 3.49 2.64 3.00 3.03
PercepNet 3.92 3.16 3.51 3.52 (one multiply-and-accumulate per weight per frame/second).By quantizing the weights with 8 bits, vectorization makes itpossible to run the network efficiently. With the default framesize of 10 ms, PercepNet requires 5.2% of one mobile x86 core(1.8 GHz Intel i7-8565U CPU) for real-time operation. Evalu-ated with a frame size of 40 ms (four internal frames of 10 mseach to improve cache efficiency), the complexity is reduced to4.1% on the same CPU core with an identical output. Despite amuch lower complexity than the maximum allowed by the DNSchallenge, PercepNet ranked second in the real-time track.Qualitatively , the use of ERB bands – rather than operat-ing directly on frequency bins – makes the algorithm incapableof producing musical noise ( aka birdie artifacts) in the output.Similarly, the short window used for analysis avoids reverb-likesmearing in the time domain. Instead, the main noticeable arti-fact is a certain amount of roughness caused by some noise re-maining between pitch harmonics, especially for loud car noise.
7. Conclusion
We have demonstrated an efficient speech enhancement algo-rithm that focuses on the main perceptual characteristics ofspeech – spectral envelope and periodicity – to produce high-quality fullband speech in real time with low complexity. Theproposed PercepNet model uses a band structure to representthe spectrum, along with pitch filtering and an additional en-velope postfiltering step. Evaluation results show significantquality improvements for both wideband and fullband speechand demonstrate the effectiveness of both the pitch filtering andthe postfilter. We believe the results demonstrate the benefits ofmodeling speech using perceptually-relevant parameters.
8. References [1] S. Boll. Suppression of acoustic noise in speech using spectralsubtraction.
IEEE Transactions on acoustics, speech, and signalprocessing , 27(2):113–120, 1979.[2] Y. Ephraim and D. Malah. Speech enhancement using a minimummean-square error log-spectral amplitude estimator.
IEEE Trans- ctions on Acoustics, Speech, and Signal Processing , 33(2):443–445, 1985.[3] D. Liu, P. Smaragdis, and M. Kim. Experiments on deep learningfor speech denoising. In Proceedings of Fifteenth Annual Con-ference of the International Speech Communication Association ,2014.[4] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee. A regression approach tospeech enhancement based on deep neural networks.
IEEE Trans-actions on Audio, Speech and Language Processing , 23(1):7–19,2015.[5] K. Tan and D. Wang. A convolutional recurrent neural networkfor real-time speech enhancement. In
Proceedings of INTER-SPEECH , volume 2018, pages 3229–3233, 2018.[6] A. Narayanan and D. Wang. Ideal ratio mask estimation usingdeep neural networks for robust speech recognition. In
Proceed-ings of International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 7092–7096, 2013.[7] Y. Zhao, D. Wang, I. Merks, and T. Zhang. Dnn-based enhance-ment of noisy and reverberant speech. In
Proceedings of Inter-national Conference on Acoustics, Speech and Signal Processing(ICASSP) , pages 6525–6529, 2016.[8] D.S. Williamson, Y. Wang, and D. Wang. Complex ratio mask-ing for monaural speech separation.
IEEE/ACM transactions onaudio, speech, and language processing , 24(3):483–492, 2016.[9] S. Pascual, A. Bonafonte, and J. Serra. SEGAN: Speech enhance-ment generative adversarial network. arXiv:1703.09452 , 2017.[10] D. Rethage, J. Pons, and X. Serra. A wavenet for speech denois-ing. In
Proceedings of International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages 5069–5073, 2018.[11] C. Macartney and T. Weyde. Improved speech enhancement withthe wave-u-net. arXiv:1811.11307 , 2018.[12] J.-M. Valin. A hybrid DSP/deep learning approach to real-timefull-band speech enhancement. In
Proceedings of IEEE Multime-dia Signal Processing (MMSP) Workshop , 2018.[13] C. Montgomery. Vorbis I specification, 2004.[14] J. Princen and A. Bradley. Analysis/synthesis filter bank designbased on time domain aliasing cancellation.
IEEE Transactionson Acoustics, Speech, and Signal Processing , 34(5):1153–1161,1986.[15] B.C.J. Moore.
An introduction to the psychology of hearing . Brill,2012.[16] H. Gockel, B.C.J. Moore, and R.D. Patterson. Asymmetry ofmasking between complex tones and noise: Partial loudness.
TheJournal of the Acoustical Society of America , 114(1):349–360,2003.[17] D. Talkin. A robust algorithm for pitch tracking (RAPT). In
Speech Coding and Synthesis , chapter 14, pages 495–518. Else-vier Science, 1995.[18] T. Painter and A. Spanias. Perceptual coding of digital audio.
Proceedings of the IEEE , 88(4):451–515, 2000.[19] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. Onthe properties of neural machine translation: Encoder-decoder ap-proaches. In
Proceedings of Eighth Workshop on Syntax, Seman-tics and Structure in Statistical Translation (SSST-8) , 2014.[20] K. Vos, K. V. Sorensen, S. S. Jensen, and J.-M. Valin. Voice cod-ing with Opus. In
Proceedings of the 135 th AES Convention ,2013.[21] Y. Zhao, D. Wang, B. Xu, and T. Zhang. Late reverberationsuppression using recurrent neural networks with long short-termmemory. In
Proceedings of International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pages 5434–5438.IEEE, 2018.[22] H. Erdogan and T. Yoshioka. Investigations on data augmentationand loss functions for deep learning based speech-backgroundseparation. In
Proceedings of INTERSPEECH , pages 3499–3503,2018. [23] J.-H. Chen and A. Gersho. Adaptive postfiltering for quality en-hancement of coded speech.
IEEE Transactions on Speech andAudio Processing , 3(1):59–71, 1995.[24] ITU-T.
Recommendation P.800: Methods for subjective determi-nation of transmission quality , 1996.[25] ITU-T.
Recommendation P.808: Subjective evaluation of speechquality with a crowdsourcing approach , 2018.[26] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi.Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In
Proceedings of ISCA Speech SynthesisWorkshop (SSW) , pages 146–152, 2016.[27] ITU-T. P.862.2: Wideband extension to recommendation P.862for the assessment of wideband telephone networks and speechcodecs (PESQ-WB). 2005.[28] C.K.A. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng,H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun,P. Rana, S. Srinivasan, and J. Gehrke. The INTERSPEECH2020 deep noise suppression challenge: Datasets, subjectivetesting framework, and challenge results. arXiv preprintarXiv:2005.13981 , 2020.[29] Y. Xia, S. Braun, C.K.A. Reddy, H. Dubey, R. Cutler, and I. Ta-shev. Weighted speech distortion losses for neural-network-basedreal-time speech enhancement. arXiv:2001.10601arXiv:2001.10601