Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech
Takuya Fujimura, Yuma Koizumi, Kohei Yatabe, Ryoichi Miyazaki
NNOISY-TARGET TRAINING: A TRAINING STRATEGY FOR DNN-BASEDSPEECH ENHANCEMENT WITHOUT CLEAN SPEECH
Takuya Fujimura † , Yuma Koizumi ‡ , Kohei Yatabe (cid:63) , Ryoichi Miyazaki †† National Institute of Technology, Tokuyama College, Yamaguchi, Japan ‡ NTT Corporation, Tokyo, Japan (cid:63)
Waseda University, Tokyo, Japan
ABSTRACT
Deep neural network (DNN)–based speech enhancement ordinarilyrequires clean speech signals as the training target. However, col-lecting clean signals is very costly because they must be recorded ina studio. This requirement currently restricts the amount of trainingdata for speech enhancement less than 1/1000 of that of speechrecognition which does not need clean signals. Increasing theamount of training data is important for improving the performance,and hence the requirement of clean signals should be relaxed. In thispaper, we propose a training strategy that does not require clean sig-nals. The proposed method only utilizes noisy signals for training,which enables us to use a variety of speech signals in the wild. Ourexperimental results showed that the proposed method can achievethe performance similar to that of a DNN trained with clean signals.
Index Terms — Single-channel speech enhancement, deep neu-ral network (DNN), training target, Noise2Noise.
1. INTRODUCTION
Speech enhancement is utilized for recovering target speech from anoisy observed signal [1]. It is a fundamental task with a wide rangeof applications, including automatic speech recognition (ASR) [2–4]. Over the last decade, a rapid progress has been made by usingsupervised training of deep neural networks (DNN) [3–13]. A DNNis trained so that it predicts the target speech from an input noisy ob-servation. In the training, the training target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speechtraining target is clean speech, and theinput signal is simulated by using clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noiseusing clean speech and noise as shownin Fig. 1 (a). In this paper, we refer to this standard training strategyas
Clean-target Training (CTT).Although Clean-target Training is clearly a proper strategy, ithas two potential problems. First, collecting studio-recorded signalsis very costly and time-consuming. Unlike image, speech signalsare easily contaminated due to the surrounded environment. Thus,“clean” signals can only be acquired under well-controlled condi-tions in a studio. Such difficulty prohibits collecting a huge amountof data for training. In fact, a typical dataset in speech enhance-ment contains only 12 thousand utterances [14], whereas training ofan ASR system may utilize over 35 million utterances [15] becauseASR does not require clean speech as the target. Second, provid-ing enough variations of the recording condition for the training ishopeless. The real-world observations vary with multiple factors, in-cluding recording equipment, mouth-microphone distance, and theLombard effect. Simulating all of these factors in a studio to obtaina variety of clean signals is impossible. Hence, the training datasetand real-world data have mismatches that can degrade the perfor-mance of speech enhancement, e.g., a training dataset is recordedwith a studio (large-diaphragm condenser) microphone, while a real-world signal is recorded with a (low-priced MEMS) microphone im-plemented in a smartphone. Because of these reasons, using cleanspeech signals as the training target can be an essential limitation. (b) Noise-target Training(a) Clean-target Training (c) Noisy-target Training
Clean Noise
Eq. (2)Updateparams.DNN
Clean Noise
Noisy Noise
Eq. (6)Updateparams.DNN x m
Noise-target Training (NeTT) as shown inFig. 1 (b). Its training target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noisetraining target is mixture of speech and noise, and theinput signal is simulated by using the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someusing the same clean speech and someother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noiseother noise. This training strategy was originally proposed in imageprocessing by the name Noise2Noise [19], and was later applied tospeech enhancement [18]. In Noise2Noise, pairs of noisy signalsthat consist of different noises and exactly the same clean target areutilized for training. Since the contained clean signal is the same,the trained DNN tries to map the noise in an input signal to anothernoise. Assuming that the noise distribution is zero-mean, the trainedDNN becomes a noise suppressor [19]. By Noise2Noise training, therequirement of a clean target can be avoided for image applicationsbecause photos of exactly the same target with different noise canbe easily obtained by a camera with multiple exposures. However,it is not applicable to audio applications because it is impossible toobserve multiple noisy signals with exactly the same speech signal.Therefore, the previous research has utilized clean speech to simu-late a pair of noisy signals [18] as in Fig. 1 (b), which inherits thelimitation of Clean-target Training.In this paper, we propose
Noisy-target Training (NyTT) as inFig. 1 (c). Its training target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speechtraining target is noisy speech, and the input signal issimulated by using the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noiseusing the same noisy speech and noise. In other words,a DNN is trained to predict a noisy speech signal from a more noisysignal. Although this strategy might sound inappropriate, it canachieve results similar to those obtained by Clean-target Trainingwhen there is a mismatch between training and testing datasets. Ourcontributions are as follows: (1) proposing a new training strategy,(2) examining several training conditions by extensive experiment, a r X i v : . [ ee ss . A S ] J a n nd (3) analyzing the experimental results.
2. RELATED WORKS
Let T -point-long time-domain observation x ∈ R T be a mixture ofa target speech s and observation noise n (obs) as x = s + n (obs) .The goal of speech enhancement is to recover s only from x . Overthe last decade, application of DNN to speech enhancement has sub-stantially advanced the state-of-the art performance [1, 3–13].A popular method is to utilize a DNN for estimating a time-frequency (T-F) mask in the short-time Fourier transform (STFT)–domain [1]. Let F : R T → C F × K denote STFT, where F and K are the numbers of frequency and time bins, respectively. Then,DNN-based speech enhancement can be written as ˆ s = F † ( M ( x ; θ ) (cid:12) F ( x )) , (1)where ˆ s is the estimate of s , F † is the inverse STFT, (cid:12) is theelement-wise product, M is a DNN for estimating a T-F mask, and θ is the set of its parameters. Obviously, the quality of output signalsis determined by the parameters of DNN.In this study, we focus on the training strategy for the DNN, M ( · ; θ ) . The conventional training strategies (CTT and NeTT)are explained in this section, whereas the proposed training strategy(NyTT) will be introduced in the next section. Most of the literature of DNN-based speech enhancement is basedon Clean-target Training [1], which utilizes the clean speech signalsas the training target. It minimizes the following prediction errorbetween the estimated signal ˆ s and the clean target s , L CTT = 1 M M (cid:88) m =1 D (ˆ s m , s m ) , (2)where M is the minibatch-size, and D is a function that measuresthe difference between the input variables, such as the (cid:96) distance D ( a , b ) = (cid:107) a − b (cid:107) . Since ˆ s m is predicted from m th noisy obser-vation x m as in Eq. (1), Clean-target Training requires pairs of noisyand clean signals ( x m , s m ) as training data. These training data aresimulated by mixing clean speech signals and noise that are collectedindependently. This is because recording such paired signals in thereal environment is not feasible.Clean-target Training has an essential limitation due to the re-quirement of clean speech signals. In the real use cases, recordingconditions have extreme variation caused by recording equipment,mouth-microphone angle and distance, surrounding environments,and several other factors. Covering all possible recording conditionsby studio-recorded speech signals is impossible because clean sig-nals can only be acquired in the well-controlled environment in astudio. This fact limits the amount and variation of the training data,which may degrade the performance of speech enhancement. Another training strategy, which was originally proposed for imageprocessing [19], is Noise-target Training [18]. It trains a DNN topredict a noisy signal from another noisy signal as follows. Noise-target Training considers two different noises n (1) and n (2) for aclean target s . Then, two kinds of observations can be obtained as x (1) = s + n (1) and x (2) = s + n (2) , which form a pair of noisysignals ( x (1) , x (2) ) . Using such noisy-noisy pairs ( x (1) m , x (2) m ) , aDNN is trained to minimize the following prediction error betweenthe output signal ˆ s (1) estimated from x (1) and x (2) , L NeTT = 1 M M (cid:88) m =1 D (ˆ s (1) m , x (2) m ) . (3)Since random noise cannot be predicted by a DNN, the random com-ponents contained in the training data are mapped to their expectedvalues. Therefore, by assuming the noise as zero-mean random vari-able, this training strategy yields a DNN that eliminates the noise.Although Noise-target Training is useful for image process-ing [19], it must inherit the limitation of Clean-target Training forspeech enhancement. For image applications, the above noisy-noisypairs (contaminated by shot noise and thermal noise) can be easilyobtained by a camera with short exposures. In contrast, for audioapplications, it is impossible to observe multiple noisy signals withexactly the same speech because audio signals are time- and space-variant. Hence, we must use the clean speech signals to simulate thenoisy-noisy pairs, which limits the variation of the training data.
3. PROPOSED METHOD
The success of Noise-target Training has suggested a possibility oftraining a DNN without clean signals. However, it is not in a suit-able form for audio applications as discussed above. In this study,we investigate another possibility of training without clean signals,Noisy-target Training, that is suitable for speech enhancement.
The above Noise-target Training has revealed two facts: (1) cleansignals are not mandatory for training, and (2) noisy signals can beutilized instead. By interpreting them in the broadest sense, we pro-pose Noisy-target Training as follows.In the proposed training strategy, we only require noisy obser-vation x and noise n , i.e., clean signals are not utilized. By mixingthem, a more noisy signal y is synthesized as follows: y = x + n (4)This forms a pair of noisy and more noisy signals ( y , x ) . By in-putting the more noisy signal y into the DNN as ˆ s = F † ( M ( y ; θ ) (cid:12) F ( y )) , (5)the proposed method trains the DNN by minimizing the followingprediction error between x and the enhanced more noisy signal ˆ s , L NyTT = 1 M M (cid:88) m =1 D (ˆ s m , x m ) . (6)Therefore, the proposed method realizes training similar to Eq. (3)without using any clean signal.Since x = s + n (obs) , the proposed method can be viewed asNoise-target Training in Section 2.2 with x (1) = s + n (obs) + n and x (2) = s + n (obs) . Hence, the validity of the proposed methoddepends on the statistics of n (obs) + n and n (obs) utilized duringthe training. Since we suppose that both n (obs) and n are givenby real-world recordings, theoretical validation cannot be made forthe proposed method. Even so, our experiments in the next sectionconfirmed its effectiveness for speech enhancement.
4. EXPERIMENTS
We conducted three types of experiments to investigate the perfor-mance of the proposed method (NyTT):
Objective experiments:
We investigated whether NyTT properlyenables training of a DNN without clean speech. We evalu-ated the performance on both seen and unseen datasets, i.e., w/and w/o mismatch between training and testing data. ffects of SNR of noisy target:
We investigated the performanceof NyTT in response to the SNR between s and n (obs) . Effects of types of additional noise:
We investigated the perfor-mance of NyTT in response to the relationship between n (obs) and n by using four types of additional noise datasets. Table 1 shows datasets used in experiments. We utilizedthe
VoiceBank-DEMAND [14] which is openly available and fre-quently used in the literature of DNN-based speech enhancement [7,10, 13]. The train and test sets consists of 28 and 2 speakers (11572and 824 utterances), respectively. In addition to this dataset, to eval-uate the performance under a training/testing data mismatched con-dition, we constructed a test dataset by mixing TIMIT [20] (speech)and TAU Urban Acoustic Scenes 2019 Mobile [21] (noise) as
TIMIT-MOBILE at signal-to-noise ratio (SNR) randomly selected from − , , , and dB. The test sets consist of 1680 utterances spoken by168 speakers (112 males and 56 females).To mimic the use of noisy signals for training in NyTT, weadditionally used Libri-Task1 and
CHiME5 as noisy datasets.
Libri-Task1 consists of mixed signals of the development sets ofLibriTTS [24] and TAU Urban Acoustic Scenes 2020 Mobile [25](TAU-2020) whose SNR was randomly selected from , , , and dB. This dataset includes 8.97 hours of noisy speech with 5736utterances. CHiME5 was the training dataset of the 5th CHiMESpeech Separation and Recognition Challenge [26], and consistedof 77.24 hours of noisy speech with 79967 utterances which was cre-ated by cutting each speech interval in the continuous training datawith before/after 0.5 sec margin. In addition, we used backgroundnoise of CHiME3 [27] as noise dataset (CHiME3).
Comparison methods and metrics:
In order to investigate whetherNyTT can solve the recording condition mismatch problem by uti-lizing a larger amount of noisy target, we evaluated the followingtwo versions of NyTT. These methods were compared with CTT andNeTT [18].As the metrics, we used PESQ, CSIG, CBAK, COVL, and scale-invariant signal-to-distortion ratio (SI-SDR). The first four metricsare the standard metrics used in
VoiceBank-DEMAND , and SI-SDRis a metrics widely used for evaluation of speech enhancement.
Training details:
For NyTT and NyTT (L), we randomly selectedan additional noise n from DEMAND, TAU-2020, and CHiME3,and mixed to noisy speech x at randomly selected SNRs between − to dB, where SNR is measured by considering x as the signaland n as noise. For CTT, to use the same variety of noise samplesas NyTT, we augmented the noisy dataset by randomly swappingthe noise in noisy signals to a noise in DEMAND, TAU-2020, andCHiME3. In the same sense, we used DEMAND, TAU-2020, andCHiME3 as noise dataset for NeTT. Therefore, the amount and va-riety of training noise samples were the same for all methods, andonly the type and amount of the target signals were different.The DNN estimated a complex-valued T-F mask [6], and con-sisted of a CNN-BLSTM which has the same architecture of [13].The input of the DNN was log-amplitude spectrogram of the in-put signal whose size was F × K . The spectrogram of the inputwas multiplied by the estimated complex T-F mask and transformedback to the time-domain, where the STFT parameters, frame shiftand window size (= DFT size ) , were set to 128- and 512-samples,respectively, with the Hamming window.We used mean-squared-error (MSE) as D ( a , b ) = T (cid:107) a − b (cid:107) and Adam optimizer [28] with a fixed learning rate 0.0001. We sep-arated the training dataset into randomly selected 50 and other utter- Table 1 . List of training/testing datasets.
Libri-Task1 and
CHiME5 include only pre-mixed (noisy) signals, which imitates the situationthat only noisy speech signals x are available.Name Clean s Noise n (obs) VoiceBank-DEMAND [14] VoiceBank [22] DEMAND [23]
TIMIT-MOBILE
TIMIT [20] TAU-2019 [21]
Libri-Task1
Libri-TTS [24] + TAU-2020 [25]
CHiME5
Only noisy signal provided
Table 2 . Results on
VoiceBank-DEMAND (no mismatch of train-ing and testing datasets). Input means the scores of input signals.Method SI-SDR PESQ CSIG CBAK COVLInput 9.21 1.97 3.35 2.44 2.63CTT
NeTT 19.50 2.63 3.77 3.34 3.19NyTT 17.66 2.30 3.19 3.01 2.72NyTT (L) 17.72 2.31 3.23 3.02 2.75
Table 3 . Results on
TIMIT-MOBILE (with mismatch of training andtesting datasets). Input means the scores of input signals.Method SI-SDR PESQ CSIG CBAK COVLInput 4.69 1.30 2.73 1.75 1.94CTT
NyTT (L) 12.38 1.91 n was added to s at SNR randomly selected from − , , , and dB. For the objective evaluation, we conducted two experiments. First,we conducted an experiment for verifying whether NyTT can trainthe DNN without clean speech signals. To remove the effect oftraining/testing data mismatch effect, we used the test dataset of
VoiceBank-DEMAND for evaluating scores. Table 2 summarizesthe evaluated scores, where NyTT and NyTT (L) achieved higherscores than that of the input signal (Noisy). This result indicates thatNoisy-target Training can train a DNN without clean speech.Next, we evaluated each method on
TIMIT-MOBILE to confirmwhether NyTT robustly worked on unseen test data. Here, thereis a mismatch between recording conditions of training and testingdatasets. Table 3 summarizes the evaluated scores. While CTT andNeTT performed better than NyTT in the
VoiceBank-DEMAND results (no mismatch between testing and trainig datasets), theperformance of all methods was similar in the results on
TIMIT-MOBILE (having mismatch between training and testing datasets).Even though NyTT did not use any clean speech in training, itachieved results similar to those obtained by CTT and NeTT whichused clean speech in training. This result might indicate that trainingusing clean speech can overfit to the signals in the training dataset.The proposed NyTT has potential to avoid such overfitting by using
SI-SDR improvement
PESQ improvement
CTT NyTT (L)CTT NyTT (L)
Fig. 2 . Comparison of SI-SDR/PESQ improvements between CTTand NyTT (L) on
TIMIT-MOBILE (having mismatch with trainingdata). -5 0 5 10 15 20
SNR of noisy target [dB] S I- S D R i m p r ov e m e n t Fig. 3 . Relationship between SNR of noisy signal x utilized in NyTTand its SI-SDR improvement. SNR ∞ dB is equivalent to CTT.a huge amount of noisy data which can be easily acquired.Fig. 2 shows SI-SDR/PESQ improvements of CTT and NyTT (L)corresponding to Table 3. The median of both training methods wasalmost the same, whereas the variance of NyTT was smaller thanthat of CTT. This suggests that NyTT has stable performance evenwhen there is a mismatch between training and testing datasets. Since NyTT becomes CTT when SNR of the noisy target signal x is ∞ (i.e., x is clean), we investigated the relationship betweenSNR of noisy target and the performance of NyTT. We modified VoiceBank-DEMAND so that all noisy targets’ SNR became − , , , , and dB. The additional noise n was taken fromDEMAND. To remove the training/testing data mismatch effect, weused the test dataset of VoiceBank-DEMAND for evaluation.Fig. 3 shows SI-SDR improvements for each SNR. As in thefigure, when SNR of the noisy target x was greater than dB, theperformance increased as the SNR of x increased. Meanwhile, therewas almost no difference in SI-SDR improvement for − and dBSNR conditions. This might be because when the power of n (obs) isequal to or greater than that of s , a DNN is trained to predict not only s but also n (obs) by removing only additional noise n . In contrast,when SNR of noisy target x was and dB, the performance wasalmost the same as CTT. These results suggest that (1) SNR of thenoisy target x should be greater than dB for NyTT to be work, and(2) noisy speech signals with SNR greater than or equal to dB canserve as “clean” signals for training in speech enhancement. Since NyTT utilized noisy signals x = s + n (obs) with additionalnoise n , we investigated the relation of the types of noise used for n (obs) and n . For the noisy signals x , VoiceBank-DEMAND wasutilized as in the previous experiments, i.e., n (obs) is from DE-MAND. For the additional noise n , we utilized one of the follow-ing four datasets: DEMAND, TAU-2020, CHiME3, and the training Table 4 . Comparison on type of additional noise n in NyTT.Input DEMAND TAU-2020 CHiME3 Task2SI-SDR 8.47 12.09 12.00 11.99 9.63PESQ 1.44 1.74 1.83 1.95 1.52
50 0 5050050
DEMAND (n) DEMAND TAU-2020 CHiME3 Task2
Fig. 4 . Visualization of distribution of n and n (obs) using t-SNE.DEMAND(n) denotes n (obs) , and the others are n in Table 4.dataset of DCASE2017 Challenge Task 2 (Task2). These four noisedatasets can be classifies into two: the first three datasets includevarious environmental noise, while Task2 includes only monophonicsound events that would occur in an office. For the testing dataset, TIMIT-NOISEX-92 was utilized to avoid using TAU-2019 which issimilar to TAU-2020. It was generated by mixing TIMIT (speech)and NOISEX-92 [29] (noise) at SNR randomly selected from , , , and dB. Note that the type of noise in NOISEX-92 is differentfrom all four datasets used for the additional noise n .Table 4 shows SI-SDR and PESQ of NyTT using one of thefour datasets for n . NyTT using DEMAND, TAU-2020, CHiME3achieved similar scores, whereas NyTT with Task2 failed to enhancethe signal (the scores of Input and Task2 were almost the same). Thisshould be because Task2 contains a very different type of noise asmentioned in the previous paragraph. To confirm it, we visualizeddistribution of the datasets as shown in Fig. 4. This figure was ob-tained as follows. First, we randomly selected 1000 samples fromeach training dataset and extracted the first 2 sec to align the lengthof data. Second, we calculated the acoustic feature of each sam-ple using VGGish [30]. Finally, the calculated features were illus-trated as a 2D map by t-Distributed Stochastic Neighbor Embedding(t-SNE) [31]. From the figure, it can be seen that DEMAND, TAU-2020 and CHiME3 are similarly distributed, but Task2 has almost nooverlap with them. This result suggests that NyTT can successfullytrain a DNN when the distribution of additional noise n can hide inthe distribution of n (obs) . This should be because the different typeof noise n (as Task2 in this experiment) can be distinguished fromthe noisy signals x = s + n (obs) , which enables a DNN to eliminateonly the additional noise n while keeping n (obs) .
5. CONCLUSIONS
In this study, for DNN-based speech enhancement, we proposed atraining strategy that does not require clean signals. We utilizednoisy signals as the target and trained a DNN to predict them fromthe more noisy signals (Section 3.1). Our experiments showed thatthe proposed method (1) was able to train a DNN without cleanspeech signals, (2) achieved the results similar to those obtainedby using clean signals as the target when the training and testingdatasets have a mismatch, and (3) revealed the borderline (15 dB)here a signal can be treated as clean in the training. Future workincludes evaluation using a larger dataset and theoretical validation.
6. REFERENCES [1] D. L. Wang and J. Chen, “Supervised Speech Separation Basedon Deep Learning: An Overview,”
IEEE/ACM Trans. AudioSpeech Lang. Process.
Proc. IEEE Autom. SpeechRecognit. Underst. Workshop (ASRU) , 2015.[3] A. Narayanan and D. L. Wang, “Ideal Ratio Mask Estimationusing Deep Neural Networks for Robust Speech Recognition,”
Proc. Int. Conf. Acoust. Speech Signal Process. (ICASSP) ,2013.[4] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux,“Phase-Sensitive and Recognition-Boosted Speech Separationusing Deep Recurrent Neural Networks,”
Proc. Int. Conf.Acoust. Speech Signal Process. (ICASSP) , 2015.[5] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “An Experimental Studyon Speech Enhancement Based on Deep Neural Networks,”
IEEE Signal Process, Lett. , 2014.[6] D. S. Williamson, Y. Wang, and D. L. Wang, “Complex RatioMasking for Monaural Speech Separation,”
IEEE/ACM Trans.Audio Speech Lang. Process. , 2016.[7] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: SpeechEnhancement Generative Adversarial Network,”
Proc. Inter-speech , 2017.[8] S. W. Fu, C. F. Liao, Y. Tsao, and S. D. Lin, “MetricGAN: Gen-erative Adversarial Networks based Black-box Metric ScoresOptimization for Speech Enhancement,”
Proc. Int. Conf. Mach.Learn. (ICML) , 2019.[9] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda,“DNN-based Source Enhancement to Increase ObjectiveSound Quality Assessment,”
IEEE/ACM Trans. Audio SpeechLang. Process. , 2018.[10] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, andD. Takeuchi, “Speech Enhancement using Self-Adaptation andMulti-Head Self-Attention,”
Proc. Int. Conf. Acoust. SpeechSignal Process. (ICASSP) , 2020.[11] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa and N. Harada,“Data-driven Design of Perfect Reconstruction Filterbank forDNN-based Sound Source Enhancement,”
Proc. Int. Conf.Acoust. Speech Signal Process. (ICASSP) , 2019.[12] D. Takeuchi, K. Yatabe, Y. Koizumi, Y. Oikawa, andN. Harada,“Real-Time Speech Enhancement using Equili-braited RNN,”
Proc. Int. Conf. Acoust. Speech Signal Process.(ICASSP) , 2020.[13] M. Kawanaka, Y. Koizumi, R. Miyazaki, and K. Yatabe,“Stable Training of DNN for Speech Enhancement Based onPerceptually-Motivated Black-Box Cost Function,”
Proc. Int.Conf. Acoust. Speech Signal Process. (ICASSP) , 2020.[14] C. Valentini-Botinho, X. Wang, S. Takaki, and J. Yamag-ishi, “Investigating RNN-based Speech Enhancement Methodsfor Noise-Robust Text-to-Speech,”
Proc. ISCA Speech Synth.Workshop (SSW) , 2016.[15] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez,D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang,D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Y. Chang, K. Rao and A. Gruenstein, “StreamingEnd-to-End Speech Recognition for Mobile Devices,”
Proc.Int. Conf. Acoust. Speech Signal Process. (ICASSP) , 2019.[16] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichan-nel End-to-End Speech Recognition,”
Proc. Int. Conf. Mach.Learn. (ICML) , 2017.[17] S. Wisdom, E. Tzinis, H. Erdogan, R. J. Weiss, K. Wilson,J. R. Hershey “Unsupervised Sound Separation using Mixturesof Mixtures,”
Proc. Int. Conf. Mach. Learn. (ICML) , 2020.[18] N. Alamdari, A. Azarang and N .Kehtarnavaz, “ImprovingDeep Speech Denoising by Noisy2Noisy Signal Mapping,”
Appl. Acoust. , 2021.[19] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Kar-ras, M. Aittala and Timo Aila, “Noise2Noise: Learning ImageRestoration without Clean Data,”
Proc. Int. Conf. Mach. Learn.(ICML) , 2018.[20] John S. Garofolo, Lori F. Lamel, William M. Fisher, JonathanG. Fiscus, David S. Pallett, Nancy L. Dahlgren and VictorZue, “TIMIT Acoustic-Phonetic Continuous Speech CorpusLDC93S1,”
Web Download. Philadelphia: Linguistic DataConsortium , 1993.[21] A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic Scene Clas-sification in DCASE 2019 Challenge: Closed and Open SetClassification and Data Mismatch Setups,”
Proc. Detect. Clas-sif. Acoust. Scenes Events (DCASE) Workshop , 2019.[22] C. Veaux, J. Yamagishi and S. King “The Voice Bank Cor-pus: Design, Collection and Data Analysis of a Large Re-gional Accent Speech Database,” , 2013[23] J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environ-ments Multi-Channel Acoustic Noise Database: A Databaseof Multichannel Environmental Noise Recordings,”
J. Acoust.Soc. Am. , 2013.[24] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia,Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from Lib-riSpeech for Text-to-Speech,” arXiv:1904.02882 , 2019.[25] T. Heittola, A. Mesaros, and T. Virtanen, “Acoustic Scene Clas-sification in DCASE 2020 Challenge: Generalization AcrossDevices and Low Complexity Solutions,”
Proc. Detect. Clas-sif. Acoust. Scenes Events (DCASE) Workshop , 2020.[26] J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “TheFifth ‘CHiME’ Speech Separation and Recognition Challenge:Dataset, Task and Baselines”
Proc. Interspeech , 2018.[27] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “TheThird CHiME Speech Separation and Recognition Challenge:Dataset Task and Baselines”,
Proc. IEEE Autom. SpeechRecognit. Underst. Workshop (ASRU) , 2015.[28] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Op-timization,”
Proc. Int. Conf. Learn. Represent. (ICLR) , 2015.[29] A. Varga and H. J. M. Steeneken, “Assessment for AutomaticSpeech Recognition ii: Noisex-92: A Database and an Experi-ment to Study the Effect of Additive Noise on Speech Recog-nition Systems,”
Speech Commun. , 1993.[30] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke,A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous,B. Seybold, M. Slaney, R. J. Weiss and K. Wilson, “CNN Ar-chitectures for Large-Scale Audio Classification,”
Proc. Int.Conf. Acoust. Speech Signal Process. (ICASSP) , 2017[31] L. van der Maaten and G. Hinton, “Visualizing High-Dimensional Data using t-SNE,”