Speech enhancement with mixture-of-deep-experts with clean clustering pre-training
SSPEECH ENHANCEMENT WITH MIXTURE OF DEEP EXPERTS WITH CLEANCLUSTERING PRE-TRAINING
Shlomo E. Chazan Jacob Goldberger Sharon Gannot
Faculty of Engineering, Bar-Ilan University, Ramat-Gan, Israel { Shlomi.Chazan,Jacob.Goldberger,Sharon.Gannot } @biu.ac.il ABSTRACT
In this study we present a mixture of deep experts (MoDE) neural-network architecture for single microphone speech enhancement.Our architecture comprises a set of deep neural networks (DNNs),each of which is an ‘expert’ in a different speech spectral patternsuch as phoneme. A gating DNN is responsible for the latent vari-ables which are the weights assigned to each expert’s output givena speech segment. The experts estimate a mask from the noisy in-put and the final mask is then obtained as a weighted average of theexperts’ estimates, with the weights determined by the gating DNN.A soft spectral attenuation, based on the estimated mask, is then ap-plied to enhance the noisy speech signal. As a byproduct, we gainreduction at the complexity in test time. We show that the expertsspecialization allows better robustness to unfamiliar noise types. Index Terms — Mixture of experts, clustering
1. INTRODUCTION
A plethora of approaches to solve the problem of speech enhance-ment using a single microphone can be found in the literature(see e.g. [1]). Although microphone array algorithms are nowadayswidely used, there are still applications in which only a single micro-phone is available. However, the performance of current solutionsis not always satisfactory. Classical model-based algorithms such asthe optimally modified log spectral amplitude (OMLSA) estimatorwith the improved minima controlled recursive averaging (IMCRA)noise estimator were introduced to enhance speech signals contam-inated by nonstationary noise signals [2, 3]. Nevertheless, when thenoisy input exhibit rapid changes in noise statistics, the processedsignal tends to yield musical noise artifacts at the output of theenhancement algorithm.In recent years, DNN-based algorithms were derived to en-hance noisy speech. A comprehensive summary of the commonapproaches can be found in [4, 5]. Recent contributions in the fieldcan be found in [6–8]. These DNN-based approaches have to copewith the large variability of the speech signal. They are thus trainedon huge databases with multiple noise types to cover the largevariety of noisy conditions, especially in real-life scenarios [9].To alleviate these obstacles, algorithms which take into accountthe variability of the speech were developed. In [10] and [11], thephoneme labels were used to enhance each phoneme separately. Yet,the capabilities of the DNN were only partly utilized. Phoneme-based architecture was introduced for automatic speech recognition(ASR) applications [12]. In this architecture, a set of DNNs was sep-arately trained with an individual database, one for each phoneme, to This project has received funding from the European Union’s Hori-zon 2020 Research and Innovation Programme under Grant Agreement No.871245 and was supported by the Ministry of Science & Technology, Israel. find the ideal ratio mask (IRM). Given a new noisy input, the ASRsystem outputs the index of the phoneme associated with the cur-rent input, and that phoneme DNN is activated to estimate the IRM.This approach improved performance in terms of noise reduction andmore accurate IRM estimation. However, when the ASR system isincorrect, a wrong DNN is activated. Additionally, the continuity ofthe speech is disrupted by mistakes in the ASR system. Finally, theASR was not learned as part of the training phase.In this work, we present a MoDE modeling for speech enhance-ment. The noisy speech signal comprises several different subspaceswhich have different relationships between the input and the output.Each expert is responsible for enhancing a single speech subspaceand the gating network finds the suitable weights for each subspacein each time frame. Each expert estimates a mask and the local maskdecisions are averaged, based on the gating network, to provide thefinal mask result. Since the gate is trained to assign an input to oneof the experts in an unsupervised manner, random initialization ofthe MoDE parameters may be insufficient, as it tends to utilize onlyfew of the experts. A pretraining stage, comprised of a clustering ofclean speech utterances, is therefore applied in order to capture thespeech variability and to alleviate this degeneration problem. Theclustering labels of the clean dataset are utilized for pre-training allexperts and the gate network as well.The contribution of this work is twofold. First, we present amixture of deep experts (MoDE)-based enhancement procedure thatautomatically decomposes the speech space into simpler subspacesand applies a suitable different enhancement procedure for each in-put subspace. Second, we propose an algorithm to train the MoDEmodel that does not require a phoneme-labeled database.
2. PROBLEM FORMULATION
Let x ( t ) = s ( t ) + n ( t ) denote the observed noisy signal at discrete-time t , where s ( t ) denotes the clean speech signal and n ( t ) an addi-tive noise signal. The short-time Fourier transform (STFT) of x ( t ) with frame-length L is denoted by ¯ x k ( n ) , where n is the frame-indexand k = 0 , , . . . , L − denotes the frequency index. Similarly, ¯ s k ( n ) and ¯ n k ( n ) denote the STFT of the speech and the noise-onlysignals, respectively.Different speech activation masks were proposed [4,13], and themost commonly used mask is the ideal ratio mask (IRM). The IRMof a single frame is defined as follows:IRM k = (cid:18) | ¯ s k | | ¯ s k | + | ¯ n k | (cid:19) γ , (1)where γ is commonly set to γ = 0 . .We can cast the speech enhancement problem as finding an esti-mate ρ k ∈ [0 , of the IRM mask IRM k by only using noisy speech a r X i v : . [ c s . S D ] F e b tterances. The DNN task, is therefore to find the mask ρ k , giventhe noisy signal.In the enhancement task, only the noisy signal ¯ x is observed,and the goal is to estimate ˆ¯ s = (cid:2) ˆ¯ s , . . . , ˆ¯ s L/ (cid:3) of the clean speech ¯ s = (cid:2) ¯ s , . . . , ¯ s L/ (cid:3) . Once the estimated mask ρ = (cid:2) ρ , . . . , ρ L/ (cid:3) is computed, the enhanced signal can be obtained by: ˆ¯ s = ¯ x (cid:12) ρ (2)where (cid:12) is a element-wise product (a.k.a. Hadamard product). Inthis work, we use a softer version of (2) to enhance the speech signal: ˆ¯ s = ¯ x (cid:12) exp {− (1 − ρ ) · β } . (3)Note, that in frequency bins where ρ k = 1 , namely where the cleanspeech is dominant, the estimated signal will be ¯ s k = ¯ x k . However,using ρ k = 0 in (2), namely in noise-dominant bins, may resultin musical noise artifacts [14] [15]. In contrast, using (3), the at-tenuation in noise-dominant bins is limited to exp {− β } , potentiallyalleviating the musical noise phenomenon.As input features to the IRM estimating network, we use thelog-spectrum of the noisy signal at a single time frame, denoted by x = log | ¯ x | . The network goal is to estimate the mask ρ .
3. DEEP MIXTURE OF EXPERTS FOR SPEECHENHANCEMENT
The mixture of experts (MoE) model, introduced by Jacobs et al. [16,17], provides important paradigms for inferring a classifier fromdata.
Statistical model
We can view the MoE model as a two step processthat produces a decision ρ given an input feature set x . We first usethe gating function to select an expert and then apply the expert todetermine the output label. The index of the selected expert canbe viewed as an intermediate hidden random variable denoted by z .Formally, the MoE conditional distribution can be written as follows: p ( ρ | x ; Θ) = m (cid:88) i =1 p ( z = i | x ; θ g ) p ( ρ | z = i, x ; θ i ) (4)such that x is the log-spectrum vector of the noisy speech, ρ is theIRM vector and z is a speech spectral state; e.g., the phoneme iden-tity or any other indication of a specific spectral pattern of the speechframe. The model parameter-set Θ comprises the parameters of thegating function, θ g , and the parameters θ , . . . , θ m of all m experts.We further assume that both the experts and the gating functions areimplemented by DNNs, thus this model is dubbed mixture of deepexperts (MoDE).The input to each expert DNN is the noisy log-spectrum frametogether with context frames. All m experts in the proposed algo-rithm are implemented by DNNs with the same structure. The gatingDNN is fed with the corresponding mel-frequency cepstral coeffi-cients (MFCC) features denoted by v . MFCC, which is based onfrequency bands, is a more compact representation than a linearlyspaced log-spectrum and is known for its better representation ofsound classes [18]. We found that using the MFCC representationfor the gating DNN both slightly improves performance and reducesthe computational complexity. The output layer that provides themask decisions is composed of L/ sigmoid neurons, one foreach frequency band. Let ˆ ρ i be the mask vector computed by the i -thexpert. The mask decision of the i -th expert and the k -th frequencybin is defined as: ˆ ρ i,k = p ( ρ k | x , z = i ; θ i ) . (5) Parameter inference
We next address the problem oflearning the MoDE parameters (i.e. the parameters of theexperts and the gating function) given a training dataset { ( x (1) , ρ (1)) , . . . , ( x ( N ) , ρ ( N )) } , where N is the size ofthe database. Our loss function is following the training strategyproposed in [16], which prefers error function that encouragesexpert specialization instead of cooperation: L (Θ) = − N (cid:88) n =1 log (cid:32) m (cid:88) i =1 p i ( n ) exp( − d ( ρ ( n ) , ˆ ρ i ( n ))) (cid:33) (6)such that p i ( n ) = p ( z ( n ) = i | v ; θ g ) is the gating soft decision and ˆ ρ i ( n ) = p ( ρ ( n ) | z ( n ) = i, x ( n ); θ i ) is the i -th network prediction. We set d ( ρ ( n ) , ˆ ρ i ( n )) to be themean square error (MSE) function between ρ ( n ) and ˆ ρ i ( n ) , i.e. d ( ρ ( n ) , ˆ ρ i ( n )) = (cid:107) ρ ( n ) − ˆ ρ i ( n ) (cid:107) .To train the network parameters we can apply the standard back-propagation techniques. The gradients of the MoDE parameters pro-vide another intuitive perspective on the model. It can be easily ver-ified that the back-propagation equations for the parameter set of the i -th expert are: ∂L∂ θ i = N (cid:88) n =1 w i ( n )( ρ ( n ) − ˆ ρ i ( n )) · ∂∂ θ i ˆ ρ i ( n ) (7)such that w i ( n ) is the posterior probability of expert i : w i ( n ) = p ( z ( n ) = i | x ( n ) , ρ ( n ); Θ) . (8)Note, that p i ( n ) is the posterior probability of expert i given theMFCC, and w i ( n ) is the posterior given the true label and the input.Similarly, the back-propagation equation for the parameter set of thegating DNN is: ∂L∂ θ g = N (cid:88) t =1 m (cid:88) i =1 w i ( n ) · ∂∂ θ g log p i ( n ) . (9)During the training of the MoDE, the gating DNN is learned inan unsupervised manner. Namely, the input x propagates throughall experts and the gate selects the output of one of the m expertswithout any supervision. When dealing with a complex task such asclustering, parameter initialization is crucial. In fact, without a smartinitialization, trivial solution might occur and only one or small num-ber of experts will be activated by the gate. Therefore pretrainingeach expert as well as the gate DNNs is a must.In [19] the phoneme labels were first used to train the gate as aphoneme classifier, and to train each expert with frames having thesame phoneme. In our approach though, no labels are available.In order to acquire labels in an unsupervised manner we proposeto apply a clustering algorithm technique to the clean signals. Theclustering is used to find m different patterns of the speech in the log-spectrum domain. The idea is that each cluster consists of frameswith a similar acoustic pattern and therefore their masks are alsoexpected to be similar. The clustering is applied to clean speechframes to encourage the clusters to focus on different speech patternsand not on different noise types.We used clustering based on training of an autoencoder followedby a k -means clustering in the embedded space [20]. The obtained a) Car noise (b) Room noise(c) Factory noise (d) Babble noise Fig. 1 : PESQ results on various noise types. (a) Speech noise (b) Room noise(c) Factory noise (d) White noise
Fig. 2 : short-time objective intelligibility measure (STOI) results forseveral noise types.clustering is used to initialize the MoDE parameters. The networkcomponents are then jointly trained using noisy speech data.
Network architecture
All m experts in the proposed algorithm areimplemented by DNNs with the same structure. In addition to thecurrent frame, the input features include four preceding frames andfour subsequent frames to add context information; therefore, eachinput consists of nine frames. The network consists of 3 fully-connected hidden layers with 512 rectified linear unit (ReLU) neu-rons each. The output layer that provides the mask decisions is com-posed of L/ sigmoid neurons, one for each frequency band.The architecture of the gating DNN is also composed of 3 fullyconnected hidden layers with 512 ReLU neurons each. The outputlayer here is a softmax function that produces the gating distributionfor the m experts.The log-spectrum of the noisy signal, x , is only utilized as theinput to the experts, and the gating DNN is fed with the correspond-ing MFCC features denoted by v (also with context frames).
4. EXPERIMENTAL STUDYSetup
To test the proposed MoDE algorithm we contaminated thespeech utterances with several types of noise from the NOISEX-92database [21], namely
Car , Room , Factory and
Babble . The noisewas added to the clean signal drawn from the test set of the TIMITdatabase (24-speaker core test set), with 5 levels of signal to noiseratio (SNR) at − dB, dB, dB, dB and dB chosen to rep- resent various real-life scenarios. Compared methods
We compared the proposed MoDE algorithmwith two DNN-based algorithms: 1) Deep single expert (DSE) isa fully-connected architecture that can be viewed as a single-expertnetwork; and 2) S-MoDE is a supervised phoneme-based MoDE ar-chitecture [19]. The network has 39 experts where each expert isexplicitly associated with a specific phoneme and training uses thephoneme labeling available in the TIMIT dataset.When using the MoDE algorithm we need to set the number ofexperts. In most MoE studies, the number of experts was determinedby an exhaustive search [22]. We found that increasing the number ofexperts from one to five significantly improves the performance andthat additional experts had little effect. Hence, we chose the simplermodel and set m = 5 . Each expert component in the S-MoDE net-work has the same network architecture as the expert block of theproposed MoDE model. The deep single expert (DSE) architectureis a single DNN chosen to have the same size (in terms of the to-tal number of neurons in each hidden layer) of the MoDE with 5experts, for fair comparison. Training Procedure
All the compared DNN-based algorithms weretrained with the same database. We used the TIMIT database [23] train set (contains 462 speakers with 3.14 hours) for the trainingphase and the test set (containing 168 speakers with 0.81 hours) forthe test phase. Note, that the train and test sets of TIMIT do not over-lap. Clean utterances were contaminated by multiple noise types,stationary and non-stationary, with varying SNRs. The speech di-versity modeling provided by the expert set was found to be richenough to handle noise types that were not presented in the trainingphase.The inputs to all DNN-based algorithms are the log-spectra vec-tor and its corresponding MFCC vector. The log-spectra and theMFCC vectors were concatenated to form the input feature vector ofthe DSE network. In the case of MoDE, log-spectra were used asthe input of the expert network and MFCC coefficients were the in-put of the gating network. Additionally, all methods apply the sameenhancement scheme using (3), where we set β to correspond to at-tenuation of dB, a value which yielded high noise suppressionwhile maintaining low speech distortion.The network was implemented in tensorflow [24] with ADAMoptimizer [25] and batch-normalization was applied to eachlayer [26]. To overcome the mismatch between the training and thetest conditions, each utterance was normalized prior to the trainingof the network, such that the sample-mean and sample-variance ofthe utterance were zero and one, respectively [27]. In order to cir-cumvent over-fitting of the DNNs to the training database, we firstapplied the cepstral mean and variance normalization (CMVN) pro-cedure to the input, prior to the training and test phases [27]. Objective quality measure results
To evaluate the performance ofthe proposed speech enhancement algorithm, the standard perceptualevaluation of speech quality (PESQ) measure, which is known tohave a high correlation with subjective score [28], was used. Intelli-gibility improvement was also evaluated using STOI [29]. We alsocarried out informal listening tests with approximately thirty listen-ers. Figure 1 depicts the PESQ results of all algorithms for the Car,Room, Factory and Babble noise types as a function of the inputSNR. Figure 2 depicts the STOI results for the same experimentsetup. It is evident that both MoDE and S-MoDE, which split the Audio samples comparing the proposed MoDE algo-rithm with the DSE and the S-MoDE can be found in . rain phase Database DetailsDSE, S-MoDE, MoDE TIMIT (train set) white Gaussian , Speech-like, F-16 cockpit, restaurant ,SNR=-5,5 dB
Test phase Database DetailsSpeech
TIMIT (test set)
Noise
NOISEX-92 Room, Car, Babble, Factory
SNR - -5, 0, 5, 10, 15 dB
Table 1 : Experimental setup. (a) Noisy input. (b) Gate output, ˆ p = [ˆ p , ˆ p , ˆ p ] .(c) Expert 1 mask estimation, ˆ ρ . (d) Expert 2 mask estimation, ˆ ρ .(e) Expert 3 mask estimation, ˆ ρ . (f) Final estimation, ˆ ρ = (cid:80) i =1 ˆ p i · ˆ ρ i . Fig. 3 : Experts and gate outputs of network with 3 experts.noisy speech enhancement task into simpler problems, outperformthe fully-connected network DSE. Moreover, the proposed method,MoDE, even outperforms the supervised method, S-MoDE, whichexploits the phoneme information. This indicates that splitting thenoisy data according to the phonemes is not an optimal strategy forenhancement and allowing the network to find by itself a suitablesplitting of the data yields improved results.
Interpretation of the learned experts
To gain a deeper understand-ing of the role of different experts, we next present for each expert i the mask estimation ˆ ρ i for an example of noisy speech utterancecontaminated by white noise SNR=5 dB (Fig. 3a). Additionally, weshow the distribution of the decisions of the gate DNN along thetime.In this case, the gate network classifies the noisy speech intothree classes, voiced frames, unvoiced frames and speech inactiveframes (Fig. 3b). We can see that the expertise of the first expert is toenhance the unvoiced parts of the speech (Fig. 3c), while the second expert is responsible for the voiced parts of the speech (Fig. 3d).Both experts do not perform well when the opposite speech patternis introduced. The third expert expertise is to estimate the mask whenonly noise is present (Fig. 3e). The final weighted average maskingdecision is shown in Fig. 3f.We can also deduce from the gate decisions in Fig. 3b that foreach time-frame the gate DNN tends to select only one expert. Con-sequently, each speech pattern is treated differently. Unlike the DSEDNN, in which a single network has to deal with the high variabilityof the speech patterns, our proposed method splits the speech en-hancement task into m simpler tasks, and therefore outperforms thecompeting DSE.This experiment suggests that each expert is responsible for aspecific pattern of the speech spectrum. Consequently, the expertspreserve the speech structure and a more robust behavior is exhibitedcompared to other DSE-based algorithm. The S-MoDE do preservesthe phoneme structures with supervised learning. Yet, is seems thatthe unsupervised classification of the speech patterns is more bene-ficial.As a byproduct, we also gain complexity reduction at test time.For each time frame the gate first outputs a probabilities vector. Theexpert with the highest probability is therefore, i (cid:48) = argmax i { ˆ p i } .Consequently, we can use only one expert for each time frame, ˆ ρ = m (cid:88) i =1 ˆ p i · ˆ ρ i ≈ ˆ ρ i (cid:48) . (10)Therefore, even with larger number of experts, m , the same com-plexity of the gate and one of the expert is preserved.
5. CONCLUSION
This study introduced a MoDE model for speech enhancement. Thisapproach splits the challenging task of speech enhancement into sub-spaces, where each DNN expert is responsible for a simpler taskwhich corresponds to a different speech type. The gating DNNweights the outputs of the experts. This approach makes it possi-ble to alleviate the well-known problem of DNN-based algorithms,namely, the mismatch between training phase and test phase. Ad-ditionally, the proposed MoDE architecture enables training with asmall database of noises and as a by product also reduce the com-plexity at test time. The experiments verified that the proposed al-gorithm outperforms other DNN-based approaches in both objectiveand subjective measures.
6. REFERENCES [1] P. C. Loizou,
Speech enhancement: theory and practice , CRCpress, 2013.2] I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,”
Signal processing , vol. 81, no.11, pp. 2403–2418, 2001.[3] I. Cohen and B. Berdugo, “Noise estimation by minima con-trolled recursive averaging for robust speech enhancement,”
IEEE Signal Processing Letters , vol. 9, no. 1, pp. 12–15, 2002.[4] D. Wang and J. Chen, “Supervised speech separation basedon deep learning: An overview,”
IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 26, no. 10, pp.1702–1726, 2018.[5] A. Defossez, G. Synnaeve, and Y. Adi, “Real time speechenhancement in the waveform domain,” arXiv preprintarXiv:2006.12847 , 2020.[6] J. Lee, Y. Jung, M. Jung, and H. Kim, “Dynamic noise embed-ding: Noise aware training and adaptation for speech enhance-ment,” arXiv preprint arXiv:2008.11920 , 2020.[7] M. Kolbæk, Z.-H. Tan, S. H. Jensen, and J. Jensen, “Onloss functions for supervised monaural time-domain speech en-hancement,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 28, pp. 825–838, 2020.[8] T. Lan, Y. Lyu, W. Ye, G. Hui, Z. Xu, and Q. Liu, “Combiningmulti-perspective attention mechanism with convolutional net-works for monaural speech enhancement,”
IEEE Access , vol.8, pp. 78979–78991, 2020.[9] Y. Wang, J. Chen, and D. Wang, “Deep neural network basedsupervised speech segregation generalizes to novel noisesthrough large-scale training,”
Dept. of Comput. Sci. and Eng.,The Ohio State Univ., Columbus, OH, USA, Tech. Rep. OSU-CISRC-3/15-TR02 , 2015.[10] A. Das and J. H.L. Hansen, “Phoneme selective speech en-hancement using parametric estimators and the mixture max-imum model: A unifying approach,”
IEEE Transactions onAudio, Speech, and Language Processing , vol. 20, no. 8, pp.2265–2279, 2012.[11] S. E. Chazan, J. Goldberger, and S. Gannot, “A hybrid ap-proach for speech enhancement using MoG model and neuralnetwork phoneme classifier,”
IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 24, no. 12, pp.2516–2530, 2016.[12] Z. Q. Wang, Y. Zhao, and D. Wang, “Phoneme-specific speechseparation,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2016.[13] Y. Wang, A. Narayanan, and D. Wang, “On training targets forsupervised speech separation,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 22, no. 12, pp. 1849–1858, 2014.[14] S. Boll, “Suppression of acoustic noise in speech using spec-tral subtraction,”
IEEE Transactions on Acoustics, Speech andSignal Processing , vol. 27, no. 2, pp. 113–120, 1979.[15] O. Capp´e, “Elimination of the musical noise phenomenon withthe Ephraim and Malah noise suppressor,”
IEEE Transactionson Speech and Audio Processing , vol. 2, no. 2, pp. 345–349,1994. [16] R. A Jacobs, M. I. Jordan, S. J Nowlan, and G. E. Hinton,“Adaptive mixtures of local experts,”
Neural computation , vol.3, no. 1, pp. 79–87, 1991.[17] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of ex-perts and the EM algorithm,”
Neural computation , vol. 6, no.2, pp. 181–214, 1994.[18] H. Hermansky, J. R. Cohen, and R. M. Stern, “Perceptual prop-erties of current speech recognition technology,”
Proceedingsof the IEEE , vol. 101, no. 9, pp. 1968–1985, 2013.[19] S. E. Chazan, S. Gannot, and J. Goldberger, “A phoneme-basedpre-training approach for deep neural network with applicationto speech enhancement,” in
International Workshop on Acous-tic Signal Enhancement (IWAENC) , 2016.[20] J. Xie, R. Girshick, and A. Farhad, “Unsupervised deep em-bedding for clustering analysis,” in
International Conferenceon Machine Learning (ICML) , 2016.[21] A. Varga and H. J.M. Steeneken, “Assessment for automaticspeech recognition: NOISEX-92: A database and an experi-ment to study the effect of additive noise on speech recognitionsystems,”
Speech communication , vol. 12, no. 3, pp. 247–251,1993.[22] D. E. Yuksel, J. N. Wilson, and P. D. Gader, “Twenty years ofmixture of experts,”
IEEE Transactions on Neural Networksand Learning Systems , vol. 23, no. 8, pp. 1177–1193, 2012.[23] J. S. Garofalo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S.Pallett, and N. L. Dahlgren, “The DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM,” Tech. Rep.,Linguistic Data Consortium, 1993.[24] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensor-flow: a system for large-scale machine learning.,” in
OSDI ,2016, vol. 16, pp. 265–283.[25] D. Kingma and J. Ba, “ADAM: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[26] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” in
International Conference on Machine Learning (ICML) , 2015.[27] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overviewof noise-robust automatic speech recognition,”
IEEE Transac-tions on Audio, Speech, and Language Processing , vol. 22, no.4, Apr. 2014.[28] P Recommendation, “862: Perceptual evaluation of speechquality (PESQ): An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs,”
Feb. , vol. 14, 2001.[29] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,“An algorithm for intelligibility prediction of time–frequencyweighted noisy speech,”