[PDF] Incorporating Symbolic Sequential Modeling for Speech Enhancement

Abstract

In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset.

Full PDF

IIncorporating Symbolic Sequential Modeling for Speech Enhancement

Chien-Feng Liao , Yu Tsao , Xugang Lu , Hisashi Kawai Research Center for Information Technology Innovation, Academic Sinica, Taiwan National Institute of Information and Communications Technology, Japan [email protected], [email protected], { xugang.lu,hisashi.kawai } @nict.go.jp Abstract

In a noisy environment, a lossy speech signal can be automat-ically restored by a listener if he/she knows the language well.That is, with the built-in knowledge of a “language model”, alistener may effectively suppress noise interference and retrievethe target speech signals. Accordingly, we argue that familiaritywith the underlying linguistic content of spoken utterances ben-eﬁts speech enhancement (SE) in noisy environments. In thisstudy, in addition to the conventional modeling for learning theacoustic noisy-clean speech mapping, an abstract symbolic se-quential modeling is incorporated into the SE framework. Thissymbolic sequential modeling can be regarded as a “linguisticconstraint” in learning the acoustic noisy-clean speech mappingfunction. In this study, the symbolic sequences for acousticsignals are obtained as discrete representations with a VectorQuantized Variational Autoencoder algorithm. The obtainedsymbols are able to capture high-level phoneme-like contentfrom speech signals. The experimental results demonstrate thatthe proposed framework can obtain notable performance im-provement in terms of perceptual evaluation of speech quality(PESQ) and short-time objective intelligibility (STOI) on theTIMIT dataset.

Index Terms : Speech enhancement, deep learning, symbolicrepresentation, multi-head attention

1. Introduction

Speech enhancement (SE) has been commonly used as a front-end module in speech-related applications, such as robust au-tomatic speech recognition (ASR) [1–3], automatic speakerrecognition, and assistive listening devices [4–6]. Recently,deep learning (DL)-based SE models have also been proposedand extensively investigated [7–11]. The main idea in theseDL-based SE models is to learn the complex mapping func-tions between noisy speech and clean speech. In most studies,the mapping functions are learned based on a large quantity ofwell-prepared noisy-clean speech pairs in the acoustic domainwithout considering the underlying linguistic structure.In a noisy environment, audiences can automatically restorea noise-masked speech based on their knowledge of a “languagemodel”, and the restoring ability depends on the effectivenessof this internal “language model”. For example, in noisy envi-ronments, great effort is required for non-native listeners [12].These studies indicate that the linguistic-related information ishelpful to retrieve target speech signals from the noisy ones.Accordingly, it is argued in this study that it is beneﬁcial toincorporate text information (phonemes or words) into an SEsystem for improved performance.In [13], oracle transcription is used to extract time-alignedtext features as auxiliary input to the DNN model. Even thoughthis can be formulated as a text-to-speech application, it is notpractical under SE scenarios to assume to have ground-truth transcription. Several studies incorporate recognition resultsor outputs from acoustic models. In [14], a phone-class fea-ture is augmented to standard acoustic features as input for de-reverberation. In [9], an ASR and an SE system are trained iter-atively, where each system’s input depend on the other’s output.In [15,16], a set of DNNs were trained as enhancement models,one for each speciﬁc phoneme. During inference time, an ASRor a phoneme classiﬁer was used to determine which DNN touse. Even though promising results have been obtained, theseapproaches have major drawbacks. First, the recognition modelis not jointly trained and thus optimization cannot be achievedfor both systems. If the ASR system is incorrect, errors willbe propagated to the downstream SE system. Secondly, heavilyequip SE with an ASR system may be undesirable because SEis commonly used as a preprocessor. To overcome these obsta-cles, [17] proposed learning a Deep Mixture of Experts (DMoE)network where the experts are DNNs, whose outputs are com-bined by a gating DNN. The gating DNN is trained to assign acombination weight to each expert. This results in splitting theacoustic space into sub-areas in an unsupervised manner, whichis similar to our proposed method.van den Oord et al. [18] recently proposed the Vector Quan-tized Variational Autoencoder (VQ-VAE), in which the stochas-tic continuous latent variables from the original VAE are re-placed with deterministic discrete latent variables. It maintainsa set of prototype vectors, i.e., a predeﬁned size of learnablecodebook. During forward pass, feature vectors produced bythe encoder are replaced with their nearest-neighbor in the code-book. Although this quantization component acts as an infor-mation bottleneck and can regularize the power of the encoder,the discrete latent variables are more interpretable and tend tolearn higher level representations, which can naturally corre-spond to phoneme-like features for given speech signal inputs.In [19], a comprehensive study of VQ-VAE applied to speechdata was carried out, and it was demonstrated that VQ-VAEachieves better interpretability and information separation (suchas disentangling speaker characteristics) than VAEs and AEs.Furthermore, the extracted representation allowed for accuratemapping into phonemes and achieved competitive performanceon an unsupervised acoustic unit discovery task. Overall, thecharacteristics of the VQ-VAE make it a suitable component toreinforce an SE system with high-level linguistic information.In this study, an SE system with U-Net architecture [20–23]is proposed. Moreover, a “symbolic encoder” is developed, con-sisting of DNNs and the vector quantization mechanism in VQ-VAE. The extracted symbolic sequence is then connected tothe U-Net via multi-head attention mechanism [24]. Thereby,the two components can be jointly trained without the need ofany supervised transcription or explicit constraints. The resultsdemonstrate a notable improvement in terms of objective mea-sures including perceptual evaluation of speech quality (PESQ)[25] and short-time objective intelligibility (STOI) [26]. a r X i v : . [ c s . L G ] J u l C (1024)FC (1024)FC (512)FC (512)FC (64)Conv1D (3, 1024)Conv1D (3, 1024)Conv1D-3x512 Conv1D (3, 128)QuantizationSymbolic encoderEncoderDecoder Conv1D (3, 512) Symbolic sequenceConv1D (1, 257)Conv1D (3, 512)Conv1D (3, 512)Deconv (6, 512)Deconv (6, 512)Deconv (6, 1024)Deconv (6, 1024)

MMMM Q M K, V

Multi-head attention(a)MFCC-39DLPS-257D waveform

Figure 1:

Proposed system consisting of a U-Net architecture,a symbolic encoder, and an attention mechanism. Conv1Ds andDeconvs are in the format (ﬁlterWidth, outputChannels), andthe down-sample \ up-sample rates are both 2. FC (outputChan-nels) denotes the fully connected layer. The rest of the paper is organized as follows. In Section 2,the proposed approach is detailed, including each componentsof the system and the objective functions. The experiment set-tings and results are presented in Section 3. Finally, Section 4concludes the paper.

2. System architecture

A paired training dataset { x i , y i } Ni =1 , where x i is the inputnoisy speech and y i is the target clean speech. The pro-posed system is shown in Figure 1. It consists of the follow-ing parts: an encoder network E se ( x ) consisting of convo-lutional layers that extracts the feature sequence; another en-coder network called symbolic encoder E symb ( x ) consists offully connected layers and extracts the symbolic sequence byvector quantization. Multi-head attention function and skip-connection are used to connect the two encoder outputs with thedecoder Dec ( E se ( x ) , E symb ( x )) . All components are jointlytrained using mean-squared-error (MSE) loss function betweenthe clean speech and the enhanced speech: L mse = 1 N N (cid:88) i =1 || Dec ( E se ( x i ) , E symb ( x i )) − y i || (1)The quantization mechanism and the multi-head attentionmechanism will now be brieﬂy explained; for more detailed in-formation readers may refer to [18] and [24], respectively. The symbolic encoder reads a sequence of acoustic featuresas input. Here, mel-frequency cepstral coefﬁcients (MFCCs) are used, as suggested in [19]. A sequence of hidden vectors { h t ∈ R D , t = 1 , ..., T } is extracted by the fully connectedlayers, where D is the dimensionality and T denotes the se-quence length. A symbolic book that contains a set of proto-type vectors { e j ∈ R D , j = 1 , ..., M } is maintained, where M is the size of the book. The hidden vectors h t will be replacedby the nearest prototype vector in the symbolic book. That is, h (cid:48) t = e k , where k = argmin j (cid:107) h t − e j (cid:107) . During the train-ing phase, the prototypes in the symbolic book are updated asa function of exponential moving averages of h . This methodis presented in the original paper as an alternative way to up-date the book, and has the advantage of faster training speedthan using an auxiliary loss. To prevent the symbolic encoderdiverge in h with unbounded value, [18] also uses the “com-mitment loss” to encourage the symbolic encoder to producevectors lying close to the prototypes. Overall, the full system isoptimized with two loss terms: the MSE between the enhancedacoustic features and the clean target features, and the commit-ment loss: L total = L mse + λ (cid:107) h t − sg ( e k ) (cid:107) (2)where λ is a hyperparameter that controls the importance of thecommitment loss and sg ( . ) denotes the stop-gradient operation.It should be noted here that the gradient of the loss can be back-propagated to the symbolic encoder using the straight-throughestimator presented in [27]. Multi-head attention (MHA) was ﬁrst proposed in the trans-former architecture [24] for machine translation, and recentlyexplored in various speech-related tasks including end-to-endASR [28] and text-to-speech system [29]. MHA extendsthe conventional attention mechanism to have multiple heads,where each head generates a different attention weight vector.This allows the decoder to jointly retrieve information from dif-ferent representation subspaces at different positions, which fa-cilitates focusing on the various structures of the symbolic se-quence. The input argument consists of queries Q , keys K , andvalues V , i.e., Attn ( Q, K, V ) . In this study, MHA is used be-fore each layer in the decoder. Every time-step of the decoderoutput acts as an query to attend on the symbolic sequence. Theoutput of MHA will be concatenated with the skip-connectionand fed to the proceeding decoder layer together. Formally, wehave the symbolic sequence h and the skip-connection from theencoder at each layer { s ( l ) , l = 1 , ..., L } . The output of eachlayer in the decoder is the following: d ( l ) = Deconv ( Concat ( s ( L − l +1) , Attn ( d ( l − , h (cid:48) , h (cid:48) ))) where l = 1 , ..., L is the depth of the decoder layer and d (0) isthe encoder output. The symbolic encoder consists of four fully connected layers,each followed by a ReLU activation function and a dropoutlayer [30] with a drop rate of 0.2. A linear projection layerthen maps the hidden vectors h t to D = 64 dimensions in or-der to perform quantization. After the quantization, an one-dimensional (1-D) convolutional layer is used to give the sym-bolic sequence contextual information. Four heads are used inMHA, leading to point (a) in Figure 1, with a dimensionality of × . As in the original transformer, the positional encodingsable 1: Average PESQ, and STOI scores for evaluating baseline models and the proposed method on the test set under three unseennoise environments at ﬁve SNR levels and the average scores across all SNRs. The unprocessed test set is denoted by

Noisy . Size of thesymbolic book is shown in the parenthesis. The highest scores per metric are highlighted with bold text, excluding

Oracle . Noisy U-Net U-Net-MOL Proposed (64) OracleSNR PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI-6 1.213 0.532 1.685 0.602 1.800 0.619

3. Experiments

The experiments were conducted on the TIMIT database [31].A total of 3696 utterances from the TIMIT training set (exclud-ing SA ﬁles) were randomly sampled and corrupted with 100noise types from [32] at six SNR levels, i.e., 20dB, 15dB, 10dB,5dB, 0dB, and -5dB, to obtain 40-hour multi-condition train-ing set, consisting of pairs of clean and noisy speech utterances.Another 100 utterances were randomly sampled to construct thevalidation set. They are mixed with cafeteria babble noise at 4SNR levels (-4 dB, 0 dB, 4 dB, and 8 dB), which is unseenfrom the training set. The 192 utterances from the core testset of the TIMIT database were used to construct the test setfor each combination of noise types and SNR levels. To eval-uate the system on unseen noise types, three other noise types,namely Buccaneer1, Destroyer engine, and HF channel fromthe NOISEX-92 corpus [33], were adopted. In the following ex-periments, the SE algorithm will be evaluated in terms of speechquality and speech intelligibility. Therefore, PESQ and STOI,respectively, will be used to evaluate the enhanced speech, re-spectively. Higher scores represent better performance.

The sampling rate of the speech data was 16 kHz. For the en-coder input, time-frequency (T-F) features were extracted usinga 512-point short time Fourier transform (STFT) with a ham-ming window size of 32 ms and a hop size of 16 ms, result-ing in feature vectors consisting of 257-point STFT log-powerspectra (LPS). For the symbolic encoder, standard 13 MFCCfeatures (extracted at a rate identical to that for the LPS fea-tures) were used and concatenated with their temporal ﬁrst andsecond derivatives. MFCCs are often used in speech recognitionbecause they are pitch invariant and slightly robust to noise. Abetter quantization behavior was observed using MFCC com-pared to LPS in the preliminary experiments. The input was a Table 2:

Average PESQ and STOI performance on the valida-tion set for different size of the symbolic book.

Book size M PESQ STOI

39 2.061 0.71164

128 2.027 0.712256 2.041 0.711segment of 64 frames (approximately 1 s), and was normalizedby mean and standard-deviation before being fed to the sys-tem. Finally, the decoder outputs were synthesized back to thewaveform signal via inverse Fourier transform and an overlap-add method. The phases of the noisy signals were used for theinverse Fourier transform. All models were trained on mini-batches of 32. The Adam optimizer [34] was used with learn-ing rate lr = 0 . , β = 0 . , and β = 0 . . The weightof the commitment loss λ was set to 0.2, which is close to theoriginal setting in VQ-VAE, and it did not have signiﬁcant im-pact on performance. Early stopping was performed based onthe validation set to prevent overﬁtting. We constructed the baseline model by excluding the symbolicencoder component, i.e., the left part of Figure 1 without MHA.This model is denoted by

U-Net . Subsequently, the multi-objective learning method proposed in [35] was adopted in thebaseline model. The input of the

U-Net was augmented byMFCC features, and an additional objective was added to L total during training to predict clean MFCCs. This baseline is de-noted by U-Net-MOL . Finally, the beneﬁt of using real textinformation as in [13] should be demonstrated. The phonemelevel transcriptions provided by the TIMIT corpus were usedto obtain frame-wise phoneme labels. The input MFCCs ofthe symbolic encoder were then replaced by the phoneme em-beddings (embeddings are jointly learned). Quantization wasdiscarded because the real phonetic information was provided.This is considered as an oracle model, as it takes correct tran-scriptions as input. This system will be called

Oracle . Table 1 presents the results of the average PESQ, SSNR, andSTOI scores on the test set for different systems. “Noisy” de-notes unprocessed noisy speech, and the proposed model isshown with the symbolic book size of 64 as a representative.From this table, it can be observed that

Oracle performed theigure 2:

Left: Histogram: each bin represents the token index, and the value shows how many times this token was chosen, given thecorresponding phoneme. Right: The element on location (i,j) represents JS-divergence between the histogram from the i-th phonemeand the histogram from the j-th phoneme. Darker color implies larger divergence. Some phonemes were omitted owing to spacelimitations. best, as expected. This also conﬁrmed the hypothesis that, givencorrect text information, the SE system can be more robust tonoisy environments. Furthermore, the proposed model outper-formed

U-Net and

U-Net-MOL at every SNR levels. It shouldbe noted here that the system had fewer trainable parameterscompared to the baselines, as MHA reduces the dimension to × , as mentioned in Section 2.3. Thus, the improvementwas not due to model complexity. Table 2 shows the de-noiseability of the proposed method with different size symbolicbook. It can be seen that performance peaked for a size of 64.During the experiments, it was also observed that the symbolicbook suffered from the “index collapse” problem [36] (some to-kens are not activated through out training) for sizes larger than256, implying that 256 tokens are sufﬁcient for exploring theacoustic units, whereas adding more will be of no beneﬁt. An advantage of the discrete representation learned by the VQ-VAE is the interpretability of individual tokens in the symbolicbook. Here, a visualization method was developed to connectinput acoustic features to the activated token. Figure 2 (left)shows histograms corresponding to phoneme classes (39-way).More speciﬁcally, noisy speech from the test set were passedthrough the symbolic encoder to obtain the symbolic sequences.Given the frame-wise phoneme labels, a histogram for individ-ual phoneme class can then be plotted. Each bin represents thetoken index, and the value shows how many times this tokenwas chosen, given the frame that belongs to the correspondingphoneme. The histograms were normalized to become probabil-ity distribution functions (PDFs), i.e., the summation equals 1.Here, it can be seen that phonemes with similar pronunciationalso have similar distribution in the histograms. For example,the phonemes in each of the pairs ( aa , aw ), ( m , n ), and ( ch , sh )have similar distributions, whereas phonemes in different pairshave different distributions.For a complete understanding of the relations within thephoneme set, the Jensen-Shannon divergence between the phonemes was measured. Figure 2 (right) shows a heat map.Each element represents the distance between two PDFs, anddarker color corresponds to larger distance. As JS-divergenceis symmetric, the heat map is also a symmetric matrix. Somesquares in light color are located on the diagonal, which im-plies that phonemes with similar pronunciation are clustered to-gether, e.g., vowels have lighter colors with each other, and arecompletely separated from fricatives. The heat map greatly fa-cilitates the visualization of the relationship between phonemes.For instance, it shows that ch is very close to s , z , and sh . In con-clusion, the symbolic encoder was demonstrated to be reactiveto phonetic content. It was observed that some of the phonemesthat are pronounced differently lie near each other. The obviousexplanation is that the noise affected the input MFCCs, thusconfusing the symbolic encoder. One possible solution is toconstrain explicitly the symbolic encoder so that it may becomenoise-invariant by adding a discriminator and using adversarialtraining as in [37]. This is left as future work.

4. Conclusion and future work

A novel approach for incorporating phonetic content into a SEsystem was proposed, without the need for a recognition sys-tem or any transcriptions during training. The symbolic encoderused the vector quantization method proposed in VQ-VAE toextract discrete representations. Consequently, the symbolicencoder learned to divide the input MFCCs into acoustic unitsautomatically, and achieved notable performance improvementcompared to the baseline systems. The representations were fur-ther interpreted by visualizing the symbolic encoder behavior,and it was conﬁrmed that it was phoneme-sensitive. In futurestudies, the effect of different noise types on the symbolic en-coder will be investigated, and noise-invariant training will beperformed to extract purer symbolic sequence. Furthermore, anexplicit language model constraint based on the learned sym-bolics may be even more useful to the SE system. . References [1] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview ofnoise-robust automatic speech recognition,”

IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 22, no. 4,pp. 745–777, 2014.[2] J. Li, L. Deng, R. Haeb-Umbach, and Y. Gong,

Robust automaticspeech recognition: a bridge to practical applications . AcademicPress, 2015.[3] Z.-Q. Wang and D. Wang, “A joint training framework for ro-bust automatic speech recognition,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 24, no. 4, pp. 796–806, 2016.[4] P. C. Loizou,

Speech enhancement: theory and practice . CRCpress, 2007.[5] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee,“A deep denoising autoencoder approach to improving the intelli-gibility of vocoded speech in cochlear implant simulation,”

IEEETransactions on Biomedical Engineering , vol. 64, no. 7, pp. 1568–1578, 2017.[6] D. Wang, “Deep learning reinvents the hearing aid,”

IEEE spec-trum , vol. 54, no. 3, pp. 32–37, 2017.[7] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancementbased on deep denoising autoencoder.” in

Interspeech , 2013, pp.436–440.[8] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 23, no. 1, pp. 7–19, 2015.[9] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speechenhancement and recognition using multi-task learning of longshort-term memory recurrent neural networks,” in

Sixteenth An-nual Conference of the International Speech Communication As-sociation , 2015.[10] M. Kolbk, Z.-H. Tan, J. Jensen, M. Kolbk, Z.-H. Tan, andJ. Jensen, “Speech intelligibility potential of general and special-ized deep neural network based speech enhancement systems,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 25, no. 1, pp. 153–167, 2017.[11] D. Wang and J. Chen, “Supervised speech separation based ondeep learning: An overview,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , 2018.[12] G. Borghini and V. Hazan, “Listening effort during sentenceprocessing is increased for non-native listeners: a pupillometrystudy,”

Frontiers in neuroscience , vol. 12, p. 152, 2018.[13] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, “Text-informed speech enhancement with deep neural networks,” in

Six-teenth Annual Conference of the International Speech Communi-cation Association , 2015.[14] M. Mimura, S. Sakai, and T. Kawahara, “Deep autoencoders aug-mented with phone-class feature for reverberant speech recog-nition,” in . IEEE, 2015, pp. 4365–4369.[15] Z.-Q. Wang, Y. Zhao, and D. Wang, “Phoneme-speciﬁc speechseparation,” in . IEEE, 2016, pp. 146–150.[16] S. E. Chazan, S. Gannot, and J. Goldberger, “A phoneme-basedpre-training approach for deep neural network with application tospeech enhancement,” in . IEEE, 2016, pp. 1–5.[17] S. E. Chazan, J. Goldberger, and S. Gannot, “Speech en-hancement using a deep mixture of experts,” arXiv preprintarXiv:1703.09302 , 2017.[18] A. van den Oord, O. Vinyals et al. , “Neural discrete representationlearning,” in

Advances in Neural Information Processing Systems ,2017, pp. 6306–6315. [19] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Un-supervised speech representation learning using wavenet autoen-coders,” arXiv preprint arXiv:1901.08810 , 2019.[20] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in

InternationalConference on Medical image computing and computer-assistedintervention . Springer, 2015, pp. 234–241.[21] S. Pascual, A. Bonafonte, and J. Serr`a, “Segan: Speechenhancement generative adversarial network,” arXiv preprintarXiv:1703.09452 , 2017.[22] D. Michelsanti and Z.-H. Tan, “Conditional generative adversarialnetworks for speech enhancement and noise-robust speaker veri-ﬁcation,” arXiv preprint arXiv:1709.01703 , 2017.[23] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scaleneural network for end-to-end audio source separation,” arXivpreprint arXiv:1806.03185 , 2018.[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”in

Advances in Neural Information Processing Systems , 2017, pp.5998–6008.[25] I.-T. Recommendation, “Perceptual evaluation of speech quality(pesq): An objective method for end-to-end speech quality as-sessment of narrow-band telephone networks and speech codecs,”

Rec. ITU-T P. 862 , 2001.[26] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-gorithm for intelligibility prediction of time–frequency weightednoisy speech,”

IEEE Transactions on Audio, Speech, and Lan-guage Processing , vol. 19, no. 7, pp. 2125–2136, 2011.[27] Y. Bengio, N. L´eonard, and A. Courville, “Estimating or propa-gating gradients through stochastic neurons for conditional com-putation,” arXiv preprint arXiv:1308.3432 , 2013.[28] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-art speech recognition with sequence-to-sequence models,”in . IEEE, 2018, pp. 4774–4778.[29] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg,J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens:Unsupervised style modeling, control and transfer in end-to-endspeech synthesis,” arXiv preprint arXiv:1803.09017 , 2018.[30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: a simple way to prevent neural net-works from overﬁtting,”

The Journal of Machine Learning Re-search , vol. 15, no. 1, pp. 1929–1958, 2014.[31] J. S. Garofolo, “Timit acoustic phonetic continuous speech cor-pus,”

Linguistic Data Consortium, 1993 , 1993.[32] G. Hu, “100 nonspeech environmental sounds, 2004.”[33] A. Varga and H. J. Steeneken, “Assessment for automatic speechrecognition: Ii. noisex-92: A database and an experiment tostudy the effect of additive noise on speech recognition systems,”

Speech communication , vol. 12, no. 3, pp. 247–251, 1993.[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[35] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, “Multi-objective learning and mask-based post-processing for deepneural network based speech enhancement,” arXiv preprintarXiv:1703.07172 , 2017.[36] Ł. Kaiser, A. Roy, A. Vaswani, N. Pamar, S. Bengio, J. Uszkoreit,and N. Shazeer, “Fast decoding in sequence models using discretelatent variables,” arXiv preprint arXiv:1803.03382 , 2018.[37] C.-F. Liao, Y. Tsao, H.-Y. Lee, and H.-M. Wang, “Noise adaptivespeech enhancement using domain adversarial training,” arXivpreprint arXiv:1807.07501arXivpreprint arXiv:1807.07501