[PDF] Semi-Supervised Learning with Data Augmentation for End-to-End ASR

Abstract

In this paper, we apply Semi-Supervised Learning (SSL) along with Data Augmentation (DA) for improving the accuracy of End-to-End ASR. We focus on the consistency regularization principle, which has been successfully applied to image classification tasks, and present sequence-to-sequence (seq2seq) versions of the FixMatch and Noisy Student algorithms. Specifically, we generate the pseudo labels for the unlabeled data on-the-fly with a seq2seq model after perturbing the input features with DA. We also propose soft label variants of both algorithms to cope with pseudo label errors, showing further performance improvements. We conduct SSL experiments on a conversational speech data set with 1.9kh manually transcribed training data, using only 25% of the original labels (475h labeled data). In the result, the Noisy Student algorithm with soft labels and consistency regularization achieves 10.4% word error rate (WER) reduction when adding 475h of unlabeled data, corresponding to a recovery rate of 92%. Furthermore, when iteratively adding 950h more unlabeled data, our best SSL performance is within 5% WER increase compared to using the full labeled training set (recovery rate: 78%).

Full PDF

SSemi-Supervised Learning with Data Augmentation for End-to-End ASR

Felix Weninger , Franco Mana , Roberto Gemello , Jes´us Andr´es-Ferrer , Puming Zhan Nuance Communications, Inc., Burlington, MA, USA, Torino, Italy, Valencia, Spain { felix.weninger,franco.mana,roberto.gemello,jesusandres.ferrer,puming.zhan } @nuance.com Abstract

In this paper, we apply Semi-Supervised Learning (SSL) alongwith Data Augmentation (DA) for improving the accuracy ofEnd-to-End ASR. We focus on the consistency regularizationprinciple, which has been successfully applied to image classi-ﬁcation tasks, and present sequence-to-sequence (seq2seq) ver-sions of the FixMatch and Noisy Student algorithms. Speciﬁ-cally, we generate the pseudo labels for the unlabeled data on-the-ﬂy with a seq2seq model after perturbing the input featureswith DA. We also propose soft label variants of both algorithmsto cope with pseudo label errors, showing further performanceimprovements. We conduct SSL experiments on a conversationalspeech data set (doctor-patient conversations) with 1.9 kh man-ually transcribed training data, using only 25 % of the originallabels (475 h labeled data). In the result, the Noisy Student al-gorithm with soft labels and consistency regularization achieves10.4 % word error rate (WER) reduction when adding 475 h ofunlabeled data, corresponding to a recovery rate of 92 %. Fur-thermore, when iteratively adding 950 h more unlabeled data,our best SSL performance is within 5 % WER increase comparedto using the full labeled training set (recovery rate: 78 %).

Index Terms : automatic speech recognition, semi-supervisedlearning, data augmentation, sequence-to-sequence, end-to-end

1. Introduction

End-to-end (E2E) systems have become a focus of ASR researchin recent years, due to their ability of integrating all compo-nents of an ASR system in a single deep neural network (DNN),which greatly simpliﬁes and uniﬁes the training and decodingprocess [1–5]. The Sequence-to-Sequence (seq2seq) model withattention is one of the model architectures for E2E ASR systemswhich has shown state-of-the-art performance [6–10]. However,a general observation is that E2E ASR needs large amounts oftraining data for achieving state-of-the-art performance, espe-cially when no language model trained with external text datais included in the system. Data Augmentation (DA) and semi-supervised learning (SSL) are two approaches that can be usedfor improving E2E model performance with limited amounts ofmanually transcribed training data.DA perturbs (usually randomly) the input data without alter-ing the corresponding labels. This not only increases the varietyof the data, but also serves as implicit regularization to avoidoverﬁtting [11]. It has been successfully used for both conven-tional [12–14] and E2E ASR systems [15, 16]. In particular, theSpecAugment approach proposed in [17] has shown impressiveimprovement for seq2seq based E2E ASR models.SSL (also called semi-supervised training) aims at lever-aging unlabeled data for improving ASR model accuracy. Inthe self-training paradigm for SSL, a seed model trained withlimited amount of labeled data is used to generate transcriptions(pseudo labels) for unlabeled data (cf. [18]). This procedure canbe iterated on additional unlabeled data [19]. Another possible implementation of SSL is via teacher-student training, where a‘student’ model is trained to replicate the outputs of a powerful‘teacher’ model on the unlabeled data [20].The central research question of our paper is how to bestintegrate SSL with DA for training E2E ASR systems. So far,the use of DA with SSL for E2E ASR has been largely limitedto a simple cascade of both, i.e., doing pseudo label generationfor the unlabeled data and then applying DA [21–24]. In con-trast, for image classiﬁcation, several algorithms have recentlybeen proposed that use DA for teacher-student training [25] and consistency regularization in SSL [26–28]. Consistency regular-ization stems from the intuition that a small perturbation to aninput data sample should not change the output distribution a lot.However, these SSL algorithms were designed only for staticclassiﬁcation and need to be modiﬁed to support the seq2seqASR use case.In this regard, our paper makes the following contributions:First, we modify the Noisy Student [25] and FixMatch [28]algorithms – which have only been applied to image classiﬁca-tion so far – for the seq2seq ASR use case. Second, we showperformance improvements for both algorithms by using softlabels and consistency training (via SpecAugment and dropout).Finally, we demonstrate additional gains from iterative genera-tion of pseudo labels by exploiting a larger amount of unlabeleddata. We show that our proposed methods outperform the simpleapproach of doing DA after generating pseudo labels.

Relation to prior work:

Several studies have recently inves-tigated SSL techniques for E2E ASR, e.g. representation learn-ing [29], the usage of external text data [30], text-to-speech [31],and transcriptions generated by conventional ASR [32]. Re-garding self-training for E2E ASR, [22] proposed data ﬁlteringand ensemble schemes, generating hard pseudo labels via beamsearch. In [33], dropout was employed to improve the pseudolabel accuracy and conﬁdence measure, due to its well-knownmodel ensembling property. All of these works did not con-sider the interaction of SSL with DA, which we investigate inour study. Very recently, [34] proposed teacher-student learningwith DA for consistency training, without considering the NoisyStudent or the FixMatch algorithm or soft labels as in our work.

2. Methods

For our E2E ASR models, we use an encoder-decoder archi-tecture with attention as described in [9], which is similar toListen-Attend-Spell (LAS) [6]. The ASR task is treated as aseq2seq learning problem: The model M is trained to predict asequence y j of symbols (here, we use sub-word units) from asequence of acoustic features (usually a spectrogram x ).The encoder e creates a hidden representation of the acous-tic features. Here, e is implemented as a stack of convolutional(CNN) layers followed by bidirectional Long Short-Term Mem-ory (bLSTM) layers. The decoder is similar to an RNN LM thattakes into account a context vector c j . Bahdanau attention [35]is used to focus c j on various parts of the encoder output. The a r X i v : . [ ee ss . A S ] J u l utput distribution p j = p ( y j | y , . . . , y j − , x ) = M ( x, y . In our training framework, DA is modeled by a function A ,which also depends on the current state of the random numbergenerator. We parameterize the function A based on SpecAug-ment (SA) [17] with hyperparameters F max , T max , m F and m T .The SA algorithm masks (replaces by zeros) up to F max contigu-ous frequency bands and up to T max contiguous time frames inthe spectrogram x . The starting positions and the actual numberof masked rows / columns are sampled from a uniform distribu-tion. This process is repeated m F times for masking frequenciesand m T times for masking time frames. Due to the combinato-rial explosion of possible corruptions applied by A to a singleinput x , the usage of SA results in a practically inﬁnite amountof training data. We obtain pseudo labels for SSL based on PT transcriptions ofthe unlabeled data. These are generated ‘ofﬂine’ prior to trainingusing beam search with a seq2seq model. Data augmentationis turned off during this ofﬂine PT generation phase. Since the , ∗ , EncoderDecoderStrong SA Weak SA , , , Labeled data PT data , ∗ Figure 1:

Sequence-to-sequence FixMatch algorithm.

PT generation is done only once, it is reasonable to use a largebeam size W to improve the PT quality (here, we set W = 16 ).Furthermore, we apply a heuristic loop ﬁltering technique tothe PT utterances similar to the one from [22]. This helps usavoid reinforcing the well-known ‘looping’ problem, where theseq2seq model keeps repeating the same n-gram.Semi-supervised training can be performed by directly usingthe PT transcriptions as pseudo labels ˆ y i,j in (7). Alternatively,we can dynamically update the PT transcriptions in the trainingprocess as in the FixMatch or Noisy Student approaches. The FixMatch algorithm is a self-training method that was pro-posed in [28] for image classiﬁcation. The method implements consistency training by applying two kinds of data augmentationto the input x in both the unlabeled loss and the pseudo labelgeneration, thus encouraging the outputs to be consistent for bothaugmented inputs. Here, we present an extension of FixMatchto seq2seq models. Speciﬁcally, we generate pseudo labels ˜ y i,j for unlabeled training examples x ui as: ˜ y i,j = M θ ( A w ( x ui ) , ˆ y i,

Sequence-to-sequence Noisy Student algorithm. student training with DA. The main difference to FixMatch isthat pseudo labels are generated by a pretrained teacher model M θ ∗ with ‘frozen’ parameters θ ∗ , instead of the current modelparameters θ : ˜ y i,j = M θ ∗ ( x ui , ˆ y i,

Figure 3:

FixMatch behavior over time for various conﬁdencethresholds C . per each encoder states that is attended more than 0.5 whiledecoding; λ wip is a constant word insertion penalty to balancethe average probability reduction per decoded token; and thelast term is a root length bonus/penalty to better approximate thelength bias [37].

3. Experiments

Our methods are evaluated on a conversational speech transcrip-tion task (doctor-patient conversations). All speech data areanonymized ﬁeld data. Our experiments are based on a corpusof 1.9 k hours manually end-pointed and transcribed speech. Wedivided the corpus into four parts of equal size, treating onlythe ﬁrst one (475 h) as labeled. In a ﬁrst set of experiments, wetreated the second part (475 h) as unlabeled data (ignoring thelabels). We performed additional experiments adding the remain-ing 950 h of data as unlabeled data. In all cases, we measure theword error rate (WER) on a test set of 300 k words (26.8 h).

Our E2E models use 80-dimensional log Mel ﬁlterbank outputsas input features. The inputs are ﬁrst passed through a stackof 3 CNN layers, which is parameterized so as to yield a 512-dimensional embedding for each input frame. Subsequently,there is a pyramid bLSTM encoder with 6 layers (512 LSTMunits per layer and direction), which performs frame decimation(by a factor of 2) after every other layer, thus reducing the framerate by a factor of 8. The decoder uses 2 (unidirectional) LSTMlayers (1 024 units per layer). The softmax output layer predictsthe posterior probabilities for 2 k word piece targets determinedon the training set. In total, the model has 66 m parameters.

The E2E models are trained with dropout [38] (with probability0.3), label smoothing for hard labels [39] (with probability 0.9 forthe target class), and early stopping (using a validation set heldable 1:

FixMatch results using 475 h labeled data (GT), 475 hunlabeled data (PT) and 200 k training steps. C PT labels PT noise Init WER

Supervised (475h) – – – random 16.77

Semi-supervised, FixMatch F = 5 ) random 16.130.5 hard + strong SA random 16.340.5 soft dropout random 15.790.0 soft dropout 475 h GT out from the training data) in order to improve generalization.The SpecAugment parameters are set as F max = 35 , T max = 50 , m F = 1 , m T = 2 . For the weak augmentation A w in pseudolabel generation, SA is parameterized with F max = 5 , m F = 1 , T max = m T = 0 (frequency masking only).

4. Results

First, we performed baseline experiments with supervised train-ing and SA. Using 475 h GT data, we obtain 16.77 % WER (cf.Table 1), which is the baseline for our SSL experiments. Using1.9 kh GT data, we achieve 13.79 % WER, which is the lowerbound on the WER attainable with SSL.

Figure 3 shows the behavior of the FixMatch algorithm fordifferent conﬁdence thresholds C . It can be seen that the fractionof selected tokens decreases signiﬁcantly with increasing C , thuseffectively training on less data. However, as training progresses,the validation accuracy varies little between different C , whichalso results in similar WER. Table 1 shows the ASR performancefor various FixMatch settings, using a short training schedule of200 k maximum steps to reduce experimental turnaround time.We observe that adding SA-based consistency training on top ofdropout does not give an improvement; however, it is conﬁrmedthat using strong augmentation instead of weak augmentationin pseudo label computation degrades the results. The usage ofsoft labels leads to a slight gain (1.1 % WERR, where WERR isdeﬁned as relative WER reduction).To further improve on the FixMatch performance, we initial-ized the FixMatch parameters from the model previously trainedon 475 h GT data, while also dispensing with the conﬁdence-based selection, since in this case we assume that the PT labelsare of high quality from the start. This approach led to a signiﬁ-cant improvement (3.6 % WERR), yielding 9.2 % WERR fromFixMatch compared to the supervised baseline. Results of the Noisy Student algorithm are shown in Table 2,using a training schedule of 70 epochs (with early stopping).We observe 2.2 % WERR from using consistency regularizationcompared to the hard label Noisy Student algorithm. More-over, the usage of soft labels improves the results considerably(2.8 % WERR). The best result (15.02 % WER) is obtained bycombining soft labels and consistency training. The best NoisyStudent setup performs similar to the best FixMatch setup, andoutperforms the iterative self-training similar to [24] by 2 %.We also assess the WER recovery rate (WRR) of our SSL Table 2:

Comparison of FixMatch and Noisy Student methods,using 475 h of labeled (GT) and 475 h of unlabeled data (PT),and 70 training epochs. SA: SpecAugment. Init(ialization):random or from the supervised training on 475 h GT.

PT labels PT noise Init WER WRR

Semi-supervised, Noisy Student hard none random 15.60 61.6hard SA ( F = 5 ) random 15.26 79.5soft none random 15.16 84.7soft dropout random 15.10 87.9soft SA ( F = 5 ) random Semi-supervised, FixMatch soft dropout 475 h GT 15.04 91.1

Semi-supervised, iterative self-training [24]hard none 475 h GT 15.34 75.3

Supervised (950 h), oracle performance – – random

Table 3:

Results with iterative PT generation using a total of475 h labeled and 1.4 kh unlabeled data.

PT labels PT noise Init WER WRR

Semi-supervised, Noisy Student hard none random 14.88 63.4soft none random 14.52 75.5soft SA ( F = 5 ) random Supervised (1.9 kh), oracle performance – – random algorithms as deﬁned in [22], which is the ratio of performancegain achieved by adding unlabeled data vs. the gain from addingthe same amount of labeled data (in this case, training with950 h labeled data). The simplest variant, which uses one-shothard label PT generation (15.60 % WER), achieves 61.6 % WRR.The other SSL methods in Table 2, which all use on-the-ﬂy PTlabel generation, can improve on this baseline. In particular, ourvariant using soft labels along with consistency training obtains92.1 % WRR.While we found only minor performance gains from multipleiterations of the Noisy Student algorithm on the same unlabeleddata, we obtained additional improvements when including addi-tional 950 h PT data, for a total of 1.4 kh PT data. In this case,the PT labels ˆ y i,j were generated using a model trained on 475 hGT and 475 h PT using the Noisy Student algorithm. Results areshown in Table 3. With soft PT labels and consistency training,we achieve a 4 % improvement over the previous best result.

5. Conclusions

In this paper, we presented a comprehensive study on SSL strate-gies for end-to-end ASR. We investigated the FixMatch andNoisy Student algorithms for ASR and demonstrated improve-ments from using soft labels and consistency training. We believethat this is due to their ability to reduce error reinforcement inSSL. In the result, the performance of SSL can approach the oneof supervised learning with similar amounts of data.In future work, we will extend our investigation to differentDA schemes beyond SpecAugment (e.g. [15], [16]). We willalso look at iteratively including more and more unlabeled datawhile increasing the model size as in [25]. . References [1] A. Graves, “Sequence transduction with recurrent neural networks,”in

Proc. of ICML Workshop on Representation Learning . Edin-burgh, UK: IEEE, 2012, pp. 1–9.[2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio,“End-to-end attention-based large vocabulary speech recognition,”in

Proc. of ICASSP . Shanghai, China: IEEE, 2016, pp. 4945–4949.[3] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu,S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducersfor end-to-end speech recognition,” in

Proc. of ASRU . Okinawa,Japan: IEEE, 2017, pp. 206–213.[4] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, andN. Jaitly, “A comparison of sequence-to-sequence models forspeech recognition,” in

Proc. of INTERSPEECH . Stockholm,Sweden: ISCA, 2017, pp. 939–943.[5] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong, “Advancing acoustic-to-word CTC model,” in

Proc. of ICASSP . Calgary, Canada:IEEE, 2018, pp. 5794–5798.[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend andspell: A neural network for large vocabulary conversational speechrecognition,” in

Proc. of ICASSP . Shanghai, China: IEEE, 2016,pp. 4960–4964.[7] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al. , “State-of-the-art speech recognition with sequence-to-sequence models,”in

Proc. of ICASSP . Calgary, Canada: IEEE, 2018, pp. 4774–4778.[8] A. Zeyer, K. Irie, R. Schl¨uter, and H. Ney, “Improved training ofend-to-end attention models for speech recognition,” in

Proc. ofINTERSPEECH . Hyderabad, India: ISCA, 2018, pp. 7–11.[9] F. Weninger, J. Andr´es-Ferrer, X. Li, and P. Zhan, “Listen, Attend,Spell and Adapt: Speaker adapted sequence-to-sequence ASR,” in

Proc. of INTERSPEECH . ISCA, 2019, pp. 3805–3809.[10] Z. T¨uske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Singleheaded attention based sequence-to-sequence model for state-of-the-art results on Switchboard-300,” arXiv:2001.07263 , 2020.[11] A. Hern´andez-Garc´ıa and P. K¨onig, “Data augmentation instead ofexplicit regularization,” arXiv:1806.03852 , 2018.[12] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmenta-tion for speech recognition,” in

Proc. of INTERSPEECH . Dresden,Germany: ISCA, 2015, pp. 3586–3589.[13] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deepneural network acoustic modeling,”

IEEE/ACM Transactions onAudio, Speech, and Language Processing , vol. 23, no. 9, pp. 1469–1477, 2015.[14] W. Zhou, W. Michel, K. Irie, M. Kitza, R. Schl¨uter, and H. Ney,“The RWTH ASR system for TED-LIUM Release 2: Improvinghybrid HMM with SpecAugment,” in

Proc. of ICASSP . Barcelona,Spain: IEEE, 2020, pp. 7839–7843.[15] G. Saon, Z. T¨uske, K. Audhkhasi, and B. Kingsbury, “Sequencenoise injected training for end-to-end speech recognition,” in

Proc.of ICASSP . Brighton, UK: IEEE, 2019, pp. 6261–6265.[16] T.-S. Nguyen, S. St¨uker, J. Niehues, and A. Waibel, “Improvingsequence-to-sequence speech recognition training with on-the-ﬂydata augmentation,” in

Proc. of ICASSP . Barcelona, Spain: IEEE,2020, pp. 7689–7693.[17] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk,and Q. V. Le, “SpecAugment: A simple data augmentation methodfor automatic speech recognition,”

Proc. of INTERSPEECH , pp.2613–2617, 2019.[18] L. Lamel, J.-L. Gauvain, and G. Adda, “Lightly supervised andunsupervised acoustic model training,”

Computer Speech & Lan-guage , vol. 16, no. 1, pp. 115–129, 2002.[19] B. Khonglah, S. Madikeri, S. Dey, H. Bourlard, P. Motlicek, andJ. Billa, “Incremental semi-supervised learning for multi-genrespeech recognition,” in

Proc. of ICASSP . Barcelona, Spain:IEEE, 2020, pp. 7419–7423.[20] M. Gibson, G. Cook, and P. Zhan, “Semi-supervised training strate-gies for deep neural networks,” in

Proc. of ASRU . Okinawa, Japan:IEEE, 2017, pp. 77–83. [21] G. Synnaeve, Q. Xu, J. Kahn, E. Grave, T. Likhomanenko,V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-endASR: From supervised to semi-supervised learning with modernarchitectures,” arXiv:1911.08460 , 2019.[22] J. Kahn, A. Lee, and A. Hannun, “Self-training for end-to-endspeech recognition,” in

Proc. of ICASSP . Barcelona, Spain:IEEE, 2020, 7084–7088.[23] Y. Huang, S. Thomas, M. Suzuki, Z. T¨uske, L. Sansone, andM. Picheny, “Semi-supervised training and data augmentation foradaptation of automatic broadcast news captioning systems,” in

Proc. of ASRU . Sentosa, Singapore: IEEE, 2019, pp. 867–874.[24] Y. Chen, W. Wang, and C. Wang, “Semi-supervised ASR by end-to-end self-training,” arXiv:2001.09128 , 2020.[25] Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le, “Self-training withnoisy student improves ImageNet classiﬁcation,” in

Proc. of CVPR .Seattle, WA: IEEE, 2020, to appear.[26] Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le, “Unsuperviseddata augmentation for consistency training,” arXiv:1904.12848 ,2019.[27] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, andC. A. Raffel, “MixMatch: A holistic approach to semi-supervisedlearning,” in

Proc. of Advances in Neural Information ProcessingSystems . Vancouver, Canada: Curran Associates, Inc., 2019, pp.5050–5060.[28] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D.Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “FixMatch: Simpli-fying semi-supervised learning with consistency and conﬁdence,” arXiv:2001.07685 , 2020.[29] S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualizedacoustic representations for semi-supervised speech recognition,”in

Proc. of ICASSP . Barcelona, Spain: IEEE, 2020, 6429–6433.[30] S. Karita, S. Watanabe, T. Iwata, A. Ogawa, and M. Delcroix,“Semi-supervised end-to-end speech recognition,” in

Proc. of Inter-speech . Hyderabad, India: ISCA, 2018, pp. 2–6.[31] S. Karita, S. Watanabe, T. Iwata, M. Delcroix, A. Ogawa, andT. Nakatani, “Semi-supervised end-to-end speech recognition usingtext-to-speech and autoencoders,” in

Proc. of ICASSP . Brighton,UK: IEEE, 2019, pp. 6166–6170.[32] B. Li, T. N. Sainath, R. Pang, and Z. Wu, “Semi-supervised trainingfor end-to-end models via weak distillation,” in

Proc. of ICASSP .Brighton, UK: IEEE, 2019, pp. 2837–2841.[33] S. Dey, P. Motlicek, T. Bui, and F. Dernoncourt, “Exploiting semi-supervised training through a dropout regularization in end-to-endspeech recognition,” in

Proc. of INTERSPEECH . Graz, Austria:ISCA, 2019, pp. 734–738.[34] R. Masumura, M. Ihori, A. Takashima, T. Moriya, A. Ando,and Y. Shinohara, “Sequence-level consistency training for semi-supervised end-to-end automatic speech recognition,” in

Proc. ofICASSP . Barcelona, Spain: IEEE, 2020, pp. 7054–7058.[35] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translationby jointly learning to align and translate,” in

Proc. of ICLR . SanDiego, CA: open publishing, 2015.[36] J. Chorowski and N. Jaitly, “Towards better decoding and languagemodel integration in sequence to sequence models,” in

Proc. ofINTERSPEECH . Stockholm, Sweden: ISCA, 2017, pp. 523–527.[37] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,M. Krikun, Y. Cao, Q. Gao, K. Macherey et al. , “Google’s neuralmachine translation system: Bridging the gap between human andmachine translation,” arXiv:1609.08144 , 2016.[38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhutdinov, “Dropout: A simple way to prevent neural net-works from overﬁtting,”

Journal of Machine Learning Research ,vol. 15, no. 1, pp. 1929–1958, 2014.[39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re-thinking the inception architecture for computer vision,” in