Speaker Adaptation for Attention-Based End-to-End Speech Recognition
SSpeaker Adaptation for Attention-Based End-to-End Speech Recognition
Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong
Microsoft Corporation, Redmond, WA, USA { zhme,yagaur,jinyli,ygong } @microsoft.com Abstract
We propose three regularization-based speaker adaptation ap-proaches to adapt the attention-based encoder-decoder (AED)model with very limited adaptation data from target speakersfor end-to-end automatic speech recognition. The first methodis Kullback-Leibler divergence (KLD) regularization, in whichthe output distribution of a speaker-dependent (SD) AED isforced to be close to that of the speaker-independent (SI) modelby adding a KLD regularization to the adaptation criterion. Tocompensate for the asymmetric deficiency in KLD regulariza-tion, an adversarial speaker adaptation (ASA) method is pro-posed to regularize the deep-feature distribution of the SD AEDthrough the adversarial learning of an auxiliary discriminatorand the SD AED. The third approach is the multi-task learning,in which an SD AED is trained to jointly perform the primarytask of predicting a large number of output units and an auxil-iary task of predicting a small number of output units to alle-viate the target sparsity issue. Evaluated on a Microsoft shortmessage dictation task, all three methods are highly effective inadapting the AED model, achieving up to 12.2% and 3.0% worderror rate improvement over an SI AED trained from 3400 hoursdata for supervised and unsupervised adaptation, respectively.
Index Terms : speaker adaptation, end-to-end, attention,encoder-decoder, speech recognition
1. Introduction
Recently, remarkable progress has been made in end-to-end(E2E) automatic speech recognition (ASR) with the advance ofdeep learning. E2E ASR aims to directly map a sequence of in-put speech signal to a sequence of corresponding output labelsas the transcription by incorporating the acoustic model, pro-nunciation model and language model in traditional ASR sys-tem into a single deep neural network (DNN). Three dominantapproaches to achieve E2E ASR include: connectionist tempo-ral classification (CTC) [1, 2], recurrent neural network trans-ducer [3] and attention-based encoder-decoder (AED) [4, 5, 6].However, the performance of E2E ASR degrades when aspeaker-independent (SI) model is tested with the speech of anunseen speaker. A natural solution is to adapt the SI E2E modelto the speech from the target speaker. The major difficulty forspeaker adaptation is that the speaker-dependent (SD) modelwith a large number of parameters can easily get overfitted tovery limited speaker-specific data.Many methods have been proposed for speaker adap-tion in traditional DNN-hidden Markov model hybrid systemssuch as regularization-based [7, 8, 9, 10, 11], transformation-based [12, 13], singular value decomposition-based [14, 15],subspace-based [16, 17] and adversarial learning-based [18, 19]approaches. Despite the broad success of these methods in hy-brid systems, there has been limited investigation in speakeradaptation for the E2E ASR. In [20], two regularization-basedapproaches are shown to be effective for CTC-based E2E ASR. In [21], constrained re-training [22] is applied to update a partof the parameters in a multi-channel AED model.In this work, we propose three regularization-based speakeradaptation approaches for AED-based E2E ASR to overcomethe adaptation data sparsity. We work on the AED model pre-dicting word or subword units (WSUs) since WSUs have shownto yield better performance than characters as the output units[23, 24]. The first method is a Kullback-Leibler divergence(KLD) regularization in which we minimize the KLD betweenthe output distributions of the SD and SI AED models whileoptimizing the adaptation criterion. To offset the deficiencyof KLD as an asymmetric distribution-similarity measure [25],we further propose an adversarial speaker adaptation (ASA)method in which an auxiliary discriminator network is jointlytrained with the SD AED to keep the deep-feature distributionof the SD AED decoder not far away from that of the SI AED.Finally, to address the sparsity of WSU targets in the adaptationdata, we propose a multi-task learning (MTL) speaker adapta-tion in which an SD AED is trained to simultaneously performthe primary task of predicting a large number of WSU units andan auxiliary task of predicting a small number of character unitsto improve the major task.We evaluate the three speaker adaptation methods on a Mi-crosoft short message dictation (SMD) task with 3400 hours liveUS English training data and 100-200 adaptation utterances foreach speaker. All three approaches significantly improve over astrong SI AED model. In particular, ASA achieves up to 12.2%and 3.0% relative word error rate (WER) gain over the SI base-line for supervised and unsupervised adaptation, respectively,consistently outperforming the KLD regularization.
2. Speaker Adaptation for Attention-BasedEncoder-Decoder (AED) Model
We first briefly describe the AED model used in this work andthen elaborate three speaker adaptation methods for AED-basedE2E ASR. The SD AED model is always initialized with a well-trained SI AED predicting WSUs in all three methods.
In this work, we investigate the speaker adaptation methods forthe AED models [4, 5, 6] with WSUs as the output units. AEDmodel is first introduced in [26, 27] for neural machine trans-lation. With the advantage of no conditional independence as-sumption over CTC criterion [1], AED is introduced, for thefirst time, to speech area in [4] for E2E phoneme recognition.In [5, 6], AED is further applied to large vocabulary speechrecognition and has recently achieved superior performance toconventional hybrid systems in [24].To achieve E2E ASR, AED directly maps a sequence ofspeech frames to an output sequence of WSU labels via an en-coder, a decoder and an attention network as shown in Fig. 1.The encoder is an RNN which encodes the sequence of a r X i v : . [ c s . C L ] N ov igure 1: The architecture of AED model for E2E ASR. input speech frames X into a sequence of high-level features H = { h , . . . , h T } . AED models the conditional probabil-ity distribution P ( Y | X ) over sequences of output WSU labels Y = { y , . . . , y T } given a sequence of input speech frames X = { x , . . . , x I } and, with the encoded features H , we have P ( Y | X ) = P ( Y | H ) = T (cid:89) t =1 P ( y t | Y t − , H ) (1)A decoder is used to model P ( Y | H ) . To capture the condi-tional dependence on H , an attention network is used to deter-mine which encoded features in H should be attended to predictthe output label y t and to generate a context vector g t as a linearcombination of H [4].At each time step t , the decoder RNN takes the sum of theprevious WSU embedding e t − and the context vector g t − asthe input to predict the conditional probability of each WSU,i.e., P ( u | Y t − , H ) , u ∈ U , at time t , where U is the set ofall the WSUs: s t = RNN dec ( s t − , e t − + g t − ) (2) [ P ( u | Y t − , H )] u ∈ U = softmax [ W y ( s t + g t ) + b y ] (3)where s t is the hidden state of the decoder RNN. bias b y andthe matrix W y are learnable parameters.A WSU-based SI AED model is trained to minimize thefollowing loss on the training corpus T r . L WSUAED ( θ SI , T r ) = − (cid:88) ( X , Y ) ∈ T r | Y | (cid:88) t =1 log P ( y t | Y t − , H ; θ SI ) (4)where θ SI denotes all the model parameters in the SI AED and | Y | represents the number of elements in the label sequence Y . Given very limited speech from a target speaker, the SD AEDmodel, usually with a large number of parameters, can easily getoverfitted to the adaptation data. To tackle this problem, one so-lution is to minimize the KLD between the output distributionsof the SI and SD AED models while training the SD AED withthe adaptation data. We compute the WSU-level KLD betweenthe output distributions of the SI and SD AED models below T (cid:88) t =1 (cid:88) u ∈ U P ( u | Y t − , X ; θ SI ) log (cid:20) P ( u | Y t − , X ; θ SI ) P ( u | Y t − , X ; θ SD ) (cid:21) (5)where θ SI denote the all the parameters in SI AED model.We add only the θ SD -related terms to the AED loss as theKLD regularization since θ SI are not updated during training. Therefore, the regularized loss function for KLD adaptation ofAED is computed below on the adaptation set A L KLD ( θ SI , θ SD , A ) = − (1 − ρ ) L WSUAED ( θ SD , A ) − ρ (cid:88) ( X , Y ) ∈ A | Y | (cid:88) t =1 (cid:88) u ∈ U P ( u | Y t − , X ; θ SI )log P ( u | Y t − , X ; θ SD )= − (cid:88) ( X , Y ) ∈ A | Y | (cid:88) t =1 (cid:88) u ∈ U (cid:110) (1 − ρ ) [ u = y t ]+ ρP ( u | Y t − , X ; θ SI ) (cid:111) P ( u | Y t − , X ; θ SD ) , (6) ˆ θ SD = arg min θ SD L KLD ( θ SI , θ SD , A ) , (7)where ρ ∈ [0 , is the regularization weight and [ · ] is the indi-cator function and ˆ θ SD denote the optimized parameters. There-fore, KLD regularization for AED is equivalent to using the lin-ear interpolation between the hard WSU one-hot label and thesoft WSU posteriors from SI AED as the new target for standardcross-entropy training. As an asymmetric metric, KLD is not a perfect similaritymeasure between distributions [25] since the minimization of KL ( P SI || P SD ) does not guarantee that KL ( P SD || P SI ) isalso minimized. Adversarial learning serves as a much bettersolution since it guarantees that the global optimum is achievedif and only if the SD and SI AEDs share exactly the samehidden-unit distribution at a certain layer [28]. Initially pro-posed for image generation [28], adversarial learning has re-cently been widely applied to many aspects of speech area in-cluding domain adaptation [29, 30, 31, 32], noise-robust ASR[33, 34, 32], domain-invariant training [19, 35, 36], speech en-hancement [37, 38, 39] and speaker verification [40]. ASA isproposed in [18] for hybrid system, and in this work, we adaptit to AED-based E2E ASR.As in Fig. 2, we view the encoder, the attention network andthe first few layers of the decoder of the SI AED as an SI featureextractor M SI f with parameters θ SI f that maps X to a sequence ofdeep hidden features F SI = { f SI , . . . , f SI T } and the rest layers ofthe SI AED decoder as a SI WSU classifier M SI y with parameters θ SI y (i.e., θ SI = { θ SI f , θ SI y } ). Similarly, we divide the SD AEDinto an SD feature extractor M SD f and an SD WSU classifier M SD y in exactly the same way as the SI AED and use θ SI f and θ SI y to initialize θ SD f and θ SD y , respectively (i.e., θ SD = { θ SD f , θ SD y } ). M SD f extracts SD deep features F SD from X .We then introduce an auxiliary discriminator M d with pa-rameters θ d taking F SI and F SD as the input to predict the pos-terior P ( f t ∈ D SD | Y t − , X ) that the input deep feature f t isgenerated by the SD AED with the discrimination loss below. L DISC ( θ SD f , θ SI f , θ d , A ) = − (cid:88) ( X , Y ) ∈ A | Y | (cid:88) t =1 (cid:104) log P ( f SD t ∈ D SD | Y t − , X ; θ SD f , θ d )+ log P ( f SI t ∈ D SI | Y t − , X ; θ SI f , θ d ) (cid:105) , (8)where D SD and D SI are the sets of SD and SI deep features,respectively.igure 2: Adversarial speaker adaptation (ASA) of AED modelfor E2E ASR.
With ASA, our goal is to make the distribution of F SD sim-ilar to that of F SI through adversarial training. Therefore, weminimize L DISC with respect to θ d and maximize L DISC with re-spect to θ SD f . This minimax competition will converge to thepoint where M SD f generates extremely confusing F SD that M d is unable to distinguish whether they are generated by M SD f or M SI f . At the same time, we minimize the AED loss in Eq. (4)to make F SD WSU-discriminative. The entire adversarial MTLprocedure of ASA for AED model is formulated below: (ˆ θ SD f , ˆ θ SD y ) = arg min θ SD f ,θ SD y (cid:104) L WSUAED ( θ SD f , θ SD y , A ) − λ L DISC ( θ SD f , θ SI f , ˆ θ d , A ) (cid:105) , (9) (ˆ θ d ) = arg min θ d L DISC (ˆ θ SD f , θ SI f , θ d , A ) , (10)where λ controls the trade-off between L WSUAED and L DISC . Notethat the SI AED serves only as a reference network and θ SI isnot updated during training. After ASA, only the SD AED withadapted parameters ˆ θ SD = { ˆ θ SD f , ˆ θ SD y } are used for decodingwhile the auxiliary discriminator M d is discarded. One difficulty of adapting AED models is that the WSUs in theadaptation data are sparsely distributed since the very few adap-tation samples are assigned to a huge number of WSU labels(about 30k). A large proportion of WSUs are unseen duringthe adaptation, overfitting the SD AED to a small space of ob-served WSU sequences. Inspired by [41, 20], to alleviate thistarget sparsity issue, we augment the primary task of predict-ing a large number of WSU output units with an auxiliary taskof predicting a small number of character output units (around30) to improve the primary task via MTL. The adaptation data,though with a small size, covers a much higher percentage (usu-ally 100%) of the character set than that of the WSU set. Pre-dicting the fully-covered character labels as a secondary taskexposes the SD AED to a enlarged acoustic space and effec-tively regularizes the major task of WSU prediction.We first introduce an auxiliary AED (parameters θ CHR ) withcharacter output units and initialize its encoder with the encoderparameters of the WSU-based SI AED θ SIenc . Then we train the decoder (parameters θ CHRdec ) and the attention network (param-eters θ CHRatt ) of the character-based AED using all the trainingdata T r to minimize the character-level AED loss below whilekeeping its encoder fixed: L CHRAED ( θ CHR , T r ) = − (cid:88) ( X , C ) ∈ T r | C | (cid:88) l =1 P ( c l | C l − , X ; θ CHR ) (11) (ˆ θ CHRdec , ˆ θ CHRatt ) = arg min θ CHRdec ,θ CHRatt L CHRAED ( θ SIenc , θ
CHRdec , θ
CHRatt , T r ) , (12)where C = { c , . . . , c L } is the sequence of character labelscorresponding to X and Y .Then we construct an MTL network comprised ofthe WSU-based SI AED with initial parameters θ SI = { θ SIenc , θ
SIdec , θ
SIatt } , a well-trained character-based decoder with pa-rameters ˆ θ CHRdec and its attention network with parameters ˆ θ CHRatt asin Fig. 3. The latter two take the encoded features H from theencoder of the SI AED as the input.Figure 3: MTL speaker adaptation of AED model for E2E ASR.
Finally, we jointly minimize the WSU-level and character-level AED losses on the adaptation data by updating only theencoder parameters θ SDenc of the MTL network as follows: ˆ θ SDenc = arg min θ SDenc (cid:104) β L WSUAED ( θ SDenc , θ
SIdec , θ
SIatt , A )+(1 − β ) L CHRAED ( θ SDenc , ˆ θ CHRdec , ˆ θ CHRatt , A ) (cid:105) (13)where β is the interpolation weight for WSU-level AED lossranging from 0 to 1. After the MTL, only the adapted WSU-based SD AED with parameters ˆ θ SD = { ˆ θ SDenc , θ
SIdec , θ
SIatt } is usedfor decoding. The character-based decoder and attention net-work are discarded.
3. Experiments
We evaluate the three speaker adaptation methods for AED-based E2E ASR on the Microsoft Windows phone SMD task.
The training data consists of 3400 hours Microsoft internal liveUS English Cortana utterances collected via various deployedspeech services including voice search and SMD. The test setconsists of 7 speakers with a total number of 20,203 words.Two adaptation sets of 100 and 200 utterances per speaker areused for acoustic model adaptation, respectively. We extract 80-dimensional log Mel filter bank (LFB) features from the speechsignal in both the training and test sets every 10 ms over a 25 msindow. We stack 3 consecutive frames and stride the stackedframe by 30 ms to form 240-dimensional input speech framesas in [24]. Following [42], we first generate 33755 mixed unitsas the set of WSUs based on the training transcription and thenproduce mixed-unit label sequences as training targets.
We train a WSU-based AED model as described in Section 2.1for E2E ASR using 3400 hours training data. The encoder is abi-directional gated recurrent units (GRU)-RNN [26, 43] with hidden layers, each with 512 hidden units. Layer normalization[44] is applied for each hidden layer. Each WSU label is rep-resented by a 512-dimensional embedding vector. The decoderis a uni-directional GRU-RNN with 2 hidden layers, each with512 hidden units, and an output layer predicting posteriors ofthe 33k WSU. We use GRU instead of long short-term memory(LSTM) [22, 45] for RNN because it has less parameters and istrained faster than LSTM with no loss of performance. We usePyTorch as the tool [46] for building, training and evaluatingthe neural networks. As shown in Table 1, the baseline SI AEDachieves 14.32% WER on the test set.System AdaptParam Weight Supervised Unsupervised100 200 100 200SI - - 14.32KLD All ρ = 0 . ρ = 0 . ρ = 0 . ρ = 0 . α = 0 . α = 0 . α = 0 . β = 0 . β = 0 . β = 0 . Table 1:
The WERs (%) of speaker adaptation using KLD, ASAand MTL for AED E2E ASR on Microsoft SMD task with 3400hours training data. Each of the 7 test speakers has 100 or200 adaptation utterances. In KLD and ASA adaptation, allthe parameters of the AED (“All”) are updated while, in MTLadaptation, only the AED encoder (“Enc”) is updated.
We first perform KLD adaptation of the SI AED with differ-ent ρ by updating all the parameters in the SD AED. Directre-training is performed with on regularization when ρ = 0 . Asshown in Table 1, for supervised adaptation, KLD achieves thebest WERs, 13.97% and 13.14%, at ρ = 0 . for both 100 and200 adaptation utterances with 2.4% and 8.2% relative WERimprovements over the SI baseline. The WER increases as ρ continues to grow. For unsupervised adaptation, KLD achievesthe best WERs, 14.04% ( ρ = 0 . and 14.00% ( ρ = 0 . , for100 and 200 adaptation utterances, which improve the SI AEDby 2.0% and 2.2% relatively. More adaptation utterances sig-nificantly improves the supervised adaptation but only slightlyreduces the WER in unsupervised adaptation since the decodedone-best path is not as accurate as the forced alignment. To perform ASA of AED, we construct the SI feature extrac-tor M SI f as the first 2 hidden layers of the decoder, the encoderand the attention network of the SI AED model. The SI senone classifier M SI y is the decoder output layer. M SD f and M SD y areinitialized with M SI f and M SI y . The discriminator M d is a feed-forward DNN with 2 hidden layers and 512 hidden units foreach layer. The output layer of M d has 1 unit predicting theposteriors of f t ∈ D SD . M SD f , M SD y and M d are jointly trainedwith an adversarial MTL objective as in Eq. (10). We update allthe parameters in the SD AED.As shown in Table 1, for supervised adaptation, ASAachieves the best WERs, 13.20% ( α = 0 . and 12.58% ( α = 0 . , with 100 and 200 adaptation utterances, which are7.8% and 12.2% relative improvements over the SI AED base-line, respectively. For unsupervised adaptation, ASA achievesthe best WERs, 13.95% and 13.89%, both at α = 0 . with100 and 200 adaptation utterances, which improves the SI AEDbaseline by 2.6% and 3.0% relatively. ASA consistently andsignificantly outperforms KLD for both supervised and unsu-pervised adaptation and for adaptation data of different sizes.Especially, for supervised adaptation, ASA achieves 5.5% and4.3% relative improvements over KLD with 100 and 200 adap-tation utterances, respectively. In MTL, we first train an auxiliary AED with 30 character unitsas the output using the training data and then adapt the SI WSUAED by simultaneously performing WSU and character predic-tion tasks. The character-based AED share the same encoder asthe WSU-based AED and has a GRU decoder with 2 hiddenlayers, each with 512 hidden units.Table 1 shows that, for supervised adaptation, MTLachieves best WERs, 13.26% ( β = 0 . and 12.71% ( β =0 . , with 100 and 200 adaptation utterances, which improvesthe SI AED baseline by 7.4% and 11.2%, respectively. For un-supervised adaptation, MTL achieves best WERs, 13.80% and13.77%, both at β = 0 . , which are 3.6% and 3.8% relative im-provements over the SI AED baseline, respectively. Note thatthe performance of MTL adaptation is not comparable with thatof KLD and ASA since in MTL, only the encoder (consisting of32.4% of the whole AED model parameters) is updated whilein KLD and ASA, the whole AED model is adapted. The KLDand ASA performance can be remarkably improved by updatingonly a portion of the entire model parameters.
4. Conclusion
In this work, we propose KLD, ASA and MTL approaches forspeaker adaptation in AED-based E2E ASR system. In KLD,we minimize the KLD between the output distributions of theSD and SI AED models in addition to the AED loss to avoidoverfitting. In ASA, adversarial learning is used to force thedeep features of the SD AED to have similar distribution withthose of the SI AED to offset the asymmetric deficiency ofKLD. In MTL, an additional task of predicting character units isperformed in addition to the primary task of WSU-based AEDto resolve the target sparsity issue.Evaluated on Microsoft SMD task, all three methodsachieve significant improvements over a strong SI AED base-line for both supervised and unsupervised adaptation. ASA im-proves consistently over KLD by updating all the AED parame-ters. By adapting only the encoder with 32.4% of the full modelparameters, the performance of MTL is not comparable withthat of KLD and ASA. Potentially, much larger improvementscan be achieved by KLD and ASA by adapting a subset of entiremodel parameters. . References [1] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classification: labelling unsegmented se-quence data with recurrent neural networks,” in
ICML . ACM,2006.[2] A. Graves and N. Jaitly, “Towards end-to-end speech recognitionwith recurrent neural networks,” in
International Conference onMachine Learning , 2014, pp. 1764–1772.[3] A. Graves, “Sequence transduction with recurrent neural net-works,” arXiv preprint arXiv:1211.3711 , 2012.[4] J. K. Chorowski, D. Bahdanau, D. Serdyuk et al. , “Attention-based models for speech recognition,” in
NIPS , 2015, pp. 577–585.[5] D. Bahdanau, J. Chorowski, D. Serdyuk et al. , “End-to-endattention-based large vocabulary speech recognition,” in
Proc.ICASSP . IEEE, 2016, pp. 4945–4949.[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attendand spell: A neural network for large vocabulary conversationalspeech recognition,” in
Proc. ICASSP . IEEE, 2016, pp. 4960–4964.[7] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “Kl-divergence reg-ularized deep neural network adaptation for improved large vo-cabulary speech recognition,” in
Proc. ICASSP , May 2013, pp.7893–7897.[8] H. Liao, “Speaker adaptation of context dependent deep neuralnetworks,” in
Proc. ICASSP , May 2013.[9] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-student learning,” in
Proc. ICASSP , 2019.[10] Z. Huang, J. L, S. Siniscalchi. et al. , “Rapid adaptation fordeep neural networks through multi-task learning,” in
Interspeech ,2015.[11] L. T´oth and G. Gosztolya, “Adaptation of dnn acoustic modelsusing kl-divergence regularization and multi-task training,” in
In-ternational Conference on Speech and Computer , 2016.[12] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineeringin context-dependent deep neural networks for conversationalspeech transcription,” in
Proc. ASRU , Dec 2011, pp. 24–29.[13] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contri-butions for unsupervised acoustic model adaptation,”
IEEE/ACMTransactions on Audio, Speech, and Language Processing ,vol. 24, no. 8, pp. 1450–1463, Aug 2016.[14] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural net-work acoustic models with singular value decomposition.” in
In-terspeech , 2013, pp. 2365–2369.[15] Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adapta-tion for deep neural networks,” in
Proc. ICASSP , March 2016, pp.5005–5009.[16] S. Xue, O. Abdel-Hamid, H. Jiang, L. Dai, and Q. Liu, “Fast adap-tation of deep neural network based on discriminant codes forspeech recognition,” in TASLP , vol. 22, no. 12, pp. 1713–1725,Dec 2014.[17] L. Samarakoon and K. C. Sim, “Factorized hidden layeradaptation for deep neural network based acoustic modeling,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 24, no. 12, pp. 2241–2250, Dec 2016.[18] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in
Proc. ICASSP , 2019.[19] Z. Meng, J. Li, Z. Chen et al. , “Speaker-invariant training via ad-versarial learning,” in
Proc. ICASSP , 2018.[20] J. Li, R. Zhao, Z. Chen et al. , “Developing far-field speaker sys-tem via teacher-student learning,” in
Proc. ICASSP , 2018.[21] T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey,“Speaker adaptation for multichannel end-to-end speech recog-nition,” in . IEEE, 2018, pp. 6707–6711. [22] H. Erdogan, T. Hayashi, J. R. Hershey et al. , “Multi-channelspeech recognition: Lstms all the way through,” in
CHiME-4workshop , 2016, pp. 1–4.[23] Y. Gaur, J. Li, Z. Meng, and Y. Gong, “Acoustic-to-phrase end-to-end speech recognition,” in submitted to INTERSPEECH 2019 .IEEE, 2019.[24] C.-C. Chiu, T. N. Sainath, Y. Wu et al. , “State-of-the-artspeech recognition with sequence-to-sequence models,” in
Proc.ICASSP . IEEE, 2018, pp. 4774–4778.[25] S. Kullback and R. A. Leibler, “On information and sufficiency,”
The annals of mathematical statistics , vol. 22, no. 1, pp. 79–86,1951.[26] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “Onthe properties of neural machine translation: Encoder-decoder ap-proaches,” arXiv preprint arXiv:1409.1259 , 2014.[27] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans-lation by jointly learning to align and translate,” arXiv preprintarXiv:1409.0473 , 2014.[28] I. Goodfellow, J. Pouget-Adadie et al. , “Generative adversarialnets,” in
Proc. NIPS , 2014, pp. 2672–2680. [Online]. Available:http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf[29] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation bybackpropagation,” in
Proc. ICML , vol. 37. Lille, France: PMLR,2015, pp. 1180–1189.[30] S. Sun, B. Zhang, L. Xie et al. , “An unsupervised deep domainadaptation approach for robust speech recognition,”
Neurocom-puting , vol. 257, pp. 79 – 87, 2017.[31] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervisedadaptation with domain separation networks for robust speechrecognition,” in
Proc. ASRU , 2017.[32] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student learning for unsupervised domain adaptation,” in
Proc.ICASSP . IEEE, 2018, pp. 5949–5953.[33] Y. Shinohara, “Adversarial multi-task learning of deep neural net-works for robust speech recognition.” in
INTERSPEECH , 2016,pp. 2369–2372.[34] D. Serdyuk, K. Audhkhasi, P. Brakel, B. Ramabhadran et al. ,“Invariant representations for noisy speech recognition,” in
NIPSWorkshop , 2016.[35] Z. Meng, J. Li, and Y. Gong, “Attentive adversarial learning fordomain-invariant training,” in
Proc. ICASSP , 2019.[36] Z. Meng, “Discriminative and adaptive training for robust speechrecognition and understanding,” Ph.D. dissertation, Georgia Insti-tute of Technology, 2018.[37] S. Pascual, A. Bonafonte, and J. Serr`a, “Segan: Speech enhance-ment generative adversarial network,” in
Interspeech , 2017.[38] Z. Meng, J. Li, and Y. Gong, “Adversarial feature-mapping forspeech enhancement,”
Interspeech , 2018.[39] ——, “Cycle-consistent speech enhancement,”
Interspeech , 2018.[40] Z. Meng, Y. Zhao, J. Li, and Y. Gong, “Adversarial speaker veri-fication,” in
Proc. ICASSP , 2019.[41] Z. Huang, S. Siniscalchi, I. Chen et al. , “Maximum a posterioriadaptation of network parameters in deep models,” in
Proc. Inter-speech , 2015.[42] J. Li, G. Ye, A. Das et al. , “Advancing acoustic-to-word ctcmodel,” in
Proc. ICASSP . IEEE, 2018, pp. 5794–5798.[43] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu-ation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555 , 2014.[44] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 , 2016.[45] Z. Meng, S. Watanabe, J. R. Hershey et al. , “Deep long short-termmemory adaptive beamforming networks for multichannel robustspeech recognition,” in
ICASSP . IEEE, 2017, pp. 271–275.[46] A. Paszke, S. Gross, S. Chintala et al.et al.