Adaptable Multi-Domain Language Model for Transformer ASR
Taewoo Lee, Min-Joong Lee, Tae Gyoon Kang, Seokyeoung Jung, Minseok Kwon, Yeona Hong, Jungin Lee, Kyoung-Gu Woo, Ho-Gyeong Kim, Jiseung Jeong, Jihyun Lee, Hosik Lee, Young Sang Choi
AAdaptable Multi-Domain Language Model for Transformer ASR
Taewoo Lee , Min-Joong Lee , Tae Gyoon Kang , Seokyeoung Jung , Minseok Kwon , Yeona Hong , Jungin Lee , Kyoung-Gu Woo , Ho-Gyeong Kim , Jiseung Jeong , Jihyun Lee , Hosik Lee , Young Sang Choi AI R&D Group, Samsung Electronics, South Korea Samsung Advanced Institute of Technology, Samsung Electronics, South Korea {tw1.lee, minjoong.lee, taeg.kang, jihyun.s.lee}@samsung.com
Abstract
We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM can be expanded to new domains by adding about 2% of parameters for a first domain and 13% parameters for after second domain. The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process. Using proposed adapter based approach, we observed that a general LM with adapter can outperform a dedicated music domain LM in terms of word error rate (WER).
Index Terms : end-to-end (E2E) automatic speech recognition (ASR), language model (LM), multi-domain adaptation Introduction
In recent years, virtual voice assistants have been widely spread to real-world applications. End-to-end (E2E) automatic speech recognition (ASR) has become one of the key elements of virtual voice assistant services. As new domains continue to be added, ASR models need to be adapted quickly to the new domains. Furthermore, domain specific proper nouns must be recognized such as new song titles and singer names. This means that it is necessary to maintain the recognition accuracy of the existing supported domains while securing the recognition accuracy for new words in the new domain. In addition, in order to provide a good user experience, such a response must be done very quickly. Transformer was first introduced as a model for translation [1]. Then, it has also been successfully applied to ASR [2]. This is because Transformer has an advantage in terms of computation and parallelism over recurrent neural network (RNN) based models. In addition, knowledge distillation has been studied to create parameter efficient models [3,4]. Shallow fusion of the E2E ASR models and external language models (LM) also showed a further improvement in WER [5,6], because external LMs are able to learn more contextual information from abundant text-only data. In natural language processing (NLP), several methods of pre-training neural language models have led to major advances in NLP subtasks. BERT, ELMO, GPT, RoBERTa, and XLNet are typical [7-11]. These methods find dependencies between words and their combinations by pre-training neural networks on large amounts of data. Also, by fine-tuning the model on training data in target tasks, these models could be easily applied to solving other NLP tasks. However, it is difficult to continuously update these models because deep networks tend to forget previous knowledge when it is sequentially re-trained [12]. To solve such a problem, continual learning approaches have been studied. To preserve previous knowledge, learning without forgetting (LWF) [13] adds output logits of previous stage networks to logits of current stage networks. Elastic weight consolidation (EWC) [14] constrains weight updates by valuing which weight are important for a task. Progressive neural networks [15] avoid forgetting by preserving task specific networks. However, those approaches are imperfect in memory and parameter efficiency [16]. In computer vision, residual adapter modules have been introduced to make a multi-task and multi-domain model [17]. In the paper, a large common model is used as a base model. Then small adapter modules are added in front of each batch normalization layer in series or in parallel manner. In the experiments, both methods showed better accuracy than a full fine-tuned model. Similar approaches have been explored for BERT in NLP [18]. In the paper, the authors proposed a model (called projected attention layers or PALs) that can resolve multi-domain NLP tasks by adding only adjustable 13% parameters compared to the original model. Meanwhile, in [16], a method to fine-tune models by adding only adjustable 3.6% of parameters has been proposed. The method adds small size adapters to the self-attention (SA) and feed forward network (FFN) layers of Transformer, respectively. In [19], the authors compared PALs and adapters. In the paper, fine-tuning adapters with norm layer showed better results compared to the PALs when almost similar number of parameters is used. For multilingual ASR, a structure is introduced so that only adapter layers can be switched [20]. In the study, the experiments have been conducted on recurrent neural network transducer (RNN-T) based streaming E2E ASR models.
In this paper, we study an external LM structure for Transformer based ASR model that can be adapted for multi-domain with only 2% or 13% parameter addition per domain. To the best of our knowledge, this is a first attempt applying adapters to Transformer LM in ASR. The effects of our model are: 1) Our adapter-based adaptation can be used on top of the full fine-tuned model, and it further reduces word error rate (WER) from the model. 2) Multi-domain LM can be supported with fewer parameters. 3) Our approach provides cost efficient way to maintain existing models. he rest of the paper is structured as follows: we describe our model architecture in Section 2. The experimental results on our data are reported in Section 3. Finally, we derive conclusions in Section 4. SA-based Multi-Domain LM with Adapter
Transformer-based E2E ASR
Fig. 1 shows a Transformer based E2E ASR models with an external LM. As in [2], the encoder module, which is similar to an acoustic model, takes the input features, π , and transforms them to a higher-level feature representation with self-attention layers. The outputs of the encoder key π² πππ and value π½ πππ are passed to encoder-decoder attention layers of E2E decoder. Using the π² πππ and π½ πππ , the E2E decoder iteratively predicts output probabilities π(π¦ π‘ |π¦ , β― , π¦ π‘β1 , π) of next output symbol π¦ π‘ until maximum sequence length or EOS (end-of-sequence) is met. An external LM, where encoder-decoder attention layers are removed, can be incorporated at each step of beam search to improve accuracy. Hereafter, we focus on an external LM decoder with adapters. SA-based LM Decoder with Adapter
SA-based LM decoder consists of three parts: an output embedding, π πΏ LM SA layers, and a linear transform following Softmax (Fig. 2 left). For simplicity we set batch size and the number of domains to one in the followings.
Input Embedding
Let word-piece [21] vocabulary size be π w , an input one-hot vector be π₯ π‘ β β π€ , hidden size be β . The output of embedding matrix is computed as (1): π ππ = π₯ π‘ πΎ π (1) where πΎ π β β π π€ Γβ and π ππ β β . Then a positional encoding vector ππΈ β β is added to π ππ [1]. SA layer in LM Decoder with Adapter
A SA layer of a LM decoder with adapters consists of four layers: layer norm [22], multi-head attention (MHA), FFN, and adapters.
Multi-Head Attention
Let the number of heads be π βπππ . Previous output is projected to a query, a key, and a value simultaneously for multi-head attention (Fig.2 left). Instead of performing a single attention function using β dimentional π , πΎ , and π , MHA performs the attention function π βπππ times in parallel with differently learned β/π βπππ dimentional π , πΎ , and π . Then π βπππ numbers outputs are concatenated and projected into a single representation. The detailed equation is as follows: MultiHead(π, πΎ, π) = Concat(βπππ , β― , βπππ π βπππ )πΎ π (2) where βπππ π = π΄π‘π‘πππ‘πππ(ππΎ ππ , πΎπΎ ππΎ , ππΎ ππ ) = softmax ( (ππΎ ππ )(πΎπΎ ππΎ ) π» β βπ βπππ ) (ππΎ ππ ), (3) πΎ ππ β β βΓπ π , πΎ ππΎ β β βΓπ π , πΎ ππ β β βΓπ π£ , and πΎ π β β βΓβ are trainable parameters. Note π π = π π = π π£ = β π βπππ β throughout the paper. Position-wise Feed-Forward Network
Let an inner filter size π . Position-wise feed forward network consists of two FFNs with ReLU activation in between. An output of position-wise FFN is calculated as (4) where the input vector π β β , the weight matrices and bias vectors πΎ ββ βΓπ , π β β , πΎ β β πΓβ , and π β β . FFN(π ) = max(0, π πΎ + π ) πΎ + π (4) Figure 2. (Left) is an architecture of transformer multi-domain LM. In a LM decoder, the adapter module (right) is added on top of multi-head attention and feed-forward layers. Only green layers (including layer norms or LN) are fine-tuned on the downstream data and expanded for π π multi-domain. Dotted red lines shows a switchable decoding path for a first domain. Figure 1.
The dotted line box shows transformer-based E2E ASR model, including encoder and decoder. An external LM is incorporated at each step of beam search.
Feed-forward down-projectNonlinearityFeed-forward up-project
Adapter Layer Γ π
π΄ππ π‘π π π ππππ SoftmaxMulti-Head AttentionIntputEmbedding
Positional Encoding π π΄ππ π‘π π π π ππΎπ β― β― π π π π β― π΄ππ π‘π π π π΄ππ π‘π β― π π π π β― ππππ π π β― Transformer LM Transformer
EncoderTransformer
Decoder
TransformerLM
E2E π π² , π½ π¦ π‘ π¦ π‘β1 π¦ π‘β1 .2.2.3 Adapter
Adapter modules proposed in [16] are inserted on top of MHA and FFN layers as in Fig. 2 (left). An adapter module (Fig. 2 right) consists of two linear transforms and ReLU activation in between. A residual connection is added to the output. The outputs of adapters π΄ and π΄ are calculated as follows: π΄ (π ) = π + max(0, π πΎ + π ) πΎ + π (5) π΄ (π ) = π + max(0, π πΎ + π ) πΎ + π , (6) where π = MultiHead(π, πΎ, π) , π = FFN(π ) , adapter filter size is π π΄ , πΎ , πΎ β β βΓπ π΄ , π , π β β π΄ , πΎ , πΎ β β π π΄ Γβ , π , π β β . Softmax
The outputs of decoder are transformed to the probabilities of output classes by a linear projection πΎ β β βΓπ π€ and a subsequent softmax function. Experiments
Table 1 shows overall model architectures and model sizes used in the experiments. In the experiments, a general domain LM (G-LM), a music specialized domain LM (M-LM), and adapter added general and music LMs (G-LM-A, M-LM-A) are used. For single precision floating point, model sizes are increased about 2% when adapters are added for a first domain. The G-LM is trained on 24GiB normalized Korean text data consisting of 353M utterances. All data were anonymized. The data consists of representative utterances of Samsungβs Bixby scenario and general domain corpus. The M-LM is trained on normalized Korean text data consisting of 45M utterances, in which general and music domain (song title and singer name related commands) corpus are mixed. To train our models, we used Tensor2Tensor framework [23]. For G-LM experiments, we recorded test cases (TCs) in three categories: In-Domain, Out-Domain, and Open-Domain. In-Domain TCs includes 50K Bixby use-case scenario utterances such as phone and device control commands and daily conversational question and answering. Out-Domain TCs includes 8K domain specific utterances which is not included in In-Domain training corpus. Especially, we selected domains having its own unique proper nouns such as hospital or doctorβs names. Open-Domain TCs are included to test noisy environment, on which cafe, city, office, highway noises are added to clean speech. The content of the utterances is in arbitrary domain and do not include unknown unique proper nouns. All TC are recorded in male and female voices. For M-LM experiments, In-Domain and Out-Domain TCs are recorded. In-Domain TC includes 610 utterances. It represents well known song titles and singer names. On the other hand, Out-Domain TC includes 3709 utterances. The content is newly added song titles and singer names. We initialized weights of each adapter layer to the values following a normal distribution having zero mean and β4 variance. We tested variance values of { , β7 , β6 , β5 , β4 , β3 , β2 } and selected a largest sable value. Since an adapter module internally has a residual connection, zero variance can be inserted to test output of the adapter module is bypassed properly. All runs are trained on eight P40 GPUs to build models from scratch and on one P40 GPU for adaptations. We used Adam optimizer with π½ = 0.9 , π½ = 0.98 , π = 1π β9 . Batch sizes tested from {32, 64, 128, 512, 1024, 4096, 8192}. 8192 is used for all our adaptation experiments. Unlike [16], small batch size made our training unstable, failing to converge. Learning rate is selected as 0.03 from {0.1, 0.03, 0.001, 0.0003, 0.0001}. When we train our models from scratch or adapt without adapter, we applied Noam learning rate decay scheme with 1000 warmup steps. On the other hand, when we train our adapter related layers, learning rate decay scheme did not used. We used 4096 word-pieces as output token units. For E2E model training, we used same hyper-parameters in [3]. All experiments used the identical input feature processing to that of [24]. The decoding hyper-parameters (beam size, length-penalty, and maximum decoding length) were tuned to minimize WER. Known proper nouns and number are converted with an inverse text normalization (ITN) module. We assumed we already knew proper domain names before inferencing. We first define four different model training or adaptation methods. (1) build from scratch : LMs are trained on whole corpus. All layers are trained and adapters are not added. (2) full Table 1.
The architectures and sizes of SA E2E, general LM (G-LM), music LM (M-LM), and adapter added LMs
E2E Enc. E2E Dec. G-LM G-LM-A M-LM M-LM-A β
512 512 512 512 512 512 π π π΄ - - - 64 - 64 π βπππ
16 4 8 8 8 8 Size (MiB) 96.7 80.3 76.4 77.9 40.3 41.3 Table 2.
WERs of E2E, E2E-G-LM, and E2E-G-LM-A on General Domain TCs
TC E2E E2E-G-LM E2E-G-LM-A In-Domain 2.42 1.82 1.69 Out-Domain 10.62 8.18
Open-Domain 12.8 5.08 4.55 Table 3.
WERs of E2E, E2E-M-LM, and E2E-M-LM-A on Music Domain TCs
TC E2E E2E-M-LM E2E-M-LM-A In-Domain 8.2 2.68 2.46 Out-Domain 12.66 5.43
Table 4.
WERs of iterative adapter fine-tuning with M-LM-A on Music Domain TCs
TC E2E-M-LM M1 iter1 M1 iter2 M1 iter3 In-Domain 2.68 2.46 1.97
Out-Domain 5.43 4.13 3.96 ine-tuning : LMs are fine-tuned on small size corpus of target domain. All layers are tuned and adapters are not added. (3) adapter fine-tuning : adapters are added on top of full fine-tuned models. LMs are fine-tuned on a small size corpus of target domain. Only adapter related layers (adapters, norms, a Softmax linear) are tuned. (4) iterative adapter fine-tuning : A LM is adapter fine-tuned iteratively. An iteration, here, means a process of a) decoding TCs with a latest adapter fine-tuned LM, b) collect error sentences from the results, and c) adapter fine-tune the last LM with the error sentences. Our goal is to attain 1) best performance iterative adapter fine-tuned LMs compared to the best full fine-tuned LMs, 2) LMs that can be extended to multi-domain by iterative adapter fine-tuning, using a common pre-trained LM. To show the proposed models can achieve the goals, we conduct four experiments. Results
Table 2 shows WERs measured with only E2E models (E2E), E2E models with a full fine-tuned G-LM (E2E-G-LM), and E2E models with an adapter fine-tuned G-LM (E2E-G- LM-A). Compared to the results decoded with only E2E models (E2E), in E2E-G-LM, WERs were reduced 0.6, 2.44, and 7.72%p for in, out, and open domain TCs, respectively. When the full fine-tuned G-LM was additionally adapter fine-tuned (E2E-G-LM-A), WERs were further reduced by 0.73, 7.78, and 8.25%p for in, out, and open domain TCs respectively. In particular, in domains having unusual proper nouns, we got higher improvement in accuracies. This means adapter fine-tuning can bias output probability properly for unusual proper nouns. In addition, despite this strong biasing, the accuracy of existing domain TCs did not deteriorated. Table 3 shows WERs measured with only E2E models (E2E), E2E models with a full fine-tuned M-LM (E2E-M-LM), and E2E models with an adapter fine-tuned M-LM (E2E-M- LM-A). The results of using E2E models with a full fine-tuned M-LM (E2E-M-LM) showed improved WERs than the results decoded with the E2E models alone. The WERs of in and out domain TCs were reduced by 5.52 and 7.23%p, respectively. When the full fine-tuned M-LM was additionally adapter fine-tuned (E2E-M-LM-A), WERs were further reduced by 0.22, 1.3%p for in and out domain TCs respectively. Like G-LM experiments, adapter fine-tunings improves the proper noun recognition accuracy without compromising the accuracy of existing domains, even for smaller models. In Table 4, we see how far WERs can be reduced by iterative adapter fine-tuning. The model M1 refers to a model that an adapter fine-tuned M-LM using error sentences from the E2E-M-LM decoding result as training data. We considered decoding, error sentence extraction, and re-training a model as one iteration. In the experiment, accuracy improved until iterations were repeated three times. If iteration was repeated more than that, WERs were not improved more. Table 5 compares the case of using a G-LM as a common base LM with an iterative adapter fine-tuned G-LM (E2E-G-LM iter ) and the case of creating a dedicated M-LM and full fine-tune or adapter fine-tune it (E2E-M-LM, E2E-M-LM-A). Intuitively, when we decode music domain TCs with the E2E model and G-LM without any adaptation (E2E-G-LM) as a baseline, it showed a higher error rates than E2E-M-LM and E2E-M-LM-A. Last two columns in Table 5 show word error rate reduction (WERR). When we iterative adapter fine-tuned the G-LM three times (E2E-G-LM-A iter3 ), WERs were reduced by 0.49 and 0.83%p in both in and out domains TCs, respectively, compared to E2E-M-LM. Also, the WERs of E2E-G-LM-A iter3 were almost close to the results of E2E-M-LM-A. This means that a common G-LM with adapters can be used as a dedicated domain LM, and we can switch only adapter related layers to fit our model on each domain. Therefore, a multi-domain LM configuration with the structure shown in Fig. 2 is possible. Since π π is a relatively small value, the increasing number of parameters per domain is about 2% for the first domain and about 13% for after the second domain. Specifically, π (2π π β + π π + β) for the first domain, because norms and Softmax linear layers can be reused. π (2π π β + π π + 3β) +(2β + βπ π€π + π π€π ) for after the second domain. This slow increasing property is important because memory size is limited for GPU or on-device applications. We built our base LMs from scratch on eight P40 GPUs and on v3-8 tensor processing units (TPU). It took three days on eight P40 GPUs and 4 hours and 30 minutes on TPU. Iterative adapter fine-tuning proposed in the paper can train G-LM in 60 minutes on a P40 GPU and 25 minutes for M-LM. Since a P40 GPUs may be available in on premise servers, we expect that cloud computing cost may be saved. Conclusions
In this paper, adapter based multi-domain LM structure has been proposed. The structure is a combination of two architectures: an adapter module proposed for BERT in NLP area and a switchable adapter architecture proposed for RNN-T streaming ASR model. The proposed architecture allows LMs to expand multi-domain, suppressing the increase of the number of parameters. The proposed architecture can reduce WERs of target domains without WER decrease of existing domains. Also we observed that applying adapter module on Transformer LM has an effect on WER improvement especially for proper nouns that is hard to be handled with a common base LM. Finally, the proposed architecture can reuse standard full fine-tuned LMs. So, the full fine-tuned LMs can be easily reused (or transferred) without any changes. Table 5. Iterative fine-tuning performance (WER). The results show a G-LM with iterative fine-tuned adapters can be used as a dedicated music LM. TC E2E-M-LM E2E-M-LM-A E2E-G-LM E2E-G-LM-A iter1
E2E-G-LM-A iter2
E2E-G-LM-A iter3
WERR (E2E-G-LM-A iter3 - E2E-M-LM) WERR ( E2E-G-LM-A iter3 - E2E-M-LM-A) In-Domain 2.68 2.46 4.65 3.82 2.38 2.19 -0.49 -0.27 Out-Domain 5.43 4.13 11.27 5.75 4.75 4.60 -0.83 0.47 . References [1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, βAttention Is All You Need,β in
NIPS , 2017. [2]
L. Dong, S. Xu, and B. Xu, βSpeech-transformer: a No-recurrence Sequence-to-Sequence Model for Speech Recognition,β in
ICASSP , 2018. [3]
H. Kim, H. Na, H. Lee, J. Lee, T. Kang, M. Lee, and Y. Choi, βKnowledge Distillation Using Output Errors for Self-Attention End-To-End Models,β in
ICASSP , 2019. [4]
K. Kwon, H. Na, H. Lee, and N. Kim, βAdaptive Knowledge Distillation Based On Entropy,β in
ICASSP , 2020. [5]
C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, βOn Using Monolingual Corpora in Neural Machine Translation,β arXiv preprint arXiv:1503.03535 , 2015. [6]
A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, βAn Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model,β in
ICASSP , 2018. [7]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, βBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,β arXiv preprint arXiv:1810.04805 , 2018. [8]
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, βDeep contextualized word representations,β in
NAACL , 2018. [9]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, βLanguage models are unsupervised multitask learners,β Accessed on: May 7 2018. [Online]. Available: https://openai.com/blog/better-language-models [10]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, βRoBERTa: A Robustly Optimized BERT Pretraining Approach,β arXiv preprint arXiv:1907.11692 , 2019. [11]
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, βXLNet: Generalized Autoregressive Pretraining for Language Understanding,β in
NeurIPS , 2019. [12]
M. McCloskey and N. J. Cohen, βCatastrophic Interference in Connectionist Networks: The Sequential Learning Problem,β in
Psychology of Learning and Motivation , 1989. [13]
Z. Li and D. Hoiem, βLearning without Forgetting,β in
ECCV , 2016 [14]
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, βOvercoming catastrophic forgetting in neural networks,β in
PNAS , 2017. [15]
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, βProgressive Neural Networks,β arXiv preprint arXiv:1606.04671 , 2016. [16]
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, βParameter-Efficient Transfer Learning for NLP,β in
ICML , 2019. [17]
S. A. Rebuffi, H. Bilen, and A. Vedaldi, βEfficient Parametrization of Multi-Domain Deep Neural Networks,β in
CVPR , 2018. [18]
C. Stickland and I. Murray, βBERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning,β in
PMLR , 2019. [19]
S. J. Semnani, K. R. Sadagopan, and F. Tlili, βBERT-A: Fine-Tuning BERT with Adapters and Data Augmentation,β [Online]. Available: http://web.stanford.edu/class/cs224n/reports/default/15848417.pdf [20]
A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee, βLarge-Scale Multilingual Speech Recognition with a Streaming End-to-End Model,β in
INTERSPEECH , 2019. [21]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, βGoogleβs Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,β arXiv preprint arXiv:1609.08144 , 2016. [22]
J. L. Ba, J. R. Kiros, and G. E. Hinton, βLayer Normalization,β arXiv preprint arXiv :1607.06450, 2016. [23]
A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit, βTensor2Tensor for Neural Machine Translation,β arXiv preprint arXiv:1803.07416 , 2018. [24]
C. C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani βState-of-the-art Speech Recognition with Sequence-to-Sequence Models,β in