[PDF] Adaptable Multi-Domain Language Model for Transformer ASR

Abstract

We propose an adapter based multi-domain Transformer based language model (LM) for Transformer ASR. The model consists of a big size common LM and small size adapters. The model can perform multi-domain adaptation with only the small size adapters and its related layers. The proposed model can reuse the full fine-tuned LM which is fine-tuned using all layers of an original model. The proposed LM can be expanded to new domains by adding about 2% of parameters for a first domain and 13% parameters for after second domain. The proposed model is also effective in reducing the model maintenance cost because it is possible to omit the costly and time-consuming common LM pre-training process. Using proposed adapter based approach, we observed that a general LM with adapter can outperform a dedicated music domain LM in terms of word error rate (WER).

Full PDF

AAdaptable Multi-Domain Language Model for Transformer ASR

Taewoo Lee , Min-Joong Lee , Tae Gyoon Kang , Seokyeoung Jung , Minseok Kwon , Yeona Hong , Jungin Lee , Kyoung-Gu Woo , Ho-Gyeong Kim , Jiseung Jeong , Jihyun Lee , Hosik Lee , Young Sang Choi AI R&D Group, Samsung Electronics, South Korea Samsung Advanced Institute of Technology, Samsung Electronics, South Korea {tw1.lee, minjoong.lee, taeg.kang, jihyun.s.lee}@samsung.com

Abstract

Index Terms : end-to-end (E2E) automatic speech recognition (ASR), language model (LM), multi-domain adaptation Introduction

In recent years, virtual voice assistants have been widely spread to real-world applications. End-to-end (E2E) automatic speech recognition (ASR) has become one of the key elements of virtual voice assistant services. As new domains continue to be added, ASR models need to be adapted quickly to the new domains. Furthermore, domain specific proper nouns must be recognized such as new song titles and singer names. This means that it is necessary to maintain the recognition accuracy of the existing supported domains while securing the recognition accuracy for new words in the new domain. In addition, in order to provide a good user experience, such a response must be done very quickly. Transformer was first introduced as a model for translation [1]. Then, it has also been successfully applied to ASR [2]. This is because Transformer has an advantage in terms of computation and parallelism over recurrent neural network (RNN) based models. In addition, knowledge distillation has been studied to create parameter efficient models [3,4]. Shallow fusion of the E2E ASR models and external language models (LM) also showed a further improvement in WER [5,6], because external LMs are able to learn more contextual information from abundant text-only data. In natural language processing (NLP), several methods of pre-training neural language models have led to major advances in NLP subtasks. BERT, ELMO, GPT, RoBERTa, and XLNet are typical [7-11]. These methods find dependencies between words and their combinations by pre-training neural networks on large amounts of data. Also, by fine-tuning the model on training data in target tasks, these models could be easily applied to solving other NLP tasks. However, it is difficult to continuously update these models because deep networks tend to forget previous knowledge when it is sequentially re-trained [12]. To solve such a problem, continual learning approaches have been studied. To preserve previous knowledge, learning without forgetting (LWF) [13] adds output logits of previous stage networks to logits of current stage networks. Elastic weight consolidation (EWC) [14] constrains weight updates by valuing which weight are important for a task. Progressive neural networks [15] avoid forgetting by preserving task specific networks. However, those approaches are imperfect in memory and parameter efficiency [16]. In computer vision, residual adapter modules have been introduced to make a multi-task and multi-domain model [17]. In the paper, a large common model is used as a base model. Then small adapter modules are added in front of each batch normalization layer in series or in parallel manner. In the experiments, both methods showed better accuracy than a full fine-tuned model. Similar approaches have been explored for BERT in NLP [18]. In the paper, the authors proposed a model (called projected attention layers or PALs) that can resolve multi-domain NLP tasks by adding only adjustable 13% parameters compared to the original model. Meanwhile, in [16], a method to fine-tune models by adding only adjustable 3.6% of parameters has been proposed. The method adds small size adapters to the self-attention (SA) and feed forward network (FFN) layers of Transformer, respectively. In [19], the authors compared PALs and adapters. In the paper, fine-tuning adapters with norm layer showed better results compared to the PALs when almost similar number of parameters is used. For multilingual ASR, a structure is introduced so that only adapter layers can be switched [20]. In the study, the experiments have been conducted on recurrent neural network transducer (RNN-T) based streaming E2E ASR models.

In this paper, we study an external LM structure for Transformer based ASR model that can be adapted for multi-domain with only 2% or 13% parameter addition per domain. To the best of our knowledge, this is a first attempt applying adapters to Transformer LM in ASR. The effects of our model are: 1) Our adapter-based adaptation can be used on top of the full fine-tuned model, and it further reduces word error rate (WER) from the model. 2) Multi-domain LM can be supported with fewer parameters. 3) Our approach provides cost efficient way to maintain existing models. he rest of the paper is structured as follows: we describe our model architecture in Section 2. The experimental results on our data are reported in Section 3. Finally, we derive conclusions in Section 4. SA-based Multi-Domain LM with Adapter

Transformer-based E2E ASR

Fig. 1 shows a Transformer based E2E ASR models with an external LM. As in [2], the encoder module, which is similar to an acoustic model, takes the input features, 𝒙 , and transforms them to a higher-level feature representation with self-attention layers. The outputs of the encoder key 𝑲 𝑒𝑛𝑐 and value 𝑽 𝑒𝑛𝑐 are passed to encoder-decoder attention layers of E2E decoder. Using the 𝑲 𝑒𝑛𝑐 and 𝑽 𝑒𝑛𝑐 , the E2E decoder iteratively predicts output probabilities 𝑃(𝑦 𝑡 |𝑦 , ⋯ , 𝑦 𝑡−1 , 𝒙) of next output symbol 𝑦 𝑡 until maximum sequence length or EOS (end-of-sequence) is met. An external LM, where encoder-decoder attention layers are removed, can be incorporated at each step of beam search to improve accuracy. Hereafter, we focus on an external LM decoder with adapters. SA-based LM Decoder with Adapter

SA-based LM decoder consists of three parts: an output embedding, 𝑁 𝐿 LM SA layers, and a linear transform following Softmax (Fig. 2 left). For simplicity we set batch size and the number of domains to one in the followings.

Input Embedding

Let word-piece [21] vocabulary size be 𝑁 w , an input one-hot vector be 𝑥 𝑡 ∈ ℝ 𝑤 , hidden size be ℎ . The output of embedding matrix is computed as (1): 𝑊 𝑒𝑜 = 𝑥 𝑡 𝑾 𝑒 (1) where 𝑾 𝑒 ∈ ℝ 𝑁 𝑤 ×ℎ and 𝑊 𝑒𝑜 ∈ ℝ . Then a positional encoding vector 𝑃𝐸 ∈ ℝ is added to 𝑊 𝑒𝑜 [1]. SA layer in LM Decoder with Adapter

A SA layer of a LM decoder with adapters consists of four layers: layer norm [22], multi-head attention (MHA), FFN, and adapters.

Multi-Head Attention

Let the number of heads be 𝑁 ℎ𝑒𝑎𝑑 . Previous output is projected to a query, a key, and a value simultaneously for multi-head attention (Fig.2 left). Instead of performing a single attention function using ℎ dimentional 𝑄 , 𝐾 , and 𝑉 , MHA performs the attention function 𝑁 ℎ𝑒𝑎𝑑 times in parallel with differently learned ℎ/𝑁 ℎ𝑒𝑎𝑑 dimentional 𝑄 , 𝐾 , and 𝑉 . Then 𝑁 ℎ𝑒𝑎𝑑 numbers outputs are concatenated and projected into a single representation. The detailed equation is as follows: MultiHead(𝑄, 𝐾, 𝑉) = Concat(ℎ𝑒𝑎𝑑 , ⋯ , ℎ𝑒𝑎𝑑 𝑁 ℎ𝑒𝑎𝑑 )𝑾 𝑂 (2) where ℎ𝑒𝑎𝑑 𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑾 𝑖𝑄 , 𝐾𝑾 𝑖𝐾 , 𝑉𝑾 𝑖𝑉 ) = softmax ( (𝑄𝑾 𝑖𝑄 )(𝐾𝑾 𝑖𝐾 ) 𝑻 √ ℎ𝑁 ℎ𝑒𝑎𝑑 ) (𝑉𝑾 𝑖𝑉 ), (3) 𝑾 𝑖𝑄 ∈ ℝ ℎ×𝑑 𝑞 , 𝑾 𝑖𝐾 ∈ ℝ ℎ×𝑑 𝑘 , 𝑾 𝑖𝑉 ∈ ℝ ℎ×𝑑 𝑣 , and 𝑾 𝑂 ∈ ℝ ℎ×ℎ are trainable parameters. Note 𝑑 𝑞 = 𝑑 𝑘 = 𝑑 𝑣 = ℎ 𝑁 ℎ𝑒𝑎𝑑 ⁄ throughout the paper. Position-wise Feed-Forward Network

Let an inner filter size 𝑓 . Position-wise feed forward network consists of two FFNs with ReLU activation in between. An output of position-wise FFN is calculated as (4) where the input vector 𝑖 ∈ ℝ , the weight matrices and bias vectors 𝑾 ∈ℝ ℎ×𝑓 , 𝑏 ∈ ℝ , 𝑾 ∈ ℝ 𝑓×ℎ , and 𝑏 ∈ ℝ . FFN(𝑖 ) = max(0, 𝑖 𝑾 + 𝑏 ) 𝑾 + 𝑏 (4) Figure 2. (Left) is an architecture of transformer multi-domain LM. In a LM decoder, the adapter module (right) is added on top of multi-head attention and feed-forward layers. Only green layers (including layer norms or LN) are fine-tuned on the downstream data and expanded for 𝑁 𝑑 multi-domain. Dotted red lines shows a switchable decoding path for a first domain. Figure 1.

The dotted line box shows transformer-based E2E ASR model, including encoder and decoder. An external LM is incorporated at each step of beam search.

Feed-forward down-projectNonlinearityFeed-forward up-project

Adapter Layer × 𝑁

𝐴𝑑𝑎 𝑡𝑒 𝑁 𝑑 𝑖𝑛𝑒𝑎 SoftmaxMulti-Head AttentionIntputEmbedding

Positional Encoding 𝑁 𝐴𝑑𝑎 𝑡𝑒 𝑁 𝑁 𝑑 𝑄𝐾𝑉 ⋯ ⋯ 𝑁 𝑁 𝑁 𝑑 ⋯ 𝐴𝑑𝑎 𝑡𝑒 𝑁 𝑑 𝐴𝑑𝑎 𝑡𝑒 ⋯ 𝑁 𝑁 𝑁 𝑑 ⋯ 𝑖𝑛𝑒𝑎 𝑁 𝑑 ⋯ Transformer LM Transformer

EncoderTransformer

Decoder

TransformerLM

E2E 𝒙 𝑲 , 𝑽 𝑦 𝑡 𝑦 𝑡−1 𝑦 𝑡−1 .2.2.3 Adapter

Adapter modules proposed in [16] are inserted on top of MHA and FFN layers as in Fig. 2 (left). An adapter module (Fig. 2 right) consists of two linear transforms and ReLU activation in between. A residual connection is added to the output. The outputs of adapters 𝐴 and 𝐴 are calculated as follows: 𝐴 (𝑖 ) = 𝑖 + max(0, 𝑖 𝑾 + 𝑏 ) 𝑾 + 𝑏 (5) 𝐴 (𝑖 ) = 𝑖 + max(0, 𝑖 𝑾 + 𝑏 ) 𝑾 + 𝑏 , (6) where 𝑖 = MultiHead(𝑄, 𝐾, 𝑉) , 𝑖 = FFN(𝑖 ) , adapter filter size is 𝑓 𝐴 , 𝑾 , 𝑾 ∈ ℝ ℎ×𝑓 𝐴 , 𝑏 , 𝑏 ∈ ℝ 𝐴 , 𝑾 , 𝑾 ∈ ℝ 𝑓 𝐴 ×ℎ , 𝑏 , 𝑏 ∈ ℝ . Softmax

The outputs of decoder are transformed to the probabilities of output classes by a linear projection 𝑾 ∈ ℝ ℎ×𝑁 𝑤 and a subsequent softmax function. Experiments

Table 1 shows overall model architectures and model sizes used in the experiments. In the experiments, a general domain LM (G-LM), a music specialized domain LM (M-LM), and adapter added general and music LMs (G-LM-A, M-LM-A) are used. For single precision floating point, model sizes are increased about 2% when adapters are added for a first domain. The G-LM is trained on 24GiB normalized Korean text data consisting of 353M utterances. All data were anonymized. The data consists of representative utterances of Samsung’s Bixby scenario and general domain corpus. The M-LM is trained on normalized Korean text data consisting of 45M utterances, in which general and music domain (song title and singer name related commands) corpus are mixed. To train our models, we used Tensor2Tensor framework [23]. For G-LM experiments, we recorded test cases (TCs) in three categories: In-Domain, Out-Domain, and Open-Domain. In-Domain TCs includes 50K Bixby use-case scenario utterances such as phone and device control commands and daily conversational question and answering. Out-Domain TCs includes 8K domain specific utterances which is not included in In-Domain training corpus. Especially, we selected domains having its own unique proper nouns such as hospital or doctor’s names. Open-Domain TCs are included to test noisy environment, on which cafe, city, office, highway noises are added to clean speech. The content of the utterances is in arbitrary domain and do not include unknown unique proper nouns. All TC are recorded in male and female voices. For M-LM experiments, In-Domain and Out-Domain TCs are recorded. In-Domain TC includes 610 utterances. It represents well known song titles and singer names. On the other hand, Out-Domain TC includes 3709 utterances. The content is newly added song titles and singer names. We initialized weights of each adapter layer to the values following a normal distribution having zero mean and −4 variance. We tested variance values of { , −7 , −6 , −5 , −4 , −3 , −2 } and selected a largest sable value. Since an adapter module internally has a residual connection, zero variance can be inserted to test output of the adapter module is bypassed properly. All runs are trained on eight P40 GPUs to build models from scratch and on one P40 GPU for adaptations. We used Adam optimizer with 𝛽 = 0.9 , 𝛽 = 0.98 , 𝜖 = 1𝑒 −9 . Batch sizes tested from {32, 64, 128, 512, 1024, 4096, 8192}. 8192 is used for all our adaptation experiments. Unlike [16], small batch size made our training unstable, failing to converge. Learning rate is selected as 0.03 from {0.1, 0.03, 0.001, 0.0003, 0.0001}. When we train our models from scratch or adapt without adapter, we applied Noam learning rate decay scheme with 1000 warmup steps. On the other hand, when we train our adapter related layers, learning rate decay scheme did not used. We used 4096 word-pieces as output token units. For E2E model training, we used same hyper-parameters in [3]. All experiments used the identical input feature processing to that of [24]. The decoding hyper-parameters (beam size, length-penalty, and maximum decoding length) were tuned to minimize WER. Known proper nouns and number are converted with an inverse text normalization (ITN) module. We assumed we already knew proper domain names before inferencing. We first define four different model training or adaptation methods. (1) build from scratch : LMs are trained on whole corpus. All layers are trained and adapters are not added. (2) full Table 1.

The architectures and sizes of SA E2E, general LM (G-LM), music LM (M-LM), and adapter added LMs

E2E Enc. E2E Dec. G-LM G-LM-A M-LM M-LM-A ℎ

512 512 512 512 512 512 𝑓 𝑓 𝐴 - - - 64 - 64 𝑁 ℎ𝑒𝑎𝑑

16 4 8 8 8 8 Size (MiB) 96.7 80.3 76.4 77.9 40.3 41.3 Table 2.

WERs of E2E, E2E-G-LM, and E2E-G-LM-A on General Domain TCs

TC E2E E2E-G-LM E2E-G-LM-A In-Domain 2.42 1.82 1.69 Out-Domain 10.62 8.18

Open-Domain 12.8 5.08 4.55 Table 3.

WERs of E2E, E2E-M-LM, and E2E-M-LM-A on Music Domain TCs

TC E2E E2E-M-LM E2E-M-LM-A In-Domain 8.2 2.68 2.46 Out-Domain 12.66 5.43

Table 4.

WERs of iterative adapter fine-tuning with M-LM-A on Music Domain TCs

TC E2E-M-LM M1 iter1 M1 iter2 M1 iter3 In-Domain 2.68 2.46 1.97

Out-Domain 5.43 4.13 3.96 ine-tuning : LMs are fine-tuned on small size corpus of target domain. All layers are tuned and adapters are not added. (3) adapter fine-tuning : adapters are added on top of full fine-tuned models. LMs are fine-tuned on a small size corpus of target domain. Only adapter related layers (adapters, norms, a Softmax linear) are tuned. (4) iterative adapter fine-tuning : A LM is adapter fine-tuned iteratively. An iteration, here, means a process of a) decoding TCs with a latest adapter fine-tuned LM, b) collect error sentences from the results, and c) adapter fine-tune the last LM with the error sentences. Our goal is to attain 1) best performance iterative adapter fine-tuned LMs compared to the best full fine-tuned LMs, 2) LMs that can be extended to multi-domain by iterative adapter fine-tuning, using a common pre-trained LM. To show the proposed models can achieve the goals, we conduct four experiments. Results

Table 2 shows WERs measured with only E2E models (E2E), E2E models with a full fine-tuned G-LM (E2E-G-LM), and E2E models with an adapter fine-tuned G-LM (E2E-G- LM-A). Compared to the results decoded with only E2E models (E2E), in E2E-G-LM, WERs were reduced 0.6, 2.44, and 7.72%p for in, out, and open domain TCs, respectively. When the full fine-tuned G-LM was additionally adapter fine-tuned (E2E-G-LM-A), WERs were further reduced by 0.73, 7.78, and 8.25%p for in, out, and open domain TCs respectively. In particular, in domains having unusual proper nouns, we got higher improvement in accuracies. This means adapter fine-tuning can bias output probability properly for unusual proper nouns. In addition, despite this strong biasing, the accuracy of existing domain TCs did not deteriorated. Table 3 shows WERs measured with only E2E models (E2E), E2E models with a full fine-tuned M-LM (E2E-M-LM), and E2E models with an adapter fine-tuned M-LM (E2E-M- LM-A). The results of using E2E models with a full fine-tuned M-LM (E2E-M-LM) showed improved WERs than the results decoded with the E2E models alone. The WERs of in and out domain TCs were reduced by 5.52 and 7.23%p, respectively. When the full fine-tuned M-LM was additionally adapter fine-tuned (E2E-M-LM-A), WERs were further reduced by 0.22, 1.3%p for in and out domain TCs respectively. Like G-LM experiments, adapter fine-tunings improves the proper noun recognition accuracy without compromising the accuracy of existing domains, even for smaller models. In Table 4, we see how far WERs can be reduced by iterative adapter fine-tuning. The model M1 refers to a model that an adapter fine-tuned M-LM using error sentences from the E2E-M-LM decoding result as training data. We considered decoding, error sentence extraction, and re-training a model as one iteration. In the experiment, accuracy improved until iterations were repeated three times. If iteration was repeated more than that, WERs were not improved more. Table 5 compares the case of using a G-LM as a common base LM with an iterative adapter fine-tuned G-LM (E2E-G-LM iter ) and the case of creating a dedicated M-LM and full fine-tune or adapter fine-tune it (E2E-M-LM, E2E-M-LM-A). Intuitively, when we decode music domain TCs with the E2E model and G-LM without any adaptation (E2E-G-LM) as a baseline, it showed a higher error rates than E2E-M-LM and E2E-M-LM-A. Last two columns in Table 5 show word error rate reduction (WERR). When we iterative adapter fine-tuned the G-LM three times (E2E-G-LM-A iter3 ), WERs were reduced by 0.49 and 0.83%p in both in and out domains TCs, respectively, compared to E2E-M-LM. Also, the WERs of E2E-G-LM-A iter3 were almost close to the results of E2E-M-LM-A. This means that a common G-LM with adapters can be used as a dedicated domain LM, and we can switch only adapter related layers to fit our model on each domain. Therefore, a multi-domain LM configuration with the structure shown in Fig. 2 is possible. Since 𝑓 𝑎 is a relatively small value, the increasing number of parameters per domain is about 2% for the first domain and about 13% for after the second domain. Specifically, 𝑑 (2𝑓 𝑎 ℎ + 𝑓 𝑎 + ℎ) for the first domain, because norms and Softmax linear layers can be reused. 𝑑 (2𝑓 𝑎 ℎ + 𝑓 𝑎 + 3ℎ) +(2ℎ + ℎ𝑁 𝑤𝑝 + 𝑁 𝑤𝑝 ) for after the second domain. This slow increasing property is important because memory size is limited for GPU or on-device applications. We built our base LMs from scratch on eight P40 GPUs and on v3-8 tensor processing units (TPU). It took three days on eight P40 GPUs and 4 hours and 30 minutes on TPU. Iterative adapter fine-tuning proposed in the paper can train G-LM in 60 minutes on a P40 GPU and 25 minutes for M-LM. Since a P40 GPUs may be available in on premise servers, we expect that cloud computing cost may be saved. Conclusions

In this paper, adapter based multi-domain LM structure has been proposed. The structure is a combination of two architectures: an adapter module proposed for BERT in NLP area and a switchable adapter architecture proposed for RNN-T streaming ASR model. The proposed architecture allows LMs to expand multi-domain, suppressing the increase of the number of parameters. The proposed architecture can reduce WERs of target domains without WER decrease of existing domains. Also we observed that applying adapter module on Transformer LM has an effect on WER improvement especially for proper nouns that is hard to be handled with a common base LM. Finally, the proposed architecture can reuse standard full fine-tuned LMs. So, the full fine-tuned LMs can be easily reused (or transferred) without any changes. Table 5. Iterative fine-tuning performance (WER). The results show a G-LM with iterative fine-tuned adapters can be used as a dedicated music LM. TC E2E-M-LM E2E-M-LM-A E2E-G-LM E2E-G-LM-A iter1

E2E-G-LM-A iter2

E2E-G-LM-A iter3

WERR (E2E-G-LM-A iter3 - E2E-M-LM) WERR ( E2E-G-LM-A iter3 - E2E-M-LM-A) In-Domain 2.68 2.46 4.65 3.82 2.38 2.19 -0.49 -0.27 Out-Domain 5.43 4.13 11.27 5.75 4.75 4.60 -0.83 0.47 . References [1]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in

NIPS , 2017. [2]

L. Dong, S. Xu, and B. Xu, “Speech-transformer: a No-recurrence Sequence-to-Sequence Model for Speech Recognition,” in

ICASSP , 2018. [3]

H. Kim, H. Na, H. Lee, J. Lee, T. Kang, M. Lee, and Y. Choi, “Knowledge Distillation Using Output Errors for Self-Attention End-To-End Models,” in

ICASSP , 2019. [4]

K. Kwon, H. Na, H. Lee, and N. Kim, “Adaptive Knowledge Distillation Based On Entropy,” in

ICASSP , 2020. [5]

C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On Using Monolingual Corpora in Neural Machine Translation,” arXiv preprint arXiv:1503.03535 , 2015. [6]

A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model,” in

ICASSP , 2018. [7]

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv preprint arXiv:1810.04805 , 2018. [8]

M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in

NAACL , 2018. [9]

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Accessed on: May 7 2018. [Online]. Available: https://openai.com/blog/better-language-models [10]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692 , 2019. [11]

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized Autoregressive Pretraining for Language Understanding,” in

NeurIPS , 2019. [12]

M. McCloskey and N. J. Cohen, “Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem,” in

Psychology of Learning and Motivation , 1989. [13]

Z. Li and D. Hoiem, “Learning without Forgetting,” in

ECCV , 2016 [14]

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell, “Overcoming catastrophic forgetting in neural networks,” in

PNAS , 2017. [15]

A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive Neural Networks,” arXiv preprint arXiv:1606.04671 , 2016. [16]

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-Efficient Transfer Learning for NLP,” in

ICML , 2019. [17]

S. A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient Parametrization of Multi-Domain Deep Neural Networks,” in

CVPR , 2018. [18]

C. Stickland and I. Murray, “BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning,” in

PMLR , 2019. [19]

S. J. Semnani, K. R. Sadagopan, and F. Tlili, “BERT-A: Fine-Tuning BERT with Adapters and Data Augmentation,” [Online]. Available: http://web.stanford.edu/class/cs224n/reports/default/15848417.pdf [20]

A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee, “Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model,” in

INTERSPEECH , 2019. [21]

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” arXiv preprint arXiv:1609.08144 , 2016. [22]

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv preprint arXiv :1607.06450, 2016. [23]

A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit, “Tensor2Tensor for Neural Machine Translation,” arXiv preprint arXiv:1803.07416 , 2018. [24]

C. C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani “State-of-the-art Speech Recognition with Sequence-to-Sequence Models,” in