When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute
WWhen Attention Meets Fast Recurrence:Training Language Models with Reduced Compute
Tao Lei
ASAPP, Inc. [email protected]
Abstract
Large language models have become increas-ingly difficult to train because of the requiredcomputation time and cost. In this work, wepresent SRU++, a recurrent unit with optionalbuilt-in attention that exhibits state-of-the-artmodeling capacity and training efficiency. Onstandard language modeling benchmarks suchas
ENWIK
IKI -103 datasets, our modelobtains better perplexity and bits-per-character(bpc) while using 2.5x-10x less training timeand cost compared to top-performing Trans-former models. Our results reaffirm that at-tention is not all we need and can be com-plementary to other sequential modeling mod-ules (Merity, 2019; Gulati et al., 2020). More-over, fast recurrence with little attention can bea leading model architecture.
Many recent advances in language modeling havecome from leveraging massive data and trainingmodels with significantly increased model size.Not surprisingly, the associated computation fortraining has grown enormously, requiring hun-dreds of GPU hours or days for an experiment.As a consequence, it has become imperative tobuild computationally efficient models that retaintop modeling power with reduced or acceleratedcomputation.The Transformer architecture (Vaswani et al.,2017) was proposed to accelerate model trainingand has become the predominant architecture inNLP. Specifically, it is built entirely upon self-attention and avoids the use of recurrence to en-able strong parallelization. While this change hasled to many empirical success and improved com-putational efficiency, we are interested in revisit-ing the architectural question:
Is attention all weneed for modeling?
Table 1 E ff ective GPU hour Transformer-XL SRU++ SRU++ (single attention) B i t s P e r C h a r ac t e r ( B P C ) Effective training hours
Transformer-XLSRU++SRU++ (single attention) B i t s P e r C h a r a c t e r ( B P C ) Effective training hours
Transformer-XLSRU++ (single attention) B i t s P e r C h a r a c t e r ( B P C ) Effective training hours
Transformer-XLSRU++ (single attention) Figure 1 : Bits-per-character on
ENWIK a r X i v : . [ c s . C L ] F e b recurrent unit with built-in self-attention thatachieves strong computation efficiency. Our workbuilds upon the SRU (Lei et al., 2018), a highlyparallelizable RNN implementation that has beenshown effective in NLP and speech applica-tions (Park et al., 2018; Shangguan et al., 2020;Lin et al., 2020). We incorporate attention intothe SRU by simply replacing the linear projec-tion of input with a self-attention component. Theproposed architecture, called SRU++, enjoys highparallelization and enhanced context modeling ca-pacity. Figure 1 compares its performance withthe Transformer-XL model (Dai et al., 2019) onthe ENWIK
ENWIK
8, W
IKI -103 and
BILLION WORD datasets. We use SRU++as a replacement of the Transformer layer andstack multiple layers to construct our models. Ourresults demonstrate that SRU++ consistently out-performs various Transformer models, deliveringon par or better results while using 2.5x-10x lesscomputation. We open source our implementa-tion to facilitate future research at https://github.com/asappresearch/sru . We first describe the Simple Recurrent Unit (SRU)in this section. A single layer of SRU involves thefollowing computation: f [ t ] = σ ( Wx [ t ] + v (cid:12) c [ t −
1] + b ) r [ t ] = σ (cid:0) W (cid:48) x [ t ] + v (cid:48) (cid:12) c [ t −
1] + b (cid:48) (cid:1) c [ t ] = f [ t ] (cid:12) c [ t −
1] + (1 − f [ t ]) (cid:12) ( W (cid:48)(cid:48) x [ t ]) h [ t ] = r [ t ] (cid:12) c [ t ] + (1 − r [ t ]) (cid:12) x [ t ] where (cid:12) is the element-wise multiplication, W , W (cid:48) and W (cid:48)(cid:48) are parameter matrices and v , v (cid:48) , b and b (cid:48) are parameter vectors to be learnt dur-ing training. The SRU architecture consists ofa light recurrence component which successivelyreads the input vectors x [ t ] and computes the se-quence of states c [ t ] capturing sequential informa-tion. The computation resembles other recurrentnetworks such as LSTM and GRU, by using a for-get gate f [ t ] to control the information flow. Thestate vector c [ t ] is determined by averaging theprevious state c [ t − and the current observa-tion W (cid:48)(cid:48) x [ t ] according to f [ t ] . Once the internal state c [ t ] is produced, SRU uses skip connectionimplemented via a highway network to computethe final output state h [ t ] . Similarly, the informa-tion flow in a highway network is controlled by areset gate r [ t ] .Two code-level optimizations are performed toenhance the parallelism and therefore the speedof SRU. First, given the input sequence X = { x [1] , · · · , x [ L ] } where each x [ t ] ∈ R d is a d -dimensional vector, we group the three ma-trix multiplications across all time steps as a sin-gle multiplication. This significantly improvesthe computation intensity (e.g. GPU utilization).Specifically, the batched multiplication is a linearprojection of the input tensor X ∈ R L × d : U (cid:62) = WW (cid:48) W (cid:48)(cid:48) X (cid:62) , (1)where L is the sequence length, U ∈ R L × × d isthe output tensor and d is the hidden state size.The second optimization performs all element-wise operations in an efficient way. This involves f [ t ] = σ ( U [ t,
0] + v (cid:12) c [ t −
1] + b ) (2) r [ t ] = σ ( U [ t,
1] + v (cid:48) (cid:12) c [ t −
1] + b (cid:48) ) (3) c [ t ] = f [ t ] (cid:12) c [ t −
1] + (1 − f [ t ]) (cid:12) U [ t, (4) h [ t ] = r [ t ] (cid:12) c [ t ] + (1 − r [ t ]) (cid:12) x [ t ] . (5)Note that each dimension of the hidden vectorsare independent once U is computed. As a result,these operations can be done in parallel across hid-den dimension d (and batch size B in a mini-batchscenario). Similar to other operations such as at-tention and LSTM, this step is implemented as aCUDA kernel to accelerate computation. The key modification of SRU++ is to incorporatemore expressive non-linear operations into the re-current network. Note that the computation of U (Equation 1) is a linear transformation of the in-put sequence X . We can use other parameterizedneural operators to replace this transformation.Specifically, SRU++ builds in self-attention toenhance its modeling capacity. Given the inputsequence being represented as a tensor or matrix X ∈ R L × d , the attention component computes thequery, key and value representations using the fol- lementwise recurrence MatMulMatMul input xoutput h, c
Elementwise recurrence
MatMul input xoutput h, c
Elementwise recurrence
MatMulMatMul
Attention input xoutput h, c (a) SRU (b) SRU w/ projection trick (c) SRU++ w/ attention
Figure 2 : An illustration of SRU and SRU++ networks: (a) the original SRU network, (b) the SRUvariant using a projection trick to reduce the number of parameters, experimented in Lei et al. (2018) and(c) SRU++ proposed in this work. Numbers indicate the hidden size of intermediate inputs/outputs for d = 2048 and d (cid:48) = 512 .lowing multiplications, Q = W q X (cid:62) K = W k QV = W v Q where W q ∈ R d (cid:48) × d , W k ∈ R d (cid:48) × d (cid:48) , W v ∈ R d (cid:48) × d (cid:48) are model parameters, and d (cid:48) is the atten-tion dimension (typically smaller than d ). Next, aweighted average output A ∈ R d (cid:48) × L is computedusing the scaled dot-product attention introducedin Vaswani et al. (2017), A (cid:62) = softmax (cid:18) Q (cid:62) K √ d (cid:48) (cid:19) V (cid:62) . Finally, the output U required by the element-wisekernel is obtained by another linear projection, U (cid:62) = W o ( Q + α · A ) . where α ∈ R is a learned scalar and W o ∈ R d × d (cid:48) is the projection matrix applied to a residual con-nection ( Q + α · A ) which improves gradient prop-agation and stablizes training. α is initialized tozero and as a result U (cid:62) = W o Q = ( W o W q ) X (cid:62) initially falls back to a linear mapping of the in-put X skipping the attention transformation. Com-pared to Equation (1), the linear mapping herecan be interpreted as applying a factorization trick W o W q with a small inner dimension d (cid:48) < d thatcan reduce the total number of parameters. Fig-ure 2 compares the differences of SRU, factorizedSRU and SRU++. Layer normalization
We also experiment withadding layer normalization (Ba et al., 2016) toeach SRU++ layer. In our implementation, we ap-ply normalization after the attention operation andbefore the matrix multiplication with W o , U (cid:62) = W o layernorm ( Q + α · A ) . Another option is to apply normalization over thehidden states h [ t ] once they are produced. Ap-plying either one or both normalizations all workwell. Note that these variants are post-layer nor-malizations in which the normalization happensafter the skip connection is added. In contrast, pre-layer normalization (Xiong et al., 2020) is appliedwithin each non-linear layer. We use post normal-ization for better results following the empiricalobservations in Liu et al. (2020b). Datasets
We evaluate our model on three lan-guage model benchmarks.• E
NWIK
IKI -103 (Merity et al., 2016) is a word-level language modeling dataset. The train-ing data contains 100 million tokens ex-tracted from Wikipedia articles. Follow-ing prior work, we use a vocabulary that odel Tok/batch BPC ↓ GPU hrs ↓ Trans-XL 24 ×
512 1.06 356SRU++ 24 ×
512 1.03 94SRU++ 16 ×
768 1.02 41 † Table 1 : Comparison between SRU++ andTransformer-XL on E
NWIK † in-dicates training using automatic mixed precisionand distributed data parallel in PyTorch.has about 260K tokens, and adaptive embed-ding and softmax layers (Grave et al., 2017;Baevski and Auli, 2019).• B ILLION WORD (Chelba et al., 2013) is amuch larger dataset containing 768 millionword tokens for training. Unlike W
IKI -103in which sentences in the same article aretreated as consecutive inputs to model longcontext, the sentences in
BILLION WORD arerandomly shuffled. We follow (Baevski andAuli, 2019) and use adaptive embedding andsoftmax layers for this dataset as well. Thevocabulary size is about 800K, being thesame as prior work.
Model
All our language models are constructedwith a word embedding layer, multiple layers ofSRU++ and an output linear layer followed bysoftmax operation. We set the hidden dimensions d = 4 × d (cid:48) , the number of SRU++ layers to 10and use single-head attention in each layer. Weuse the same dropout probability for all layers andtune this value according to the model size and theresults on the dev set.For simplicity, we do not use additional tech-niques that are shown useful for language mod-els such as compressed memory (Rae et al., 2020),nearest-neighbor interpolation (Khandelwal et al.,2020), relative position (Shaw et al., 2018; Presset al., 2020) and attention variants to handle longercontext (Sukhbaatar et al., 2019a; Roy et al.,2020). Optimization
We use RAdam (Liu et al.,2020a), a variant of the Adam optimizer (Kingmaand Ba, 2014) for training. RAdam is reported tobe less sensitive to the choice of learning rate and
Model Param BPC ↓ GPU hrs ↓ Trans-XL 41M 1.06 356SHA-LSTM 54M 1.07 28 † k = 1
42M 1.022 41 † k = 2
42M 1.025 33 † k = 5
41M 1.032 27 † k = 10
42M 1.033 25 † No attention 42M 1.190 25 † Table 2 : Analyzing SRU++ on
ENWIK k layers. We slightly in-crease the hidden size for k > so the numberof parameters are comparable. † indicates mixedprecision training.warmup steps, while achieving similar results atthe end. We use the default β values and a fixedweight decay of . for all experiments. We use acosine learning rate schedule following (Dai et al.,2019) and an initial learning rate of 0.0003. Wedo not change the learning rate unless otherwisespecified. See Appendix A for the detailed train-ing configuration of each model.For E NWIK
IKI -103 datasets, the train-ing data is partitioned into B chunks by concate-nating articles and ignoring the boundaries be-tween articles. Each training batch contains M consecutive tokens from each chunk (i.e. the un-roll size), which gives an effective size of B × M tokens per batch. Following previous work, theprevious training batch is provided as additionalcontext for attention, which results in a maxi-mum attention length of M . For BILLION WORD dataset, we follow (Dai et al., 2019) that concate-nates sentences to create the training batches (in-stead of adding pad tokens). Sentences are ran-domly shuffled and separated by a special token indicating sentence boundaries.
Comparison with Transformer-XL
We firstconduct a comparison with the PyTorch imple-mentation of Transformer-XL (Dai et al., 2019)on
ENWIK . Their base model consistsof 41M parameters and 12 Transformer layers.Following the official instructions, we reproducedthe reported test BPC of 1.06 by training with 4Nvidia 2080 Ti GPUs. The training took about 4 https://github.com/kimiyoung/transformer-xl/tree/master/pytorch odel Parameters ↓ Test BPC ↓ GPU days ↓ Longformer 30L (Beltagy et al., 2020) 102M 0.99 104 † All-attention network 36L (Sukhbaatar et al., 2019b) 114M 0.98 64Transformer-XL 24L (Dai et al., 2019) 277M 0.99 -+ Compressive memory (Rae et al., 2020) - 0.97 -SRU++ Base 108M 0.97 7 † SRU++ Base ( k = 5 ) 98M 0.98 5 † SRU++ Base (w/o layernorm) 108M 0.96 14 † SRU++ Large 191M 0.96 15 † Table 3 : Comparison between models on
ENWIK × the number of days) if it is reported in the previous work. Our results areobtained using an AWS p3dn instance with 8 V100 GPUs. † indicates mixed precision training. Table 1 E ff ective GPU hour Transformer-XL SRU++ SRU++ (k=10, mixed precision) B i t s P e r C h a r ac t e r ( B P C ) Effective training hours
Transformer-XLSRU++SRU++ (k=10, mixed precision) Figure 3 : Dev BPC vs. total GPU hours used on
ENWIK M during training.days or 360 total GPU hours equivalently.We train a 10-layer SRU++ model with 42M pa-rameters. For a fair comparison, we use the samehyperparameter setting including the batch size,unroll size, attention context length, learning rateand training iterations as the Transformer-XL basemodel. Notably, our base model can be trained us-ing 2 GPUs due to less memory usage. Table 1presents the results. After training, we set the at-tention context length to 2048 for testing, simi-larly to the Transformer-XL baseline. Our modelachieves a test BPC of 1.03, outperforming thebaseline by 0.03. By extending training attentioncontext / unroll size from 512 to 768, we furtherobtain a test BPC of 1.02. How much attention is needed?
Merity (2019)demonstrated that using a single attention layerwith LSTM retains most of the modeling capac-ity compared to using multiple attention layers.We conduct a similar analysis to understand how much attention is needed in conjunction with thesimple fast recurrence. To do so, we only enablethe attention operation every k layers. The otherlayers without attention become the SRU variantwith dimension projection illustrated in Figure 2(b). Note that k = 1 means the default SRU++model with attention in each layer, and k = 10 means only the last layer has attention in a 10-layer SRU++ model.Table 2 presents the results by varying k =2 , , . Our baseline model is a 10-layer modelwith 42M parameters. We see that using 50%less attention ( k = 2 ) achieves almost no increasein test BPC. Using only one attention module( k = 10 ) leads to a loss of 0.01 BPC, but reducesthe training time by 40%. Our results still out-perform Transformer-XL base model and single-headed attention LSTM (Merity, 2019) by 0.03BPC. Figure 3 showcases the training efficiencyof our model. In comparison with Transformer-XL, SRU++ is 5x more efficient to reach a devBPC of 1.09. Furthermore, using automatic mixedprecision training and only one attention sub-layer( k = 10 ) achieves 14x reduction on training timeand cost. ENWIK Table 3 compares our model with othertop-performing models on the
ENWIK d = 3072 and alarge model with d = 4096 . The unroll sizeand attention context length are set to 1024 duringtraining and 3072 during evaluation. To comparethe computation efficiency we report the effectiveGPU days – the number of GPUs multiplied bythe number of days needed to finish training. Ourbase model reaches a test BPC of 0.97 using 400K odel Parameters ↓ Test PPL ↓ GPU days ↓ All-attention network 36L (Sukhbaatar et al., 2019b) 133M 20.6 -Feedback Transformer (Fan et al., 2020) 139M 18.2 214Transformer (Baevski and Auli, 2019) 247M 18.7 22 † Transformer-XL 18L (Dai et al., 2019) 257M 18.3 -+ Compressive memory (Rae et al., 2020) - 17.1 -SRU++ Base 148M 18.4 8 † SRU++ Large 232M 17.7 16 † SRU++ Large ( k = 5 ) 215M 17.8 12 † Table 4 : Comparison between models on
WIKI -103 dataset. We include the training cost (measured bythe number of GPUs used × the number of days) if it is reported in the previous work. Our results areobtained using an AWS p3dn instance with 8 V100 GPUs. † indicates mixed precision training. Model Param PPL ↓ GPU days ↓ Transformer 331M 25.6 57 † † SRU++ 328M 25.1 59 † Table 5 : Comparison between SRU++ and theTransformer model of Baevski and Auli (2019) on
BILLION WORD dataset.training steps and 7 GPU days , significantly out-performing previous reported numbers. Further-more, our big model achieves a test BPC of 0.96,being on par with the state-of-the-art results. Inter-estingly, we found that turning off layer normal-ization achieves better generalization on the eval-uation sets. We present an additional analysis forlayer normalization later in this section. W IKI -103
Table 4 presents the result of SRU++models and other top results on the W
IKI -103dataset. We train one base model with d = 3072 and 148M parameters, and a large model with d = 4096 and 232M parameters. As shown in thetable, our base model obtains a test perplexity of18.4 using 8 GPU days of training, which is about3x reduction compared to the Transformer modelin (Baevski and Auli, 2019) and over 10x re-duction compared to Feedback Transformer (Fanet al., 2020). Similar to previous work on languagemodel, we can obtain better perplexity by increas-ing the model size and hidden dimension. Our bigmodel achieves a test perplexity of 17.7 using 16GPU days. BILLION WORD
Finally, we train a 10-layerSRU++ model on the
BILLION WORD dataset. Our Specifically, about 20 hours when using 8 V100 GPUs.
Model Speed ↑ PPL ↓ kNN-LM (Khandelwal et al.) 145 15.8Trans (Baevski and Auli) 2.5k 18.7Trans-XL (Dai et al.) 3.2k 18.3Shortformer (Press et al.) 15k 18.2SRU++ Large 15k 17.7SRU++ Large ( k = 5 ) 22k 17.8 Table 6 : Inference speed (tokens per second) onW
IKI -103 test set. Results of baseline models aretaken from Press et al. (2020). We use a batch sizeof 1 and maximum attention length 2560.model adopts the adaptive word embedding andsoftmax output layers as described in Baevski andAuli (2019). Similar to this work, we use an ef-fective batch size of 65K tokens per gradient up-date. We double our training iterations to 800Kand use a learning rate of 0.0002 compared to thesetting used for W
IKI -103 and
ENWIK d = 4096 and d (cid:48) = 1024 . Due to the expensive trainingcost for this dataset, we only compare with thebase model of Baevski and Auli (2019). Ta-ble 5 presents the test perplexity and associatedtraining cost (as measured by GPU days needed).Our model achieves a test perplexity of 25.1 usingabout 7.4 × GPU days. In comparison, the Trans-former base model in Baevski and Auli (2019)costs 2.5x more for a perplexity of 25.2 and used32 or 64 V100 GPUs for the experiments.
Inference speed
Recurrent networks enjoy acomputational benefit because the associated com- A single experiment run costs over $5,000 using anAWS p3dn on-demand instance. .690.780.870.96 w/o layernorm (train)w/o layernorm (dev)w/ layernorm (train)w/ layernorm (dev)
41M parameters
0K 100K 200K 300K 400K108M parameters Figure 4 : Understanding the empirical effect oflayer normalization. We show the training and devloss of the small and big SRU++ models on EN - WIK
WIKI -103 test set. We use a single V100 GPU forinference. Our large model runs at least 4.5x fasterthan all baseline models except Shortformer (Presset al., 2020). In addition, our model achieves 0.4-0.5 perplexity lower than Shortformer and runs50% faster by using attention every 5 layers.
The effectiveness of layer normalization
Inour experiments, we have always observed thatlayer normalization stabilizes training. For largemodels however, layer normalization can lead toworse generalization on the evaluation sets, evenif we tune dropout carefully. On the other hand,turning off layer normalization can achieve betterdev and test results but makes training sensitive tolearning rate and initialization. For example, wehave to use a smaller learning rate of 0.00025 orlower to avoid sudden gradient explosion.Figure 4 showcases our empirical observationon the
ENWIK
We present a recurrent architecture with optionalbuilt-in attention and evaluate its effectiveness onvarious language modeling datasets. We demon-strate that highly expressive and efficient neuralmodels can be designed using not just attention. Infact, by incorporating fast recurrent networks, verylittle attention computation is needed to achieveboth top-performing modeling results as well astraining speed. As future work, we believe theself-attentive recurrent models can be further im-proved using improved attention implementations,normalization and optimization techniques.
Acknowledgement
We thank Hugh Perkins, Joshua Shapiro, SamBowman and Danqi Chen for providing invalu-able feedback for this work. In addition, we thankJeremy Wohlwend, Jing Pan, Prashant Sridhar andKyu Han for helpful discussions, and ASAPP Lan-guage Technology and Infra teams for the computecluster setup for our research experiments. Finally,we’d like to thank Tao Ma, Will Robinson andGustavo Sapoznik for their support of this workand our research work at ASAPP in general.
References
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450 .Alexei Baevski and Michael Auli. 2019. Adaptive in-put representations for neural language modeling. In
International Conference on Learning Representa-tions .Iz Beltagy, Matthew E. Peters, and Arman Cohan.2020. Longformer: The long-document trans-former. arXiv:2004.05150 .James Bradbury, Stephen Merity, Caiming Xiong, andRichard Socher. 2017. Quasi-Recurrent Neural Net-works.
International Conference on Learning Rep-resentations (ICLR) .Andrew Brock, Soham De, and Samuel L Smith. 2021.Characterizing signal propagation to close the per-formance gap in unnormalized resnets. In
Interna-tional Conference on Learning Representations .Víctor Campos, Brendan Jou, Xavier Giró i Nieto,Jordi Torres, and Shih-Fu Chang. 2018. Skip rnn:Learning to skip state updates in recurrent neuralnetworks. In
ICLR .iprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. Tech-nical report, Google.Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.Transformer-XL: Attentive language models beyonda fixed-length context. In
Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics , pages 2978–2988. Associationfor Computational Linguistics.Angela Fan, Thibaut Lavril, Edouard Grave, ArmandJoulin, and Sainbayar Sukhbaatar. 2020. Access-ing higher-level representations in sequential trans-formers with feedback memory. arXiv preprintarXiv:2002.09402 .Edouard Grave, Armand Joulin, Moustapha Cissé,Hervé Jégou, et al. 2017. Efficient softmax approxi-mation for gpus. In
Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 ,pages 1302–1310. JMLR. org.Anmol Gulati, Chung-Cheng Chiu, James Qin, Ji-ahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang,Wei Han, Yonghui Wu, Yu Zhang, and ZhengdongZhang, editors. 2020.
Conformer: Convolution-augmented Transformer for Speech Recognition .Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017.Train longer, generalize better: closing the gener-alization gap in large batch training of neural net-works. In
Advances in Neural Information Process-ing Systems .Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra,and Bing Xiang. 2020. Trans-blstm: Transformerwith bidirectional lstm for language understanding.Marcus Hutter. 2012. The human knowledge compres-sion contest.
URL http://prize. hutter1. net , 6.Urvashi Khandelwal, Omer Levy, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2020. Generalizationthrough memorization: Nearest neighbor languagemodels. In
International Conference on LearningRepresentations .Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In
Proceedingsof the International Conference on Learning Repre-sentations .Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and YoavArtzi. 2018. Simple recurrent units for highly paral-lelizable recurrence. In
Empirical Methods in Natu-ral Language Processing (EMNLP) .Alexander Lin, Jeremy Wohlwend, Howard Chen, andTao Lei. 2020. Autoregressive knowledge distilla-tion through imitation learning. In
Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) . Association forComputational Linguistics. Liyuan Liu, Haoming Jiang, Pengcheng He, WeizhuChen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.2020a. On the variance of the adaptive learning rateand beyond. In
Proceedings of the Eighth Inter-national Conference on Learning Representations(ICLR 2020) .Liyuan Liu, Xiaodong Liu, Jianfeng Gao, WeizhuChen, and Jiawei Han. 2020b. Understanding thedifficulty of training transformers. In
Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) . Association forComputational Linguistics.Stephen Merity. 2019. Single headed attention rnn:Stop thinking with your head. arXiv preprintarXiv:1911.11423 .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixturemodels. arXiv preprint arXiv:1609.07843 .Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin,and Wonyong Sung. 2018. Fully neural networkbased speech recognition on mobile and embeddeddevices. In
NeurIPS , pages 10620–10630.Ofir Press, Noah A Smith, and Mike Lewis. 2020.Shortformer: Better language modeling usingshorter inputs. arXiv preprint arXiv:2012.15832 .Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku-mar, Chloe Hillier, and Timothy P. Lillicrap. 2020.Compressive transformers for long-range sequencemodelling. In
International Conference on Learn-ing Representations .Aurko Roy, Mohammad Saffar, Ashish Vaswani, andDavid Grangier. 2020. Efficient content-basedsparse attention with routing transformers. arXivpreprint arXiv:2003.05997 .Yuan Shangguan, Jian Li, Qiao Liang, Raziel Alvarez,and Ian McGraw. 2020. Optimizing speech recogni-tion for the edge.Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In
Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers) . Association forComputational Linguistics.Sheng Shen, Zhewei Yao, Amir Gholami, Michael Ma-honey, and Kurt Keutzer. 2020. Powernorm: Re-thinking batch normalization in transformers. In
ICML .Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-janowski, and Armand Joulin. 2019a. Adaptive at-tention span in transformers. In
Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics . Association for ComputationalLinguistics.ainbayar Sukhbaatar, Edouard Grave, GuillaumeLample, Herve Jegou, and Armand Joulin. 2019b.Augmenting self-attention with persistent memory.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in Neural Information Pro-cessing Systems .Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,Shuxin Zheng, Chen Xing, Huishuai Zhang, YanyanLan, Liwei Wang, and Tieyan Liu. 2020. On layernormalization in the transformer architecture. In
Proceedings of the 37th International Conference onMachine Learning , Proceedings of Machine Learn-ing Research. PMLR.Jingjing Xu, Xu Sun, Zhiyuan Zhang, GuangxiangZhao, and Junyang Lin. 2019. Understanding andimproving layer normalization. In
Advances in Neu-ral Information Processing Systems .Biao Zhang and Rico Sennrich. 2019. A lightweightrecurrent network for sequence modeling. In
Pro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics . Association forComputational Linguistics.
Training details
We use the RAdam optimizer with the default hy-perparameters β = 0 . and β = 0 . for allour experiments. We use a cosine learning rateschedule with only 1 cycle for simplicity. Forfaster training, we also leverage the native auto-matic mixed precision (AMP) training and dis-tributed data parallel (DDP) of Pytorch in all ex-periments, except those in Table 1 and Figure 1 fora fair comparison with the Transformer-XL imple-mentation.Table 7 shows the detailed training configura-tion of SRU++ models on ENWIK √ following the suggestionin Hoffer et al. (2017), which results in a roundedlearning rate of 0.0004.The model without layer normalization usesthe smaller learning rate 0.0003, as large learn-ing rates can lead to gradient explosion. We traina model without layer normalization to showcasethat it can achieve state-of-the-art results, follow-ing our analysis in the result section. We use an in-creased number of training steps of 600K for thisexperiment.Table 8 presents the detailed training configu-ration on W IKI -103 dataset. Similarly we use d = 3072 and d = 4096 for the base and largemodel respectively and fix d : d (cid:48) = 4 : 1 . Fol-lowing (Baevski and Auli, 2019), we use an adap-tive word embedding layer and an adaptive soft-max layer for our models, and we tie the weightmatrices of the two layers. https://github.com/LiyuanLucasLiu/RAdam ase model Base model Base model Large model( k = 5 ) (w/o layernorm)Attention / unroll size - train 1024 1024 1024 1024Attention / unroll size - test 3072 3072 3072 3072Batch size × Num of GPUs 4 × × × × d d (cid:48)
768 768 768 1024Learning rate 0.0003 0.0003 0.0003 0.0004LR warmup steps 16K 16K 16K 16KTraining steps 400K 400K 600K 400KWeight decay 0.1 0.1 0.1 0.1Model size 98M 108M 108M 191MDev BPC 1.002 0.997 0.983 0.985Test BPC 0.980 0.974 0.961 0.963
Table 7 : Training details of SRU++ models on
ENWIK k = 5 )Attention / unroll size - train 768 1024 1024Attention / unroll size - test 2560 2560 2560Batch size × Num of GPUs 8 × × × d d (cid:48)
768 1024 1024Learning rate 0.0003 0.0003 0.0003LR warmup steps 16K 16K 16KTraining steps 400K 400K 400KWeight decay 0.1 0.1 0.1Model size 148M 215M 232MDev PPL 17.7 17.2 17.0Test PPL 18.4 17.8 17.7
Table 8 : Training details of SRU++ models on W