[PDF] When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute

Abstract

Large language models have become increasingly difficult to train because of the required computation time and cost. In this work, we present SRU++, a recurrent unit with optional built-in attention that exhibits state-of-the-art modeling capacity and training efficiency. On standard language modeling benchmarks such as enwik8, Wiki-103 and Billion Word datasets, our model obtains better perplexity and bits-per-character (bpc) while using 3x-10x less training time and cost compared to top-performing Transformer models. Our results reaffirm that attention is not all we need and can be complementary to other sequential modeling modules. Moreover, fast recurrence with little attention can be a leading model architecture.

Full PDF

WWhen Attention Meets Fast Recurrence:Training Language Models with Reduced Compute

Tao Lei

ASAPP, Inc. [email protected]

Abstract

Large language models have become increas-ingly difﬁcult to train because of the requiredcomputation time and cost. In this work, wepresent SRU++, a recurrent unit with optionalbuilt-in attention that exhibits state-of-the-artmodeling capacity and training efﬁciency. Onstandard language modeling benchmarks suchas

ENWIK

IKI -103 datasets, our modelobtains better perplexity and bits-per-character(bpc) while using 2.5x-10x less training timeand cost compared to top-performing Trans-former models. Our results reafﬁrm that at-tention is not all we need and can be com-plementary to other sequential modeling mod-ules (Merity, 2019; Gulati et al., 2020). More-over, fast recurrence with little attention can bea leading model architecture.

Many recent advances in language modeling havecome from leveraging massive data and trainingmodels with signiﬁcantly increased model size.Not surprisingly, the associated computation fortraining has grown enormously, requiring hun-dreds of GPU hours or days for an experiment.As a consequence, it has become imperative tobuild computationally efﬁcient models that retaintop modeling power with reduced or acceleratedcomputation.The Transformer architecture (Vaswani et al.,2017) was proposed to accelerate model trainingand has become the predominant architecture inNLP. Speciﬁcally, it is built entirely upon self-attention and avoids the use of recurrence to en-able strong parallelization. While this change hasled to many empirical success and improved com-putational efﬁciency, we are interested in revisit-ing the architectural question:

Is attention all weneed for modeling?

Table 1 E ﬀ ective GPU hour Transformer-XL SRU++ SRU++ (single attention) B i t s P e r C h a r ac t e r ( B P C ) Effective training hours

Transformer-XLSRU++SRU++ (single attention) B i t s P e r C h a r a c t e r ( B P C ) Effective training hours

Transformer-XLSRU++ (single attention) B i t s P e r C h a r a c t e r ( B P C ) Effective training hours

Transformer-XLSRU++ (single attention) Figure 1 : Bits-per-character on

ENWIK a r X i v : . [ c s . C L ] F e b recurrent unit with built-in self-attention thatachieves strong computation efﬁciency. Our workbuilds upon the SRU (Lei et al., 2018), a highlyparallelizable RNN implementation that has beenshown effective in NLP and speech applica-tions (Park et al., 2018; Shangguan et al., 2020;Lin et al., 2020). We incorporate attention intothe SRU by simply replacing the linear projec-tion of input with a self-attention component. Theproposed architecture, called SRU++, enjoys highparallelization and enhanced context modeling ca-pacity. Figure 1 compares its performance withthe Transformer-XL model (Dai et al., 2019) onthe ENWIK

ENWIK

8, W

IKI -103 and

BILLION WORD datasets. We use SRU++as a replacement of the Transformer layer andstack multiple layers to construct our models. Ourresults demonstrate that SRU++ consistently out-performs various Transformer models, deliveringon par or better results while using 2.5x-10x lesscomputation. We open source our implementa-tion to facilitate future research at https://github.com/asappresearch/sru . We ﬁrst describe the Simple Recurrent Unit (SRU)in this section. A single layer of SRU involves thefollowing computation: f [ t ] = σ ( Wx [ t ] + v (cid:12) c [ t −

1] + b ) r [ t ] = σ (cid:0) W (cid:48) x [ t ] + v (cid:48) (cid:12) c [ t −

1] + b (cid:48) (cid:1) c [ t ] = f [ t ] (cid:12) c [ t −

1] + (1 − f [ t ]) (cid:12) ( W (cid:48)(cid:48) x [ t ]) h [ t ] = r [ t ] (cid:12) c [ t ] + (1 − r [ t ]) (cid:12) x [ t ] where (cid:12) is the element-wise multiplication, W , W (cid:48) and W (cid:48)(cid:48) are parameter matrices and v , v (cid:48) , b and b (cid:48) are parameter vectors to be learnt dur-ing training. The SRU architecture consists ofa light recurrence component which successivelyreads the input vectors x [ t ] and computes the se-quence of states c [ t ] capturing sequential informa-tion. The computation resembles other recurrentnetworks such as LSTM and GRU, by using a for-get gate f [ t ] to control the information ﬂow. Thestate vector c [ t ] is determined by averaging theprevious state c [ t − and the current observa-tion W (cid:48)(cid:48) x [ t ] according to f [ t ] . Once the internal state c [ t ] is produced, SRU uses skip connectionimplemented via a highway network to computethe ﬁnal output state h [ t ] . Similarly, the informa-tion ﬂow in a highway network is controlled by areset gate r [ t ] .Two code-level optimizations are performed toenhance the parallelism and therefore the speedof SRU. First, given the input sequence X = { x [1] , · · · , x [ L ] } where each x [ t ] ∈ R d is a d -dimensional vector, we group the three ma-trix multiplications across all time steps as a sin-gle multiplication. This signiﬁcantly improvesthe computation intensity (e.g. GPU utilization).Speciﬁcally, the batched multiplication is a linearprojection of the input tensor X ∈ R L × d : U (cid:62) =  WW (cid:48) W (cid:48)(cid:48)  X (cid:62) , (1)where L is the sequence length, U ∈ R L × × d isthe output tensor and d is the hidden state size.The second optimization performs all element-wise operations in an efﬁcient way. This involves f [ t ] = σ ( U [ t,

0] + v (cid:12) c [ t −

1] + b ) (2) r [ t ] = σ ( U [ t,

1] + v (cid:48) (cid:12) c [ t −

1] + b (cid:48) ) (3) c [ t ] = f [ t ] (cid:12) c [ t −

1] + (1 − f [ t ]) (cid:12) U [ t, (4) h [ t ] = r [ t ] (cid:12) c [ t ] + (1 − r [ t ]) (cid:12) x [ t ] . (5)Note that each dimension of the hidden vectorsare independent once U is computed. As a result,these operations can be done in parallel across hid-den dimension d (and batch size B in a mini-batchscenario). Similar to other operations such as at-tention and LSTM, this step is implemented as aCUDA kernel to accelerate computation. The key modiﬁcation of SRU++ is to incorporatemore expressive non-linear operations into the re-current network. Note that the computation of U (Equation 1) is a linear transformation of the in-put sequence X . We can use other parameterizedneural operators to replace this transformation.Speciﬁcally, SRU++ builds in self-attention toenhance its modeling capacity. Given the inputsequence being represented as a tensor or matrix X ∈ R L × d , the attention component computes thequery, key and value representations using the fol- lementwise recurrence MatMulMatMul input xoutput h, c

Elementwise recurrence

MatMul input xoutput h, c

Elementwise recurrence

MatMulMatMul

Attention input xoutput h, c (a) SRU (b) SRU w/ projection trick (c) SRU++ w/ attention

Figure 2 : An illustration of SRU and SRU++ networks: (a) the original SRU network, (b) the SRUvariant using a projection trick to reduce the number of parameters, experimented in Lei et al. (2018) and(c) SRU++ proposed in this work. Numbers indicate the hidden size of intermediate inputs/outputs for d = 2048 and d (cid:48) = 512 .lowing multiplications, Q = W q X (cid:62) K = W k QV = W v Q where W q ∈ R d (cid:48) × d , W k ∈ R d (cid:48) × d (cid:48) , W v ∈ R d (cid:48) × d (cid:48) are model parameters, and d (cid:48) is the atten-tion dimension (typically smaller than d ). Next, aweighted average output A ∈ R d (cid:48) × L is computedusing the scaled dot-product attention introducedin Vaswani et al. (2017), A (cid:62) = softmax (cid:18) Q (cid:62) K √ d (cid:48) (cid:19) V (cid:62) . Finally, the output U required by the element-wisekernel is obtained by another linear projection, U (cid:62) = W o ( Q + α · A ) . where α ∈ R is a learned scalar and W o ∈ R d × d (cid:48) is the projection matrix applied to a residual con-nection ( Q + α · A ) which improves gradient prop-agation and stablizes training. α is initialized tozero and as a result U (cid:62) = W o Q = ( W o W q ) X (cid:62) initially falls back to a linear mapping of the in-put X skipping the attention transformation. Com-pared to Equation (1), the linear mapping herecan be interpreted as applying a factorization trick W o W q with a small inner dimension d (cid:48) < d thatcan reduce the total number of parameters. Fig-ure 2 compares the differences of SRU, factorizedSRU and SRU++. Layer normalization

We also experiment withadding layer normalization (Ba et al., 2016) toeach SRU++ layer. In our implementation, we ap-ply normalization after the attention operation andbefore the matrix multiplication with W o , U (cid:62) = W o layernorm ( Q + α · A ) . Another option is to apply normalization over thehidden states h [ t ] once they are produced. Ap-plying either one or both normalizations all workwell. Note that these variants are post-layer nor-malizations in which the normalization happensafter the skip connection is added. In contrast, pre-layer normalization (Xiong et al., 2020) is appliedwithin each non-linear layer. We use post normal-ization for better results following the empiricalobservations in Liu et al. (2020b). Datasets

We evaluate our model on three lan-guage model benchmarks.• E

NWIK

IKI -103 (Merity et al., 2016) is a word-level language modeling dataset. The train-ing data contains 100 million tokens ex-tracted from Wikipedia articles. Follow-ing prior work, we use a vocabulary that odel Tok/batch BPC ↓ GPU hrs ↓ Trans-XL 24 ×

512 1.06 356SRU++ 24 ×

512 1.03 94SRU++ 16 ×

768 1.02 41 † Table 1 : Comparison between SRU++ andTransformer-XL on E

NWIK † in-dicates training using automatic mixed precisionand distributed data parallel in PyTorch.has about 260K tokens, and adaptive embed-ding and softmax layers (Grave et al., 2017;Baevski and Auli, 2019).• B ILLION WORD (Chelba et al., 2013) is amuch larger dataset containing 768 millionword tokens for training. Unlike W

IKI -103in which sentences in the same article aretreated as consecutive inputs to model longcontext, the sentences in

BILLION WORD arerandomly shufﬂed. We follow (Baevski andAuli, 2019) and use adaptive embedding andsoftmax layers for this dataset as well. Thevocabulary size is about 800K, being thesame as prior work.

Model

All our language models are constructedwith a word embedding layer, multiple layers ofSRU++ and an output linear layer followed bysoftmax operation. We set the hidden dimensions d = 4 × d (cid:48) , the number of SRU++ layers to 10and use single-head attention in each layer. Weuse the same dropout probability for all layers andtune this value according to the model size and theresults on the dev set.For simplicity, we do not use additional tech-niques that are shown useful for language mod-els such as compressed memory (Rae et al., 2020),nearest-neighbor interpolation (Khandelwal et al.,2020), relative position (Shaw et al., 2018; Presset al., 2020) and attention variants to handle longercontext (Sukhbaatar et al., 2019a; Roy et al.,2020). Optimization

We use RAdam (Liu et al.,2020a), a variant of the Adam optimizer (Kingmaand Ba, 2014) for training. RAdam is reported tobe less sensitive to the choice of learning rate and

Model Param BPC ↓ GPU hrs ↓ Trans-XL 41M 1.06 356SHA-LSTM 54M 1.07 28 † k = 1

42M 1.022 41 † k = 2

42M 1.025 33 † k = 5

41M 1.032 27 † k = 10

42M 1.033 25 † No attention 42M 1.190 25 † Table 2 : Analyzing SRU++ on

ENWIK k layers. We slightly in-crease the hidden size for k > so the numberof parameters are comparable. † indicates mixedprecision training.warmup steps, while achieving similar results atthe end. We use the default β values and a ﬁxedweight decay of . for all experiments. We use acosine learning rate schedule following (Dai et al.,2019) and an initial learning rate of 0.0003. Wedo not change the learning rate unless otherwisespeciﬁed. See Appendix A for the detailed train-ing conﬁguration of each model.For E NWIK

IKI -103 datasets, the train-ing data is partitioned into B chunks by concate-nating articles and ignoring the boundaries be-tween articles. Each training batch contains M consecutive tokens from each chunk (i.e. the un-roll size), which gives an effective size of B × M tokens per batch. Following previous work, theprevious training batch is provided as additionalcontext for attention, which results in a maxi-mum attention length of M . For BILLION WORD dataset, we follow (Dai et al., 2019) that concate-nates sentences to create the training batches (in-stead of adding pad tokens). Sentences are ran-domly shufﬂed and separated by a special token ~~indicating sentence boundaries.~~

Comparison with Transformer-XL

We ﬁrstconduct a comparison with the PyTorch imple-mentation of Transformer-XL (Dai et al., 2019)on

ENWIK . Their base model consistsof 41M parameters and 12 Transformer layers.Following the ofﬁcial instructions, we reproducedthe reported test BPC of 1.06 by training with 4Nvidia 2080 Ti GPUs. The training took about 4 https://github.com/kimiyoung/transformer-xl/tree/master/pytorch odel Parameters ↓ Test BPC ↓ GPU days ↓ Longformer 30L (Beltagy et al., 2020) 102M 0.99 104 † All-attention network 36L (Sukhbaatar et al., 2019b) 114M 0.98 64Transformer-XL 24L (Dai et al., 2019) 277M 0.99 -+ Compressive memory (Rae et al., 2020) - 0.97 -SRU++ Base 108M 0.97 7 † SRU++ Base ( k = 5 ) 98M 0.98 5 † SRU++ Base (w/o layernorm) 108M 0.96 14 † SRU++ Large 191M 0.96 15 † Table 3 : Comparison between models on

ENWIK × the number of days) if it is reported in the previous work. Our results areobtained using an AWS p3dn instance with 8 V100 GPUs. † indicates mixed precision training. Table 1 E ﬀ ective GPU hour Transformer-XL SRU++ SRU++ (k=10, mixed precision) B i t s P e r C h a r ac t e r ( B P C ) Effective training hours

Transformer-XLSRU++SRU++ (k=10, mixed precision) Figure 3 : Dev BPC vs. total GPU hours used on

ENWIK M during training.days or 360 total GPU hours equivalently.We train a 10-layer SRU++ model with 42M pa-rameters. For a fair comparison, we use the samehyperparameter setting including the batch size,unroll size, attention context length, learning rateand training iterations as the Transformer-XL basemodel. Notably, our base model can be trained us-ing 2 GPUs due to less memory usage. Table 1presents the results. After training, we set the at-tention context length to 2048 for testing, simi-larly to the Transformer-XL baseline. Our modelachieves a test BPC of 1.03, outperforming thebaseline by 0.03. By extending training attentioncontext / unroll size from 512 to 768, we furtherobtain a test BPC of 1.02. How much attention is needed?

Merity (2019)demonstrated that using a single attention layerwith LSTM retains most of the modeling capac-ity compared to using multiple attention layers.We conduct a similar analysis to understand how much attention is needed in conjunction with thesimple fast recurrence. To do so, we only enablethe attention operation every k layers. The otherlayers without attention become the SRU variantwith dimension projection illustrated in Figure 2(b). Note that k = 1 means the default SRU++model with attention in each layer, and k = 10 means only the last layer has attention in a 10-layer SRU++ model.Table 2 presents the results by varying k =2 , , . Our baseline model is a 10-layer modelwith 42M parameters. We see that using 50%less attention ( k = 2 ) achieves almost no increasein test BPC. Using only one attention module( k = 10 ) leads to a loss of 0.01 BPC, but reducesthe training time by 40%. Our results still out-perform Transformer-XL base model and single-headed attention LSTM (Merity, 2019) by 0.03BPC. Figure 3 showcases the training efﬁciencyof our model. In comparison with Transformer-XL, SRU++ is 5x more efﬁcient to reach a devBPC of 1.09. Furthermore, using automatic mixedprecision training and only one attention sub-layer( k = 10 ) achieves 14x reduction on training timeand cost. ENWIK Table 3 compares our model with othertop-performing models on the

ENWIK d = 3072 and alarge model with d = 4096 . The unroll sizeand attention context length are set to 1024 duringtraining and 3072 during evaluation. To comparethe computation efﬁciency we report the effectiveGPU days – the number of GPUs multiplied bythe number of days needed to ﬁnish training. Ourbase model reaches a test BPC of 0.97 using 400K odel Parameters ↓ Test PPL ↓ GPU days ↓ All-attention network 36L (Sukhbaatar et al., 2019b) 133M 20.6 -Feedback Transformer (Fan et al., 2020) 139M 18.2 214Transformer (Baevski and Auli, 2019) 247M 18.7 22 † Transformer-XL 18L (Dai et al., 2019) 257M 18.3 -+ Compressive memory (Rae et al., 2020) - 17.1 -SRU++ Base 148M 18.4 8 † SRU++ Large 232M 17.7 16 † SRU++ Large ( k = 5 ) 215M 17.8 12 † Table 4 : Comparison between models on

WIKI -103 dataset. We include the training cost (measured bythe number of GPUs used × the number of days) if it is reported in the previous work. Our results areobtained using an AWS p3dn instance with 8 V100 GPUs. † indicates mixed precision training. Model Param PPL ↓ GPU days ↓ Transformer 331M 25.6 57 † † SRU++ 328M 25.1 59 † Table 5 : Comparison between SRU++ and theTransformer model of Baevski and Auli (2019) on

BILLION WORD dataset.training steps and 7 GPU days , signiﬁcantly out-performing previous reported numbers. Further-more, our big model achieves a test BPC of 0.96,being on par with the state-of-the-art results. Inter-estingly, we found that turning off layer normal-ization achieves better generalization on the eval-uation sets. We present an additional analysis forlayer normalization later in this section. W IKI -103

Table 4 presents the result of SRU++models and other top results on the W

IKI -103dataset. We train one base model with d = 3072 and 148M parameters, and a large model with d = 4096 and 232M parameters. As shown in thetable, our base model obtains a test perplexity of18.4 using 8 GPU days of training, which is about3x reduction compared to the Transformer modelin (Baevski and Auli, 2019) and over 10x re-duction compared to Feedback Transformer (Fanet al., 2020). Similar to previous work on languagemodel, we can obtain better perplexity by increas-ing the model size and hidden dimension. Our bigmodel achieves a test perplexity of 17.7 using 16GPU days. BILLION WORD

Finally, we train a 10-layerSRU++ model on the

BILLION WORD dataset. Our Speciﬁcally, about 20 hours when using 8 V100 GPUs.

Model Speed ↑ PPL ↓ kNN-LM (Khandelwal et al.) 145 15.8Trans (Baevski and Auli) 2.5k 18.7Trans-XL (Dai et al.) 3.2k 18.3Shortformer (Press et al.) 15k 18.2SRU++ Large 15k 17.7SRU++ Large ( k = 5 ) 22k 17.8 Table 6 : Inference speed (tokens per second) onW

IKI -103 test set. Results of baseline models aretaken from Press et al. (2020). We use a batch sizeof 1 and maximum attention length 2560.model adopts the adaptive word embedding andsoftmax output layers as described in Baevski andAuli (2019). Similar to this work, we use an ef-fective batch size of 65K tokens per gradient up-date. We double our training iterations to 800Kand use a learning rate of 0.0002 compared to thesetting used for W

IKI -103 and

ENWIK d = 4096 and d (cid:48) = 1024 . Due to the expensive trainingcost for this dataset, we only compare with thebase model of Baevski and Auli (2019). Ta-ble 5 presents the test perplexity and associatedtraining cost (as measured by GPU days needed).Our model achieves a test perplexity of 25.1 usingabout 7.4 × GPU days. In comparison, the Trans-former base model in Baevski and Auli (2019)costs 2.5x more for a perplexity of 25.2 and used32 or 64 V100 GPUs for the experiments.

Inference speed

Recurrent networks enjoy acomputational beneﬁt because the associated com- A single experiment run costs over $5,000 using anAWS p3dn on-demand instance. .690.780.870.96 w/o layernorm (train)w/o layernorm (dev)w/ layernorm (train)w/ layernorm (dev)

41M parameters

0K 100K 200K 300K 400K108M parameters Figure 4 : Understanding the empirical effect oflayer normalization. We show the training and devloss of the small and big SRU++ models on EN - WIK

WIKI -103 test set. We use a single V100 GPU forinference. Our large model runs at least 4.5x fasterthan all baseline models except Shortformer (Presset al., 2020). In addition, our model achieves 0.4-0.5 perplexity lower than Shortformer and runs50% faster by using attention every 5 layers.

The effectiveness of layer normalization

Inour experiments, we have always observed thatlayer normalization stabilizes training. For largemodels however, layer normalization can lead toworse generalization on the evaluation sets, evenif we tune dropout carefully. On the other hand,turning off layer normalization can achieve betterdev and test results but makes training sensitive tolearning rate and initialization. For example, wehave to use a smaller learning rate of 0.00025 orlower to avoid sudden gradient explosion.Figure 4 showcases our empirical observationon the

ENWIK

We present a recurrent architecture with optionalbuilt-in attention and evaluate its effectiveness onvarious language modeling datasets. We demon-strate that highly expressive and efﬁcient neuralmodels can be designed using not just attention. Infact, by incorporating fast recurrent networks, verylittle attention computation is needed to achieveboth top-performing modeling results as well astraining speed. As future work, we believe theself-attentive recurrent models can be further im-proved using improved attention implementations,normalization and optimization techniques.

Acknowledgement

We thank Hugh Perkins, Joshua Shapiro, SamBowman and Danqi Chen for providing invalu-able feedback for this work. In addition, we thankJeremy Wohlwend, Jing Pan, Prashant Sridhar andKyu Han for helpful discussions, and ASAPP Lan-guage Technology and Infra teams for the computecluster setup for our research experiments. Finally,we’d like to thank Tao Ma, Will Robinson andGustavo Sapoznik for their support of this workand our research work at ASAPP in general.

References

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450 .Alexei Baevski and Michael Auli. 2019. Adaptive in-put representations for neural language modeling. In

International Conference on Learning Representa-tions .Iz Beltagy, Matthew E. Peters, and Arman Cohan.2020. Longformer: The long-document trans-former. arXiv:2004.05150 .James Bradbury, Stephen Merity, Caiming Xiong, andRichard Socher. 2017. Quasi-Recurrent Neural Net-works.

International Conference on Learning Rep-resentations (ICLR) .Andrew Brock, Soham De, and Samuel L Smith. 2021.Characterizing signal propagation to close the per-formance gap in unnormalized resnets. In

Interna-tional Conference on Learning Representations .Víctor Campos, Brendan Jou, Xavier Giró i Nieto,Jordi Torres, and Shih-Fu Chang. 2018. Skip rnn:Learning to skip state updates in recurrent neuralnetworks. In

ICLR .iprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,Thorsten Brants, Phillipp Koehn, and Tony Robin-son. 2013. One billion word benchmark for measur-ing progress in statistical language modeling. Tech-nical report, Google.Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.Transformer-XL: Attentive language models beyonda ﬁxed-length context. In

Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics , pages 2978–2988. Associationfor Computational Linguistics.Angela Fan, Thibaut Lavril, Edouard Grave, ArmandJoulin, and Sainbayar Sukhbaatar. 2020. Access-ing higher-level representations in sequential trans-formers with feedback memory. arXiv preprintarXiv:2002.09402 .Edouard Grave, Armand Joulin, Moustapha Cissé,Hervé Jégou, et al. 2017. Efﬁcient softmax approxi-mation for gpus. In

Proceedings of the 34th Interna-tional Conference on Machine Learning-Volume 70 ,pages 1302–1310. JMLR. org.Anmol Gulati, Chung-Cheng Chiu, James Qin, Ji-ahui Yu, Niki Parmar, Ruoming Pang, Shibo Wang,Wei Han, Yonghui Wu, Yu Zhang, and ZhengdongZhang, editors. 2020.

Conformer: Convolution-augmented Transformer for Speech Recognition .Elad Hoffer, Itay Hubara, and Daniel Soudry. 2017.Train longer, generalize better: closing the gener-alization gap in large batch training of neural net-works. In

Advances in Neural Information Process-ing Systems .Zhiheng Huang, Peng Xu, Davis Liang, Ajay Mishra,and Bing Xiang. 2020. Trans-blstm: Transformerwith bidirectional lstm for language understanding.Marcus Hutter. 2012. The human knowledge compres-sion contest.

URL http://prize. hutter1. net , 6.Urvashi Khandelwal, Omer Levy, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2020. Generalizationthrough memorization: Nearest neighbor languagemodels. In

International Conference on LearningRepresentations .Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In

Proceedingsof the International Conference on Learning Repre-sentations .Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and YoavArtzi. 2018. Simple recurrent units for highly paral-lelizable recurrence. In

Empirical Methods in Natu-ral Language Processing (EMNLP) .Alexander Lin, Jeremy Wohlwend, Howard Chen, andTao Lei. 2020. Autoregressive knowledge distilla-tion through imitation learning. In

Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) . Association forComputational Linguistics. Liyuan Liu, Haoming Jiang, Pengcheng He, WeizhuChen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.2020a. On the variance of the adaptive learning rateand beyond. In

Proceedings of the Eighth Inter-national Conference on Learning Representations(ICLR 2020) .Liyuan Liu, Xiaodong Liu, Jianfeng Gao, WeizhuChen, and Jiawei Han. 2020b. Understanding thedifﬁculty of training transformers. In

Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) . Association forComputational Linguistics.Stephen Merity. 2019. Single headed attention rnn:Stop thinking with your head. arXiv preprintarXiv:1911.11423 .Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2016. Pointer sentinel mixturemodels. arXiv preprint arXiv:1609.07843 .Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin,and Wonyong Sung. 2018. Fully neural networkbased speech recognition on mobile and embeddeddevices. In

NeurIPS , pages 10620–10630.Oﬁr Press, Noah A Smith, and Mike Lewis. 2020.Shortformer: Better language modeling usingshorter inputs. arXiv preprint arXiv:2012.15832 .Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku-mar, Chloe Hillier, and Timothy P. Lillicrap. 2020.Compressive transformers for long-range sequencemodelling. In

International Conference on Learn-ing Representations .Aurko Roy, Mohammad Saffar, Ashish Vaswani, andDavid Grangier. 2020. Efﬁcient content-basedsparse attention with routing transformers. arXivpreprint arXiv:2003.05997 .Yuan Shangguan, Jian Li, Qiao Liang, Raziel Alvarez,and Ian McGraw. 2020. Optimizing speech recogni-tion for the edge.Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position represen-tations. In

Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 2 (Short Papers) . Association forComputational Linguistics.Sheng Shen, Zhewei Yao, Amir Gholami, Michael Ma-honey, and Kurt Keutzer. 2020. Powernorm: Re-thinking batch normalization in transformers. In

ICML .Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-janowski, and Armand Joulin. 2019a. Adaptive at-tention span in transformers. In

Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics . Association for ComputationalLinguistics.ainbayar Sukhbaatar, Edouard Grave, GuillaumeLample, Herve Jegou, and Armand Joulin. 2019b.Augmenting self-attention with persistent memory.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems .Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng,Shuxin Zheng, Chen Xing, Huishuai Zhang, YanyanLan, Liwei Wang, and Tieyan Liu. 2020. On layernormalization in the transformer architecture. In

Proceedings of the 37th International Conference onMachine Learning , Proceedings of Machine Learn-ing Research. PMLR.Jingjing Xu, Xu Sun, Zhiyuan Zhang, GuangxiangZhao, and Junyang Lin. 2019. Understanding andimproving layer normalization. In

Advances in Neu-ral Information Processing Systems .Biao Zhang and Rico Sennrich. 2019. A lightweightrecurrent network for sequence modeling. In

Pro-ceedings of the 57th Annual Meeting of the Associa-tion for Computational Linguistics . Association forComputational Linguistics.

Training details

We use the RAdam optimizer with the default hy-perparameters β = 0 . and β = 0 . for allour experiments. We use a cosine learning rateschedule with only 1 cycle for simplicity. Forfaster training, we also leverage the native auto-matic mixed precision (AMP) training and dis-tributed data parallel (DDP) of Pytorch in all ex-periments, except those in Table 1 and Figure 1 fora fair comparison with the Transformer-XL imple-mentation.Table 7 shows the detailed training conﬁgura-tion of SRU++ models on ENWIK √ following the suggestionin Hoffer et al. (2017), which results in a roundedlearning rate of 0.0004.The model without layer normalization usesthe smaller learning rate 0.0003, as large learn-ing rates can lead to gradient explosion. We traina model without layer normalization to showcasethat it can achieve state-of-the-art results, follow-ing our analysis in the result section. We use an in-creased number of training steps of 600K for thisexperiment.Table 8 presents the detailed training conﬁgu-ration on W IKI -103 dataset. Similarly we use d = 3072 and d = 4096 for the base and largemodel respectively and ﬁx d : d (cid:48) = 4 : 1 . Fol-lowing (Baevski and Auli, 2019), we use an adap-tive word embedding layer and an adaptive soft-max layer for our models, and we tie the weightmatrices of the two layers. https://github.com/LiyuanLucasLiu/RAdam ase model Base model Base model Large model( k = 5 ) (w/o layernorm)Attention / unroll size - train 1024 1024 1024 1024Attention / unroll size - test 3072 3072 3072 3072Batch size × Num of GPUs 4 × × × × d d (cid:48)

768 768 768 1024Learning rate 0.0003 0.0003 0.0003 0.0004LR warmup steps 16K 16K 16K 16KTraining steps 400K 400K 600K 400KWeight decay 0.1 0.1 0.1 0.1Model size 98M 108M 108M 191MDev BPC 1.002 0.997 0.983 0.985Test BPC 0.980 0.974 0.961 0.963

Table 7 : Training details of SRU++ models on

ENWIK k = 5 )Attention / unroll size - train 768 1024 1024Attention / unroll size - test 2560 2560 2560Batch size × Num of GPUs 8 × × × d d (cid:48)

768 1024 1024Learning rate 0.0003 0.0003 0.0003LR warmup steps 16K 16K 16KTraining steps 400K 400K 400KWeight decay 0.1 0.1 0.1Model size 148M 215M 232MDev PPL 17.7 17.2 17.0Test PPL 18.4 17.8 17.7

Table 8 : Training details of SRU++ models on W

Related Researches

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

Wake Word Detection with Streaming Transformers

by Yiming Wang

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

In-Order Chart-Based Constituent Parsing

by Yang Wei

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Representation Learning for Natural Language Processing

by Zhiyuan Liu

«

1

2

3

4

»

Submitted on 24 Feb 2021 (v1), last revised 30 Mar 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar