Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh
NNystr ¨omformer: A Nystr ¨om-based Algorithm for Approximating Self-Attention
Yunyang Xiong Zhanpeng Zeng Rudrasis Chakraborty Mingxing Tan Glenn Fung Yin Li Vikas Singh University of Wisconsin-Madison UC Berkeley Google Brain American Family [email protected], [email protected], [email protected], [email protected], [email protected],[email protected], [email protected]
Abstract
Transformers have emerged as a powerful tool for a broadrange of natural language processing tasks. A key compo-nent that drives the impressive performance of Transform-ers is the self-attention mechanism that encodes the influenceor dependence of other tokens on each specific token. Whilebeneficial, the quadratic complexity of self-attention on theinput sequence length has limited its application to longer se-quences – a topic being actively studied in the community. Toaddress this limitation, we propose Nystr¨omformer – a modelthat exhibits favorable scalability as a function of sequencelength. Our idea is based on adapting the Nystr¨om methodto approximate standard self-attention with O ( n ) complex-ity. The scalability of Nystr¨omformer enables application tolonger sequences with thousands of tokens. We perform eval-uations on multiple downstream tasks on the GLUE bench-mark and IMDB reviews with standard sequence length, andfind that our Nystr¨omformer performs comparably, or in afew cases, even slightly better, than standard Transformer.Our code is at https://github.com/mlpen/Nystromformer. Introduction
Transformer-based models, such as BERT (Devlin et al.2019) and GPT-3 (Brown et al. 2020), have been verysuccessful in natural language processing (NLP), achiev-ing state-of-the-art performance in machine translation(Vaswani et al. 2017), natural language inference (Williams,Nangia, and Bowman 2018), paraphrasing (Dolan andBrockett 2005), text classification (Howard and Ruder2018), question answering (Rajpurkar et al. 2016) and manyother NLP tasks (Peters et al. 2018; Radford et al. 2018).A key feature of transformers is what is known as the self-attention mechanism (Vaswani et al. 2017), where each to-ken’s representation is computed from all other tokens. Self-attention enables interactions of token pairs across the fullsequence and has been shown quite effective.Despite the foregoing advantages, self-attention also turnsout to be a major efficiency bottleneck since it has a memoryand time complexity of O ( n ) where n is the length of an in-put sequence. This leads to high memory and computationalrequirements for training large Transformer-based models.For example, training a BERT-large model (Devlin et al. O ( n ) complexity makes it prohibitively expensive to trainlarge Transformers with long sequences (e.g., n = 2048 ).To address this challenge, several recent works have pro-posed strategies that avoid incurring the quadratic cost whendealing with longer input sequences. For example, (Daiet al. 2019) suggests a trade-off between memory and com-putational efficiency. The ideas described in (Child et al.2019; Kitaev, Kaiser, and Levskaya 2019) decrease the self-attention complexity to O ( n √ n ) and O ( n log n ) respec-tively. In (Shen et al. 2018b; Katharopoulos et al. 2020;Wang et al. 2020), self-attention complexity can be reducedto O ( n ) with various approximation ideas, each with its ownstrengths and limitations.In this paper, we propose a O ( n ) approximation, bothin the sense of memory and time, for self-attention. Ourmodel, Nystr¨omformer , scales linearly with the input se-quence length n . This is achieved by leveraging the cele-brated Nystr¨om method, repurposed for approximating self-attention. Specifically, our Nystr¨omFormer algorithm makesuse of landmark (or Nystr¨om) points to reconstruct the soft-max matrix in self-attention, thereby avoiding computing the n × n softmax matrix. We show that this yields a good ap-proximation of the true self-attention.To evaluate our method, we consider a transfer learningsetting using Transformers, where models are first pretrainedwith a language modeling objective on a large corpus, andthen finetuned on target tasks using supervised data (Devlinet al. 2019; Liu et al. 2019; Lewis et al. 2020; Wang et al.2020). Following BERT (Devlin et al. 2019; Liu et al. 2019),we pretrain our proposed model on English Wikipedia andBookCorpus (Zhu et al. 2015) using a masked-language-modeling objective. We observe a similar performance tothe baseline BERT model on English Wikipedia and Book-Corpus. We then finetune our pretrained models on multi-ple downstream tasks in the GLUE benchmark (Wang et al.2018) and IMDB reviews (Maas et al. 2011), and compareour results to BERT in both accuracy and efficiency. Acrossall tasks, our model compares favorably to the vanilla pre-trained BERT with promising speedups. Our model also out-performs several recent efficient transformer models, thusproviding a step towards resource efficient Transformers. a r X i v : . [ c s . C L ] F e b elated Work We briefly review a few results on efficient Transformers,linearized Softmax kernels and Nystr¨om-like methods.
Efficient Transformers . Weight pruning (Michel, Levy,and Neubig 2019), weight factorization (Lan et al. 2020),weight quantization (Zafrir et al. 2019) or knowledge dis-tillation (Sanh et al. 2019) are several strategies that havebeen proposed to improve memory efficiency in transform-ers. The use of a new pretraining objective in (Clark et al.2019), product-key attention in (Lample et al. 2019), andthe Transformer-XL model in (Dai et al. 2019) have shownhow the overall compute requirements can be reduced. In(Child et al. 2019), a sparse factorization of the attentionmatrix was used for reducing the overall complexity fromquadratic to O ( n √ n ) for generative modeling of long se-quences. In (Kitaev, Kaiser, and Levskaya 2019), the Re-former model further reduces the complexity to O ( n log n ) via locality-sensitive-hashing (LSH). This relies on perform-ing fewer dot product operations overall by assuming thatthe keys need to be identical to the queries. Recently, in(Wang et al. 2020), the Linformer model suggested the useof random projections based on the JL lemma to reduce thecomplexity to O ( n ) with a linear projection step. The Long-former model in (Beltagy, Peters, and Cohan 2020) achievesa O ( n ) complexity using a local windowed attention and atask motivated global attention for longer documents, whileBIGBIRD (Zaheer et al. 2020) uses a sparse attention mech-anism. There are also other existing approaches to improveoptimizer efficiency, such as microbatching (Huang et al.2019) and gradient checkpointing (Chen et al. 2016). Linearized Softmax . In (Blanc and Rendle 2018), an adap-tive sampled softmax with a kernel based sampling wasshown to speed up training. It involves sampling only someof the classes at each training step using a linear dot productapproximation. In (Rawat et al. 2019), the Random FourierSoftmax (RF-softmax) idea uses random Fourier featuresto perform efficient sampling from an approximate soft-max distribution for normalized embedding. In (Shen et al.2018b; Katharopoulos et al. 2020), linearizing the softmaxattention in transformers was based on heuristically separat-ing keys and queries in a linear dot product approximation.While the idea is interesting, the approximation error to thesoftmax matrix in self-attention can be large in some cases.
Nystr¨om-like Methods . Nystr¨om-like methods samplecolumns of the matrix to achieve a close approximation tothe original matrix. The Nystr¨om method (Baker 1977) wasdeveloped as a way of discretizing an integral equation witha simple quadrature rule and remains a widely used approachfor approximating the kernel matrix with a given sampledsubset of columns (Williams and Seeger 2001). Many vari-ants such as Nystr¨om with k -means (Zhang, Tsang, andKwok 2008; Zhang and Kwok 2010), randomized Nystr¨om(Li, Kwok, and L¨u 2010), Nystr¨om with spectral shift (Wanget al. 2014), Nystr¨om with pseudo landmarks, prototypemethod (Wang and Zhang 2013; Wang, Zhang, and Zhang2016), fast-Nys (Si, Hsieh, and Dhillon 2016), and MEKA(Si, Hsieh, and Dhillon 2017), ensemble Nystr¨om (Kumar, Mohri, and Talwalkar 2009) have been proposed for spe-cific improvements over the basic Nystr¨om approximation.In (Nemtsov, Averbuch, and Schclar 2016), the Nystr¨ommethod was extended to deal with a general matrix (ratherthan a symmetric matrix). The authors in (Musco and Musco2017) introduced the RLS-Nystr¨om method, which proposesa recursive sampling approach to accelerate landmark pointssampling. (Fanuel, Schreurs, and Suykens 2019) developedDAS (Deterministic Adaptive Sampling) and RAS (Ran-domized Adaptive Sampling) algorithms to promote diver-sity of landmarks selection.The most related ideas to our development are (Wang andZhang 2013; Musco and Musco 2017). These approachesare designed for general matrix approximation (which ac-curately reflects our setup) while only sampling a subset ofcolumns and rows. However, directly applying these meth-ods to approximate a softmax matrix used by self-attentiondoes not directly reduce the computational complexity. Thisis because that even accessing a subset of columns or rowsof a softmax matrix will require the calculation of all ele-ments in the full matrix before the softmax function. Andcalculating these entries will incur a quadratic complexity inour case. Nonetheless, inspired by the key idea of using asubset of columns to reconstruct the full matrix, we proposea Nystr¨om approximation with O ( n ) complexity tailored forthe softmax matrix, for efficiently computing self-attention. Nystr¨om-Based Linear Transformers
In this section, we start by briefly reviewing self-attention,then discuss the basic idea of Nystr¨om approximationmethod for the softmax matrix in self-attention, and finallyadapting this idea to achieve our proposed construction.
Self-Attention
What is self-attention?
Self-attention calculates a weightedaverage of feature representations with the weight propor-tional to a similarity score between pairs of representations.Formally, an input sequence of n tokens of dimensions d , X ∈ R n × d , is projected using three matrices W Q ∈ R d × d q , W K ∈ R d × d k , and W V ∈ R d × d v to extract feature repre-sentations Q , K , and V , referred to as query, key, and valuerespectively with d k = d q . The outputs Q , K , V are com-puted as Q = XW Q , K = XW K , V = XW V . (1) So, self-attention can be written as, S = D ( Q, K, V ) = softmax (cid:32) QK T (cid:112) d q (cid:33) V, (2) where softmax denotes a row-wise softmax normalizationfunction. Thus, each element in S depends on all other ele-ments in the same row. Compute cost of self-attention . The self-attention mecha-nism requires calculating n similarity scores between eachpair of tokens, leading to a complexity of O ( n ) for bothmemory and time. Due to this quadratic dependence on theinput length, the application of self-attention is limited toshort sequences (e.g., n < ). This is a key motivationfor a resource-efficient self-attention module. ystr¨om Method for Matrix Approximation
The starting point of our work is to reduce the computationalcost of self-attention in Transformers using the Nystr¨ommethod, widely adopted for matrix approximation (Williamsand Seeger 2001; Drineas and Mahoney 2005; Wang andZhang 2013). Following (Wang and Zhang 2013), we de-scribe a potential strategy and its challenges for using theNystr¨om method to approximate the softmax matrix in self-attention by sampling a subset of columns and rows.Denote the softmax matrix used in self-attention S = softmax (cid:18) QK T √ d q (cid:19) ∈ R n × n . S can be written as S = softmax (cid:32) QK T (cid:112) d q (cid:33) = (cid:20) A S B S F S C S (cid:21) , (3) where A S ∈ R m × m , B S ∈ R m × ( n − m ) , F S ∈ R ( n − m ) × m and C S ∈ R ( n − m ) × ( n − m ) . A S is designated to be our sam-ple matrix by sampling m columns and rows from S . Quadrature technique. S can be approximated via the ba-sic quadrature technique of the Nystr¨om method. It beginswith the singular value decomposition (SVD) of the samplematrix, A S = U Λ V T , where U, V ∈ R m × m are orthogonalmatrices, Λ ∈ R m × m is a diagonal matrix. Based on the out-of-sample columns approximation (Wang and Zhang 2013),the explicit Nystr¨om form of S can be reconstructed with m columns and m rows from S , ˆ S = (cid:20) A S B S F S F S A + S B S (cid:21) = (cid:20) A S F S (cid:21) A + S [ A S B S ] , (4)where A + S is the Moore-Penrose inverse of A S . C S isapproximated by F S A + S B S . Here, (4) suggests that the n × n matrix S can be reconstructed by sampling m rows( A S , B S ) and m columns ( A S , F S ) from S and finding theNystr¨om approximation ˆ S . Nystr¨om approximation for softmax matrix.
We brieflydiscuss how to construct the out-of-sample approximationfor the softmax matrix in self-attention using the standardNystr¨om method. Given a query q i and key k j , let K K ( q i ) = softmax (cid:32) q i K T (cid:112) d q (cid:33) ; K Q ( k j ) = softmax (cid:32) Qk Tj (cid:112) d q (cid:33) where K K ( q i ) ∈ R × n and K Q ( k j ) ∈ R n × . We can thenconstruct φ K ( q i ) = Λ − V T [ K TK ( q i )] m × φ Q ( k j ) = Λ − U T [ K Q ( k j )] m × where [ · ] m × refers to calculating the full n × vector andthen taking the first m × entries. With φ K ( q i ) and φ Q ( k j ) available in hand, the entry of ˆ S for standard Nystr¨om ap-proximation is calculated as, ˆ S ij = φ K ( q i ) T φ Q ( k j ) , ∀ i = 1 , . . . , n, j = 1 , . . . , n (5) QK T : n × nn × m nmn Figure 1:
A key challenge of Nystr¨om approximation. The orangeblock on the left shows a n × m sub-matrix of S used by Nystr¨ommatrix approximation in (4). Computing the sub-matrix, however,requires all entries in the n × n matrix before the softmax function( QK T ). Therefore, the direct application of Nystr¨om approxima-tion has the same complexity of O ( n ) . In matrix form, ˆ S can be represented as, ˆ S = (cid:20) softmax (cid:18) QK T √ d q (cid:19)(cid:21) n × m A + S (cid:20) softmax (cid:18) QK T √ d q (cid:19)(cid:21) m × n (6)where [ · ] n × m refers to taking m columns from n × n matrixand [ · ] m × n refers to taking m rows from n × n matrix. Thisrepresentation is the application of (4) for softmax matrixapproximation in self-attention. (cid:20) A S F S (cid:21) in (4) corresponds tothe first n × m matrix in (6) and [ A S B S ] in (4) correspondsto the last n × m matrix in (6). More details of the matrixrepresentation is available in the supplement. A key challenge of Nystr¨om approximation.
Unfortu-nately, (4) and (6) require calculating all entries in QK T due to the softmax function, even though the approximationonly needs to access a subset of the columns of S , i.e., (cid:20) A S F S (cid:21) .The problem arises due to the denominator within the row-wise softmax function. Specifically, computing an elementin S requires a summation of the exponential of all elementsin the same row of QK T . Thus, calculating (cid:20) A S F S (cid:21) needs ac-cessing the full QK T , shown in Fig. 1, and directly applyingNystr¨om approximation as in (4) is not attractive. Linearized Self-Attention via Nystr¨om Method
We now adapt the Nystr¨om method to approximately cal-culate the full softmax matrix S . The basic idea is to uselandmarks ˜ K and ˜ Q from key K and query Q to derivean efficient Nystr¨om approximation without accessing thefull QK T . When the number of landmarks, m , is muchsmaller than the sequence length n , our Nystr¨om approxima-tion scales linearly w.r.t. input sequence length in the senseof both memory and time.Following the Nystr¨om method, we also start with theSVD of a smaller matrix, A S , and apply the basic quadraturetechnique. But instead of subsampling the matrix after thesoftmax operation, we select landmarks ˜ Q from queries Q and ˜ K from keys K before softmax and then form a m × m matrix A S by applying the softmax operation on the land-marks. We also form the matrices corresponding to the left oftmax ≈ = × × Nystr¨om approximation
Figure 2:
Illustration of a Nystr¨om approximation of softmax ma-trix in self-attention. The left image shows the true softmax matrixused in self-attention and the right images show its Nystr¨om ap-proximation. Our approximation is computed via multiplication ofthree matrices. and right matrices in (4) using landmarks ˜ Q and ˜ K . Thisprovides a n × m matrix and m × n matrix respectively.With these three n × m, m × m, m × n matrices we con-structed, our Nystr¨om approximation of the n × n matrix S involves the multiplication of three matrices as in (4).In the description that follows, we first define the matrixform of landmarks. Then, based on the landmarks matrix,we form the three matrices needed for our approximation. Definition 1.
Let us assume that the selected landmarks forinputs Q = [ q ; . . . ; q n ] and K = [ k ; . . . ; k n ] are { ˜ q j } mj =1 and { ˜ k j } mj =1 respectively. We denote the matrix form of thecorresponding landmarks as For { ˜ q j } mj =1 , ˜ Q = [ ˜ q ; . . . ; ˜ q m ] ∈ R m × d q For { ˜ k j } mj =1 , ˜ K = [ ˜ k ; . . . ; ˜ k m ] ∈ R m × d q The corresponding m × m matrix is generated by A S = softmax (cid:32) ˜ Q ˜ K T (cid:112) d q (cid:33) where A S = U m × m Λ m × m V Tm × m Note that in the SVD decomposition of A S , U m × m and V m × m are orthogonal matrices.Similar to the out-of-sample approximation procedure forthe standard Nystr¨om scheme describe above, given a query q i and key k j , let K ˜ K ( q i ) = softmax (cid:32) q i ˜ K T (cid:112) d q (cid:33) ; K ˜ Q ( k j ) = softmax (cid:32) ˜ Qk Tj (cid:112) d q (cid:33) , where K ˜ K ( q i ) ∈ R × m and K ˜ Q ( k j ) ∈ R m × . We can thenconstruct, φ ˜ K ( q i ) = Λ − m × m V Tm × m K T ˜ K ( q i ) φ ˜ Q ( k j ) = Λ − m × m U Tm × m K ˜ Q ( k j ) So, the entry for ˆ S depends on landmark matrices ˜ K and ˜ Q and is calculated as, ˆ S ij = φ ˜ K ( q i ) T φ ˜ Q ( k j ) , ∀ i = 1 , . . . , n, j = 1 , . . . , n, (7)To derive the explicit Nystr¨om form, ˆ S , of the softmax ma-trix with the three n × m , m × m , m × n matrices, we assumethat A S is non-singular first to guarantee that the above ex-pression to define φ ˜ K and φ ˜ Q is meaningful. We will shortlyrelax this assumption to achieve the general form as (4). Algorithm 1:
Pipeline for Nystr¨om approximationof softmax matrix in self-attention
Input:
Query Q and Key K . Output:
Nystr¨om approximation of softmax matrix.Compute landmarks from input Q and landmarksfrom input K , ˜ Q and ˜ K as the matrix form ;Compute ˜ F = softmax ( Q ˜ K T √ d q ) , ˜ B = softmax ( ˜ QK T √ d q ) ;Compute ˜ A = softmax ( ˜ Q ˜ K T √ d q ) + ; return ˜ F × ˜ A × ˜ B ;When A S is non-singular, ˆ S ij = φ ˜ K ( q i ) T φ ˜ Q ( k j ) (8) = K ˜ K ( q i ) V m × m Λ − m × m U Tm × m K ˜ Q ( k j ) . (9)Let W m = V m × m Λ − m × m U Tm × m . Recall that a SVD of A S is U m × m Λ m × m V Tm × m , and so, W m A S = I m × m . Therefore, ˆ S ij = K ˜ K ( q i ) A − S K ˜ Q ( k j ) (10)Based on (10), we can rewrite it to have a similar form as(4) (i.e., not requiring that A S is non-singular) as ˆ S ij = K ˜ K ( q i ) T A + S K ˜ Q ( k j ) , (11)where A + S is a Moore-Penrose pseudoinverse of A S . So, ˆ S ij = softmax (cid:32) q i ˜ K T (cid:112) d q (cid:33) A + S softmax (cid:32) ˜ Qk Tj (cid:112) d q (cid:33) , (12)for i, j = { , . . . , n } . The Nystr¨om form of the softmaxmatrix, S = softmax (cid:18) QK T √ d q (cid:19) is thus approximated as ˆ S = softmax (cid:18) Q ˜ K T √ d q (cid:19) (cid:18) softmax (cid:18) ˜ Q ˜ K T √ d q (cid:19)(cid:19) + softmax (cid:18) ˜ QK T √ d q (cid:19) (13)Note that we arrive at (13) via an out-of-sample approxi-mation similar to (4). The key difference is that that in (13),the landmarks are selected before the softmax operation togenerate the out-of-sample approximation. This avoids theneed to compute the full softmax matrix S for a Nystr¨omapproximation. Fig. 2 illustrates the proposed Nystr¨om ap-proximation and Alg. 1 summarizes our method.We now describe (a) the calculation of the Moore-Penroseinverse and (b) the selection of landmarks. Moore-Penrose inverse computation . Moore-Penrosepseudoinverse can be calculated by using singular value de-composition. However, SVD is not very efficient on GPUs.To accelerate the computation, we use an iterative methodfrom (Razavi et al. 2014) to approximate the Moore-Penroseinverse via efficient matrix-matrix multiplications.
Lemma 1.
For A S ∈ R m × m , the sequence { Z j } j = ∞ j =0 gen-erated by (Razavi et al. 2014), Z j +1 = 14 Z j (13 I − A S Z j (15 I − A S Z j )(7 I − A S Z j ) (14) onverges to the Moore-Penrose inverse A + S in the third-order with initial approximation Z satisfying || A S A + S − A S Z || < . We select Z by Z = A S / ( || A S || || A S || ∞ ) where || A S || = max j m (cid:88) i =1 | ( A S ) ij | ; || A S || ∞ = max i n (cid:88) j =1 | ( A S ) ij | , based on (Pan and Schreiber 1991). This choice ensures that || I − A S Z || < . When A S is non-singular, || A S A + S − A S Z || = || I − A S Z || < . Without the non-singular constraint, the choice of initializ-ing Z provides a good approximation in our experiments.For all our experiments, we need to run about iterations inorder to achieve a good approximation of the pseudoinverse.Let A S be approximated by Z (cid:63) with (14). Our Nystr¨omapproximation of S can be written as ˆ S = softmax (cid:32) Q ˜ K T (cid:112) d q (cid:33) Z (cid:63) softmax (cid:32) ˜ QK T (cid:112) d q (cid:33) . (15) Here, (15) only needs matrix-matrix multiplication, thus thegradient computation is straight-forward.
Landmarks selection . Landmark points (inducing points(Lee et al. 2019)) can be selected by using K-means cluster-ing (Zhang, Tsang, and Kwok 2008; Vyas, Katharopoulos,and Fleuret 2020). However, the EM style of updates in K-means is less desirable during mini-batch training. We pro-pose to simply use Segment-means similar to the local aver-age pooling previously used in the NLP literature (Shen et al.2018a). Specifically, for input queries Q = [ q ; . . . ; q n ] , weseparate the n queries into m segments. As we can pad in-puts to a length divisible to m , we assume n is divisible by m for simplicity. Let l = n / m , landmark points for Q are com-puted in (16). Similarly, for input keys K = [ k ; . . . ; k n ] ,landmarks are computed as shown in (16). ˜ q j = ( j − × l + m (cid:88) i =( j − × l +1 q i m , ˜ k j = ( j − × l + m (cid:88) i =( j − × l +1 k i m , (16)where j = 1 , · · · , m . Segment-means requires a single scanof the sequence to compute the landmarks leading to a com-plexity of O ( n ) . We find that using landmarks is oftensufficient to ensure a good approximation, although this de-pends on the application. More details regarding the land-mark selection is in the supplement. Approximate self-attention . With landmark points andpseudoinverse computed, the Nystr¨om approximation ofthe softmax matrix can be calculated. By plugging in theNystr¨om approximation, we obtain a linearized version ˆ SV ,to approximate the true self-attention SV , ˆ SV = softmax (cid:32) Q ˜ K T (cid:112) d q (cid:33) Z (cid:63) softmax (cid:32) ˜ QK T (cid:112) d q (cid:33) V. (17) Fig. 3 presents an example of the fidelity between Nystr¨omapproximate self-attention versus true self-attention. True self-attentionNystr¨om approximate self-attentionFigure 3:
An example of Nystr¨om approximation vs. ground-truthself-attention. Top: standard self-attention computed by (2). Bot-tom: self-attention from our proposed Nystr¨om approximation in(17). We see that the attention patterns are quite similar.
Complexity analysis . We now provide a complexity anal-ysis of the Nystr¨om approximation which needs to accountfor landmark selection, pseudoinverse calculation, and thematrix multiplications. Landmark selection using Segment-means takes O ( n ) . Iterative approximation of the pseudoin-verse takes O ( m ) in the worst case. The matrix multi-plication first calculates softmax ( Q ˜ K T / (cid:112) d q ) × Z (cid:63) andsoftmax ( ˜ QK T / (cid:112) d q ) × V , and then calculates the product ( softmax ( Q ˜ K T / (cid:112) d q ) × Z (cid:63) ) × ( softmax ( ˜ QK T / (cid:112) d q ) × V ) .This costs O ( nm + mnd v + m + nmd v ) . The overall timecomplexity is thus O ( n + m + nm + mnd v + m + nmd v ) .For memory cost, storing landmarks matrix ˜ Q and ˜ K in-volves a O ( md q ) cost and storing four Nystr¨om approxima-tion matrices has a O ( nm + m + mn + nd v ) cost. Thus,the memory footprint is O ( md q + nm + m + mn + nd v ) .When the number of landmarks m (cid:28) n , the time and mem-ory complexity of our Nystr¨om approximation is O ( n ) , i.e.,scales linearly w.r.t. the input sequence length n . Analysis of Nystr¨om Approximation
The following simple result states that the Galerkin dis-cretization of φ ˜ K ( q ) T φ ˜ Q ( k ) with the same set of quadratureand landmark points, induces the same Nystr¨om matrix, inparticular, the same n × n Nystr¨om approximation ˆ S ij . Thisresult agrees with the discussion in (Bremer 2012). Lemma 2.
Given the input data set Q = { q i } ni =1 and K = { k i } ni =1 , and the corresponding landmark point set ˜ Q = { ˜ q j } mj =1 and ˜ K j = { ˜ k } mj =1 . Using (17) , the Nystr¨omapproximate self-attention converges to true self-attention ifthere exist landmarks points ˜ q p and ˜ k t such that ˜ q p = q i and ˜ k t = k j , ∀ i = 1 , . . . , n, j = 1 , . . . , n . Lemma 2 suggests that if the landmark points overlapsufficiently with the original data points, the approximationto self-attention will be good. While the condition here isproblem dependent, we note that it is feasible to achievean accurate approximation without using a large number oflandmarks. This is because (Oglic and G¨artner 2017) pointsout that the error of Nystr¨om approximation depends on thespectrum of the matrix to be approximated and it decreaseswith the rank of the matrix. When this result is compared : n × d p K T : d p × nV : n × d v X : n × d ˜ Q : m × d p ˜ K T : d p × m m × m m × m × m × nn × m ×× n × m × m × d v × n × d v × O : n × d v + DConv k × n × d v sMEANSsMEANS pINV Figure 4:
The proposed architecture of efficient self-attention via Nystr¨om approximation. Each box represents an input, output, or interme-diate matrix. The variable name and the size of the matrix are inside box. × denotes matrix multiplication, and + denotes matrix addition. Theorange colored boxes are those matrices used in the Nystr¨om approximation. The green boxes are the skip connection added in parrallel to theapproximation. The dashed bounding box illustrates the three matrices of Nystro¨om approximate softmax matrix in self-attention in Eq. 15.sMEANS is the landmark selection using Segment-means (averaging m segments of input sequence). pINV is the iterative Moore-Penrosepseudoinverse approximation. And DConv denotes depthwise convolution. with the observation in (Wang et al. 2020) that suggeststhat self-attention is low-rank, stronger guarantees based onstructural properties of the matrix that we wish to approxi-mate are possible. Our Model: Nystr¨omformer
Architecture . Our proposed architecture is shown in Fig. 4.Given the input key K and query Q , our model first usesSegment-means to compute landmark points as matrices ˜ K and ˜ Q . With the landmark points, our model then calcu-lates the Nystr¨om approximation using approximate Moore-Penrose pseudoinverse. A skip connection of value V , im-plemented using a 1D depthwise convolution, is also addedto the model to help the training. Experiments
We now present our experiments and results. Our experi-ments follow a transfer learning setting that consists of twostages. In the first stage, we train our Nystr¨omformer on alarge-scale text corpus, and report the language modelingperformance of our model on a hold-out validation set. In thesecond stage, we fine-tune the pre-trained Nystr¨omformeracross several different NLP tasks in GLUE benchmarks(Wang et al. 2019) and IMDB reviews (Maas et al. 2011),and report the performance on individual dataset for eachtask. In both stages, we compare our results to a baselineTransformer model (BERT). (Pre-)training of Language Modeling
Our first experiment evaluates if our model can achieve sim-ilar performance with reduced complexity in comparison toa standard Transformer on language modeling. We introducethe dataset and evaluation protocol, describe implementationdetails, and finally present the results of our model.
Dataset and metric . We consider BookCorpus plus En-glish Wikipedia as the training corpus, which is further splitinto training (80%) and validation (20%) sets. Our modelis trained using the training set. We report the masked-language-modeling (MLM) and sentence-order-prediction(SOP) accuracy on the validation set, and compare the ef-ficiency (runtime and memory consumption) of our modelto a baseline model.
Baselines . Our baseline is a well-known Transformer basedmodel – BERT (Devlin et al. 2019). Specifically, we con-sider two variants of BERT:•
BERT-small is a light weighted BERT model with 4 lay-ers. We use BERT-small to compare to linear Transform-ers, including ELU linearized self-attention (Katharopou-los et al. 2020) and Linformer (Wang et al. 2020).•
BERT-base is the base model from (Devlin et al. 2019).We use this model as our baseline when fine-tuning ondownstream NLP tasks.Our Nystr¨omformer replaces the self-attention in BERT-small and BERT-base using the proposed Nystr¨om approx-imation. We acknowledge that several very recent articles(Zaheer et al. 2020; Beltagy, Peters, and Cohan 2020), con-current with our work, have also proposed efficient O ( n ) self-attention for Transformers. An exhaustive comparisonto a rapidly growing set of algorithms is prohibitive unlessextensive compute resources are freely available. Thus, weonly compare runtime performance and the memory con-sumption of our method to Linformer (Wang et al. 2020) andLongformer (Beltagy, Peters, and Cohan 2020) in Table 1. Implementation details . Our model is pre-trained withthe masked-language-modeling (MLM) and sentence-order-prediction (SOP) objectives (Lan et al. 2020). We use abatch size of 256, optimizer Adam with learning rate 1e-4, elf-attention input sequence length n512 1024 2048 4096 8192memory (MB) time(ms) memory (MB) time (ms) memory (MB) time (ms) memory (MB) time (ms) memory (MB) time (ms)Transformer 54 (1 × ) 0.8 (1 × ) 186 (1 × ) 2.4 (1 × ) 685 (1 × ) 10.0 (1 × ) 2620 (1 × ) 32.9 (1 × ) 10233 (1 × ) 155.4 (1 × )Linformer-256 41 (1.3 × ) 0.7 (1.1 × ) 81 (2.3 × ) 1.3 (1.8 × ) 165 (4.2 × ) 2.7 (3.6 × ) 366 (7.2 × ) 5.3 (6.2 × ) 635 (16.1 × ) 11.3 (13.8 × )Longformer-257 32.2 (1.7 × ) 2.4 (0.3 × ) 65 (2.9 × ) 4.6 (0.5 × ) 130 (5.3 × ) 9.2 (1.0 × ) 263 (10.0 × ) 18.5 (1.8 × ) 455 (22.5 × ) 36.2 (4.3 × )Nystr¨omformer-64 35 (1.5 × ) 0.7 (1.1 × ) 63 (3.0 × ) 1.3 (1.8 × ) 118 (5.8 × ) 2.7 (3.6 × ) 229 (11.5 × ) 5.9 (5.6 × ) 450 (22.8 × ) 12.3 (12.7 × )Nystr¨omformer-32 26 (2.1 × ) 0.6 (1.2 × ) 49 (3.8 × ) 1.2 (1.9 × ) 96 (7.1 × ) 2.6 (3.7 × ) 193 (13.6 × ) 5.6 (5.9 × ) 383 (26.7 × ) 11.5 (13.4 × ) Table 1:
Memory consumption and running time results on various input sequence length. We report the average memory consump-tion (MB) and running time (ms) for one input instance with different input length through self-attention module. Nystr¨omformer-64 de-notes Nystr¨omformer self-attention module using 64 landmarks and Nystr¨omformer-32 denotes Nystr¨omformer module using 32 landmarks.Linformer-256 denotes Linformer self-attention module using linear projection dimension . Longformer-257 denotes Longformer self-attention using sliding window size × . Our Nystr¨om self-attention offers favorable memory and time efficiency over standardself-attention and Longformer self-attention. With a length of , our model offers 1.2 × memory saving and 3 × speed-up over Longformer,and 1.7 × memory saving over Linformer with similar running time. β = 0 . , β = 0 . , L2 weight decay of . , learn-ing rate warm-up over the first 10,000 steps, and linearlearning rate decay to update our model. Training BERT-base with M update steps takes more than one week on 8V100 GPUs. To keep compute costs reasonable, our baseline(BERT-base) and our model are trained with 0.5M steps. Wealso train our model with ∼ . M steps. More details are available inthe supplement.
Results on accuracy and efficiency . We report the vali-dation accuracy and inference efficiency of our model andcompare the results to transformer based models. In Fig. 5and 6, we plot MLM and SOP pre-training validation ac-curacy, which shows that Nystr¨oformer is comparable toa standard transformer and outperforms other variants ofefficient transformers. We also note the computation andmemory efficiency of our model in Table 1. To evaluatethe inference time and memory efficiency, we generate ran-dom inputs for self-attention module with sequence length n ∈ [512 , , , , . All models are evaluatedon the same machine setting with Nvidia 1080Ti and we re-port the improved inference speed and memory saving.Figure 5: Results on masked-language-modeling (MLM) andsentence-order-prediction (SOP). On BERT-small, our Nystr¨omself-attention is competitive to standard self-attention, outperform-ing Linformer and other linear self-attentions.
Fine-tuning on Downstream NLP tasks
Our second experiment is designed to test the generalizationability of our model on downstream NLP tasks. To this end,we fine-tune the pretrained model across several NLP tasks.
Datasets and metrics . We consider the datasets of SST-2 (Socher et al. 2013), MRPC (Dolan and Brockett 2005), Figure 6:
Results on MLM and SOP. We report MLM andSOP validation accuracy for each training step. BERT-base (fromscratch) is trained with 0.5 M steps, our Nystr¨om (from scratch)is trained with 0.5 M steps as BERT-base (from scratch), and ourNystr¨omformer (from standard) is trained with ∼ Model SST-2 MRPC QNLI QQP MNLI m/mm IMDBBERT-base 90.0 88.4 90.3 87.3 82.4/82.4 93.3Nystr¨omformer 91.4 88.1 88.7 86.3 80.9/82.2 93.2
Table 2:
Results on natural language understanding tasks. We re-port F1 score for MRPC and QQP and accuracy for others. OurNystr¨omformer performs competitively with BERT-base.
QNLI (Rajpurkar et al. 2016), QQP (Chen et al. 2018), andMNLI (Williams, Nangia, and Bowman 2018) in GLUEbenchmark and IMDB reviews (Maas et al. 2011). We followthe standard evaluation protocols, fine-tune the pre-trainedmodel on the training set, report the results on the validationset, and compare them to our baseline BERT-base.
Implementation details . We fine-tune our pre-trainedmodel on GLUE benchmark datasets and IMDB reviewsrespectively and report its final performance. For largerdatasets (SST-2, QNLI, QQP, MMNL, IMDB reviews), weuse a batch size of 32 and the AdamW optimizer with learn-ing rate 3e-5 and fine-tune our models for 4 epochs. ForMRPC, due to the sensitivity of a smaller dataset, we follow(Devlin et al. 2019) by performing a hyperparameter searchwith candidate batch size [
8, 16, 32 ] and learning rate [ ] , and select the best validation result. Asthese downstream tasks do not exceed the maximum inputsequence length 512, we fine-tune our model trained on aninput sequence length of 512. Results . Table 2 presents our experimental results on natu-al language understanding benchmarks with different tasks.Our results compares favorably to BERT-base across alldownstream tasks. Moreover, we also experiment with fine-tuning our model using longer sequences ( n = 1024 ), yetthe results remain almost identical to n = 512 , e.g. 93.0 vs.93.2 accuracy on IMDB reviews. These results further sug-gest that our model is able to scale linearly with input length.Additional details on longer sequences is in the supplementand project webpage. Conclusion
It is becoming clear that scaling Transformer based modelsto longer sequences, desirable in both NLP as well as com-puter vision, will involve identifying mechanisms to miti-gate its compute and memory requirements. Within the lastyear, this need has led to a number of results describinghow randomized numerical linear algebra schemes basedon random projections and low rank assumptions can help(Katharopoulos et al. 2020; Wang et al. 2020; Beltagy, Pe-ters, and Cohan 2020; Zaheer et al. 2020). In this paper, weapproach this task differently by showing how the Nystr¨ommethod, a widely used strategy for matrix approximation,can be adapted and deployed within a deep Transformer ar-chitecture to provide an approximation of self attention withhigh efficiency. We show that our design choices enable allkey operations to be mapped to popular deep learning li-braries in a convenient way. The algorithm maintains theperformance profile of other self-attention approximationsin the literature but offers additional benefit of resource uti-lization. Overall, we believe that our work is a step towardsrunning Transformer models on very long sequences. Ourcode and supplement is available at our project webpagehttps://github.com/mlpen/Nystromformer.
Acknowledgments
This work was supported in part by American Family In-surance, NSF CAREER award RI 1252725 and UW CPCP(U54AI117924). We thank Denny Zhou, Hongkun Yu, andAdam Yu for discussions and help with some of the experi-ments.
References
Baker, C. T. 1977.
The numerical treatment of integral equations .Clarendon press.Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: TheLong-Document Transformer. arXiv:2004.05150 .Blanc, G.; and Rendle, S. 2018. Adaptive sampled softmax withkernel based sampling. In
Proceedings of the International Con-ference on Machine Learning (ICML) , 590–599.Bremer, J. 2012. On the Nystr¨om discretization of integral equa-tions on planar curves with corners.
Applied and ComputationalHarmonic Analysis arXiv preprintarXiv:2005.14165 . Chen, T.; Xu, B.; Zhang, C.; and Guestrin, C. 2016. Training deepnets with sublinear memory cost. arXiv preprint arXiv:1604.06174 arXiv preprintarXiv:1904.10509 .Clark, K.; Luong, M.-T.; Le, Q. V.; and Manning, C. D. 2019.ELECTRA: Pre-training Text Encoders as Discriminators RatherThan Generators. In
International Conference on Learning Repre-sentations (ICLR) .Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; and Salakhut-dinov, R. 2019. Transformer-XL: Attentive Language Models be-yond a Fixed-Length Context. In
Proceedings of the Annual Meet-ing of the Association for Computational Linguistics (ACL) , 2978–2988.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Un-derstanding. In
Proceedings of the Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (NAACL-HLT) .Dolan, W. B.; and Brockett, C. 2005. Automatically constructinga corpus of sentential paraphrases. In
Proceedings of the ThirdInternational Workshop on Paraphrasing (IWP2005) .Drineas, P.; and Mahoney, M. W. 2005. On the Nystr¨om methodfor approximating a Gram matrix for improved kernel-based learn-ing.
Journal of Machine Learning Research (JMLR) \ ” omlandmark sampling and regularized Christoffel functions. arXivpreprint arXiv:1905.12346 .Howard, J.; and Ruder, S. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (ACL) ,328–339.Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.;Lee, H.; Ngiam, J.; Le, Q. V.; Wu, Y.; et al. 2019. Gpipe: Effi-cient training of giant neural networks using pipeline parallelism.In
Advances in Neural Information Processing Systems (NeurIPS) ,103–112.Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F. 2020.Transformers are RNNs: Fast Autoregressive Transformers withLinear Attention. In
Proceedings of the International Conferenceon Machine Learning (ICML) .Kitaev, N.; Kaiser, L.; and Levskaya, A. 2019. Reformer: The Ef-ficient Transformer. In
International Conference on Learning Rep-resentations (ICLR) .Kumar, S.; Mohri, M.; and Talwalkar, A. 2009. Ensemble Nystr¨ommethod. In
Advances in Neural Information Processing Systems(NeurIPS) , 1060–1068.Lample, G.; Sablayrolles, A.; Ranzato, M.; Denoyer, L.; and J´egou,H. 2019. Large memory layers with product keys. In
Advances inNeural Information Processing Systems (NeurIPS) , 8548–8559.Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Sori-cut, R. 2020. ALBERT: A lite BERT for self-supervised learning oflanguage representations. In
International Conference on LearningRepresentations (ICLR) .ee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh,Y. W. 2019. Set transformer: A framework for attention-basedpermutation-invariant neural networks. In
International Confer-ence on Machine Learning , 3744–3753. PMLR.Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.;Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denois-ing sequence-to-sequence pre-training for natural language gener-ation, translation, and comprehension. 7871–7880. Association forComputational Linguistics.Li, M.; Kwok, J. T.-Y.; and L¨u, B. 2010. Making large-scaleNystr¨om approximation possible. In
Proceedings of the Interna-tional Conference on Machine Learning (ICML) , 631.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.;Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa:A robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 .Maas, A.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts,C. 2011. Learning word vectors for sentiment analysis. In
Proceed-ings of the 49th Annual Meeting of the Association for Computa-tional Linguistics (ACL): Human language technologies , 142–150.Michel, P.; Levy, O.; and Neubig, G. 2019. Are sixteen heads reallybetter than one? In
Advances in Neural Information ProcessingSystems , 14014–14024.Musco, C.; and Musco, C. 2017. Recursive sampling for the nys-trom method. In
Advances in Neural Information Processing Sys-tems (NeurIPS) , 3833–3845.Nemtsov, A.; Averbuch, A.; and Schclar, A. 2016. Matrix compres-sion using the Nystr¨om method.
Intelligent Data Analysis
Journal of Machine Learning Re-search (JMLR)
SIAMJournal on Scientific and Statistical Computing
Proceedings of the Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (NAACL-HLT) , 2227–2237.Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018.Improving language understanding with unsupervised learning.
Technical report, OpenAI .Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD:100,000+ Questions for Machine Comprehension of Text. In
Pro-ceedings of the Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , 2383–2392.Rawat, A. S.; Chen, J.; Yu, F. X. X.; Suresh, A. T.; and Kumar,S. 2019. Sampled softmax with random fourier features. In
Advances in Neural Information Processing Systems (NeurIPS) ,13857–13867.Razavi, M. K.; Kerayechian, A.; Gachpazan, M.; and Shateyi, S.2014. A new iterative method for finding approximate inverses ofcomplex matrices. In
Abstract and Applied Analysis .Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. DistilBERT,a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 . Shen, D.; Wang, G.; Wang, W.; Min, M. R.; Su, Q.; Zhang, Y.; Li,C.; Henao, R.; and Carin, L. 2018a. Baseline Needs More Love: OnSimple Word-Embedding-Based Models and Associated PoolingMechanisms. In
Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (ACL) , 440–450.Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2018b. Effi-cient Attention: Attention with Linear Complexities. arXiv preprintarXiv:1812.01243 .Si, S.; Hsieh, C.-J.; and Dhillon, I. 2016. Computationally efficientNystr¨om approximation using fast transforms. In
Proceedings ofthe International Conference on Machine Learning (ICML) , 2655–2663.Si, S.; Hsieh, C.-J.; and Dhillon, I. S. 2017. Memory efficient ker-nel approximation.
Journal of Machine Learning Research (JMLR)
Proceedings of theConference on Empirical Methods in Natural Language Process-ing (EMNLP) , 1631–1642.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention isall you need. In
Advances in Neural Information Processing Sys-tems (NeurIPS) , 5998–6008.Vyas, A.; Katharopoulos, A.; and Fleuret, F. 2020. Fast transform-ers with clustered attention.
Advances in Neural Information Pro-cessing Systems
International Con-ference on Learning Representations (ICLR) .Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman,S. R. 2019. GLUE: A Multi-Task Benchmark and Analysis Plat-form for Natural Language Understanding. In the Proceedings ofICLR.Wang, S.; Li, B.; Khabsa, M.; Fang, H.; and Ma, H. 2020. Lin-former: Self-Attention with Linear Complexity. arXiv preprintarXiv:2006.04768 .Wang, S.; Zhang, C.; Qian, H.; and Zhang, Z. 2014. Improving themodified nystr¨om method using spectral shifting. In
Proceedingsof the 20th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD) , 611–620.Wang, S.; and Zhang, Z. 2013. Improving CUR matrix decomposi-tion and the Nystr¨om approximation via adaptive sampling.
Jour-nal of Machine Learning Research (JMLR)
Jour-nal of Machine Learning Research (JMLR)
Proceedings of the Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT) , 1112–1122.Williams, C. K.; and Seeger, M. 2001. Using the Nystr¨om methodto speed up kernel machines. In
Advances in Neural InformationProcessing Systems (NeurIPS) , 682–688.Zafrir, O.; Boudoukh, G.; Izsak, P.; and Wasserblat, M. 2019.Q8BERT: Quantized 8bit BERT. In
NeurIPS Workshop on EnergyEfficient Machine Learning and Cognitive Computing 2019 .aheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.;Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al.2020. Big bird: Transformers for longer sequences. arXiv preprintarXiv:2007.14062 .Zhang, K.; and Kwok, J. T. 2010. Clustered Nystr¨om methodfor large scale manifold learning and dimension reduction.
IEEETransactions on Neural Networks
Proceedings ofthe International Conference on Machine Learning (ICML) , 1232–1239.Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Tor-ralba, A.; and Fidler, S. 2015. Aligning books and movies: To-wards story-like visual explanations by watching movies and read-ing books. In