[PDF] Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Abstract

Transformers have emerged as a powerful tool for a broad range of natural language processing tasks. A key component that drives the impressive performance of Transformers is the self-attention mechanism that encodes the influence or dependence of other tokens on each specific token. While beneficial, the quadratic complexity of self-attention on the input sequence length has limited its application to longer sequences -- a topic being actively studied in the community. To address this limitation, we propose Nystr\"{o}mformer -- a model that exhibits favorable scalability as a function of sequence length. Our idea is based on adapting the Nystr\"{o}m method to approximate standard self-attention with O(n) complexity. The scalability of Nystr\"{o}mformer enables application to longer sequences with thousands of tokens. We perform evaluations on multiple downstream tasks on the GLUE benchmark and IMDB reviews with standard sequence length, and find that our Nystr\"{o}mformer performs comparably, or in a few cases, even slightly better, than standard self-attention. On longer sequence tasks in the Long Range Arena (LRA) benchmark, Nystr\"{o}mformer performs favorably relative to other efficient self-attention methods. Our code is available at this https URL

Full PDF

NNystr ¨omformer: A Nystr ¨om-based Algorithm for Approximating Self-Attention

Yunyang Xiong Zhanpeng Zeng Rudrasis Chakraborty Mingxing Tan Glenn Fung Yin Li Vikas Singh University of Wisconsin-Madison UC Berkeley Google Brain American Family [email protected], [email protected], [email protected], [email protected], [email protected],[email protected], [email protected]

Abstract

Transformers have emerged as a powerful tool for a broadrange of natural language processing tasks. A key compo-nent that drives the impressive performance of Transform-ers is the self-attention mechanism that encodes the inﬂuenceor dependence of other tokens on each speciﬁc token. Whilebeneﬁcial, the quadratic complexity of self-attention on theinput sequence length has limited its application to longer se-quences – a topic being actively studied in the community. Toaddress this limitation, we propose Nystr¨omformer – a modelthat exhibits favorable scalability as a function of sequencelength. Our idea is based on adapting the Nystr¨om methodto approximate standard self-attention with O ( n ) complex-ity. The scalability of Nystr¨omformer enables application tolonger sequences with thousands of tokens. We perform eval-uations on multiple downstream tasks on the GLUE bench-mark and IMDB reviews with standard sequence length, andﬁnd that our Nystr¨omformer performs comparably, or in afew cases, even slightly better, than standard Transformer.Our code is at https://github.com/mlpen/Nystromformer. Introduction

Transformer-based models, such as BERT (Devlin et al.2019) and GPT-3 (Brown et al. 2020), have been verysuccessful in natural language processing (NLP), achiev-ing state-of-the-art performance in machine translation(Vaswani et al. 2017), natural language inference (Williams,Nangia, and Bowman 2018), paraphrasing (Dolan andBrockett 2005), text classiﬁcation (Howard and Ruder2018), question answering (Rajpurkar et al. 2016) and manyother NLP tasks (Peters et al. 2018; Radford et al. 2018).A key feature of transformers is what is known as the self-attention mechanism (Vaswani et al. 2017), where each to-ken’s representation is computed from all other tokens. Self-attention enables interactions of token pairs across the fullsequence and has been shown quite effective.Despite the foregoing advantages, self-attention also turnsout to be a major efﬁciency bottleneck since it has a memoryand time complexity of O ( n ) where n is the length of an in-put sequence. This leads to high memory and computationalrequirements for training large Transformer-based models.For example, training a BERT-large model (Devlin et al. O ( n ) complexity makes it prohibitively expensive to trainlarge Transformers with long sequences (e.g., n = 2048 ).To address this challenge, several recent works have pro-posed strategies that avoid incurring the quadratic cost whendealing with longer input sequences. For example, (Daiet al. 2019) suggests a trade-off between memory and com-putational efﬁciency. The ideas described in (Child et al.2019; Kitaev, Kaiser, and Levskaya 2019) decrease the self-attention complexity to O ( n √ n ) and O ( n log n ) respec-tively. In (Shen et al. 2018b; Katharopoulos et al. 2020;Wang et al. 2020), self-attention complexity can be reducedto O ( n ) with various approximation ideas, each with its ownstrengths and limitations.In this paper, we propose a O ( n ) approximation, bothin the sense of memory and time, for self-attention. Ourmodel, Nystr¨omformer , scales linearly with the input se-quence length n . This is achieved by leveraging the cele-brated Nystr¨om method, repurposed for approximating self-attention. Speciﬁcally, our Nystr¨omFormer algorithm makesuse of landmark (or Nystr¨om) points to reconstruct the soft-max matrix in self-attention, thereby avoiding computing the n × n softmax matrix. We show that this yields a good ap-proximation of the true self-attention.To evaluate our method, we consider a transfer learningsetting using Transformers, where models are ﬁrst pretrainedwith a language modeling objective on a large corpus, andthen ﬁnetuned on target tasks using supervised data (Devlinet al. 2019; Liu et al. 2019; Lewis et al. 2020; Wang et al.2020). Following BERT (Devlin et al. 2019; Liu et al. 2019),we pretrain our proposed model on English Wikipedia andBookCorpus (Zhu et al. 2015) using a masked-language-modeling objective. We observe a similar performance tothe baseline BERT model on English Wikipedia and Book-Corpus. We then ﬁnetune our pretrained models on multi-ple downstream tasks in the GLUE benchmark (Wang et al.2018) and IMDB reviews (Maas et al. 2011), and compareour results to BERT in both accuracy and efﬁciency. Acrossall tasks, our model compares favorably to the vanilla pre-trained BERT with promising speedups. Our model also out-performs several recent efﬁcient transformer models, thusproviding a step towards resource efﬁcient Transformers. a r X i v : . [ c s . C L ] F e b elated Work We brieﬂy review a few results on efﬁcient Transformers,linearized Softmax kernels and Nystr¨om-like methods.

Efﬁcient Transformers . Weight pruning (Michel, Levy,and Neubig 2019), weight factorization (Lan et al. 2020),weight quantization (Zafrir et al. 2019) or knowledge dis-tillation (Sanh et al. 2019) are several strategies that havebeen proposed to improve memory efﬁciency in transform-ers. The use of a new pretraining objective in (Clark et al.2019), product-key attention in (Lample et al. 2019), andthe Transformer-XL model in (Dai et al. 2019) have shownhow the overall compute requirements can be reduced. In(Child et al. 2019), a sparse factorization of the attentionmatrix was used for reducing the overall complexity fromquadratic to O ( n √ n ) for generative modeling of long se-quences. In (Kitaev, Kaiser, and Levskaya 2019), the Re-former model further reduces the complexity to O ( n log n ) via locality-sensitive-hashing (LSH). This relies on perform-ing fewer dot product operations overall by assuming thatthe keys need to be identical to the queries. Recently, in(Wang et al. 2020), the Linformer model suggested the useof random projections based on the JL lemma to reduce thecomplexity to O ( n ) with a linear projection step. The Long-former model in (Beltagy, Peters, and Cohan 2020) achievesa O ( n ) complexity using a local windowed attention and atask motivated global attention for longer documents, whileBIGBIRD (Zaheer et al. 2020) uses a sparse attention mech-anism. There are also other existing approaches to improveoptimizer efﬁciency, such as microbatching (Huang et al.2019) and gradient checkpointing (Chen et al. 2016). Linearized Softmax . In (Blanc and Rendle 2018), an adap-tive sampled softmax with a kernel based sampling wasshown to speed up training. It involves sampling only someof the classes at each training step using a linear dot productapproximation. In (Rawat et al. 2019), the Random FourierSoftmax (RF-softmax) idea uses random Fourier featuresto perform efﬁcient sampling from an approximate soft-max distribution for normalized embedding. In (Shen et al.2018b; Katharopoulos et al. 2020), linearizing the softmaxattention in transformers was based on heuristically separat-ing keys and queries in a linear dot product approximation.While the idea is interesting, the approximation error to thesoftmax matrix in self-attention can be large in some cases.

Nystr¨om-like Methods . Nystr¨om-like methods samplecolumns of the matrix to achieve a close approximation tothe original matrix. The Nystr¨om method (Baker 1977) wasdeveloped as a way of discretizing an integral equation witha simple quadrature rule and remains a widely used approachfor approximating the kernel matrix with a given sampledsubset of columns (Williams and Seeger 2001). Many vari-ants such as Nystr¨om with k -means (Zhang, Tsang, andKwok 2008; Zhang and Kwok 2010), randomized Nystr¨om(Li, Kwok, and L¨u 2010), Nystr¨om with spectral shift (Wanget al. 2014), Nystr¨om with pseudo landmarks, prototypemethod (Wang and Zhang 2013; Wang, Zhang, and Zhang2016), fast-Nys (Si, Hsieh, and Dhillon 2016), and MEKA(Si, Hsieh, and Dhillon 2017), ensemble Nystr¨om (Kumar, Mohri, and Talwalkar 2009) have been proposed for spe-ciﬁc improvements over the basic Nystr¨om approximation.In (Nemtsov, Averbuch, and Schclar 2016), the Nystr¨ommethod was extended to deal with a general matrix (ratherthan a symmetric matrix). The authors in (Musco and Musco2017) introduced the RLS-Nystr¨om method, which proposesa recursive sampling approach to accelerate landmark pointssampling. (Fanuel, Schreurs, and Suykens 2019) developedDAS (Deterministic Adaptive Sampling) and RAS (Ran-domized Adaptive Sampling) algorithms to promote diver-sity of landmarks selection.The most related ideas to our development are (Wang andZhang 2013; Musco and Musco 2017). These approachesare designed for general matrix approximation (which ac-curately reﬂects our setup) while only sampling a subset ofcolumns and rows. However, directly applying these meth-ods to approximate a softmax matrix used by self-attentiondoes not directly reduce the computational complexity. Thisis because that even accessing a subset of columns or rowsof a softmax matrix will require the calculation of all ele-ments in the full matrix before the softmax function. Andcalculating these entries will incur a quadratic complexity inour case. Nonetheless, inspired by the key idea of using asubset of columns to reconstruct the full matrix, we proposea Nystr¨om approximation with O ( n ) complexity tailored forthe softmax matrix, for efﬁciently computing self-attention. Nystr¨om-Based Linear Transformers

In this section, we start by brieﬂy reviewing self-attention,then discuss the basic idea of Nystr¨om approximationmethod for the softmax matrix in self-attention, and ﬁnallyadapting this idea to achieve our proposed construction.

Self-Attention

What is self-attention?

Self-attention calculates a weightedaverage of feature representations with the weight propor-tional to a similarity score between pairs of representations.Formally, an input sequence of n tokens of dimensions d , X ∈ R n × d , is projected using three matrices W Q ∈ R d × d q , W K ∈ R d × d k , and W V ∈ R d × d v to extract feature repre-sentations Q , K , and V , referred to as query, key, and valuerespectively with d k = d q . The outputs Q , K , V are com-puted as Q = XW Q , K = XW K , V = XW V . (1) So, self-attention can be written as, S = D ( Q, K, V ) = softmax (cid:32) QK T (cid:112) d q (cid:33) V, (2) where softmax denotes a row-wise softmax normalizationfunction. Thus, each element in S depends on all other ele-ments in the same row. Compute cost of self-attention . The self-attention mecha-nism requires calculating n similarity scores between eachpair of tokens, leading to a complexity of O ( n ) for bothmemory and time. Due to this quadratic dependence on theinput length, the application of self-attention is limited toshort sequences (e.g., n < ). This is a key motivationfor a resource-efﬁcient self-attention module. ystr¨om Method for Matrix Approximation

The starting point of our work is to reduce the computationalcost of self-attention in Transformers using the Nystr¨ommethod, widely adopted for matrix approximation (Williamsand Seeger 2001; Drineas and Mahoney 2005; Wang andZhang 2013). Following (Wang and Zhang 2013), we de-scribe a potential strategy and its challenges for using theNystr¨om method to approximate the softmax matrix in self-attention by sampling a subset of columns and rows.Denote the softmax matrix used in self-attention S = softmax (cid:18) QK T √ d q (cid:19) ∈ R n × n . S can be written as S = softmax (cid:32) QK T (cid:112) d q (cid:33) = (cid:20) A S B S F S C S (cid:21) , (3) where A S ∈ R m × m , B S ∈ R m × ( n − m ) , F S ∈ R ( n − m ) × m and C S ∈ R ( n − m ) × ( n − m ) . A S is designated to be our sam-ple matrix by sampling m columns and rows from S . Quadrature technique. S can be approximated via the ba-sic quadrature technique of the Nystr¨om method. It beginswith the singular value decomposition (SVD) of the samplematrix, A S = U Λ V T , where U, V ∈ R m × m are orthogonalmatrices, Λ ∈ R m × m is a diagonal matrix. Based on the out-of-sample columns approximation (Wang and Zhang 2013),the explicit Nystr¨om form of S can be reconstructed with m columns and m rows from S , ˆ S = (cid:20) A S B S F S F S A + S B S (cid:21) = (cid:20) A S F S (cid:21) A + S [ A S B S ] , (4)where A + S is the Moore-Penrose inverse of A S . C S isapproximated by F S A + S B S . Here, (4) suggests that the n × n matrix S can be reconstructed by sampling m rows( A S , B S ) and m columns ( A S , F S ) from S and ﬁnding theNystr¨om approximation ˆ S . Nystr¨om approximation for softmax matrix.

We brieﬂydiscuss how to construct the out-of-sample approximationfor the softmax matrix in self-attention using the standardNystr¨om method. Given a query q i and key k j , let K K ( q i ) = softmax (cid:32) q i K T (cid:112) d q (cid:33) ; K Q ( k j ) = softmax (cid:32) Qk Tj (cid:112) d q (cid:33) where K K ( q i ) ∈ R × n and K Q ( k j ) ∈ R n × . We can thenconstruct φ K ( q i ) = Λ − V T [ K TK ( q i )] m × φ Q ( k j ) = Λ − U T [ K Q ( k j )] m × where [ · ] m × refers to calculating the full n × vector andthen taking the ﬁrst m × entries. With φ K ( q i ) and φ Q ( k j ) available in hand, the entry of ˆ S for standard Nystr¨om ap-proximation is calculated as, ˆ S ij = φ K ( q i ) T φ Q ( k j ) , ∀ i = 1 , . . . , n, j = 1 , . . . , n (5) QK T : n × nn × m nmn Figure 1:

A key challenge of Nystr¨om approximation. The orangeblock on the left shows a n × m sub-matrix of S used by Nystr¨ommatrix approximation in (4). Computing the sub-matrix, however,requires all entries in the n × n matrix before the softmax function( QK T ). Therefore, the direct application of Nystr¨om approxima-tion has the same complexity of O ( n ) . In matrix form, ˆ S can be represented as, ˆ S = (cid:20) softmax (cid:18) QK T √ d q (cid:19)(cid:21) n × m A + S (cid:20) softmax (cid:18) QK T √ d q (cid:19)(cid:21) m × n (6)where [ · ] n × m refers to taking m columns from n × n matrixand [ · ] m × n refers to taking m rows from n × n matrix. Thisrepresentation is the application of (4) for softmax matrixapproximation in self-attention. (cid:20) A S F S (cid:21) in (4) corresponds tothe ﬁrst n × m matrix in (6) and [ A S B S ] in (4) correspondsto the last n × m matrix in (6). More details of the matrixrepresentation is available in the supplement. A key challenge of Nystr¨om approximation.

Unfortu-nately, (4) and (6) require calculating all entries in QK T due to the softmax function, even though the approximationonly needs to access a subset of the columns of S , i.e., (cid:20) A S F S (cid:21) .The problem arises due to the denominator within the row-wise softmax function. Speciﬁcally, computing an elementin S requires a summation of the exponential of all elementsin the same row of QK T . Thus, calculating (cid:20) A S F S (cid:21) needs ac-cessing the full QK T , shown in Fig. 1, and directly applyingNystr¨om approximation as in (4) is not attractive. Linearized Self-Attention via Nystr¨om Method

We now adapt the Nystr¨om method to approximately cal-culate the full softmax matrix S . The basic idea is to uselandmarks ˜ K and ˜ Q from key K and query Q to derivean efﬁcient Nystr¨om approximation without accessing thefull QK T . When the number of landmarks, m , is muchsmaller than the sequence length n , our Nystr¨om approxima-tion scales linearly w.r.t. input sequence length in the senseof both memory and time.Following the Nystr¨om method, we also start with theSVD of a smaller matrix, A S , and apply the basic quadraturetechnique. But instead of subsampling the matrix after thesoftmax operation, we select landmarks ˜ Q from queries Q and ˜ K from keys K before softmax and then form a m × m matrix A S by applying the softmax operation on the land-marks. We also form the matrices corresponding to the left oftmax ≈ = × × Nystr¨om approximation

Figure 2:

Illustration of a Nystr¨om approximation of softmax ma-trix in self-attention. The left image shows the true softmax matrixused in self-attention and the right images show its Nystr¨om ap-proximation. Our approximation is computed via multiplication ofthree matrices. and right matrices in (4) using landmarks ˜ Q and ˜ K . Thisprovides a n × m matrix and m × n matrix respectively.With these three n × m, m × m, m × n matrices we con-structed, our Nystr¨om approximation of the n × n matrix S involves the multiplication of three matrices as in (4).In the description that follows, we ﬁrst deﬁne the matrixform of landmarks. Then, based on the landmarks matrix,we form the three matrices needed for our approximation. Deﬁnition 1.

Let us assume that the selected landmarks forinputs Q = [ q ; . . . ; q n ] and K = [ k ; . . . ; k n ] are { ˜ q j } mj =1 and { ˜ k j } mj =1 respectively. We denote the matrix form of thecorresponding landmarks as For { ˜ q j } mj =1 , ˜ Q = [ ˜ q ; . . . ; ˜ q m ] ∈ R m × d q For { ˜ k j } mj =1 , ˜ K = [ ˜ k ; . . . ; ˜ k m ] ∈ R m × d q The corresponding m × m matrix is generated by A S = softmax (cid:32) ˜ Q ˜ K T (cid:112) d q (cid:33) where A S = U m × m Λ m × m V Tm × m Note that in the SVD decomposition of A S , U m × m and V m × m are orthogonal matrices.Similar to the out-of-sample approximation procedure forthe standard Nystr¨om scheme describe above, given a query q i and key k j , let K ˜ K ( q i ) = softmax (cid:32) q i ˜ K T (cid:112) d q (cid:33) ; K ˜ Q ( k j ) = softmax (cid:32) ˜ Qk Tj (cid:112) d q (cid:33) , where K ˜ K ( q i ) ∈ R × m and K ˜ Q ( k j ) ∈ R m × . We can thenconstruct, φ ˜ K ( q i ) = Λ − m × m V Tm × m K T ˜ K ( q i ) φ ˜ Q ( k j ) = Λ − m × m U Tm × m K ˜ Q ( k j ) So, the entry for ˆ S depends on landmark matrices ˜ K and ˜ Q and is calculated as, ˆ S ij = φ ˜ K ( q i ) T φ ˜ Q ( k j ) , ∀ i = 1 , . . . , n, j = 1 , . . . , n, (7)To derive the explicit Nystr¨om form, ˆ S , of the softmax ma-trix with the three n × m , m × m , m × n matrices, we assumethat A S is non-singular ﬁrst to guarantee that the above ex-pression to deﬁne φ ˜ K and φ ˜ Q is meaningful. We will shortlyrelax this assumption to achieve the general form as (4). Algorithm 1:

Pipeline for Nystr¨om approximationof softmax matrix in self-attention

Input:

Query Q and Key K . Output:

Nystr¨om approximation of softmax matrix.Compute landmarks from input Q and landmarksfrom input K , ˜ Q and ˜ K as the matrix form ;Compute ˜ F = softmax ( Q ˜ K T √ d q ) , ˜ B = softmax ( ˜ QK T √ d q ) ;Compute ˜ A = softmax ( ˜ Q ˜ K T √ d q ) + ; return ˜ F × ˜ A × ˜ B ;When A S is non-singular, ˆ S ij = φ ˜ K ( q i ) T φ ˜ Q ( k j ) (8) = K ˜ K ( q i ) V m × m Λ − m × m U Tm × m K ˜ Q ( k j ) . (9)Let W m = V m × m Λ − m × m U Tm × m . Recall that a SVD of A S is U m × m Λ m × m V Tm × m , and so, W m A S = I m × m . Therefore, ˆ S ij = K ˜ K ( q i ) A − S K ˜ Q ( k j ) (10)Based on (10), we can rewrite it to have a similar form as(4) (i.e., not requiring that A S is non-singular) as ˆ S ij = K ˜ K ( q i ) T A + S K ˜ Q ( k j ) , (11)where A + S is a Moore-Penrose pseudoinverse of A S . So, ˆ S ij = softmax (cid:32) q i ˜ K T (cid:112) d q (cid:33) A + S softmax (cid:32) ˜ Qk Tj (cid:112) d q (cid:33) , (12)for i, j = { , . . . , n } . The Nystr¨om form of the softmaxmatrix, S = softmax (cid:18) QK T √ d q (cid:19) is thus approximated as ˆ S = softmax (cid:18) Q ˜ K T √ d q (cid:19) (cid:18) softmax (cid:18) ˜ Q ˜ K T √ d q (cid:19)(cid:19) + softmax (cid:18) ˜ QK T √ d q (cid:19) (13)Note that we arrive at (13) via an out-of-sample approxi-mation similar to (4). The key difference is that that in (13),the landmarks are selected before the softmax operation togenerate the out-of-sample approximation. This avoids theneed to compute the full softmax matrix S for a Nystr¨omapproximation. Fig. 2 illustrates the proposed Nystr¨om ap-proximation and Alg. 1 summarizes our method.We now describe (a) the calculation of the Moore-Penroseinverse and (b) the selection of landmarks. Moore-Penrose inverse computation . Moore-Penrosepseudoinverse can be calculated by using singular value de-composition. However, SVD is not very efﬁcient on GPUs.To accelerate the computation, we use an iterative methodfrom (Razavi et al. 2014) to approximate the Moore-Penroseinverse via efﬁcient matrix-matrix multiplications.

Lemma 1.

For A S ∈ R m × m , the sequence { Z j } j = ∞ j =0 gen-erated by (Razavi et al. 2014), Z j +1 = 14 Z j (13 I − A S Z j (15 I − A S Z j )(7 I − A S Z j ) (14) onverges to the Moore-Penrose inverse A + S in the third-order with initial approximation Z satisfying || A S A + S − A S Z || < . We select Z by Z = A S / ( || A S || || A S || ∞ ) where || A S || = max j m (cid:88) i =1 | ( A S ) ij | ; || A S || ∞ = max i n (cid:88) j =1 | ( A S ) ij | , based on (Pan and Schreiber 1991). This choice ensures that || I − A S Z || < . When A S is non-singular, || A S A + S − A S Z || = || I − A S Z || < . Without the non-singular constraint, the choice of initializ-ing Z provides a good approximation in our experiments.For all our experiments, we need to run about iterations inorder to achieve a good approximation of the pseudoinverse.Let A S be approximated by Z (cid:63) with (14). Our Nystr¨omapproximation of S can be written as ˆ S = softmax (cid:32) Q ˜ K T (cid:112) d q (cid:33) Z (cid:63) softmax (cid:32) ˜ QK T (cid:112) d q (cid:33) . (15) Here, (15) only needs matrix-matrix multiplication, thus thegradient computation is straight-forward.

Landmarks selection . Landmark points (inducing points(Lee et al. 2019)) can be selected by using K-means cluster-ing (Zhang, Tsang, and Kwok 2008; Vyas, Katharopoulos,and Fleuret 2020). However, the EM style of updates in K-means is less desirable during mini-batch training. We pro-pose to simply use Segment-means similar to the local aver-age pooling previously used in the NLP literature (Shen et al.2018a). Speciﬁcally, for input queries Q = [ q ; . . . ; q n ] , weseparate the n queries into m segments. As we can pad in-puts to a length divisible to m , we assume n is divisible by m for simplicity. Let l = n / m , landmark points for Q are com-puted in (16). Similarly, for input keys K = [ k ; . . . ; k n ] ,landmarks are computed as shown in (16). ˜ q j = ( j − × l + m (cid:88) i =( j − × l +1 q i m , ˜ k j = ( j − × l + m (cid:88) i =( j − × l +1 k i m , (16)where j = 1 , · · · , m . Segment-means requires a single scanof the sequence to compute the landmarks leading to a com-plexity of O ( n ) . We ﬁnd that using landmarks is oftensufﬁcient to ensure a good approximation, although this de-pends on the application. More details regarding the land-mark selection is in the supplement. Approximate self-attention . With landmark points andpseudoinverse computed, the Nystr¨om approximation ofthe softmax matrix can be calculated. By plugging in theNystr¨om approximation, we obtain a linearized version ˆ SV ,to approximate the true self-attention SV , ˆ SV = softmax (cid:32) Q ˜ K T (cid:112) d q (cid:33) Z (cid:63) softmax (cid:32) ˜ QK T (cid:112) d q (cid:33) V. (17) Fig. 3 presents an example of the ﬁdelity between Nystr¨omapproximate self-attention versus true self-attention. True self-attentionNystr¨om approximate self-attentionFigure 3:

An example of Nystr¨om approximation vs. ground-truthself-attention. Top: standard self-attention computed by (2). Bot-tom: self-attention from our proposed Nystr¨om approximation in(17). We see that the attention patterns are quite similar.

Complexity analysis . We now provide a complexity anal-ysis of the Nystr¨om approximation which needs to accountfor landmark selection, pseudoinverse calculation, and thematrix multiplications. Landmark selection using Segment-means takes O ( n ) . Iterative approximation of the pseudoin-verse takes O ( m ) in the worst case. The matrix multi-plication ﬁrst calculates softmax ( Q ˜ K T / (cid:112) d q ) × Z (cid:63) andsoftmax ( ˜ QK T / (cid:112) d q ) × V , and then calculates the product ( softmax ( Q ˜ K T / (cid:112) d q ) × Z (cid:63) ) × ( softmax ( ˜ QK T / (cid:112) d q ) × V ) .This costs O ( nm + mnd v + m + nmd v ) . The overall timecomplexity is thus O ( n + m + nm + mnd v + m + nmd v ) .For memory cost, storing landmarks matrix ˜ Q and ˜ K in-volves a O ( md q ) cost and storing four Nystr¨om approxima-tion matrices has a O ( nm + m + mn + nd v ) cost. Thus,the memory footprint is O ( md q + nm + m + mn + nd v ) .When the number of landmarks m (cid:28) n , the time and mem-ory complexity of our Nystr¨om approximation is O ( n ) , i.e.,scales linearly w.r.t. the input sequence length n . Analysis of Nystr¨om Approximation

The following simple result states that the Galerkin dis-cretization of φ ˜ K ( q ) T φ ˜ Q ( k ) with the same set of quadratureand landmark points, induces the same Nystr¨om matrix, inparticular, the same n × n Nystr¨om approximation ˆ S ij . Thisresult agrees with the discussion in (Bremer 2012). Lemma 2.

Given the input data set Q = { q i } ni =1 and K = { k i } ni =1 , and the corresponding landmark point set ˜ Q = { ˜ q j } mj =1 and ˜ K j = { ˜ k } mj =1 . Using (17) , the Nystr¨omapproximate self-attention converges to true self-attention ifthere exist landmarks points ˜ q p and ˜ k t such that ˜ q p = q i and ˜ k t = k j , ∀ i = 1 , . . . , n, j = 1 , . . . , n . Lemma 2 suggests that if the landmark points overlapsufﬁciently with the original data points, the approximationto self-attention will be good. While the condition here isproblem dependent, we note that it is feasible to achievean accurate approximation without using a large number oflandmarks. This is because (Oglic and G¨artner 2017) pointsout that the error of Nystr¨om approximation depends on thespectrum of the matrix to be approximated and it decreaseswith the rank of the matrix. When this result is compared : n × d p K T : d p × nV : n × d v X : n × d ˜ Q : m × d p ˜ K T : d p × m m × m m × m × m × nn × m ×× n × m × m × d v × n × d v × O : n × d v + DConv k × n × d v sMEANSsMEANS pINV Figure 4:

The proposed architecture of efﬁcient self-attention via Nystr¨om approximation. Each box represents an input, output, or interme-diate matrix. The variable name and the size of the matrix are inside box. × denotes matrix multiplication, and + denotes matrix addition. Theorange colored boxes are those matrices used in the Nystr¨om approximation. The green boxes are the skip connection added in parrallel to theapproximation. The dashed bounding box illustrates the three matrices of Nystro¨om approximate softmax matrix in self-attention in Eq. 15.sMEANS is the landmark selection using Segment-means (averaging m segments of input sequence). pINV is the iterative Moore-Penrosepseudoinverse approximation. And DConv denotes depthwise convolution. with the observation in (Wang et al. 2020) that suggeststhat self-attention is low-rank, stronger guarantees based onstructural properties of the matrix that we wish to approxi-mate are possible. Our Model: Nystr¨omformer

Architecture . Our proposed architecture is shown in Fig. 4.Given the input key K and query Q , our model ﬁrst usesSegment-means to compute landmark points as matrices ˜ K and ˜ Q . With the landmark points, our model then calcu-lates the Nystr¨om approximation using approximate Moore-Penrose pseudoinverse. A skip connection of value V , im-plemented using a 1D depthwise convolution, is also addedto the model to help the training. Experiments

We now present our experiments and results. Our experi-ments follow a transfer learning setting that consists of twostages. In the ﬁrst stage, we train our Nystr¨omformer on alarge-scale text corpus, and report the language modelingperformance of our model on a hold-out validation set. In thesecond stage, we ﬁne-tune the pre-trained Nystr¨omformeracross several different NLP tasks in GLUE benchmarks(Wang et al. 2019) and IMDB reviews (Maas et al. 2011),and report the performance on individual dataset for eachtask. In both stages, we compare our results to a baselineTransformer model (BERT). (Pre-)training of Language Modeling

Our ﬁrst experiment evaluates if our model can achieve sim-ilar performance with reduced complexity in comparison toa standard Transformer on language modeling. We introducethe dataset and evaluation protocol, describe implementationdetails, and ﬁnally present the results of our model.

Dataset and metric . We consider BookCorpus plus En-glish Wikipedia as the training corpus, which is further splitinto training (80%) and validation (20%) sets. Our modelis trained using the training set. We report the masked-language-modeling (MLM) and sentence-order-prediction(SOP) accuracy on the validation set, and compare the ef-ﬁciency (runtime and memory consumption) of our modelto a baseline model.

Baselines . Our baseline is a well-known Transformer basedmodel – BERT (Devlin et al. 2019). Speciﬁcally, we con-sider two variants of BERT:•

BERT-small is a light weighted BERT model with 4 lay-ers. We use BERT-small to compare to linear Transform-ers, including ELU linearized self-attention (Katharopou-los et al. 2020) and Linformer (Wang et al. 2020).•

BERT-base is the base model from (Devlin et al. 2019).We use this model as our baseline when ﬁne-tuning ondownstream NLP tasks.Our Nystr¨omformer replaces the self-attention in BERT-small and BERT-base using the proposed Nystr¨om approx-imation. We acknowledge that several very recent articles(Zaheer et al. 2020; Beltagy, Peters, and Cohan 2020), con-current with our work, have also proposed efﬁcient O ( n ) self-attention for Transformers. An exhaustive comparisonto a rapidly growing set of algorithms is prohibitive unlessextensive compute resources are freely available. Thus, weonly compare runtime performance and the memory con-sumption of our method to Linformer (Wang et al. 2020) andLongformer (Beltagy, Peters, and Cohan 2020) in Table 1. Implementation details . Our model is pre-trained withthe masked-language-modeling (MLM) and sentence-order-prediction (SOP) objectives (Lan et al. 2020). We use abatch size of 256, optimizer Adam with learning rate 1e-4, elf-attention input sequence length n512 1024 2048 4096 8192memory (MB) time(ms) memory (MB) time (ms) memory (MB) time (ms) memory (MB) time (ms) memory (MB) time (ms)Transformer 54 (1 × ) 0.8 (1 × ) 186 (1 × ) 2.4 (1 × ) 685 (1 × ) 10.0 (1 × ) 2620 (1 × ) 32.9 (1 × ) 10233 (1 × ) 155.4 (1 × )Linformer-256 41 (1.3 × ) 0.7 (1.1 × ) 81 (2.3 × ) 1.3 (1.8 × ) 165 (4.2 × ) 2.7 (3.6 × ) 366 (7.2 × ) 5.3 (6.2 × ) 635 (16.1 × ) 11.3 (13.8 × )Longformer-257 32.2 (1.7 × ) 2.4 (0.3 × ) 65 (2.9 × ) 4.6 (0.5 × ) 130 (5.3 × ) 9.2 (1.0 × ) 263 (10.0 × ) 18.5 (1.8 × ) 455 (22.5 × ) 36.2 (4.3 × )Nystr¨omformer-64 35 (1.5 × ) 0.7 (1.1 × ) 63 (3.0 × ) 1.3 (1.8 × ) 118 (5.8 × ) 2.7 (3.6 × ) 229 (11.5 × ) 5.9 (5.6 × ) 450 (22.8 × ) 12.3 (12.7 × )Nystr¨omformer-32 26 (2.1 × ) 0.6 (1.2 × ) 49 (3.8 × ) 1.2 (1.9 × ) 96 (7.1 × ) 2.6 (3.7 × ) 193 (13.6 × ) 5.6 (5.9 × ) 383 (26.7 × ) 11.5 (13.4 × ) Table 1:

Memory consumption and running time results on various input sequence length. We report the average memory consump-tion (MB) and running time (ms) for one input instance with different input length through self-attention module. Nystr¨omformer-64 de-notes Nystr¨omformer self-attention module using 64 landmarks and Nystr¨omformer-32 denotes Nystr¨omformer module using 32 landmarks.Linformer-256 denotes Linformer self-attention module using linear projection dimension . Longformer-257 denotes Longformer self-attention using sliding window size × . Our Nystr¨om self-attention offers favorable memory and time efﬁciency over standardself-attention and Longformer self-attention. With a length of , our model offers 1.2 × memory saving and 3 × speed-up over Longformer,and 1.7 × memory saving over Linformer with similar running time. β = 0 . , β = 0 . , L2 weight decay of . , learn-ing rate warm-up over the ﬁrst 10,000 steps, and linearlearning rate decay to update our model. Training BERT-base with M update steps takes more than one week on 8V100 GPUs. To keep compute costs reasonable, our baseline(BERT-base) and our model are trained with 0.5M steps. Wealso train our model with ∼ . M steps. More details are available inthe supplement.

Results on accuracy and efﬁciency . We report the vali-dation accuracy and inference efﬁciency of our model andcompare the results to transformer based models. In Fig. 5and 6, we plot MLM and SOP pre-training validation ac-curacy, which shows that Nystr¨oformer is comparable toa standard transformer and outperforms other variants ofefﬁcient transformers. We also note the computation andmemory efﬁciency of our model in Table 1. To evaluatethe inference time and memory efﬁciency, we generate ran-dom inputs for self-attention module with sequence length n ∈ [512 , , , , . All models are evaluatedon the same machine setting with Nvidia 1080Ti and we re-port the improved inference speed and memory saving.Figure 5: Results on masked-language-modeling (MLM) andsentence-order-prediction (SOP). On BERT-small, our Nystr¨omself-attention is competitive to standard self-attention, outperform-ing Linformer and other linear self-attentions.

Fine-tuning on Downstream NLP tasks

Our second experiment is designed to test the generalizationability of our model on downstream NLP tasks. To this end,we ﬁne-tune the pretrained model across several NLP tasks.

Datasets and metrics . We consider the datasets of SST-2 (Socher et al. 2013), MRPC (Dolan and Brockett 2005), Figure 6:

Results on MLM and SOP. We report MLM andSOP validation accuracy for each training step. BERT-base (fromscratch) is trained with 0.5 M steps, our Nystr¨om (from scratch)is trained with 0.5 M steps as BERT-base (from scratch), and ourNystr¨omformer (from standard) is trained with ∼ Model SST-2 MRPC QNLI QQP MNLI m/mm IMDBBERT-base 90.0 88.4 90.3 87.3 82.4/82.4 93.3Nystr¨omformer 91.4 88.1 88.7 86.3 80.9/82.2 93.2

Table 2:

Results on natural language understanding tasks. We re-port F1 score for MRPC and QQP and accuracy for others. OurNystr¨omformer performs competitively with BERT-base.

QNLI (Rajpurkar et al. 2016), QQP (Chen et al. 2018), andMNLI (Williams, Nangia, and Bowman 2018) in GLUEbenchmark and IMDB reviews (Maas et al. 2011). We followthe standard evaluation protocols, ﬁne-tune the pre-trainedmodel on the training set, report the results on the validationset, and compare them to our baseline BERT-base.

Implementation details . We ﬁne-tune our pre-trainedmodel on GLUE benchmark datasets and IMDB reviewsrespectively and report its ﬁnal performance. For largerdatasets (SST-2, QNLI, QQP, MMNL, IMDB reviews), weuse a batch size of 32 and the AdamW optimizer with learn-ing rate 3e-5 and ﬁne-tune our models for 4 epochs. ForMRPC, due to the sensitivity of a smaller dataset, we follow(Devlin et al. 2019) by performing a hyperparameter searchwith candidate batch size [

8, 16, 32 ] and learning rate [ ] , and select the best validation result. Asthese downstream tasks do not exceed the maximum inputsequence length 512, we ﬁne-tune our model trained on aninput sequence length of 512. Results . Table 2 presents our experimental results on natu-al language understanding benchmarks with different tasks.Our results compares favorably to BERT-base across alldownstream tasks. Moreover, we also experiment with ﬁne-tuning our model using longer sequences ( n = 1024 ), yetthe results remain almost identical to n = 512 , e.g. 93.0 vs.93.2 accuracy on IMDB reviews. These results further sug-gest that our model is able to scale linearly with input length.Additional details on longer sequences is in the supplementand project webpage. Conclusion

It is becoming clear that scaling Transformer based modelsto longer sequences, desirable in both NLP as well as com-puter vision, will involve identifying mechanisms to miti-gate its compute and memory requirements. Within the lastyear, this need has led to a number of results describinghow randomized numerical linear algebra schemes basedon random projections and low rank assumptions can help(Katharopoulos et al. 2020; Wang et al. 2020; Beltagy, Pe-ters, and Cohan 2020; Zaheer et al. 2020). In this paper, weapproach this task differently by showing how the Nystr¨ommethod, a widely used strategy for matrix approximation,can be adapted and deployed within a deep Transformer ar-chitecture to provide an approximation of self attention withhigh efﬁciency. We show that our design choices enable allkey operations to be mapped to popular deep learning li-braries in a convenient way. The algorithm maintains theperformance proﬁle of other self-attention approximationsin the literature but offers additional beneﬁt of resource uti-lization. Overall, we believe that our work is a step towardsrunning Transformer models on very long sequences. Ourcode and supplement is available at our project webpagehttps://github.com/mlpen/Nystromformer.

Acknowledgments

This work was supported in part by American Family In-surance, NSF CAREER award RI 1252725 and UW CPCP(U54AI117924). We thank Denny Zhou, Hongkun Yu, andAdam Yu for discussions and help with some of the experi-ments.

References

Baker, C. T. 1977.

The numerical treatment of integral equations .Clarendon press.Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Longformer: TheLong-Document Transformer. arXiv:2004.05150 .Blanc, G.; and Rendle, S. 2018. Adaptive sampled softmax withkernel based sampling. In

Proceedings of the International Con-ference on Machine Learning (ICML) , 590–599.Bremer, J. 2012. On the Nystr¨om discretization of integral equa-tions on planar curves with corners.

Applied and ComputationalHarmonic Analysis arXiv preprintarXiv:2005.14165 . Chen, T.; Xu, B.; Zhang, C.; and Guestrin, C. 2016. Training deepnets with sublinear memory cost. arXiv preprint arXiv:1604.06174 arXiv preprintarXiv:1904.10509 .Clark, K.; Luong, M.-T.; Le, Q. V.; and Manning, C. D. 2019.ELECTRA: Pre-training Text Encoders as Discriminators RatherThan Generators. In

International Conference on Learning Repre-sentations (ICLR) .Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; and Salakhut-dinov, R. 2019. Transformer-XL: Attentive Language Models be-yond a Fixed-Length Context. In

Proceedings of the Annual Meet-ing of the Association for Computational Linguistics (ACL) , 2978–2988.Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Un-derstanding. In

Proceedings of the Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (NAACL-HLT) .Dolan, W. B.; and Brockett, C. 2005. Automatically constructinga corpus of sentential paraphrases. In

Proceedings of the ThirdInternational Workshop on Paraphrasing (IWP2005) .Drineas, P.; and Mahoney, M. W. 2005. On the Nystr¨om methodfor approximating a Gram matrix for improved kernel-based learn-ing.

Journal of Machine Learning Research (JMLR) \ ” omlandmark sampling and regularized Christoffel functions. arXivpreprint arXiv:1905.12346 .Howard, J.; and Ruder, S. 2018. Universal Language Model Fine-tuning for Text Classiﬁcation. In Proceedings of the 56th AnnualMeeting of the Association for Computational Linguistics (ACL) ,328–339.Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.;Lee, H.; Ngiam, J.; Le, Q. V.; Wu, Y.; et al. 2019. Gpipe: Efﬁ-cient training of giant neural networks using pipeline parallelism.In

Advances in Neural Information Processing Systems (NeurIPS) ,103–112.Katharopoulos, A.; Vyas, A.; Pappas, N.; and Fleuret, F. 2020.Transformers are RNNs: Fast Autoregressive Transformers withLinear Attention. In

Proceedings of the International Conferenceon Machine Learning (ICML) .Kitaev, N.; Kaiser, L.; and Levskaya, A. 2019. Reformer: The Ef-ﬁcient Transformer. In

International Conference on Learning Rep-resentations (ICLR) .Kumar, S.; Mohri, M.; and Talwalkar, A. 2009. Ensemble Nystr¨ommethod. In

Advances in Neural Information Processing Systems(NeurIPS) , 1060–1068.Lample, G.; Sablayrolles, A.; Ranzato, M.; Denoyer, L.; and J´egou,H. 2019. Large memory layers with product keys. In

Advances inNeural Information Processing Systems (NeurIPS) , 8548–8559.Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Sori-cut, R. 2020. ALBERT: A lite BERT for self-supervised learning oflanguage representations. In

International Conference on LearningRepresentations (ICLR) .ee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; and Teh,Y. W. 2019. Set transformer: A framework for attention-basedpermutation-invariant neural networks. In

International Confer-ence on Machine Learning , 3744–3753. PMLR.Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.;Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denois-ing sequence-to-sequence pre-training for natural language gener-ation, translation, and comprehension. 7871–7880. Association forComputational Linguistics.Li, M.; Kwok, J. T.-Y.; and L¨u, B. 2010. Making large-scaleNystr¨om approximation possible. In

Proceedings of the Interna-tional Conference on Machine Learning (ICML) , 631.Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.;Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa:A robustly optimized BERT pretraining approach. arXiv preprintarXiv:1907.11692 .Maas, A.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.; and Potts,C. 2011. Learning word vectors for sentiment analysis. In

Proceed-ings of the 49th Annual Meeting of the Association for Computa-tional Linguistics (ACL): Human language technologies , 142–150.Michel, P.; Levy, O.; and Neubig, G. 2019. Are sixteen heads reallybetter than one? In

Advances in Neural Information ProcessingSystems , 14014–14024.Musco, C.; and Musco, C. 2017. Recursive sampling for the nys-trom method. In

Advances in Neural Information Processing Sys-tems (NeurIPS) , 3833–3845.Nemtsov, A.; Averbuch, A.; and Schclar, A. 2016. Matrix compres-sion using the Nystr¨om method.

Intelligent Data Analysis

Journal of Machine Learning Re-search (JMLR)

SIAMJournal on Scientiﬁc and Statistical Computing

Proceedings of the Conference of the North Ameri-can Chapter of the Association for Computational Linguistics: Hu-man Language Technologies (NAACL-HLT) , 2227–2237.Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018.Improving language understanding with unsupervised learning.

Technical report, OpenAI .Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD:100,000+ Questions for Machine Comprehension of Text. In

Pro-ceedings of the Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , 2383–2392.Rawat, A. S.; Chen, J.; Yu, F. X. X.; Suresh, A. T.; and Kumar,S. 2019. Sampled softmax with random fourier features. In

Advances in Neural Information Processing Systems (NeurIPS) ,13857–13867.Razavi, M. K.; Kerayechian, A.; Gachpazan, M.; and Shateyi, S.2014. A new iterative method for ﬁnding approximate inverses ofcomplex matrices. In

Abstract and Applied Analysis .Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. DistilBERT,a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 . Shen, D.; Wang, G.; Wang, W.; Min, M. R.; Su, Q.; Zhang, Y.; Li,C.; Henao, R.; and Carin, L. 2018a. Baseline Needs More Love: OnSimple Word-Embedding-Based Models and Associated PoolingMechanisms. In

Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (ACL) , 440–450.Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; and Li, H. 2018b. Efﬁ-cient Attention: Attention with Linear Complexities. arXiv preprintarXiv:1812.01243 .Si, S.; Hsieh, C.-J.; and Dhillon, I. 2016. Computationally efﬁcientNystr¨om approximation using fast transforms. In

Proceedings ofthe International Conference on Machine Learning (ICML) , 2655–2663.Si, S.; Hsieh, C.-J.; and Dhillon, I. S. 2017. Memory efﬁcient ker-nel approximation.

Journal of Machine Learning Research (JMLR)

Proceedings of theConference on Empirical Methods in Natural Language Process-ing (EMNLP) , 1631–1642.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention isall you need. In

Advances in Neural Information Processing Sys-tems (NeurIPS) , 5998–6008.Vyas, A.; Katharopoulos, A.; and Fleuret, F. 2020. Fast transform-ers with clustered attention.

Advances in Neural Information Pro-cessing Systems

International Con-ference on Learning Representations (ICLR) .Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and Bowman,S. R. 2019. GLUE: A Multi-Task Benchmark and Analysis Plat-form for Natural Language Understanding. In the Proceedings ofICLR.Wang, S.; Li, B.; Khabsa, M.; Fang, H.; and Ma, H. 2020. Lin-former: Self-Attention with Linear Complexity. arXiv preprintarXiv:2006.04768 .Wang, S.; Zhang, C.; Qian, H.; and Zhang, Z. 2014. Improving themodiﬁed nystr¨om method using spectral shifting. In

Proceedingsof the 20th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD) , 611–620.Wang, S.; and Zhang, Z. 2013. Improving CUR matrix decomposi-tion and the Nystr¨om approximation via adaptive sampling.

Jour-nal of Machine Learning Research (JMLR)

Proceedings of the Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies (NAACL-HLT) , 1112–1122.Williams, C. K.; and Seeger, M. 2001. Using the Nystr¨om methodto speed up kernel machines. In

Advances in Neural InformationProcessing Systems (NeurIPS) , 682–688.Zafrir, O.; Boudoukh, G.; Izsak, P.; and Wasserblat, M. 2019.Q8BERT: Quantized 8bit BERT. In

NeurIPS Workshop on EnergyEfﬁcient Machine Learning and Cognitive Computing 2019 .aheer, M.; Guruganesh, G.; Dubey, A.; Ainslie, J.; Alberti, C.;Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al.2020. Big bird: Transformers for longer sequences. arXiv preprintarXiv:2007.14062 .Zhang, K.; and Kwok, J. T. 2010. Clustered Nystr¨om methodfor large scale manifold learning and dimension reduction.

IEEETransactions on Neural Networks

Proceedings ofthe International Conference on Machine Learning (ICML) , 1232–1239.Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Tor-ralba, A.; and Fidler, S. 2015. Aligning books and movies: To-wards story-like visual explanations by watching movies and read-ing books. In