Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller
RR ETHINKING A TTENTION WITH P ERFORMERS
Krzysztof Choromanski ∗ , Valerii Likhosherstov ∗ , David Dohan ∗ , Xingyou Song ∗ Andreea Gane ∗ , Tamas Sarlos ∗ , Peter Hawkins ∗ , Jared Davis ∗ , Afroz Mohiuddin Lukasz Kaiser , David Belanger , Lucy Colwell , , Adrian Weller , Google University of Cambridge DeepMind Alan Turing Institute A BSTRACT
We introduce
Performers , Transformer architectures which can estimate regular(softmax) full-rank-attention Transformers with provable accuracy, but using onlylinear (as opposed to quadratic) space and time complexity, without relying onany priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel
Fast Attention Via positive Orthogonal Randomfeatures approach (FAVOR+), which may be of independent interest for scalablekernel methods. FAVOR+ can be also used to efficiently model kernelizableattention mechanisms beyond softmax. This representational power is crucial toaccurately compare softmax with other kernels for the first time on large-scale tasks,beyond the reach of regular Transformers, and investigate optimal attention-kernels.Performers are linear architectures fully compatible with regular Transformersand with strong theoretical guarantees: unbiased or nearly-unbiased estimationof the attention matrix, uniform convergence and low estimation variance. Wetested Performers on a rich set of tasks stretching from pixel-prediction throughtext models to protein sequence modeling. We demonstrate competitive resultswith other examined efficient sparse and dense attention methods, showcasingeffectiveness of the novel attention-learning paradigm leveraged by Performers.
NTRODUCTION AND RELATED WORK
Transformers (Vaswani et al., 2017; Dehghani et al., 2019) are powerful neural network architecturesthat have become SOTA in several areas of machine learning including natural language processing(NLP) (e.g. speech recognition (Luo et al., 2020)), neural machine translation (NMT) (Chen et al.,2018), document generation/summarization, time series prediction, generative modeling (e.g. imagegeneration (Parmar et al., 2018)), music generation (Huang et al., 2019), and bioinformatics (Riveset al., 2019; Madani et al., 2020; Ingraham et al., 2019; Elnaggar et al., 2019; Du et al., 2020).Transformers rely on a trainable attention mechanism that identifies complex dependencies betweenthe elements of each input sequence. Unfortunately, the regular Transformer scales quadraticallywith the number of tokens L in the input sequence, which is prohibitively expensive for large L and precludes its usage in settings with limited computational resources even for moderate valuesof L . Several solutions have been proposed to address this issue (Beltagy et al., 2020; Gulati et al.,2020; Chan et al., 2020; Child et al., 2019; Bello et al., 2019). Most approaches restrict the attentionmechanism to attend to local neighborhoods (Parmar et al., 2018) or incorporate structural priorson attention such as sparsity (Child et al., 2019), pooling-based compression (Rae et al., 2020)clustering/binning/convolution techniques (e.g. (Roy et al., 2020) which applies k -means clusteringto learn dynamic sparse attention regions, or (Kitaev et al., 2020), where locality sensitive hashingis used to group together tokens of similar embeddings), sliding windows (Beltagy et al., 2020),or truncated targeting (Chelba et al., 2020). There is also a long line of research on using denseattention matrices, but defined by low-rank kernels substituting softmax (Katharopoulos et al., 2020;Shen et al., 2018). Those methods critically rely on kernels admitting explicit representations asdot-products of finite positive-feature vectors.The approaches above do not aim to approximate regular attention, but rather propose simpler andmore tractable attention mechanisms, often by incorporating additional constraints (e.g. identicalquery and key sets as in (Kitaev et al., 2020)), or by trading regular with sparse attention using more ∗ Equal contribution. Correspondence to {kchoro,lcolwell}@google.com .Code for Transformer models on protein data can be found in github.com/google-research/google-research/tree/master/protein_lm and Performer code can be found in github.com/google-research/google-research/tree/master/performer . a r X i v : . [ c s . L G ] S e p ayers (Child et al., 2019). Unfortunately, there is a lack of rigorous guarantees for the representationpower produced by such methods, and sometimes the validity of sparsity patterns can only be verifiedempirically through trial and error by constructing special GPU operations (e.g. either writing C++CUDA kernels (Child et al., 2019) or using TVMs (Beltagy et al., 2020)). Other techniques whichaim to reduce Transformers’ space complexity include reversible residual layers allowing one-timeactivation storage in training (Kitaev et al., 2020) and shared attention weights (Xiao et al., 2019).These constraints may impede application to long-sequence problems, where approximations ofthe attention mechanism are not sufficient. Approximations based on truncated back-propagation(Dai et al., 2019) are also unable to capture long-distance correlations since the gradients are onlypropagated inside a localized window. Other methods propose biased estimation of the regularattention but only in the non-causal setting and of large mean squared error (Wang et al., 2020).In response, we introduce the first Transformer architectures, Performers , capable of provably accurate and practical estimation of regular (softmax) full-rank attention, but of only linear spaceand time complexity and not relying on any priors such as sparsity or low-rankness. Performersuse the
Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism, leveragingnew methods for approximating softmax and Gaussian kernels, which we propose. We believethese methods are of independent interest, contributing to the theory of scalable kernel methods.Consequently, Performers are the first linear architectures fully compatible (via small amountsof fine-tuning) with regular Transformers, providing strong theoretical guarantees: unbiased ornearly-unbiased estimation of the attention matrix, uniform convergence and lower variance of theapproximation.FAVOR+ can be also applied to efficiently model other kernelizable attention mechanisms beyondsoftmax. This representational power is crucial to accurately compare softmax with other kernelsfor the first time on large-scale tasks, that are beyond the reach of regular Transformers, and find forthem optimal attention-kernels. FAVOR+ can be also applied beyond the Transformer scope as amore scalable replacement for regular attention, which itself has a wide variety of uses in computervision (Fu et al., 2019), reinforcement learning (Zambaldi et al., 2019), training with softmax crossentropy loss and even combinatorial optimization (Vinyals et al., 2015).We test Performers on a rich set of tasks ranging from pixel-prediction through text models to proteinsequence modeling. We demonstrate competitive results with other examined efficient sparse anddense attention methods, showcasing the effectiveness of the novel attention-learning paradigmleveraged by Performers. We emphasize that in principle, FAVOR+ can also be combined with othertechniques, such as reversible layers (Kitaev et al., 2020) or cluster-based attention (Roy et al., 2020).
ECHANISM & P
OSITIVE O RTHOGONAL R ANDOM F EATURES
Below we describe in detail the FAVOR+ mechanism - the backbone of the
Performer (cid:48) s architecture.We introduce a new method for estimating softmax (and Gaussian) kernels with positive orthogonalrandom features which FAVOR+ leverages for the robust and unbiased estimation of regular (softmax)attention and show how FAVOR+ can be applied for other attention-kernels.2.1 P RELIMINARIES - REGULAR ATTENTION MECHANISM
Let L be the size of an input sequence of tokens. Then regular dot-product attention (Vaswani et al.,2017) is a mapping which accepts matrices Q , K , V ∈ R L × d as input where d is the hidden dimension(dimension of the latent representation). Matrices Q , K , V are intermediate representations of theinput and their rows can be interpreted as queries , keys and values of the continuous dictionary datastructure respectively. Bidirectional (or non-directional (Devlin et al., 2018)) dot-product attention has the following form, where A ∈ R L × L is the so-called attention matrix : Att ↔ ( Q , K , V ) = D − AV , A = exp( QK (cid:62) / √ d ) , D = diag( A1 L ) . (1)Here exp( · ) is applied elementwise, L is the all-ones vector of length L , and diag( · ) is a diagonalmatrix with the input vector as the diagonal. Time and space complexity of computing (1) are O ( L d ) and O ( L + Ld ) respectively, because A has to be stored explicitly. Hence, in principle, dot-productattention of type (1) is incompatible with end-to-end processing of long sequences. Bidirectionalattention is applied in encoder self-attention and encoder-decoder attention in Seq2Seq architectures.Another important type of attention is unidirectional dot-product attention which has the form: Att → ( Q , K , V ) = (cid:101) D − (cid:101) AV , (cid:101) A = tril( A ) , (cid:101) D = diag( (cid:101) A1 L ) , (2)2here tril( · ) returns the lower-triangular part of the argument matrix including the diagonal. Asdiscussed in (Vaswani et al., 2017), unidirectional attention is used for autoregressive generativemodelling, e.g. as self-attention in generative Transformers as well as the decoder part of Seq2SeqTransformers.We will show that attention matrix A can be approximated up to any precision in time O ( Ld log( d )) .For comparison, popular methods leveraging sparsity via Locality-Sensitive Hashing (LSH) tech-niques (Kitaev et al., 2020) have O ( Ld log L ) time complexity. In the main body of the paper wewill describe FAVOR+ for bidirectional attention. Completely analogous results can be obtained forthe unidirectional variant via the mechanism of prefix-sums (all details in the Appendix B.1).2.2 G ENERALIZED K ERNELIZABLE A TTENTION
FAVOR+ works for attention blocks using matrices A ∈ R L × L of the form A ( i, j ) = K( q (cid:62) i , k (cid:62) j ) ,with q i / k j standing for the i th /j th query/key row-vector in Q / K and kernel K : R d × R d → R + defined for the (usually randomized) mapping: φ : R d → R r + (for some r > ) as: K( x , y ) = E [ φ ( x ) (cid:62) φ ( y )] . (3)We call φ ( u ) a random feature map for u ∈ R d . For Q (cid:48) , K (cid:48) ∈ R L × r with rows given as φ ( q (cid:62) i ) (cid:62) and φ ( k (cid:62) i ) (cid:62) respectively, Equation 3 leads directly to the efficient attention mechanism of the form: (cid:92) Att ↔ ( Q , K , V ) = (cid:98) D − ( Q (cid:48) (( K (cid:48) ) (cid:62) V )) , (cid:98) D = diag( Q (cid:48) (( K (cid:48) ) (cid:62) L )) . (4)Here (cid:92) Att ↔ stands for the approximate attention and brackets indicate the order of computations. It iseasy to see that such a mechanism is characterized by space complexity O ( Lr + Ld + rd ) and timecomplexity O ( Lrd ) as opposed to O ( L + Ld ) and O ( L d ) of the regular attention (see also Fig. 1).Figure 1: Approximation of the regular attention mechanism AV (before D − -renormalization) via (random)feature maps. Dashed-blocks indicate order of computation with corresponding time complexities attached. The above scheme constitutes the FA-part of the FAVOR+ mechanism. The remaining OR+ partanswers the following questions: (1)
How expressive is the attention model defined in Equation 3,and in particular, can we use it in principle to approximate regular softmax attention ? (2)
How dowe implement it robustly in practice, and in particular, can we choose r (cid:28) L for L (cid:29) d to obtaindesired space and time complexity gains? We answer these questions in the next sections.2.3 H OW TO AND HOW NOT TO APPROXIMATE SOFTMAX KERNELS FOR A TTENTION
It turns out that by taking φ of the following form for functions f , ..., f l : R → R , function g : R d → R and deterministic vectors ω i or ω , ..., ω m iid ∼ D for some distribution D ∈ P ( R ) d : φ ( x ) = h ( x ) √ m ( f ( ω (cid:62) x ) , ..., f ( ω (cid:62) m x ) , ..., f l ( ω (cid:62) x ) , ..., f l ( ω (cid:62) m x )) , (5)we can model most kernels used in practice. Furthermore, in most cases D is isotropic (i.e. withpdf function constant on a sphere), usually Gaussian. For example, by taking h ( x ) = 1 , l = 1 and D = N (0 , I d ) we obtain estimators of the so-called PNG-kernels (Choromanski et al., 2017) (e.g. f = sgn corresponds to the angular kernel). Configurations: h ( x ) = 1 , l = 2 , f = sin , f = cos correspond to shift-invariant kernels, in particular D = N (0 , I d ) leads to the Gaussian kernel K gauss (Rahimi & Recht, 2007). The softmax-kernel which defines regular attention matrix A is given as:3 M( x , y ) def = exp( x (cid:62) y ) . (6)In the above, without loss of generality, we omit √ d -renormalization since we can equivalentlyrenormalize input keys and queries. Since: SM( x , y ) = exp( (cid:107) x (cid:107) )K gauss ( x , y ) exp( (cid:107) y (cid:107) ) , basedon what we have said, we obtain random feature map unbiased approximation of SM( x , y ) usingtrigonometric functions with: h ( x ) = exp( (cid:107) x (cid:107) ) , l = 2 , f = sin , f = cos . We call it (cid:100) SM trig m ( x , y ) .There is however a caveat there. The attention module from (1) constructs for each token, a convexcombination of value-vectors with coefficients given as corresponding renormalized kernel scores.That is why kernels producing non-negative scores are used. Applying random feature maps withpotentially negative dimension-values ( sin / cos ) leads to unstable behaviours, especially when kernelscores close to (which is the case for lots of entries of A corresponding to not relevant tokens) areapproximated by estimators with large variance in such regions. This results in abnormal behaviours,e.g. negative-diagonal-values renormalizers D − , and consequently either completely preventstraining or leads to sub-optimal models. We demonstrate empirically that this is what happens for (cid:100) SM trig m and provide detailed theoretical explanations showing that the variance of (cid:100) SM trig m is largeas approximated values tend to (see: Section 3). This is one of the main reasons why the robustrandom feature map mechanism for approximating regular softmax attention was never proposed.We propose a robust mechanism in this paper. Furthermore, the variance of our new unbiased positiverandom feature map estimator tends to as approximated values tend to (see: Section 3). Lemma 1 (Positive Random Features (PRFs) for Softmax) . For x , y ∈ R d , z = x + y we have: SM( x , y ) = E ω ∼N (0 , I d ) [exp( ω (cid:62) x − (cid:107) x (cid:107) ω (cid:62) y − (cid:107) y (cid:107) E ω ∼N (0 , I d ) cosh( ω (cid:62) z ) , (7) where Λ = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) ) and cosh is a hyperbolic cosine. Consequently, softmax-kernel admitsa positive random feature map unbiased approximation with h ( x ) = exp( − (cid:107) x (cid:107) ) , l = 1 , f = exp and D = N (0 , I d ) or: h ( x ) = √ exp( − (cid:107) x (cid:107) ) , l = 2 , f ( u ) = exp( u ) , f ( u ) = exp( − u ) and thesame D (the latter for further variance reduction). We call related estimators: (cid:100) SM + m and (cid:100) SM hyp+ m . Figure 2:
Left:
Symmetrized (around origin) utility function r (defined as a ratio of the mean squared errors(MSEs) of estimators built on: trigonometric and positive random features) as a function of the angle φ (inradians) between input feature vectors and their lengths l . Larger values indicate regions of ( φ, l ) -space withbetter performance of positive random features. We see that for critical regions with φ large enough (smallenough softmax-kernel values) our method is arbitrarily more accurate than trigonometric random features. Plotpresented for domain [ − π, π ] × [ − , . Right:
The slice of function r for fixed l = 1 and varying angle φ . Right Upper Corner:
Comparison of the MSEs of both the estimators in a low softmax-kernel value region.
In Fig. 2 we visualize the advantages of positive versus standard trigonometric random features. Incritical regions, where kernel values are small and need careful approximation, our method outper-forms its counterpart. In Section 4 we further confirm our method’s advantages empirically, usingpositive features to efficiently train softmax-based linear Transformers. If we replace in (7) ω with √ d ω (cid:107) ω (cid:107) , we obtain the so-called regularized softmax-kernel SMREG which we can approximate ina similar manner, simply changing D = N (0 , I d ) to D = Unif( √ d S d − ) , a distribution correspond-ing to Haar measure on the sphere of radius √ d in R d , obtaining estimator (cid:92) SMREG + m . As we showin Section 3, such random features can be also used to accurately approximate regular softmax-kernel.4.4 O RTHOGONAL R ANDOM F EATURES (ORF S )The above constitutes the R+ part of the FAVOR+ method. It remains to explain the O-part. To furtherreduce the variance of the estimator (so that we can use even smaller number of random features r ),we entangle different random samples ω , ..., ω m to be exactly orthogonal. This can be done whilemaintaining unbiasedness whenever isotropic distributions D are used (i.e. in particular in all kernelswe considered so far) by standard Gram-Schmidt renormalization procedure (see: (Choromanskiet al., 2017) for details). ORFs is a well-known method, yet it turns out that it works particularly wellwith our introduced PRFs for softmax. This leads to first theoretical results showing that ORFs canbe applied to reduce the variance of softmax/Gaussian kernel estimators for any dimensionality d rather than just asymptotically for large enough d (as it is the case for previous methods, see: nextsection) and leads to first exponentially small bounds on large deviations probabilities that arestrictly smaller than for non-orthogonal methods. Positivity of random features plays a key role inthese bounds. ORF mechanism requires m ≤ d , but this will be the case in all our experiments. Thepseudocode of the entire FAVOR+ algorithm is given in Appendix B.Our theoretical results are tightly aligned with experiments. We show in Section 4 that PRFs+ORFsdrastically improve accuracy of the approximation of the attention matrix and enable us to reduce r which results in accurate as well as space and time efficient mechanism which we call FAVOR+. HEORETICAL RESULTS
We present here the theory of positive orthogonal random features for softmax-kernel estimation. Allthese results can be applied also to the Gaussian kernel, since as explained in the previous section,one can be obtained from the other by renormalization (see: Section 2.3). All proofs and additionalmore general theoretical results with a discussion are given in the Appendix.
Lemma 2 (positive (hyperbolic) versus trigonometric random features) . The following is true:
MSE( (cid:100) SM trig m ( x , y )) = 12 m exp( (cid:107) x + y (cid:107) )SM − ( x , y )(1 − exp( −(cid:107) x − y (cid:107) )) , MSE( (cid:100) SM + m ( x , y )) = 1 m exp( (cid:107) x + y (cid:107) )SM ( x , y )(1 − exp( −(cid:107) x + y (cid:107) )) , MSE( (cid:100) SM hyp+ m ( x , y )) = 12 (1 − exp( −(cid:107) x + y (cid:107) ))MSE( (cid:100) SM + m ( x , y )) . (8) for independent random samples ω i and where MSE stands for the mean squared error.
Thus, for
SM( x , y ) → we have: MSE( (cid:100) SM trig m ( x , y )) → ∞ and MSE( (cid:100) SM + m ( x , y )) → . Further-more, the hyperbolic estimator provides additional accuracy improvements that are strictly better thanthose from (cid:100) SM +2 m ( x , y )) with twice as many random features. The next result shows that regularizedsoftmax-kernel is in practice an accurate proxy of the softmax-kernel in attention. Theorem 1 (regularized versus softmax-kernel) . Assume that the L ∞ -norm of the attention matrixfor the softmax-kernel satisfies: (cid:107) A (cid:107) ∞ ≤ C for some constant C ≥ . Denote by A reg thecorresponding attention matrix for the regularized softmax-kernel. The following holds: inf i,j A reg ( i, j ) A ( i, j ) ≥ − d + o (cid:18) d (cid:19) , and sup i,j A reg ( i, j ) A ( i, j ) ≤ . (9) Furthermore, the latter holds for d ≥ even if L ∞ -norm condition is not satisfied, i.e. the regularizedsoftmax-kernel is a universal lower bound for the softmax-kernel. Consequently, positive random features for
SMREG can be used to approximate the softmax-kernel.Our next result shows that orthogonality provably reduces mean squared error of the estimation withpositive random features for any dimensionality d > and we explicitly provide the gap. Theorem 2. If (cid:100) SM ort+ m ( x , y ) stands for the modification of (cid:100) SM + m ( x , y ) with orthogonal randomfeatures (and thus for m ≤ d ), then the following holds for any d > : MSE( (cid:100) SM ort+ m ( x , y )) ≤ MSE( (cid:100) SM + m ( x , y )) − (1 − m ) 2 d + 2 SM ( x , y ) . (10) Furthermore, completely analogous result holds for the regularized softmax-kernel
SMREG . d > . Our next result enables us to explicitly estimate the gap. Theorem 3.
Let x , y ∈ R d . The following holds for any a > SMREG( x , y ) and m ≤ d : P [ (cid:92) SMREG + m ( x , y ) > a ] ≤ exp( − m L X ( a )) , P [ (cid:92) SMREG ort+ m ( x , y ) > a ] ≤ dd + 2 exp( − m L X ( a )) where (cid:92) SMREG ort+ m ( x , y ) stands for the modification of (cid:92) SMREG + m ( x , y ) with ORFs, X =Λ exp( √ d ω (cid:62) (cid:107) ω (cid:107) ( x + y )) , ω ∼ N (0 , I d ) , Λ is as in Lemma 1 and L Z is a Legendre Transformof Z defined as: L Z ( a ) = sup θ> log( e θa M Z ( θ ) ) for the moment generating function M Z of Z . We see that ORFs provide exponentially small and sharper bounds for critical regions where softmax-kernel is small. Below we show that even for the SM trig mechanism with ORFs, it suffices to take m = Θ( d log( d )) random projections to accurately approximate the attention matrix (thus if notattention renormalization, PRFs would not be needed). In general, m depends on the dimensionality d of the embeddings, radius R of the ball where all queries/keys live and precision parameter (cid:15) (see:Appendix F.6 for the additional discussion), but does not depend on input sequence length L . Theorem 4 (uniform convergence for attention approximation) . Take h ( x ) = exp( (cid:107) x (cid:107) ) . Assumethat L -norms of queries/keys are upper-bounded by R > . Define l = Rd − and take h ∗ =max x ∈ B ( l ) | h ( x ) | , where B ( l ) is a ball of radius l and centered at . Then for any (cid:15) > , δ = (cid:15) ( h ∗ ) and the number of random projections m = Θ( dδ log( d Rδ )) the following holds for the attentionapproximation mechanism leveraging estimators (cid:100) SM trig with ORFs: (cid:107) (cid:98) A − A (cid:107) ≤ (cid:15) with anyconstant probability, where (cid:98) A is the approximation of the attention matrix A . XPERIMENTS
We implemented our setup on top of pre-existing Transformer training code in Jax (Frostig et al.,2018) optimized with just-in-time ( jax.jit ) compilation, and complement our theory with em-pirical evidence to demonstrate the practicality of FAVOR+ in multiple settings. Unless explicitlystated, a Performer replaces only the attention component with our method, while all other com-ponents are exactly the same as for the regular Transformer. For shorthand notation, we denoteunidirectional/causal modelling as (U) and bidirectional/masked language modelling as (B) .In terms of baselines, we use other Transformer models for comparison, although some of themare restricted to only one case - e.g. Reformer (Kitaev et al., 2020) is only (U), and Linformer(Wang et al., 2020) is only (B). Furthermore, we use PG-19 (Rae et al., 2020) as an alternative (B)pretraining benchmark, as it is made for long-length sequence training compared to the (now publiclyunavailable) BookCorpus (Zhu et al., 2015) + Wikipedia dataset used in BERT (Devlin et al., 2018)and Linformer. All model and tokenization hyperparameters are shown in Appendix A.Figure 3:
Comparison of Transformer and Performer in terms of forward and backward pass speed andmaximum L allowed. "X" (OPT) denotes the maximum possible speedup achievable, when attention simplyreturns the V -matrix. Plots shown up to when a model produces an out of memory error on a V100 GPU with16GB. Vocabulary size used was 256. Best in color. OMPUTATIONAL COSTS
We compared speed-wise the backward pass of the Transformer and the Performer in (B) setting,as it is one of the main computational bottlenecks during training, when using the regular defaultsize ( n heads , n layers , d ff , d ) = (8 , , , , where d ff denotes the width of the MLP layers.6e observed (Fig. 3) that in terms of L , the Performer reaches nearly linear time and sub-quadraticmemory consumption (since the explicit O ( L ) attention matrix is not stored). In fact, the Performerachieves nearly optimal speedup and memory efficiency possible, depicted by the "X"-line whenattention is replaced with the "identity function" simply returning the V -matrix. The combination ofboth memory and backward pass efficiencies for large L allows respectively, large batch training andlower wall clock time per gradient step. Extensive additional results are demonstrated in Appendix Eby varying layers, raw attention, and architecture sizes.4.2 S OFTMAX ATTENTION APPROXIMATION ERROR
We further examined the approximation error via FAVOR+ in Fig. 4. We demonstrate that Orthogonal features produce lower error than unstructured (IID) features, Positive features producelower error than trigonometric sin / cos features. These two empirically validate the PORF mechanism.Figure 4: MSE of the approximation output when comparing Orthogonal vs IID features and trigonometric sin / cos vs positive features. We took L = 4096 , d = 16 , and varied the number of random samples m . Standarddeviations shown across 15 samples of appropriately normalized random matrix input data. To further improve overall approximation of attention blocks across multiple iterations which furtherimproves training, random samples should be periodically redrawn (Fig. 5, right). This is a cheapprocedure, but can be further optimized (Appendix B.2).4.3 S
OFTMAX APPROXIMATION ON T RANSFORMERS
Even if the approximation of the attention mechanism is tight, small errors can easily propagatethroughout multiple Transformer layers (e.g. MLPs, multiple heads), as we show in Fig. 14(Appendix). In other words, the model’s
Lipschitz constant can easily scale up small attentionapproximation error, which means that very tight approximations may sometimes be needed. Thus,when applying FAVOR(+)’s softmax approximations on a Transformer model (i.e. "Performer-X-SOFTMAX"), we demonstrate that: Backwards compatibility with pretrained models is available as a benefit from softmax approxima-tion, via small finetuning (required due to error propagation) even for trigonometric features (Fig. 5,left) on the LM1B dataset (Chelba et al., 2014). However, when on larger dataset PG-19, Positive(POS) softmax features (with redrawing) become crucial for achieving performance matching regularTransformers (Fig. 5, right).Figure 5:
We transferred the original pretrained Transformer’s weights into the Performer, which producesan initial non-zero 0.07 accuracy (dotted orange line), but quickly recovers accuracy in a small fraction of theoriginal number of gradient steps. However on PG-19, Trigonometric (TRIG) softmax approximation becomeshighly unstable (full curve in Appendix D.2), while positive features (POS) (without redrawing) and Linformer(which also approximates softmax) even with redrawn projections , plateau at the same perplexity. Positivesoftmax with feature redrawing is necessary to match the Transformer, with SMREG (regularization from Sec.3) allowing faster convergence. Additional ablation studies over many attention kernels, showing also thattrigonometric random features lead even to NaN values in training are given in Appendix D.3.
ULTIPLE LAYER TRAINING FOR PROTEINS
We further benchmark the Performer on both (U) and (B) cases by training a 36-layer model usingprotein sequences from the Jan. 2019 release of TrEMBL (Consortium, 2019), similar to (Madaniet al., 2020). In Fig. 6, the Reformer and Linformer significantly drop in accuracy on the proteindataset. Furthermore, the usefulness of generalized attention is evidenced by Performer-RELU (taking f = ReLU in Equation 5) achieving the highest accuracy in both (U) and (B) cases. Our proposedsoftmax approximation is also shown to be tight, achieving the same accuracy as the exact-softmaxTransformer and confirming our theoretical claims from Section 3.Figure 6: Train = Dashed, Validation = Solid. For TrEMBL, we used the exact same model parameters ( n heads , n layers , d ff , d ) = (8 , , , from (Madani et al., 2020) for all runs. For fairness, all TrEMBLexperiments used 16x16 TPU-v2’s. Batch sizes were maximized for each separate run given the computeconstraints. Hyperparameters can be found in Appendix A. Extended results including dataset statistics, out ofdistribution evaluations, and visualizations, can be found in Appendix C. ARGE LENGTH TRAINING - C
OMMON DATASETS
On the standard (U) ImageNet64 benchmark from (Parmar et al., 2018) with L = 12288 which isunfeasible for regular Transformers, we set all models to use the same ( n heads , d ff , d ) but varying n layers . Performer/6-layers matches the Reformer/12-layers, while the Performer/12-layers matchesthe Reformer/24-layers (Fig. 7: left). Depending on hardware (TPU or GPU), we also found that thePerformer can be 2x faster than the Reformer via Jax optimizations for the (U) setting.For a proof of principle study, we also create an initial protein benchmark for predicting interactionsamong groups of proteins by concatenating protein sequences to length L = 8192 from TrEMBL,long enough to model protein interaction networks without the large sequence alignments required byexisting methods (Cong et al., 2019). In this setting, a regular Transformer overloads memory even ata batch size of per chip, by a wide margin. Thus as a baseline, we were forced to use a significantlysmaller variant, reducing to ( n heads , n layers , d ff , d ) = (8 , { , , } , , . Meanwhile, the Per-former trains efficiently at a batch size of 8 per chip using the standard (8 , , , architecture.We see in Fig. 7 (right subfigure) that the smaller Transformer ( n layer = 3 ) is quickly bounded at ≈ , while the Performer is able to train continuously to ≈ .Figure 7: Train = Dashed, Validation = Solid. For ImageNet64, all models used the standard ( n heads , d ff , d ) =(8 , , . We further show that our positive softmax approximation achieves the same performance asReLU in Appendix D.2. For concatenated TrEMBL, we varied n layers ∈ { , , } for the smaller Transformer.Hyperparameters can be found in Appendix A. ONCLUSION
We presented
Performer , a new type of Transformer, relying on our Fast Attention Via positive Or-thogonal Random features (FAVOR+) mechanism to significantly improve space and time complexityof regular Transformers. Our mechanism provides to our knowledge the first effective unbiased esti-mation of the original softmax-based Transformer with linear space and time complexity and opensnew avenues in the research on Transformers and the role of non-sparsifying attention mechanisms.8 B ROADER IMPACT
We believe that the presented algorithm can be impactful in various ways:
Biology and Medicine:
Our method has the potential to directly impact research on biologicalsequence analysis by enabling the Transformer to be applied to much longer sequences withoutconstraints on the structure of the attention matrix. The initial application that we consider is theprediction of interactions between proteins on the proteome scale. Recently published approachesrequire large evolutionary sequence alignments, a bottleneck for applications to mammalian genomes(Cong et al., 2019). The potentially broad translational impact of applying these approaches to biolog-ical sequences was one of the main motivations of this work. We believe that modern bioinformaticscan immensely benefit from new machine learning techniques with Transformers being among themost promising. Scaling up these methods to train faster more accurate language models opensthe door to the ability to design sets of molecules with pre-specified interaction properties. Theseapproaches could be used to augment existing physics-based design strategies that are of criticalimportance for example in the development of new nanoparticle vaccines (Marcandalli et al., 2019).
Environment:
As we have shown, Performers with FAVOR are characterized by much lowercompute costs and substantially lower space complexity which can be directly translated to CO emission reduction (Strubell et al., 2019) and lower energy consumption (You et al., 2020), as regularTransformers require very large computational resources. Research on Transformers:
We believe that our results can shape research on efficient Transformersarchitectures, guiding the field towards methods with strong mathematical foundations. Our researchmay also hopefully extend Transformers also beyond their standard scope (e.g. by considering theGeneralized Attention mechanism and connections with kernels). Exploring scalable Transformerarchitectures that can handle L of the order of magnitude few thousands and more, preservingaccuracy of the baseline at the same time, is a gateway to new breakthroughs in bio-informatics,e.g. language modeling for proteins, as we explained in the paper. Our presented method can bepotentially a first step. Backward Compatibility:
Our Performer can be used on the top of a regular pre-trained Transformeras opposed to other Transformer variants. Even if up-training is not required, FAVOR can be stillused for fast inference with no loss of accuracy. We think about this backward compatibility as avery important additional feature of the presented techniques that might be particularly attractive forpractitioners.
Attention Beyond Transformers:
Finally, FAVOR can be applied to approximate exact attentionalso outside the scope of Transformers. This opens a large volume of new potential applicationsincluding: hierarchical attention networks (HANS) (Yang et al., 2016), graph attention networks(Velickovic et al., 2018), image processing (Fu et al., 2019), and reinforcement learning/robotics(Tang et al., 2020).
CKNOWLEDGEMENTS
We thank Nikita Kitaev and Wojciech Gajewski for multiple discussions on the Reformer, andalso thank Aurko Roy and Ashish Vaswani for multiple discussions on the Routing Transformer.We further thank Joshua Meier, John Platt, and Tom Weingarten for many fruitful discussions onbiological data and useful comments on this draft. We lastly thank Yi Tay and Mostafa Dehghani fordiscussions on comparing baselines.Valerii Likhosherstov acknowledges support from the Cambridge Trust and DeepMind. Lucy Colwellacknowledges support from the Simons Foundation. Adrian Weller acknowledges support from theDavid MacKay Newton research fellowship at Darwin College, The Alan Turing Institute underEPSRC grant EP/N510129/1 and U/B/000074, and the Leverhulme Trust via CFI.9
EFERENCES
Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V. Le. Attention augmentedconvolutional networks.
CoRR , abs/1904.09925, 2019. URL http://arxiv.org/abs/1904.09925 .Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer.
CoRR , abs/2004.05150, 2020. URL https://arxiv.org/abs/2004.05150 .William Chan, Chitwan Saharia, Geoffrey E. Hinton, Mohammad Norouzi, and Navdeep Jaitly.Imputer: Sequence modelling via imputation and dynamic programming.
CoRR , abs/2002.08926,2020. URL https://arxiv.org/abs/2002.08926 .Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and TonyRobinson. One billion word benchmark for measuring progress in statistical language modeling.In
INTERSPEECH 2014, 15th Annual Conference of the International Speech CommunicationAssociation, Singapore, September 14-18, 2014 , pp. 2635–2639, 2014.Ciprian Chelba, Mia Xu Chen, Ankur Bapna, and Noam Shazeer. Faster transformer decoding:N-gram masked self-attention.
CoRR , abs/2001.04589, 2020. URL https://arxiv.org/abs/2001.04589 .Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George F. Foster,Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, LukaszKaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combiningrecent advances in neural machine translation. In
Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20,2018, Volume 1: Long Papers , pp. 76–86. Association for Computational Linguistics, 2018. doi:10.18653/v1/P18-1008. URL .Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparsetransformers.
CoRR , abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509 .Krzysztof Choromanski, Carlton Downey, and Byron Boots. Initialization matters: Orthogonal predic-tive state recurrent neural networks. In .OpenReview.net, 2018a. URL https://openreview.net/forum?id=HJJ23bW0b .Krzysztof Choromanski, Mark Rowland, Tamás Sarlós, Vikas Sindhwani, Richard E. Turner, andAdrian Weller. The geometry of random features. In
International Conference on ArtificialIntelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, CanaryIslands, Spain , volume 84 of
Proceedings of Machine Learning Research , pp. 1–9. PMLR, 2018b.URL http://proceedings.mlr.press/v84/choromanski18a.html .Krzysztof Choromanski, Aldo Pacchiano, Jeffrey Pennington, and Yunhao Tang. KAMA-NNs:Low-dimensional rotation based neural networks. In
The 22nd International Conference onArtificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan ,volume 89 of
Proceedings of Machine Learning Research , pp. 236–245. PMLR, 2019a. URL http://proceedings.mlr.press/v89/choromanski19a.html .Krzysztof Choromanski, Mark Rowland, Wenyu Chen, and Adrian Weller. Unifying orthog-onal Monte Carlo methods. In
Proceedings of the 36th International Conference on Ma-chine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of
Proceedings of Machine Learning Research , pp. 1203–1212. PMLR, 2019b. URL http://proceedings.mlr.press/v97/choromanski19a.html .Krzysztof Marcin Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectivenessof structured random orthogonal embeddings. In
Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA , pp. 219–228, 2017.Qian Cong, Ivan Anishchenko, Sergey Ovchinnikov, and David Baker. Protein interaction networksrevealed by proteome coevolution.
Science , 365(6449):185–189, 2019.10niProt Consortium. Uniprot: a worldwide hub of protein knowledge.
Nucleic acids research , 47(D1):D506–D515, 2019.Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction toAlgorithms, 3rd Edition . MIT Press, 2009. ISBN 978-0-262-03384-8. URL http://mitpress.mit.edu/books/introduction-algorithms .Zihang Dai, Zhilin Yang, Yiming Yang, William W. Cohen, Jaime Carbonell, Quoc V. Le, and RuslanSalakhutdinov. Transformer-XL: Language modeling with longer-term dependency, 2019. URL https://openreview.net/forum?id=HJePno0cYm .Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universaltransformers. In . OpenReview.net, 2019. URL https://openreview.net/forum?id=HyzdRiR9Y7 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deepbidirectional transformers for language understanding.
CoRR , abs/1810.04805, 2018. URL http://arxiv.org/abs/1810.04805 .Yilun Du, Joshua Meier, Jerry Ma, Rob Fergus, and Alexander Rives. Energy-based models foratomic-resolution protein conformations. arXiv preprint arXiv:2004.13167 , 2020.Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, and Burkhard Rost. End-to-end multitasklearning, from protein language to protein features without alignments. bioRxiv , pp. 864405, 2019.Roy Frostig, Matthew Johnson, and Chris Leary. Compiling machine learning programs via high-level tracing. In
Conference on Machine Learning and Systems 2018 , 2018. URL .Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attentionnetwork for scene segmentation. In
IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pp. 3146–3154, 2019.Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, ShiboWang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmentedtransformer for speech recognition, 2020.Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, NoamShazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. Musictransformer: Generating music with long-term structure. In . OpenReview.net,2019. URL https://openreview.net/forum?id=rJe4ShAcF7 .John Ingraham, Vikas Garg, Regina Barzilay, and Tommi Jaakkola. Generative models for graph-based protein design. In
Advances in Neural Information Processing Systems , pp. 15794–15805,2019.Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers arernns: Fast autoregressive transformers with linear attention.
CoRR , abs/2006.16236, 2020. URL https://arxiv.org/abs/2006.16236 .Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In . OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB .Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets ofbert. arXiv preprint arXiv:1908.08593 , 2019.Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subwordtokenizer and detokenizer for neural text processing.
CoRR , abs/1808.06226, 2018. URL http://arxiv.org/abs/1808.06226 . 11ichard E. Ladner and Michael J. Fischer. Parallel prefix computation.
J. ACM , 27(4):831–838,October 1980. ISSN 0004-5411. doi: 10.1145/322217.322232. URL https://doi.org/10.1145/322217.322232 .Han Lin, Haoxian Chen, Tianyi Zhang, Clément Laroche, and Krzysztof Choromanski. Demystifyingorthogonal Monte Carlo and beyond.
CoRR , abs/2005.13590, 2020.Haoneng Luo, Shiliang Zhang, Ming Lei, and Lei Xie. Simplified self-attention for transformer-basedend-to-end speech recognition.
CoRR , abs/2005.10463, 2020. URL https://arxiv.org/abs/2005.10463 .Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R. Eguchi,Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation.
CoRR ,abs/2004.03497, 2020. URL https://arxiv.org/abs/2004.03497 .Jessica Marcandalli, Brooke Fiala, Sebastian Ols, Michela Perotti, Willem de van der Schueren, JoostSnijder, Edgar Hodge, Mark Benhaim, Rashmi Ravichandran, Lauren Carter, et al. Induction ofpotent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratorysyncytial virus.
Cell , 176(6):1420–1431, 2019.Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku,and Dustin Tran. Image transformer. In
Proceedings of the 35th International Conferenceon Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 ,volume 80 of
Proceedings of Machine Learning Research , pp. 4052–4061. PMLR, 2018. URL http://proceedings.mlr.press/v80/parmar18a.html .Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Com-pressive transformers for long-range sequence modelling. In
International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=SylKikSYDH .Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In
Advances inNeural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conferenceon Neural Information Processing Systems, Vancouver, British Columbia, Canada, December3-6, 2007 , pp. 1177–1184. Curran Associates, Inc., 2007. URL http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines .Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Zitnick, Jerry Ma, andRob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250million protein sequences. bioArxiv , 04 2019. doi: 10.1101/622803.Mark Rowland, Jiri Hron, Yunhao Tang, Krzysztof Choromanski, Tamás Sarlós, and Adrian Weller.Orthogonal estimation of Wasserstein distances. In
The 22nd International Conference on ArtificialIntelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan , volume 89of
Proceedings of Machine Learning Research , pp. 186–195. PMLR, 2019. URL http://proceedings.mlr.press/v89/rowland19a.html .Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparseattention with routing transformers.
CoRR , abs/2003.05997, 2020. URL https://arxiv.org/abs/2003.05997 .Zhuoran Shen, Mingyuan Zhang, Shuai Yi, Junjie Yan, and Haiyu Zhao. Factorized attention:Self-attention with linear complexities.
CoRR , abs/1812.01243, 2018. URL http://arxiv.org/abs/1812.01243 .Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations fordeep learning in NLP.
CoRR , abs/1906.02243, 2019. URL http://arxiv.org/abs/1906.02243 .Yujin Tang, Duong Nguyen, and David Ha. Neuroevolution of self-interpretable agents.
CoRR ,abs/2003.08165, 2020. URL https://arxiv.org/abs/2003.08165 .12ao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhut-dinov. Transformer dissection: An unified understanding for transformer’s attention via thelens of kernel. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) , pp. 4335–4344, 2019.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural InformationProcessing Systems 30 , pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and YoshuaBengio. Graph attention networks. In .OpenReview.net, 2018. URL https://openreview.net/forum?id=rJXMpikCZ .Jesse Vig. A multiscale visualization of attention in the transformer model. arXiv preprintarXiv:1906.05714 , 2019.Jesse Vig and Yonatan Belinkov. Analyzing the structure of attention in a transformer languagemodel.
CoRR , abs/1906.04284, 2019. URL http://arxiv.org/abs/1906.04284 .Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, Richard Socher, and Nazneen FatemaRajani. Bertology meets biology: Interpreting attention in protein language models.
CoRR ,abs/2006.15222, 2020. URL https://arxiv.org/abs/2006.15222 .Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In
Advances in NeuralInformation Processing Systems 28: Annual Conference on Neural Information Processing Systems2015, December 7-12, 2015, Montreal, Quebec, Canada , pp. 2692–2700, 2015.Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention withlinear complexity.
CoRR , abs/2006.04768, 2020. URL https://arxiv.org/abs/2006.04768 .Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. Sharing attention weights forfast transformer. In
Proceedings of the Twenty-Eighth International Joint Conference on ArtificialIntelligence, IJCAI 2019, Macao, China, August 10-16, 2019 , pp. 5292–5298. ijcai.org, 2019. doi:10.24963/ijcai.2019/735. URL https://doi.org/10.24963/ijcai.2019/735 .Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy.Hierarchical attention networks for document classification. In
NAACL HLT 2016, The 2016Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, San Diego California, USA, June 12-17, 2016 , pp. 1480–1489.The Association for Computational Linguistics, 2016. doi: 10.18653/v1/n16-1174. URL https://doi.org/10.18653/v1/n16-1174 .Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk,Zhangyang Wang, and Yingyan Lin. Drawing early-bird tickets: Toward more efficient training ofdeep networks. In
International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=BJxsrgStvr .Felix X. Yu, Ananda Theertha Suresh, Krzysztof Marcin Choromanski, Daniel N. Holtmann-Rice,and Sanjiv Kumar. Orthogonal random features. In
Advances in Neural Information ProcessingSystems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,2016, Barcelona, Spain , pp. 1975–1983, 2016.Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin,Karl Tuyls, David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, VictoriaLangston, Razvan Pascanu, Matthew Botvinick, Oriol Vinyals, and Peter W. Battaglia. Deepreinforcement learning with relational inductive biases. In , 2019.13ukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watchingmovies and reading books. In , pp. 19–27, 2015. doi: 10.1109/ICCV.2015.11. URL https://doi.org/10.1109/ICCV.2015.11 .14
PPENDIX: R
ETHINKING A TTENTION WITH P ERFORMERS
A H
YPERPARAMETERS FOR EXPERIMENTS
This optimal setting (including comparisons to approximate softmax) we use for the Performeris specified in the Generalized Attention (Subsec. A.4), and unless specifically mentioned (e.g.using name "Performer-SOFTMAX"), "Performer" refers to using this generalized attentionsetting.
A.1 M
ETRICS
We report the following evaluation metrics:1.
Accuracy : For unidirectional models, we measure the accuracy on next-token prediction,averaged across all sequence positions in the dataset. For bidirectional models, we maskeach token with probability (same as (Devlin et al., 2018)) and measure accuracy acrossthe masked positions.2.
Perplexity : For unidirectional models, we measure perplexity across all sequence positionsin the dataset. For bidirectional models, similar to the accuracy case, we measure perplexityacross the masked positions.3.
Bits Per Dimension/Character (BPD/BPC) : This calculated by loss divided by ln(2) .We used the full evaluation dataset for TrEMBL in the plots in the main section, while for otherdatasets such as ImageNet64 and PG-19 which have very large evaluation dataset sizes, we usedrandom batches (>2048 samples) for plotting curves.A.1.1 PG-19 P
REPROCESSING
The PG-19 dataset (Rae et al., 2020) is presented as a challenging long range text modeling task.It consists of out-of-copyright Project Gutenberg books published before 1919. It does not have afixed vocabulary size, instead opting for any tokenization which can model an arbitrary string oftext. We use a unigram SentencePiece vocabulary (Kudo & Richardson, 2018) with 32768 tokens,which maintains whitespace and is completely invertible to the original book text. Perplexitiesare calculated as the average log-likelihood per token, multiplied by the ratio of the sentencepiecetokenization to number of tokens in the original dataset. The original dataset token count per split is:train=1973136207, validation=3007061, test=6966499. Our sentencepiece tokenization yields thefollowing token counts per split: train=3084760726, valid=4656945, and test=10699704. This giveslog likelihood multipliers of train=1.5634, valid=1.5487, test=1.5359 per split before computingperplexity, which is equal to exp( log likelihood multiplier ∗ loss ) .Preprocessing for TrEMBL is extensively explained in Appendix C.A.2 T RAINING H YPERPARAMETERS
Unless specifically stated, all Performer + Transformer runs by default used . grad clip, . weightdecay, . dropout, − fixed learning rate with Adam hyperparameters ( β = 0 . , β = 0 . , (cid:15) =10 − ) , with batch size maximized (until TPU memory overload) for a specific model.All 36-layer protein experiments used the same amount of compute (i.e. 16x16 TPU-v2, 8GB perchip). For concatenated experiments, 16x16 TPU-v2’s were also used for the Performer, while 8x8’swere used for the 1-3 layer ( d = 256) Transformer models (using 16x16 did not make a difference inaccuracy).
Note that Performers are using the same training hyperparameters as Transformers, yetachieving competitive results - this shows that FAVOR can act as a simple drop-in without needingmuch tuning.A.3 A
PPROXIMATE S OFTMAX A TTENTION D EFAULT V ALUES
The optimal values, set to default parameters , are: renormalize_attention = True, numerical stabilizer= − , number of features = 256, ortho_features = True, ortho_scaling = 0.0. https://github.com/google-research/google-research/blob/master/performer/fast_self_attention/fast_self_attention.py ENERALIZED A TTENTION D EFAULT V ALUES
The optimal values, set to default parameters , are: renormalize_attention = True, numerical stabilizer= 0.0, number of features = 256, kernel = ReLU, kernel_epsilon = − .A.5 R EFORMER D EFAULT V ALUES
For the Reformer, we used the same hyperparameters as mentioned for protein experiments, withoutgradient clipping, while using the defaults (which instead use learning rate decay) for ImageNet-64.In both cases, the Reformer used the same default LSH attention parameters.A.6 L INFORMER D EFAULT V ALUES
Using our standard pipeline as mentioned above, we replaced the attention function with the Linformervariant via Jax, with δ = 10 − , k = 600 (same notation used in the paper (Wang et al., 2020)), where δ is the exponent in a renormalization procedure using e − δ as a multiplier in order to approximatesoftmax, while k is the dimension of the projections of the Q and K matrices. As a sanity check,we found that our Linformer implementation in Jax correctly approximated exact softmax’s outputwithin . error for all entries.Note that for rigorous comparisons, our Linformer hyperparameters are even stronger than the defaultsfound in (Wang et al., 2020), as: • We use k = 600 , which is more than twice than the default k = 256 from the paper, andalso twice than our default m = 256 number of features. • We also use redrawing, which avoids "unlucky" projections on Q and K . https://github.com/google-research/google-research/blob/master/performer/fast_self_attention/fast_self_attention.py https://github.com/google/trax/blob/master/trax/supervised/configs/reformer_imagenet64.gin M AIN A LGORITHM : FAVOR+
We outline the main algorithm for FAVOR+ formally:
Algorithm 1:
FAVOR+ (bidirectional or unidirectional).
Input : Q , K , V ∈ R L × d , isBidirectional - binary flag. Result: (cid:100)
Att ↔ ( Q , K , V ) ∈ R L × L if isBidirectional , (cid:100) Att → ( Q , K , V ) ∈ R L × L otherwise.Compute Q (cid:48) and K (cid:48) as described in Section 2.2 and Section 2.3 and take C := [ V 1 L ] ; if isBidirectional then Buf := ( K (cid:48) ) (cid:62) C ∈ R M × ( d +1) , Buf := Q (cid:48) Buf ∈ R L × ( d +1) ; else Compute G and its prefix-sum tensor G PS according to (11); Buf := (cid:2) G PS1 , : , : Q (cid:48) . . . G PS L, : , : Q (cid:48) L (cid:3) (cid:62) ∈ R L × ( d +1) ; end [Buf buf ] := Buf , Buf ∈ R L × d , buf ∈ R L ; return diag(buf ) − Buf ;B.1 U NIDIRECTIONAL C ASE AND P REFIX S UMS
We explain how our analysis from Section 2.2 can be extended to the unidirectional mechanismin this section. Notice that this time attention matrix A is masked, i.e. all its entries not in thelower-triangular part (which contains the diagonal) are zeroed (see also Fig. 8).Figure 8: Visual representation of the prefix-sum algorithm for unidirectional attention. For clarity, we omitattention normalization in this visualization. The algorithm keeps the prefix-sum which is a matrix obtainedby summing the outer products of random features corresponding to keys with value-vectors. At each giveniteration of the prefix-sum algorithm, a random feature vector corresponding to a query is multiplied by the mostrecent prefix-sum (obtained by summing all outer-products corresponding to preceding tokens) to obtain a newrow of the matrix AV which is output by the attention mechanism. For the unidirectional case, our analysis is similar as for the bidirectional case, but this time our goal isto compute tril( Q (cid:48) ( K (cid:48) ) (cid:62) ) C without constructing and storing the L × L -sized matrix tril( Q (cid:48) ( K (cid:48) ) (cid:62) ) explicitly, where C = [ V L ] ∈ R L × ( d +1) . In order to do so, observe that ∀ ≤ i ≤ L : [tril( Q (cid:48) ( K (cid:48) ) (cid:62) ) C ] i = G PS i, : , : × Q (cid:48) i , G PS i, : , : = i (cid:88) j =1 G j, : , : , G j, : , : = K (cid:48) j C (cid:62) j ∈ R M × ( d +1) (11)where G , G PS ∈ R L × M × ( d +1) are 3d-tensors. Each slice G PS: ,l,p is therefore a result of a prefix-sum(or cumulative-sum) operation applied to G : ,l,p : G PS i,l,p = (cid:80) ij =1 G i,l,p . An efficient algorithm tocompute the prefix-sum of L elements takes O ( L ) total steps and O (log L ) time when computed inparallel (Ladner & Fischer, 1980; Cormen et al., 2009). See Algorithm 1 for the whole approach.B.2 O RTHOGONAL R ANDOM F EATURES - E
XTENSIONS
As mentioned in the main text, for isotropic Ω (true for most practical applications, including regularattention), instead of sampling ω i independently, we can use orthogonal random features (ORF) (Yu17t al., 2016; Choromanski et al., 2017; 2018b): these maintain the marginal distributions of samples ω i while enforcing that different samples are orthogonal. If we need m > d , ORFs still can be usedlocally within each d × d block of W (Yu et al., 2016).ORFs were introduced to reduce the variance of Monte Carlo estimators (Yu et al., 2016; Choromanskiet al., 2017; 2018b; 2019a; Rowland et al., 2019; Choromanski et al., 2018a; 2019b) and we showedin the theoretical and experimental sections from the main body that they do indeed lead to moreaccurate approximations and substantially better downstream results. There exist several variants ofthe ORF-mechanism and in the main body we discussed only the base one (that we refer to here as regular ). Below we briefly review the most efficient ORF mechanisms (based on their strengths andcosts) to present the most complete picture. (1) Regular ORFs [R-ORFs]: Applies Gaussian orthogonal matrices (Yu et al., 2016). Encodesmatrix W of ω -samples (with different rows corresponding to different samples) in O ( md ) space.Provides algorithm for computing Wx in O ( md ) time for any x ∈ R d . Gives unbiased estimation.Requires one-time O ( md ) preprocessing (Gram-Schmidt orthogonalization). (2) Hadamard/Givens ORFs [H/G-ORFs]: Applies random Hadamard (Choromanski et al., 2017)or Givens matrices (Choromanski et al., 2019b). Encodes matrix W in O ( m ) or O ( m log( d )) space.Provides algorithm for computing Wx in O ( m log( d )) time for any x ∈ R d . Gives small bias(tending to with d → ∞ ).B.3 T IME AND S PACE C OMPLEXITY - D
ETAILED A NALYSIS
We see that a variant of bidirectional FAVOR+ using iid samples or R-ORFs has O ( md + Ld + mL ) space complexity as opposed to Θ( L + Ld ) space complexity of the baseline. UnidirectionalFAVOR+ using fast prefix-sum pre-computation in parallel (Ladner & Fischer, 1980; Cormen et al.,2009) has O ( mLd ) space complexity to store G PS which can be reduced to O ( md + Ld + mL ) byrunning a simple (though non-parallel in L ) aggregation of G PS i, : , : without storing the whole tensor G PS in memory. From Subsec. B.2, we know that if instead we use G-ORFs, then space complexityis reduced to O ( m log( d ) + Ld + mL ) and if the H-ORFs mechanism is used, then space is furtherreduced to O ( m + Ld + mL ) = O ( Ld + mL ) . Thus for m, d (cid:28) L all our variants provide substantialspace complexity improvements since they do not need to store the attention matrix explicitly.The time complexity of Algorithm 1 is O ( Lmd ) (note that constructing Q (cid:48) and K (cid:48) can be done intime O ( Lmd ) ). Note that the time complexity of our method is much lower than O ( L d ) of thebaseline for L (cid:29) m .As explained in Subsec. B.2, the R-ORF mechanism incurs an extra one-time O ( md ) cost (negligiblecompared to the O ( Lmd ) term for L (cid:29) d ). H-ORFs or G-ORFs do not have this cost, and whenFAVOR+ uses them, computing Q (cid:48) and K (cid:48) can be conducted in time O ( L log( m ) d ) as opposedto O ( Lmd ) (see: Subsec. B.2). Thus even though H/G-ORFs do not change the asymptotic timecomplexity, they improve the constant factor from the leading term. This might play an importantrole in training very large models.The number of random features m allows a trade-off between computational complexity and the levelof approximation: bigger m results in higher computation costs, but also in a lower variance of theestimate of A . In the theoretical section from the main body we showed that in practice we can take M = Θ( d log( d )) .Observe that the FAVOR+ algorithm is highly-parallelizable, and benefits from fast matrix multiplica-tion and broadcasted operations on GPUs or TPUs.18 E XPERIMENTAL D ETAILS FOR P ROTEIN M ODELING T ASKS
C.1 T R EMBL D
ATASET
Dataset Set Name Count Length StatisticsMin Max Mean STD Median
TrEMBL Train 104,863,744 2 74,488 353.09 311.16 289.00Valid 102,400 7 11,274 353.62 307.42 289.00Test 1,033,216 8 32,278 353.96 312.23 289.00OOD 29,696 24 4,208 330.96 269.86 200.00TrEMBL(concat) Train 4,532,224 8,192 8,192 8,192 0 8,192Valid 4,096
Table 1: Statistics for the TrEMBL single sequence and the long sequence task.We used the TrEMBL dataset , which contains 139,394,261 sequences of which 106,030,080 areunique. While the training dataset appears smaller than the one used in Madani et al. (Madani et al.,2020), we argue that it includes most of the relevant sequences. Specifically, the TrEMBL datasetconsists of the subset of UniProtKB sequences that have been computationally analyzed but notmanually curated, and accounts for ≈ . of the total number of sequences in the UniProtKBdataset .Following the methodology described in Madani et al. (Madani et al., 2020), we used both anOOD-Test set, where a selected subset of Pfam families are held-out for valuation, and an IIDsplit, where the remaining protein sequences are split randomly into train, valid, and test tests. Weheld-out the following protein families (PF18369, PF04680, PF17988, PF12325, PF03272, PF03938,PF17724, PF10696, PF11968, PF04153, PF06173, PF12378, PF04420, PF10841, PF06917, PF03492,PF06905, PF15340, PF17055, PF05318), which resulted in 29,696 OOD sequences. We note that,due to deduplication and potential TrEMBL version mismatch, our OOD-Test set does not matchexactly the one in Madani et al. (Madani et al., 2020). We also note that this OOD-Test selectionmethodology does not guarantee that the evaluation sequences are within a minimum distance fromthe sequences used during training. In future work, we will include rigorous distance based splits.The statistics for the resulting dataset splits are reported in Table 1. In the standard sequence modelingtask, given the length statistics that are reported in the table, we clip single sequences to maximumlength L = 1024 , which results in few sequences being truncated significantly.In the long sequence task, the training and validation sets are obtained by concatenating the sequences,separated by an end-of-sequence token, and grouping the resulting chain into non-overlappingsequences of length L = 8192 .C.2 E MPIRICAL B ASELINE
Figure 9:
Visualization of the estimated empirical distribution for the 20 standard amino acids, colored by theirclass. Note the consistency with the statistics on the TrEMBL web page.
A random baseline, with uniform probability across all the vocabulary tokens at every position, hasaccuracy (when including only the 20 standard amino acids) and (when also including the5 anomalous amino acids (Consortium, 2019)). However, the empirical frequencies of the various empirical baseline wherethe amino acid probabilities are proportional to their empirical frequencies in the training set.Figure 9 shows the estimated empirical distribution. We use both the standard and anomalousamino acids, and we crop sequences to length 1024 to match the data processing performed for theTransformer models. The figure shows only the 20 standard amino acids, colored by their class, forcomparison with the visualization on the TrEMBL web page .C.3 T ABULAR R ESULTS
Table 2 contains the results on the single protein sequence modeling task ( L = 1024 ). We reportaccuracy and perplexity as defined in Appendix A: Model Type Set Name Model Accuracy Perplexity
UNI Test Empirical Baseline 9.92 17.80Transformer 30.80 9.37Performer (generalized) 31.58 9.17OOD Empirical Baseline 9.07 17.93Transformer 19.70 13.20Performer (generalized) 18.44 13.63BID Test Transformer 33.32 9.22Performer (generalized) 36.09 8.36Performer (softmax) 33.00 9.24OOD Transformer 25.07 12.09Performer (generalized) 24.10 12.26Performer (softmax) 23.48 12.41
Table 2: Results on single protein sequence modeling ( L = 1024 ). We note that the empiricalbaseline results are applicable to both the unidirectional (UNI) and bidirectional (BID) models.C.4 A TTENTION M ATRIX I LLUSTRATION
In this section we illustrate the attention matrices produced by a Performer model. We focus on thebidirectional case and choose one Performer model trained on the standard single-sequence TrEMBLtask for over 500K steps. The same analysis can be applied to unidirectional Performers as well.We note that while the Transformer model instantiates the attention matrix in order to compute theattention output that incorporates the (queries Q , keys K , values V ) triplet (see Eq. 1 in the mainpaper), the FAVOR mechanism returns the attention output directly (see Algorithm 1). To account forthis discrepancy, we extract the attention matrices by applying each attention mechanism twice: onceon each original ( Q, K, V ) triple to obtain the attention output, and once on a modified ( Q, K, V ◦ ) triple, where V ◦ contains one-hot indicators for each position index, to obtain the attention matrix.The choice of V ◦ ensures that the dimension of the attention output is equal to the sequence length,and that a non-zero output on a dimension i can only arise from a non-zero attention weight to the i th sequence position. Indeed, in the Transformer case, when comparing the output of this procedurewith the instantiated attention matrix, the outputs match. Attention matrix example.
We start by visualizing the attention matrix for an individual proteinsequence. We use the BPT1_BOVIN protein sequence , one of the most extensively studied globularproteins, which contains 100 amino acids. In Figure 10, we show the attention matrices for the first4 layers. Note that many heads show a diagonal pattern, where each node attends to its neighbors,and some heads show a vertical pattern, where each head attends to the same fixed positions. Thesepatterns are consistent with the patterns found in Transformer models trained on natural language Amino acid similarity.
Furthermore, we analyze the amino-acid similarity matrix estimated fromthe attention matrices produced by the Performer model, as described in Vig et al. (Vig et al., 2020).We aggregate the attention matrix across 800 sequences. The resulting similarity matrix is illustratedin Figure 13. Note that the Performer recognizes highly similar amino acid pairs such as (D, E) and(F, Y).Figure 10:
We show the attention matrices for the first 4 layers and all 8 heads (each row is a layer, each columnis head index, each cell contains the attention matrix across the entire BPT1_BOVIN protein sequence). Notethat many heads show a diagonal pattern, where each node attends to its neighbors, and some heads show a vertical pattern, where each head attends to the same fixed positions.
Figure 11:
We illustrate in more detail two attention heads. The sub-figures correspond respectively to: (1)
Head 1-2 (second layer, third head), (2)
Head 4-1 (fifth layer, second head). Note the block attention in Head 1-2and the vertical attention (to the start token (‘M’) and the 85th token (‘C’)) in Head 4-1.
We highlight the attention patterns by restricting our attention to the first 25 tokens (note that we donot renormalize the attention to these tokens). The illustration is based on Vig et al. (Vig, 2019; Vig & Belinkov,2019). Note that, similar to prior work on protein Transformers (Madani et al., 2020), the attention matricesinclude both local and global patterns.
A C D E F G H I K L M N P Q R S T V W YACDEFGHIKLMNPQRSTVWY 0.00.20.40.60.81.0 A C D E F G H I K L M N P Q R S T V W YACDEFGHIKLMNPQRSTVWY 0.00.20.40.60.8
Figure 13:
Amino acid similarity matrix estimated from attention matrices aggregated across a small subsetof sequences, as described in Vig et al. (Vig et al., 2020). The sub-figures correspond respectively to: (1) thenormalized BLOSUM matrix, (2) the amino acid similarity estimated via a trained Performer model. Note thatthe Performer recognizes highly similar amino acid pairs such as (D, E) and (F, Y). E XTENDED APPROXIMATION RESULTS
D.1 B
ACKWARDS C OMPATIBILITY - E
RROR P ROPAGATION
Although mentioned previously (Sec. 4.2) that the Performer with additional finetuning is backwardscompatible with the Transformer, we demonstrate below in Fig. 14 that error propagation due to non-attention components of the Transformer is one of the primary reasons that pretrained Transformerweights cannot be immediately used for inference on the corresponding Performer.Figure 14: Output approximation errors between a vanilla Transformer and a Performer (withorthogonal features) for varying numbers of layers.D.2 A
PPROXIMATE S OFTMAX - E
XTENDED P ROPERTIES
We show the following properties of our softmax approximation, in Fig. 15:
Redrawing:
While the benefits of redrawing features was shown in Subsec. 4.3 of the main bodyof the paper, we also its benefits when there are multiple layers with large scale (16x16 TPU-v2)training.
Unidirectional:
While we have shown on TrEMBL that Performer with generalized ReLU attentionoutperforms softmax, we also show that approximate softmax attention can still be a solid choice, forexample on ImageNet64 (U). After 100K steps of training, the Performer-ReLU, Performer-Softmax,and Performer-Softmax (SMREG) variants achieve respectively, 3.67, 3.69, 3.67 BPD.
Instability of Trigonometric Features:
We see the full view of the unstable training curve whenusing Trigonometric softmax.Figure 15: Best viewed zoomed in.
Left:
The importance of redrawing features. If redrawing is notused, an "unlucky" set of random features may cause training degradation, shown by the early-stoppedcurve with Seed 1, while a ‘lucky’ set of random features may cause no issue, shown by the curvewith Seed 2. Redrawing allows the training to correct itself, as seen at the black vertical line.
Middle:
Using the same 8x8 TPU-v2 compute and same 6-layer standard model, approximate softmax withpositive features achieves the same result as generalized ReLU attention.
Right:
Zoomed out view ofright subfigure of Fig. 5, showing that Trigonometric softmax causes very unstable training behaviors.D.3 G
ENERALIZED A TTENTION
We investigated Generalized Attention mechanisms (mentioned in Sec. 2.2) on TrEMBL when L = 512 for various kernel functions. This is similar to (Tsai et al., 2019) which also experimentswith various attention kernels for natural language. Using hyperparameter sweeps across multiple23ariables in FAVOR, we compared several kernels and also renormalization on/off (Fig. 16 andFig. 17), where Renormalize corresponds to applying D − operator in attention, as for the standardmechanism, though we noticed that disabling it does not necessarily hurt accuracy) to produce thebest training configuration for the Performer. We note that the effective batch size slightly affectsthe rankings (as shown by the difference between 2x2 and 4x4 TPU runs) - we by default use thegeneralized ReLU kernel with other default hyperparameters shown in Appendix A, as we observedthat they are empirically optimal for large batch size runs (i.e. 8x8 or 16x16 TPU’s).Figure 16: To emphasize the highest accuracy runs but also show the NaN issues with certain kernelswhich caused runs to stop early, we set both x and y axes to be log-scale. We tested kernels definedby different functions f (see: Sec. 2.2): sigmoid, exponential, ReLU, absolute, gelu, cosine (originalsoftmax approximation), tanh, and identity. All training runs were performed on 2x2 TPU-v2’s, 128batch size per device.Figure 17: We also performed a similar setup as Fig. 16 for 4x4 TPU-v2’s. E C
OMPUTATION COSTS - E
XTENDED RESULTS
In this subsection, we empirically measure computational costs in terms wall clock time on forwardand backward passes for three scenarios in Fig. 18, 19:1. Performer, with varying number of layers. We show that our method can scale up to (but notnecessarily limited to) even 20 layers.2. Attention time complexities when comparing standard attention (from Transformer) andFAVOR (from Performer). Note that the maximum memory size here is not reflective ofthe maximum memory size in an actual model (shown below), as this benchmark requirescomputing explicit tensors (causing memory increases) in Jax, while a model does not.3. Time complexities when comparing the Transformer and Performer models. "X" (OPT)denotes the maximum possible speedup achievable, when attention simply returns the V -vector, showing that the Performer is nearly optimal. We see that the maximum possiblepower of 2 length allowed on a V100 GPU (16GB) is = 32768 using regular dimensions.Since some of the computational bottleneck in the Transformer may originate from the ex-tra feed-forward layers (Kitaev et al., 2020), we also benchmark the “Small" version, i.e. ( n heads , n layers , d ff , d ) = (1 , , , as well, when the attention component is the dominantsource of computation and memory. We remind the reader that the “Regular" version consists of ( n heads , n layers , d ff , d ) = (8 , , , . 24igure 18: Captions (1), (2) for each 2x2 subfigure mentioned above.Figure 19: Caption (3) for this 2x2 subfigure mentioned above.25 T HEORETICAL RESULTS
We provide here the proofs of all theoretical results presented in the paper.F.1 P
ROOF OF L EMMA Proof.
We first deduce that for any a , b ∈ R d SM( x , y ) = exp( x (cid:62) y ) = exp( −(cid:107) x (cid:107) / · exp( (cid:107) x + y (cid:107) / · exp( −(cid:107) y (cid:107) / . Next, let w ∈ R d . We use the fact that (2 π ) − d/ (cid:90) exp( −(cid:107) w − c (cid:107) / d w = 1 for any c ∈ R d and derive: exp( (cid:107) x + y (cid:107) /
2) = (2 π ) − d/ exp( (cid:107) x + y (cid:107) / (cid:90) exp( −(cid:107) w − ( x + y ) (cid:107) / d w = (2 π ) − d/ (cid:90) exp( −(cid:107) w (cid:107) / w (cid:62) ( x + y ) − (cid:107) x + y (cid:107) / (cid:107) x + y (cid:107) / d w = (2 π ) − d/ (cid:90) exp( −(cid:107) w (cid:107) / w (cid:62) ( x + y )) d w = (2 π ) − d/ (cid:90) exp( −(cid:107) w (cid:107) / · exp( w (cid:62) x ) · exp( w (cid:62) y ) d w = E ω ∼N ( d , I d ) [exp( ω (cid:62) x ) · exp( ω (cid:62) y )] . That completes the proof of the first part of the lemma. An identity involving hyperbolic cosinefunction is implied by the fact that for every u ∈ R d and ω ∼ N (0 , I d ) the following is true: E [exp( ω (cid:62) u )] = ∞ (cid:88) i =0 E [( ω (cid:62) u ) i ](2 i )! = 12 ∞ (cid:88) i =0 E [( ω (cid:62) u ) i ] + E [( − ω (cid:62) u ) i ](2 i )! . (12)The cancellation of the odd moments E [( ω (cid:62) u ) i +1 ] follows directly from the fact that ω is takenfrom the isotropic distribution (i.e. distribution with pdf function constant on each sphere). Thatcompletes the proof.F.2 P ROOF OF L EMMA Proof.
Denote: z = x + y and ∆ = x − y . Note that by using standard trigonometric identities (andthe fact that the variance of the sum of independent random variables is the sum of variances of thoserandom variables), we can get the following for ω ∼ N (0 , I d ) : MSE( (cid:100) SM trig m ( x , y )) = 1 m exp( (cid:107) x (cid:107) + (cid:107) y (cid:107) )Var(cos( ω (cid:62) ∆)) . (13)Using the fact that (see: Lemma 1 in (Yu et al., 2016); note that in that lemma they use notation: z for what we denote as: (cid:107) ∆ (cid:107) ): Var(cos( ω (cid:62) ∆)) = 12 (1 − exp( −(cid:107) ∆ (cid:107) )) , (14)we obtain: MSE( (cid:100) SM trig m ( x , y )) = 12 m exp( (cid:107) x (cid:107) + (cid:107) y (cid:107) )(1 − exp( −(cid:107) ∆ (cid:107) )) =12 m exp( (cid:107) z (cid:107) )SM − ( x , y )(1 − exp( −(cid:107) ∆ (cid:107) )) , (15)which completes the first part of the proof. To obtain the formula for: MSE( (cid:100) SM + m ( x , y )) notice firstthat: E ω ∼N (0 , I d ) [exp( ω (cid:62) z )] = exp( (cid:107) z (cid:107) . (16)26he above immediately follows from the fact that positive random feature maps provide unbiasedestimation of the softmax-kernel, thus the following is true: SM( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) E ω ∼N (0 , I d ) [exp( ω (cid:62) z )] . (17)Therefore we obtain: MSE( (cid:100) SM + m ( x , y )) = 1 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ))Var(exp( ω (cid:62) z )) =1 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) )) (cid:0) E [exp(2 ω (cid:62) z )] − ( E [exp( ω (cid:62) z )]) (cid:1) =1 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ))(exp(2 (cid:107) z (cid:107) ) − exp( z )) , (18)where the last inequality follows from Equation 16. Therefore we have: MSE( (cid:100) SM + m ( x , y )) = 1 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) )) exp( (cid:107) z (cid:107) )(exp( (cid:107) z (cid:107) ) −
1) =1 m exp( (cid:107) z (cid:107) )SM ( x , y )(1 − exp( −(cid:107) z (cid:107) )) . (19)Finally, MSE( (cid:100) SM hyp+ m ( x , y )) = 14 m exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) (Var(exp( ω (cid:62) z )) + Var(exp( − ω (cid:62) z ))+2Cov(exp( ω (cid:62) z )) , exp( − ω (cid:62) z )))) = 14 m exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) (2Var(exp( ω (cid:62) z ))+2Cov(exp( ω (cid:62) z )) , exp( − ω (cid:62) z ))))) = 12 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ))(Var(exp( ω (cid:62) z )) + 1 − ( E [exp( ω (cid:62) z )]) ) = 12 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ))(exp(2 (cid:107) z (cid:107) ) − exp( (cid:107) z (cid:107) ) + 1 − exp( (cid:107) z (cid:107) )) = 12 m exp( − ( (cid:107) x (cid:107) + (cid:107) y (cid:107) ))(exp( (cid:107) z (cid:107) ) − = 12 (1 − exp( −(cid:107) z (cid:107) ))MSE( (cid:100) SM + m ( x , y )) . (20)In the chain of equalities above we used the fact that random variables exp( ω (cid:62) z ) and exp( − ω (cid:62) z ) have the same distribution. This is true since ω and − ω have the same distribution ( ω is Gaussian).That completes the proof.F.3 P ROOF OF T HEOREM Proof.
Let x , y ∈ R d be respectively a query/key. Note that from the definition of SMREG( x , y ) we have for z = x + y : SMREG( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) ∞ (cid:88) k =0 k )! (cid:107) z (cid:107) k d k E ω ∼N (0 , I d ) [( ω (cid:107) ω (cid:107) e ) k ] , (21)where e = (1 , , ..., (cid:62) ∈ R d . To obtain the above we used the fact that N (0 , I d ) is isotropic (thatin particular implies zeroing of the even terms in the Taylor expansion).Let us denote: A ( k, d ) def = E ω ∼N (0 , I d ) [( ω (cid:107) ω (cid:107) e ) k ] . It turns out that: A (2 k, d ) = (2 k − d + 2 k − d + 2 k − · ... · d . (22)The proof of that fact can be found in the supplement of (Choromanski et al., 2018b), yet we provideit below for completeness and the convenience of the Reader:27 emma 3. Expression A (2 k, d ) satisfies the following for k ∈ N : A (2 k, d ) = (2 k − d + 2 k − d + 2 k − · ... · d . (23) Proof.
Note first that for d ≥ the density function p d ( θ ) of the angle between a vector r ∈ R d chosen uniformly at random from the unit sphere and e is given by the following formula: p d ( θ ) = sin d − ( θ ) (cid:82) π sin d − θ ) dθ . (24)Let us denote: F ( k, d ) def = (cid:82) π cos k ( θ ) sin d ( θ ) dθ . Using partial integration, we get: (cid:90) π cos k ( θ ) sin d ( θ ) dθ = (cid:90) π cos k − ( θ ) sin d ( θ )(sin( θ )) (cid:48) dθ =cos k − ( θ ) sin d +1 ( θ ) | π − (cid:90) π sin( θ )(( k −
1) cos k − ( θ )( − sin( θ )) sin d ( θ )+ d cos k ( θ ) sin d − ( θ )) dθ. (25)Thus we conclude that: F ( k, d ) = k − d +1 F ( k − , d + 2) . Therefore we have: F (2 k, d ) = (2 k − d + 1)( d + 3) · ... · ( d + 2 k − (cid:90) π sin d +2 k ( θ ) dθ. (26)We again conduct partial integration and get: (cid:90) π sin d ( θ ) dθ = − d sin d − ( θ ) cos( θ ) | π + d − d (cid:90) π sin d − ( θ ) dθ = d − d (cid:90) π sin d − ( θ ) dθ. (27)Therefore we conclude that: A (2 k, d ) = 1 d − d − d − d − · ... (2 k − d − d + 1) · ... · ( d + 2 k − d + 2 k − d + 2 k − d + 2 k − d + 2 k − · .... =(2 k − d + 2 k − d + 2 k − · ... · d , (28)which completes the proof.Applying the above lemma, we get: SMREG( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) ∞ (cid:88) k =0 k )! (cid:107) z (cid:107) k d k (2 k − d + 2 k − d + 2 k − · ... · d = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) ∞ (cid:88) k =0 w k k ! f ( k, d ) , (29)where w = (cid:107) z (cid:107) and f ( k, d ) = d k ( d +2 k − d +2 k − · ... · d .Thus we obtain: SMREG( x , y )SM( x , y ) = e − w ∞ (cid:88) k =0 w k k ! f ( k, d ) . (30)Note first that for k ≥ we have: f ( k, d ) ≤ , thus: SMREG( x , y ) ≤ SM( x , y ) . (31)28e also have for l = d : SMREG( x , y )SM( x , y ) = e − w l (cid:88) k =0 w k k ! f ( k, d ) + e − w ∞ (cid:88) k = l +1 w k k ! f ( k, d ) ≥ f ( l, d ) e − w l (cid:88) k =0 w k k ! + e − w ∞ (cid:88) k = l +1 w k k ! f ( k, d ) ≥ f ( l, d )(1 − e − w ∞ (cid:88) k = l +1 w k k ! ) = f ( l, d )(1 − P [Po( w ) > l ]) , (32)where Po( w ) stands for the random variable of Poisson distribution with parameter w . Therefore weget for t = ln( lw ) : SMREG( x , y )SM( x , y ) ≥ (1 − l − d ) l (1 − P [Po( w ) > l ]) ≥ exp( l ln(1 − l − d ))(1 − P [ t Po( w ) ≥ tl ]) =exp (cid:32) l ∞ (cid:88) i =1 ( − i ( l − d ) i i (cid:33) (1 − P [exp( t Po( w ) − tl ) ≥ ≥ exp( − d + o ( 1 d ))(1 − exp( − tl ) E [exp( t Po( w ))]) =exp( − d + o ( 1 d ))(1 − exp( − w − l ( t − , (33)where the last equality is implied by the formula for the Laplace Transform for the Poisson randomvariable: E [exp( t Po( w ))] = exp( w (exp( t ) − . (34)Notice that: w = (cid:107) z (cid:107) = ln(SM( x , x ))+ln(SM( y , y ))+2 ln(SM( x , y ))2 ≤ C ) . We conclude that: SMREG( x , y )SM( x , y ) ≥ (1 − d + o ( 1 d ))(1 − C − ( d e · ln( C ) ) − d ) = 1 − d + o ( 1 d ) . (35)That completes the proof.F.4 P ROOFS OF T HEOREM
HEOREM
EAUTIFUL F UNCTIONS
We will provide here much more general theoretical results which will imply Theorem 3 and Theorem2. We need the following definition:
Definition 1.
We say that function F : R n → R is beautiful if F can be expressed as: F Ω ,g ( z ) = E ω ∼ Ω [ g ( ω (cid:62) z )] , (36) for a probabilistic isotropic distribution Ω , and where g : R → R is an entire function with non-negative power-series coefficients (i.e. g ( x ) = (cid:80) ∞ i =0 a i x i for every x ∈ R and with a i ≥ for i = 0 , , ... ). In the formula above we assume that the expectation on the RHS exists. Interestingly, beautiful functions can be used to define softmax and consequently, Gaussian kernels(both standard and regularized), leading to our PRF mechanism presented in the main body of thepaper, as we explain below.
Remark 1.
If one takes
Ω = N (0 , I d ) (note that N (0 , I d ) is isotropic) and g : x → exp( x ) (such g is clearly entire with nonnegative power-series coefficient) then the following is true for z = x + y : SM( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) F Ω ,g ( z ) . (37) Similarly:
SMREG( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) ) F Ω reg ,g ( z ) , where Ω reg stands for the distributioncorresponding to Haar measure on the sphere of radius √ d (which is clearly isotropic). Thereforegeneral concentration results for Monte Carlo estimators of beautiful functions immediately implycorresponding results for the (standard and regularized) softmax (and thus also Gaussian) kernel.
29e will consider two estimators of the beautiful functions from Definition 1 that directly lead(through Remark 1) to: PRF-based approximation of the softmax kernel and its enhanced versionwith orthogonal features. Standard Monte Carlo estimator samples independently ω iid1 , ..., ω iid m iid ∼ Ω ,where m stands for the number of samples and then computes: (cid:98) F iid m ( z ) def = 1 m m (cid:88) i =1 g (( ω iid i ) (cid:62) z ) . (38)Orthogonal Monte Carlo estimator samples ω ort1 , ..., ω ort m ( m ≤ d ) in such a way that marginally wehave: ω ort i ∼ Ω , but ( ω ort i ) (cid:62) ω ort j = 0 for i (cid:54) = j (such an orthogonal ensemble can be always createdif Ω is isotropic, as we already mentioned in the main body of the paper). We define: (cid:98) F ort m ( z ) def = 1 m m (cid:88) i =1 g (( ω ort i ) (cid:62) z ) . (39)F.4.1 O RTHOGONALITY UNIVERSALLY IMPROVES CONCENTRATION
Denote by M Z ( θ ) = E [ e θZ ] a moment generating function of the random variable Z . Note firstthat estimators of beautiful functions based on standard Monte Carlo procedure using independentvectors ω iid i guarantee strong concentration bounds since independent ω i s provide a way to obtainexponentially small upper bounds on failure probabilities through moment generating functions. Wesummarize this classic observation which is a standard application of Markov’s Inequality below. Lemma 4.
Consider an estimator (cid:98) F iid m ( z ) of the beautiful function F evaluated at z . Then thefollowing holds for any a > F ( z ) : P [ (cid:98) F iid m ( z ) > a ] ≤ e − m L X ( a ) , (40) where X = g ( w (cid:62) z ) , w ∼ D and L Z stands for a Legendre Transform of the random variable Z defined as: L Z ( a ) = sup θ> log( e θa M Z ( θ ) ) . Furthermore, L X ( a ) > . The above result provides us with exponentially small (in Legendre Transform) upper bounds on tailprobabilities for the standard estimator. Below we provide our two main theoretical results.
Theorem 5 (orthogonality provides smaller tails) . If F Ω ,g is a beautiful function then the followingholds for m ≤ d , X as in Lemma 4 and any a > F ( z ) : P [ (cid:98) F ort m ( z )) > a ] ≤ dd + 2 e − m L X ( a ) . (41)This result shows that features obtained from the ensembles of pairwise orthogonal random vectorsprovide exponentially small bounds on tail probabilities and that these bounds are strictly better thanfor estimators using unstructured features. Furthermore, the result is universal , i.e. holds for anydimensionality d , not just asymptotically for d large enough.We also obtain similar result regarding mean squared errors (MSEs) of the considered estimators: Theorem 6. If F Ω ,g is a beautiful function then the following holds for m ≤ d : MSE( (cid:98) F ort m ( z )) ≤ MSE( (cid:98) F iid m ( z )) − (1 − m ) 2 d + 2 F ,g ( z ) . (42)As before, an orthogonal estimator leads to better concentration results and as before, this is the casefor any d > , not only asymptotically for large enough d . Note that from what we have said above, Theorem 2 and Theorem 3 follow immediately fromTheorem 6 and Theorem 5 respectively.
Thus in the remainder of this section we will prove Theorem 6 and Theorem 5.30.4.2 P
ROOF OF T HEOREM Proof.
Note that by the analogous application of Markov’s Inequality as in Lemma 4, we get: P [ (cid:98) F ort m ( z )) > a ] ≤ E [ e θ ( X ort1 + ... + X ort m ) ] e θma , (43)where we have: X ort i = g (( ω ort i ) (cid:62) z ) . We see that it suffices to show that for any θ > the followingholds: E [ e θ ( X ort1 + ... + X ort m ) ] < E [ e θ ( X iid1 + ... + X iid m ) ] . We have: E [ e θ ( X ort1 + ... + X ort m ) ] = E [ ∞ (cid:88) j =0 ( θ (cid:80) mi =1 X ort i ) j j ! ] = E [ ∞ (cid:88) j =0 θ j j ! ( m (cid:88) i =1 X ort i ) j ] = ∞ (cid:88) j =0 θ j j ! E [( m (cid:88) i =1 X ort i ) j ] = ∞ (cid:88) j =0 θ j j ! E [ (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m )( X ort1 ) j · ... · ( X ort m ) j m ] , (44)where S j = { ( j , ..., j m ) ∈ N × ... × N : j , ..., j m ≥ , j + ... + j m = j } and for some positiveconstants c ( j , ..., j m ) .Thus we have: E [ e θ ( X ort1 + ... + X ort m ) ] = ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) E [( X ort1 ) j · ... · ( X ort m ) j m ] . (45)Similarly, we get: E [ e θ ( X iid1 + ... + X iid m ) ] = ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) E [( X iid1 ) j · ... · ( X iid m ) j m ] . (46)Therefore we get: ∆ = E [ e θ ( X iid1 + ... + X iid m ) ] − E [ e θ ( X ort1 + ... + X ort m ) ]= ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) (cid:0) E [( X iid1 ) j · ... · ( X iid m ) j m ] − E [( X ort1 ) j · ... · ( X ort m ) j m ] (cid:1) (47)Note first that using the fact that f is entire, we can rewrite each X ort i as: X ort i = ∞ (cid:88) s =0 a s (( ω ort i ) (cid:62) z ) s , (48)where f ( x ) = (cid:80) ∞ s =0 a s x s and a , a , ... ≥ . Similarly, X iid i = ∞ (cid:88) s =0 a s (( ω iid i ) (cid:62) z ) s . (49)By plugging in the above formulae for X ort i and X iid i int the formula for ∆ and expanding power-expressions, we obtain: ∆ = ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) (cid:88) ( d ,...,d m ) ∈D ( j ,...,j m ) (cid:98) ∆( d , ..., d m ) , (50)for some ordered subsets of indices (with potentially repeating entries) D ( j , ..., j m ) (exact formulafor those can be given but we do not need it to complete the proof and since it is technical, it wouldunnecessarily complicate the proof so we skip it) and (cid:98) ∆( d , ..., d m ) defined as: (cid:98) ∆( d , ..., d m ) = E [(( ω iid1 ) (cid:62) z ) d · ... · (( ω iid m ) (cid:62) z ) d m ] − E [(( ω ort1 ) (cid:62) z ) d · ... · (( ω ort m ) (cid:62) z ) d m ] . (51)31ur next goal is to re-write the formula for (cid:98) ∆( d , ..., d m ) . Denote: Y = (( ω ort1 ) (cid:62) z ) d · ... · (( ω ort m ) (cid:62) z ) d m . (52)Observe that Y has the same distribution as Y (cid:48) defined as: Y (cid:48) = ( e (cid:62) g (cid:107) g (cid:107) (cid:107) z (cid:107) ) d · ... · ( e (cid:62) m g (cid:107) g (cid:107) (cid:107) z (cid:107) ) d m · ( (cid:107) ω ort1 (cid:107) ) d · ... · ( (cid:107) ω ort m (cid:107) ) d m , (53)where g is a Gaussian vector taken from the N (0 , I d ) distribution, independently from: (cid:107) ω ort1 (cid:107) , ..., (cid:107) ω ort m (cid:107) .This comes from the fact that for a fixed z one can think about the set: ω ort1 (cid:107) ω ort1 (cid:107) , ..., ω ort m (cid:107) ω ort m (cid:107) as arandom rotation of the system of m canonical basis vectors: e , ..., e m . Thus instead of applyinga random rotation to: e , ..., e m , one can equivalently randomly rotate vector z . Randomly rotatedvector z has the same distribution as: g (cid:107) g (cid:107) (cid:107) z (cid:107) .Now note that lengths of vectors ω ort1 , ..., ω ort m are chosen independently.Therefore we obtain: E [(( ω ort1 ) (cid:62) z ) d · ... · (( ω ort m ) (cid:62) z ) d m ] = E [( (cid:107) ω ort1 (cid:107) ) d ] · ... · E [( (cid:107) ω ort m (cid:107) ) d m ] · E [( e (cid:62) v ) d · ... · ( e (cid:62) m v ) d m ] (cid:107) z (cid:107) d + ... + d m , (54)where v ∼ g (cid:107) g (cid:107) .Denote g = ( g , ..., g d ) (cid:62) . Thus we obtain: E [(( ω ort1 ) (cid:62) z ) d · ... · (( ω ort m ) (cid:62) z ) d m ] = E [( (cid:107) ω ort1 (cid:107) ) d ] · ... · E [( (cid:107) ω ort m (cid:107) ) d m ] ·(cid:107) z (cid:107) d + ... + d m E [ g d · ... · g d m m (cid:112) g + ... + g dd + ... + d m ] (55)Now let us focus on the second expression from th eformula on (cid:98) ∆( d , ..., d m ) . We have: E [(( ω iid1 ) (cid:62) z ) d · ... · (( ω iid m ) (cid:62) z ) d m ] = m (cid:89) i =1 E [(( ω iid i ) (cid:62) z ) d i ] = E [( (cid:107) ω iid1 (cid:107) ) d ] · ... · E [( (cid:107) ω iid m (cid:107) ) d m ] · (cid:107) z (cid:107) d + ... + d m · m (cid:89) i =1 E [ g d i i (cid:112) g + ... + g dd i ] , (56)where the first equality comes from the fact that different ω iid i s are independent and the second one isimplied by the analogous analysis to the one conducted above.We will need the following lemma: Lemma 5.
For every s ∈ N + such that s ≤ n and every k , ..., k s ∈ N + the following holds: E [ g k · ... · g k s s (cid:112) g + ... + g dk + ... + k s ] = (cid:81) si =1 E [ g k i i ] E [ (cid:112) g + ... + g dk + ... + k s ] . (57) Proof.
Take r = g (cid:107) g (cid:107) (cid:107) ˜ g (cid:107) , where ˜ g is an independent copy of g . Note that r ∼ g . We have: E [ r k ] · ... · E [ r k s s ] = E [ r k · ... · r k s s ] = E [ g k · ... · g k s s (cid:112) g + ... + g dk + ... + k s ] · E [ (cid:107) ˜ g (cid:107) k + ... + k s ] , (58)32here the first equality comes from the independence of different elements of z = ( z , ..., z n ) (cid:62) andthe second equality is implied by the fact that ˜ g is independent from g .Therefore we have: E [ g k · ... · g k s s (cid:112) g + ... + g dk + ... + k s ] = E [ r k ] · ... · E [ r k s s ] E [ (cid:107) ˜ g (cid:107) k + ... + k s ] . (59)That completes the proof since z ∼ g and ˜ g ∼ g .Note that by Lemma 5, we can rewrite the right expression from the formula on (cid:98) ∆( d , ..., d m ) as: E [( (cid:107) ω ort1 (cid:107) ) d ] · ... · E [( (cid:107) ω ort m (cid:107) ) d m ] · (cid:107) z (cid:107) d + ... + d m (cid:81) mi =1 E [ g d i i ] E [ (cid:112) g + ... + g dd + ... + d m ] . (60)The left expression from the formula on (cid:98) ∆( d , ..., d m ) can be rewritten as: L ( d , ..., d m ) = E [( (cid:107) ω iid1 (cid:107) ) d ] · ... · E [( (cid:107) ω iid m (cid:107) ) d m ] · (cid:107) z (cid:107) d + ... + d m (cid:81) mi =1 E [ g d i i ] E [ (cid:112) g + ... + g dd ] · ... · E [ (cid:112) g + ... + g dd m ] . (61)Since marginal distributions of ω ort i and ω iid i are the same, we can rewrite (cid:98) ∆( d , ..., d n ) as: (cid:98) ∆( d , ..., d m ) = L ( d , ..., d m )(1 − τ ( d , ..., d m )) , (62)where τ ( d , ..., d m ) is defined as: τ ( d , ..., d m ) = E [ (cid:112) g + ... + g dd ] · ... · E [ (cid:112) g + ... + g dd m ] E [ (cid:112) g + ... + g dd + ... + d m ] (63)We need now few observations regarding (cid:98) ∆( d , ..., d m ) . Note firsr that since odd moments ofthe Gaussian scalar distribution N (0 , are zero, (cid:98) ∆( d , ..., d m ) is zero if at least of of d i is odd.Furthermore, (cid:92) ∆( d , ..., d m ) is trivially zero if all but at most one d i are zero.With our new notation, ∆ can be rewritten as: ∆ = ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) (cid:88) ( d ,...,d m ) ∈D ( j ,...,j m ) L ( d , ..., d m )(1 − τ ( d , ..., d m )) , (64)Note also that we have: e θ ( X iid1 + ... + X iid m ) = ∞ (cid:88) j =0 θ j j ! (cid:88) ( j ,...,j m ) ∈S j c ( j , ..., j m ) (cid:88) ( d ,...,d m ) ∈D ( j ,...,j m ) L ( d , ..., d m ) . (65)Therefore (see: our observations on (cid:98) ∆( d , ..., d m ) ) to complete the proof it suffices to show that: τ ( d , ..., d m ) ≤ dd +2 if at least two: d i , d j for i (cid:54) = j are nonzero and all d i are even. Lemma 6.
The following holds if for some i (cid:54) = j we have: d i , d j > and all d i are even: τ ( d , ..., d m ) ≤ dd + 2 . (66) Proof.
Note that τ ( d , ..., d m ) can be rewritten as: τ ( d , ..., d m ) = (cid:81) mi =1 µ d ( d i ) µ d ( (cid:80) mi =1 d i ) , (67)33here µ d ( j ) stands for the j th moment of the χ -distribution with d degrees of freedom. Note that µ d ( j ) = 2 j Γ( d + j )Γ( d ) , where Γ is the so-called Gamma-function .Using the fact that: Γ( n ) = ( n − and Γ( n + ) = (2 n − n √ π for n ∈ N + , it is easy to see thatfor a fixed d , the RHS of the Equality 67 is maximized when d i = d j = 2 and d k = 0 for some i (cid:54) = j and k / ∈ { i, j } . Furthermore, straightforward calculations show that in that case the value of the RHSfrom Equality 67 is dd +2 . That completes the proof of the Lemma, and consequently, the proof of theentire Theorem.F.4.3 P ROOF OF T HEOREM Proof.
We will use the notation from the proof of Theorem 5. Since both estimators: (cid:98) F ort m ( z ) and (cid:98) F iid m ( z ) are unbiased, we have: MSE( (cid:98) F ort m ( z )) = Var( (cid:98) F ort m ( z )) and MSE( (cid:98) F iid m ( z )) = Var( (cid:98) F iid m ( z )) .We have: Var( (cid:98) F iid m ( z )) = E [( (cid:98) F iid m ( z ) − E [ (cid:98) F iid m ( z )]) ] = E [( (cid:98) F iid m ( z )) ] − F ( z ) . (68)Similarly, Var( (cid:98) F ort m ( z )) = E [( (cid:98) F ort m ( z )) ] − F ( z ) . (69)We have: E [( (cid:98) F iid m ( z )) ] = 1 m m (cid:88) i =1 E [( X iid i ) ] + 1 m (cid:88) i (cid:54) = j E [ X iid i X iid j ] . (70)Similarly, we get: E [( (cid:98) F ort m ( z )) ] = 1 m m (cid:88) i =1 E [( X ort i ) ] + 1 m (cid:88) i (cid:54) = j E [ X ort i X ort j ] . (71)Therefore, since marginal distributions of X iid i and X ort i are the same, we have: MSE( (cid:98) F iid m ( z )) − MSE( (cid:98) F ort m ( z )) = (cid:18) m (cid:19) · · m ( E [ X iid1 X iid2 ] − E [ X ort1 X ort2 ])= (1 − m )( E [ X iid1 X iid2 ] − E [ X ort1 X ort2 ]) (72)Plugging in the formula for X ort i and X iid i from Equation 48 and Equation 49, and using our analysisfrom the proof of Theorem 3 we obtain: MSE( (cid:98) F iid m ( z )) − MSE( (cid:98) F ort m ( z )) = (1 − m ) ∞ (cid:88) t,u =0 a t a u (cid:107) z (cid:107) t + u E [ (cid:107) ω (cid:107) t ] E [ (cid:107) ω (cid:107) u ] · E [ r t ] E [ r u ] E [ (cid:112) g + ... + g dt ] E [ (cid:112) g + ... + g du ] (1 − τ ( t, u )) . (73)for ω ∼ Ω and r ∼ N (0 , . Thus, using Lemma 6, we get: MSE( (cid:98) F iid m ( z )) − MSE( (cid:98) F ort m ( z )) ≥ (1 − m ) 2 d + 2 ∞ (cid:88) t,u =0 a t a u (cid:107) z (cid:107) t + u E [ (cid:107) ω (cid:107) t ] E [ (cid:107) ω (cid:107) u ] · E [ r t ] E [ r u ] E [ (cid:112) g + ... + g dt ] E [ (cid:112) g + ... + g du ]= (1 − m ) 2 d + 2 (cid:32) ∞ (cid:88) t =0 a t (cid:107) z (cid:107) t E [ (cid:107) ω (cid:107) t ] · E [ r t ] E [ (cid:112) g + ... + g dt ] (cid:33) = (1 − m ) 2 d + 2 F ,g ( z ) . (74)That completes the proof. 34.5 P ROOF OF T HEOREM A , our algorithm provides strong concentration guarantees. This is the case alsofor trigonometric random features, yet, as discussed in the main body of the paper, due to attentionrenormalization and higher variance of the estimation of small entries of the attention matrix,trigonometric mechanism is sub-optimal. We show here that m opt , the optimal number of randomprojections for the trigonometric orthogonal mechanism for accurate estimation of the attentionmatrix does not depend on L but only on d . In fact, we prove that if we take m opt = Θ( d log( d )) ,then with O ( Ld log( d )) -time, we can approximate A up to any precision, regardless of the numberof tokens L . In order to provide those guarantees, we leverage recent research on the theory ofnegative dependence for ORFs (Lin et al., 2020).We prove the more general version of Theorem 4 from the main body of the paper: Theorem 7 (Uniform convergence for the trigonometric mechanism) . Define entries of the attentionmatrix A as follows: A i,j = g ( q (cid:62) i )K( d q (cid:62) i , d k (cid:62) j ) h ( k (cid:62) j ) for some g, h : R d → R and where K is a radial basis function (RBF) kernel (Choromanski et al., 2018b) with corresponding spectraldistribution Ω (e.g. Gaussian kernel for which Ω = N (0 , I d ) ). Assume that the rows of matrices Q and K are taken from a ball B ( R ) of radius R , centered at (i.e. norms of queries and keys are upper-bounded by R ). Define l = Rd − and take g ∗ = max x ∈ B ( l ) | g ( x ) | and h ∗ = max x ∈ B ( l ) | h ( x ) | .Then for any (cid:15) > , δ = (cid:15)g ∗ h ∗ and the number of random projections m = Ω( dδ log( σRδd )) for σ = E ω ∼ Ω [ ω (cid:62) ω ] the following holds: (cid:107) (cid:98) A − A (cid:107) ≤ (cid:15) with any constant probability, where (cid:98) A approximates generalized attention matrix via orthogonal trigonometric random features. The result holds in particular for regular softmax-attention for which K is a Gaussian kernel and g ( x ) = h ( x ) = exp( (cid:107) x (cid:107) ) . In that case m opt = Ω( dδ log( d Rδ )) since σ = d . Proof.
Let D Q be a diagonal matrix with entries of the form: g ( q (cid:62) i ) and let D K be a diagonal matrixwith entries of the form: h ( k (cid:62) i ) . Denote B = [K( d q (cid:62) i , d k (cid:62) j )] i,j ∈ R L × L . Denote by (cid:98) A andapproximation of the attention matrix obtained from trigonometric orthogonal random features andby (cid:98) B an approximation of matrix B that those random features provide. We rely on Theorem 3 from(Lin et al., 2020). Note that we can apply it in our case, since for RBF kernels the correspondingfunctions f i satisfy f ( x ) = sin( x ) , f ( x ) = cos( x ) (thus in particular are bounded). Also, it is nothard to observe (see for instance analysis in Claim 1 from (Rahimi & Recht, 2007)) that we can take: L f = 1 (for L f as in Theorem 3 from (Lin et al., 2020)). Using Theorem 3 from (Lin et al., 2020),we conclude that: (cid:107) (cid:98) B − B (cid:107) ≤ δ (75)with any constant probability as long as m = Ω( dδ ) log( σ · diam( M ) δ ) , where σ = E [ ω (cid:62) ω ] and M is the diameter of the smallest ball M containing all vectors of the form z = Q i d − K j d . Since (cid:107) Q i (cid:107) , (cid:107) K j (cid:107) ≤ R , we conclude that (cid:107) z (cid:107) ≤ Rd and thus one can take diam( M ) = Rd . We have: (cid:107) (cid:98) A − A (cid:107) = (cid:107) D Q ( (cid:98) B − B ) D K (cid:107) ≤ (cid:107) D Q (cid:107) (cid:107) (cid:98) B − B (cid:107) (cid:107) D K (cid:107) ≤ δg ∗ h ∗ (76)Taking δ = (cid:15)g ∗ h ∗ completes the proof.F.6 D ISCUSSION OF T HEOREM m of random projections required to approximate theattention matrix within (cid:15) error is a function of data dimensionality d , the parameter (cid:15) and the radius R of the ball within which the queries and keys live: m = Ψ( (cid:15), d, R ) . The dependence on d and (cid:15) is fairly easy to understand: with a larger dimensionality d we need morerandom projeections (on the order of magnitude d log( d ) ) to get an approximation within (cid:15) error. Thedependence on R means that the length of queries and keys cannot grow at a fixed m if we want toretain the quality of the approximation. In particular, this means that FAVOR cannot approximate35ard attention on sequences of unlimited length with a fixed mm