[PDF] Unlocking Pixels for Reinforcement Learning via Implicit Attention

Abstract

There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.

Full PDF

UUnlocking Pixels for Reinforcement Learning via Implicit Attention

Krzysztof Choromanski ∗ Deepali Jain ∗ Jack Parker-Holder ∗ Xingyou Song ∗ Valerii Likhosherstov Anirban Santara Aldo Pacchiano Yunhao Tang Adrian Weller Abstract

There has recently been signiﬁcant interest intraining reinforcement learning (RL) agents invision-based environments. This poses many chal-lenges, such as high dimensionality and potentialfor observational overﬁtting through spurious cor-relations. A promising approach to solve bothof these problems is a self-attention bottleneck ,which provides a simple and effective frameworkfor learning high performing policies, even in thepresence of distractions. However, due to poorscalability of attention architectures, these meth-ods do not scale beyond low resolution visual in-puts, using large patches (thus small attention ma-trices). In this paper we make use of new efﬁcientattention algorithms, recently shown to be highlyeffective for Transformers, and demonstrate thatthese new techniques can be applied in the RL set-ting. This allows our attention-based controllersto scale to larger visual inputs, and facilitate theuse of smaller patches, even individual pixels, im-proving generalization. In addition, we proposea new efﬁcient algorithm approximating softmaxattention with what we call hybrid random fea-tures , leveraging the theory of angular kernels.We show theoretically and empirically that hybridrandom features is a promising approach whenusing attention for vision-based RL.

1. Introduction

Reinforcement learning (RL (Sutton and Barto, 1998)) con-siders the problem of an agent learning solely from interac-tions to maximize reward. Since the introduction of deepneural networks, the ﬁeld of deep RL has achieved sometremendous achievements, from games (Silver et al., 2016),to robotics (OpenAI et al., 2019) and even real world prob-lems (Bellemare et al., 2020). * Equal contribution Google University of Oxford Universityof Cambridge UC Berkeley Columbia University. Correspon-dence to: Krzysztof Choromanski < [email protected] > . As RL continues to be tested in more challenging settings,there has been increased interest in learning from vision-based observations (Hafner et al., 2019; Lee et al., 2020;Hafner et al., 2020; Laskin et al., 2020a;b; Kostrikov et al.,2021). This presents several challenges, as not only areimage-based observations signiﬁcantly larger, but they alsocontain greater possibility of containing confounding vari-ables, which can lead to overﬁtting (Song et al., 2020).A promising approach for tackling these challenges isthrough the use of bottlenecks , which force agents to learnfrom a low dimensional feature representation. This hasbeen shown to be useful for both improving scalability(Hafner et al., 2019; 2020) and generalization (Igl et al.,2019). In this paper, we focus on self-attention bottlenecks ,using an attention mechanism to select the most importantregions of the state space. Recent work showed a speciﬁcform of hard attention combined effectively with neuroevo-lution to create agents with signiﬁcantly fewer parametersand strong generalization capabilities (Tang et al., 2020),while also producing interpretable policies.However, the current form of selective attention proposedis severely limited. It makes use of the most prominentsoftmax attention, popularized by (Vaswani et al., 2017a),which suffers from quadratic complexity in the size of theattention matrix (i.e. the number of patches). This meansthat models become signiﬁcantly slower as vision-basedobservations become higher resolution, and the effectivenessof the bottleneck is reduced by relying on larger patches.

Brute Force IAP0.0000.0050.0100.015 I n f e r en c e T i m e ( s ) R e w a r d Figure 1.

Left: An observation from the

Cheetah - Run task, whendownsized to a (100 x 100) RGB image. Right, comparison ofinference time (bars) vs. rewards (crosses) for the Baseline At-tention Agent from (Tang et al., 2020), and our IAP mechanism.Rewards are the means from ﬁve seeds, training for 100 iterations.Inference times are the means of 100 forward passes. a r X i v : . [ c s . L G ] M a r nlocking Pixels for Reinforcement Learning via Implicit Attention In this paper, we demonstrate how new, scalable attentionmechanisms (Choromanski et al., 2021) designed for Trans-formers can be effectively adapted to the vision-based RLsetting. We call the resulting algorithm the

Implicit Atten-tion for Pixels (IAP). Notably, using IAP we are able to trainagents with self-attention for images with 8x more pixelsthan (Tang et al., 2020). We are also able to dramaticallyreduce the patch, to even just a single pixel. In both cases,inference time is only marginally higher due to the linearscaling factor of IAP. We show a simple example of theeffectiveness of our approach in Figure 1. Here we trainan agent for 100 iterations on the

Cheetah : Run task fromthe DM Control Suite (Tassa et al., 2018). The agents areboth trained the same way, with the only difference beingthe use of brute force attention (blue) or IAP efﬁcient atten-tion (orange). Both agents achieve a similar reward, withdramatically different inference time.In addition, we show that attention row-normalization,which is typically crucial in supervised settings, is not re-quired for training RL policies. Thus, we are able to intro-duce a new efﬁcient mechanism, approximating softmax-kernel attention (known to be in general superior to otherattention kernels) with what we call hybrid random features ,leveraging the theory of angular kernels. We show that ournew method is more robust than existing algoorithms forapproximating softmax-kernel attention when attention nor-malization is not needed. Our mechanism is effective for RLtasks with as few as 15 random samples which is in strikingcontrast to the supervised setting, where usually 200-300samples are required. That 13x+ reduction has a profoundeffect on the speed of the method.To summarize, our key contributions are as follows:•

Practical : To the best of our knowledge, we are theﬁrst to use efﬁcient attention mechanisms for RL frompixels. This has two clear beneﬁts: we can scale tolarger images than previous works; we can use moreﬁne-grained patches which produce more effective self-attention bottlenecks. Both goals can be achieved withan embarrassingly small number of trainable parame-ters, providing 10x compression over standard CNN-based policies with no loss of quality of the learned con-troller. In our experiments (Section 5) we demonstratethe strength of this approach by training quadrupedrobots for obstacle avoidance.• Theoretical : We introduce hybrid random features ,which provably and unbiasedly approximate softmax-kernel attention and better control the variance of theestimation than previous algorithms. We believe this isa signiﬁcant contribution towards efﬁcient attention forRL and beyond - to the theory of Monte Carlo methodsfor kernels in machine learning.

2. Related Work

Several approaches to vision in RL have been proposed overthe years, tackling three key challenges: high-dimensionalinput space, partial observability of the actual state fromimages, and observational overﬁtting to spurious features(Song et al., 2020). Dimensionality reduction can be ob-tained with hand-crafted features or with learned representa-tions, typically via ResNet/CNN-based modules (He et al.,2016). Other approaches equip an agent with segmentationtechniques and depth maps (Wu et al., 2018). Those meth-ods require training a substantial number of parameters, justto process vision, usually a part of the richer heterogeneousagent’s input, that might involve in addition lidar data, tac-tile sensors and more as in robotics applications. Partialobservability was addressed by a line of work focusing ondesigning new compact and expressive neural networks forvision-based controllers such as (Kulh´anek et al., 2019).Common ways to reduce observational overﬁtting are: dataaugmentation (Kostrikov et al., 2021; Laskin et al., 2020a;b),causal approaches (Zhang et al., 2021) and bottlenecks (Iglet al., 2019).

Information bottlenecks have been particularlypopular in vision-based RL (Hafner et al., 2019; 2020; Leeet al., 2020), backed by theory for improved generalization(Shamir et al., 2010; Tishby and Zaslavsky, 2015).In this work, we focus on self-attention bottlenecks . Theseprovide a drastic reduction in the number of model param-eters compared to standard CNN-based approaches, andfurthermore, aid interpretability which is of particular im-portance in reinforcement learning. The idea of selectingindividual “glimpses” with attention was ﬁrst proposed byMnih et al. (2014), who use REINFORCE (Williams, 1992)to learn which patches to use, achieving strong generaliza-tion results. Others have presented approaches to differenti-ate through hard attention (Bengio et al., 2013). This workis inspired by Tang et al. (2020) who proposed to use neu-roevolution methods to optimize a hard attention module,circumventing the requirement to backpropagate through it.Our paper also contributes to the recent line of work onfast attention mechanisms. Since Transformers were shownto produce state-of-the-art results for language modellingtasks (Vaswani et al., 2017a), there has been a series ofefforts to reduce the O ( L ) time and space with respectto sequence length (Kitaev et al., 2020; Peng et al., 2021;Wang et al., 2020). This work extends techniques fromPerformer architectures (Choromanski et al., 2021), whichwere recently shown to be some of the best performingefﬁcient mechanisms (Tay et al., 2021) and is well alignedwith recent efforts on linearizing attention, exempliﬁed byPerformers and LambdaNetworks (Bello, 2021).Finally, it also naturally contributes to the theory of MonteCarlo algorithms for scalable kernel methods (Rahimi and nlocking Pixels for Reinforcement Learning via Implicit Attention Recht, 2007; Lin et al., 2020; Choromanski et al., 2019;2018; 2017; Yu et al., 2016), proposing new random fea-ture map mechanisms for softmax-kernels and consequently,inherently related Gaussian kernels.Solving robotics tasks from vision input is an importantand well-researched topic (Kalashnikov et al., 2018a; Yahyaet al., 2017; Levine et al., 2016; Pan et al., 2019). Ourrobotic experiments focus on learning legged locomotionand necessary navigation skills from vision. In prior work,CNNs have been used to process vision input (Pan et al.,2019; Li et al., 2019; Blanc et al., 2005). In this work, weuse self attention for processing image observations andcompare our results with CNNs for realistic robotics tasks.

3. Compact Vision with Attention for RL

In this paper, we focus on training policies π : S → A for RL agents, where S is the set of states and A is aset of actions. The goal is to maximize the expectedreward E τ ∼ π [ R ( τ )] obtained by an agent in the givenenvironment, where the expectation is over trajectories τ = { s , a , r , . . . , s H , a H , r H } , for a horizon H , anda reward function r ( s t , a t , s t +1 ) : R × R × R → R . Weconsider deterministic policies. A state is either a compactrepresentation of the visual input (RGB(D) image) or itsconcatenation with other sensors available to an agent (moredetails in Section 5).The agents are trained with attention mechanisms, whichtake vision input state (or observation in a partially ob-servable setting) and produce a compact representation forsubsequent layers of the policy. The mechanism is agnosticto the choice of the training algorithm. Consider an image represented as a collection of L = a · b (potentially intersecting) RGB(D)-patches indexed by i ∈{ , , ..., a − } , j ∈ { , , ..., b − } for some a, b ∈ N + .Denote by X ∈ R L × c a matrix with vectorized patches asrows (i.e. vectors of RGB(D)-values of all pixels in thepatch). Let V ∈ R L × d V be a matrix of (potentially learned)value vectors corresponding to patches as in the regularattention mechanism (Vaswani et al., 2017b).For l < L , we deﬁne the following patch-to-patch attentionmodule which is a transformation R L × d V → R l × d V : Att( V ) = I l,L P ( r (cid:62) A K ) V , (1)where I l,L is a matrix I L truncated to its ﬁrst l rows and:• K : R d × R d → R + is a kernel admitting the form: K( u , v ) = E [ φ ( u ) (cid:62) φ ( v )] for some (randomized) ﬁ-nite kernel feature map φ : R d → R m + ,• A K ∈ R L × L is the attention matrix deﬁned as: A K ( i, j ) = K( q (cid:62) i , k (cid:62) j ) where q i , k i are the i th rowsof matrices Q = XW Q , K = XW K (queries &keys), and W Q , W K ∈ R c × d QK for some d QK ∈ N + ,• r ∈ R L is a (potentially learnable) vector deﬁninghow the signal from the attention matrix should beagglomerated to determine the most critical patches,• P : R L → Perm( L ) is a (potentially learnable) func-tion to the space of permutation matrices in { , } L × L .The above mechanism effectively chooses l patches fromthe entire coverage and takes its corresponding embeddingsfrom V as a ﬁnal representation of the image. The attentionblock deﬁned in Equation 1 is parameterized by two matri-ces: W Q , W K ∈ R c × d QK , and potentially also by: a vector r ∈ R a · b and a function P : R L → Perm( L ) . The outputof the attention module is vectorized and concatenated withother sensor data. The resulting vector is then passed to thecontroller as its input state. Particular instantiations of theabove mechanism lead to techniques studied before. For in-stance, if K is a softmax-kernel, r = (1 , ..., (cid:62) , P outputsa permutation matrix that sorts the entries of the input to P from largest to smallest, and rows of V are centers of thecorresponding patches, one retrieves the method proposed in(Tang et al., 2020), yet with no attention row-normalization.

4. Implicit Attention for Pixels (IAP)

Computing attention blocks, as deﬁned in Equation 1, is inpractice very costly when L is large, since it requires explicitconstruction of the matrix A ∈ R L × L . This means it is notpossible to use small-size patches, even for a moderate-size input image, while high-resolution images are pro-hibitive. Standard attention modules are characterized by Ω( L ) space and time complexity, where L is the numberof patches. We instead propose to leverage A indirectly,by applying techniques introduced in (Choromanski et al.,2021) for the class of Transformers called Performers . Weapproximate A via (random) ﬁnite feature maps given bythe mapping φ : R d QK → R m for a parameter m ∈ N + , as: (cid:98) A = Q (cid:48) ( K (cid:48) ) (cid:62) , (2)where Q (cid:48) ∈ R L × m , K (cid:48) ∈ R L × m are matrices with rows: φ ( q (cid:62) i ) (cid:62) and φ ( k (cid:62) i ) (cid:62) respectively. By replacing A with (cid:98) A in Equation 1, we obtain attention transformation given as: (cid:100) Att( V ) = I l,L P (( r (cid:62) Q (cid:48) )( K (cid:48) ) (cid:62) ) V , (3)where brackets indicate the order of computations. By dis-entagling Q (cid:48) from K (cid:48) , we effectively avoid explicitly cal-culating attention matrices and compute the input to P in nlocking Pixels for Reinforcement Learning via Implicit Attention p Pixels in a single patch L P a t c h e s c C h a nn e l s R G B ( D ) Patches Key Patches f h e a d s m Query f r Scores

Unrealized L*L attention matrix A T d d Attention Type

Figure 2.

Visualization of our Implicit Attention for Pixels (IAP). An input RGB(D) is represented as a union of (not necessarily disjoint)patches (in principle even individual pixels). Each patch is projected via learned matrices W Q / W K . This is followed by a set of(potentially randomized) projections, which in turn is followed by nonlinear mapping f deﬁning attention type. In the inference, thisprocess can be further optimized by computing the product of W Q / K with the (random) projection matrix in advance. Tensors Q (cid:48) and K (cid:48) , obtained via (random) projections followed by f , deﬁne an attention matrix which is never explicitly materialized. Instead, ( Q (cid:48) ) (cid:62) ismultiplied with vector r and then the result with tensor K (cid:48) . The output is the score vector. The algorithm can in principle use a multi-headmechanism, although we do not apply it in our experiments. Same-color lines indicate axis with the same number of dimensions. linear time and space rather than quadratic in L . The IAPmethod is schematically presented in Fig. 2.Kernel K deﬁning attention type, and consequently corre-sponding ﬁnite feature map φ (randomized or deterministic)can be chosen in different ways, see: (Choromanski et al.,2021), yet a variant of the form: K( u , v ) = SM( x , y ) , for x = d − QK u , y = d − QK v , (4)or: x = d QK u (cid:107) u (cid:107) , y = d QK v (cid:107) v (cid:107) (5)(same-length input version) and a softmax-kernel SM( x , y ) def = exp( x (cid:62) y ) , in practice often outperformsothers. Thus it sufﬁces to estimate SM . Its efﬁcientrandom feature map φ , from the FAVOR+ mechanism(Choromanski et al., 2021), is of the form: φ m exp ( z ) = Λ √ m (exp( ω (cid:62) z ) , ..., exp( ω (cid:62) m z )) (cid:62) (6)for Λ = exp( − (cid:107) z (cid:107) ) and the block-orthogonal ensemble ofGaussian vectors { ω , ..., ω m } with marginal distributions N (0 , I d QK ) . This mapping provides an unbiased estimator (cid:98) K m exp ( x , y ) = φ m exp ( x ) (cid:62) φ m exp ( y ) of SM( x , y ) and conse-quently: an unbiased estimator of the attention matrix A K for the softmax-kernel K . The most straightforward approach to approximating thesoftmax-kernel

SM( x , y ) is to use trigonometric fea-tures and consequently the estimator (cid:98) K m trig ( x , y ) = φ m trig ( x ) (cid:62) φ m trig ( y ) for φ m trig deﬁned as: φ m trig ( z ) = Λ − √ m (sin( ω (cid:62) i z ) , cos( ω (cid:62) i z )) (cid:62) i =1 ,...,m for iid ω i ∼ N (0 , I d ) . As explained in (Choromanski et al., 2021), for the in-puts of similar length, estimator (cid:98) K m trig ( x , y ) is character-ized by lower variance when the approximated softmax-kernel values are larger (this can be best illustrated when (cid:107) x (cid:107) = (cid:107) y (cid:107) and an angle θ x , y between x and y satisﬁes θ x , y = 0 when variance is zero) and larger variance whenthey are smaller. This makes the mechanism unsuitable forapproximating attention, if the attention matrix needs to berow-normalized (which is the case in standard supervisedsetting for Transformers), since the renormalizers might bevery poorly approximated if they are given as sums contain-ing many small attention values. On the other hand, theestimator (cid:98) K m exp ( x , y ) has variance going to zero as approxi-mated values go to zero since the corresponding mapping φ m exp has nonnegative entries.Since our proposed algorithm does not conduct row-normalization of the attention matrix (we show in Section5 that we do not need it for RL applications), the questionarises whether we can take the best of both worlds. Wepropose an unbiased hybrid estimator (cid:98) K hyb ( x , y ) of thesoftmax-kernel attention, given as: (cid:98) K hyb ( x , y ) = (cid:98) θ x , y π (cid:98) K m exp ( x , y ) + (1 − (cid:98) θ x , y π ) (cid:98) K m trig ( x , y ) , (7)where (cid:98) θ x , y is an unbiased estimator of θ x , y , constructed in-dependently from (cid:98) K m exp ( x , y ) , (cid:98) K m trig ( x , y ) and furthermorethe two latter estimators rely on the same sets of Gaussiansamples { ω , ..., ω m } . In addition, we constrain (cid:98) θ x , y tosatisfy Var( (cid:98) θ x , y ) = 0 if θ x , y = 0 or θ x , y = π .Estimator (cid:98) K hyb ( x , y ) becomes (cid:98) K exp ( x , y ) for θ x , y = π and (cid:98) K trig ( x , y ) for θ x , y = 0 , which means that its vari-ance approaches zero for both: θ x , y → and θ x , y → π (for inputs of the same L -norm). They key observa- nlocking Pixels for Reinforcement Learning via Implicit Attention MSE MSE MSE

R R R

Figure 3.

Mean squared errors for three unbiased softmax-kernel estimators discussed in the paper (from left to right on the ﬁgure): (cid:98) K m trig , (cid:98) K m exp and (cid:98) K m,r hyb for m = 10 , r = 5 (values used in our experiments, see. Sec. 5). MSEs are given as functions of two variables: an angle θ x , y between x and y and the inputs’ length R (symmetrized along for length axis and with θ x , y ∈ [0 , π ] ) . For each plot, we markedin grey its slice for a ﬁxed R . Those slices show key differences between these estimators. The MSE of (cid:98) K m trig goes to zero as θ x , y goes tozero. The MSE of (cid:98) K m exp goes to zero as θ x , y goes to π . The MSE of the hybrid one goes to zero for both: θ x , y → and θ x , y → π . tion is that such an estimator expressed as (cid:98) K m,r hyb ( x , y ) = φ m,r hyb ( x ) (cid:62) φ m,r hyb ( y ) , for a ﬁnite-dimensional mapping φ m,r hyb indeed can be constructed. The mapping φ m,r hyb is given as: φ m,r hyb ( z ) = 1 √ φ m trig ( z ) , φ m exp ( z ) , α sgntrig ( z ) , β sgnexp ( z )) , (8)where: α sgntrig ( z ) = Λ − √ mr (cid:0) sin( ω (cid:62) i z )sgn( ξ (cid:62) j z ) , cos( ω (cid:62) i z )sgn( ξ (cid:62) j z ) (cid:1) j =1 ,...,ri =1 ,...,m ,β sgnexp ( z ) = − Λ √ mr (exp( ω (cid:62) i z )sgn( ξ (cid:62) j z )) j =1 ,...,ri =1 ,...,m , (9)and: Ψ stands for the horizontal concatenation operation, sgn is the sign mapping and ( ω , ..., ω m ) and ( ξ , ..., ξ r ) aretwo independent ensembles of random Gaussian samples.The following is true: Theorem 4.1 (MSE of the hybrid estimator) . Let x , y ∈ R d . Then (cid:98) K m,r hyb ( x , y ) satisﬁes formula from Eq. 7 (thusin particular, it is unbiased) and furthermore, the meansquared error ( MSE ) of (cid:98) K m,r hyb ( x , y ) satisﬁes: MSE( (cid:98) K m,r hyb ( x , y )) = θ x , y π MSE( (cid:98) K m exp ( x , y )) + (1 − θ x , y π ) MSE( (cid:98) K m trig ( x , y ))+ θ x , y π (1 − θ x , y π ) (cid:16) MSE( (cid:98) K m exp ( x , y )) + MSE( (cid:98) K m trig ( x , y )) r − − r ) m SM ( x , y )(1 − cos( (cid:107) x (cid:107) − (cid:107) y (cid:107) )) (cid:17) , (10)where MSE( (cid:98) K m trig ( x , y )) = SM − ( x , y )2 m δ ( x + y ) ρ ( x − y ) , MSE( (cid:98) K m exp ( x , y )) = SM ( x , y ) m δ ( x + y ) ρ ( x + y ) for ρ ( z ) =1 − exp( −(cid:107) z (cid:107) ) , δ ( z ) = exp( (cid:107) z (cid:107) ) . - trigexphybrid Figure 4.

Slices of the 3d-plots of MSEs from Fig. 3 for extendedangle axis ( θ x , y ∈ ( −∞ , ∞ ) ). We see that the MSE of the hybridestimator is better bounded than those of (cid:98) K trig and (cid:98) K exp . Further-more, it vanishes in places, where the other two do, namely for: θ x , y ∈ { , π, − π, π, − π, .... } . Estimator (cid:98) K hyb is more accurate than both (cid:98) K trig and (cid:98) K exp since the hybrid feature map mechanism better controls itsvariance, in particular making the MSE vanish for both cor-ner cases: θ x , y = 0 and θ x , y = π (for same-length inputs),see: Fig. 3, 4. Furthermore, which is critical from the prac-tical point of view, since it can be efﬁciently expressed asa dot-products of ﬁnite-dimensional randomized vectors, itadmits the decomposition from Sec. 3. Consequently, itcan be directly used to provide estimation of the attentionmechanism from Sec. 4 in space and time complexity whichis linear in the number of patches L . Sketch of the proof:

The full proof is given in the Ap-pendix (Sec. A.3). It relies in particular on: (1) the factthat the angular kernel K ang ( x , y ) = 1 − θ x , y π (quanti-fying relative importance of the two estimators combinedin the hybrid method) can be rewritten as K ang ( x , y ) = E [sgn( x (cid:62) ω )sgn( y (cid:62) ω )] for ω ∼ N (0 , I d ) (see: Fig. 5 forthe explanation why this is true), (2) composite randomfeature mechanism for the product of two kernels, eachequipped with its own random feature map. Vanishing vari-ance of (cid:98) θ x , y for θ x , y ∈ { , π } is implied by the fact that nlocking Pixels for Reinforcement Learning via Implicit Attention / - Figure 5.

Visualization of the random feature mechanism for theangular kernel in 3d-space. For the Gaussian vector ω , the ex-pression sgn( ω (cid:62) x )sgn( ω (cid:62) y ) is negative iff its projection ω ⊥ onthe linear span of { x , y } is in one of the two light-blue conesobtained by rotating the green ones by π . Since the distributionof the angle α that ω ⊥ forms with one of the coordinate axis (cid:98) x is Unif(0 , π ) , that event happens with probability p = θ π = θπ .Thus the expected value of the expression is ( − · θπ + 1 · (1 − θπ ) which is exactly the value of the angular kernel. estimator (cid:98) K ang ( x , y ) based on sgn -features is deterministicfor these two corner cases and thus it is exact.

5. Experiments

In this section, we seek to test our hypothesis that efﬁcientattention mechanisms can achieve strong accuracy in RL,matching their performance in the context of Transform-ers (Choromanski et al., 2021). We also aim to show thatwe can scale to signiﬁcantly larger visual inputs, and usesmaller patches, which would be prohibitively expensivewith standard attention algorithms. Finally, we hypothe-size that fewer, smaller patches will be particularly effectivein preventing observational overﬁtting in the presence ofdistractions.To test our hypotheses, we conduct a series of experiments,beginning with a challenging large scale vision task withdistractions, where attending to the correct regions of theobservation is critical. We ﬁnish with difﬁcult simulatedrobotics environments, where an agent must navigate severalobstacles. We use two kernel-attention mechanism for IAP:

ReLU -based from (Choromanski et al., 2021) and intro-duced here hybrid method. The former applies deterministickernel features and the latter: randomized. Controllers aretrained with ES methods (Salimans et al., 2017).

We ﬁrst discuss the question of the sensitivity of our methodto the number of random features. There is a trade-off be-tween speed and accuracy: as we reduce the number ofrandom features, the inference time reduces, however accu-racy may decline. To test it, we use the default Cheetah-Runenvironment from the DM Control Suite (Tassa et al., 2018), with observations resized to (100 x 100), similar to the (96x 96) sizes used for

CarRacing and

DoomTakeCover in(Tang et al., 2020). We use patches of size and select thetop patches. Results are in Fig. 6. Different variants ofthe number of random features are encoded as pairs ( m, r ) . ReLU (3,2) (6,3) (10,5)(16,8)Brute0.0000.0050.0100.0150.020 S e c ond s Inference Time R e w a r d Reward (3,2)(6,3) (10,5)(16,8)

Figure 6.

Cheetah-Run ablations. Left: Inference time for forward passes with different attention mechanisms. Right: meanreward curves for iterations, shaded areas correspond to ± std. As we see, ReLU is the fastest IAP approach, while thereis an increase in inference time as we increase the num-ber of random features. However, all IAP approaches aresigniﬁcantly faster than brute force (brown). In terms ofperformance, we see the best performance for ( , ), whichwe hypothesize is due to it trading off accuracy and explo-ration in an effective manner for this task. Given that ( , )also appears to gain most of the speed beneﬁts, we use thissetting for our other experiments involving hybrid softmax. We then apply our method to a modiﬁed version of the DMcontrol suite termed the

Distracting Control Suite (Stoneet al., 2021), where the background of the normal DM Con-trol Suite’s observations are replaced with random imagesand backgrounds and viewed through random camera anglesas shown in Fig. 12 in the Appendix.

Table 1.

We use the static setting on the medium difﬁculty bench-mark found in (Stone et al., 2021). We include reported resultsfrom the paper for SAC and QT-Opt. For IAP, we report the ﬁnalreward for the fastest convergent method.

Environment IAP SAC QT-OptCheetah-Run

77 74Walker-Walk

24 111CartPole-Swingup 196 167

Ball-In-Cup Catch

109 62Reacher-Easy

75 109

By default in this benchmark, the native images are of size(240 x 320), substantially larger than (96 x 96) used in (Tanget al., 2020), and given that we may also use smaller patchsizes (e.g. size 2 vs the default 7 in (Tang et al., 2020)), thisnew benchmark leads to a signiﬁcantly longer maximumsequence length L (19200 vs 529) for the attention com-ponent. In addition, given the particularly small stick-like nlocking Pixels for Reinforcement Learning via Implicit Attention appearances of most of the agents, a higher percentage of im-age patches will contain irrelevant background observationsthat can cause observational overﬁtting (Song et al., 2020),making this task more difﬁcult for vision-based policies.Our experimental results on the Distracting Control Suiteshow that more ﬁne-grained patches (lower patch size) withfewer selected patches (lower l ) improves performance (Fig.7). Interestingly, this is contrary to the results found in(Tang et al., 2020), which showed that for CarRacing withYouTube/Noisy backgrounds, decreasing l reduces perfor-mance as the agent attends to noisier patches. We hypoth-esize this could be due to many potential reasons (higherparameter count from ES, different benchmarks, bottleneckeffects, etc.) but we leave this investigation to future works. Figure 7.

We performed a grid-search sweep over patch sizes in { , , } , embedding dimensions in { , , } , and number ofpatches l ∈ { , , } . We see that generally, smaller patchsizes with lower l improves performance. Figure 8.

We see that the IAP - Hybrid method is competitive oroutperforms IAP - ReLU variant. Both are signiﬁcantly faster thanBrute Force attention approach.

We thus use patch sizes of 2 with l = 10 patches andcompare the performances between regular “brute force”softmax, IAP with ReLU features, and IAP with hybridsoftmax, in terms of wall-clock time. For the hybrid setting,as discussed in Subsection 5.1, we use (10 , -feature com-bination, which is signiﬁcantly lower than the featuresused in the supervised Transformer setting (Choromanskiet al., 2021), yet achieve competitive results in the RL set-ting. Furthermore, we compare our algorithm with standard ConvNets trained with SAC (Haarnoja et al., 2018) and QT-Opt (Kalashnikov et al., 2018b) in Table 1 and ﬁnd that weare consistently competitive or outperform those methods. We use a simulated quadruped robot for our experiments.This robot has degrees of freedom ( per leg). Ourlocomotion task is set up in an obstacle course environment.In this environment, the robot starts from the origin on araised platform and a series of walls lies ahead of it. Therobot can observe the environment through a ﬁrst-personRGB-camera view, looking straight ahead. To accomplishthis, it needs to learn to steer in order to avoid collisionswith the walls and falling off the edge. The reward functionis speciﬁed as the capped ( v cap ) velocity of the robot alongthe x direction (see: Section A.2). Policy details and Training setup:

We train our IAP poli-cies to solve this robotics task and compare performanceagainst traditional CNN policies. Given the complexity ofthe task, we use a hierarchical structure for our policiesintroduced in (Jain et al., 2019). In this setup, the policy issplit into two hierarchical levels - high level and low level.The high level processes the camera observations from theenvironment and outputs a latent command vector which isfed into the low level. The high level also outputs a scalarduration for which its execution is stopped, while the lowlevel runs at every control timestep. The low level is a linearneural network which controls the robot leg movements.

Table 2.

Ablation with number of patches and stride length.

Patch Size Stride Length Maximum Reward1 1

In the CNN variant, the high level contains a CNN thatreceives a × × RGB camera input. It has con-volutional layers of × ﬁlters with output channels , ,and , followed by a pooling layer with ﬁlter of size × applied with a stride of . Output from the pooling layer isﬂattened and transformed into a D feature vector througha fully-connected layer with tanh activation. It is then fedinto a fully-connected layer to produce a D output clippedbetween − and . The ﬁrst dimension of the D outputvector corresponds to the HL duration scalar and the rest tothe latent command. The duration is calculated by linearlyscaling the output to a value between - time-steps.The IAP policy also has the same speciﬁcation except thatCNNs are replaced with attention modules in the high level.For this task, we have used deterministic ReLU features. nlocking Pixels for Reinforcement Learning via Implicit Attention Figure 9.

Navigating Gibson environments with IAP policies. This navigation environment has realistic visuals which the robot observeswith a front depth camera view. The resolution of the camera is × . We set the IAP patch size to be . Top patches are selectedby self-attention. The input depth camera image is shown on the top left corner each of the frames. The red area in the camera viewcorresponds to the selected patches. The robot successfully passes through a narrow gate with the help of vision while navigating theenvironment. As for the previous ﬁgure, a policy is highly interpretable. Figure 10.

Visualization of IAP policies with patch size (top row) and patch size (bottom row). A series of image frames along theepisode length are shown. On the top-left corner of the images, the input camera image is attached. The red part of the camera image isthe area selected by self-attention. In case of patch size , we can see that the policy ﬁnely detects the boundaries of the obstacles whichhelps in navigation. For patch size , only a single patch is selected which covers one fourth of the whole camera image. The policyidentiﬁes general walking direction but ﬁne-grained visual information is lost. Comparison with CNN:

Training curves for the CNN pol-icy and IAP policy are shown in Figure 11. We observesimilar task performance for both types of policies. How-ever, the number of parameters in the CNN policy were compared to only parameters in the IAP policy.

ES Iterations R e w a r d CNNIAP - ReLU

Figure 11.

Comparison between IAP and CNN policies on Loco-motion Tasks: Both methods show similar performance.

Ablation on patch sizes and stride lengths:

We trainedthe IAP policy with different sets of values for the patchsize and stride length (deﬁning translation from one patch tothe other one) to encode the input image into patches whichare processed by self-attention module. The comparativeperformance of different combinations is shown in Table 2.Best value for maximum episode return is achieved by patchsize and stride length - a setting corresponding to thelargest number of patches. For a qualitative assessment, wehave added a visualization of the policies with patch size and patch size in Figure 10. IAP locomotion policies for photo-realistic Gibson envi-ronments:

Finally, we trained interpretable IAP policiesfrom scratch for locomotion and navigation in simulated 3D-spaces with realistic visuals from the Gibson dataset (Xiaet al., 2018). A visualization of learned policy is shown inFigure 9. Corresponding videos can be viewed here .

6. Conclusion

In this paper, we signiﬁcantly expanded the capabilities ofmethods using self-attention bottlenecks in RL. We are theﬁrst to show that efﬁcient attention mechanisms, which haverecently demonstrated impressive results for Transformers,can be used for RL policies, in what we call

Implicit Atten-tion for Pixels or IAP. While IAP can work with existingkernel features, we also proposed a new robust algorithmfor estimating softmax-kernels that is of interest on its own,with strong theoretical results. In a series of experiments, weshowed that IAP scales to higher-resolution images and em-ulate much ﬁner-grain attention than what was previouslypossible, improving generalization in challenging vision-based RL tasks such as quadruped locomotion with obsta-cles and the recently introduced Distracting Control Suite. https://sites.google.com/view/implicitattention nlocking Pixels for Reinforcement Learning via Implicit Attention References

Bellemare, M., Candido, S., Castro, P., Gong, J., Machado,M., Moitra, S., Ponda, S., and Wang, Z. (2020). Au-tonomous navigation of stratospheric balloons using rein-forcement learning.

Nature , 588:77–82.Bello, I. (2021). Lambdanetworks: Modeling long-rangeinteractions without attention.

CoRR , abs/2102.08602.Bengio, Y., L´eonard, N., and Courville, A. (2013). Estimat-ing or propagating gradients through stochastic neuronsfor conditional computation.Blanc, G., Mezouar, Y., and Martinet, P. (2005). Indoornavigation of a wheeled mobile robot along visual routes.In

Proceedings of the 2005 IEEE international conferenceon robotics and automation , pages 3354–3359. IEEE.Choromanski, K., Likhosherstov, V., Dohan, D., Song, X.,Gane, A., Sarl´os, T., Hawkins, P., Davis, J., Mohiuddin,A., Kaiser, L., Belanger, D., Colwell, L., and Weller,A. (2021). Rethinking attention with performers. In

International Conference on Learning Representations .Choromanski, K., Rowland, M., Chen, W., and Weller, A.(2019). Unifying orthogonal Monte Carlo methods. InChaudhuri, K. and Salakhutdinov, R., editors,

Proceed-ings of the 36th International Conference on MachineLearning, ICML 2019, 9-15 June 2019, Long Beach,California, USA , volume 97 of

Proceedings of MachineLearning Research , pages 1203–1212. PMLR.Choromanski, K., Rowland, M., Sarl´os, T., Sindhwani, V.,Turner, R. E., and Weller, A. (2018). The geometry ofrandom features. In Storkey, A. J. and P´erez-Cruz, F., ed-itors,

International Conference on Artiﬁcial Intelligenceand Statistics, AISTATS 2018, 9-11 April 2018, PlayaBlanca, Lanzarote, Canary Islands, Spain , volume 84 of

Proceedings of Machine Learning Research , pages 1–9.PMLR.Choromanski, K. M., Rowland, M., and Weller, A. (2017).The unreasonable effectiveness of structured random or-thogonal embeddings. In Guyon, I., von Luxburg, U.,Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan,S. V. N., and Garnett, R., editors,

Advances in Neural In-formation Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, December4-9, 2017, Long Beach, CA, USA , pages 219–228.Coumans, E. (2013). Bullet Physics SDK. https://github.com/bulletphysics/bullet3 .Goemans, M. X. and Williamson, D. P. (2001). Approxima-tion algorithms for MAX-3-CUT and other problems viacomplex semideﬁnite programming. In Vitter, J. S., Spi-rakis, P. G., and Yannakakis, M., editors,

Proceedings on 33rd Annual ACM Symposium on Theory of Computing,July 6-8, 2001, Heraklion, Crete, Greece , pages 443–452.ACM.Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha,S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P.,and Levine, S. (2018). Soft actor-critic algorithms andapplications.

CoRR , abs/1812.05905.Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2020).Dream to control: Learning behaviors by latent imagina-tion. In

International Conference on Learning Represen-tations .Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee,H., and Davidson, J. (2019). Learning latent dynamicsfor planning from pixels. In

Proceedings of the 36thInternational Conference on Machine Learning , pages2555–2565.He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residuallearning for image recognition. In , pages 770–778.IEEE Computer Society.Igl, M., Ciosek, K., Li, Y., Tschiatschek, S., Zhang, C.,Devlin, S., and Hofmann, K. (2019). Generalization inreinforcement learning with selective noise injection andinformation bottleneck. In

Advances in Neural Informa-tion Processing Systems 32 .Iscen, A., Caluwaerts, K., Tan, J., Zhang, T., Coumans,E., Sindhwani, V., and Vanhoucke, V. (2018). Policiesmodulating trajectory generators. In

CoRL , pages 916–926.Jain, D., Iscen, A., and Caluwaerts, K. (2019). Hierarchicalreinforcement learning for quadruped locomotion.

IROS ,pages 7551–7557.Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,Vanhoucke, V., et al. (2018a). Qt-opt: Scalable deep rein-forcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293 .Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog,A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M.,Vanhoucke, V., and Levine, S. (2018b). Scalable deepreinforcement learning for vision-based robotic manipu-lation. In

Proceedings of The 2nd Conference on RobotLearning , pages 651–673.Katz, B., Carlo, J. D., and Kim, S. (2019). Mini Cheetah: APlatform for Pushing the Limits of Dynamic QuadrupedControl. In , pages 6295–6301. nlocking Pixels for Reinforcement Learning via Implicit Attention

Kitaev, N., Kaiser, L., and Levskaya, A. (2020). Reformer:The efﬁcient transformer. In

International Conference onLearning Representations .Kostrikov, I., Yarats, D., and Fergus, R. (2021). Image aug-mentation is all you need: Regularizing deep reinforce-ment learning from pixels. In

International Conferenceon Learning Representations .Kulh´anek, J., Derner, E., de Bruin, T., and Babuska, R.(2019). Vision-based navigation using deep reinforce-ment learning. In , pages 1–8. IEEE.Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., andSrinivas, A. (2020a). Reinforcement learming with aug-mented data. In

Advances in Neural Information Process-ing Systems 33 .Laskin, M., Srinivas, A., and Abbeel, P. (2020b). CURL:Contrastive unsupervised representations for reinforce-ment learning. In

Proceedings of the 37th InternationalConference on Machine Learning .Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S. (2020).Stochastic latent actor-critic: Deep reinforcement learn-ing with a latent variable model. In

Neural InformationProcessing Systems (NeurIPS) .Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies.

The Journalof Machine Learning Research , 17(1):1334–1373.Li, C., Xia, F., Martin, R. M., and Savarese, S. (2019).HRL4IN: Hierarchical reinforcement learning for interac-tive navigation with mobile manipulators. In

CoRL .Lin, H., Chen, H., Choromanski, K. M., Zhang, T., andLaroche, C. (2020). Demystifying orthogonal montecarlo and beyond. In Larochelle, H., Ranzato, M., Had-sell, R., Balcan, M., and Lin, H., editors,

Advances inNeural Information Processing Systems 33: Annual Con-ference on Neural Information Processing Systems 2020,NeurIPS 2020, December 6-12, 2020, virtual .Mnih, V., Heess, N., Graves, A., and kavukcuoglu, k. (2014).Recurrent models of visual attention. In Ghahramani, Z.,Welling, M., Cortes, C., Lawrence, N., and Weinberger,K. Q., editors,

Advances in Neural Information Process-ing Systems , volume 27, pages 2204–2212. Curran Asso-ciates, Inc.OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M.,Litwin, M., McGrew, B., Petron, A., Paino, A., Plap-pert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N.,Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba,W., and Zhang, L. (2019). Solving rubik’s cube with arobot hand.

CoRR , abs/1910.07113. Pan, X., Zhang, T., Ichter, B., Faust, A., Tan, J., and Ha,S. (2019). Zero-shot imitation learning from demon-strations for legged robot visual navigation.

ArXiv ,abs/1909.12971.Peng, H., Pappas, N., Yogatama, D., Schwartz, R., Smith,N., and Kong, L. (2021). Random feature attention. In

International Conference on Learning Representations .Rahimi, A. and Recht, B. (2007). Random features forlarge-scale kernel machines. In Platt, J. C., Koller, D.,Singer, Y., and Roweis, S. T., editors,

Advances in Neu-ral Information Processing Systems 20, Proceedings ofthe Twenty-First Annual Conference on Neural Informa-tion Processing Systems, Vancouver, British Columbia,Canada, December 3-6, 2007 , pages 1177–1184. CurranAssociates, Inc.Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I.(2017). Evolution strategies as a scalable alternative toreinforcement learning.

CoRR, abs/1703.03864 .Shamir, O., Sabato, S., and Tishby, N. (2010). Learning andgeneralization with the information bottleneck.

Theoreti-cal Computer Science , 411(29):2696 – 2711.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,van den Driessche, G., Schrittwieser, J., Antonoglou, I.,Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap,T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Has-sabis, D. (2016). Mastering the game of Go with deepneural networks and tree search.

Nature , 529:484–489.Song, X., Jiang, Y., Tu, S., Du, Y., and Neyshabur, B. (2020).Observational overﬁtting in reinforcement learning. In

International Conference on Learning Representations .Stone, A., Ramirez, O., Konolige, K., and Jonschkowski,R. (2021). The distracting control suite – a challengingbenchmark for reinforcement learning from pixels.Sutton, R. S. and Barto, A. G. (1998).

Introduction toReinforcement Learning . MIT Press, Cambridge, MA,USA, 1st edition.Tang, Y., Nguyen, D., and Ha, D. (2020). Neuroevolutionof self-interpretable agents. In Coello, C. A. C., editor,

GECCO ’20: Genetic and Evolutionary ComputationConference, Canc´un Mexico, July 8-12, 2020 , pages 414–424. ACM.Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y.,de Las Casas, D., Budden, D., Abdolmaleki, A., Merel,J., Lefrancq, A., Lillicrap, T. P., and Riedmiller, M. A.(2018). Deepmind control suite.

CoRR , abs/1801.00690. nlocking Pixels for Reinforcement Learning via Implicit Attention

Tay, Y., Dehghani, M., Abnar, S., Shen, Y., Bahri, D., Pham,P., Rao, J., Yang, L., Ruder, S., and Metzler, D. (2021).Long range arena : A benchmark for efﬁcient transform-ers. In

International Conference on Learning Represen-tations .Tishby, N. and Zaslavsky, N. (2015). Deep learning and theinformation bottleneck principle. In .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.(2017a). Attention is all you need. In

Advances in NeuralInformation Processing Systems .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.(2017b). Attention is all you need. In Guyon, I., Luxburg,U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,S., and Garnett, R., editors,

Advances in Neural Informa-tion Processing Systems 30 , pages 5998–6008. CurranAssociates, Inc.Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H.(2020). Linformer: Self-attention with linear complexity.Williams, R. J. (1992). Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Mach. Learn. , 8(3–4):229–256.Wu, Y., Wu, Y., Gkioxari, G., and Tian, Y. (2018). Buildinggeneralizable agents with a realistic and rich 3d envi-ronment. In .OpenReview.net.Xia, F., R. Zamir, A., He, Z., Sax, A., Malik, J., andSavarese, S. (2018). Gibson Env: real-world perceptionfor embodied agents. In

Computer Vision and PatternRecognition (CVPR), 2018 IEEE Conference on . IEEE.Yahya, A., Li, A., Kalakrishnan, M., Chebotar, Y., andLevine, S. (2017). Collective robot reinforcement learn-ing with distributed asynchronous guided policy search.In , pages 79–86. IEEE.Yu, F. X., Suresh, A. T., Choromanski, K. M., Holtmann-Rice, D. N., and Kumar, S. (2016). Orthogonal randomfeatures. In Lee, D. D., Sugiyama, M., von Luxburg, U.,Guyon, I., and Garnett, R., editors,

Advances in Neural In-formation Processing Systems 29: Annual Conference onNeural Information Processing Systems 2016, December5-10, 2016, Barcelona, Spain , pages 1975–1983. Zhang, A., McAllister, R. T., Calandra, R., Gal, Y., andLevine, S. (2021). Invariant representations for reinforce-ment learning without reconstruction. In

InternationalConference on Learning Representations . nlocking Pixels for Reinforcement Learning via Implicit Attention A. APPENDIX: Unlocking Pixels for Reinforcement Learning via Implicit Attention

A.1. Extra Figures

Figure 12.

Examples of Distracting Control Suite (Stone et al., 2021) tasks with distractions in the background that need to be automaticallyﬁltered out to learn a successful controller. Image resolutions are substantially larger than for most other vision-based benchmarks for RLconsidered before. Code can be found at https://github.com/google-research/google-research/tree/master/distracting_control . A.2. Quadruped Locomotion Experiments

We provide here more details regarding an experimental setup for the quadruped locomotion tasks.Our simulated robot is similar in size, actuator performance, and range of motion to the MIT Mini Cheetah (Katz et al.,2019) ( kg) and Unitree A1 ( kg) robots. Robot leg movements are generated using a trajectory generator, based on the Policies Modulating Trajectory Generators (PMTG) architecture, which has shown success at learning diverse primitivebehaviors for quadruped robots (Iscen et al., 2018). The latent command from the high level, IMU sensor observations,motor angles, and current PMTG state is fed to low level neural network which outputs the residual motor commands andPMTG parameters at every timestep.We use the Unitree A1’s URDF description , which is available in the PyBullet simulator (Coumans, 2013). The swing andextension of each leg is controlled by a PD position controller.The reward function is speciﬁed as the capped ( v cap ) velocity of the robot along the x direction: f v cap ( r ) = max( − v cap , min( r, v cap )) (11) r cc ( t ) = f v cap ( x ( t ) − x ( t − . (12) A.3. Proof of Theorem 4.1

Proof.

We will rely on the formulae proven in (Choromanski et al., 2021):

MSE( (cid:98) K m trig ( x , y )) = 12 m exp( (cid:107) x + y (cid:107) )SM − ( x , y )(1 − exp( −(cid:107) x − y (cid:107) )) , (13)and MSE( (cid:98) K m exp ( x , y )) = 1 m exp( (cid:107) x + y (cid:107) )SM ( x , y )(1 − exp( −(cid:107) x + y (cid:107) )) . (14) https://github.com/unitreerobotics nlocking Pixels for Reinforcement Learning via Implicit Attention Denote by θ and angle between x and y . We start by proving unbiasedness of the proposed hybrid estimator. The ﬁrstobservation is that this estimator can be rewritten as: (cid:98) K m,r hyb ( x , y ) = (cid:98) θ r π (cid:98) K m exp ( x , y ) + (1 − (cid:98) θ r π ) (cid:98) K m trig ( x , y ) , (15)where: − (cid:98) θ r π = 1 r r (cid:88) i =1 sgn( ξ (cid:62) i x )sgn( ξ (cid:62) i y ) (16)Thus we just need to show that (cid:98) K r ang ( x , y ) deﬁned as: (cid:98) K r ang ( x , y ) = X + ... + X r r (17)for X i = sgn( ξ (cid:62) i x )sgn( ξ (cid:62) i y ) is an unbiased estimator of the angular kernel − θπ . It remains to show that sgn( ω (cid:62) x )sgn( ω (cid:62) y ) for ω ∼ N (0 , I d ) is an unbiased expectation of the angular kernel. This is shown in detail inthe main body in the sketch of the proof of the Theorem (see: Fig. 5; analysis from there can be trivially extended to anydimensionality and also follows directly from the analysis of the Goemans-Williamson algorithm (Goemans and Williamson,2001)). Notice that effectively the hybrid estimator is constructed by: (1) creating random features for the angular kernel, (2) creating random features for the softmax kernel (two variants), (3) leveraging the formula for the random feature map forthe product of two kernels which is a cartesian product of the random features corresponding to the two kernels (compositerandom features). Vanishing variance of (cid:98) θ r in points: and π follows directly from the fact that (cid:98) K r ang ( x , y ) has zerovariance if x , y are colinear or anti-colinear.Having proved that the hybrid estimator admits structure given in Equation 7 (in particular that it is unbiased), we nowswitch to the computation of its mean squared error. From the deﬁnitions of (cid:98) K m exp ( x , y ) and (cid:98) K m trig ( x , y ) , we know that theseestimators can be rewritten as: (cid:98) K m exp ( x , y ) = exp( − (cid:107) x (cid:107) + (cid:107) y (cid:107) Y + ... + Y m m , (18)where: Y i = exp( ω (cid:62) i ( x + y )) and: (cid:98) K m trig ( x , y ) = exp( (cid:107) x (cid:107) + (cid:107) y (cid:107) Z + ... + Z m m , (19)where: Z i = cos( ω (cid:62) i ( x − y )) for ω , ..., ω m iid ∼ N (0 , I d ) , sampled independently from ξ , ..., ξ r . From now on, we willdrop superscripts m, r from the estimator notation since m, r are ﬁxed. We have the following: Var( (cid:98) K hyb ( x , y )) = 14 Var (cid:16) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) (cid:17) + 14 Var (cid:16) (1 + (cid:98) K ang ( x , y )) (cid:98) K trig ( x , y ) (cid:17) +12 Cov (cid:16) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) , (1 + (cid:98) K ang ( x , y )) (cid:98) K trig ( x , y ) (cid:17) (20)The following is also true: Var (cid:16) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) (cid:17) = E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:98) K ( x , y ) (cid:105) − (cid:16) E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) (cid:105)(cid:17) = E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:105) E (cid:104) (cid:98) K ( x , y ) (cid:105) − (cid:16) E [(1 − (cid:98) K ang ( x , y ))] (cid:17) (cid:16) E [ (cid:98) K exp ( x , y )] (cid:17) , (21)where the last equality follows from the fact that (cid:98) K exp ( x , y ) and (cid:98) K ang ( x , y ) are independent. Therefore we have: Var (cid:16) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) (cid:17) = E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:105) E (cid:104) (cid:98) K ( x , y ) (cid:105) − θ π SM ( x , y ) (22)Furthermore, since: E [ (cid:98) K ( x , y )] = MSE( (cid:98) K exp ( x , y )) + SM ( x , y ) , (23) nlocking Pixels for Reinforcement Learning via Implicit Attention we obtain the following: Var (cid:16) (1 − (cid:98) K ang ( x , y )) (cid:98) K exp ( x , y ) (cid:17) = E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:105) (cid:16) MSE( (cid:98) K exp ( x , y )) + SM ( x , y ) (cid:17) − θ π SM ( x , y ) (24)Let us now focus on the expression E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:105) . We have the following: E (cid:104) (1 − (cid:98) K ang ( x , y )) (cid:105) = E [1 − (cid:98) K ang ( x , y )) + (cid:98) K ang ( x , y )) ] = 4 θπ − E [ (cid:98) K ( x , y )] (25)From the deﬁnition of the estimator of the angular kernel, we get: E [ (cid:98) K ( x , y )] = E (cid:20) ( X + ... + X r ) r (cid:21) = 1 r  r (cid:88) i =1 E [ X i ] + 2 (cid:88) i