[PDF] Is Space-Time Attention All You Need for Video Understanding?

Abstract

We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: this https URL

Full PDF

IIs Space-Time Attention All You Need for Video Understanding?

Gedas Bertasius Heng Wang Lorenzo Torresani

Abstract

We present a convolution-free approach to videoclassiﬁcation built exclusively on self-attentionover space and time. Our method, named “TimeS-former,” adapts the standard Transformer architec-ture to video by enabling spatiotemporal featurelearning directly from a sequence of frame-levelpatches. Our experimental study compares dif-ferent self-attention schemes and suggests that“divided attention,” where temporal attention andspatial attention are separately applied within eachblock, leads to the best video classiﬁcation accu-racy among the design choices considered. De-spite the radically different design compared tothe prominent paradigm of 3D convolutional ar-chitectures for video, TimeSformer achieves state-of-the-art results on several major action recog-nition benchmarks, including the best reportedaccuracy on Kinetics-400 and Kinetics-600. Fur-thermore, our model is faster to train and hashigher test-time efﬁciency compared to compet-ing architectures. Code and pretrained modelswill be made publicly available.

1. Introduction

Over the last few years, the ﬁeld of natural language pro-cessing (NLP) has been revolutionized by the emergence ofmethods based on self-attention (Vaswani et al., 2017a). Be-cause of their excellent capabilities at capturing long-rangedependencies among words as well as their training scala-bility, self-attention architectures, such as the Transformermodel, represent the current state-of-the-art across a widerange of language tasks, including machine translation (Ottet al., 2018; Chen et al., 2018a), question answering (De-vlin et al., 2019; Dai et al., 2019), and autoregressive wordgeneration (Radford et al., 2019; Brown et al., 2020).Video understanding shares several high-level similaritieswith NLP. First of all, videos and sentences are both funda-mentally sequential. Furthermore, precisely as the meaning Facebook AI Dartmouth College. Correspondence to: GedasBertasius . of a word can often be understood only by relating it to theother words in the sentence, it may be argued that atomicactions in short-term segments need to be contextualizedwith the rest of the video in order to be fully disambiguated.Thus, one would expect the long-range self-attention mod-els from NLP to be highly effective for video modeling,not only to capture dependencies across time but also touncover contextual information within each frame by pair-wise comparison of features at different spatial locations.However, in the video domain, 2D or 3D convolutions stillrepresent the core operators for spatiotemporal feature learn-ing across different video tasks (Feichtenhofer et al., 2019a;Teed & Deng, 2020; Bertasius & Torresani, 2020). Whileself-attention has shown beneﬁts when applied on top ofconvolutional layers (Wang et al., 2018a), to the best of ourknowledge, no attempt to use self-attention as the exclusivebuilding block for video models has been reported.In this work we pose the question of whether it may bepossible to build a performant convolution-free video archi-tecture by replacing altogether the convolution operator withself-attention. We argue that such a design has the poten-tial to overcome a few inherent limitations of convolutionalmodels for video analysis. First, while their strong inductivebiases (e.g., local connectivity and translation equivariance)are undoubtedly beneﬁcial on small training sets, they mayexcessively limit the expressivity of the model in settingswhere there is ample availability of data and “all” can belearned from examples. Compared to CNNs, Transformersimpose less restrictive inductive biases. This broadens thefamily of functions they can represent (Cordonnier et al.,2020; Zhao et al., 2020), and renders them better suitedto modern big-data regimes where there is less need forstrong inductive priors. Second, while convolutional kernelsare speciﬁcally designed to capture short-range spatiotem-poral information, they cannot model dependencies thatextend beyond the receptive ﬁeld. While deep stacks ofconvolutions (Simonyan & Zisserman, 2015; Szegedy et al.,2015; Carreira & Zisserman, 2017) naturally extend thereceptive ﬁeld, these strategies are inherently limited in cap-turing long-range dependencies by means of aggregationof shorter-range information. Conversely, the self-attentionmechanism can be applied to capture both local as well asglobal long-range dependencies by directly comparing fea-ture activations at all space-time locations, much beyond the a r X i v : . [ c s . C V ] F e b s Space-Time Attention All You Need for Video Understanding? receptive ﬁeld of traditional convolutional ﬁlters. Finally,despite the advances in GPU hardware acceleration, trainingdeep CNNs remains very costly, especially when applied tohigh-resolution and long videos. Recent work in the still-image domain (Dosovitskiy et al., 2020; Carion et al., 2020;Zhao et al., 2020) has demonstrated that Transformers enjoyfaster training and inference compared to CNNs, making itpossible to construct models with larger learning capacityfor the same computational budget.Motivated by these observations, we propose a video ar-chitecture built exclusively on self-attention. We adapt theimage model “Vision Transformer” (ViT) (Dosovitskiy et al.,2020) to video by extending the self-attention mechanismfrom the image space to the space-time 3D volume. Ourproposed model, named “TimeSformer” (from Time-SpaceTransformer), views the video as a sequence of patches ex-tracted from the individual frames. As in ViT, each patch islinearly mapped into an embedding and augmented with po-sitional information. This makes it possible to interpret theresulting sequence of vectors as token embeddings whichcan be fed to a Transformer encoder, analogously to thetoken features computed from words in NLP.One downside of the self-attention operator in standardTransformer is that it requires computing a similarity mea-sure for all pairs of tokens. In our setting, this is compu-tationally costly due to the large number of patches in thevideo. Furthermore, it ignores the space-time structure ofthe video. To address these challenges, we propose sev-eral scalable self-attention designs over the space-time vol-ume and empirically evaluate them over large-scale actionclassiﬁcation datasets. Among the proposed schemes, wefound that the best design is represented by a “divided atten-tion” architecture which separately applies temporal atten-tion and spatial attention within each block of the network.Compared to the established paradigm of convolution-basedvideo architecture, TimeSformer follow a radically differentdesign. Yet, it achieves accuracy comparable, and in somecases superior, to the state-of-the-art in this ﬁeld, while alsobeing more efﬁcient. Furthermore, we also show that ourmodel can be effectively used for long-range modeling ofvideos spanning many minutes.

2. Related Work

Our approach is inﬂuenced by recent works that use self-attention for still-image classiﬁcation, either in combinationwith the convolution operator or even as a full replacementfor it. Within the former class, Non-Local Networks (Wanget al., 2018b) employ a non-local mean that effectively gen-eralizes the self-attention function of Transformers (Vaswaniet al., 2017b). Bello et al. (2019) propose a 2D self-attentionmechanism that is competitive as a replacement of 2D convo-lution but gives even stronger results when used to augment convolutional features with self-attention features. Beyondimage categorization, Relation Networks (Hu et al., 2018)and DETR (Carion et al., 2020) use self-attention on top ofconvolutional feature maps for object detection.Our method is more closely related to image networks lever-aging self-attention as a substitute for convolution (Parmaret al., 2018; Ramachandran et al., 2019; Cordonnier et al.,2020; Zhao et al., 2020). Since these works use individualpixels as queries, in order to maintain a manageable compu-tational cost and a small memory consumption, they mustrestrict the scope of self-attention to local neighborhoods oruse global self-attention on heavily downsized versions ofthe image. Alternative strategies for scalability to full im-ages include sparse key-value sampling (Child et al., 2019)or constraining the self-attention to be calculated along thespatial axes (Ho et al., 2019; Huang et al., 2019; Wanget al., 2020b). A few of the self-attention operators con-sidered in our experiments adopt similar sparse and axialcomputation, although generalized to the spatiotemporalvolume. However, the efﬁciency of our approach stemsmainly from decomposing the video into a sequence offrame-level patches and then feeding linear embeddings ofthese patches as input token embeddings to a Transformer.This strategy was recently introduced in Vision Transform-ers (ViT) (Dosovitskiy et al., 2020) which were shown todeliver impressive performance on image categorization. Inthis work, we build on the ViT design, and extend it to videoby proposing and empirically comparing several scalableschemes for space-time self-attention over high-resolutionand long videos.While Transformers have been recently used for video gen-eration (Weissenborn et al., 2020), we are not aware of priorvideo recognition architectures leveraging self-attention asthe exclusive building block. However, we note that Trans-formers have been adopted on top of convolutional featuremaps for action localization and recognition (Girdhar et al.,2019), video classiﬁcation (Wang et al., 2018b; Chen et al.,2018c), and group activity recognition (Gavrilyuk et al.,2020). We also note that there is a wide literature based onthe use of text Transformers combined with video CNNsto address various video-language tasks, such as caption-ing (Zhou et al., 2018), question-answering (Yang et al.,2020) and dialog (Le et al., 2019). Finally, multimodalvideo-text transformers (Sun et al., 2019; Li et al., 2020a)have also been trained or pretrained in unsupervised fashionby adopting masked-token pretext tasks adapted from thelanguage domain (Devlin et al., 2018; Radford et al., 2018;Raffel et al., 2019).

3. The TimeSformer Model

Input clip.

The TimeSformer takes as input a clip X ∈ R H × W × × F consisting of F RGB frames of size H × W s Space-Time Attention All You Need for Video Understanding? AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Time Att.Space Att. AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== MLP AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Space Att.MLP AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Joint Space-Time Att. AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Space Attention (S) Joint Space-Time Attention (ST) Divided Space-Time Attention (T+S) AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Time Att.Width Att. AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== MLP AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Height Att. AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSJcxOZpMxszPLTK8QQv7BiwdFvPo/3vwbJ8keNLGgoajqprsrSqWw6PvfXmFtfWNzq7hd2tnd2z8oHx41rc4M4w2mpTbtiFouheINFCh5OzWcJpHkrWh0O/NbT9xYodUDjlMeJnSgRCwYRSc1uzqVme2VK37Vn4OskiAnFchR75W/un3NsoQrZJJa2wn8FMMJNSiY5NNSN7M8pWxEB7zjqKIJt+Fkfu2UnDmlT2JtXCkkc/X3xIQm1o6TyHUmFId22ZuJ/3mdDOPrcCJUmiFXbLEoziRBTWavk74wnKEcO0KZEe5WwobUUIYuoJILIVh+eZU0L6qBXw3uLyu1mzyOIpzAKZxDAFdQgzuoQwMYPMIzvMKbp70X7937WLQWvHzmGP7A+/wBz/6PRQ== Axial Attention (T+W+H)Sparse Local Global Attention (L+G)

MLP Local Att.Global Att.MLP z ( ` ) AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB z ( ` AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz z ( ` ) AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB z ( ` ) AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB z ( ` ) AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB z ( ` ) AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahXkoigh6LXjxWsB/QxLLZTtqlm03Y3RRqyD/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgoQzpR3n2yqtrW9sbpW3Kzu7e/sH9uFRW8WppNCiMY9lNyAKOBPQ0kxz6CYSSBRw6ATj25nfmYBULBYPepqAH5GhYCGjRBupb9uZF4T4KX/Mah5wfp737apTd+bAq8QtSBUVaPbtL28Q0zQCoSknSvVcJ9F+RqRmlENe8VIFCaFjMoSeoYJEoPxsfnmOz4wywGEsTQmN5+rviYxESk2jwHRGRI/UsjcT//N6qQ6v/YyJJNUg6GJRmHKsYzyLAQ+YBKr51BBCJTO3YjoiklBtwqqYENzll1dJ+6LuOnX3/rLauCniKKMTdIpqyEVXqIHuUBO1EEUT9Ixe0ZuVWS/Wu/WxaC1Zxcwx+gPr8wcJUZNB z ( ` AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz z ( ` AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz z ( ` AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz z ( ` AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz AAAB+3icbVBNS8NAEN3Ur1q/Yj16WSxCPVgSEfRY9OKxgv2ANpbNdtIu3WzC7kasIX/FiwdFvPpHvPlv3LY5aOuDgcd7M8zM82POlHacb6uwsrq2vlHcLG1t7+zu2fvllooSSaFJIx7Jjk8UcCagqZnm0IklkNDn0PbH11O//QBSsUjc6UkMXkiGggWMEm2kvl1Oe36An7L7tNoDzk/dk6xvV5yaMwNeJm5OKihHo29/9QYRTUIQmnKiVNd1Yu2lRGpGOWSlXqIgJnRMhtA1VJAQlJfObs/wsVEGOIikKaHxTP09kZJQqUnom86Q6JFa9Kbif1430cGllzIRJxoEnS8KEo51hKdB4AGTQDWfGEKoZOZWTEdEEqpNXCUTgrv48jJpndVcp+benlfqV3kcRXSIjlAVuegC1dENaqAmougRPaNX9GZl1ov1bn3MWwtWPnOA/sD6/AHtlpOz Figure 1.

Illustration of the video self-attention blocks that we investigate in this work. Each attention layer implements self-attention (Vaswani et al., 2017b) on a speciﬁed spatiotemporal neighborhood of frame-level patches (see Figure 2 for a visualization of theneighborhoods). We use residual connections to aggregate information from different attention layers within each block. A 1-hidden-layerMLP is applied at the end of each block. The ﬁnal model is constructed by repeatedly stacking these blocks on top of each other. sampled from the original video.

Decomposition into patches.

Following the ViT ap-proach (Dosovitskiy et al., 2020), we decompose each frameinto N non-overlapping patches, each of size P × P , suchthat the N patches span the entire frame, i.e., N = HW/P .We ﬂatten these patches into vectors x ( p,t ) ∈ R P with p = 1 , . . . , N denoting spatial locations and t = 1 , . . . , F representing an index over frames. Linear embedding.

We linearly map each patch x ( p,t ) intoan embedding vector z (0)( p,t ) ∈ R D by means of a learnablematrix E ∈ R D × P : z (0)( p,t ) = E x ( p,t ) + e pos ( p,t ) (1)where e pos ( p,t ) ∈ R D represents a learnable positional embed-ding added to encode the spatiotemporal position of eachpatch. The resulting sequence of embedding vectors z (0)( p,t ) for p = 1 , . . . , N , and t = 1 , . . . , F represents the input tothe Transformer, and plays a role similar to the sequencesof embedded words/tokens that are fed to text Transformersin NLP. As in the original BERT Transformer (Devlin et al.,2018), we add in the ﬁrst position of the sequence a speciallearnable vector z (0)(0 , ∈ R D representing the embeddingof the classiﬁcation token. Query-Key-Value computation.

Our Transformer consistsof L encoding blocks. At each block (cid:96) , a query/key/valuevector is computed for each patch from the representation z ( (cid:96) − p,t ) encoded by the preceding block: q ( (cid:96),a )( p,t ) = W ( (cid:96),a ) Q LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (2) k ( (cid:96),a )( p,t ) = W ( (cid:96),a ) K LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (3) v ( (cid:96),a )( p,t ) = W ( (cid:96),a ) V LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (4)where LN() denotes LayerNorm (Ba et al., 2016), a =1 , . . . , A is an index over multiple attention heads and A denotes the total number of attention heads. The latentdimensionality for each attention head is set to D h = D/ A . Self-attention computation.

Self-attention weights arecomputed via dot-product. The self-attention weights ααα ( (cid:96),a )( p,t ) ∈ R NF +1 for query patch ( p, t ) are given by: ααα ( (cid:96),a )( p,t ) = SM  q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:34) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p (cid:48) ,t (cid:48) ) (cid:111) p (cid:48) =1 ,...,Nt (cid:48) =1 ,...,F (cid:35) (5)where SM denotes the softmax activation function. Notethat when attention is computed over one dimension only(e.g., spatial-only or temporal-only), the computation issigniﬁcantly reduced. For example, in the case of spatialattention, only N + 1 query-key comparisons are made,using exclusively keys from the same frame as the query: s Space-Time Attention All You Need for Video Understanding? f r a m e tf r a m e t - AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== f r a m e t + AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaUDabTbt2kw27E6GE/gcvHhTx6v/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdgBouRcJbKFDybqo5jQPJO8H4duZ3nrg2QiUPOEm5H9NhIiLBKFqp3Q+5RDqo1ty6OwdZJV5BalCgOah+9UPFspgnyCQ1pue5Kfo51SiY5NNKPzM8pWxMh7xnaUJjbvx8fu2UnFklJJHSthIkc/X3RE5jYyZxYDtjiiOz7M3E/7xehtG1n4skzZAnbLEoyiRBRWavk1BozlBOLKFMC3srYSOqKUMbUMWG4C2/vEraF3XPrXv3l7XGTRFHGU7gFM7BgytowB00oQUMHuEZXuHNUc6L8+58LFpLTjFzDH/gfP4AkYyPHA== Space Attention ( S ) Joint Space-Time Attention ( ST ) Divided Space-Time Attention ( T + S ) Sparse Local Global Attention ( L + G ) Axial Attention ( T + W + H ) Figure 2.

Visualization of the ﬁve space-time self-attention schemes studied in this work. Each video clip is viewed as a sequence offrame-level patches with a size of × pixels. For illustration, we denote in blue the query patch and show in non-blue colors itsself-attention space-time neighborhood under each scheme. Patches without color are not used for the self-attention computation of theblue patch. Multiple colors within a scheme denote attentions separately applied along different dimensions (e.g., space and time for(T+S)) or over different neighborhoods (e.g., for (L+G)). Note that self-attention is computed for every single patch in the video clip, i.e.,every patch serves as a query. We also note that although the attention pattern is shown for only two adjacent frames, it extends in thesame fashion to all frames of the clip. ααα ( (cid:96),a )space( p,t ) = SM  q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:20) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p (cid:48) ,t ) (cid:111) p (cid:48) =1 ,...,N (cid:21) . (6) Encoding.

The encoding z ( (cid:96) )( p,t ) at block (cid:96) is obtained byﬁrst computing the weighted sum of value vectors usingself-attention coefﬁcients from each attention head: s ( (cid:96),a )( p,t ) = α ( (cid:96),a )( p,t ) , (0 , v ( (cid:96),a )(0 , + N (cid:88) p (cid:48) =1 F (cid:88) t (cid:48) =1 α ( (cid:96),a )( p,t ) , ( p (cid:48) ,t (cid:48) ) v ( (cid:96),a )( p (cid:48) ,t (cid:48) ) . (7)Then, the concatenation of these vectors from all headsis projected and passed through an MLP, using residualconnections after each operation: z (cid:48) ( (cid:96) )( p,t ) = W O  s ( (cid:96), p,t ) ... s ( (cid:96), A )( p,t )  + z ( (cid:96) − p,t ) (8) z ( (cid:96) )( p,t ) = MLP (cid:16) LN (cid:16) z (cid:48) ( (cid:96) )( p,t ) (cid:17)(cid:17) + z (cid:48) ( (cid:96) )( p,t ) . (9) Classiﬁcation embedding.

The ﬁnal clip embedding isobtained from the ﬁnal block for the classiﬁcation token: y = LN (cid:16) z ( L )(0 , (cid:17) ∈ R D . (10)On top of this representation we append a 1-hidden-layerMLP, which is used to predict the ﬁnal video classes. Space-Time Self-Attention Models.

As already men-tioned, we can reduce the computational cost by replacingthe spatiotemporal attention of Eq. 5 with spatial attentionwithin each frame only (Eq. 6). However, such a model ne-glects to capture temporal dependencies across frames. Asshown in our experiments, this approach leads to degradedclassiﬁcation accuracy compared to full spatiotemporal at-tention, especially on benchmarks where strong temporalmodeling is necessary.We propose an alternative, more efﬁcient architecture forspatiotemporal attention, named “Divided Space-Time At-tention” (denoted with T+S), where temporal attention andspatial attention are separately applied one after the other.This architecture is compared to that of Space and JointSpace-Time attention in Fig. 1. A visualization of the differ-ent attention models on a video example is given in Fig. 2.For Divided Attention, within each block (cid:96) , we ﬁrst computetemporal attention by comparing each patch ( p, t ) with allthe patches at the same spatial location in the other frames: ααα ( (cid:96),a )time( p,t ) = SM  q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:20) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p,t (cid:48) ) (cid:111) t (cid:48) =1 ,...,F (cid:21) . (11) s Space-Time Attention All You Need for Video Understanding? Attention Params K400 SSv2Space 85.9M 77.6 36.6Joint Space-Time 85.9M 78.1 58.5Divided Space-Time 121.4M

Sparse Local Global 121.4M 76.8 56.3Axial 156.8M 74.6 56.2

Table 1.

Video-level accuracy for different space-time attentionschemes in TimeSformer. We evaluate the models on the valida-tion sets of Kinetics-400 (K400), and Something-Something-V2(SSv2). We observe that divided space-time attention achieves thebest results on both datasets.

The encoding z (cid:48) ( (cid:96) )time( p,t ) resulting from the application ofEq. 8 using temporal attention is then fed back for spatial attention computation instead of being passed to the MLP. Inother words, new key/query/value vectors are obtained from z (cid:48) ( (cid:96) )time( p,t ) and spatial attention is then computed using Eq. 6.Finally, the resulting vector z (cid:48) ( (cid:96) )space( p,t ) is passed to the MLPof Eq. 9 to compute the ﬁnal encoding z ( (cid:96) )( p,t ) of the patch atblock (cid:96) . For the model of divided attention, we learn dis-tinct query/key/value matrices { W ( (cid:96),a ) Q time , W ( (cid:96),a ) K time , W ( (cid:96),a ) V time } and { W ( (cid:96),a ) Q space , W ( (cid:96),a ) K space , W ( (cid:96),a ) V space } over the time and spacedimensions. Note that compared to the ( N F + 1) compar-isons per patch needed by the joint spatiotemporal attentionmodel of Eq. 5, Divided Attention performs only ( N + F +2) comparisons per patch. Our experiments demonstrate thatthis space-time factorization is not only more efﬁcient but italso leads to improved classiﬁcation accuracy.We have also experimented with a “Sparse Local Global”(L+G) attention model and an “Axial” (T+W+H) attentionmodel. Their architectures are illustrated in Fig. 1, whileFig. 2 shows the patches considered for attention by thesemodels. For each patch ( p, t ) , (L+G) ﬁrst computes a localattention by considering the neighboring F × H/ × W/ patches and then calculates a sparse global attention overthe entire clip using a stride of 2 patches along the temporaldimension and also the two spatial dimensions. Thus, it canbe viewed as a faster approximation of full spatiotemporalattention using a local-global decomposition and a sparsitypattern, similar to that used in (Child et al., 2019). Finally,“Axial” attention decomposes the attention computation inthree distinct steps: over time, width and height. A de-composed attention pattern over the two spatial axes of theimage was proposed in (Ho et al., 2019; Huang et al., 2019;Wang et al., 2020b) and our (T+W+H) adds a third dimen-sion (time) for the case of video. All of these models areimplemented by learning distinct query/key/value matricesfor each attention step.

4. Experiments

We now evaluate TimeSformer on four popular action recog-nition datasets: Kinetics-400 (Carreira & Zisserman, 2017),Kinetics-600 (Carreira et al., 2018), Something-Something- TF L O P s Joint Space-TimeDivided Space-Time

224 336 448 560

Spatial Crop (Px) TF L O P s Joint Space-TimeDivided Space-Time out of memory out of memory

Figure 3.

We compare the video classiﬁcation cost (in TFLOPs) ofJoint Space-Time versus Divided Space-Time attention. We plotthe number of TFLOPs as a function of spatial crop size in pixels(left), and the number of input frames (right). As we increase thespatial resolution (left), or the video length (right), our proposed di-vided space-time attention leads to dramatic computational savingscompared to the scheme of joint space-time attention.

V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018).For all our experiments, we adopt the “Base” ViT modelarchitecture (Dosovitskiy et al., 2020) pretrained on Ima-geNet (Russakovsky et al., 2014). Unless differently speci-ﬁed, we use clips of size × × , with frames sampledat a rate of / . The patch size is set to × pixels.During inference, unless otherwise noted, we sample a sin-gle temporal clip in the middle of the video. We use spatialcrops (top-left, center, bottom-right) so that the entire videoclip is spatially covered. The ﬁnal prediction is obtained byaveraging the softmax scores of these predictions. In Table 1, we present the results for the ﬁve proposed space-time attention schemes in TimeSformer using Kinetics-400(K400) and Something-Something-V2 (SSv2) as bench-marks. From these results we can ﬁrst notice that TimeS-former with space-only attention (S) performs well on K400.This is an interesting ﬁnding. Indeed, Sevilla-Lara et al.(2021) found that, on K400, the use of spatial informationis more important than the leveraging of temporal dynamicsin order to achieve strong accuracy. Here, we show that itis possible to obtain solid accuracy on K400 without anytemporal modeling at all. Note, however, that space-onlyattention performs extremely poorly on SSv2. This stressesthe importance of temporal modeling on this latter dataset.Furthermore, we observe that divided space-time attentionachieves the best accuracy on both K400 and SSv2. Thismakes sense because compared to joint space-time attention,divided space-time attention has a larger learning capacity(see Table 1) as it contains distinct learning parameters fortemporal attention and spatial attention.In Figure 3, we also compare the computational cost of jointspace-time versus divided space-time attention when usinghigher spatial resolution (left) and longer (right) videos. Wenote that the scheme of divided space-time scales gracefully s Space-Time Attention All You Need for Video Understanding?

Number of Input Frames C li p A cc u r a cy

224 336 448 560

Spatial Crop Side (Px) C li p A cc u r a cy Figure 4.

Clip-level accuracy on Kinetics-400 as a function of spa-tial crop size in pixels (left), and the number of input frames (right). under both of these settings. In contrast, the scheme ofjoint space-time attention leads to a dramatically higher costwhen resolution or video length is increased. In practice,joint space-time attention causes a GPU memory overﬂowonce the spatial frame resolution reaches pixels, or oncethe number of frames is increased to and thus it is effec-tively not applicable to large frames or long videos. Thus,despite a larger number of parameters, divided space-timeattention is more efﬁcient than joint space-time attentionwhen operating on higher spatial resolution, or longer videos.Thus, for all subsequent experiments we use a TimeSformerconstructed with divided space-time self-attention blocks. As discussed above, the scalability of our model allows itto operate at higher spatial resolution and on longer videoscompared to most 3D CNNs. We note that both of theseaspects affect the length of the sequence of tokens fed to theTransformer. Speciﬁcally, increasing the spatial resolutionresults in a higher number of patches ( N ) per frame. Thenumber of input tokens is also increased when using moreframes. To investigate the beneﬁts, we conduct an empiricalstudy where we separately increase the number of tokensalong each of these two axes.We report the ﬁndings in Figure 4. We see that increasing thespatial resolution (up to a certain point) leads to a substantialboost in performance. Similarly, we observe that increasingthe length of the input clip leads to consistent accuracy gains.Due to GPU memory constraints, we are not able to testour model on clips longer than frames. Still, we wouldlike to point out that using clips of frames is a signiﬁcantdeparture from current convolutional models, which aretypically limited to processing inputs of − frames. In their recent work, Dosovitskiy et al (2020) demonstratedthat ViT is most effective when trained on very large-scaledatasets. In this section, we investigate whether the sametrend holds for our TimeSformer model.First, we attempted to train TimeSformer on video datasetsdirectly, without ImageNet pretraining. For these exper-

60K 120K 180K 240K A ccc u r a cy Kinetics

TimeSformerSlowFastI3D

42K 85K 127K 170K A ccc u r a cy Something-Something-V2

TimeSformerSlowFastI3D

Figure 5.

We study accuracy on Kinetics-400 (K400), andSomething-Something-V2 (SSv2) as a function of the numberof training videos. On K400, TimeSformer performs best in allcases. On SSv2, which requires more complicated temporal reason-ing, TimeSformer outperforms the other models only when usingenough training videos. All models are pretrained on ImageNet. iments, we followed the training-from-scratch protocolof Touvron et al. (2020) and we also evaluated some vari-ants of it. However, the model failed to learn meaningfulfeatures. We suspect that a new optimization strategy orperhaps different hyperparameter values may be needed fortraining TimeSformer from scratch on video data.Thus, for all subsequent studies we continue to use Ima-geNet pretraining. To understand the effects of video-datascale on performance, we train our model on different sub-sets of K400, and SSv2: { , , , } of thefull datasets. We show these results in Figure 5, where wealso compare our method with SlowFast R50 (Feichtenhoferet al., 2019b), and I3D R50 (Carreira & Zisserman, 2017)trained on the same subsets. We note that all 3 architecturesare pretrained on ImageNet (Russakovsky et al., 2014).The results of Figure 5 show that, on K400, TimeSformeroutperforms the other models for all training subsets. How-ever, we observe a different trend on SSv2, where TimeS-former is the strongest model only when trained on or of the full data. This may be explained by the fact thatcompared to K400, SSv2 requires learning more complextemporal patterns, and thus more examples may be neededby TimeSformer to learn effectively those patterns. We now compare TimeSformer to the state-of-the-art onseveral popular action recognition datasets. We use threevariants of our model: (1)

TimeSformer , which is the de-fault version of our model operating on × × videoclips, (2) TimeSformer-HR , a high spatial resolution vari-ant that operates on × × video clips, and lastly (3) TimeSformer-L , a long-range conﬁguration of our modelthat operates on × × video clips with framessampled at a rate of / . Kinetics-400.

In Table 2, we report our results on the val-idation set of K400. In addition to the accuracy metrics,we include inference cost, given in TFLOPs. We note that s Space-Time Attention All You Need for Video Understanding?

Method Top-1 Top-5

TFLOPs

ARTNet (Wang et al., 2018a) 69.2 88.3 6.0I3D (Carreira & Zisserman, 2017) 71.1 89.3 N/AR(2+1)D (Tran et al., 2018) 72.0 90.0 17.5MFNet (Chen et al., 2018b) 72.8 90.4 N/AInception-ResNet (Bian et al., 2017) 73.0 90.9 N/AbLVNet (Fan et al., 2019) 73.5 91.2 0.84 A -Net (Chen et al., 2018c) 74.6 91.5 N/ATSM (Lin et al., 2019) 74.7 N/A N/AS3D-G (Xie et al., 2018) 74.7 93.4 N/AOct-I3D+NL (Chen et al., 2019a) 75.7 N/A 0.84D3D (Stroud et al., 2020) 75.9 N/A N/AGloRe (Chen et al., 2019b) 76.1 N/A N/AI3D+NL (Wang et al., 2018b) 77.7 93.3 10.8ip-CSN-152 (Tran et al., 2019) 77.8 92.8 3.2CorrNet (Wang et al., 2020a) 79.2 N/A 6.7LGD-3D-101 (Qiu et al., 2019) 79.4 94.4 N/ASlowFast (Feichtenhofer et al., 2019b) 79.8 93.9 7.0X3D-XXL (Feichtenhofer, 2020) 80.4 94.6 5.8 TimeSformer

TimeSformer-L 80.7 94.7

Table 2.

Video-level accuracy on Kinetics-400. TimeSformer-Lachieves the best reported accuracy. whereas most previous methods use temporal clips with spatial crops (for a total of space-time views) duringinference, we show that TimeSformer achieves solid ac-curacy with only views (3 spatial crops), which reducesthe inference cost. Our long-range variant, TimeSformer-L,achieves a top-1 accuracy of . , outperforming all priormethods. Furthermore, our default TimeSformer has thelowest inference cost among recent state-of-the-art models.Yet, it still provides a solid accuracy of , outperformingmany much more costly models.In Figure 6, we study the effect of using multiple temporalclips during inference (each with a single spatial crop). Weplot accuracy using K ∈ { , , , } temporal clips fortesting. We compare our model against X3D (Feichtenhofer,2020), and SlowFast (Feichtenhofer et al., 2019b). X3D andSlowFast require multiple ( ≥ ) clips to approach their topaccuracy. Conversely, our long-range variant, TimeSformer-L, does not beneﬁt at all from using multiple clips. Thismakes sense since it is able to span about seconds of aKinetics video with a single clip.We also note that TimeSformer is much faster to train thanstate-of-the-art 3D CNNs even when both types of modelsare pretrained on Imagenet. For instance, training a Slow-Fast × R50 on K400 takes . hours on GPUs, whiletraining I3D R50 under similar settings takes hours using GPUs. In contrast, TimeSformer can be trained in . hours using GPUs. This cuts down the training time by afactor of compared to both 3D CNNs. Kinetics-600.

In Table 3, we also present our results onKinetics-600. Just like on Kinetics-400, we observe thatour method outperforms all prior works on this benchmark. A ccc u r a cy TimeSformer-LTimeSformerSlowFast-R101+NLX3D-XL

Figure 6.

Video-level accuracy on Kinetics-400 vs the number oftemporal clips used during inference. TimeSformer-L achievesexcellent accuracy using a small number of sampled clips, whichleads to strong performance at low inference cost.Method Top-1 Top-5I3D-R50+Cell (Wang et al., 2020c) 79.8 94.4LGD-3D-101 (Qiu et al., 2019) 81.5 95.6SlowFast (Feichtenhofer et al., 2019b) 81.8 95.1X3D-XL (Feichtenhofer, 2020) 81.9 95.5

TimeSformer

TimeSformer-HR 82.4 96.0TimeSformer-L

Table 3.

Video-level accuracy on Kinetics-600. TimeSformer-HRachieves the best reported accuracy.

Interestingly, we note that in this case, the best performingvariant of our model is TimeSformer-HR.

Something-Something-V2.

In Table 4, we also validateour model on SSv2. Our results suggest that TimeSformerachieves lower accuracy than the best models on this dataset.However, considering that our model uses a completelydifferent design, we take these results as suggesting thatTimesSformer is a promising approach even for challengingtemporally-heavy datasets, such as SSv2.

Diving-48.

Finally, in Table 4, we present our method onanother "temporally-heavy" dataset, Diving-48. Due to therecently discovered issue with a previous version of Diving-48 labels, here, we only compare our method with a repro-duced SlowFast × R101 model. Our results show thatTimeSformer outperforms SlowFast by a substantial margin.

In this subsection, we evaluate TimeSFormer on thetask of long-term video modeling using the HowTo100Mdataset (Miech et al., 2019). HowTo100M is a large-scaleinstructional video dataset. It contains around 1M instruc-tional Web videos showing humans performing over 23Kdifferent tasks, such as cooking, repairing, knitting, andmaking arts. The average duration of these videos is around minutes, which is orders of magnitude longer than theduration of videos in standard action recognition bench-marks. Each HowTo100M video has a label indicating the s Space-Time Attention All You Need for Video Understanding? Method SSv2 Diving-48 ∗∗ SlowFast (Feichtenhofer et al., 2019b) 61.7 77.6TSM (Lin et al., 2019) 63.4 N/ASTM (Jiang et al., 2019) 64.2 N/AMSNet (Kwon et al., 2020) 64.7 N/ATEA (Li et al., 2020b) 65.1 N/AbLVNet (Fan et al., 2019)

N/A

TimeSformer

TimeSformer-HR

TimeSformer-L

Table 4.

Video-level accuracy on Something-Something-V2 andDiving-48. ∗∗ Due to an issue with Diving-48 labels used in pre-viously published results, we only compare our method with areproduced SlowFast 16x8 R101 model.Method Table 5.

Long-term task classiﬁcation on HowTo100M. Given avideo spanning several minutes, the goal is to predict the long-termtask demonstrated in the video (e.g., cooking breakfast, cleaninghouse, etc). We evaluate a few variants of SlowFast and TimeS-former on this task. “Single Clip Coverage” denotes the numberof seconds spanned by a single clip. “ task demonstrated in the video (one out of the 23K classes),which can be used for supervised training. Thus, it is a goodbenchmark to assess the ability of a model to recognizeactivities exhibited over very long temporal extents.For this evaluation, we consider only categories that haveat least video examples. This gives a subset ofHowTo100M corresponding to K videos spanning task categories. We randomly partition this collection into K training videos and K testing videos.We present our results in Table 5. As our baselines, we usetwo variants of SlowFast R50, both of which operate on in-puts of frames. However, the ﬁrst variant uses a samplingrate of / , thus it uses clips of about seconds in length(assuming FPS), whereas the second variant employs alower sampling rate of / , thus it spans seconds inlength. For TimeSformer, we test four variants, all operatingon video clips sampled at a frame rate of / but havingvarying number of frames: , , and . During infer-ence, for each method, we sample as many non-overlappingtemporal clips as needed to cover the full temporal extentof an input video, e.g., if a single clip spans . seconds,we would sample test clips to cover a video of sec-onds. Video-level classiﬁcation is done by averaging the Figure 7.

Visualization of space-time attention from the outputtoken to the input space on Something-Something-V2. Our modellearns to focus on the relevant parts in the video in order to performspatiotemporal reasoning.

ViT -50 -40 -30 -20 -10 0 10 20 30 40 50-50-40-30-20-1001020304050 -60 -40 -20 0 20 40-40-30-20-1001020304050

TimeSformer w/ Divided Space-Time Attention -60 -40 -20 0 20 40 60-60-40-200204060

TimeSformer w/ Space Attention

Figure 8.

Feature visualization with t-SNE (van der Maaten & Hin-ton, 2008) on Something-Something-V2. Each video is visualizedas a point. Videos belonging to the same action category have thesame color. The TimeSformer with divided space-time attentionlearns semantically more separable features than the TimeSformerwith space-only attention or ViT (Dosovitskiy et al., 2020). clip predictions.From the results in Table 5 we ﬁrst note that, for the samesingle clip coverage, TimeSformer outperforms the corre-sponding SlowFast by a large margin of − . Second, weobserve that longer-range TimeSformers do better and ourlongest-range variant achieves the best video-level classiﬁca-tion accuracy. These results suggest that our model is highlysuitable for tasks that require long-term video modeling.Finally, we note that the long-range modeling approach ofTimeSformer is orthogonal to long-term schemes designedto operate on top of video backbones (Wu et al., 2019; Gird-har et al., 2017), and thus, further gains may come from acombination of them.

5. Conclusion

In this work, we introduced TimeSformer, a fundamentallydifferent approach to video modeling compared to the es-tablished paradigm of convolution-based video networks.We showed that it is possible to design an effective, andscalable video architecture built exclusively on space-timeself-attention. Our method (1) is conceptually simple, (2)achieves state-of-the-art results on major action recognitionbenchmarks, (3) has low inference cost, and (4) is suitablefor long-term video modeling. In the future, we plan to ex-tend our method to other video analysis tasks such as actionlocalization, video captioning and question-answering. s Space-Time Attention All You Need for Video Understanding?

References

Ba, L. J., Kiros, J. R., and Hinton, G. E. Layer normalization.

CoRR , 2016.Bello, I., Zoph, B., Le, Q., Vaswani, A., and Shlens, J.Attention augmented convolutional networks. In , 2019.Bertasius, G. and Torresani, L. Classifying, segmenting, andtracking object instances in video with mask propagation.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.Bian, Y., Gan, C., Liu, X., Li, F., Long, X., Li, Y., Qi, H.,Zhou, J., Wen, S., and Lin, Y. Revisiting the effectivenessof off-the-shelf temporal modeling approaches for large-scale video classiﬁcation. arXiv: Computer Vision andPattern Recognition , 2017.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners. 2020.Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,A., and Zagoruyko, S. End-to-end object detection withtransformers. In

European Conference Computer Vision(ECCV) , 2020.Carreira, J. and Zisserman, A. Quo vadis, action recogni-tion? A new model and the kinetics dataset. In , 2017.Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., andZisserman, A. A short note about kinetics-600.

CoRR ,2018.Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey,W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Par-mar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen,Z., Wu, Y., and Hughes, M. The best of both worlds:Combining recent advances in neural machine translation.In

Proceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics . Association forComputational Linguistics, 2018a.Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. Multi-ﬁber networks for video recognition.

ECCV , 2018b.Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. Aˆ2-nets: Double attention networks. In

Advances in NeuralInformation Processing Systems 31 , 2018c. Chen, Y., Fan, H., Xu, B., Yan, Z., Kalantidis, Y., Rohrbach,M., Yan, S., and Feng, J. Drop an octave: Reducingspatial redundancy in convolutional neural networks withoctave convolution. In

Proceedings of the IEEE/CVFInternational Conference on Computer Vision (ICCV) ,October 2019a.Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J.,and Kalantidis, Y. Graph-based global reasoning net-works. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , June2019b.Child, R., Gray, S., Radford, A., and Sutskever, I. Gener-ating long sequences with sparse transformers.

CoRR ,2019.Cordonnier, J., Loukas, A., and Jaggi, M. On the relation-ship between self-attention and convolutional layers. In , 2020.Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., andSalakhutdinov, R. Transformer-XL: Attentive languagemodels beyond a ﬁxed-length context. In

Proceedings ofthe 57th Annual Meeting of the Association for Computa-tional Linguistics , 2019.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 ,2018.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Associationfor Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , 2019.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.An image is worth 16x16 words: Transformers for imagerecognition at scale.

CoRR , 2020.Fan, Q., Chen, C.-F. R., Kuehne, H., Pistoia, M., and Cox,D. More is less: Learning efﬁcient video representationsby big-little network and depthwise temporal aggregation.In

Advances in Neural Information Processing Systems ,volume 32, 2019.Feichtenhofer, C. X3d: Expanding architectures for efﬁcientvideo recognition.

CVPR , pp. 200–210, 2020. s Space-Time Attention All You Need for Video Understanding?

Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfastnetworks for video recognition. In

Proceedings of theIEEE/CVF International Conference on Computer Vision(ICCV) , 2019a.Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfastnetworks for video recognition. In , 2019b.Gavrilyuk, K., Sanford, R., Javan, M., and Snoek, C. G. M.Actor-transformers for group activity recognition. In , 2020.Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., and Russell,B. Actionvlad: Learning spatio-temporal aggregationfor action classiﬁcation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2017.Girdhar, R., Carreira, J., Doersch, C., and Zisserman, A.Video action transformer network. In

IEEE Conference onComputer Vision and Pattern Recognition, CVPR , 2019.Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J.,Westphal, S., Kim, H., Haenel, V., Fründ, I., Yianilos,P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I.,and Memisevic, R. The "something something" videodatabase for learning and evaluating visual common sense.

CoRR , 2017.Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T.Axial attention in multidimensional transformers.

CoRR ,2019.Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. Relation net-works for object detection. In , 2018.Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., andLiu, W. Ccnet: Criss-cross attention for semantic seg-mentation. 2019.Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. Stm:Spatiotemporal and motion encoding for action recog-nition. In

Proceedings of the IEEE/CVF InternationalConference on Computer Vision (ICCV) , October 2019.Kwon, H., Kim, M., Kwak, S., and Cho, M. Motionsqueeze:Neural motion feature learning for video understanding.In

ECCV , 2020.Le, H., Sahoo, D., Chen, N., and Hoi, S. Multimodal trans-former networks for end-to-end video-grounded dialoguesystems. In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics , 2019. Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., andLiu, J. Hero: Hierarchical encoder for video+ lan-guage omni-representation pre-training. arXiv preprintarXiv:2005.00200 , 2020a.Li, Y., Li, Y., and Vasconcelos, N. Resound: Towards actionrecognition without representation bias. In

The Euro-pean Conference on Computer Vision (ECCV) , September2018.Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. Tea:Temporal excitation and aggregation for action recogni-tion. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , June2020b.Lin, J., Gan, C., and Han, S. Tsm: Temporal shift module forefﬁcient video understanding. In

Proceedings of the IEEEInternational Conference on Computer Vision , 2019.Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev,I., and Sivic, J. HowTo100M: Learning a Text-Video Em-bedding by Watching Hundred Million Narrated VideoClips. In

ICCV , 2019.Ott, M., Edunov, S., Grangier, D., and Auli, M. Scalingneural machine translation. In

Proceedings of the ThirdConference on Machine Translation: Research Papers ,2018.Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer,N., Ku, A., and Tran, D. Image transformer. In Dy, J. G.and Krause, A. (eds.),

Proceedings of the 35th Interna-tional Conference on Machine Learning, ICML , 2018.Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., and Mei, T. Learn-ing spatio-temporal representation with local and globaldiffusion. In

CVPR , 2019.Radford, A., Narasimhan, K., Salimans, T., and Sutskever,I. Improving language understanding by generative pre-training. 2018.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners. 2019.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploringthe limits of transfer learning with a uniﬁed text-to-texttransformer. arXiv preprint arXiv:1910.10683 , 2019.Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev-skaya, A., and Shlens, J. Stand-alone self-attention invision models. In

Advances in Neural Information Pro-cessing Systems , pp. 68–80, 2019. s Space-Time Attention All You Need for Video Understanding?

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge. arXiv:1409.0575 , 2014.Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli,M., and Torresani, L. Only time can tell: Discoveringtemporal data for temporal modeling. In

Proceedingsof the IEEE/CVF Winter Conference on Applications ofComputer Vision (WACV) , pp. 535–544, January 2021.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In

ICLR ,2015.Stroud, J., Ross, D., Sun, C., Deng, J., and Sukthankar, R.D3d: Distilled 3d networks for video action recognition.In

Proceedings of the IEEE/CVF Winter Conference onApplications of Computer Vision (WACV) , March 2020.Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid,C. Videobert: A joint model for video and languagerepresentation learning, 2019.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,A. Going deeper with convolutions. In

Computer Visionand Pattern Recognition (CVPR) , 2015.Teed, Z. and Deng, J. RAFT: recurrent all-pairs ﬁeld trans-forms for optical ﬂow. In

Computer Vision - ECCV 2020- 16th European Conference, Glasgow, UK, August 23-28,2020, Proceedings, Part II , 2020.Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,A., and Jégou, H. Training data-efﬁcient image trans-formers and distillation through attention. arXiv preprintarXiv:2012.12877 , 2020.Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., andPaluri, M. A closer look at spatiotemporal convolutionsfor action recognition. In , 2018.Tran, D., Wang, H., Feiszli, M., and Torresani, L. Videoclassiﬁcation with channel-separated convolutional net-works.

ICCV , pp. 5551–5560, 2019.van der Maaten, L. and Hinton, G. Visualizing data us-ing t-SNE.

Journal of Machine Learning Research , 9:2579–2605, 2008. URL .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-tion is all you need. In

Advances in Neural InformationProcessing Systems , 2017a. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-tion is all you need. In

Advances in Neural InformationProcessing Systems 30 . 2017b.Wang, H., Tran, D., Torresani, L., and Feiszli, M. Videomodeling with correlation networks. In

Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR) , June 2020a.Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A. L., andChen, L. Axial-deeplab: Stand-alone axial-attention forpanoptic segmentation. In

Computer Vision - ECCV 2020- 16th European Conference , 2020b.Wang, X., Girshick, R., Gupta, A., and He, K. Non-localneural networks. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,June 2018a.Wang, X., Girshick, R. B., Gupta, A., and He, K. Non-localneural networks. In , 2018b.Wang, X., Xiong, X., Neumann, M., Piergiovanni, A. J.,Ryoo, M. S., Angelova, A., Kitani, K. M., and Hua, W.Attentionnas: Spatiotemporal attention cell search forvideo classiﬁcation. In

Computer Vision - ECCV 2020 -16th European Conference, Glasgow, UK, August 23-28,2020, Proceedings, Part VIII , 2020c.Weissenborn, D., Täckström, O., and Uszkoreit, J. Scal-ing autoregressive video models. In , 2020.Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl,P., and Girshick, R. Long-term feature banks for detailedvideo understanding. In

Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,2019.Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K.Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In

Com-puter Vision - ECCV 2018 - 15th European Confer-ence, Munich, Germany, September 8-14, 2018, Pro-ceedings, Part XV , pp. 318–335, 2018. doi: 10.1007/978-3-030-01267-0\_19. URL https://doi.org/10.1007/978-3-030-01267-0_19 .Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., andTakemura, H. Bert representations for video question an-swering. In

The IEEE Winter Conference on Applicationsof Computer Vision , 2020.Zhao, H., Jia, J., and Koltun, V. Exploring self-attentionfor image recognition. In , 2020. s Space-Time Attention All You Need for Video Understanding?

Zhou, L., Zhou, Y., Corso, J. J., Socher, R., and Xiong, C.End-to-end dense video captioning with masked trans-former. In