Is Space-Time Attention All You Need for Video Understanding?
IIs Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius Heng Wang Lorenzo Torresani
Abstract
We present a convolution-free approach to videoclassification built exclusively on self-attentionover space and time. Our method, named “TimeS-former,” adapts the standard Transformer architec-ture to video by enabling spatiotemporal featurelearning directly from a sequence of frame-levelpatches. Our experimental study compares dif-ferent self-attention schemes and suggests that“divided attention,” where temporal attention andspatial attention are separately applied within eachblock, leads to the best video classification accu-racy among the design choices considered. De-spite the radically different design compared tothe prominent paradigm of 3D convolutional ar-chitectures for video, TimeSformer achieves state-of-the-art results on several major action recog-nition benchmarks, including the best reportedaccuracy on Kinetics-400 and Kinetics-600. Fur-thermore, our model is faster to train and hashigher test-time efficiency compared to compet-ing architectures. Code and pretrained modelswill be made publicly available.
1. Introduction
Over the last few years, the field of natural language pro-cessing (NLP) has been revolutionized by the emergence ofmethods based on self-attention (Vaswani et al., 2017a). Be-cause of their excellent capabilities at capturing long-rangedependencies among words as well as their training scala-bility, self-attention architectures, such as the Transformermodel, represent the current state-of-the-art across a widerange of language tasks, including machine translation (Ottet al., 2018; Chen et al., 2018a), question answering (De-vlin et al., 2019; Dai et al., 2019), and autoregressive wordgeneration (Radford et al., 2019; Brown et al., 2020).Video understanding shares several high-level similaritieswith NLP. First of all, videos and sentences are both funda-mentally sequential. Furthermore, precisely as the meaning Facebook AI Dartmouth College. Correspondence to: GedasBertasius
2. Related Work
Our approach is influenced by recent works that use self-attention for still-image classification, either in combinationwith the convolution operator or even as a full replacementfor it. Within the former class, Non-Local Networks (Wanget al., 2018b) employ a non-local mean that effectively gen-eralizes the self-attention function of Transformers (Vaswaniet al., 2017b). Bello et al. (2019) propose a 2D self-attentionmechanism that is competitive as a replacement of 2D convo-lution but gives even stronger results when used to augment convolutional features with self-attention features. Beyondimage categorization, Relation Networks (Hu et al., 2018)and DETR (Carion et al., 2020) use self-attention on top ofconvolutional feature maps for object detection.Our method is more closely related to image networks lever-aging self-attention as a substitute for convolution (Parmaret al., 2018; Ramachandran et al., 2019; Cordonnier et al.,2020; Zhao et al., 2020). Since these works use individualpixels as queries, in order to maintain a manageable compu-tational cost and a small memory consumption, they mustrestrict the scope of self-attention to local neighborhoods oruse global self-attention on heavily downsized versions ofthe image. Alternative strategies for scalability to full im-ages include sparse key-value sampling (Child et al., 2019)or constraining the self-attention to be calculated along thespatial axes (Ho et al., 2019; Huang et al., 2019; Wanget al., 2020b). A few of the self-attention operators con-sidered in our experiments adopt similar sparse and axialcomputation, although generalized to the spatiotemporalvolume. However, the efficiency of our approach stemsmainly from decomposing the video into a sequence offrame-level patches and then feeding linear embeddings ofthese patches as input token embeddings to a Transformer.This strategy was recently introduced in Vision Transform-ers (ViT) (Dosovitskiy et al., 2020) which were shown todeliver impressive performance on image categorization. Inthis work, we build on the ViT design, and extend it to videoby proposing and empirically comparing several scalableschemes for space-time self-attention over high-resolutionand long videos.While Transformers have been recently used for video gen-eration (Weissenborn et al., 2020), we are not aware of priorvideo recognition architectures leveraging self-attention asthe exclusive building block. However, we note that Trans-formers have been adopted on top of convolutional featuremaps for action localization and recognition (Girdhar et al.,2019), video classification (Wang et al., 2018b; Chen et al.,2018c), and group activity recognition (Gavrilyuk et al.,2020). We also note that there is a wide literature based onthe use of text Transformers combined with video CNNsto address various video-language tasks, such as caption-ing (Zhou et al., 2018), question-answering (Yang et al.,2020) and dialog (Le et al., 2019). Finally, multimodalvideo-text transformers (Sun et al., 2019; Li et al., 2020a)have also been trained or pretrained in unsupervised fashionby adopting masked-token pretext tasks adapted from thelanguage domain (Devlin et al., 2018; Radford et al., 2018;Raffel et al., 2019).
3. The TimeSformer Model
Input clip.
The TimeSformer takes as input a clip X ∈ R H × W × × F consisting of F RGB frames of size H × W s Space-Time Attention All You Need for Video Understanding?
MLP Local Att.Global Att.MLP z ( ` )
Illustration of the video self-attention blocks that we investigate in this work. Each attention layer implements self-attention (Vaswani et al., 2017b) on a specified spatiotemporal neighborhood of frame-level patches (see Figure 2 for a visualization of theneighborhoods). We use residual connections to aggregate information from different attention layers within each block. A 1-hidden-layerMLP is applied at the end of each block. The final model is constructed by repeatedly stacking these blocks on top of each other. sampled from the original video.
Decomposition into patches.
Following the ViT ap-proach (Dosovitskiy et al., 2020), we decompose each frameinto N non-overlapping patches, each of size P × P , suchthat the N patches span the entire frame, i.e., N = HW/P .We flatten these patches into vectors x ( p,t ) ∈ R P with p = 1 , . . . , N denoting spatial locations and t = 1 , . . . , F representing an index over frames. Linear embedding.
We linearly map each patch x ( p,t ) intoan embedding vector z (0)( p,t ) ∈ R D by means of a learnablematrix E ∈ R D × P : z (0)( p,t ) = E x ( p,t ) + e pos ( p,t ) (1)where e pos ( p,t ) ∈ R D represents a learnable positional embed-ding added to encode the spatiotemporal position of eachpatch. The resulting sequence of embedding vectors z (0)( p,t ) for p = 1 , . . . , N , and t = 1 , . . . , F represents the input tothe Transformer, and plays a role similar to the sequencesof embedded words/tokens that are fed to text Transformersin NLP. As in the original BERT Transformer (Devlin et al.,2018), we add in the first position of the sequence a speciallearnable vector z (0)(0 , ∈ R D representing the embeddingof the classification token. Query-Key-Value computation.
Our Transformer consistsof L encoding blocks. At each block (cid:96) , a query/key/valuevector is computed for each patch from the representation z ( (cid:96) − p,t ) encoded by the preceding block: q ( (cid:96),a )( p,t ) = W ( (cid:96),a ) Q LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (2) k ( (cid:96),a )( p,t ) = W ( (cid:96),a ) K LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (3) v ( (cid:96),a )( p,t ) = W ( (cid:96),a ) V LN (cid:16) z ( (cid:96) − p,t ) (cid:17) ∈ R D h (4)where LN() denotes LayerNorm (Ba et al., 2016), a =1 , . . . , A is an index over multiple attention heads and A denotes the total number of attention heads. The latentdimensionality for each attention head is set to D h = D/ A . Self-attention computation.
Self-attention weights arecomputed via dot-product. The self-attention weights ααα ( (cid:96),a )( p,t ) ∈ R NF +1 for query patch ( p, t ) are given by: ααα ( (cid:96),a )( p,t ) = SM q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:34) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p (cid:48) ,t (cid:48) ) (cid:111) p (cid:48) =1 ,...,Nt (cid:48) =1 ,...,F (cid:35) (5)where SM denotes the softmax activation function. Notethat when attention is computed over one dimension only(e.g., spatial-only or temporal-only), the computation issignificantly reduced. For example, in the case of spatialattention, only N + 1 query-key comparisons are made,using exclusively keys from the same frame as the query: s Space-Time Attention All You Need for Video Understanding? f r a m e tf r a m e t -
Visualization of the five space-time self-attention schemes studied in this work. Each video clip is viewed as a sequence offrame-level patches with a size of × pixels. For illustration, we denote in blue the query patch and show in non-blue colors itsself-attention space-time neighborhood under each scheme. Patches without color are not used for the self-attention computation of theblue patch. Multiple colors within a scheme denote attentions separately applied along different dimensions (e.g., space and time for(T+S)) or over different neighborhoods (e.g., for (L+G)). Note that self-attention is computed for every single patch in the video clip, i.e.,every patch serves as a query. We also note that although the attention pattern is shown for only two adjacent frames, it extends in thesame fashion to all frames of the clip. ααα ( (cid:96),a )space( p,t ) = SM q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:20) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p (cid:48) ,t ) (cid:111) p (cid:48) =1 ,...,N (cid:21) . (6) Encoding.
The encoding z ( (cid:96) )( p,t ) at block (cid:96) is obtained byfirst computing the weighted sum of value vectors usingself-attention coefficients from each attention head: s ( (cid:96),a )( p,t ) = α ( (cid:96),a )( p,t ) , (0 , v ( (cid:96),a )(0 , + N (cid:88) p (cid:48) =1 F (cid:88) t (cid:48) =1 α ( (cid:96),a )( p,t ) , ( p (cid:48) ,t (cid:48) ) v ( (cid:96),a )( p (cid:48) ,t (cid:48) ) . (7)Then, the concatenation of these vectors from all headsis projected and passed through an MLP, using residualconnections after each operation: z (cid:48) ( (cid:96) )( p,t ) = W O s ( (cid:96), p,t ) ... s ( (cid:96), A )( p,t ) + z ( (cid:96) − p,t ) (8) z ( (cid:96) )( p,t ) = MLP (cid:16) LN (cid:16) z (cid:48) ( (cid:96) )( p,t ) (cid:17)(cid:17) + z (cid:48) ( (cid:96) )( p,t ) . (9) Classification embedding.
The final clip embedding isobtained from the final block for the classification token: y = LN (cid:16) z ( L )(0 , (cid:17) ∈ R D . (10)On top of this representation we append a 1-hidden-layerMLP, which is used to predict the final video classes. Space-Time Self-Attention Models.
As already men-tioned, we can reduce the computational cost by replacingthe spatiotemporal attention of Eq. 5 with spatial attentionwithin each frame only (Eq. 6). However, such a model ne-glects to capture temporal dependencies across frames. Asshown in our experiments, this approach leads to degradedclassification accuracy compared to full spatiotemporal at-tention, especially on benchmarks where strong temporalmodeling is necessary.We propose an alternative, more efficient architecture forspatiotemporal attention, named “Divided Space-Time At-tention” (denoted with T+S), where temporal attention andspatial attention are separately applied one after the other.This architecture is compared to that of Space and JointSpace-Time attention in Fig. 1. A visualization of the differ-ent attention models on a video example is given in Fig. 2.For Divided Attention, within each block (cid:96) , we first computetemporal attention by comparing each patch ( p, t ) with allthe patches at the same spatial location in the other frames: ααα ( (cid:96),a )time( p,t ) = SM q ( (cid:96),a )( p,t ) √ D h (cid:62) · (cid:20) k ( (cid:96),a )(0 , (cid:110) k ( (cid:96),a )( p,t (cid:48) ) (cid:111) t (cid:48) =1 ,...,F (cid:21) . (11) s Space-Time Attention All You Need for Video Understanding? Attention Params K400 SSv2Space 85.9M 77.6 36.6Joint Space-Time 85.9M 78.1 58.5Divided Space-Time 121.4M
Sparse Local Global 121.4M 76.8 56.3Axial 156.8M 74.6 56.2
Table 1.
Video-level accuracy for different space-time attentionschemes in TimeSformer. We evaluate the models on the valida-tion sets of Kinetics-400 (K400), and Something-Something-V2(SSv2). We observe that divided space-time attention achieves thebest results on both datasets.
The encoding z (cid:48) ( (cid:96) )time( p,t ) resulting from the application ofEq. 8 using temporal attention is then fed back for spatial attention computation instead of being passed to the MLP. Inother words, new key/query/value vectors are obtained from z (cid:48) ( (cid:96) )time( p,t ) and spatial attention is then computed using Eq. 6.Finally, the resulting vector z (cid:48) ( (cid:96) )space( p,t ) is passed to the MLPof Eq. 9 to compute the final encoding z ( (cid:96) )( p,t ) of the patch atblock (cid:96) . For the model of divided attention, we learn dis-tinct query/key/value matrices { W ( (cid:96),a ) Q time , W ( (cid:96),a ) K time , W ( (cid:96),a ) V time } and { W ( (cid:96),a ) Q space , W ( (cid:96),a ) K space , W ( (cid:96),a ) V space } over the time and spacedimensions. Note that compared to the ( N F + 1) compar-isons per patch needed by the joint spatiotemporal attentionmodel of Eq. 5, Divided Attention performs only ( N + F +2) comparisons per patch. Our experiments demonstrate thatthis space-time factorization is not only more efficient but italso leads to improved classification accuracy.We have also experimented with a “Sparse Local Global”(L+G) attention model and an “Axial” (T+W+H) attentionmodel. Their architectures are illustrated in Fig. 1, whileFig. 2 shows the patches considered for attention by thesemodels. For each patch ( p, t ) , (L+G) first computes a localattention by considering the neighboring F × H/ × W/ patches and then calculates a sparse global attention overthe entire clip using a stride of 2 patches along the temporaldimension and also the two spatial dimensions. Thus, it canbe viewed as a faster approximation of full spatiotemporalattention using a local-global decomposition and a sparsitypattern, similar to that used in (Child et al., 2019). Finally,“Axial” attention decomposes the attention computation inthree distinct steps: over time, width and height. A de-composed attention pattern over the two spatial axes of theimage was proposed in (Ho et al., 2019; Huang et al., 2019;Wang et al., 2020b) and our (T+W+H) adds a third dimen-sion (time) for the case of video. All of these models areimplemented by learning distinct query/key/value matricesfor each attention step.
4. Experiments
We now evaluate TimeSformer on four popular action recog-nition datasets: Kinetics-400 (Carreira & Zisserman, 2017),Kinetics-600 (Carreira et al., 2018), Something-Something- TF L O P s Joint Space-TimeDivided Space-Time
224 336 448 560
Spatial Crop (Px) TF L O P s Joint Space-TimeDivided Space-Time out of memory out of memory
Figure 3.
We compare the video classification cost (in TFLOPs) ofJoint Space-Time versus Divided Space-Time attention. We plotthe number of TFLOPs as a function of spatial crop size in pixels(left), and the number of input frames (right). As we increase thespatial resolution (left), or the video length (right), our proposed di-vided space-time attention leads to dramatic computational savingscompared to the scheme of joint space-time attention.
V2 (Goyal et al., 2017), and Diving-48 (Li et al., 2018).For all our experiments, we adopt the “Base” ViT modelarchitecture (Dosovitskiy et al., 2020) pretrained on Ima-geNet (Russakovsky et al., 2014). Unless differently speci-fied, we use clips of size × × , with frames sampledat a rate of / . The patch size is set to × pixels.During inference, unless otherwise noted, we sample a sin-gle temporal clip in the middle of the video. We use spatialcrops (top-left, center, bottom-right) so that the entire videoclip is spatially covered. The final prediction is obtained byaveraging the softmax scores of these predictions. In Table 1, we present the results for the five proposed space-time attention schemes in TimeSformer using Kinetics-400(K400) and Something-Something-V2 (SSv2) as bench-marks. From these results we can first notice that TimeS-former with space-only attention (S) performs well on K400.This is an interesting finding. Indeed, Sevilla-Lara et al.(2021) found that, on K400, the use of spatial informationis more important than the leveraging of temporal dynamicsin order to achieve strong accuracy. Here, we show that itis possible to obtain solid accuracy on K400 without anytemporal modeling at all. Note, however, that space-onlyattention performs extremely poorly on SSv2. This stressesthe importance of temporal modeling on this latter dataset.Furthermore, we observe that divided space-time attentionachieves the best accuracy on both K400 and SSv2. Thismakes sense because compared to joint space-time attention,divided space-time attention has a larger learning capacity(see Table 1) as it contains distinct learning parameters fortemporal attention and spatial attention.In Figure 3, we also compare the computational cost of jointspace-time versus divided space-time attention when usinghigher spatial resolution (left) and longer (right) videos. Wenote that the scheme of divided space-time scales gracefully s Space-Time Attention All You Need for Video Understanding?
Number of Input Frames C li p A cc u r a cy
224 336 448 560
Spatial Crop Side (Px) C li p A cc u r a cy Figure 4.
Clip-level accuracy on Kinetics-400 as a function of spa-tial crop size in pixels (left), and the number of input frames (right). under both of these settings. In contrast, the scheme ofjoint space-time attention leads to a dramatically higher costwhen resolution or video length is increased. In practice,joint space-time attention causes a GPU memory overflowonce the spatial frame resolution reaches pixels, or oncethe number of frames is increased to and thus it is effec-tively not applicable to large frames or long videos. Thus,despite a larger number of parameters, divided space-timeattention is more efficient than joint space-time attentionwhen operating on higher spatial resolution, or longer videos.Thus, for all subsequent experiments we use a TimeSformerconstructed with divided space-time self-attention blocks. As discussed above, the scalability of our model allows itto operate at higher spatial resolution and on longer videoscompared to most 3D CNNs. We note that both of theseaspects affect the length of the sequence of tokens fed to theTransformer. Specifically, increasing the spatial resolutionresults in a higher number of patches ( N ) per frame. Thenumber of input tokens is also increased when using moreframes. To investigate the benefits, we conduct an empiricalstudy where we separately increase the number of tokensalong each of these two axes.We report the findings in Figure 4. We see that increasing thespatial resolution (up to a certain point) leads to a substantialboost in performance. Similarly, we observe that increasingthe length of the input clip leads to consistent accuracy gains.Due to GPU memory constraints, we are not able to testour model on clips longer than frames. Still, we wouldlike to point out that using clips of frames is a significantdeparture from current convolutional models, which aretypically limited to processing inputs of − frames. In their recent work, Dosovitskiy et al (2020) demonstratedthat ViT is most effective when trained on very large-scaledatasets. In this section, we investigate whether the sametrend holds for our TimeSformer model.First, we attempted to train TimeSformer on video datasetsdirectly, without ImageNet pretraining. For these exper-
60K 120K 180K 240K A ccc u r a cy Kinetics
TimeSformerSlowFastI3D
42K 85K 127K 170K A ccc u r a cy Something-Something-V2
TimeSformerSlowFastI3D
Figure 5.
We study accuracy on Kinetics-400 (K400), andSomething-Something-V2 (SSv2) as a function of the numberof training videos. On K400, TimeSformer performs best in allcases. On SSv2, which requires more complicated temporal reason-ing, TimeSformer outperforms the other models only when usingenough training videos. All models are pretrained on ImageNet. iments, we followed the training-from-scratch protocolof Touvron et al. (2020) and we also evaluated some vari-ants of it. However, the model failed to learn meaningfulfeatures. We suspect that a new optimization strategy orperhaps different hyperparameter values may be needed fortraining TimeSformer from scratch on video data.Thus, for all subsequent studies we continue to use Ima-geNet pretraining. To understand the effects of video-datascale on performance, we train our model on different sub-sets of K400, and SSv2: { , , , } of thefull datasets. We show these results in Figure 5, where wealso compare our method with SlowFast R50 (Feichtenhoferet al., 2019b), and I3D R50 (Carreira & Zisserman, 2017)trained on the same subsets. We note that all 3 architecturesare pretrained on ImageNet (Russakovsky et al., 2014).The results of Figure 5 show that, on K400, TimeSformeroutperforms the other models for all training subsets. How-ever, we observe a different trend on SSv2, where TimeS-former is the strongest model only when trained on or of the full data. This may be explained by the fact thatcompared to K400, SSv2 requires learning more complextemporal patterns, and thus more examples may be neededby TimeSformer to learn effectively those patterns. We now compare TimeSformer to the state-of-the-art onseveral popular action recognition datasets. We use threevariants of our model: (1)
TimeSformer , which is the de-fault version of our model operating on × × videoclips, (2) TimeSformer-HR , a high spatial resolution vari-ant that operates on × × video clips, and lastly (3) TimeSformer-L , a long-range configuration of our modelthat operates on × × video clips with framessampled at a rate of / . Kinetics-400.
In Table 2, we report our results on the val-idation set of K400. In addition to the accuracy metrics,we include inference cost, given in TFLOPs. We note that s Space-Time Attention All You Need for Video Understanding?
Method Top-1 Top-5
TFLOPs
ARTNet (Wang et al., 2018a) 69.2 88.3 6.0I3D (Carreira & Zisserman, 2017) 71.1 89.3 N/AR(2+1)D (Tran et al., 2018) 72.0 90.0 17.5MFNet (Chen et al., 2018b) 72.8 90.4 N/AInception-ResNet (Bian et al., 2017) 73.0 90.9 N/AbLVNet (Fan et al., 2019) 73.5 91.2 0.84 A -Net (Chen et al., 2018c) 74.6 91.5 N/ATSM (Lin et al., 2019) 74.7 N/A N/AS3D-G (Xie et al., 2018) 74.7 93.4 N/AOct-I3D+NL (Chen et al., 2019a) 75.7 N/A 0.84D3D (Stroud et al., 2020) 75.9 N/A N/AGloRe (Chen et al., 2019b) 76.1 N/A N/AI3D+NL (Wang et al., 2018b) 77.7 93.3 10.8ip-CSN-152 (Tran et al., 2019) 77.8 92.8 3.2CorrNet (Wang et al., 2020a) 79.2 N/A 6.7LGD-3D-101 (Qiu et al., 2019) 79.4 94.4 N/ASlowFast (Feichtenhofer et al., 2019b) 79.8 93.9 7.0X3D-XXL (Feichtenhofer, 2020) 80.4 94.6 5.8 TimeSformer
TimeSformer-L 80.7 94.7
Table 2.
Video-level accuracy on Kinetics-400. TimeSformer-Lachieves the best reported accuracy. whereas most previous methods use temporal clips with spatial crops (for a total of space-time views) duringinference, we show that TimeSformer achieves solid ac-curacy with only views (3 spatial crops), which reducesthe inference cost. Our long-range variant, TimeSformer-L,achieves a top-1 accuracy of . , outperforming all priormethods. Furthermore, our default TimeSformer has thelowest inference cost among recent state-of-the-art models.Yet, it still provides a solid accuracy of , outperformingmany much more costly models.In Figure 6, we study the effect of using multiple temporalclips during inference (each with a single spatial crop). Weplot accuracy using K ∈ { , , , } temporal clips fortesting. We compare our model against X3D (Feichtenhofer,2020), and SlowFast (Feichtenhofer et al., 2019b). X3D andSlowFast require multiple ( ≥ ) clips to approach their topaccuracy. Conversely, our long-range variant, TimeSformer-L, does not benefit at all from using multiple clips. Thismakes sense since it is able to span about seconds of aKinetics video with a single clip.We also note that TimeSformer is much faster to train thanstate-of-the-art 3D CNNs even when both types of modelsare pretrained on Imagenet. For instance, training a Slow-Fast × R50 on K400 takes . hours on GPUs, whiletraining I3D R50 under similar settings takes hours using GPUs. In contrast, TimeSformer can be trained in . hours using GPUs. This cuts down the training time by afactor of compared to both 3D CNNs. Kinetics-600.
In Table 3, we also present our results onKinetics-600. Just like on Kinetics-400, we observe thatour method outperforms all prior works on this benchmark. A ccc u r a cy TimeSformer-LTimeSformerSlowFast-R101+NLX3D-XL
Figure 6.
Video-level accuracy on Kinetics-400 vs the number oftemporal clips used during inference. TimeSformer-L achievesexcellent accuracy using a small number of sampled clips, whichleads to strong performance at low inference cost.Method Top-1 Top-5I3D-R50+Cell (Wang et al., 2020c) 79.8 94.4LGD-3D-101 (Qiu et al., 2019) 81.5 95.6SlowFast (Feichtenhofer et al., 2019b) 81.8 95.1X3D-XL (Feichtenhofer, 2020) 81.9 95.5
TimeSformer
TimeSformer-HR 82.4 96.0TimeSformer-L
Table 3.
Video-level accuracy on Kinetics-600. TimeSformer-HRachieves the best reported accuracy.
Interestingly, we note that in this case, the best performingvariant of our model is TimeSformer-HR.
Something-Something-V2.
In Table 4, we also validateour model on SSv2. Our results suggest that TimeSformerachieves lower accuracy than the best models on this dataset.However, considering that our model uses a completelydifferent design, we take these results as suggesting thatTimesSformer is a promising approach even for challengingtemporally-heavy datasets, such as SSv2.
Diving-48.
Finally, in Table 4, we present our method onanother "temporally-heavy" dataset, Diving-48. Due to therecently discovered issue with a previous version of Diving-48 labels, here, we only compare our method with a repro-duced SlowFast × R101 model. Our results show thatTimeSformer outperforms SlowFast by a substantial margin.
In this subsection, we evaluate TimeSFormer on thetask of long-term video modeling using the HowTo100Mdataset (Miech et al., 2019). HowTo100M is a large-scaleinstructional video dataset. It contains around 1M instruc-tional Web videos showing humans performing over 23Kdifferent tasks, such as cooking, repairing, knitting, andmaking arts. The average duration of these videos is around minutes, which is orders of magnitude longer than theduration of videos in standard action recognition bench-marks. Each HowTo100M video has a label indicating the s Space-Time Attention All You Need for Video Understanding? Method SSv2 Diving-48 ∗∗ SlowFast (Feichtenhofer et al., 2019b) 61.7 77.6TSM (Lin et al., 2019) 63.4 N/ASTM (Jiang et al., 2019) 64.2 N/AMSNet (Kwon et al., 2020) 64.7 N/ATEA (Li et al., 2020b) 65.1 N/AbLVNet (Fan et al., 2019)
N/A
TimeSformer
TimeSformer-HR
TimeSformer-L
Table 4.
Video-level accuracy on Something-Something-V2 andDiving-48. ∗∗ Due to an issue with Diving-48 labels used in pre-viously published results, we only compare our method with areproduced SlowFast 16x8 R101 model.Method Table 5.
Long-term task classification on HowTo100M. Given avideo spanning several minutes, the goal is to predict the long-termtask demonstrated in the video (e.g., cooking breakfast, cleaninghouse, etc). We evaluate a few variants of SlowFast and TimeS-former on this task. “Single Clip Coverage” denotes the numberof seconds spanned by a single clip. “ task demonstrated in the video (one out of the 23K classes),which can be used for supervised training. Thus, it is a goodbenchmark to assess the ability of a model to recognizeactivities exhibited over very long temporal extents.For this evaluation, we consider only categories that haveat least video examples. This gives a subset ofHowTo100M corresponding to K videos spanning task categories. We randomly partition this collection into K training videos and K testing videos.We present our results in Table 5. As our baselines, we usetwo variants of SlowFast R50, both of which operate on in-puts of frames. However, the first variant uses a samplingrate of / , thus it uses clips of about seconds in length(assuming FPS), whereas the second variant employs alower sampling rate of / , thus it spans seconds inlength. For TimeSformer, we test four variants, all operatingon video clips sampled at a frame rate of / but havingvarying number of frames: , , and . During infer-ence, for each method, we sample as many non-overlappingtemporal clips as needed to cover the full temporal extentof an input video, e.g., if a single clip spans . seconds,we would sample test clips to cover a video of sec-onds. Video-level classification is done by averaging the Figure 7.
Visualization of space-time attention from the outputtoken to the input space on Something-Something-V2. Our modellearns to focus on the relevant parts in the video in order to performspatiotemporal reasoning.
ViT -50 -40 -30 -20 -10 0 10 20 30 40 50-50-40-30-20-1001020304050 -60 -40 -20 0 20 40-40-30-20-1001020304050
TimeSformer w/ Divided Space-Time Attention -60 -40 -20 0 20 40 60-60-40-200204060
TimeSformer w/ Space Attention
Figure 8.
Feature visualization with t-SNE (van der Maaten & Hin-ton, 2008) on Something-Something-V2. Each video is visualizedas a point. Videos belonging to the same action category have thesame color. The TimeSformer with divided space-time attentionlearns semantically more separable features than the TimeSformerwith space-only attention or ViT (Dosovitskiy et al., 2020). clip predictions.From the results in Table 5 we first note that, for the samesingle clip coverage, TimeSformer outperforms the corre-sponding SlowFast by a large margin of − . Second, weobserve that longer-range TimeSformers do better and ourlongest-range variant achieves the best video-level classifica-tion accuracy. These results suggest that our model is highlysuitable for tasks that require long-term video modeling.Finally, we note that the long-range modeling approach ofTimeSformer is orthogonal to long-term schemes designedto operate on top of video backbones (Wu et al., 2019; Gird-har et al., 2017), and thus, further gains may come from acombination of them.
5. Conclusion
In this work, we introduced TimeSformer, a fundamentallydifferent approach to video modeling compared to the es-tablished paradigm of convolution-based video networks.We showed that it is possible to design an effective, andscalable video architecture built exclusively on space-timeself-attention. Our method (1) is conceptually simple, (2)achieves state-of-the-art results on major action recognitionbenchmarks, (3) has low inference cost, and (4) is suitablefor long-term video modeling. In the future, we plan to ex-tend our method to other video analysis tasks such as actionlocalization, video captioning and question-answering. s Space-Time Attention All You Need for Video Understanding?
References
Ba, L. J., Kiros, J. R., and Hinton, G. E. Layer normalization.
CoRR , 2016.Bello, I., Zoph, B., Le, Q., Vaswani, A., and Shlens, J.Attention augmented convolutional networks. In , 2019.Bertasius, G. and Torresani, L. Classifying, segmenting, andtracking object instances in video with mask propagation.In
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2020.Bian, Y., Gan, C., Liu, X., Li, F., Long, X., Li, Y., Qi, H.,Zhou, J., Wen, S., and Lin, Y. Revisiting the effectivenessof off-the-shelf temporal modeling approaches for large-scale video classification. arXiv: Computer Vision andPattern Recognition , 2017.Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,S., Radford, A., Sutskever, I., and Amodei, D. Languagemodels are few-shot learners. 2020.Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,A., and Zagoruyko, S. End-to-end object detection withtransformers. In
European Conference Computer Vision(ECCV) , 2020.Carreira, J. and Zisserman, A. Quo vadis, action recogni-tion? A new model and the kinetics dataset. In , 2017.Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., andZisserman, A. A short note about kinetics-600.
CoRR ,2018.Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey,W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Par-mar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Chen,Z., Wu, Y., and Hughes, M. The best of both worlds:Combining recent advances in neural machine translation.In
Proceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics . Association forComputational Linguistics, 2018a.Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. Multi-fiber networks for video recognition.
ECCV , 2018b.Chen, Y., Kalantidis, Y., Li, J., Yan, S., and Feng, J. Aˆ2-nets: Double attention networks. In
Advances in NeuralInformation Processing Systems 31 , 2018c. Chen, Y., Fan, H., Xu, B., Yan, Z., Kalantidis, Y., Rohrbach,M., Yan, S., and Feng, J. Drop an octave: Reducingspatial redundancy in convolutional neural networks withoctave convolution. In
Proceedings of the IEEE/CVFInternational Conference on Computer Vision (ICCV) ,October 2019a.Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J.,and Kalantidis, Y. Graph-based global reasoning net-works. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , June2019b.Child, R., Gray, S., Radford, A., and Sutskever, I. Gener-ating long sequences with sparse transformers.
CoRR ,2019.Cordonnier, J., Loukas, A., and Jaggi, M. On the relation-ship between self-attention and convolutional layers. In , 2020.Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., andSalakhutdinov, R. Transformer-XL: Attentive languagemodels beyond a fixed-length context. In
Proceedings ofthe 57th Annual Meeting of the Association for Computa-tional Linguistics , 2019.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:Pre-training of deep bidirectional transformers for lan-guage understanding. arXiv preprint arXiv:1810.04805 ,2018.Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:Pre-training of deep bidirectional transformers for lan-guage understanding. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Associationfor Computational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , 2019.Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N.An image is worth 16x16 words: Transformers for imagerecognition at scale.
CoRR , 2020.Fan, Q., Chen, C.-F. R., Kuehne, H., Pistoia, M., and Cox,D. More is less: Learning efficient video representationsby big-little network and depthwise temporal aggregation.In
Advances in Neural Information Processing Systems ,volume 32, 2019.Feichtenhofer, C. X3d: Expanding architectures for efficientvideo recognition.
CVPR , pp. 200–210, 2020. s Space-Time Attention All You Need for Video Understanding?
Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfastnetworks for video recognition. In
Proceedings of theIEEE/CVF International Conference on Computer Vision(ICCV) , 2019a.Feichtenhofer, C., Fan, H., Malik, J., and He, K. Slowfastnetworks for video recognition. In , 2019b.Gavrilyuk, K., Sanford, R., Javan, M., and Snoek, C. G. M.Actor-transformers for group activity recognition. In , 2020.Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., and Russell,B. Actionvlad: Learning spatio-temporal aggregationfor action classification. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2017.Girdhar, R., Carreira, J., Doersch, C., and Zisserman, A.Video action transformer network. In
IEEE Conference onComputer Vision and Pattern Recognition, CVPR , 2019.Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J.,Westphal, S., Kim, H., Haenel, V., Fründ, I., Yianilos,P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I.,and Memisevic, R. The "something something" videodatabase for learning and evaluating visual common sense.
CoRR , 2017.Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans, T.Axial attention in multidimensional transformers.
CoRR ,2019.Hu, H., Gu, J., Zhang, Z., Dai, J., and Wei, Y. Relation net-works for object detection. In , 2018.Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., andLiu, W. Ccnet: Criss-cross attention for semantic seg-mentation. 2019.Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. Stm:Spatiotemporal and motion encoding for action recog-nition. In
Proceedings of the IEEE/CVF InternationalConference on Computer Vision (ICCV) , October 2019.Kwon, H., Kim, M., Kwak, S., and Cho, M. Motionsqueeze:Neural motion feature learning for video understanding.In
ECCV , 2020.Le, H., Sahoo, D., Chen, N., and Hoi, S. Multimodal trans-former networks for end-to-end video-grounded dialoguesystems. In
Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics , 2019. Li, L., Chen, Y.-C., Cheng, Y., Gan, Z., Yu, L., andLiu, J. Hero: Hierarchical encoder for video+ lan-guage omni-representation pre-training. arXiv preprintarXiv:2005.00200 , 2020a.Li, Y., Li, Y., and Vasconcelos, N. Resound: Towards actionrecognition without representation bias. In
The Euro-pean Conference on Computer Vision (ECCV) , September2018.Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., and Wang, L. Tea:Temporal excitation and aggregation for action recogni-tion. In
Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR) , June2020b.Lin, J., Gan, C., and Han, S. Tsm: Temporal shift module forefficient video understanding. In
Proceedings of the IEEEInternational Conference on Computer Vision , 2019.Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev,I., and Sivic, J. HowTo100M: Learning a Text-Video Em-bedding by Watching Hundred Million Narrated VideoClips. In
ICCV , 2019.Ott, M., Edunov, S., Grangier, D., and Auli, M. Scalingneural machine translation. In
Proceedings of the ThirdConference on Machine Translation: Research Papers ,2018.Parmar, N., Vaswani, A., Uszkoreit, J., Kaiser, L., Shazeer,N., Ku, A., and Tran, D. Image transformer. In Dy, J. G.and Krause, A. (eds.),
Proceedings of the 35th Interna-tional Conference on Machine Learning, ICML , 2018.Qiu, Z., Yao, T., Ngo, C.-W., Tian, X., and Mei, T. Learn-ing spatio-temporal representation with local and globaldiffusion. In
CVPR , 2019.Radford, A., Narasimhan, K., Salimans, T., and Sutskever,I. Improving language understanding by generative pre-training. 2018.Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., andSutskever, I. Language models are unsupervised multitasklearners. 2019.Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploringthe limits of transfer learning with a unified text-to-texttransformer. arXiv preprint arXiv:1910.10683 , 2019.Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev-skaya, A., and Shlens, J. Stand-alone self-attention invision models. In
Advances in Neural Information Pro-cessing Systems , pp. 68–80, 2019. s Space-Time Attention All You Need for Video Understanding?
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge. arXiv:1409.0575 , 2014.Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli,M., and Torresani, L. Only time can tell: Discoveringtemporal data for temporal modeling. In
Proceedingsof the IEEE/CVF Winter Conference on Applications ofComputer Vision (WACV) , pp. 535–544, January 2021.Simonyan, K. and Zisserman, A. Very deep convolutionalnetworks for large-scale image recognition. In
ICLR ,2015.Stroud, J., Ross, D., Sun, C., Deng, J., and Sukthankar, R.D3d: Distilled 3d networks for video action recognition.In
Proceedings of the IEEE/CVF Winter Conference onApplications of Computer Vision (WACV) , March 2020.Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid,C. Videobert: A joint model for video and languagerepresentation learning, 2019.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,A. Going deeper with convolutions. In
Computer Visionand Pattern Recognition (CVPR) , 2015.Teed, Z. and Deng, J. RAFT: recurrent all-pairs field trans-forms for optical flow. In
Computer Vision - ECCV 2020- 16th European Conference, Glasgow, UK, August 23-28,2020, Proceedings, Part II , 2020.Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,A., and Jégou, H. Training data-efficient image trans-formers and distillation through attention. arXiv preprintarXiv:2012.12877 , 2020.Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., andPaluri, M. A closer look at spatiotemporal convolutionsfor action recognition. In , 2018.Tran, D., Wang, H., Feiszli, M., and Torresani, L. Videoclassification with channel-separated convolutional net-works.
ICCV , pp. 5551–5560, 2019.van der Maaten, L. and Hinton, G. Visualizing data us-ing t-SNE.
Journal of Machine Learning Research , 9:2579–2605, 2008. URL .Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-tion is all you need. In
Advances in Neural InformationProcessing Systems , 2017a. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-tion is all you need. In
Advances in Neural InformationProcessing Systems 30 . 2017b.Wang, H., Tran, D., Torresani, L., and Feiszli, M. Videomodeling with correlation networks. In
Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR) , June 2020a.Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A. L., andChen, L. Axial-deeplab: Stand-alone axial-attention forpanoptic segmentation. In
Computer Vision - ECCV 2020- 16th European Conference , 2020b.Wang, X., Girshick, R., Gupta, A., and He, K. Non-localneural networks. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,June 2018a.Wang, X., Girshick, R. B., Gupta, A., and He, K. Non-localneural networks. In , 2018b.Wang, X., Xiong, X., Neumann, M., Piergiovanni, A. J.,Ryoo, M. S., Angelova, A., Kitani, K. M., and Hua, W.Attentionnas: Spatiotemporal attention cell search forvideo classification. In
Computer Vision - ECCV 2020 -16th European Conference, Glasgow, UK, August 23-28,2020, Proceedings, Part VIII , 2020c.Weissenborn, D., Täckström, O., and Uszkoreit, J. Scal-ing autoregressive video models. In , 2020.Wu, C.-Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl,P., and Girshick, R. Long-term feature banks for detailedvideo understanding. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,2019.Xie, S., Sun, C., Huang, J., Tu, Z., and Murphy, K.Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In
Com-puter Vision - ECCV 2018 - 15th European Confer-ence, Munich, Germany, September 8-14, 2018, Pro-ceedings, Part XV , pp. 318–335, 2018. doi: 10.1007/978-3-030-01267-0\_19. URL https://doi.org/10.1007/978-3-030-01267-0_19 .Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., andTakemura, H. Bert representations for video question an-swering. In
The IEEE Winter Conference on Applicationsof Computer Vision , 2020.Zhao, H., Jia, J., and Koltun, V. Exploring self-attentionfor image recognition. In , 2020. s Space-Time Attention All You Need for Video Understanding?
Zhou, L., Zhou, Y., Corso, J. J., Socher, R., and Xiong, C.End-to-end dense video captioning with masked trans-former. In