[PDF] A Comprehensive Study of Deep Video Action Recognition

Abstract

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

Full PDF

AA Comprehensive Study of Deep Video Action Recognition

Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong,Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu LiAmazon Web Services { yzaws,xxnl,chunhliu,mozolf,yuanjx,chongrwu,zhiz,tighej,manmatha,mli } @amazon.com Abstract

Video action recognition is one of the representativetasks for video understanding. Over the last decade, wehave witnessed great advancements in video action recog-nition thanks to the emergence of deep learning. But wealso encountered new challenges, including modeling long-range temporal information in videos, high computationcosts, and incomparable results due to datasets and evalua-tion protocol variances. In this paper, we provide a compre-hensive survey of over 200 existing papers on deep learn-ing for video action recognition. We ﬁrst introduce the 17video action recognition datasets that inﬂuenced the designof models. Then we present video action recognition mod-els in chronological order: starting with early attempts atadapting deep learning, then to the two-stream networks,followed by the adoption of 3D convolutional kernels, andﬁnally to the recent compute-efﬁcient models. In addition,we benchmark popular methods on several representativedatasets and release code for reproducibility. In the end,we discuss open problems and shed light on opportunitiesfor video action recognition to facilitate new research ideas.

1. Introduction

One of the most important tasks in video understandingis to understand human actions. It has many real-worldapplications, including behavior analysis, video retrieval,human-robot interaction, gaming, and entertainment. Hu-man action understanding involves recognizing, localizing,and predicting human behaviors. The task to recognize hu-man actions in a video is called video action recognition . InFigure 1, we visualize several video frames with the associ-ated action labels, which are typical human daily activitiessuch as shaking hands and riding a bike.Over the last decade, there has been growing researchinterest in video action recognition with the emergence ofhigh-quality large-scale action recognition datasets. Wesummarize the statistics of popular action recognition

Cricket bowling Cutting in kitchenSkate boardingRiding a bike Playing violin Dribbling basketballShaking handsStretching leg

Figure 1.

Visual examples of categories in popular video actiondatasets .Figure 2.

Statistics of most popular video action recognitiondatasets from past 10 years. The area of an circle represents thescale of each dataset (i.e., number of videos). datasets in Figure 2. We see that both the number of videosand classes increase rapidly, e.g, from 7K videos over classes in HMDB51 [109] to 8M videos over , classesin YouTube8M [1]. Also, the rate at which new datasets arereleased is increasing: 3 datasets were released from 2011to 2015 compared to 13 released from 2016 to 2020.Thanks to both the availability of large-scale datasetsand the rapid progress in deep learning, there is also a1 a r X i v : . [ c s . C V ] D ec eepVideo Karpathy et al

Two-Stream Networks

Simonyan et al

TDD

Wang et al

C3D

Tran et al

Fusion

Feichtenhofer et al

LRCN

Donahue et al

Beyond-Short-Snippet

Ng et al

TSN

Wang et al

ActionVLAD

Girdhar et al

I3D

Carreira et al

P3D

Qiu et al

TRN

Zhou et al

SlowFast

Feichtenhofer et al

X3D

Feichtenhofer et al

TSM

Lin et al

Non-local

Wang et al

R2+1D

Tran et al

R3D

Kataoka et al

S3D

Xie et al

V4D

Zhang et al

TEA

Li et al

CSN

Tran et al

TLE

Diba et al

CoViAR

Wu et al

AssembleNet

Ryoo et al

TPN

Yang et al

LTC

Varol et al

ECO

Zolfaghari et al

Hidden TSN

Zhu et al

Figure 3.

A chronological overview of recent representative work in video action recognition . rapid growth in deep learning based models to recognizevideo actions. In Figure 3, we present a chronologicaloverview of recent representative work. DeepVideo [99]is one of the earliest attempts to apply convolutional neuralnetworks to videos. We observed three trends here. The ﬁrsttrend started by the seminal paper on Two-Stream Networks[187], adds a second path to learn the temporal informationin a video by training a convolutional neural network on theoptical ﬂow stream. Its great success inspired a large num-ber of follow-up papers, such as TDD [214], LRCN [37],Fusion [50], TSN [218], etc. The second trend was the useof 3D convolutional kernels to model video temporal infor-mation, such as I3D [14], R3D [74], S3D [239], Non-local[219], SlowFast [45], etc. Finally, the third trend focused oncomputational efﬁciency to scale to even larger datasets sothat they could be adopted in real applications. Examplesinclude Hidden TSN [278], TSM [128], X3D [44], TVN[161], etc.Despite the large number of deep learning based mod-els for video action recognition, there is no comprehensivesurvey dedicated to these models. Previous survey paperseither put more efforts into hand-crafted features [77, 173]or focus on broader topics such as video captioning [236],video prediction [104], video action detection [261] andzero-shot video action recognition [96]. In this paper:• We comprehensively review over 200 papers on deeplearning for video action recognition. We walk thereaders through the recent advancements chronolog-ically and systematically, with popular papers ex-plained in detail.• We benchmark widely adopted methods on the sameset of datasets in terms of both accuracy and efﬁciency. We also release our implementations for full repro-ducibility .• We elaborate on challenges, open problems, and op-portunities in this ﬁeld to facilitate future research.The rest of the survey is organized as following. Weﬁrst describe popular datasets used for benchmarking andexisting challenges in section 2. Then we present recentadvancements using deep learning for video action recogni-tion in section 3, which is the major contribution of this sur-vey. In section 4, we evaluate widely adopted approacheson standard benchmark datasets, and provide discussionsand future research opportunities in section 5.

2. Datasets and Challenges

Deep learning methods usually improve in accuracywhen the volume of the training data grows. In the caseof video action recognition, this means we need large-scaleannotated datasets to learn effective models.For the task of video action recognition, datasets areoften built by the following process: (1) Deﬁne an ac-tion list, by combining labels from previous action recog-nition datasets and adding new categories depending on theuse case. (2) Obtain videos from various sources, such asYouTube and movies, by matching the video title/subtitle tothe action list. (3) Provide temporal annotations manuallyto indicate the start and end position of the action, and (4)ﬁnally clean up the dataset by de-duplication and ﬁltering Model zoo in both PyTorch and MXNet: https://cv.gluon.ai/model_zoo/action_recognition.html ataset Year ∼

5s 51UCF101 [190] 2012 13.3K ∼

6s 101Sports1M [99] 2014 1.1M ∼ [5 , m . s . s s s s [2 , s [2 , s m m, s s s s [3 , s Table 1.

A list of popular datasets for video action recognition out noisy classes/samples. Below we review the most pop-ular large-scale video action recognition datasets in Table 1and Figure 2.

HMDB51 [109] was introduced in 2011. It was collectedmainly from movies, and a small proportion from pub-lic databases such as the Prelinger archive, YouTube andGoogle videos. The dataset contains , clips dividedinto 51 action categories, each containing a minimum of101 clips. The dataset has three ofﬁcial splits. Most previ-ous papers either report the top-1 classiﬁcation accuracy onsplit 1 or the average accuracy over three splits. UCF101 [190] was introduced in 2012 and is an exten-sion of the previous UCF50 dataset. It contains , videos from YouTube spreading over 101 categories of hu-man actions. The dataset has three ofﬁcial splits similar toHMDB51, and is also evaluated in the same manner. Sports1M [99] was introduced in 2014 as the ﬁrst large-scale video action dataset which consisted of more than 1million YouTube videos annotated with 487 sports classes.The categories are ﬁne-grained which leads to low inter-class variations. It has an ofﬁcial 10-fold cross-validationsplit for evaluation.

ActivityNet [40] was originally introduced in 2015 andthe ActivityNet family has several versions since its ini-tial launch. The most recent ActivityNet 200 (V1.3) con-tains 200 human daily living actions. It has , train-ing, , validation, and , testing videos. On averagethere are 137 untrimmed videos per class and 1.41 activityinstances per video. YouTube8M [1] was introduced in 2016 and is by far thelargest-scale video dataset that contains 8 million YouTubevideos (500K hours of video in total) and annotated with , action classes. Each video is annotated with one ormultiple labels by a YouTube video annotation system. Thisdataset is split into training, validation and test in the ratio 70:20:10. The validation set of this dataset is also extendedwith human-veriﬁed segment annotations to provide tempo-ral localization information. Charades [186] was introduced in 2016 as a dataset forreal-life concurrent action understanding. It contains , videos with an average length of 30 seconds. This datasetincludes 157 multi-label daily indoor activities, performedby 267 different people. It has an ofﬁcial train-validationsplit that has , videos for training and the remaining , for validation. Kinetics Family is now the most widely adopted bench-mark. Kinetics400 [100] was introduced in 2017 and itconsists of approximately 240k training and 20k valida-tion videos trimmed to 10 seconds from 400 human actioncategories. The Kinetics family continues to expand, withKinetics-600 [12] released in 2018 with 480K videos andKinetics700[13] in 2019 with 650K videos. [69] V1 was introduced in2017 and V2 was introduced in 2018. This family is an-other popular benchmark that consists of 174 action classesthat describe humans performing basic actions with every-day objects. There are , videos in V1 and , videos in V2. Note that the Something-Something datasetrequires strong temporal modeling because most activitiescannot be inferred based on spatial features alone (e.g.opening something, covering something with something). AVA [70] was introduced in 2017 as the ﬁrst large-scalespatio-temporal action detection dataset. It contains 43015-minute video clips with 80 atomic actions labels (only60 labels were used for evaluation). The annotations wereprovided at each key-frame which lead to , train-ing, , validation and , testing samples. TheAVA dataset was recently expanded to AVA-Kinetics with , training, , validation and , testingsamples [117]. Moments in Time [142] was introduced in 2018 and it isa large-scale dataset designed for event understanding. Itcontains one million 3 second video clips, annotated with adictionary of 339 classes. Different from other datasets de-signed for human action understanding, Moments in Timedataset involves people, animals, objects and natural phe-nomena. The dataset was extended to Multi-Moments inTime (M-MiT) [143] in 2019 by increasing the number ofvideos to 1.02 million, pruning vague classes, and increas-ing the number of labels per video.

HACS [267] was introduced in 2019 as a new large-scaledataset for recognition and localization of human actionscollected from Web videos. It consists of two kinds ofmanual annotations. HACS Clips contains 1.55M 2-secondclip annotations on 504K videos, and HACS Segments has140K complete action segments (from action start to end)on 50K videos. The videos are annotated with the same 200human action classes used in ActivityNet (V1.3) [40].3 VU [34] dataset was released in 2020 for multi-labelmulti-task video understanding. This dataset has 572Kvideos and , labels. The ofﬁcial split has 481K, 31Kand 65K videos for train, validation, and test respectively.This dataset has six task categories: scene, object, action,event, attribute, and concept. On average, there are about , samples for each label. The duration of the videosvaries with a maximum length of seconds. AViD [165] was introduced in 2020 as a dataset foranonymized action recognition. It contains 410K videos fortraining and 40K videos for testing. Each video clip dura-tion is between 3-15 seconds and in total it has 887 actionclasses. During data collection, the authors tried to collectdata from various countries to deal with data bias. Theyalso remove face identities to protect privacy of video mak-ers. Therefore, AViD dataset might not be a proper choicefor recognizing face-related actions.Before we dive into the chronological review of meth-ods, we present several visual examples from the abovedatasets in Figure 4 to show their different characteristics.In the top two rows, we pick action classes from UCF101[190] and Kinetics400 [100] datasets. Interestingly, we ﬁndthat these actions can sometimes be determined by the con-text or scene alone. For example, the model can predictthe action riding a bike as long as it recognizes a bike inthe video frame. The model may also predict the actioncricket bowling if it recognizes the cricket pitch. Hence forthese classes, video action recognition may become an ob-ject/scene classiﬁcation problem without the need of rea-soning motion/temporal information. In the middle tworows, we pick action classes from Something-Somethingdataset [69]. This dataset focuses on human-object interac-tion, thus it is more ﬁne-grained and requires strong tempo-ral modeling. For example, if we only look at the ﬁrst frameof dropping something and picking something up withoutlooking at other video frames, it is impossible to tell thesetwo actions apart. In the bottom row, we pick action classesfrom Moments in Time dataset [142]. This dataset is differ-ent from most video action recognition datasets, and is de-signed to have large inter-class and intra-class variation thatrepresent dynamical events at different levels of abstraction.For example, the action climbing can have different actors(person or animal) in different environments (stairs or tree).

There are several major challenges in developing effec-tive video action recognition algorithms.In terms of dataset, ﬁrst, deﬁning the label space fortraining action recognition models is non-trivial. It’s be-cause human actions are usually composite concepts andthe hierarchy of these concepts are not well-deﬁned. Sec-ond, annotating videos for action recognition are laborious(e.g., need to watch all the video frames) and ambiguous

Cricket bowling Cutting in kitchenSkate boardingDropping something

Picking something up

Riding a bike Salsa dancing braiding hair

UCF101Kinetics400Something-SomethingMomentsin time

Climbing

Figure 4.

Visual examples from popular video action datasets .Top: individual video frames from action classes in UCF101 andKinetics400. A single frame from these scene-focused datasets of-ten contains enough information to correctly guess the category.Middle: consecutive video frames from classes in Something-Something. The 2nd and 3rd frames are made transparent to in-dicate the importance of temporal reasoning that we cannot tellthese two actions apart by looking at the 1st frame alone. Bottom:individual video frames from classes in Moment in Time. Sameaction could have different actors in different environments. (e.g, hard to determine the exact start and end of an ac-tion). Third, some popular benchmark datasets (e.g., Kinet-ics family) only release the video links for users to down-load and not the actual video, which leads to a situation thatmethods are evaluated on different data. It is impossible todo fair comparisons between methods and gain insights.In terms of modeling, ﬁrst, videos capturing human ac-tions have both strong intra- and inter-class variations. Peo-ple can perform the same action in different speeds undervarious viewpoints. Besides, some actions share similarmovement patterns that are hard to distinguish. Second, rec-ognizing human actions requires simultaneous understand-ing of both short-term action-speciﬁc motion informationand long-range temporal information. We might need asophisticated model to handle different perspectives ratherthan using a single convolutional neural network. Third, thecomputational cost is high for both training and inference,hindering both the development and deployment of actionrecognition models. In the next section, we will demon-strate how video action recognition methods developed overthe last decade to address the aforementioned challenges.4 . An Odyssey of Using Deep Learning forVideo Action Recognition

In this section, we review deep learning based methodsfor video action recognition from 2014 to present and intro-duce the related earlier work in context.

Despite there being some papers using ConvolutionalNeural Networks (CNNs) for video action recognition,[200, 5, 91], hand-crafted features [209, 210, 158, 112], par-ticularly Improved Dense Trajectories (IDT) [210], domi-nated the video understanding literature before 2015, due totheir high accuracy and good robustness. However, hand-crafted features have heavy computational cost [244], andare hard to scale and deploy.With the rise of deep learning [107], researchers startedto adapt CNNs for video problems. The seminal workDeepVideo [99] proposed to use a single 2D CNN modelon each video frame independently and investigated sev-eral temporal connectivity patterns to learn spatio-temporalfeatures for video action recognition, such as late fusion,early fusion and slow fusion. Though this model made earlyprogress with ideas that would prove to be useful later suchas a multi-resolution network, its transfer learning perfor-mance on UCF101 [190] was less than hand-craftedIDT features ( . vs . ). Furthermore, DeepVideo[99] found that a network fed by individual video frames,performs equally well when the input is changed to a stackof frames. This observation might indicate that the learntspatio-temporal features did not capture the motion well. Italso encouraged people to think about why CNN models didnot outperform traditional hand-crafted features in the videodomain unlike in other computer vision tasks [107, 171]. Since video understanding intuitively needs motion in-formation, ﬁnding an appropriate way to describe the tem-poral relationship between frames is essential to improvingthe performance of CNN-based video action recognition.Optical ﬂow [79] is an effective motion representationto describe object/scene movement. To be precise, it is thepattern of apparent motion of objects, surfaces, and edgesin a visual scene caused by the relative motion between anobserver and a scene. We show several visualizations of op-tical ﬂow in Figure 5. As we can see, optical ﬂow is able todescribe the motion pattern of each action accurately. Theadvantage of using optical ﬂow is it provides orthogonal in-formation compared to the the RGB image. For example,the two images on the bottom of Figure 5 have clutteredbackgrounds. Optical ﬂow can effectively remove the non-moving background and result in a simpler learning prob-lem compared to using the original RGB images as input.

Figure 5. Visualizations of optical ﬂow. We show four image-ﬂowpairs, left is original RGB image and right is the estimated op-tical ﬂow by FlowNet2 [85]. Color of optical ﬂow indicates thedirections of motion, and we follow the color coding scheme ofFlowNet2 [85] as shown in top right.

In addition, optical ﬂow has been shown to work well onvideo problems. Traditional hand-crafted features such asIDT [210] also contain optical-ﬂow-like features, such asHistogram of Optical Flow (HOF) and Motion BoundaryHistogram (MBH).Hence, Simonyan et al . [187] proposed two-stream net-works, which included a spatial stream and a temporalstream as shown in Figure 6. This method is related tothe two-streams hypothesis [65], according to which thehuman visual cortex contains two pathways: the ventralstream (which performs object recognition) and the dorsalstream (which recognizes motion). The spatial stream takesraw video frame(s) as input to capture visual appearanceinformation. The temporal stream takes a stack of opticalﬂow images as input to capture motion information betweenvideo frames. To be speciﬁc, [187] linearly rescaled the hor-izontal and vertical components of the estimated ﬂow (i.e.,motion in the x-direction and y-direction) to a [0, 255] rangeand compressed using JPEG. The output corresponds to thetwo optical ﬂow images shown in Figure 6. The compressedoptical ﬂow images will then be concatenated as the input tothe temporal stream with a dimension of H × W × L , whereH, W and L indicates the height, width and the length of thevideo frames. In the end, the ﬁnal prediction is obtained byaveraging the prediction scores from both streams.By adding the extra temporal stream, for the ﬁrst time, aCNN-based approach achieved performance similar to theprevious best hand-crafted feature IDT on UCF101 ( . vs . ) and on HMDB51 [109] ( . vs . ). [187]makes two important observations. First, motion informa-tion is important for video action recognition. Second, itis still challenging for CNNs to learn temporal informa-tion directly from raw video frames. Pre-computing opti-cal ﬂow as the motion representation is an effective way fordeep learning to reveal its power. Since [187] managed toclose the gap between deep learning approaches and tradi-tional hand-crafted features, many follow-up papers on two-stream networks emerged and greatly advanced the devel-5 coreFusion Predictionvideo video frameoptical flow Spatial streamTemporal streamSegmental Consensus Predictionvideo segments CNNvideo Two-stream NetworksTemporal Segment Networks (TSN) video clip 3D CNN Prediction

I3D video slow pathfast pathslow frame ratefast frame rate Predictionlateral connection

SlowFast

3D convolution2D convolution non-local fully-connected concatenatevideo video clip 3D CNN Prediction

Non-local

Figure 6.

Workﬂow of ﬁve important papers : two-stream net-works [187], temporal segment networks [218], I3D [14], Non-local [219] and SlowFast [45]. Best viewed in color. opment of video action recognition. Here, we divide theminto several categories and review them individually.

Two-stream networks [187] used a relatively shallow net-work architecture [107]. Thus a natural extension to thetwo-stream networks involves using deeper networks. How-ever, Wang et al . [215] ﬁnds that simply using deeper net-works does not yield better results, possibly due to over-ﬁtting on the small-sized video datasets [190, 109]. Re-call from section 2.1, UCF101 and HMDB51 datasets onlyhave thousands of training videos. Hence, Wang et al .[217] introduce a series of good practices, including cross-modality initialization, synchronized batch normalization, corner cropping and multi-scale cropping data augmenta-tion, large dropout ratio, etc. to prevent deeper networksfrom overﬁtting. With these good practices, [217] was ableto train a two-stream network with the VGG16 model [188]that outperforms [187] by a large margin on UCF101. Thesegood practices have been widely adopted and are still beingused. Later, Temporal Segment Networks (TSN) [218] per-formed a thorough investigation of network architectures,such as VGG16, ResNet [76], Inception [198], and demon-strated that deeper networks usually achieve higher recogni-tion accuracy for video action recognition. We will describemore details about TSN in section 3.2.4.

Since there are two streams in a two-stream network, therewill be a stage that needs to merge the results from bothnetworks to obtain the ﬁnal prediction. This stage is usuallyreferred to as the spatial-temporal fusion step.The easiest and most straightforward way is late fusion,which performs a weighted average of predictions fromboth streams. Despite late fusion being widely adopted[187, 217], many researchers claim that this may not bethe optimal way to fuse the information between the spa-tial appearance stream and temporal motion stream. Theybelieve that earlier interactions between the two networkscould beneﬁt both streams during model learning and this istermed as early fusion.Fusion [50] is one of the ﬁrst of several papers investi-gating the early fusion paradigm, including how to performspatial fusion (e.g., using operators such as sum, max, bilin-ear, convolution and concatenation), where to fuse the net-work (e.g., the network layer where early interactions hap-pen), and how to perform temporal fusion (e.g., using 2Dor 3D convolutional fusion in later stages of the network).[50] shows that early fusion is beneﬁcial for both streamsto learn richer features and leads to improved performanceover late fusion. Following this line of research, Feicht-enhofer et al . [46] generalizes ResNet [76] to the spatio-temporal domain by introducing residual connections be-tween the two streams. Based on [46], Feichtenhofer etal . [47] further propose a multiplicative gating function forresidual networks to learn better spatio-temporal features.Concurrently, [225] adopts a spatio-temporal pyramid toperform hierarchical early fusion between the two streams.

Since a video is essentially a temporal sequence, re-searchers have explored Recurrent Neural Networks(RNNs) for temporal modeling inside a video, particularlythe usage of Long Short-Term Memory (LSTM) [78].LRCN [37] and Beyond-Short-Snippets [253] are theﬁrst of several papers that use LSTM for video action recog-6ition under the two-stream networks setting. They take thefeature maps from CNNs as an input to a deep LSTM net-work, and aggregate frame-level CNN features into video-level predictions. Note that they use LSTM on two streamsseparately, and the ﬁnal results are still obtained by latefusion. However, there is no clear empirical improve-ment from LSTM models [253] over the two-stream base-line [187]. Following the CNN-LSTM framework, severalvariants are proposed, such as bi-directional LSTM [205],CNN-LSTM fusion [56] and hierarchical multi-granularityLSTM network [118]. [125] described VideoLSTM whichincludes a correlation-based spatial attention mechanismand a lightweight motion-based attention mechanism. Vide-oLSTM not only show improved results on action recog-nition, but also demonstrate how the learned attention canbe used for action localization by relying on just the actionclass label. Lattice-LSTM [196] extends LSTM by learn-ing independent hidden state transitions of memory cellsfor individual spatial locations, so that it can accuratelymodel long-term and complex motions. ShuttleNet [183]is a concurrent work that considers both feedforward andfeedback connections in a RNN to learn long-term depen-dencies. FASTER [272] designed a FAST-GRU to aggre-gate clip-level features from an expensive backbone anda cheap backbone. This strategy reduces the processingcost of redundant clips and hence accelerates the inferencespeed.However, the work mentioned above [37, 253, 125, 196,183] use different two-stream networks/backbones. Thedifferences between various methods using RNNs are thusunclear. Ma et al . [135] build a strong baseline for fair com-parison and thoroughly study the effect of learning spatio-temporal features by using RNNs. They ﬁnd that it requiresproper care to achieve improved performance, e.g., LSTMsrequire pre-segmented data to fully exploit the temporal in-formation. RNNs are also intensively studied in video ac-tion localization [189] and video question answering [274],but these are beyond the scope of this survey.

Thanks to optical ﬂow, two-stream networks are able to rea-son about short-term motion information between frames.However, they still cannot capture long-range temporal in-formation. Motivated by this weakness of two-stream net-works , Wang et al . [218] proposed a Temporal SegmentNetwork (TSN) to perform video-level action recognition.Though initially proposed to be used with 2D CNNs, it issimple and generic. Thus recent work using either 2D or3D CNNs, are still built upon this framework.To be speciﬁc, as shown in Figure 6, TSN ﬁrst divides awhole video into several segments, where the segments dis-tribute uniformly along the temporal dimension. Then TSN randomly selects a single video frame within each segmentand forwards them through the network. Here, the networkshares weights for input frames from all the segments. Inthe end, a segmental consensus is performed to aggregateinformation from the sampled video frames. The segmen-tal consensus could be operators like average pooling, maxpooling, bilinear encoding, etc. In this sense, TSN is capa-ble of modeling long-range temporal structure because themodel sees the content from the entire video. In addition,this sparse sampling strategy lowers the training cost overlong video sequences but preserves relevant information.Given TSN’s good performance and simplicity, mosttwo-stream methods afterwards become segment-basedtwo-stream networks. Since the segmental consensus issimply doing a max or average pooling operation, a fea-ture encoding step might generate a global video featureand lead to improved performance as suggested in tradi-tional approaches [179, 97, 157]. Deep Local Video Fea-ture (DVOF) [114] proposed to treat the deep networks thattrained on local inputs as feature extractors and train anotherencoding function to map the global features into global la-bels. Temporal Linear Encoding (TLE) network [36] ap-peared concurrently with DVOF, but the encoding layer wasembedded in the network so that the whole pipeline couldbe trained end-to-end. VLAD3 and ActionVLAD [123, 63]also appeared concurrently. They extended the NetVLADlayer [4] to the video domain to perform video-level encod-ing, instead of using compact bilinear encoding as in [36].To improve the temporal reasoning ability of TSN, Tempo-ral Relation Network (TRN) [269] was proposed to learnand reason about temporal dependencies between videoframes at multiple time scales. The recent state-of-the-artefﬁcient model TSM [128] is also segment-based. We willdiscuss it in more detail in section 3.4.2.

Two-stream networks are successful because appearanceand motion information are two of the most important prop-erties of a video. However, there are other factors that canhelp video action recognition as well, such as pose, object,audio and depth, etc.Pose information is closely related to human action. Wecan recognize most actions by just looking at a pose (skele-ton) image without scene context. Although there is previ-ous work on using pose for action recognition [150, 246],P-CNN [23] was one of the ﬁrst deep learning methodsthat successfully used pose to improve video action recogni-tion. P-CNN proposed to aggregates motion and appearanceinformation along tracks of human body parts, in a simi-lar spirit to trajectory pooling [214]. [282] extended thispipeline to a chained multi-stream framework, that com-puted and integrated appearance, motion and pose. They7ntroduced a Markov chain model that added these cuessuccessively and obtained promising results on both ac-tion recognition and action localization. PoTion [25] was afollow-up work to P-CNN, but introduced a more powerfulfeature representation that encoded the movement of humansemantic keypoints. They ﬁrst ran a decent human pose es-timator and extracted heatmaps for the human joints in eachframe. They then obtained the PoTion representation bytemporally aggregating these probability maps. PoTion islightweight and outperforms previous pose representations[23, 282]. In addition, it was shown to be complementaryto standard appearance and motion streams, e.g. combin-ing PoTion with I3D [14] achieved state-of-the-art result onUCF101 ( . ).Object information is another important cue becausemost human actions involve human-object interaction. Wu[232] proposed to leverage both object features and scenefeatures to help video action recognition. The object andscene features were extracted from state-of-the-art pre-trained object and scene detectors. Wang et al . [252] took astep further to make the network end-to-end trainable. Theyintroduced a two-stream semantic region based method, byreplacing a standard spatial stream with a Faster RCNN net-work [171], to extract semantic information about the ob-ject, person and scene.Audio signals usually come with video, and are com-plementary to the visual information. Wu et al . [233] in-troduced a multi-stream framework that integrates spatial,short-term motion, long-term temporal and audio in videosto digest complementary clues. Recently, Xiao et al . [237]introduced AudioSlowFast following [45], by adding an-other audio pathway to model vision and sound in an uniﬁedrepresentation.In RGB-D video action recognition ﬁeld, using depth in-formation is standard practice [59]. However, for vision-based video action recognition (e.g., only given monocu-lar videos), we do not have access to ground truth depthinformation as in the RGB-D domain. An early attemptDepth2Action [280] uses off-the-shelf depth estimators toextract depth information from videos and use it for actionrecognition.Essentially, multi-stream networks is a way of multi-modality learning, using different cues as input signals tohelp video action recognition. We will discuss more onmulti-modality learning in section 5.12. Pre-computing optical ﬂow is computationally intensiveand storage demanding, which is not friendly for large-scaletraining or real-time deployment. A conceptually easy wayto understand a video is as a 3D tensor with two spatial andone time dimension. Hence, this leads to the usage of 3DCNNs as a processing unit to model the temporal informa- tion in a video.The seminal work for using 3D CNNs for action recog-nition is [91]. While inspiring, the network was not deepenough to show its potential. Tran et al . [202] extended[91] to a deeper 3D network, termed C3D. C3D follows themodular design of [188], which could be thought of as a 3Dversion of VGG16 network. Its performance on standardbenchmarks is not satisfactory, but shows strong generaliza-tion capability and can be used as a generic feature extractorfor various video tasks [250].However, 3D networks are hard to optimize. In order totrain a 3D convolutional ﬁlter well, people need a large-scale dataset with diverse video content and action cate-gories. Fortunately, there exists a dataset, Sports1M [99]which is large enough to support the training of a deep 3Dnetwork. However, the training of C3D takes weeks to con-verge. Despite the popularity of C3D, most users just adoptit as a feature extractor for different use cases instead ofmodifying/ﬁne-tuning the network. This is partially the rea-son why two-stream networks based on 2D CNNs domi-nated the video action recognition domain from year 2014to 2017.The situation changed when Carreira et al . [14] proposedI3D in year 2017. As shown in Figure 6, I3D takes a videoclip as input, and forwards it through stacked 3D convolu-tional layers. A video clip is a sequence of video frames,usually 16 or 32 frames are used. The major contribu-tions of I3D are: 1) it adapts mature image classiﬁcationarchitectures to use for 3D CNN; 2) For model weights, itadopts a method developed for initializing optical ﬂow net-works in [217] to inﬂate the ImageNet pre-trained 2D modelweights to their counterparts in the 3D model. Hence, I3Dbypasses the dilemma that 3D CNNs have to be trained fromscratch. With pre-training on a new large-scale dataset Ki-netics400 [100], I3D achieved a . on UCF101 and . on HMDB51. I3D ended the era where differ-ent methods reported numbers on small-sized datasets suchas UCF101 and HMDB51 . Publications following I3Dneeded to report their performance on Kinetics400, or otherlarge-scale benchmark datasets, which pushed video ac-tion recognition to the next level. In the next few years,3D CNNs advanced quickly and became top performers onalmost every benchmark dataset. We will review the 3DCNNs based literature in several categories below.We want to point out that 3D CNNs are not replacingtwo-stream networks, and they are not mutually exclusive.They just use different ways to model the temporal rela-tionship in a video. Furthermore, the two-stream approachis a generic framework for video understanding, insteadof a speciﬁc method. As long as there are two networks,one for spatial appearance modeling using RGB frames, theother for temporal motion modeling using optical ﬂow, the As we can see in Table 2 . on UCF101 and . on HMDB51. Hence, theﬁnal I3D model is a combination of 3D CNNs and two-stream networks. However, the contribution of I3D doesnot lie in the usage of optical ﬂow.

2D CNNs enjoy the beneﬁt of pre-training brought by thelarge-scale of image datasets such as ImageNet [30] andPlaces205 [270], which cannot be matched even with thelargest video datasets available today. On these datasetsnumerous efforts have been devoted to the search for 2DCNN architectures that are more accurate and generalizebetter. Below we describe the efforts to capitalize on theseadvances for 3D CNNs.ResNet3D [74] directly took 2D ResNet [76] and re-placed all the 2D convolutional ﬁlters with 3D kernels.They believed that by using deep 3D CNNs together withlarge-scale datasets one can exploit the success of 2DCNNs on ImageNet. Motivated by ResNeXt [238], Chen et al . [20] presented a multi-ﬁber architecture that slices acomplex neural network into an ensemble of lightweightnetworks (ﬁbers) that facilitate information ﬂow betweenﬁbers, reduces the computational cost at the same time. In-spired by SENet [81], STCNet [33] propose to integratechannel-wise information inside a 3D block to capture bothspatial-channels and temporal-channels correlation infor-mation throughout the network.

To reduce the complexity of 3D network training, P3D[169] and R2+1D [204] explore the idea of 3D factorization.To be speciﬁc, a 3D kernel (e.g., × × ) can be factorizedto two separate operations, a 2D spatial convolution (e.g., × × ) and a 1D temporal convolution (e.g., × × ).The differences between P3D and R2+1D are how they ar-range the two factorized operations and how they formulateeach residual block. Trajectory convolution [268] followsthis idea but uses deformable convolution for the temporalcomponent to better cope with motion.Another way of simplifying 3D CNNs is to mix 2D and3D convolutions in a single network. MiCTNet [271] inte-grates 2D and 3D CNNs to generate deeper and more in-formative feature maps, while reducing training complex-ity in each round of spatio-temporal fusion. ARTNet [213]introduces an appearance-and-relation network by using anew building block. The building block consists of a spa-tial branch using 2D CNNs and a relation branch using 3DCNNs. S3D [239] combines the merits from approachesmentioned above. It ﬁrst replaces the 3D convolutions at the bottom of the network with 2D kernels, and ﬁnd thatthis kind of top-heavy network has higher recognition accu-racy. Then S3D factorizes the remaining 3D kernels as P3Dand R2+1D do, to further reduce the model size and train-ing complexity. A concurrent work named ECO [283] alsoadopts such a top-heavy network to achieve online videounderstanding. In 3D CNNs, long-range temporal connection may beachieved by stacking multiple short temporal convolutions,e.g., × × ﬁlters. However, useful temporal informationmay be lost in the later stages of a deep network, especiallyfor frames far apart.In order to perform long-range temporal modeling, LTC[206] introduces and evaluates long-term temporal convo-lutions over a large number of video frames. However, lim-ited by GPU memory, they have to sacriﬁce input resolu-tion to use more frames. After that, T3D [32] adopted adensely connected structure [83] to keep the original tem-poral information as complete as possible to make the ﬁnalprediction. Later, Wang et al . [219] introduced a new build-ing block, termed non-local. Non-local is a generic oper-ation similar to self-attention [207], which can be used formany computer vision tasks in a plug-and-play manner. Asshown in Figure 6, they used a spacetime non-local moduleafter later residual blocks to capture the long-range depen-dence in both space and temporal domain, and achieved im-proved performance over baselines without bells and whis-tles. Wu et al . [229] proposed a feature bank representation,which embeds information of the entire video into a mem-ory cell, to make context-aware prediction. Recently, V4D[264] proposed video-level 4D CNNs, to model the evolu-tion of long-range spatio-temporal representation with 4Dconvolutions. In order to further improve the efﬁciency of 3D CNNs (i.e.,in terms of GFLOPs, model parameters and latency), manyvariants of 3D CNNs begin to emerge.Motivated by the development in efﬁcient 2D networks,researchers started to adopt channel-wise separable con-volution and extend it for video classiﬁcation [111, 203].CSN [203] reveals that it is a good practice to factorize 3Dconvolutions by separating channel interactions and spatio-temporal interactions, and is able to obtain state-of-the-artperformance while being 2 to 3 times faster than the pre-vious best approaches. These methods are also related tomulti-ﬁber networks [20] as they are all inspired by groupconvolution.Recently, Feichtenhofer et al . [45] proposed SlowFast,an efﬁcient network with a slow pathway and a fast path-9ay. The network design is partially inspired by the bio-logical Parvo- and Magnocellular cells in the primate visualsystems. As shown in Figure 6, the slow pathway operatesat low frame rates to capture detailed semantic information,while the fast pathway operates at high temporal resolutionto capture rapidly changing motion. In order to incorpo-rate motion information such as in two-stream networks,SlowFast adopts a lateral connection to fuse the represen-tation learned by each pathway. Since the fast pathway canbe made very lightweight by reducing its channel capac-ity, the overall efﬁciency of SlowFast is largely improved.Although SlowFast has two pathways, it is different fromthe two-stream networks [187], because the two pathwaysare designed to model different temporal speeds, not spatialand temporal modeling. There are several concurrent pa-pers using multiple pathways to balance the accuracy andefﬁciency [43].Following this line, Feichtenhofer [44] introduced X3Dthat progressively expand a 2D image classiﬁcation ar-chitecture along multiple network axes, such as temporalduration, frame rate, spatial resolution, width, bottleneckwidth, and depth. X3D pushes the 3D model modiﬁca-tion/factorization to an extreme, and is a family of efﬁ-cient video networks to meet different requirements of tar-get complexity. With similar spirit, A3D [276] also lever-ages multiple network conﬁgurations. However, A3D trainsthese conﬁgurations jointly and during inference deploysonly one model. This makes the model at the end moreefﬁcient. In the next section, we will continue to talk aboutefﬁcient video modeling, but not based on 3D convolutions.

With the increase of dataset size and the need for deploy-ment, efﬁciency becomes an important concern.If we use methods based on two-stream networks, weneed to pre-compute optical ﬂow and store them on localdisk. Taking Kinetics400 dataset as an illustrative exam-ple, storing all the optical ﬂow images requires 4.5TB diskspace. Such a huge amount of data would make I/O becomethe tightest bottleneck during training, leading to a waste ofGPU resources and longer experiment cycle. In addition,pre-computing optical ﬂow is not cheap, which means allthe two-stream networks methods are not real-time.If we use methods based on 3D CNNs, people still ﬁndthat 3D CNNs are hard to train and challenging to deploy.In terms of training, a standard SlowFast network trained onKinetics400 dataset using a high-end 8-GPU machine takes10 days to complete. Such a long experimental cycle andhuge computing cost makes video understanding researchonly accessible to big companies/labs with abundant com-puting resources. There are several recent attempts to speedup the training of deep video models [230], but these arestill expensive compared to most image-based computer vi- sion tasks. In terms of deployment, 3D convolution is notas well supported as 2D convolution for different platforms.Furthermore, 3D CNNs require more video frames as inputwhich adds additional IO cost.Hence, starting from year 2018, researchers started toinvestigate other alternatives to see how they could improveaccuracy and efﬁciency at the same time for video actionrecognition. We will review recent efﬁcient video modelingmethods in several categories below.

One of the major drawback of two-stream networks is itsneed for optical ﬂow. Pre-computing optical ﬂow is com-putationally expensive, storage demanding, and not end-to-end trainable for video action recognition. It is appealingif we can ﬁnd a way to encode motion information withoutusing optical ﬂow, at least during inference time.[146] and [35] are early attempts for learning to esti-mate optical ﬂow inside a network for video action recog-nition. Although these two approaches do not need opti-cal ﬂow during inference, they require optical ﬂow duringtraining in order to train the ﬂow estimation network. Hid-den two-stream networks [278] proposed MotionNet to re-place the traditional optical ﬂow computation. MotionNetis a lightweight network to learn motion information in anunsupervised manner, and when concatenated with the tem-poral stream, is end-to-end trainable. Thus, hidden two-stream CNNs [278] only take raw video frames as inputand directly predict action classes without explicitly com-puting optical ﬂow, regardless of whether its the training orinference stage. PAN [257] mimics the optical ﬂow featuresby computing the difference between consecutive featuremaps. Following this direction, [197, 42, 116, 164] continueto investigate end-to-end trainable CNNs to learn optical-ﬂow-like features from data. They derive such features di-rectly from the deﬁnition of optical ﬂow [255]. MARS [26]and D3D [191] used knowledge distillation to combine two-stream networks into a single stream, e.g., by tuning the spa-tial stream to predict the outputs of the temporal stream. Re-cently, Kwon et al . [110] introduced MotionSqueeze mod-ule to estimate the motion features. The proposed module isend-to-end trainable and can be plugged into any network,similar to [278].

A simple and natural choice to model temporal relationshipbetween frames is using 3D convolution. However, thereare many alternatives to achieve this goal. Here, we willreview some recent work that performs temporal modelingwithout 3D convolution.Lin et al . [128] introduce a new method, termed tempo-ral shift module (TSM). TSM extends the shift operation10228] to video understanding. It shifts part of the channelsalong the temporal dimension, thus facilitating informationexchanged among neighboring frames. In order to keep spa-tial feature learning capacity, they put temporal shift mod-ule inside the residual branch in a residual block. Thus allthe information in the original activation is still accessibleafter temporal shift through identity mapping. The biggestadvantage of TSM is that it can be inserted into a 2D CNNto achieve temporal modeling at zero computation and zeroparameters. Similar to TSM, TIN [182] introduces a tempo-ral interlacing module to model the temporal convolution.There are several recent 2D CNNs approaches using at-tention to perform long-term temporal modeling [92, 122,132, 133]. STM [92] proposes a channel-wise spatio-temporal module to present the spatio-temporal features anda channel-wise motion module to efﬁciently encode mo-tion features. TEA [122] is similar to STM, but inspiredby SENet [81], TEA uses motion features to recalibratethe spatio-temporal features to enhance the motion pattern.Speciﬁcally, TEA has two components: motion excitationand multiple temporal aggregation, while the ﬁrst one han-dles short-range motion modeling and the second one efﬁ-ciently enlarge the temporal receptive ﬁeld for long-rangetemporal modeling. They are complementary and bothlight-weight, thus TEA is able to achieve competitive re-sults with previous best approaches while keeping FLOPsas low as many 2D CNNs. Recently, TEINet [132] alsoadopts attention to enhance temporal modeling. Note that,the above attention-based methods are different from non-local [219], because they use channel attention while non-local uses spatial attention.

In this section, we are going to show several other direc-tions that are popular for video action recognition in the lastdecade.

While CNN-based approaches have demonstrated their su-periority and gradually replaced the traditional hand-craftedmethods, the traditional local feature pipeline still has itsmerits which should not be ignored, such as the usage oftrajectory.Inspired by the good performance of trajectory-basedmethods [210], Wang et al . [214] proposed to conducttrajectory-constrained pooling to aggregate deep convolu-tional features into effective descriptors, which they termas TDD. Here, a trajectory is deﬁned as a path trackingdown pixels in the temporal dimension. This new video rep-resentation shares the merits of both hand-crafted featuresand deep-learned features, and became one of the top per-formers on both UCF101 and HMDB51 datasets in the year 2015. Concurrently, Lan et al . [113] incorporated both In-dependent Subspace Analysis (ISA) and dense trajectoriesinto the standard two-stream networks, and show the com-plementarity between data-independent and data-driven ap-proaches. Instead of treating CNNs as a ﬁxed feature ex-tractor, Zhao et al . [268] proposed trajectory convolution tolearn features along the temporal dimension with the helpof trajectories.

There is another way to model temporal information insidea video, termed rank pooling (a.k.a learning-to-rank). Theseminal work in this line starts from VideoDarwin [53], thatuses a ranking machine to learn the evolution of the appear-ance over time and returns a ranking function. The rankingfunction should be able to order the frames of a video tem-porally, thus they use the parameters of this ranking func-tion as a new video representation. VideoDarwin [53] isnot a deep learning based method, but achieves comparableperformance and efﬁciency.To adapt rank pooling to deep learning, Fernando [54] in-troduces a differentiable rank pooling layer to achieve end-to-end feature learning. Following this direction, Bilen etal . [9] apply rank pooling on the raw image pixels of a videoproducing a single RGB image per video, termed dynamicimages. Another concurrent work by Fernando [51] extendsrank pooling to hierarchical rank pooling by stacking mul-tiple levels of temporal encoding. Finally, [22] propose ageneralization of the original ranking formulation [53] us-ing subspace representations and show that it leads to sig-niﬁcantly better representation of the dynamic evolution ofactions, while being computationally cheap.

Most video action recognition approaches use raw videos(or decoded video frames) as input. However, there areseveral drawbacks of using raw videos, such as the hugeamount of data and high temporal redundancy. Video com-pression methods usually store one frame by reusing con-tents from another frame (i.e., I-frame) and only store thedifference (i.e., P-frames and B-frames) due to the fact thatadjacent frames are similar. Here, the I-frame is the originalRGB video frame, and P-frames and B-frames include themotion vector and residual, which are used to store the dif-ference. Motivated by the developments in the video com-pression domain, researchers started to adopt compressedvideo representations as input to train effective video mod-els.Since the motion vector has coarse structure and maycontain inaccurate movements, Zhang et al . [256] adoptedknowledge distillation to help the motion-vector-based tem-poral stream mimic the optical-ﬂow-based temporal stream.11ethod Pre-train Flow Backbone Venue UCF101 HMDB51 Kinetics400DeepVideo [99] I - AlexNet CVPR 2014 65.4 - -Two-stream [187] I (cid:88)

CNN-M NeurIPS 2014 88.0 59.4 -LRCN [37] I (cid:88)

CNN-M CVPR 2015 82.3 - -TDD [214] I (cid:88)

CNN-M CVPR 2015 90.3 63.2 -Fusion [50] I (cid:88)

VGG16 CVPR 2016 92.5 65.4 -TSN [218] I (cid:88)

BN-Inception ECCV 2016 94.0 68.5

TLE [36] I (cid:88)

BN-Inception CVPR 2017 -C3D [202] S - VGG16-like ICCV 2015 82.3 56.8 59.5I3D [14] I,K - BN-Inception-like CVPR 2017 95.6 74.8 71.1P3D [169] S - ResNet50-like ICCV 2017 88.6 - 71.6ResNet3D [74] K - ResNeXt101-like CVPR 2018 94.5 70.2 65.1R2+1D [204] K - ResNet34-like CVPR 2018 96.8 74.5 72.0NL I3D [219] I - ResNet101-like CVPR 2018 - - 77.7S3D [239] I,K - BN-Inception-like ECCV 2018 96.8

TPN [248] - - ResNet101-like CVPR 2020 - - 78.9CIDC [121] - - ResNet50-like ECCV 2020

Table 2.

Results of widely adopted methods on three scene-focused datasets . Pre-train indicates which dataset the model is pre-trainedon. I: ImageNet, S: Sports1M and K: Kinetics400. NL represents non local.

However, their approach required extracting and processingeach frame. They obtained comparable recognition accu-racy with standard two-stream networks, but were 27 timesfaster. Wu et al . [231] used a heavyweight CNN for theI frame and lightweight CNN’s for the P frames. Thisrequired that the motion vectors and residuals for each Pframe be referred back to the I frame by accumulation.DMC-Net [185] is a follow-up work to [231] using adver-sarial loss. It adopts a lightweight generator network to helpthe motion vector capturing ﬁne motion details, instead ofknowledge distillation as in [256]. A recent paper SCSam-pler [106], also adopts compressed video representation forsampling salient clips and we will discuss it in the next sec-tion 3.5.4. As yet none of the compressed approaches candeal with B-frames due to the added complexity.

Most of the aforementioned deep learning methods treat ev-ery video frame/clip equally for the ﬁnal prediction. How-ever, discriminative actions only happen in a few moments,and most of the other video content is irrelevant or weaklyrelated to the labeled action category. There are several drawbacks of this paradigm. First, training with a large pro-portion of irrelevant video frames may hurt performance.Second, such uniform sampling is not efﬁcient during infer-ence.Partially inspired by how human understand a video us-ing just a few glimpses over the entire video [251], manymethods were proposed to sample the most informativevideo frames/clips for both improving the performance andmaking the model more efﬁcient during inference.KVM [277] is one of the ﬁrst attempts to propose an end-to-end framework to simultaneously identify key volumesand do action classiﬁcation. Later, [98] introduce AdaS-can that predicts the importance score of each video framein an online fashion, which they term as adaptive temporalpooling. Both of these methods achieve improved perfor-mance, but they still adopt the standard evaluation schemewhich does not show efﬁciency during inference. Recentapproaches focus more on the efﬁciency [41, 234, 8, 106].AdaFrame [234] follows [251, 98] but uses a reinforcementlearning based approach to search more informative videoclips. Concurrently, [8] uses a teacher-student framework,i.e., a see-it-all teacher can be used to train a compute ef-12cient see-very-little student. They demonstrate that theefﬁcient student network can reduce the inference time by and the number of FLOPs by approximately withnegligible performance drop. Recently, SCSampler [106]trains a lightweight network to sample the most salientvideo clips based on compressed video representations, andachieve state-of-the-art performance on both Kinetics400and Sports1M dataset. They also empirically show that suchsaliency-based sampling is not only efﬁcient, but also en-joys higher accuracy than using all the video frames.

Visual tempo is a concept to describe how fast an actiongoes. Many action classes have different visual tempos. Inmost cases, the key to distinguish them is their visual tem-pos, as they might share high similarities in visual appear-ance, such as walking, jogging and running [248]. Thereare several papers exploring different temporal rates (tem-pos) for improved temporal modeling [273, 147, 82, 281,45, 248]. Initial attempts usually capture the video tempothrough sampling raw videos at multiple rates and con-structing an input-level frame pyramid [273, 147, 281]. Re-cently, SlowFast [45], as we discussed in section 3.3.4, uti-lizes the characteristics of visual tempo to design a two-pathway network for better accuracy and efﬁciency trade-off. CIDC [121] proposed directional temporal modelingalong with a local backbone for video temporal model-ing. TPN [248] extends the tempo modeling to the feature-level and shows consistent improvement over previous ap-proaches.We would like to point out that visual tempo is alsowidely used in self-supervised video representation learn-ing [6, 247, 16] since it can naturally provide supervisionsignals to train a deep network. We will discuss more de-tails on self-supervised video representation learning in sec-tion 5.13.

4. Evaluation and Benchmarking

In this section, we compare popular approaches onbenchmark datasets. To be speciﬁc, we ﬁrst introducestandard evaluation schemes in section 4.1. Then we di-vide common benchmarks into three categories, scene-focused (UCF101, HMDB51 and Kinetics400 in section4.2), motion-focused (Sth-Sth V1 and V2 in section 4.3)and multi-label (Charades in section 4.4). In the end, wepresent a fair comparison among popular methods in termsof both recognition accuracy and efﬁciency in section 4.5.

During model training, we usually randomly pick a videoframe/clip to form mini-batch samples. However, for eval-uation, we need a standardized pipeline in order to perform fair comparisons.For 2D CNNs, a widely adopted evaluation scheme isto evenly sample frames from each video following[187, 217]. For each frame, we perform ten-crop data aug-mentation by cropping the 4 corners and 1 center, ﬂippingthem horizontally and averaging the prediction scores (be-fore softmax operation) over all crops of the samples, i.e.,this means we use 250 frames per video for inference.For 3D CNNs, a widely adopted evaluation schemetermed 30-view strategy is to evenly sample clips fromeach video following [219]. For each video clip, we per-form three-crop data augmentation. To be speciﬁc, we scalethe shorter spatial side to pixels and take three crops of × to cover the spatial dimensions and average theprediction scores.However, the evaluation schemes are not ﬁxed. They areevolving and adapting to new network architectures and dif-ferent datasets. For example, TSM [128] only uses two clipsper video for small-sized datasets [190, 109], and performthree-crop data augmentation for each clip despite its beinga 2D CNN. We will mention any deviations from the stan-dard evaluation pipeline.In terms of evaluation metric, we report accuracy forsingle-label action recognition, and mAP (mean averageprecision) for multi-label action recognition. Here, we compare recent state-of-the-art approaches onscene-focused datasets: UCF101, HMDB51 and Kinet-ics400. The reason we call them scene-focused is becausemost action videos in these datasets are short, and can berecognized by static scene appearance alone as shown inFigure 4.Following the chronology, we ﬁrst present results forearly attempts of using deep learning and the two-streamnetworks at the top of Table 2. We make several observa-tions. First, without motion/temporal modeling, the perfor-mance of DeepVideo [99] is inferior to all other approaches.Second, it is helpful to transfer knowledge from traditionalmethods (non-CNN-based) to deep learning. For example,TDD [214] uses trajectory pooling to extract motion-awareCNN features. TLE [36] embeds global feature encoding,which is an important step in traditional video action recog-nition pipeline, into a deep network.We then compare 3D CNNs based approaches in themiddle of Table 2. Despite training on a large corpus ofvideos, C3D [202] performs inferior to concurrent work[187, 214, 217], possibly due to difﬁculties in optimizationof 3D kernels. Motivated by this, several papers - I3D [14],P3D [169], R2+1D [204] and S3D [239] factorize 3D con-volution ﬁlters to 2D spatial kernels and 1D temporal ker-nels to ease the training. In addition, I3D introduces an in-ﬂation strategy to avoid training from scratch by bootstrap-13ethod Pre-train Backbone Frames × Views Venue V1 Top1 V2 Top1TSN [218] I BN-Inception 8 × × × × × × ×

30 ICCV 2019 49.2 -STM [92] I ResNet50-like 16 ×

30 ICCV 2019 50.7 -TSM [128] K ResNet50 8 × × × × × × × × × TIN [182] K ResNet50-like 16 × × × Table 3.

Results of widely adopted methods on Something-Something V1 and V2 datasets . We only report numbers without usingoptical ﬂow. Pre-train indicates which dataset the model is pre-trained on. I: ImageNet and K: Kinetics400. View means number oftemporal clip multiples spatial crop, e.g., 30 means 10 temporal clips with 3 spatial crops each clip. ping the 3D model weights from well-trained 2D networks.By using these techniques, they achieve comparable perfor-mance to the best two-stream network methods [36] withoutthe need for optical ﬂow. Furthermore, recent 3D modelsobtain even higher accuracy, by using more training sam-ples [203], additional pathways [45], or architecture search[44].Finally, we show recent efﬁcient models in the bottomof Table 2. We can see that these methods are able toachieve higher recognition accuracy than two-stream net-works (top), and comparable performance to 3D CNNs(middle). Since they are 2D CNNs and do not use opticalﬂow, these methods are efﬁcient in terms of both trainingand inference. Most of them are real-time approaches, andsome can do online video action recognition [128]. We be-lieve 2D CNN plus temporal modeling is a promising direc-tion due to the need of efﬁciency. Here, temporal modelingcould be attention based, ﬂow based or 3D kernel based.

In this section, we compare the recent state-of-the-artapproaches on the 20BN-Something-Something (Sth-Sth)dataset. We report top1 accuracy on both V1 and V2. Sth-Sth datasets focus on humans performing basic actions withdaily objects. Different from scene-focused datasets, back-ground scene in Sth-Sth datasets contributes little to the ﬁ-nal action class prediction. In addition, there are classes such as “Pushing something from left to right” and “Push-ing something from right to left”, and which require strongmotion reasoning.By comparing the previous work in Table 3, we observethat using longer input (e.g., 16 frames) is generally bet-ter. Moreover, methods that focus on temporal modeling[128, 122, 92] work better than stacked 3D kernels [14].For example, TSM [128], TEA [122] and MSNet [110] in-sert an explicit temporal reasoning module into 2D ResNetbackbones and achieves state-of-the-art results. This im-plies that the Sth-Sth dataset needs strong temporal motionreasoning as well as spatial semantics information.

In this section, we ﬁrst compare the recent state-of-the-art approaches on Charades dataset [186] and then we listsome recent work that use assemble model or additional ob-ject information on Charades.Comparing the previous work in Table 4, we make thefollowing observations. First, 3D models [229, 45] gener-ally perform better than 2D models [186, 231] and 2D mod-els with optical ﬂow input. This indicates that the spatio-temporal reasoning is critical for long-term complex con-current action understanding. Second, longer input helpswith the recognition [229] probably because some actionsrequire long-term feature to recognize. Third, models withstrong backbones that are pre-trained on larger datasets gen-14ethod Extra-information Backbone Pre-train Venue mAP2D CNN [186] - AlexNet I ECCV 2016 11.2Two-stream [186] ﬂow VGG16 I ECCV 2016 22.4ActionVLAD [63] - VGG16 I CVPR 2017 21.0CoViAR [231] - ResNet50-like - CVPR 2018 21.9MultiScale TRN [269] - BN-Inception-like I ECCV 2018

I3D [14] - BN-Inception-like K400 CVPR 2017 32.9STRG [220] - ResNet101-NL-like K400 ECCV 2018 39.7LFB [229] - ResNet101-NL-like K400 CVPR 2019 42.5TC [84] ResNet101-NL-like K400 ICCV 2019 41.1HAF [212] IDT + ﬂow BN-Inception-like K400 ICCV 2019 43.1SlowFast [45] - ResNet-like K400 ICCV 2019 42.5SlowFast [45] - ResNet-like K600 ICCV 2019

Action-Genome [90] person + object ResNet-like - CVPR 2020

AssembleNet++ [177] ﬂow + object ResNet-like - ECCV 2020 59.9

Table 4.

Charades evaluation using mAP , calculated using the ofﬁcially provided script. NL: non-local network. Pre-train indicateswhich dataset the model is pre-trained on. I: ImageNet, K400: Kinetics400 and K600: Kinetics600. erally have better performance [45]. This is because Cha-rades is a medium-scaled dataset which doesn’t containenough diversity to train a deep model.Recently, researchers explored the alternative directionfor complex concurrent action recognition by assemblingmodels [177] or providing additional human-object inter-action information [90]. These papers signiﬁcantly outper-formed previous literature that only ﬁnetune a single modelon Charades. It demonstrates that exploring spatio-temporalhuman-object interactions and ﬁnding a way to avoid over-ﬁtting are the keys for concurrent action understanding.

To deploy a model in real-life applications, we usuallyneed to know whether it meets the speed requirement be-fore we can proceed. In this section, we evaluate the ap-proaches mentioned above to perform a thorough compari-son in terms of (1) number of parameters, (2) FLOPS, (3)latency and (4) frame per second.We present the results in Table 5. Here, we use the mod-els in the GluonCV video action recognition model zoo since all these models are trained using the same data, samedata augmentation strategy and under same 30-view evalua-tion scheme, thus fair comparison. All the timings are doneon a single Tesla V100 GPU with 105 repeated runs, whilethe ﬁrst 5 runs are ignored for warming up. We always usea batch size of 1. In terms of model input, we use the sug-gested settings in the original paper.As we can see in Table 5, if we compare latency, 2Dmodels are much faster than all other 3D variants. This is To reproduce the numbers in Table 5, please visit https://github.com/dmlc/gluon-cv/blob/master/scripts/action-recognition/README.md probably why most real-world video applications still adoptframe-wise methods. Secondly, as mentioned in [170, 259],FLOPS is not strongly correlated with the actual inferencetime (i.e., latency). Third, if comparing performance, most3D models give similar accuracy around , but pre-training with a larger dataset can signiﬁcantly boost theperformance . This indicates the importance of trainingdata and partially suggests that self-supervised pre-trainingmight be a promising way to further improve existing meth-ods.

5. Discussion and Future Work

We have surveyed more than 200 deep learning basedmethods for video action recognition since year 2014. De-spite the performance on benchmark datasets plateauing,there are many active and promising directions in this taskworth exploring.

More and more methods haven been developed to im-prove video action recognition, at the same time, thereare some papers summarizing these methods and providinganalysis and insights. Huang et al . [82] perform an explicitanalysis of the effect of temporal information for video un-derstanding. They try to answer the question “how impor-tant is the motion in the video for recognizing the action”.Feichtenhofer et al . [48, 49] provide an amazing visualiza-tion of what two-stream models have learned in order tounderstand how these deep representations work and whatthey are capturing. Li et al . [124] introduce a concept, rep-resentation bias of a dataset, and ﬁnd that current datasets Note that, R2+1D-ResNet152 ∗ and CSN-ResNet152 ∗ in Table 5 arepretrained on a large-scale IG65M dataset [60]. odel Input FLOPS (G) of params (M) FPS Latency (s) Acc ( % )TSN-ResNet18 [218] × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ∗ [204] × × × ∗ [203] × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × Table 5.

Comparison on both efﬁciency and accuracy . Top: 2D models and bottom: 3D models. FLOPS means ﬂoating point operationsper second. FPS indicates how many video frames can the model process per second. Latency is the actual running time to completeone network forward given the input. Acc is the top-1 accuracy on Kinetics400 dataset. TSN, I3D, I3D-slow families are pretrained onImageNet. R2+1D, SlowFast and TPN families are trained from scratch. are biased towards static representations. Experiments onsuch biased datasets may lead to erroneous conclusions,which is indeed a big problem that limits the developmentof video action recognition. Recently, Piergiovanni et al .introduced the AViD [165] dataset to cope with data bias bycollecting data from diverse groups of people. These pa-pers provide great insights to help fellow researchers to un-derstand the challenges, open problems and where the nextbreakthrough might reside.

Numerous data augmentation methods have been pro-posed in image recognition domain, such as mixup [258],cutout [31], CutMix [254], AutoAugment [27], FastAu-toAug [126], etc. However, video action recognition stilladopts basic data augmentation techniques introduced be-fore year 2015 [217, 188], including random resizing, ran-dom cropping and random horizontal ﬂipping. Recently, SimCLR [17] and other papers have shown that color jitter-ing and random rotation greatly help representation learn-ing. Hence, an investigation of using different data augmen-tation techniques for video action recognition is particularlyuseful. This may change the data pre-processing pipelinefor all existing methods.

Domain adaptation (DA) has been studied extensively inrecent years to address the domain shift problem. Despitethe accuracy on standard datasets getting higher and higher,the generalization capability of current video models acrossdatasets or domains is less explored.There is early work on video domain adaptation [193,241, 89, 159]. However, these literature focus on small-scale video DA with only a few overlapping categories,which may not reﬂect the actual domain discrepancy andmay lead to biased conclusions. Chen et al . [15] intro-16uce two larger-scale datasets to investigate video DA andﬁnd that aligning temporal dynamics is particularly use-ful. Pan et al . [152] adopts co-attention to solve the tem-poral misalignment problem. Very recently, Munro et al .[145] explore a multi-modal self-supervision method forﬁne-grained video action recognition and show the effec-tiveness of multi-modality learning in video DA. Shufﬂeand Attend [95] argues that aligning features of all sam-pled clips results in a sub-optimal solution due to the factthat all clips do not include relevant semantics. Therefore,they propose to use an attention mechanism to focus moreon informative clips and discard the non-informative ones.In conclusion, video DA is a promising direction, especiallyfor researchers with less computing resources.

Neural architecture search (NAS) has attracted great in-terest in recent years and is a promising research direc-tion. However, given its greedy need for computing re-sources, only a few papers have been published in this area[156, 163, 161, 178]. The TVN family [161], which jointlyoptimize parameters and runtime, can achieve competitiveaccuracy with human-designed contemporary models, andrun much faster (within 37 to 100 ms on a CPU and 10 mson a GPU per 1 second video clip). AssembleNet [178] andAssembleNet++ [177] provide a generic approach to learnthe connectivity among feature representations across in-put modalities, and show surprisingly good performance onCharades and other benchmarks. AttentionNAS [222] pro-posed a solution for spatio-temporal attention cell search.The found cell can be plugged into any network to improvethe spatio-temporal features. All previous papers do showtheir potential for video understanding.Recently, some efﬁcient ways of searching architectureshave been proposed in the image recognition domain, suchas DARTS [130], Proxyless NAS [11], ENAS [160], one-shot NAS [7], etc. It would be interesting to combine ef-ﬁcient 2D CNNs and efﬁcient searching algorithms to per-form video NAS for a reasonable cost.

Despite their accuracy, it is difﬁcult to deploy deep learn-ing based methods for video understanding problems interms of real-world applications. There are several majorchallenges: (1) most methods are developed in ofﬂine set-tings, which means the input is a short video clip, not avideo stream in an online setting; (2) most methods do notmeet the real-time requirement; (3) incompatibility of 3Dconvolutions or other non-standard operators on non-GPUdevices (e.g., edge devices).Hence, the development of efﬁcient network architec-ture based on 2D convolutions is a promising direction.The approaches proposed in the image classiﬁcation do- main can be easily adapted to video action recognition, e.g.model compression, model quantization, model pruning,distributed training [68, 127], mobile networks [80, 265],mixed precision training, etc. However, more effort isneeded for the online setting since the input to most ac-tion recognition applications is a video stream, such assurveillance monitoring. We may need a new and morecomprehensive dataset for benchmarking online video ac-tion recognition methods. Lastly, using compressed videosmight be desirable because most videos are already com-pressed, and we have free access to motion information.

Data is more or at least as important as model devel-opment for machine learning. For video action recogni-tion, most datasets are biased towards spatial representa-tions [124], i.e., most actions can be recognized by a sin-gle frame inside the video without considering the tempo-ral movement. Hence, a new dataset in terms of long-termtemporal modeling is required to advance video understand-ing. Furthermore, most current datasets are collected fromYouTube. Due to copyright/privacy issues, the dataset or-ganizer often only releases the YouTube id or video linkfor users to download and not the actual video. The ﬁrstproblem is that downloading the large-scale datasets mightbe slow for some regions. In particular, YouTube recentlystarted to block massive downloading from a single IP.Thus, many researchers may not even get the dataset tostart working in this ﬁeld. The second problem is, due toregion limitation and privacy issues, some videos are notaccessible anymore. For example, the original Kinetcis400dataset has over 300K videos, but at this moment, we canonly crawl about 280K videos. On average, we lose 5 % ofthe videos every year. It is impossible to do fair compar-isons between methods when they are trained and evaluatedon different data. Adversarial examples have been well studied on imagemodels. [199] ﬁrst shows that an adversarial sample, com-puted by inserting a small amount of noise on the originalimage, may lead to a wrong prediction. However, limitedwork has been done on attacking video models.This task usually considers two settings, a white-box at-tack [86, 119, 66, 21] where the adversary can always getthe full access to the model including exact gradients of agiven input, or a black-box one [93, 245, 226], in which thestructure and parameters of the model are blocked so thatthe attacker can only access the (input, output) pair throughqueries. Recent work ME-Sampler [260] leverages the mo-tion information directly in generating adversarial videos,and is shown to successfully attack a number of video clas-siﬁcation models using many fewer queries. In summary,17his direction is useful since many companies provide APIsfor services such as video classiﬁcation, anomaly detection,shot detection, face detection, etc. In addition, this topic isalso related to detecting DeepFake videos. Hence, investi-gating both attacking and defending methods is crucial tokeeping these video services safe.

Zero-shot learning (ZSL) has been trending in the im-age understanding domain, and has recently been adapted tovideo action recognition. Its goal is to transfer the learnedknowledge to classify previously unseen categories. Dueto (1) the expensive data sourcing and annotation and (2)the set of possible human actions is huge, zero-shot actionrecognition is a very useful task for real-world applications.There are many early attempts [242, 88, 243, 137, 168,57] in this direction. Most of them follow a standard frame-work, which is to ﬁrst extract visual features from videosusing a pretrained network, and then train a joint model thatmaps the visual embedding to a semantic embedding space.In this manner, the model can be applied to new classesby ﬁnding the test class whose embedding is the nearest-neighbor of the model’s output. A recent work URL [279]proposes to learn a universal representation that general-izes across datasets. Following URL [279], [10] present theﬁrst end-to-end ZSL action recognition model. They alsoestablish a new ZSL training and evaluation protocol, andprovide an in-depth analysis to further advance this ﬁeld.Inspired by the success of pre-training and then zero-shotin NLP domain, we believe ZSL action recognition is apromising research topic.

Building a high-quality video action recognitiondataset [190, 100] usually requires multiple laborious steps:1) ﬁrst sourcing a large amount of raw videos, typicallyfrom the internet; 2) removing videos irrelevant to the cat-egories in the dataset; 3) manually trimming the video seg-ments that have actions of interest; 4) reﬁning the categor-ical labels. Weakly-supervised action recognition exploreshow to reduce the cost for curating training data.The ﬁrst direction of research [19, 60, 58] aims to reducethe cost of sourcing videos and accurate categorical label-ing. They design training methods that use training datasuch as action-related images or partially annotated videos,gathered from publicly available sources such as Internet.Thus this paradigm is also referred to as webly-supervisedlearning [19, 58]. Recent work on omni-supervised learn-ing [60, 64, 38] also follows this paradigm but featuresbootstrapping on unlabelled videos by distilling the mod-els’ own inference results.The second direction aims at removing trimming, themost time consuming part in annotation. Untrimmed- Net [216] proposed a method to learn action recogni-tion model on untrimmed videos with only categoricallabels [149, 172]. This task is also related to weakly-supervised temporal action localization which aims to auto-matically generate the temporal span of the actions. Severalpapers propose to simultaneously [155] or iteratively [184]learn models for these two tasks.

Popular action recognition datasets, such asUCF101 [190] or Kinetics400 [100], mostly compriseactions happening in various scenes. However, modelslearned on these datasets could overﬁt to contextual infor-mation irrelevant to the actions [224, 227, 24]. Severaldatasets have been proposed to study the problem ofﬁne-grained action recognition, which could examine themodels’ capacities in modeling action speciﬁc information.These datasets comprise ﬁne-grained actions in humanactivities such as cooking [28, 108, 174], working [103]and sports [181, 124]. For example, FineGym [181] is arecent large dataset annotated with different moves andsub-actions in gymnastic videos.

Recently, large-scale egocentric action recognition [29,28] has attracted increasing interest with the emerging ofwearable cameras devices. Egocentric action recognitionrequires a ﬁne understanding of hand motion and the inter-acting objects in the complex environment. A few papersleverage object detection features to offer ﬁne object con-text to improve egocentric video recognition [136, 223, 229,180]. Others incorporate spatio-temporal attention [192] orgaze annotations [131] to localize the interacting object tofacilitate action recognition. Similar to third-person actionrecognition, multi-modal inputs (e.g., optical ﬂow and au-dio) have been demonstrated to be effective in egocentricaction recognition [101].

Multi-modal video understanding has attracted increas-ing attention in recent years [55, 3, 129, 167, 154, 2, 105].There are two main categories for multi-modal video un-derstanding. The ﬁrst group of approaches use multi-modalities such as scene, object, motion, and audio toenrich the video representations. In the second group,the goal is to design a model which utilizes modality in-formation as a supervision signal for pre-training mod-els [195, 138, 249, 62, 2].

Multi-modality for comprehensive video understand-ing

Learning a robust and comprehensive representationof video is extremely challenging due to the complexity18

CF101 HMDB51Method Dataset V i d e o A ud i o T e x t Size Backbone Venue Linear FT Linear FTAVTS [105] K400 (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2018 − − (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2018 − − (cid:88) − (cid:88) S3D arXiv 2019 54.0 79.5 29.5 44.6MIL-NCE [138] HTM (cid:88) − (cid:88) S3D CVPR 2020 82.7 91.3 53.1 61.0ELO [162] YT8M (cid:88) (cid:88) − R(2+1)D-50 CVPR 2020 – 93.8 64.5 67.4XDC [3] K400 (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2020 − − (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2020 − − (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2020 − − (cid:88) (cid:88) − R(2+1)D-18 NeurIPS 2020 − − (cid:88) (cid:88) − R(2+1)D-50 arXiv 2020 − − (cid:88) (cid:88) − R(2+1)D-18 arXiv 2020 − − (cid:88) (cid:88) − R(2+1)D-18 arXiv 2020 − − (cid:88) (cid:88) − R(2+1)D-18 arXiv 2020 − − (cid:88) (cid:88) (cid:88) S3D NeurIPS 2020 89.6 92.5 62.6 69.6MMV [2] AS+ HTM (cid:88) (cid:88) (cid:88)

TSM-50x2 NeurIPS 2020

OPN [115] UCF101 (cid:88) − −

VGG ICCV 2017 − − (cid:88) − − R3D arXiv 2018 − − (cid:88) − − R3D AAAI 2019 − − (cid:88) − − R(2+1)D CVPR 2019 − − (cid:88) − − R-2D3D ICCVW 2019 − − (cid:88) − − S3D-G CVPR 2020 − − (cid:88) − −

224 R-2D3D ECCV 2020 54.1 86.1 30.5 54.5CoCLR [73] K400 (cid:88) − −

S3D NeurIPS 2020 (cid:88) − −

R3D-50 arXiv 2020 − − (cid:88) − − R3D-50 arXiv 2020 − − Table 6.

Comparison of self-supervised video representation learning methods.

Top section shows the multi-modal video representationlearning approaches and bottom section shows the video-only representation learning methods. From left to right, we show the self-supervised training setting, e.g. dataset, modalities, resolution, and architecture. Two last right columns show the action recognitionresults on two datasets UCF101 and HMDB51 to measure the quality of self-supervised pre-trained model. HTM: HowTo100M, YT8M:YouTube8M, AS: AudioSet, IG-K: IG-Kinetics, K400: Kinetics400 and K600: Kinetics600. of semantics in videos. Video data often includes varia-tions in different forms including appearance, motion, au-dio, text or scene [55, 129, 166]. Therefore, utilizing thesemulti-modal representations is a critical step in understand-ing video content more efﬁciently. The multi-modal repre-sentations of video can be approximated by gathering rep-resentations of various modalities such as scene, object,audio, motion, appearance and text. Ngiam et al . [148]was an early attempt to suggest using multiple modali-ties to obtain better features. They utilized videos of lipsand their corresponding speech for multi-modal represen-tation learning. Miech et al . [139] proposed a mixture-ofembedding-experts model to combine multiple modalitiesincluding motion, appearance, audio, and face features andlearn the shared embedding space between these modali-ties and text. Roig et al . [175] combines multiple modali-ties such as action, scene, object and acoustic event featuresin a pyramidal structure for action recognition. They showthat adding each modality improves the ﬁnal action recog-nition accuracy. Both CE [129] and MMT [55], follow a similar research line to [139] where the goal is to com-bine multiple-modalities to obtain a comprehensive repre-sentation of video for joint video-text representation learn-ing. Piergiovanni et al . [166] utilized textual data togetherwith video data to learn a joint embedding space. Usingthis learned joint embedding space, the method is capableof doing zero-shot action recognition. This line of researchis promising due to the availability of strong semantic ex-traction models and also success of transformers on bothvision and language tasks.

Multi-modality for self-supervised video representationlearning

Most videos contain multiple modalities such asaudio or text/caption. These modalities are great sourceof supervision for learning video representations [3, 144,154, 2, 162]. Korbar et al . [105] incorporated the natu-ral synchronization between audio and video as a supervi-sion signal in their contrastive learning objective for self-supervised representation learning. In multi-modal self-supervised representation learning, the dataset plays an im-19ortant role. VideoBERT [195] collected 310K cookingvideos from YouTube. However, this dataset is not publiclyavailable. Similar to BERT, VideoBERT used a “maskedlanguage model” training objective and also quantized thevisual representations into “visual words”. Miech et al .[140] introduced HowTo100M dataset in 2019. This datasetincludes 136M clips from 1.22M videos with their corre-sponding text. They collected the dataset from YouTubewith the aim of obtaining instructional videos (how to per-form an activity). In total, it covers 23.6K instructionaltasks. MIL-NCE [138] used this dataset for self-supervisedcross-modal representation learning. They tackled the prob-lem of visually misaligned narrations, by considering multi-ple positive pairs in the contrastive learning objective. Act-BERT [275], utilized HowTo100M dataset for pre-trainingof the model in a self-supervised way. They incorporatedvisual, action, text and object features for cross modal rep-resentation learning. Recently AVLnet [176] and MMV [2]considered three modalities visual, audio and language forself-supervised representation learning. This research di-rection is also increasingly getting more attention due to thesuccess of contrastive learning on many vision and languagetasks and the access to the abundance of unlabeled multi-modal video data on platforms such as YouTube, Instagramor Flickr. The top section of Table 6 compares multi-modalself-supervised representation learning methods. We willdiscuss more work on video-only representation learning inthe next section.

Self-supervised learning has attracted more attention re-cently as it is able to leverage a large amount of unlabeleddata by designing a pretext task to obtain free supervisorysignals from data itself. It ﬁrst emerged in image repre-sentation learning. On images, the ﬁrst stream of papersaimed at designing pretext tasks for completing missinginformation, such as image coloring [262] and image re-ordering [153, 61, 263]. The second stream of papers usesinstance discrimination [235] as the pretext task and con-trastive losses [235, 151] for supervision. They learn visualrepresentation by modeling visual similarity of object in-stances without class labels [235, 75, 201, 18, 17].Self-supervised learning is also viable for videos. Com-pared with images, videos has another axis, temporal di-mension, which we can use to craft pretext tasks. Infor-mation completion tasks for this purpose include predict-ing the correct order of shufﬂed frames [141, 52] and videoclips [240]. Jing et al . [94] focus on the spatial dimen-sion only by predicting the rotation angles of rotated videoclips. Combining temporal and spatial information, sev-eral tasks have been introduced to solve a space-time cubicpuzzle, anticipate future frames [208], forecast long-termmotions [134] and predict motion and appearance statis- tics [211]. RSPNet [16] and visual tempo [247] exploit therelative speed between video clips as a supervision signal.The added temporal axis can also provide ﬂexibility indesigning instance discrimination pretexts [67, 167]. In-spired by the decoupling of 3D convolution to spatial andtemporal separable convolutions [239], Zhang et al . [266]proposed to decouple the video representation learning intotwo sub-tasks: spatial contrast and temporal contrast. Re-cently, Han et al . [72] proposed memory augmented densepredictive coding for self-supervised video representationlearning. They split each video into several blocks and theembedding of future block is predicted by the combinationof condensed representations in memory.The temporal continuity in videos inspires researchers todesign other pretext tasks around correspondence. Wang et al . [221] proposed to learn representation by perform-ing cycle-consistency tracking. Speciﬁcally, they track thesame object backward and then forward in the consecutivevideo frames, and use the inconsistency between the startand end points as the loss function. TCC [39] is a con-current paper. Instead of tracking local objects, [39] usedcycle-consistency to perform frame-wise temporal align-ment as a supervision signal. [120] was a follow-up work of[221], and utilized both object-level and pixel-level corre-spondence across video frames. Recently, long-range tem-poral correspondence is modelled as a random walk graphto help learning video representation in [87].We compare video self-supervised representation learn-ing methods at the bottom section of Table 6. A clear trendcan be observed that recent papers have achieved muchbetter linear evaluation accuracy and ﬁne-tuning accuracycomparable to supervised pre-training. This shows that self-supervised learning could be a promising direction towardslearning better video representations.

6. Conclusion

In this survey, we present a comprehensive review of200+ deep learning based recent approaches to video ac-tion recognition. Although this is not an exhaustive list,we hope the survey serves as an easy-to-follow tutorial forthose seeking to enter the ﬁeld, and an inspiring discussionfor those seeking to ﬁnd new research directions.

Acknowledgement

We would like to thank Peter Gehler, Linchao Zhu andThomas Brady for constructive feedback and fruitful dis-cussions.

References [1] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, PaulNatsev, George Toderici, Balakrishnan Varadarajan, and udheendra Vijayanarasimhan. YouTube-8M: A Large-Scale Video Classiﬁcation Benchmark. arXiv preprintarXiv:1609.08675 , 2016.[2] Jean-Baptiste Alayrac, Adri`a Recasens, Rosalia Schneider,Relja Arandjelovi´c, Jason Ramapuram, Jeffrey De Fauw,Lucas Smaira, Sander Dieleman, and Andrew Zisserman.Self-Supervised MultiModal Versatile Networks, 2020.[3] Humam Alwassel, Dhruv Mahajan, Bruno Korbar, LorenzoTorresani, Bernard Ghanem, and Du Tran. Self-Supervised Learning by Cross-Modal Audio-Video Clus-tering. In Advances in Neural Information Processing Sys-tems (NeurIPS) , 2020.[4] Relja Arandjelovi´c, Petr Gronat, Akihiko Torii, TomasPajdla, and Josef Sivic. NetVLAD: CNN Architecturefor Weakly Supervised Place Recognition. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2016.[5] Moez Baccouche, Franck Mamalet, Christian Wolf,Christophe Garcia, and Atilla Baskurt. Sequential DeepLearning for Human Action Recognition. In the SecondInternational Conference on Human Behavior Understand-ing , 2011.[6] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri,William T. Freeman, Michael Rubinstein, Michal Irani, andTali Dekel. SpeedNet: Learning the Speediness in Videos.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2020.[7] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, VijayVasudevan, and Quoc Le. Understanding and SimplifyingOne-Shot Architecture Search. In

The International Con-ference on Machine Learning (ICML) , 2018.[8] Shweta Bhardwaj, Mukundhan Srinivasan, and Mitesh M.Khapra. Efﬁcient Video Classiﬁcation Using FewerFrames. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2019.[9] Hakan Bilen, Basura Fernando, Efstratios Gavves, AndreaVedaldi, and Stephen Gould. Dynamic Image Networks forAction Recognition. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[10] Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Per-ona, and Krzysztof Chalupka. Rethinking Zero-Shot VideoClassiﬁcation: End-to-End Training for Realistic Applica-tions. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2020.[11] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Di-rect Neural Architecture Search on Target Task and Hard-ware. In

The International Conference on Learning Repre-sentations (ICLR) , 2019.[12] Joao Carreira, Eric Noland, Andras Banki-Horvath, ChloeHillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340 , 2018.[13] Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zis-serman. A short note on the kinetics-700 human actiondataset. arXiv preprint arXiv:1907.06987 , 2019.[14] Joao Carreira and Andrew Zisserman. Quo Vadis, Ac-tion Recognition? A New Model and the Kinetics Dataset.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017. [15] Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, JaekwonYoo, Ruxin Chen, and Jian Zheng. Temporal AttentiveAlignment for Large-Scale Video Domain Adaptation. In

The IEEE International Conference on Computer Vision(ICCV) , 2019.[16] Peihao Chen, Deng Huang, Dongliang He, Xiang Long,Runhao Zeng, Shilei Wen, Mingkui Tan, and ChuangGan. RSPNet: Relative Speed Perception for UnsupervisedVideo Representation Learning, 2020.[17] Ting Chen, Simon Kornblith, Mohammad Norouzi, andGeoffrey Hinton. A Simple Framework for ContrastiveLearning of Visual Representations. arXiv preprintarXiv:2002.05709 , 2020.[18] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297 , 2020.[19] Xinlei Chen and Abhinav Gupta. Webly SupervisedLearning of Convolutional Networks. In

Proceedings ofthe IEEE International Conference on Computer Vision(ICCV) , pages 1431–1439, 2015.[20] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, ShuichengYan, and Jiashi Feng. Multi-Fiber Networks for VideoRecognition. In

The European Conference on ComputerVision (ECCV) , 2018.[21] Zhikai Chen, Lingxi Xie, Shanmin Pang, Yong He, and QiTian. Appending Adversarial Frames for Universal VideoAttack. arXiv preprint arXiv:1912.04538 , 2019.[22] Anoop Cherian, Basura Fernando, Mehrtash Harandi, andStephen Gould. Generalized Rank Pooling for ActivityRecognition. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017.[23] Guilhem Cheron, Ivan Laptev, and Cordelia Schmid. P-CNN: Pose-based CNN Features for Action Recognition.In

The IEEE International Conference on Computer Vision(ICCV) , 2015.[24] Jinwoo Choi, Chen Gao, Joseph CE Messou, and Jia-BinHuang. Why Can’t I Dance in the Mall? Learning to Mit-igate Scene Bias in Action Recognition. In

Advances inNeural Information Processing Systems (NeurIPS) , pages853–865, 2019.[25] Vasileios Choutas, Philippe Weinzaepfel, J´erˆome Revaud,and Cordelia Schmid. PoTion: Pose MoTion Representa-tion for Action Recognition. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018.[26] Nieves Crasto, Philippe Weinzaepfel, Karteek Alahari,and Cordelia Schmid. MARS: Motion-Augmented RGBStream for Action Recognition. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2019.[27] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Va-sudevan, and Quoc V. Le. AutoAugment: Learning Aug-mentation Strategies From Data. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2019.[28] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Da-vide Moltisanti, Jonathan Munro, Toby Perrett, Will Price,and Michael Wray. The EPIC-KITCHENS Dataset: Col- ection, Challenges and Baselines. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) , 2020.[29] Dima Damen, Hazel Doughty, Giovanni Maria Farinella,Antonino Furnari, Evangelos Kazakos, Jian Ma, DavideMoltisanti, Jonathan Munro, Toby Perrett, Will Price,et al. Rescaling Egocentric Vision. arXiv preprintarXiv:2006.13256 , 2020.[30] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.ImageNet: A Large-Scale Hierarchical Image Database. In

CVPR , 2009.[31] Terrance DeVries and Graham W Taylor. Improved Reg-ularization of Convolutional Neural Networks with Cutout. arXiv preprint arXiv:1708.04552 , 2017.[32] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Amir HosseinKarami, Mohammad Mahdi Arzani, Rahman Yousefzadeh,and Luc Van Gool. Temporal 3D ConvNets: New Architec-ture and Transfer Learning for Video Classiﬁcation. arXivpreprint arXiv:1711.08200 , 2017.[33] Ali Diba, Mohsen Fayyaz, Vivek Sharma, M.Mahdi Arzani, Rahman Yousefzadeh, Juergen Gall,and Luc Van Gool. Spatio-Temporal Channel CorrelationNetworks for Action Classiﬁcation. In

The EuropeanConference on Computer Vision (ECCV) , 2018.[34] Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri,J¨urgen Gall, Rainer Stiefelhagen, and Luc Van Gool. LargeScale Holistic Video Understanding. In

European Confer-ence on Computer Vision , pages 593–610. Springer, 2020.[35] Ali Diba, Ali Mohammad Pazandeh, and Luc Van Gool.Efﬁcient Two-Stream Motion and Appearance 3D CNNsfor Video Classiﬁcation. arXiv preprint arXiv:1608.08851 ,2016.[36] Ali Diba, Vivek Sharma, and Luc Van Gool. Deep TemporalLinear Encoding Networks. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2017.[37] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach,Subhashini Venugopalan, Sergio Guadarrama, KateSaenko, and Trevor Darrell. Long-term Recurrent Convo-lutional Networks for Visual Recognition and Description.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2015.[38] Hao Dong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu,and Dahu Lin. Omni-sourced Webly-supervised Learningfor Video Recognition. In

European Conference on Com-puter Vision , 2020.[39] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, PierreSermanet, and Andrew Zisserman. Temporal Cycle-Consistency Learning. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages1801–1810, 2019.[40] Bernard Ghanem Fabian Caba Heilbron, Victor Escorciaand Juan Carlos Niebles. ActivityNet: A Large-Scale VideoBenchmark for Human Activity Understanding. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2015.[41] Hehe Fan, Zhongwen Xu, Linchao Zhu, Chenggang Yan,Jianjun Ge, and Yi Yang. Watching a small portion could beas good as watching all: Towards efﬁcient video classiﬁca- tion. In

IJCAI International Joint Conference on ArtiﬁcialIntelligence , 2018.[42] Lijie Fan, Wenbing Huang, Chuang Gan, Stefano Ermon,Boqing Gong, and Junzhou Huang. End-to-End Learningof Motion Representation for Video Understanding. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2018.[43] Quanfu Fan, Chun-Fu (Richard) Chen, Hilde Kuehne,Marco Pistoia, and David Cox. More Is Less: LearningEfﬁcient Video Representations by Big-Little Network andDepthwise Temporal Aggregation. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 2019.[44] Christoph Feichtenhofer. X3D: Expanding Architecturesfor Efﬁcient Video Recognition. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2020.[45] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, andKaiming He. SlowFast Networks for Video Recognition.In

The IEEE International Conference on Computer Vision(ICCV) , 2019.[46] Christoph Feichtenhofer, Axel Pinz, and Richard P.Wildes. Spatiotemporal Residual Networks for Video Ac-tion Recognition. In

Advances in Neural Information Pro-cessing Systems (NeurIPS) , 2016.[47] Christoph Feichtenhofer, Axel Pinz, and Richard P Wildes.Spatiotemporal Multiplier Networks for Video ActionRecognition. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2017.[48] Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, andAndrew Zisserman. What Have We Learned From DeepRepresentations for Action Recognition? In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2018.[49] Christoph Feichtenhofer, Axel Pinz, Richard P. Wildes, andAndrew Zisserman. Deep insights into convolutional net-works for video recognition.

International Journal of Com-puter Vision (IJCV) , 2019.[50] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisser-man. Convolutional Two-Stream Network Fusion for VideoAction Recognition. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[51] Basura Fernando, Peter Anderson, Marcus Hutter, andStephen Gould. Discriminative Hierarchical Rank Poolingfor Activity Recognition. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2016.[52] Basura Fernando, Hakan Bilen, Efstratios Gavves, andStephen Gould. Self-supervised video representation learn-ing with odd-one-out networks. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 3636–3645, 2017.[53] Basura Fernando, Efstratios Gavves, Jose Oramas M., AmirGhodrati, and Tinne Tuytelaars. Modeling Video EvolutionFor Action Recognition. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2015.[54] Basura Fernando and Stephen Gould. Learning End-to-endVideo Classiﬁcation with Rank-Pooling. In

The Interna-tional Conference on Machine Learning (ICML) , 2016.

55] Gabeur et al. Multi-modal Transformer for Video Retrieval. arxiv:2007.10639 , 2020.[56] Harshala Gammulle, Simon Denman, Sridha Sridharan,and Clinton Fookes. Two Stream LSTM: A Deep Fu-sion Framework for Human Action Recognition. In

TheIEEE Winter Conference on Applications of Computer Vi-sion (WACV) , 2017.[57] Chuang Gan, Ming Lin, Yi Yang, Yueting Zhuang, andAlexander G.Hauptmann. Exploring Semantic Inter-ClassRelationships (SIR) for Zero-Shot Action Recognition. In

AAAI , 2015.[58] Chuang Gan, Chen Sun, Lixin Duan, and Boqing Gong.Webly-supervised video recognition by mutually voting forrelevant web images and web video frames. In

EuropeanConference on Computer Vision , pages 849–866. Springer,2016.[59] Nuno C. Garcia, Pietro Morerio, and Vittorio Murino.Modality Distillation with Multiple Stream Networks forAction Recognition. In

The European Conference on Com-puter Vision (ECCV) , 2018.[60] Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan,Heng Wang, and D. Mahajan. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition.

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2019.[61] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-supervised representation learning by predicting image ro-tations. arXiv preprint arXiv:1803.07728 , 2018.[62] Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsi-avash, and Thomas Brox. Coot: Cooperative hierarchicaltransformer for video-text representation learning. In

Ad-vances in Neural Information Processing Systems , 2020.[63] Rohit Girdhar, Deva Ramanan, Abhinav Gupta, JosefSivic, and Bryan Russell. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classiﬁcation. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2017.[64] Rohit Girdhar, Du Tran, Lorenzo Torresani, and Deva Ra-manan. Distinit: Learning video representations withouta single labeled video. In

Proceedings of the IEEE Inter-national Conference on Computer Vision , pages 852–861,2019.[65] M. A. Goodale and A. D. Milner. Separate Visual Pathwaysfor Perception and Action.

Trends in Neurosciences , 1992.[66] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples. In

Inter-national Conference on Learning Representations , 2015.[67] Daniel Gordon, Kiana Ehsani, Dieter Fox, and Ali Farhadi.Watching the World Go By: Representation Learning fromUnlabeled Videos. arXiv preprint arXiv:2003.07990 , 2020.[68] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He. Accurate, Large Mini-batch SGD: Training ImageNet in 1 Hour. arXiv preprintarXiv:1706.02677 , 2017.[69] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, MoritzMueller-Freitag, et al. The” Something Something” VideoDatabase for Learning and Evaluating Visual CommonSense. In

The IEEE International Conference on ComputerVision (ICCV) , 2017.[70] Chunhui Gu, Chen Sun, David A. Ross, Carl Von-drick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijaya-narasimhan, George Toderici, Susanna Ricco, Rahul Suk-thankar, Cordelia Schmid, and Jitendra Malik. AVA: AVideo Dataset of Spatio-Temporally Localized Atomic Vi-sual Actions. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018.[71] Tengda Han, Weidi Xie, and Andrew Zisserman. Video rep-resentation learning by dense predictive coding. In

Work-shop on Large Scale Holistic Video Understanding, ICCV ,2019.[72] Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representa-tion learning. In

European Conference on Computer Vision ,2020.[73] Tengda Han, Weidi Xie, and Andrew Zisserman. Self-supervised co-training for video representation learning. In

Neurips , 2020.[74] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. CanSpatiotemporal 3D CNNs Retrace the History of 2D CNNsand ImageNet? In

The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) , 2018.[75] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual rep-resentation learning. arXiv preprint arXiv:1911.05722 ,2019.[76] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep Residual Learning for Image Recognition. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2016.[77] Samitha Herath, Mehrtash Harandi, and Fatih Porikli. Go-ing Deeper into Action Recognition: A Survey. arXivpreprint arXiv:1605.04988 , 2016.[78] Sepp Hochreiter and J¨urgen Schmidhuber. Long Short-Term Memory.

Neural Computation , 1997.[79] Berthold K.P. Horn and Brian G. Rhunck. Determining Op-tical Flow.

Artiﬁcial Intelligence , 1981.[80] Andrew Howard, Mark Sandler, Grace Chu, Liang-ChiehChen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and HartwigAdam. Searching for MobileNetV3. In

The IEEE Interna-tional Conference on Computer Vision (ICCV) , 2019.[81] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-ExcitationNetworks. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018.[82] De-An Huang, Vignesh Ramanathan, Dhruv Mahajan,Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, and JuanCarlos Niebles. What Makes a Video a Video: AnalyzingTemporal Information in Video Understanding Models andDatasets. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018.

83] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kil-ian Q. Weinberger. Densely Connected Convolutional Net-works. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2017.[84] Noureldien Hussein, Efstratios Gavves, and Arnold WMSmeulders. Timeception for complex action recognition.In

Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 254–263, 2019.[85] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, andT. Brox. FlowNet 2.0: Evolution of Optical Flow Estima-tion with Deep Networks. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2017.[86] Nathan Inkawhich, Matthew Inkawhich, Yiran Chen, andHai Li. Adversarial attacks for optical ﬂow-based actionrecognition classiﬁers. arXiv preprint arXiv:1811.11875 ,2018.[87] Allan Jabri, Andrew Owens, and Alexei A Efros. Space-time correspondence as a contrastive random walk.

Ad-vances in Neural Information Processing Systems , 2020.[88] Mihir Jain, Jan C van Gemert, Thomas Mensink, andCees GM Snoek. Objects2action: Classifying and Local-izing Actions without Any Video Example. In

ICCV , 2015.[89] Arshad Jamal, Vinay P Namboodiri, Dipti Deodhare, andKS Venkatesh. Deep Domain Adaptation in Action Space.In

The British Machine Vision Conference (BMVC) , 2018.[90] Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Car-los Niebles. Action genome: Actions as compositionsof spatio-temporal scene graphs. In

Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , pages 10236–10247, 2020.[91] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3D Convo-lutional Neural Networks for Human Action Recognition.

IEEE Transactions on Pattern Analysis and Machine Intel-ligence (PAMI) , 2012.[92] Boyuan Jiang, MengMeng Wang, Weihao Gan, Wei Wu,and Junjie Yan. STM: SpatioTemporal and Motion En-coding for Action Recognition. In

The IEEE InternationalConference on Computer Vision (ICCV) , 2019.[93] Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey,and Yu-Gang Jiang. Black-box adversarial attacks on videorecognition models. In

Proceedings of the 27th ACM Inter-national Conference on Multimedia , pages 864–872, 2019.[94] Longlong Jing, Xiaodong Yang, Jingen Liu, and YingliTian. Self-supervised spatiotemporal feature learn-ing via video rotation prediction. arXiv preprintarXiv:1811.11387 , 2018.[95] Samuel Schulter Jinwoo Choi, Gaurav Sharma and Jia-BinHuang. Shufﬂe and Attend: Video Domain Adaptation.In

The European Conference on Computer Vision (ECCV) ,2020.[96] Valter Lu´ıs Estevam Junior, Helio Pedrini, and DavidMenotti. Zero-Shot Action Recognition in Videos: A Sur-vey. arXiv preprint arXiv:1909.06423 , 2019.[97] Herv´e J´egou, Matthijs Douze, Cordelia Schmid, and PatrickP´erez. Aggregating Local Descriptors into a Compact Im-age Representation. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2010. [98] Amlan Kar, Nishant Rai, Karan Sikka, and Gaurav Sharma.AdaScan: Adaptive Scan Pooling in Deep ConvolutionalNeural Networks for Human Action Recognition in Videos.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017.[99] Andrej Karpathy, George Toderici, Sanketh Shetty, ThomasLeung, Rahul Sukthankar, and Li Fei-Fei. Large-ScaleVideo Classiﬁcation with Convolutional Neural Networks.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2014.[100] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman,and Andrew Zisserman. The Kinetics Human Action VideoDataset. arXiv preprint arXiv:1705.06950 , 2017.[101] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman,and Dima Damen. EPIC-Fusion: Audio-Visual TemporalBinding for Egocentric Action Recognition. In

ICCV , 2019.[102] Dahun Kim, Donghyeon Cho, and In So Kweon. Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. In

AAAI , 2019.[103] Takuya Kobayashi, Yoshimitsu Aoki, Shogo Shimizu, Kat-suhiro Kusano, and Seiji Okumura. Fine-grained actionrecognition in assembly work scenes by drawing atten-tion to the hands. In , pages 440–446. IEEE, 2019.[104] Yu Kong and Yun Fu. Human Action Recognition and Pre-diction: A Survey. arXiv preprint arXiv:1806.11230 , 2018.[105] Bruno Korbar, Du Tran, and Lorenzo Torresani. Co-operative learning of audio and video models from self-supervised synchronization, 2018.[106] Bruno Korbar, Du Tran, and Lorenzo Torresani. SCSam-pler: Sampling Salient Clips From Video for Efﬁcient Ac-tion Recognition. In

The IEEE International Conference onComputer Vision (ICCV) , 2019.[107] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.ImageNet Classiﬁcation with Deep Convolutional NeuralNetworks. In

Advances in Neural Information ProcessingSystems (NeurIPS) , 2012.[108] Hilde Kuehne, Ali Arslan, and Thomas Serre. The languageof actions: Recovering the syntax and semantics of goal-directed human activities. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , pages780–787, 2014.[109] Hildegard Kuehne, Hueihan Jhuang, Est´ıbaliz Garrote,Tomaso Poggio, and Thomas Serre. HMDB: A Large VideoDatabase for Human Motion Recognition. In

The IEEE In-ternational Conference on Computer Vision (ICCV) , 2011.[110] Heeseung Kwon, Manjin Kim, Suha Kwak, and MinsuCho. Motionsqueeze: Neural motion feature learning forvideo understanding. In

ECCV , 2020.[111] Okan K¨op¨ukl¨u, Neslihan Kose, Ahmet Gunduz, and Ger-hard Rigoll. Resource Efﬁcient 3D Convolutional NeuralNetworks. In

The IEEE International Conference on Com-puter Vision (ICCV) Workshop , 2019.[112] Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G.Hauptmann, and Bhiksha Raj. Beyond Gaussian Pyramid: ulti-skip Feature Stacking for Action Recognition. In TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2015.[113] Zhenzhong Lan, Dezhong Yao, Ming Lin, Shoou-I Yu, andAlexander Hauptmann. The Best of Both Worlds: Combin-ing Data-independent and Data-driven Approaches for Ac-tion Recognition. arXiv preprint arXiv:1505.04427 , 2015.[114] Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann, andShawn Newsam. Deep Local Video Feature for ActionRecognition. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) Workshops , 2017.[115] Hsin-Ying Lee, Jia-Bin Huang, Maneesh Kumar Singh, andMing-Hsuan Yang. Unsupervised Representation Learningby Sorting Sequence, 2017.[116] Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyutae Park,and Nojun Kwak. Motion Feature Network: Fixed MotionFilter for Action Recognition. In

The European Conferenceon Computer Vision (ECCV) , 2018.[117] Ang Li, Meghana Thotakuri, David A Ross, Jo˜ao Car-reira, Alexander Vostrikov, and Andrew Zisserman. Theava-kinetics localized human actions video dataset. arXivpreprint arXiv:2005.00214 , 2020.[118] Qing Li, Zhaofan Qiu, Ting Yao, Tao Mei, Yong Rui, andJiebo Luo. Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation. In

TheACM International Conference on Multimedia Retrieval(ICMR) , 2016.[119] Shasha Li, Ajaya Neupane, Sujoy Paul, Chengyu Song,Srikanth V Krishnamurthy, Amit K Roy Chowdhury, andAnanthram Swami. Adversarial perturbations againstreal-time video classiﬁcation systems. arXiv preprintarXiv:1807.00458 , 2018.[120] Xueting Li, Sifei Liu, Shalini De Mello, Xiaolong Wang,Jan Kautz, and Ming-Hsuan Yang. Joint-task self-supervised learning for temporal correspondence. In

Ad-vances in Neural Information Processing Systems , pages317–327, 2019.[121] Xinyu Li, Bing Shuai, and Joseph Tighe. Directional tem-poral modeling for action recognition. In

European Confer-ence on Computer Vision , pages 275–291. Springer, 2020.[122] Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, andLimin Wang. TEA: Temporal Excitation and Aggregationfor Action Recognition. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2020.[123] Yingwei Li, Weixin Li, Vijay Mahadevan, and Nuno Vas-concelos. VLAD3: Encoding Dynamics of Deep Featuresfor Action Recognition. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2016.[124] Yingwei Li, Yi Li, and Nuno Vasconcelos. RESOUND:Towards Action Recognition without Representation Bias.In

The European Conference on Computer Vision (ECCV) ,2018.[125] Zhenyang Li, Kirill Gavrilyuk, Efstratios Gavves, MihirJain, and Cees GM Snoek. VideoLSTM Convolves, At-tends and Flows for Action Recognition.

Computer Visionand Image Understanding (CVIU) , 2018. [126] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, andSungwoong Kim. Fast AutoAugment. In

Advances in Neu-ral Information Processing Systems (NeurIPS) , 2019.[127] Ji Lin, Chuang Gan, and Song Han. Training Kinetics in15 Minutes: Large-scale Distributed Training on Videos.In

Advances in Neural Information Processing Systems(NeurIPS) Workshop , 2019.[128] Ji Lin, Chuang Gan, and Song Han. TSM: Temporal ShiftModule for Efﬁcient Video Understanding. In

The IEEE In-ternational Conference on Computer Vision (ICCV) , 2019.[129] Liu et al. Use What You Have: Video Retrieval using Rep-resentations from Collaborative Experts. arxiv:1907.13487 ,2019.[130] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:Differentiable Architecture Search. In

The InternationalConference on Learning Representations (ICLR) , 2019.[131] Miao Liu, Siyu Tang, Yin Li, and James Rehg. Forecastinghuman object interaction: Joint prediction of motor atten-tion and egocentric activity. In

ECCV , 2020.[132] Zhaoyang Liu, Donghao Luo, Yabiao Wang, Limin Wang,Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and TongLu. TEINet: Towards an Efﬁcient Architecture for VideoRecognition. In

The Conference on Artiﬁcial Intelligence(AAAI) , 2020.[133] Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, andTong Lu. TAM: Temporal Adaptive Module for VideoRecognition. arXiv preprint arXiv:2005.06803 , 2020.[134] Zelun Luo, Boya Peng, De-An Huang, Alexandre Alahi,and Li Fei-Fei. Unsupervised learning of long-term motiondynamics for videos. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages2203–2212, 2017.[135] Chih-Yao Ma, Min-Hung Chen, Zsolt Kira, and GhassanAlRegib. TS-LSTM and Temporal-Inception: ExploitingSpatiotemporal Dynamics for Activity Recognition.

SignalProcessing: Image Communication , 2019.[136] Minghuang Ma, Haoqi Fan, and Kris M Kitani. Goingdeeper into ﬁrst-person activity recognition. In

CVPR ,2016.[137] Pascal Mettes and Cees G. M. Snoek. Spatial-Aware ObjectEmbeddings for Zero-Shot Localization and Classiﬁcationof Actions. In

ICCV , 2017.[138] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, IvanLaptev, Josef Sivic, and Andrew Zisserman. End-to-EndLearning of Visual Representations from Uncurated In-structional Videos. In

CVPR , 2020.[139] Antoine Miech, Ivan Laptev, and Josef Sivic. Learning atext-video embedding from incomplete and heterogeneousdata. arXiv preprint arXiv:1804.02516 , 2018.[140] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac,Makarand Tapaswi, Ivan Laptev, and Josef Sivic.HowTo100M: Learning a Text-Video Embedding byWatching Hundred Million Narrated Video Clips. In

ICCV ,2019.[141] Ishan Misra, C Lawrence Zitnick, and Martial Hebert.Shufﬂe and learn: unsupervised learning using temporalorder veriﬁcation. In

European Conference on ComputerVision , pages 527–544. Springer, 2016. IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI) , 2019.[143] Mathew Monfort, Kandan Ramakrishnan, Alex Andonian,Barry A McNamara, Alex Lascelles, Bowen Pan, QuanfuFan, Dan Gutfreund, Rogerio Feris, and Aude Oliva.Multi-moments in time: Learning and interpreting mod-els for multi-action video understanding. arXiv preprintarXiv:1911.00232 , 2019.[144] Pedro Morgado, Nuno Vasconcelos, and Ishan Misra.Audio-visual instance discrimination with cross-modalagreement. 2020.[145] Jonathan Munro and Dima Damen. Multi-Modal DomainAdaptation for Fine-Grained Action Recognition. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2020.[146] Joe Yue-Hei Ng, Jonghyun Choi, Jan Neumann, andLarry S. Davis. ActionFlowNet: Learning Motion Rep-resentation for Action Recognition. In

The IEEE WinterConference on Applications of Computer Vision (WACV) ,2018.[147] Joe Yue-Hei Ng and Larry S. Davis. Temporal Differ-ence Networks for Video Action Recognition. In

TheIEEE Winter Conference on Applications of Computer Vi-sion (WACV) , 2018.[148] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam,Honglak Lee, and Andrew Y Ng. Multimodal deep learn-ing. In

ICML , 2011.[149] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han.Weakly supervised action localization by sparse temporalpooling network. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 6752–6761, 2018.[150] Bruce Xiaohan Nie, Caiming Xiong, and Song-Chun Zhu.Joint Action Recognition and Pose Estimation From Video.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2015.[151] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-sentation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 , 2018.[152] Boxiao Pan, Zhangjie Cao, Ehsan Adeli, and Juan CarlosNiebles. Adversarial Cross-Domain Action Recognitionwith Co-Attention. In

The Conference on Artiﬁcial Intel-ligence (AAAI) , 2020.[153] Deepak Pathak, Philipp Kr¨ahenb¨uhl, Jeff Donahue, TrevorDarrell, and Alexei A. Efros. Context encoders: Featurelearning by inpainting. In

CVPR , 2016.[154] Mandela Patrick, Yuki M. Asano, Ruth Fong, Jo˜ao F.Henriques, Geoffrey Zweig, and Andrea Vedaldi. Multi-modal self-supervision from generalized data transforma-tions. arXiv preprint arXiv:2003.04298 , 2020.[155] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly-supervised temporal activity localization andclassiﬁcation. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 563–579, 2018. [156] Wei Peng, Xiaopeng Hong, and Guoying Zhao. Video Ac-tion Recognition Via Neural Architecture Searching. In

The IEEE International Conference on Image Processing(ICIP) , 2019.[157] Xiaojiang Peng, Limin Wang, Xingxing Wang, and YuQiao. Bag of Visual Words and Fusion Methods for Ac-tion Recognition: Comprehensive Study and Good Prac-tice. arXiv preprint arXiv:1405.4506 , 2014.[158] Xiaojiang Peng, Changqing Zou, Yu Qiao, and Qiang Peng.Action Recognition with Stacked Fisher Vectors. In

TheEuropean Conference on Computer Vision (ECCV) , 2014.[159] Toby Perrett and Dima Damen. DDLSTM: Dual-DomainLSTM for Cross-Dataset Action Recognition. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2019.[160] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, andJeff Dean. Efﬁcient Neural Architecture Search via Param-eter Sharing. In

The International Conference on MachineLearning (ICML) , 2018.[161] AJ Piergiovanni, Anelia Angelova, and Michael S. Ryoo.Tiny Video Networks. arXiv preprint arXiv:1910.06961 ,2019.[162] AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo.Evolving Losses for Unsupervised Video RepresentationLearning. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pages 133–142,2020.[163] AJ Piergiovanni, Anelia Angelova, Alexander Toshev, andMichael S. Ryoo. Evolving Space-Time Neural Architec-tures for Videos. In

The IEEE International Conference onComputer Vision (ICCV) , 2019.[164] AJ Piergiovanni and Michael S. Ryoo. Representation Flowfor Action Recognition. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2019.[165] AJ Piergiovanni and Michael S. Ryoo. Avid dataset:Anonymized videos from diverse countries, 2020.[166] AJ Piergiovanni and Michael S. Ryoo. Learning multi-modal representations for unseen activities, 2020.[167] Rui Qian, Tianjian Meng, Boqing Gong, Ming-HsuanYang, Huisheng Wang, Serge Belongie, and Yin Cui. Spa-tiotemporal contrastive video representation learning. arXivpreprint arXiv:2008.03800 , 2020.[168] Jie Qin, Li Liu, Ling Shao, Fumin Shen, Bingbing Ni, Ji-axin Chen, and Yunhong Wang. Zero-Shot Action Recogni-tion with Error-Correcting Output Codes. In

CVPR , 2017.[169] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning Spatio-Temporal Representation with Pseudo-3D Residual Net-works. In

The IEEE International Conference on ComputerVision (ICCV) , 2017.[170] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,Kaiming He, and Piotr Doll´ar. Designing Network DesignSpaces. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2020.[171] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster R-CNN: Towards Real-Time Object Detection withRegion Proposal Networks. In

Advances in Neural Infor-mation Processing Systems (NeurIPS) , 2015. Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 754–763, 2017.[173] Itsaso Rodr´ıguez-Moreno, Jos´e Mar´ıa Mart´ınez-Otzeta,Basilio Sierra, Igor Rodriguez, and Ekaitz Jauregi. VideoActivity Recognition: State-of-the-Art.

Sensors , 2019.[174] Marcus Rohrbach, Anna Rohrbach, Michaela Regneri,Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, andBernt Schiele. Recognizing ﬁne-grained and composite ac-tivities using hand-centric features and script data.

Interna-tional Journal of Computer Vision , pages 1–28, 2015.[175] C. Roig, M. Sarmiento, D. Varas, I. Masuda, J. C. Riveiro,and E. Bou-Balust. Multi-modal pyramid feature combina-tion for human action recognition. In , pages 3742–3746, 2019.[176] Andrew Rouditchenko, Angie Boggust, David Harwath,Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, RogerioFeris, Brian Kingsbury, Michael Picheny, Antonio Torralba,and James Glass. Avlnet: Learning audio-visual languagerepresentations from instructional videos, 2020.[177] Michael S Ryoo, AJ Piergiovanni, Juhana Kangaspunta,and Anelia Angelova. Assemblenet++: Assembling modal-ity representations via attention connections. arXiv preprintarXiv:2008.08072 , 2020.[178] Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, andAnelia Angelova. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures. In

The International Conference on Learning Representations(ICLR) , 2020.[179] Jorge Sanchez, Florent Perronnin, Thomas Mensink, andJakob Verbeek. Image Classiﬁcation with the Fisher Vector:Theory and Practice.

International Journal of ComputerVision (IJCV) , 2013.[180] Fadime Sener, Dipika Singhania, and Angela Yao. Tempo-ral aggregate representations for long-range video under-standing. In

European Conference on Computer Vision ,pages 154–171. Springer, 2020.[181] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. FineGym:A Hierarchical Video Dataset for Fine-grained Action Un-derstanding. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2020.[182] Hao Shao, Shengju Qian, and Yu Liu. Temporal interlacingnetwork, 2020.[183] Yemin Shi, Yonghong Tian, Yaowei Wang, Wei Zeng, andTiejun Huang. Learning Long-Term Dependencies for Ac-tion Recognition With a Biologically-Inspired Deep Net-work. In

The IEEE International Conference on ComputerVision (ICCV) , 2017.[184] Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa,and Shih-Fu Chang. Autoloc: Weakly-supervised temporalaction localization in untrimmed videos. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 154–171, 2018.[185] Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, and Zhicheng Yan. DMC-Net: Generating Discriminative Motion Cuesfor Fast Compressed Video Action Recognition. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2019.[186] Gunnar A. Sigurdsson, G¨ul Varol, Xiaolong Wang, AliFarhadi, Ivan Laptev, and Abhinav Gupta. Hollywood inHomes: Crowdsourcing Data Collection for Activity Un-derstanding. In

The European Conference on Computer Vi-sion (ECCV) , 2016.[187] Karen Simonyan and Andrew Zisserman. Two-StreamConvolutional Networks for Action Recognition in Videos.In

Advances in Neural Information Processing Systems(NeurIPS) , 2014.[188] Karen Simonyan and Andrew Zisserman. Very Deep Con-volutional Networks for Large-Scale Image Recognition. In

The International Conference on Learning Representations(ICLR) , 2015.[189] Bharat Singh, Tim K. Marks, Michael Jones, Oncel Tuzel,and Ming Shao. A Multi-Stream Bi-Directional RecurrentNeural Network for Fine-Grained Action Detection. In

TheIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2016.[190] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.UCF101: A Dataset of 101 Human Actions Classes FromVideos in The Wild. arXiv preprint arXiv:1212.0402 , 2012.[191] Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng,and Rahul Sukthankar. D3D: Distilled 3D Networks forVideo Action Recognition. In

The IEEE Winter Conferenceon Applications of Computer Vision (WACV) , 2020.[192] Swathikiran Sudhakaran, Sergio Escalera, and OswaldLanz. Lsta: Long short-term attention for egocentric actionrecognition. In

CVPR , 2019.[193] Waqas Sultani and Imran Saleemi. Human Action Recog-nition across Datasets by Foreground-Weighted HistogramDecomposition. In

The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) , 2014.[194] Chen Sun, Fabien Baradel, Kevin Murphy, and CordeliaSchmid. Learning video representations using contrastivebidirectional transformer, 2019.[195] Chen Sun, Austin Myers, Carl Vondrick, Kevin Mur-phy, and Cordelia Schmid. VideoBERT: A Joint Modelfor Video and Language Representation Learning. In

The IEEE International Conference on Computer Vision(ICCV) , 2019.[196] Lin Sun, Kui Jia, Kevin Chen, Dit-Yan Yeung, Bertram E.Shi, and Silvio Savarese. Lattice Long Short-Term Memoryfor Human Action Recognition. In

The IEEE InternationalConference on Computer Vision (ICCV) , 2017.[197] Shuyang Sun, Zhanghui Kuang, Lu Sheng, Wanli Ouyang,and Wei Zhang. Optical Flow Guided Feature: A Fast andRobust Motion Representation for Video Action Recogni-tion. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2018.[198] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going Deeper withConvolutions. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2015. arXiv preprintarXiv:1312.6199 , 2013.[200] Graham W. Taylor, Rob Fergus, Yann LeCun, andChristoph Bregler. Convolutional Learning of Spatio-temporal Features. In The European Conference on Com-puter Vision (ECCV) , 2010.[201] Yonglong Tian, Dilip Krishnan, and Phillip Isola.Contrastive multiview coding. arXiv preprintarXiv:1906.05849 , 2019.[202] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torre-sani, and Manohar Paluri. Learning Spatiotemporal Fea-tures with 3D Convolutional Networks. In

The IEEE Inter-national Conference on Computer Vision (ICCV) , 2015.[203] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis-zli. Video Classiﬁcation With Channel-Separated Convolu-tional Networks. In

The IEEE International Conference onComputer Vision (ICCV) , 2019.[204] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, YannLeCun, and Manohar Paluri. A Closer Look at Spatiotem-poral Convolutions for Action Recognition. In

The IEEEConference on Computer Vision and Pattern Recognition(CVPR) , 2018.[205] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W.Baik. Action Recognition in Video Sequences using DeepBi-Directional LSTM With CNN Features.

IEEE Access ,2017.[206] G¨ul Varol, Ivan Laptev, and Cordelia Schmid. Long-term Temporal Convolutions for Action Recognition.

IEEETransactions on Pattern Analysis and Machine Intelligence(PAMI) , 2018.[207] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is All You Need.In

Advances in Neural Information Processing Systems(NeurIPS) , 2017.[208] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.Anticipating visual representations from unlabeled video.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 98–106, 2016.[209] Heng Wang, Alexander Kl¨aser, Cordelia Schmid, and LiuCheng-Lin. Action Recognition by Dense Trajectories.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2011.[210] Heng Wang and Cordelia Schmid. Action Recognition withImproved Trajectories. In

The IEEE International Confer-ence on Computer Vision (ICCV) , 2013.[211] Jiangliu Wang, Jianbo Jiao, Linchao Bao, Shengfeng He,Yunhui Liu, and Wei Liu. Self-supervised spatio-temporalrepresentation learning for videos by predicting motion andappearance statistics. In

Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , pages4006–4015, 2019.[212] Lei Wang, Piotr Koniusz, and Du Huynh. Hallucinating idtdescriptors and i3d optical ﬂow features for action recog-nition with cnns. In

Proceedings of the 2019 International Conference on Computer Vision . IEEE, Institute of Electri-cal and Electronics Engineers, 2019.[213] Limin Wang, Wei Li, Wen Li, and Luc Van Gool.Appearance-and-Relation Networks for Video Classiﬁca-tion. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2018.[214] Limin Wang, Yu Qiao, and Xiaoou Tang. Action Recogni-tion With Trajectory-Pooled Deep-Convolutional Descrip-tors. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2015.[215] Limin Wang, Zhe Wang, Yuanjun Xiong, and Yu Qiao.CUHK and SIAT Submission for THUMOS15 ActionRecognition Challenge.

THUMOS’15 Action RecognitionChallenge , 2015.[216] Limin Wang, Yuanjun Xiong, Dahua Lin, and LucVan Gool. Untrimmednets for weakly supervised actionrecognition and detection. In

Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition , pages4325–4334, 2017.[217] Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao.Towards Good Practices for Very Deep Two-Stream Con-vNets. arXiv preprint arXiv:1507.02159 , 2015.[218] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, DahuaLin, Xiaoou Tang, and Luc Van Gool. Temporal Seg-ment Networks: Towards Good Practices for Deep ActionRecognition. In

The European Conference on ComputerVision (ECCV) , 2016.[219] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-Local Neural Networks. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2018.[220] Xiaolong Wang and Abhinav Gupta. Videos as Space-TimeRegion Graphs. In

The European Conference on ComputerVision (ECCV) , 2018.[221] Xiaolong Wang, Allan Jabri, and Alexei A. Efros. LearningCorrespondence from the Cycle-Consistency of Time. In

CVPR , 2019.[222] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier-giovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Ki-tani, and Wei Hua. Attentionnas: Spatiotemporal attentioncell search for video classiﬁcation, 2020.[223] Xiaohan Wang, Linchao Zhu, Yu Wu, and Yi Yang. Symbi-otic attention for egocentric action recognition with object-centric alignment.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 2020.[224] Yang Wang and Minh Hoai. Pulling actions out of context:Explicit separation for effective combination. In

Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition , pages 7044–7053, 2018.[225] Yunbo Wang, Mingsheng Long, Jianmin Wang, andPhilip S. Yu. Spatiotemporal Pyramid Network for VideoAction Recognition. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2017.[226] Zhipeng Wei, Jingjing Chen, Xingxing Wei, Linxi Jiang,Tat-Seng Chua, Fengfeng Zhou, and Yu-Gang Jiang.Heuristic black-box adversarial attacks on video recogni-tion models. arXiv preprint arXiv:1911.09449 , 2019. arXivpreprint arXiv:1912.07249 , 2019.[228] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, SichengZhao, Noah Golmant, Amir Gholaminejad, Joseph Gonza-lez, and Kurt Keutzer. Shift: A Zero FLOP, Zero ParameterAlternative to Spatial Convolutions. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2018.[229] Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan,Kaiming He, Philipp Krahenbuhl, and Ross Girshick.Long-Term Feature Banks for Detailed Video Understand-ing. In

The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) , 2019.[230] Chao-Yuan Wu, Ross Girshick, Kaiming He, Christoph Fe-ichtenhofer, and Philipp Kr¨ahenb¨uhl. A Multigrid Methodfor Efﬁciently Training Video Models. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2020.[231] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha,Alexander J. Smola, and Philipp Kr¨ahenb¨uhl. CompressedVideo Action Recognition. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018.[232] Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Si-gal. Harnessing Object and Scene Semantics for Large-Scale Video Understanding. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2016.[233] Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xi-angyang Xue. Multi-Stream Multi-Class Fusion of DeepNetworks for Video Classiﬁcation. In

The ACM Confer-ence on Multimedia (MM) , 2016.[234] Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, RichardSocher, and Larry S. Davis. AdaFrame: Adaptive FrameSelection for Fast Video Recognition. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2019.[235] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.Unsupervised feature learning via non-parametric instancediscrimination. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3733–3742, 2018.[236] Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang.Deep Learning for Video Classiﬁcation and Captioning. arXiv preprint arXiv:1609.06782 , 2016.[237] Fanyi Xiao, Yong Jae Lee, Kristen Grauman, JitendraMalik, and Christoph Feichtenhofer. Audiovisual Slow-Fast Networks for Video Recognition. arXiv preprintarXiv:2001.08740 , 2020.[238] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, andKaiming He. Aggregated Residual Transformations forDeep Neural Networks. In

The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2017.[239] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, andKevin Murphy. Rethinking Spatiotemporal Feature Learn-ing: Speed-Accuracy Trade-offs in Video Classiﬁcation.In

The European Conference on Computer Vision (ECCV) ,2018. [240] Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, andYueting Zhuang. Self-supervised spatiotemporal learningvia video clip order prediction. 2019.[241] Tiantian Xu, Fan Zhu, Edward K. Wong, and Yi Fang. DualMany-to-One-Encoder-based Transfer Learning for Cross-Dataset Human Action Recognition.

Image and VisionComputing , 2016.[242] Xun Xu, Timothy Hospedales, and Shaogang Gong. Multi-Task Zero-Shot Action Recognition with Prioritised DataAugmentation. In

ECCV , 2016.[243] Xun Xu, Timothy Hospedales, and Shaogang Gong. Trans-ductive Zero-Shot Action Recognition by Word-Vector Em-bedding.

IJCV , 2017.[244] Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. ADiscriminative CNN Video Representation for Event De-tection. In

The IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2015.[245] Huanqian Yan, Xingxing Wei, and Bo Li. Sparse black-boxvideo attack with reinforcement learning. arXiv preprintarXiv:2001.03754 , 2020.[246] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-ral graph convolutional networks for skeleton-based actionrecognition. In

AAAI , 2018.[247] Ceyuan Yang, Yinghao Xu, Bo Dai, and Bolei Zhou. VideoRepresentation Learning with Visual Tempo Consistency.In arXiv preprint arXiv:2006.15489 , 2020.[248] Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and BoleiZhou. Temporal Pyramid Network for Action Recognition.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2020.[249] Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, LarryDavis, and Jan Kautz. Hierarchical contrastive motionlearning for video action recognition, 2020.[250] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,Christopher Pal, Hugo Larochelle, and Aaron Courville.Describing Videos by Exploiting Temporal Structure. In

The IEEE International Conference on Computer Vision(ICCV) , 2015.[251] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frameglimpses in videos. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[252] Wang Yifan, Jie Song, Limin Wang, Luc Van Gool, andOtmar Hilliges. Two-Stream SR-CNNs for Action Recog-nition in Videos. In

The British Machine Vision Conference(BMVC) , 2016.[253] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vi-jayanarasimhan, Oriol Vinyals, Rajat Monga, and GeorgeToderici. Beyond Short Snippets: Deep Networks for VideoClassiﬁcation. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2015.[254] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, SanghyukChun, Junsuk Choe, and Youngjoon Yoo. CutMix: Regu-larization Strategy to Train Strong Classiﬁers With Local-izable Features. In

The IEEE International Conference onComputer Vision (ICCV) , 2019. Joint Pattern Recognition Symposium , 2007.[256] Bowen Zhang, Limin Wang, Zhe Wang, Yu Qiao, and HanliWang. Real-time Action Recognition with Enhanced Mo-tion Vector CNNs. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[257] Can Zhang, Yuexian Zou, Guang Chen, and Lei Gan. Pan:Towards fast action recognition via learning persistence ofappearance, 2020.[258] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, andDavid Lopez-Paz. Mixup: Beyond Empirical Risk Min-imization. In

The International Conference on LearningRepresentations (ICLR) , 2018.[259] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, ZhiZhang, Haibin Lin, Yue Sun, Tong He, Jonas Muller, R.Manmatha, Mu Li, and Alexander Smola. ResNeSt: Split-Attention Networks. arXiv preprint arXiv:2004.08955 ,2020.[260] Hu Zhang, Linchao Zhu, Yi Zhu, and Yi Yang. Motion-Excited Sampler: Video Adversarial Attack with SparkedPrior. In

The European Conference on Computer Vision(ECCV) , 2020.[261] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, QingLei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. AComprehensive Survey of Vision-Based Human ActionRecognition Methods.

Sensors , 2018.[262] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorfulimage colorization. In

ECCV , 2016.[263] Richard Zhang, Phillip Isola, and Alexei A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , July 2017.[264] Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R.Scott, and Limin Wang. V4D:4D Convolutional Neu-ral Networks for Video-level Representation Learning. In

The International Conference on Learning Representations(ICLR) , 2020.[265] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.ShufﬂeNet: An Extremely Efﬁcient Convolutional NeuralNetwork for Mobile Devices. In

The IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018.[266] Zehua Zhang and David Crandall. Hierarchically decou-pled spatial-temporal contrast for self-supervised video rep-resentation learning, 2020.[267] Hang Zhao, Zhicheng Yan, Lorenzo Torresani, and Anto-nio Torralba. HACS: Human Action Clips and SegmentsDataset for Recognition and Temporal Localization. arXivpreprint arXiv:1712.09374 , 2019.[268] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory Con-volution for Action Recognition. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 2018.[269] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-ralba. Temporal Relational Reasoning in Videos. In

TheEuropean Conference on Computer Vision (ECCV) , 2018.[270] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,and Antonio Torralba. Places: A 10 million Image Database for Scene Recognition.

IEEE Transactions on Pattern Anal-ysis and Machine Intelligence (PAMI) , 2017.[271] Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, and WenjunZeng. MiCT: Mixed 3D/2D Convolutional Tube for HumanAction Recognition. In

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2018.[272] Linchao Zhu, Du Tran, Laura Sevilla-Lara, Yi Yang, MattFeiszli, and Heng Wang. Faster recurrent networks for efﬁ-cient video classiﬁcation. In

AAAI .[273] Linchao Zhu, Zhongwen Xu, and Yi Yang. BidirectionalMultirate Reconstruction for Temporal Modeling in Videos.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2017.[274] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alex G. Haupt-mann. Uncovering Temporal Context for Video Ques-tion Answering.

International Journal of Computer Vision(IJCV) , 2017.[275] Linchao Zhu and Yi Yang. ActBERT: Learning Global-Local Video-Text Representations. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,2020.[276] Sijie Zhu, Taojiannan Yang, Matias Mendieta, and ChenChen. A3d: Adaptive 3d networks for video action recog-nition, 2020.[277] Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, and YuQiao. A Key Volume Mining Deep Framework for ActionRecognition. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2016.[278] Yi Zhu, Zhenzhong Lan, Shawn Newsam, and Alexan-der G. Hauptmann. Hidden Two-Stream ConvolutionalNetworks for Action Recognition. In

The Asian Conferenceon Computer Vision (ACCV) , 2018.[279] Yi Zhu, Yang Long, Yu Guan, Shawn Newsam, and LingShao. Towards Universal Representation for Unseen ActionRecognition. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2018.[280] Yi Zhu and Shawn Newsam. Depth2Action: Exploring Em-bedded Depth for Large-Scale Action Recognition. In

TheEuropean Conference on Computer Vision (ECCV) Work-shop , 2016.[281] Yi Zhu and Shawn Newsam. Random Temporal Skippingfor Multirate Video Analysis. In

The Asian Conference onComputer Vision (ACCV) , 2018.[282] Mohammadreza Zolfaghari, Gabriel L. Oliveira, NimaSedaghat, and Thomas Brox. Chained Multi-Stream Net-works Exploiting Pose, Motion, and Appearance for ActionClassiﬁcation and Detection. In

The IEEE InternationalConference on Computer Vision (ICCV) , 2017.[283] Mohammadreza Zolfaghari, Kamaljeet Singh, and ThomasBrox. ECO: Efﬁcient Convolutional Network for OnlineVideo Understanding. In

The European Conference onComputer Vision (ECCV) , 2018., 2018.