A3D: Adaptive 3D Networks for Video Action Recognition
AA3D: Adaptive 3D Networks for Video Action Recognition
Sijie Zhu * , Taojiannan Yang * , Matias Mendieta, Chen ChenDepartment of Electrical and Computer Engineering, University of North Carolina at Charlotte { szhu3, tyang30, mmendiet, chen.chen } @uncc.edu Abstract
This paper presents A3D, an adaptive 3D network thatcan infer at a wide range of computational constraints withone-time training. Instead of training multiple models ina grid-search manner, it generates good configurations bytrading off between network width and spatio-temporal res-olution. Furthermore, the computation cost can be adaptedafter the model is deployed to meet variable constraints, forexample, on edge devices. Even under the same compu-tational constraints, the performance of our adaptive net-works can be significantly boosted over the baseline coun-terparts by the mutual training along three dimensions.When a multiple pathway framework, e.g. SlowFast, isadopted, our adaptive method encourages a better trade-offbetween pathways than manual designs. Extensive experi-ments on the Kinetics dataset show the effectiveness of theproposed framework. The performance gain is also verifiedto transfer well between datasets and tasks. Code will bemade available.
1. Introduction
Spatio-temporal (3D) networks have achieved excellentperformance on video action recognition [36, 3, 7, 6, 38, 5],as well as its downstream tasks [16, 49, 45]. However,popular 3D networks [36, 3, 44] are extremely computa-tional demanding, with tens or even hundreds of GFLOPsper video clip. Such a large computational cost imposeslimits on its applicability in real-world scenarios, espe-cially on resource-limited edge devices. A thread of works[7, 6, 5, 42] has been proposed to reduce the computationcost of 3D networks to meet lower computational budgets.These works narrow the gap between 3D networks and theirapplications to some degree, but they all ignore the crucialfact that the computational budgets change in many real ap-plications. For example, the battery condition of mobiledevices imposes constrains on the computational budget ofmany operations. Similarly, a task may have specific prior- * Equal contribution ities at any given time, requiring a dynamic computationalbudget throughout its deployment phases. To meet differentresource constraints, one needs to deploy several 3D net-works on the device and switch among them. This con-sumes much larger memory footprints than a single modeland the cost of loading a different model is not negligible.In this paper, we propose adaptive 3D networks (A3D)where one model can meet a variety of resource constraints.The computational cost of a 3D network is determined bythe input size (spatial and temporal) and network size (widthand depth). During training, we randomly sample severalspatial-temporal-width configurations. In this way, the net-work can perform inference with many configurations inreal deployment, making it possible to meet a wide rangeof computational budgets. Besides, distinct network con-figurations can learn different semantic information. Forexample, a larger spatial size can capture more fine-grainedfeatures, and a longer temporal duration can encode morelong-term semantics. Motivated by this finding, in eachtraining iteration, we randomly sample several configura-tions and train them jointly. Moreover, we propose Spatial-Temporal Distillation (STD) to transfer the knowledge inlarger spatial-temporal configurations to other configura-tions. This allows smaller configurations to learn fine-grained and long-term representations with less computa-tional cost. As a result, the performance of every configura-tion is greatly improved.Our work shares a similar motivation with X3D [6] –both leverage different network configurations for variouscomputational budgets. However, X3D expands a 2D net-work along different dimensions. In each expansion step,X3D trains 6 models which correspond to 6 dimensions.To obtain a moderate size model ( e.g . X3D-L), it requires10 steps, which means that it needs to train 60 models inthe whole process. This makes the expansion very inef-ficient. However, in A3D, we train different configura-tions jointly. Therefore, the whole framework is end-to-end, making it simpler and more efficient than X3D. Be-sides, during inference, X3D has to deploy several differentscaled models to meet dynamic computational budgets. Onthe other hand, A3D only needs to deploy one model, mak-1 a r X i v : . [ c s . C V ] N ov ng it abundantly more memory-friendly on edge devices.Furthermore, X3D trains each configuration independently,which fails to leverage the variety of semantic informationcontained in different configurations. However, in A3D,each configuration can transfer its knowledge to one anotherthrough the proposed Spatial-Temporal Distillation; this en-ables every configuration to learn better representations.Our experimental results reveal that A3D outperformsindependently trained models under the same network con-figuration, even when that configuration is not the best per-forming one A3D found at that computational budget. Notethat A3D can also be applied to the network structures inX3D. We conclude our contributions as follows:• We are the first to achieve adaptive 3D networks whereone model can meet different computational budgets inreal application scenarios. For training, the proposedA3D requires just a fraction of the training cost comparedwith independently training several models. For infer-ence, A3D deploys only one model rather than severalindependently trained models to cope with dynamic bud-gets.• The proposed Spatial-Temporal Distillation (STD)scheme transfers knowledge among different configura-tions. This allows every configuration to learn multi-scalespatial and temporal information, thereby improving theoverall performance.• On Kinetics-400, A3D outperforms its independentlytrained counterparts at various computational budgets, es-pecially when the budget is low. The effectiveness of thelearned representation is also validated via cross-datasettransfer (Charade dataset [31]) and cross-task transfer(action detection on AVA dataset [9]).
2. Related Work
Spatio-temporal (3D) Networks.
The basic idea of videorecognition architectures stems from 2D image classifica-tion models. [36, 3, 11, 44] build 3D networks by extend-ing 2D convolutional filters [32, 34, 12, 43, 15] to 3D fil-ters along the temporal axis; then the 3D filters can learnspatio-temporal representations in a similar way to their 2Dcounterparts. Later works [44, 26, 38, 7] propose to treat thespatial and temporal domains differently. [44] reveals thata bottom-heavy structure is better than naive 3D structuresin both accuracy and speed. [26, 38] propose to split 3D fil-ters to 2D+1D filters, which reduce the heavy computationalcost of 3D filters and improve the performance. SlowFast[7] further shows that space and time should not be handledsymmetrically, and introduces a two-path structure to dealwith slow and fast motion separately. Recently, [25, 40, 24]explore neural architecture search (NAS) techniques to au-tomatically learn spatio-temporal network structures.
Efficient 3D Networks.
3D networks are often very com-putationally expensive. Many approaches [38, 26, 44, 6, 7, 5, 20, 4, 21, 37] have been proposed to reduce the complex-ity. [38, 26, 44] share the idea of splitting the 3D filters to2D and 1D filters. [5, 21, 37] leverage the group convolutionand channel-wise separable convolution in 2D networks[13, 29] to reduce computational cost. Other approachessuch as [4, 20] propose generic, insertable modules to im-prove temporal representations with minimal computationaloverhead. [6, 7] improve efficiency by performing trade-offs across several model dimensions. A few methods[42, 23, 19] also explore adaptive frame sampling to reducecomputations. However, none of these methods can achievedynamic computational budgets and our approach is com-plementary to these light-weight structures.
Multi-dimension Networks.
The computational cost andaccuracy of a model is determined by both the input sizeand network size. In 2D networks, there is a growing inter-est [46, 48, 2, 35, 10] in achieving better accuracy-efficiencytrade-offs by balancing different model dimensions ( e.g . im-age resolution, network width and depth). EfficientNet [35]performs a grid-search on different model dimensions andexpands the configuration to larger models. [46, 48, 2] traindifferent configurations jointly so that one model can per-form variable execution in the 2D domain. [10] prunes net-works from multiple dimensions to achieve better accuracy-complexity trade-offs. X3D [6] is the first work to investi-gate the effect of different dimensions in spatio-temporalnetworks. It expands a 2D network step by step to a 3Done. Our method also aims to achieve better model config-urations under different budgets. However, we train variousintrinsic configurations and share knowledge between them,creating a time-saving end-to-end framework that enableseffective adaptive inference with a single model.
3. Adaptive 3D Network
Standard 3D models are trained at a fixed spatial-temporal-width configuration ( e.g . 224-8-1.0 × ). But themodel can not be well generalized to other configurationsduring inference. In A3D, we randomly sample differ-ent spatial-temporal-width configurations during training,so the model can run at different configurations during in-ference. Note that the computation cost of a vanilla 3D con-volutional layer is given by K × K × C i × C o × H × W × T. (1)Here, K denotes the kernel size, and C i , C o are the in-put and output channels of this layer. H , W , T denote thespatial-temporal size of the output feature map. For a sub-network with network width (channels) coefficient γ w ∈ [0 , , and spatial-temporal resolution factors γ s , γ t ∈ [0 , ,the computation cost is reduced to K × K × γ w C i × γ w C o × γ s H × γ s W × γ t T. (2)The computational cost is now γ w γ s γ t times that of theoriginal in Eq. 1. Our goal is to train a 3D network that2 ull-network Sub-network Figure 1. Class activation maps along spatial and temporal dimensions of the full-network and sub-network.
Concat C T H,W
Label
DistillationFull-network Sub-network Sub-network
Single-Pathway Multiple-Pathway
Label
DistillationFull-network Sub-network
Sub-network
Fast Slow Fast
Concat Concat
FastSlow Slow
Concat Adaptive Fusion
Knowledge Distillation
Concatenation
Figure 2. An overview of the spatial-temporal distillation strategy to facilitate knowledge transfer among different network configurations. is executable at a range of resource constraints. For exam-ple, we define a maximal reduction coefficient ρ in which γ w γ s γ t ∈ [1 /ρ, . . This allows for an execution rangefrom a ρ times reduction in computation up to the full net-work. With this range, one possible configuration set lets γ w , γ s ∈ [ (cid:112) /ρ, . and γ t ∈ [ (cid:112) /ρ, . . Here, each di-mension is equally responsible for a √ ρ times computationreduction. Further details will be discussed in the followingsections.In Sec. 3.1, we first show that different model configu-rations focus on different semantic information in a video.Then we demonstrate how the knowledge is transferred be-tween different sub-networks for single-pathway models.Sec. 3.2 further illustrates the multiple-pathway trade-offwhen using A3D with SlowFast-inspired networks [7]. Knowledge in Different Configurations.
Since we ran-domly sample various model configurations in each train-ing iteration, we want these configurations to learn fromeach other. However, is there any unique knowledge in agiven sub-network that is beneficial for transferring to oth-ers? The answer is yes when sub-networks are fed withdifferent spatial-temporal resolutions. Fig. 1 shows the spa-tial and temporal distributions of network activation follow-ing the spirit of [51]. Higher value means more contribu-tion to the final logit value. Although both the full-network( γ w = 1 ) and sub-network generate the prediction as “head-butting”, their decisions are based on different areas of the frames. The input of the full-network has frames, and the2nd and 8th frames contribute the most to the final predic-tion. Since these two key frames are not sampled in theinput of the sub-network, it has to learn other semantic in-formation, forcing a change in both temporal and spatialactivation distributions. For example, the activation valueof the 5th frame exceeds that of the 3rd frame in the sub-network. The attention areas are also unique between net-works, indicating that a varied set of visual cues is captured. Mutual Training.
To fully leverage the semantic informa-tion captured by different model configurations, we proposethe mutual training scheme as shown in Fig. 2. The lefthalf of Fig. 2 shows how mutual training works in singlepathway structures. In each training iteration, we randomlysample two sub-networks (by the width factor γ w ) in addi-tion to the full-network. Sub-networks share the parameterswith part of the full-network. For example, the sub-networkwith γ w = 0 . shares the first half of the full-network’sparameters in each layer. During training, since the full-network has the best learning capacity, it is always fed withthe highest spatial-temporal resolution ( γ s = γ t = 1 . )inputs. For these two sub-networks, one is randomly sam-pled with γ w ∈ [0 . , . , and the other is always sampledwith the minimal γ w = 0 . . The spatial-temporal resolutionof a sub-network’s input is randomly sampled with γ s ∈{ . , . , . , . } and γ t ∈ { . , . , . , . } . Thisallows sub-networks to learn different semantic informa-tion as motivated in Fig. 1. Since sub-networks sharethe weights with part of the full-network, the full-network3an also benefit from these diverse information. The full-network is trained with ground-truth labels with the CrossEntropy (CE) loss, while the outputs of the full-network areadopted as soft labels for sub-networks using a Kullback-Leibler (KL) Divergence loss. This forces the full-networkand sub-networks to have a high-level semantic consistency,even though they focus on different parts of the input. Theoverall loss function can be written as L = CE (cid:16) F ( I γ s ,γ t =1 , Θ γ w =1 ) , Y (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) Full-network + (cid:88) KL (cid:16) F ( I γ s ,γ t =1 , Θ γ w =1 ) , F ( I γ s ,γ t , Θ γ w ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) Sub-networks . (3) F denotes the network architecture and I γ s ,γ t is the inputtensor. Θ γ w denotes the parameters of the sub-network and Y is the one-hot class label. It may be naive to simply treat time as a dimension ofthe input matrix, as slow and fast motions contain differ-ent information for identifying an action class. SlowFast[7] shows that a lightweight fast pathway is a good com-plement to slow networks. This inspires us to leveragemultiple-pathway trade-offs in our A3D framework. Thestructure is shown in the right half of Fig. 2. Since the fastpathway is lightweight (about of the overall compu-tation), reducing its spatial-temporal resolution or networkwidth is not beneficial for the overall computation-accuracytrade-off. In multiple-pathway A3D, we keep the respective γ w , γ s , γ t = 1 for the Fast pathway, so that it can providecomplementary information for the Slow path with its own γ w , γ s , γ t ≤ . Note that this complementary informationis not only on temporal resolution but also on spatial res-olution.
Furthermore, the multiple-pathway A3D enables abetter trade-off on network width than manual design.
Adaptive Fusion.
Given fixed temporal resolutions fortwo pathways, the fusion is conducted by lateral connec-tions with time-strided convolution in SlowFast [7].
How-ever, since all the three dimensions ( γ w , γ s , γ t ) of the Slowpathway can change during training, directly applying thetime-strided convolution does not work for our framework. Therefore, we design an adaptive fusion block for multiple-pathway A3D.
Normal Layer Fusion Layer
C C+2βC
Slow
Fast
Figure 3. The adaptive fusion on network channels.
Following SlowFast [7], we denote the feature shape of astandard Slow pathway as { T, S , C } , where S is the spatialresolution and C is the channel number. Then the feature shape of adaptive Slow pathway is { γ t T, ( γ s S ) , γ w C } .The feature shape of Fast pathway remains { αT, S , βC } as in [7] ( α = 8 , β = 1 / for SlowFast 4 × × kernel with βC out-put channels and a stride of α . The output feature shapeof this convolution layer is { T, S , βC } . To fuse it withthe adaptive Slow pathway, we perform a spatial interpola-tion and temporal down-sampling to make the output shape { γ t T, ( γ s S ) , βC } . Then the final feature shape after thefusion is { γ t T, ( γ s S ) , ( γ w + 2 β ) C } . As shown in Fig. 3,normal convolution layers of adaptive Slow pathway have γ w C channels indexing from the left, while the first convo-lution layer after each fusion has γ w C + 2 βC input chan-nels. The last βC channels are always kept for the outputof Fast pathway, while the first γ w C channels from the leftcan vary for each iteration. This operation enforces an exactchannel-wise correspondence between the fusion featuresand the parameters in convolution layers. After training, A3D models can run at different modelconfigurations. We test the performance of different con-figurations on the validation set. Then we choose the bestperforming one under each resource budget. For example,we can train a single-pathway A3D model with a Slow [7]network backbone, notated as A3D-Slow-8 × × γ w γ s γ t with ρ = 64 . Then, we test con-figurations within this computational range, where γ w ∈{ . , . , . , . , . , . } , γ s ∈ { . , . , . , . } , γ t ∈ { . , . , . , . } . This provides a total of 96configurations. After that we choose the best configurationat certain computation budgets to form the configuration-budget table. One can adjust the values of γ w , γ s and γ t to get a more fine-grained table. Note that this process isonly done once , so it is very efficient. For real deployment,we only need to keep the model and the table, and thereforethe memory consumption is essentially the same as a sin-gle model. Given a resource constraint, we can adjust themodel according to the configuration-budget table. Duringinference, we perform batch normalization (BN) calibrationas proposed in [47] since the BN means and variances aredistinct in different configurations. We do not perform anyre-training during inference for all configurations.
4. Experiments
We use SlowFast [7] as our backbone network since itis simple, clean and has the state-of-the-art performance.We conduct experiments on three video datasets follow-ing the standard evaluation protocols. We first evaluate ourmethod on Kinetics-400 [18] for action classification. Thenwe transfer the learned representations to Charade [31] ac-4ion classification and AVA [9] action detection. Finally, weperform extensive ablation studies to analyze the effect ofdifferent components in A3D.
Datasets.
Kinetics-400 [18] is a large scale action classifi-cation dataset with ∼ less than that in SlowFast [7]and some of the videos have a duration less than 10s. Thisleads to an accuracy drop of 0.6% on Slow-8 × ×
16 as we reproduce the results with the of-ficially released codes [1].Charade [31] is a multi-label action classification datasetwith longer activity duration. The average activity durationis ∼
30 seconds. The dataset is composed of ∼ Training.
For single-pathway structures, we adopt Slow8 × ×
16 due to the limit of GPU mem-ory. For A3D-Slow networks, we train the model for twodynamic budget ranges, [0.06, 1.0] and [0.016, 1.0]. Ac-cordingly, for the range of [0.016, 1.0], the width factor γ w is uniformly sampled from [0 . , . . The spatial resolu-tion factor is γ s ∈ { . , . , . , . } (corresponding to { , , , } ) and the temporal resolution factor is γ t ∈ { . , . , . , . } (corresponding to { , , , } ).For the range of [0.06, 1.0], the width factor γ w is uniformlysampled from [0 . , . . The spatial resolution factor is γ s ∈ { . , . , . } (corresponding to { , , } )and the temporal resolution factor is γ t ∈ { . , . , . } (corresponding to { , , } ). The A3D-SlowFast is onlytrained with the setting of [0 . , . All the models are basedon ResNet-50 (R-50) if not specified. Other training set-tings are the same as the official SlowFast codes [1]. OnCharades and AVA datasets, we finetune the model trainedon Kinetics-400 following the same setting as SlowFast [1]. Kinetics.
In Table 1, we show the classification accuracyand computation cost of the proposed A3D and previousworks. We use the Slow network and SlowFast [7] as ourbaselines. “-R” and “-P” denote our reproduced results withthe official codes and numbers in SlowFast paper, respec-tively. Given the same implementation environment anddataset, the proposed A3D constantly outperforms its base- line counterpart. And it is executable for a range of compu-tation constraints, denoted as “model name × ratio” in Table1, e.g . a model with × . only has of the computa-tion of the × model. Although our reproduced accuracy islower than the reported numbers in SlowFast paper [7] dueto the dataset issue, our results are still comparable or betterthan previous works under the same computation budgets.Recently, X3D [6] achieves the state-of-the-art accuracywith extremely low computation cost (GFLOPs). We be-lieve that using X3D as a baseline in the A3D frameworkcould improve the accuracy-computation trade-off. How-ever, we are not able to perform such experiments due tothree reasons: 1) X3D uses 3D depth-wise convolution,which is not well supported in current deep learning plat-form and GPU acceleration library. Although the GFLOPsof X3D is much smaller than Slow networks [7], the train-ing time is actually three times more than that of Slow net-works. 2) The depth-wise convolution requires high mem-ory access cost (MAC) [22], which further limits its usabil-ity in practice. 3) X3D leverages Squeeze-Excitation (SE)-block [14] and Swish activation [27] in the network design,which boost the Kinetics accuracy by . and . , re-spectively. It is unsuitable to adopt X3D as a baseline tocompare with other clean architectures, e.g . SlowFast [7]. model pre flow top-1 top-5 GFLOPs × views ParamI3D [3] I m a g e N e t × N/A 12MTwo-Stream I3D [3] (cid:88) × N/A 25MTwo-Stream S3D-G [44] (cid:88) × N/A 23.1MMF-Net [5] 72.8 90.4 11.1 ×
50 8.0MTSM R50 [20] 74.7 N/A 65 ×
10 24.3MNonlocal R50 [39] 76.5 92.6 359 ×
30 54.3MTwo-Stream I3D [3] - (cid:88) × NA 25.0MR(2+1)D [38] - 72.0 90.0 152 ×
115 63.6MTwo-Stream R(2+1)D [38] - (cid:88) ×
115 127.2Mip-CSN-152 [37] - 77.8 92.8 109 ×
30 32.8MX3D-M-P [6] - 76.0 92.3 6.2 ×
30 3.8MSlow-8 × ×
30 32.5MSlow-8 × ×
30 32.5MA3D-Slow-8 × × ×
30 32.5MA3D-Slow-8 × × ×
30 32.5MA3D-Slow-8 × × ×
30 8.3MA3D-Slow-8 × × ×
30 8.3MA3D-Slow-8 × × ×
30 32.5MA3D-Slow-8 × × ×
30 17.5MA3D-Slow-8 × × ×
30 13.0MSlowFast-4 × ×
30 34.4MSlowFast-4 × ×
30 34.4MA3D-SlowFast-4 × × ×
30 34.4MA3D-SlowFast-4 × × ×
30 24.4MA3D-SlowFast-4 × × . ×
30 24.4M
Table 1. Comparison of performance, computation cost (inGFLOPs × view), and parameter size of different methods. The default testing of SlowFast uses views for eachvideo, while some previous efficient 3D networks ( e.g . TSM[20] and CSN [37]) are based on -view testing. There-fore, we show the -view testing results of A3D and pre-vious works in Fig. 4 for comparison. A3D-Slow-8 × ×
16 achieve better accuracy-computationtrade-off than previous methods with one-time training . Training Cost.
Since we sample two sub-networks in eachtraining iteration, A3D consumes more computational cost5 .0 0.1 0.2 0.3 0.4 0.5 0.6
Inference cost per video in TFLOPs ( ) K i n e t i c s T o p - A cc u r a c y ( % ) A3D-SlowFast-4x16-[0.06, 1.0]A3D-Slow-8x8-[0.06, 1.0]TSMCSNSlowFast-4x16-RSlow-8x8-R
Figure 4. Comparison of A3D with state-of-the-art 3D networks. than a single model. However, we show that the wall-clocktime is only slightly longer than the full model and the to-tal time is much shorter than independently training severalmodels. In Table 2 we measure the training cost and timebased on Slow-8 ×
8. We show the cost of independentlytraining 6 models of different complexities. The trainingtime of the full model ( × . ) is measured on an 8 × . times of the single full model, the wall-clock time is onlyslightly longer (69 vs. 58 mins/epoch) because the sub-networks share the video data of full-network in memoryand do not need to decode video clips from disk again.Therefore, the wall-clock time of training an A3D modelis much less than training several independent models. Slow-8 × × × × × × × × × ∼ × ∗ Total 191 87.5 ∗ Mins/epoch 5.8 ∗ ∗ ∗ ∗ ∗
58 69Total 203 69
Table 2. Training costs of A3D and independently training severalmodels. ∗ indicates expected values. Charades.
We finetune the models trained on Kinetics-400on Charades. For SlowFast models, we use the pre-trainedmodels reproduced by us for a fair comparison. For A3Dmodels, we do not perform adaptive training during fine-tuning. That means both SlowFast models and A3D modelsfollow the same finetuning process on Charades. The onlydifference is the pre-trained models. We follow the train-ing settings in the released codes [1]. Since we train themodel on 4 GPUs, we reduce the batch-size and base learn-ing rate by half following the linear scaling rule [8]. Allother settings remain unchanged. As can be seen in Table 3, A3D model outperforms its counterpart (Slow-8 ×
8) by0.9% without increasing the computational cost. Note thatthe only difference lies in the pre-trained model, so the im-provement demonstrate that our method helps the networklearn effective and well-generalized representations whichare transferable across different datasets. model pretrain mAP GFLOPs × viewsCoViAR, R-50 [41] ImageNet 21.9 N/AAsyn-TF, VGG16 [30] ImageNet 22.4 N/AMultiScale TRN [50] ImageNet 25.2 N/ANonlocal, R-101 [39] ImageNet+Kinetics 37.5 × Slow-8 × . × A3D-Slow-8 × . × Table 3. Comparison of different models on Charades. All modelsare based on R-50.
AVA Detection.
Similar to the experiments in Charades,we follow the same training settings as the released Slow-Fast codes [1]. The detector is similar to Faster R-CNN [28]with minimal modifications adopted for video. The regionproposals are pre-computed by an off-the-shelf person de-tector. Experiments are conducted on AVA v2.1. All models(with R-50 backbone) are trained on a 4-GPU machine for20 epochs with a batch-size of 32. The base learning rate is0.05 with linear warm-up for the first 5 epochs. The learn-ing rate is reduced by a factor of 10 at the 10th and 15thepochs. Both SlowFast pre-trained models and A3D pre-trained models are finetuned following the standard train-ing procedure; the only difference is the pre-trained models.As shown in Table 4, A3D pre-trained model also outper-forms SlowFast and previous methods. Note that only thepre-trained weights are different in the experiments, so theimprovements are not marginal and clearly demonstrate theeffectiveness of the learned representations. model (R-50 backbone) flow pretrain mAPI3D [3] Kinetics-400 14.5I3D [3] (cid:88)
Kinetics-400 15.6ACRN, S3D [33] (cid:88)
Kinetics-400 17.4ATR, R-50+NL [17] Kinetics-400 20.0Slow-8 × × Table 4. Comparison of different models on AVA v2.1.
Testing Crop Size.
The default testing crop size ofSlowFast is 256, while the training crop size is 224. This isuncommon because training and testing inputs usually havethe same resolution for better generalization. Althoughthis trick improves the performance (from . to . for Slow-8 × ∼ × . When considering accuracy-computation trade-off, this setting may not be effective,especially for the proposed A3D since it covers multiple6
10 20 30 40 50
Inference cost per clip in GFLOPs K i n e t i c s T o p - A cc u r a c y ( % ) Test Size 256 s Test Size 224 s Figure 5. A3D-Slow-8 × γ s and γ s . spatial-temporal resolutions. In Fig. 5, we show the perfor-mances of different configurations with spatial resolutionsof γ s and γ s . The A3D-Slow-8 × γ s ∈ { . , . , . , . } , thus the correspondingspatial resolutions are { , , , } for γ s and { , , , } for γ s . Although testing with generates the best accuracy, other configurations of γ s constantly outperform γ s under the same computationconstraints. Therefore, we only test A3D models withspatial resolution of γ s and for all our experiments. Trade-off Dimensions.
Fig. 6 shows the accuracy-computation trade-off curves of A3D-Slow-8 × × τ ” means Slow networkswith different temporal resolutions. “-R” and “-P” denotethe reproduced results using the official codes and the num-bers reported in the paper. We also provide the results ofA3D only reducing temporal resolutions (“-T”) to comparewith Slow networks under the same configurations. Forthe curve of A3D-Slow-8 × model S × T, γ w top-1 GFLOPs × views Param × × × , . ×
30 32.5MSlow-8 × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 32.5M × × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 26.5M × × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 21.0M
Table 5. Comparison between different trade-off strategies of A3Dgiven the same computation constraints.
Under the same configuration, “A3D-Slow-8 × ×
16 and Slow-2 ×
32 areadopted from the paper [7], and the reproduced resultsshould be lower. Since A3D can perform trade-offs onthree dimensions, “A3D-Slow-8 × Inference cost per clip in GFLOPs K i n e t i c s T o p - A cc u r a c y ( % ) A3D-Slow-8x8-[0.016,1]A3D-Slow-8x8-[0.016,1]-TSlow-Tx -PSlow-Tx -R
Figure 6. Accuracy-computation trade-off curves of A3D-Slow-8 × γ s ),green for temporal ( γ t ), and blue for network width ( γ w ). Black indicates multiple dimensions for one trade-off step.). “-T” meansonly reducing the temporal resolution. surpasses “A3D-Slow-8 × ×
32” and“A3D-Slow-8 × to meet . × computation cost, while “A3D-Slow-8 × . higher accuracy with lessGFLOPs and parameters with lower spatial resolution andnetwork width. In addition, the trade-off curve of A3Dbrings insights about the mutual training and separatedtraining. Based on a progressive grid-search training, X3D[6] claims that a small network should keep a high temporalresolution, as witnessed by the configuration of X3D-M, i.e . × with only 3.76M parameters. While even forA3D-Slow-8 ×
8, it is more beneficial to reduce temporalresolution to before reducing network width. Therefore,the conclusions drew from separated training may not holdfor a mutual training framework. Inference cost per clip in GFLOPs K i n e t i c s T o p - A cc u r a c y ( % ) A3D-Slow-8x8-[0.016, 1]A3D-Slow-8x8-[0.06, 1.0]Slow-Tx -PSlow-Tx -R
Figure 7. Comparison of A3D-Slow with different computationalranges. omputational Range. To explore the effect of thecomputational range on performance, we compare A3D-Slow-8 × × e.g . 0.06) generally provides a better accuracy-computation trade-off due to a narrower computationalrange. However, for the smallest FLOPs configuration( γ w = 0 . , γ s = 0 . , γ t = 0 . ) of [0 . , , it can nottrade-off between different dimensions, thus has a similarperformance as [0 . , . In the Appendix , we present thedetailed configurations of the trade-off list for A3D underthese two computational ranges.
Temporal Dimension.
To investigate the effect of the tem-poral dimension, we train A3D-Slow-8 × × (resulting in a computational range around [0.06,1]), thelower bounds of γ s , γ t , γ w are . , . , . . However, ifwe keep γ t = 1 , then the lower bounds of γ s , γ w have to be . , . to provide the same computational range. Table 6shows that removing the temporal dimension in A3D leadsto a lower accuracy for both the full-network ( × ) and sub-networks ( × . , × . ). model S × T, γ w top-1 GFLOPs × views Param × × × , . ×
30 32.5MSlow-8 × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 32.5MA3D-Slow-8 × × , . ×
30 32.5M × × × , . ×
30 32.5MA3D-Slow-8 × × , . × A3D-Slow-8 × × , . × × × × , . × A3D-Slow-8 × × , . × Table 6. Comparison between A3D-Slow w/ and w/o temporal di-mension.
Mutual Training.
A3D benefits from the mutual training ofdifferent spatial-temporal resolutions, which is partly simi-lar to multi-resolution training. To further demonstrate theirdifference, we train a Slow-8 × × model top-1 top-5Slow-8 × × × × × Table 7. Comparison between A3D-Slow and multi-resolutiontraining on Slow-8 × Multiple-pathway Trade-off.
In Fig. 8, We provide thetrade-off curves of A3D-SlowFast4 ×
16 and SlowFast net-works (please refer to the
Appendix for the detailed trade-off list). Table 8 further shows the detailed configura-tions under typical constraints. When the temporal reso-lution of Slow pathway is reduced from to , the accu-racy of SlowFast drops . , while the accuracy of A3D-SlowFast only drops . due to the mutual training. Withmultiple-dimension trade-offs, A3D further outperforms thetemporal-only (“-T”) version.The Fast pathway in A3D-SlowFast is not reduced fortrade-offs, thus the range [0 . , only applies for the Slowpathway. However, the Fast pathway makes the trade-offon temporal dimension more beneficial, as the performancedrop of A3D-SlowFast is only . when γ t is reducedfrom . to . ( { × } −→ { × } ), which ismuch lower than that for A3D-Slow in Fig. 6. As shownin Table 8, the . × configuration has almost the same ac-curacy as the reproduced SlowFast-4 ×
16 with much lesscomputation cost and parameters.
10 15 20 25 30 35
Inference cost per clip in GFLOPs K i n e t i c s T o p - A cc u r a c y ( % ) A3D-SlowFast-4x16-[0.06,1.0]A3D-SlowFast-4x16-[0.06,1.0]-TSlowFast-Tx -PSlowFast-Tx -R
Figure 8. Comparison of A3D-SlowFast and SlowFast. Differentline colors show different dimensions for reducing computation(same as Fig. 6). “-T” means only reducing the temporal resolu-tion. model S × T, γ w top-1 GFLOPs × views Param × × × , . ×
30 34.4MSlowFast-4 × × , . ×
30 34.4MA3D-SlowFast-4 × × , . ×
30 34.4M × × × , . . ×
30 34.4M × × × , . . ×
30 34.4M × × × , . ×
30 34.4MA3D-SlowFast-4 × × , . ×
30 34.4MA3D-SlowFast-4 × × , . ×
30 24.4M × × × , . ×
30 24.4M × × × , . . ×
30 14.5M × × × , . . ×
30 24.4M
Table 8. Comparison between A3D-SlowFast-4 ×
16 and SlowFastunder different computation constraints.
5. Conclusion
This paper presents the first adaptive 3D network (A3D)which can achieve adaptive accuracy-efficiency trade-offs at8untime for video action recognition with a single model. Itconsiders network width and input spatio-temporal resolu-tion and randomly samples different spatial-temporal-widthconfigurations in network training. A spatial-temporal dis-tillation scheme is developed to facilitate knowledge trans-fer among different configurations. The training paradigmis generic and applicable to any 3D networks. Using thesame backbone, A3D outperforms the state-of-the-art Slow-Fast [7] networks under various computation constraints onKinetics. Extensive evaluations also validate the effective-ness of the learned representations of A3D for cross datasetand task transfer.
References [1] Official implementation of slowfast. https://github.com/facebookresearch/SlowFast .[2] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, andSong Han. Once-for-all: Train one network and specialize itfor efficient deployment. arXiv preprint arXiv:1908.09791 ,2019.[3] Joao Carreira and Andrew Zisserman. Quo vadis, actionrecognition? a new model and the kinetics dataset. In pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 6299–6308, 2017.[4] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yan-nis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and JiashiFeng. Drop an octave: Reducing spatial redundancy in con-volutional neural networks with octave convolution. In
Pro-ceedings of the IEEE International Conference on ComputerVision , pages 3435–3444, 2019.[5] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, ShuichengYan, and Jiashi Feng. Multi-fiber networks for video recogni-tion. In
Proceedings of the european conference on computervision (ECCV) , pages 352–367, 2018.[6] Christoph Feichtenhofer. X3d: Expanding architectures forefficient video recognition. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,pages 203–213, 2020.[7] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, andKaiming He. Slowfast networks for video recognition. In
Proceedings of the IEEE international conference on com-puter vision , pages 6202–6211, 2019.[8] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noord-huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch,Yangqing Jia, and Kaiming He. Accurate, large mini-batch sgd: Training imagenet in 1 hour. arXiv preprintarXiv:1706.02677 , 2017.[9] Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Car-oline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan,George Toderici, Susanna Ricco, Rahul Sukthankar, et al.Ava: A video dataset of spatio-temporally localized atomicvisual actions. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 6047–6056, 2018.[10] Jinyang Guo, Wanli Ouyang, and Dong Xu. Multi-dimensional pruning: A unified framework for model com- pression. In
Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 1508–1517, 2020.[11] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Canspatiotemporal 3d cnns retrace the history of 2d cnns and im-agenet? In
Proceedings of the IEEE conference on ComputerVision and Pattern Recognition , pages 6546–6555, 2018.[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[13] Andrew G Howard, Menglong Zhu, Bo Chen, DmitryKalenichenko, Weijun Wang, Tobias Weyand, Marco An-dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-tional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 , 2017.[14] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 7132–7141, 2018.[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-ian Q Weinberger. Densely connected convolutional net-works. In
Proceedings of the IEEE conference on computervision and pattern recognition , pages 4700–4708, 2017.[16] Jianwen Jiang, Yu Cao, Lin Song, Shiwei Zhang4 YunkaiLi, Ziyao Xu, Qian Wu, Chuang Gan, Chi Zhang, and GangYu. Human centric spatio-temporal action localization. In
ActivityNet Workshop on CVPR , 2018.[17] Jianwen Jiang, Yu Cao, Lin Song, Shiwei Zhang4 YunkaiLi, Ziyao Xu, Qian Wu, Chuang Gan, Chi Zhang, and GangYu. Human centric spatio-temporal action localization. In
ActivityNet Workshop on CVPR , 2018.[18] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang,Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola,Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu-man action video dataset. arXiv preprint arXiv:1705.06950 ,2017.[19] Bruno Korbar, Du Tran, and Lorenzo Torresani. Scsampler:Sampling salient clips from video for efficient action recog-nition. In
Proceedings of the IEEE International Conferenceon Computer Vision , pages 6232–6242, 2019.[20] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shiftmodule for efficient video understanding. In
Proceedingsof the IEEE International Conference on Computer Vision ,pages 7083–7093, 2019.[21] Chenxu Luo and Alan L Yuille. Grouped spatial-temporalaggregation for efficient action recognition. In
Proceedingsof the IEEE International Conference on Computer Vision ,pages 5512–5521, 2019.[22] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.Shufflenet v2: Practical guidelines for efficient cnn architec-ture design. In
Proceedings of the European conference oncomputer vision (ECCV) , pages 116–131, 2018.[23] Yue Meng, Chung-Ching Lin, Rameswar Panda, PrasannaSattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, andRogerio Feris. Ar-net: Adaptive frame resolution for effi-cient action recognition. arXiv preprint arXiv:2007.15796 ,2020.
24] Wei Peng, Xiaopeng Hong, and Guoying Zhao. Video actionrecognition via neural architecture searching. In , pages11–15. IEEE, 2019.[25] AJ Piergiovanni, Anelia Angelova, Alexander Toshev, andMichael S Ryoo. Evolving space-time neural architecturesfor videos. In
Proceedings of the IEEE international confer-ence on computer vision , pages 1793–1802, 2019.[26] Zhaofan Qiu, Ting Yao, and Tao Mei. Learning spatio-temporal representation with pseudo-3d residual networks.In proceedings of the IEEE International Conference onComputer Vision , pages 5533–5541, 2017.[27] Prajit Ramachandran, Barret Zoph, and Quoc V Le.Searching for activation functions. arXiv preprintarXiv:1710.05941 , 2017.[28] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
Advances in neural information pro-cessing systems , pages 91–99, 2015.[29] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Invertedresiduals and linear bottlenecks. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 4510–4520, 2018.[30] Gunnar A Sigurdsson, Santosh Divvala, Ali Farhadi, and Ab-hinav Gupta. Asynchronous temporal fields for action recog-nition. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 585–594, 2017.[31] Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, AliFarhadi, Ivan Laptev, and Abhinav Gupta. Hollywood inhomes: Crowdsourcing data collection for activity under-standing. In
European Conference on Computer Vision ,pages 510–526. Springer, 2016.[32] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[33] Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Mur-phy, Rahul Sukthankar, and Cordelia Schmid. Actor-centricrelation network. In
Proceedings of the European Confer-ence on Computer Vision (ECCV) , pages 318–334, 2018.[34] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. Going deeper withconvolutions. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1–9, 2015.[35] Mingxing Tan and Quoc Le. Efficientnet: Rethinking modelscaling for convolutional neural networks. In
InternationalConference on Machine Learning , pages 6105–6114, 2019.[36] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Learning spatiotemporal features with3d convolutional networks. In
Proceedings of the IEEE inter-national conference on computer vision , pages 4489–4497,2015.[37] Du Tran, Heng Wang, Lorenzo Torresani, and Matt Feis-zli. Video classification with channel-separated convolu-tional networks. In
Proceedings of the IEEE InternationalConference on Computer Vision , pages 5552–5561, 2019. [38] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, YannLeCun, and Manohar Paluri. A closer look at spatiotemporalconvolutions for action recognition. In
Proceedings of theIEEE conference on Computer Vision and Pattern Recogni-tion , pages 6450–6459, 2018.[39] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In
Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 7794–7803, 2018.[40] Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Pier-giovanni, Michael S Ryoo, Anelia Angelova, Kris M Ki-tani, and Wei Hua. Attentionnas: Spatiotemporal atten-tion cell search for video classification. arXiv preprintarXiv:2007.12034 , 2020.[41] Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R Manmatha,Alexander J Smola, and Philipp Kr¨ahenb¨uhl. Compressedvideo action recognition. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages6026–6035, 2018.[42] Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher,and Larry S Davis. Adaframe: Adaptive frame selection forfast video recognition. In
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages1278–1287, 2019.[43] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, andKaiming He. Aggregated residual transformations for deepneural networks. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 1492–1500,2017.[44] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, andKevin Murphy. Rethinking spatiotemporal feature learning:Speed-accuracy trade-offs in video classification. In
Pro-ceedings of the European Conference on Computer Vision(ECCV) , pages 305–321, 2018.[45] Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, andBernard Ghanem. G-tad: Sub-graph localization for tempo-ral action detection. In
Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages10156–10165, 2020.[46] Taojiannan Yang, Sijie Zhu, Chen Chen, Shen Yan, MiZhang, and Andrew Willis. Mutualnet: Adaptive convnetvia mutual learning from network width and resolution. In
European Conference on Computer Vision (ECCV) , 2020.[47] Jiahui Yu and Thomas S Huang. Universally slimmable net-works and improved training techniques. In
Proceedingsof the IEEE International Conference on Computer Vision ,pages 1803–1811, 2019.[48] Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender,Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xi-aodan Song, Ruoming Pang, and Quoc Le. Bignas: Scalingup neural architecture search with big single-stage models. arXiv preprint arXiv:2003.11142 , 2020.[49] Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang,and Qi Tian. Bottom-up temporal action localization withmutual regularization. In
European Conference on ComputerVision , pages 539–555. Springer, 2020.[50] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Tor-ralba. Temporal relational reasoning in videos. In
Pro- eedings of the European Conference on Computer Vision(ECCV) , pages 803–818, 2018.[51] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,and Antonio Torralba. Learning deep features for discrimina-tive localization. In Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 2921–2929,2016.
A. Appendix
A.1. Full Trade-off List
A3D-Slow-8 × We show the detailed config-urations of the trade-off list for A3D-Slow-8 × ρ S × T, γ w top-1 GFLOPs × views Param . × × , . ×
30 32.5M . × × , . ×
30 32.5M . × × , . ×
30 32.5M . × × , . ×
30 32.5M . × × , . ×
30 26.5M . × × , . ×
30 21.0M . × × , . ×
30 21.0M . × × , . ×
30 16.0M . × × , . ×
30 8.3M . × × , . ×
30 8.3M . × × , . ×
30 8.3M . × × , . ×
30 8.3M . × × , . ×
30 8.3M . × × , . ×
30 8.3M
Table 9. Full trade-off list of A3D-Slow-8 × A3D-Slow-8 × We show the detailed configu-rations of the trade-off list for A3D-Slow-8 × – ”)in Fig. 7 of the main paper. ρ S × T, γ w top-1 GFLOPs × views Param . × × , . ×
30 32.5M . × × , . ×
30 32.5M . × × , . ×
30 22.5M . × × , . ×
30 17.5M . × × , . ×
30 13.0M . × × , . ×
30 13.0M . × × , . ×
30 13.0M . × × , . ×
30 13.0M . × × , . ×
30 13.0M
Table 10. Full trade-off list of A3D-Slow-8 × A3D-SlowFast-4 × We show the detailedconfigurations of the trade-off list for A3D-SlowFast-4 ×
16- [0.06, 1] in Table 11. The configurations correspond to thecurve in Fig. 8 of the main paper. ρ S × T, γ w top-1 GFLOPs × views Param . × × , . ×
30 34.5M . × × , . ×
30 34.5M . × × , . ×
30 34.5M . × × , . ×
30 24.4M . × × , . ×
30 19.2M . × × , . ×
30 14.5M . × × , . ×
30 14.5M . × × , . ×
30 14.5M . × × , . ×
30 19.2M . × × , . ×
30 14.5M
Table 11. Full trade-off list of A3D-SlowFast-4 × A.2. Full Trade-off Table
A3D-Slow-8 × In Table 12, we show the re-sults of all the configurations tested in our paper for A3D-Slow-8 × S × T γ w × × × × × × × × × × × × × × × × × Table 12. Full trade-off table of A3D-Slow-8 × A3D-Slow-8 × In Table 13, we show the re-sults of all the configurations tested in our paper for A3D-Slow-8 × A3D-SlowFast-4 × In Table 14, we show theresults of all the configurations tested in our paper for A3D-SlowFast-4 × S × T γ w × × × × × × × × × × Table 13. Full trade-off table of A3D-Slow-8 × S × T γ w × × × × × × × × × × Table 14. Full trade-off table of A3D-SlowFast-4 ×16-[0.06, 1].