Collaborative Spatio-temporal Feature Learning for Video Action Recognition
CCollaborative Spatiotemporal Feature Learning for Video Action Recognition
Chao Li Qiaoyong Zhong Di Xie Shiliang PuHikvision Research Institute { lichao15,zhongqiaoyong,xiedi,pushiliang } @hikvision.com Abstract
Spatiotemporal feature learning is of central importancefor action recognition in videos. Existing deep neural net-work models either learn spatial and temporal features in-dependently (C2D) or jointly with unconstrained parame-ters (C3D). In this paper, we propose a novel neural op-eration which encodes spatiotemporal features collabora-tively by imposing a weight-sharing constraint on the learn-able parameters. In particular, we perform 2D convolu-tion along three orthogonal views of volumetric video data,which learns spatial appearance and temporal motion cuesrespectively. By sharing the convolution kernels of dif-ferent views, spatial and temporal features are collabora-tively learned and thus benefit from each other. The com-plementary features are subsequently fused by a weightedsummation whose coefficients are learned end-to-end. Ourapproach achieves state-of-the-art performance on large-scale benchmarks and won the 1st place in the Momentsin Time Challenge 2018. Moreover, based on the learnedcoefficients of different views, we are able to quantify thecontributions of spatial and temporal features. This analy-sis sheds light on interpretability of the model and may alsoguide the future design of algorithm for video recognition.
1. Introduction
Recently, video action recognition has drawn increasingattention considering its potential in a wide range of appli-cations such as video surveillance, human-computer inter-action and social video recommendation. The key to thistask lies in joint spatiotemporal feature learning. The spa-tial feature mainly describes appearance of objects involvedin an action and the scene configuration as well within eachframe of the video. Spatial feature learning is analogousto that of still image recognition, and thus easily benefitsfrom the recent advancements brought by deep Convolu-tional Neural Networks (CNN) [13]. While the tempo-ral feature captures motion cues embedded in the evolvingframes over time. There are two challenges that arise. Oneis how to learn the temporal feature. The other is how to 𝐻𝑊 𝑇 𝑇 Figure 1. Visualization of three views of a video, which motivatesour design of collaborative spatiotemporal feature learning. Topleft: view of H - W . Top right: view of T - H . Bottom: view of T - W . properly fuse spatial and temporal features.The first attempt of researchers is to model temporal mo-tion information explicitly and in parallel to spatial informa-tion. Raw frames and optical flow between adjacent framesare exploited as two input streams of a deep neural net-work [23, 6]. On the other hand, as a generalization of 2DConvNets (C2D) for still image recognition, 3D ConvNets(C3D) are proposed to tackle 3D volumetric video data [24].In C3D, spatial and temporal features are closely entangledand jointly learned. That is, rather than learning spatial andtemporal features separately and fusing them at the top ofthe network, joint spatiotemporal features are learned by3D convolutions distributed over the whole network. Con-sidering the excellent feature representation learning capa-bility of CNN, ideally C3D should achieve great success onvideo understanding just as C2D does on image recogni-tion. However, the huge number of model parameters andcomputational inefficiency limit the effectiveness and prac-ticality of C3D.In this paper, we propose a novel Collaborative Spa-1 a r X i v : . [ c s . C V ] M a r × 3 × 3 (a) C3D (b)
C3D (c)
CoST+𝛼 ℎ𝑤 𝛼 𝑡𝑤 𝛼 𝑡ℎ Share parameters
Figure 2. Comparison of CoST to common spatiotemporal featurelearning architectures. (a) C3D × × . (b) C3D × × . (c) Theproposed CoST. tioTemporal (CoST) feature learning operation, whichlearns spatiotemporal features jointly with a weight-sharingconstraint. Given a 3D volumetric video tensor, we flattenit into three sets of 2D images by viewing it from differentangles. Then 2D convolution is applied to each set of 2Dimages. Figure 1 shows the 2D snapshots from three viewsof an exemplary video clip, where a man is high jumping atthe stadium. View of H - W is the natural view with whichhuman beings are familiar. By scanning the video frameby frame from this view over time T , we are able to un-derstand the video content. Although snapshots from viewsinvolving T (i.e. T - W and T - H ) are difficult to interpret forhuman beings, they contain exactly the same amount of in-formation as the normal H - W view. More importantly, richmotion information is embedded within each frame ratherthan between frames. Hence 2D convolutions on frames ofthe T - W and T - H views are able to capture temporal mo-tion cues directly. As shown in Figure 2(c), by fusing com-plementary spatial and temporal features of the three views,we are able to learn spatiotemporal features using 2D con-volutions rather than 3D convolutions.Notably, the convolution kernels of different views areshared for the following reasons. 1) From the visualizationof the frames of different views (see Figure 1), their visualappearances are compatible. For example, common spatialpatterns such as edges and color blobs also exist in tempo-ral views ( T - H and T - W ). Hence, the same set of convo-lution kernels can be applied on frames of different views.2) Convolution kernels in C2D networks are inherently re-dundant without pruning [9, 15, 31]. While the redundantkernels can be exploited for temporal feature learning bymeans of weight sharing. 3) The number of model param-eters is greatly reduced, such that the network is easier totrain and less prone to overfitting, resulting in better per-formance. Besides, the success of spatial feature learningon still images ( e.g . carefully designed network architectureand pre-trained parameters) can be transferred to temporaldomain with little effort.The complementary features of different views are fusedby a weighted summation. We learn an independent co-efficient for each channel in each view, which allows the network to attend to either spatial or temporal features ondemand. Moreover, based on the learned coefficients, weare able to quantify the respective contributions of spatialdomain and temporal domain.Based on the CoST operation, we build a convolutionalneural network. We will henceforth refer to both the opera-tion and the network as CoST, which should be easy to iden-tify according to its context. Compared with C2D, CoSTcan learn spatiotemporal features jointly. While comparedwith C3D, CoST is based on 2D rather than 3D convolu-tions. CoST essentially bridges the gap between C2D andC3D, where the benefits from both sides, i.e. compactnessof C2D and representation capability of C3D are retained.For the task of action recognition in videos, experimentsshow that CoST achieves superior performance over bothC2D and C3D.The main contributions of this work are summarized asfollows: • We propose CoST, which collaboratively learns spa-tiotemporal features using 2D convolutions rather than3D convolutions. • To the best of our knowledge, this is the first work onquantitative analysis of importance of spatial and tem-poral features for video understanding. • The proposed CoST model outperforms the conven-tional C3D model and its variants, achieving state-of-the-art performance on large-scale benchmarks.
2. Related Work
In the early stage, hand-crafted representations havebeen well explored for video action recognition. Manyfeature descriptors for 2D images are generalized to 3Dspatiotemporal domain, e.g . Space-Time Interest Points(STIP) [14], SIFT-3D [21], Spatiotemporal SIFT [1] and3D Histogram of Gradient [12]. The most successful hand-crafted representations are dense trajectories [27] and itsimproved version [28], which extract local features alongtrajectories guided by optical flow.Encouraged by the great success of deep learning, espe-cially the CNN model for image understanding, there are anumber of attempts to develop deep learning methods foraction classification [33]. The two-stream architecture [23]utilizes visual frames and optical flows between adjacentframes as two separate inputs of the network, and fusestheir output classification scores as the final prediction.Many works follow and extend this architecture [5, 6, 34].The LSTM networks have also been employed to capturetemporal dynamics and long range dependences in videos.In [18, 4] CNN is used to learn spatial feature for eachframe, while LSTM is used to model temporal evolutions. × 1 × 11 × 3 × 31 × 1 × 1 + + + + (a) C2D (b)
C3D (c)
C3D (d)
CoST
Figure 3. Comparison of various residual units for action recogni-tion in videos.
More recently, with the increasing computing capabilityof modern GPUs and the availability of large-scale videodatasets, 3D ConvNet (C3D) has drawn more and more at-tention. In [24] a 11-layer C3D model is designed to jointlylearn spatiotemporal features on the Sports-1M dataset [11].However, the huge computational cost and the dense param-eters of C3D make it infeasible to train a very deep model.Qiu et al . [19] proposed Pseudo-3D (P3D) which decom-poses a 3D convolution of × × into a 2D convolutionof × × followed by a 1D convolution of × × . Inanother work [25], similar architecture is explored and re-ferred to as (2+1)D. [2] proposed the Inflated 3D ConvNet(I3D), which is exactly C3D whose parameters are initial-ized by inflating the parameters of pre-trained C2D model.The most closely related work to ours is SlicingCNN [22], which also learns features from multiple viewsfor crowd video understanding. However, there are sub-stantial differences between Slicing CNN and the proposedCoST. Slicing CNN learns independent features of thethree views via three different network branches, which aremerged at the top of the network. Aggregation of spatialand temporal features is conducted only once at the networklevel. On the contrary, we learn spatiotemporal features col-laboratively using a novel CoST operation. Spatiotemporalfeature aggregation is conducted layer-wise.
3. Method
In this section, we first review the conventional C2D andC3D architectures, which are implemented as a baseline.Then we introduce the proposed CoST. The connection andcomparison between CoST and C2D / C3D are also dis-cussed.
C2D leverages the strong spatial feature representationcapability of 2D convolutions, while simple strategy ( e.g .pooling) is utilized for temporal feature aggregation. In this
Name Output Size Filter Strideinput 8 × ×
224 none noneconv × × × × , × × × × , max × × × × , × × , × × , × × × × × , max × × × × , × × , × × , × × × × × , × × , × × , × × × × × , × × , × × , × × × × × , average × × × class 1,1,1Table 1. Architecture of ResNet-50-C2D. Spatial striding is per-formed on the first residual unit of each block. work, we implement C2D as a baseline model. We chooseResNets [8] as our backbone networks, whose residual unitis shown in Figure 3(a). To handle 3D volumetric videodata, the vanilla ResNets need to be adapted accordingly.Taking ResNet-50 as an example, its adapted version forvideo action recognition is illustrated in Table 1. For con-venience we will henceforth refer to it as ResNet-50-C2D.Note the differences between ResNet-50-C2D and vanillaResNet-50. Firstly, all k × k
2D convolutions are adapted totheir 3D form, i.e. × k × k . Secondly, a temporal pooling(pool ) is append after block to halve the number of framesfrom 8 to 4. Thirdly, the global average pooling (pool ) isalso adapted from × to × × such that spatial andtemporal features are aggregated simultaneously. Similarly,we can setup ResNet-101-C2D based on ResNet-101. C3D is a natural generalization of C2D for 3D videodata. In C3D, 2D convolutions are converted to 3D by in-flating the filters from square to cubic. For example, an h × w
2D filter can be converted into a t × h × w
3D filterby introducing an additional temporal dimension t [5, 2].In modern deep CNN architectures like ResNets, there aretwo main types of filters, i.e. × and × . As ex-plored in [30], given a residual unit comprised of × and × convolutions, we may either inflate the middle × filter into × × (C3D × × ) as shown in Figure 3(b),or inflate the first × filter into × × (C3D × × )as shown in Figure 3(c). Experiments in [30] demonstratethat C3D × × and C3D × × achieve comparable perfor-mance, while the latter contains much fewer parameters ands more computationally efficient. Therefore, in our imple-mentation, C3D × × is adopted and referred to as C3D forsimplicity. Notably, the C3D × × model learns spatial andtemporal features alternatively rather than jointly, which isvery similar to the (2+1)D [25] and P3D [19] models.In our implementation, we inflate the first × filter forevery two residual units following [30]. However, we leaveconv unchanged to be 2D ( × × ), as opposed to [30]. In this section, we elaborately describe the proposedCoST model. Figure 2 compares the proposed CoST oper-ation to common spatiotemporal feature aggregating mod-ules. As mentioned above, C3D × × utilizes a 3D convo-lution of × × to extract spatial (along H and W ) andtemporal (along T ) features jointly. In the C3D × × con-figuration, a 1D × × convolution along T is utilizedto aggregate temporal feature, followed by a 2D × × convolution along H and W for spatial feature. While inthe proposed method, we perform 2D × convolutionsalong three views of the T × H × W volumetric data, i.e. H - W , T - H and T - W separately. Notably, the parametersof the three-view convolutions are shared, which keeps thenumber of parameters the same as single-view 2D convo-lution. The three resulting feature maps are subsequentlyaggregated with weighted summation. The weights are alsolearned during training in an end-to-end manner.Let x denote the input feature maps of size T × H × W × C where C is the number of input channels. Thethree sets of output feature maps from different views arecomputed by: x hw = x ⊗ w × × , x tw = x ⊗ w × × , x th = x ⊗ w × × , (1)where ⊗ denotes 3D convolution, w is convolution filtersof size × shared among the three views. To apply w toframes of different views, we insert an additional dimensionof size at different indices. The resulting variants of w , i.e. w × × , w × × and w × × learn features of the H - W , T - W and T - H views respectively. Then, the three sets offeature maps are aggregated with weighted summation: y = (cid:2) α hw , α tw , α th (cid:3) x hw x tw x th , (2)where α = [ α hw , α tw , α th ] are the coefficients of size C × . C is the number of output channels and denotesthree views. To avoid magnitude explosion of the resultingresponses from multiple views, α is normalized with theSoftmax function along each row. T × 𝐻 × 𝑊 × 𝐶 𝜶𝒙 ℎ𝑤 𝒙 𝑡𝑤 𝒙 𝑡ℎ 𝒙 ℎ𝑤 , 𝒙 𝑡𝑤 , 𝒙 𝑡ℎ 𝑇 𝒚𝒙 × T × 𝐻 × 𝑊 × 𝐶 𝐶 × 3 model parametersShare weightSoftmax Figure 4. Architecture of CoST(a), where the coefficients α arepart of the model parameters. T × 𝐻 × 𝑊 × 𝐶 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 𝑝𝑜𝑜𝑙𝑖𝑛𝑔 Squeeze & Concat 𝐶 × 3 FC & Softmax 𝜶 𝒙 ℎ𝑤 𝒙 𝑡𝑤 𝒙 𝑡ℎ 𝒙 ℎ𝑤 , 𝒙 𝑡𝑤 , 𝒙 𝑡ℎ 𝑇 𝒚𝒙 T × 𝐻 × 𝑊 × 𝐶 𝐶 × 3 Share weight × Figure 5. Architecture of CoST(b), where the coefficients α arepredicted by the network. To learn the coefficients α , we propose two architec-tures, named CoST(a) and CoST(b). CoST(a).
As illustrated in Figure 4, the coefficients α areconsidered as part of the model parameters, which can beupdated with back-propagation during training. During in-ference, the coefficients are fixed and the same set of coef-ficients is applied to each video clip. CoST(b).
The coefficients α are predicted by the networkbased on the feature maps by which α will be multiplied.This design is inspired by the recent self-attention [26]mechanism for machine translation. In this case, the co-efficients for each sample depend on the sample itself. Itcan be formulated as: (cid:2) α hw , α tw , α th (cid:3) = f ( (cid:2) x hw , x tw , x th (cid:3) ) (3)The architecture of CoST(b) is illustrated in Figure 5. Thecomputational block inside the dashed lines represents thefunction f in Equation (3). Specifically, for each view, wefirst reduce the feature map from a size of T × H × W × C to × × × C using global max pooling along dimen-sion T , H and W . Then, a × × convolution is appliedon the pooled features, whose weights are also shared by all 𝑊𝑇 (a) +𝛼 ℎ𝑤 𝛼 𝑡𝑤 𝛼 𝑡ℎ (b) Figure 6. Connection of CoST to C2D (a) and C3D (b). three views. This convolution maps features of dimension C back to C , which captures the contextual informationamong channels. After that, the three sets of features areconcatenated and fed into a fully connected (FC) layer. Asopposed to the × × convolution, this FC layer is ap-plied to each row of the C × matrix, which captures thecontextual information among different views. Finally, wenormalize the output by the Softmax function.The residual unit of the proposed CoST is shown in Fig-ure 3(d). We replace the middle × convolution with ourCoST operation, either CoST(a) or CoST(b), and leave thepreceding × convolution unchanged. Based on the C2Dconfiguration of ResNets, we build CoST by replacing theC2D unit with the proposed CoST unit for every two resid-ual units, which is consistent to C3D. The proposed CoST is closely related to C2D and C3D.As shown in Figure 6(a), if the coefficients of the T - W and T - W views were set to zero, CoST degenerates to C2D.Hence, CoST is a strict generalization of C2D.To compare CoST with C3D, let us exclude the dimen-sions of input and output channels for simplicity. 3D con-volution with a kernel size of k × k × k contains k pa-rameters and covers a cubic receptive field of k voxels.While the proposed CoST operation covers an irregular re-ceptive field of k − k + 1 voxels. Figure 6(b) shows acomparison of receptive field when k is equal to 3. C3Dcovers the whole × × cube, while CoST covers theshaded region excluding the 8 corner voxels. If the con-volution kernels of the three views are learned separatelywithout weight sharing, CoST is nearly equivalent to C3Dexcept that the 8 corner parameters of the cubic kernel arefixed to zero and not learnable. When weight sharing isenabled in CoST, although the receptive field contains 19voxels in total, the corresponding 19 parameters can be de-rived from the 9 learnable parameters shared among differ-ent views. Therefore, CoST can be considered as a specialcase of C3D, where similar receptive field is covered withsignificantly reduced number of parameters.In terms of computational cost, CoST is also superior over C3D. The number of multiply-adds involved in theCoST operation is approximately k (excluding input andoutput channels), while that of C3D is k . Computationalcost of CoST increases quadratically with the kernel sizerather than cubically. This characteristic makes the employ-ment of large kernel possible, which has not been exploredyet on video data. Moreover, for the CoST(a) variant, somevoxels in the receptive field are duplicately computed bymultiple views in our current implementation. With an op-timized implementation, the number of multiply-adds canbe reduced from k to k − k + 1 , e.g . from 27 to 19(save ∼ ) for the case of k = 3 .
4. Experiments
To validate the effectiveness of the proposed CoST forthe task of action recognition in videos, we perform ex-tensively experiments on two of the largest benchmarkdatasets, i.e. Moments in Time [17] and Kinetics [2]. Ac-curacies are measured on the validation set of both datasetsin all experiments.
Moments in Time.
The Moments in Time dataset contains802245 training videos and 39900 validation videos from339 action categories. The videos are trimmed such that theduration is about 3 seconds.
Kinetics.
The Kinetics dataset contains 236763 trainingvideos and 19095 validation videos, which are annotatedas one of 400 human action categories. Note that the fullKinetics dataset contains a bit more samples. The numbersonly cover the samples we are able to download. The dura-tion of the videos is about 10 seconds.
During training, we first sample 64 continuous framesfrom a video and then sub-sample one frame for every 8frames, resulting in 8 frames in total. Next, image patcheswith a size of × pixels are randomly cropped froma scaled video whose shorter side is randomly sampled be-tween 256 and 320 pixels. Hence, the network input is ofdimension × × . In all experiments, our modelsare initialized from ImageNet [20] pre-trained 2D models.We train the models on an 8-GPU machine. To speeduptraining, the 8 GPUs are grouped into two workers and theweights are updated asynchronously between the two work-ers. Each GPU process a mini-batch of 8 video clips. Thatis, for each worker 4 GPUs are employed, resulting in a totalmini-batch size of 32. We train the models for 600k itera-tions using the SGD optimizer with momentum. We use amomentum of 0.9 and a weight decay of 0.0001. The learn-ing rate is initialized to 0.005 and reduced by a factor of 10at 300k and 450k iterations respectively.ataset Method Accuracy (%)Top-1 Top-5 AverageMoments CoST(a) 29.3 55.8 42.6CoST(b) Kinetics CoST(a) 73.6 90.8 82.2CoST(b)
Table 2. Comparison of CoST(a) and CoST(b) for coefficientlearning. The backbone network is ResNet-50.
During inference, following [30] we perform spatiallyfully convolutional inference on videos whose shorter sideis rescaled to 256 pixels. While for the temporal domain, wesample 10 clips evenly from a full-length video and com-pute their classification scores individually. The final pre-diction is the averaged score of all clips.
To validate the effectiveness of individual componentsof our approach, we perform ablation studies on coeffi-cient learning, impact of collaborative spatiotemporal fea-ture learning and improvements of CoST over C2D andC3D.
We first compare the performance of the two CoST vari-ants for coefficient learning of different views. As shownin Table 2, on both of the Moments in Time and Kineticsdatasets, coefficients predicted by the network (CoST(b))outperform those learned as model parameters (CoST(a)).This result verifies the effectiveness of the self-attentionmechanism introduced in our model. It also reveals thatfor different video clips, the importance of spatial and tem-poral features varies. Henceforth, the CoST(b) architectureis adopted in the following experiments.
To validate the effectiveness of collaborative spatiotemporalfeature learning through weight sharing, we compare the re-sults of the CoST(b) network with and without weight shar-ing. When weight sharing is disabled, the parameters of thethree convolutional layers in Figure 5 are learned indepen-dently such that spatiotemporal features are learned in a de-coupled manner. As listed in Table 3, with weight sharingamong different views, accuracies get improved by about1% on both datasets. This result shows that our analysis onthe characteristics of the three spatial and temporal views inSection 1 is reasonable and their collaborative feature learn-ing is beneficial. Dataset Share Weight Accuracy (%)Top-1 Top-5 AverageMoments 29.0 56.1 42.5 (cid:88)
Kinetics 73.2 90.2 81.7 (cid:88)
Table 3. Performance improvements brought by weight sharing us-ing ResNet-50 as the backbone.
Method Accuracy (%)Top-1 Top-5 AverageResNet-50 C2D 27.9 54.6 41.3C3D 29.0 55.3 42.2CoST
ResNet-101 C2D 30.0 56.8 43.4C3D 30.6 57.7 44.2CoST
Table 4. Performance comparison of C2D, C3D and CoST on thevalidation set of Moments in Time.
Method Accuracy (%)Top-1 Top-5 AverageResNet-50 C2D 71.5 89.8 80.7C3D 73.3 90.4 81.9CoST
ResNet-101 C2D 72.9 89.8 81.4C3D 74.5 91.1 82.8CoST
Table 5. Performance comparison of C2D, C3D and CoST on thevalidation set of Kinetics.
To compare CoST with the C2D and C3D baselines,we train all the three networks using the same protocol.Their performances on the Moments in Time and Kinet-ics datasets are listed in Table 4 and Table 5 respectively.We can see that C3D is far better than C2D, while CoSTconsistently outperforms C3D by about 1%, which clearlydemonstrates the superiority of CoST. Note that the perfor-mance of C3D with ResNet-50 backbone is on par with theproposed CoST without weight sharing (see Table 3), whichvalidates the connection between CoST and C3D describedin Section 3.4.
Besides the 8-frame model, we also train a model with ahigher temporal resolution, i.e. 32 frames. On Moments intime, the 32 input frames are sampled from 64 continuousframes mentioned earlier. While on Kinetics, we sample 32ethod Network Pre-training Input Size Accuracy (%)Top-1 Top-5C3D [7] ResNet-101 None 16 × ×
112 62.8 83.9C3D [7] ResNeXt-101 None 16 × ×
112 65.1 85.7ARTNet [29] ResNet-18 None 16 × ×
112 69.2 88.3STC [3] ResNeXt-101 None 32 × ×
112 68.7 88.5I3D [2] Inception ImageNet 64 × ×
224 71.1 ∗ ∗ R(2+1)D [25] Custom None 8 × ×
112 72.0 90.0R(2+1)D [25] Custom Sports-1M 8 × ×
112 74.3 91.4S3D-G [32] Inception ImageNet 64 × ×
224 74.7
NL I3D [30] ResNet-101 ImageNet 32 × ×
224 76.0 92.1NL I3D [30] ResNet-101 ImageNet 128 × × × ×
224 75.5 92.0CoST ResNet-101 ImageNet 32 × × Table 6. Comparison with the state-of-the-arts on the validation set of Kinetics. For fair comparison, only results based on the RGBmodality are listed. All the numbers are single-model results. ∗ indicates results on the test set. Method Accuracy (%)Top-1 Top-5ResNet-50-Scratch [17] 23.7 46.7ResNet-50-ImageNet [17] 27.2 51.7SoundNet-Audio [17] 7.6 18.0TSN-Flow [17] 15.7 34.7RGB+Flow+Audio [17] 30.4 55.9CoST (ResNet-50, 8 frames) 30.1 57.2CoST (ResNet-101, 8 frames) 31.5 57.9CoST (ResNet-101, 32 frames)
Table 7. Comparison with the state-of-the-arts on the validation setof Moments in Time. Methods marked in gray exploit additionalmodalities, e.g . audio and optical flow. frames from a clip of 128 frames considering that videosin this dataset is longer than those in Moments in Time.The 32-frame model is fine-tuned from the 8-frame model,where the parameters of BN layers [10] are frozen.On the Moments in Time dataset, Table 7 shows a com-parison of the proposed CoST with existing methods. CoSTimproves the ResNet-50 C2D baseline reported in [17] by2.9% and 5.5% in terms of top-1 and top-5 accuracies re-spectively. While ResNet-101 based CoST with 32 inputframes achieves 32.4% top-1 accuracy and 60.0% top-5 ac-curacy. Notably, based on the RGB modality only, ourmodel outperforms the ensemble result of multiple modal-ities (i.e. RGB, optical flow and audio) in [17] by a largemargin. With an ensemble of multiple models and modal-ities, we achieve 52.91% average accuracy on the test set,which won the 1st place in the Moments in Time Challenge2018.On the Kinetics dataset, CoST achieves state-of-the-artperformance. As shown in Table 6, CoST has a clear advantage over C3D [7] and its variants, e.g . I3D [2],R(2+1)D [25] and S3D-G [32]. Compared with NLI3D [30], which is a strong baseline, CoST is also superiorat various temporal resolutions.
By investigating the magnitude of the learned coeffi-cients, we are able to quantify the contribution of differentviews. Specifically, for each CoST layer, the mean coeffi-cient of each view is computed on the validation set. Themean coefficient of the H - W view measures the importanceof appearance feature, while those of the T - W and T - H views measure the importance of temporal motion cues.The overall importance of each view can be measuredby averaging the mean coefficients of all CoST layers. OnMoments in Time, the mean coefficients of the H - W , T - W and T - H views are 0.67, 0.14 and 0.19 respectively. Whileon Kinetics they are 0.77, 0.08 and 0.15. Hence, spatialfeature plays a major role on both datasets. And the Mo-ments in Time dataset depends more on temporal feature todiscriminate different actions than Kinetics.Figure 8 shows the coefficient distribution among thethree views in all CoST layers of the ResNet-50 basedCoST. From shallow layer to deep layer, a clear trend isobserved on both datasets. That is, the contribution of spa-tial feature declines, while that of temporal feature rises. Inother words, the closer to top of a network, the more impor-tant the temporal feature is, suggesting that the model tendsto learn temporal feature based on high-level spatial feature.This also verifies the conclusion in [32] that temporal rep-resentation learning on high-level semantic features is moreuseful than low-level features.Furthermore, we analyze the importance of spatial andtemporal features for each action category on the Moments rupting overflowing combusting landing baptizing interviewinghandcuffingbuying Figure 7. Left: actions for which temporal feature matters. Right: actions for which temporal feature is less important. Index of CoST layers
T-HT-WH-W Index of CoST layers
Moments in Time Kinetics
Figure 8. Distribution of the mean coefficient among the threeviews in CoST layers of various depths. in Time dataset. We sum up the mean coefficients of tem-poral related views and sort all categories by it. As shownin Figure 7, for actions such as erupting, storming, over-flowing, combusting and landing, temporal motion infor-mation is very important. On the contrary, for actions suchas baptizing, handcuffing / arresting, interviewing, buyingand paying, temporal feature is less important. These ac-tions can either be easily recognized by appearance, or thetemporal evolutions are not very helpful for classification.For example, for buying and interviewing various motionpatterns exist within the same category and they may beeasily confused between different actions, which makes themotion cues not discriminative.In summary, with the proposed CoST, we are able toquantitatively analyze the importance of spatial and tempo-ral features. In particular, we observe that the bottom layersof the network focus more on spatial feature learning, whilethe top layers attend more to temporal feature aggregation.Besides, some actions are easier to recognize based on theunderlying objects and their interactions ( e.g . geometric re-lation) rather than motion cues. This indicates that the cur-rent spatiotemporal feature learning approaches may not beoptimal, and we expect more efforts on this problem.
5. Discussion
For video analysis, how to encode spatiotemporal fea-tures effectively and efficiently is still an open question. Inthis work, we propose to use weight-shared 2D convolutionsfor simultaneous spatial and temporal feature encoding. Al-though we empirically verify that weight sharing brings per-formance gain, one big question behind is whether the tem-poral dimension T can be cast as a normal spatial dimension(like depth) or not. Intuitively, spatial appearance featureand temporal motion cue belong to two different modalitiesof information. What motivates us to learn them collabo-ratively is the visualization of different views as shown inFigure 1. Interestingly, our positive results indicate that atleast to some extent, they share similar characteristics andcan be jointly learned using a single network with identi-cal network architecture and shared convolution kernels. Inphysics, according to Minkowski spacetime [16], the three-dimensional space and one-dimensional time can be unifiedas a four-dimensional continuum. Our finding might be ex-plained and supported by the spacetime model in the contextof feature representation learning.
6. Conclusion
Feature learning from 3D volumetric data is the ma-jor challenge for action recognition in videos. In this pa-per, we propose a novel feature learning operation, whichlearns spatiotemporal features collaboratively from multi-ple views. It can be easily used as a drop-in replacement forC2D and C3D. Experiments on large-scale benchmarks val-idate the superiority of the proposed architecture over exist-ing methods. Based on the learned coefficients of differentviews, we are able to take a peek at the individual contri-bution of spatial and temporal features for classification. Asystematic analysis indicates some promising directions onthe design of algorithm, which we will leave as future work. eferences [1] M. Al Ghamdi, L. Zhang, and Y. Gotoh. Spatio-temporal siftand its application to human action classification. In
Com-puter Vision – ECCV 2012. Workshops and Demonstrations ,2012. 2[2] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In
The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,July 2017. 3, 5, 7[3] A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Youse-fzadeh, J. Gall, and L. Van Gool. Spatio-temporal channelcorrelation networks for action classification.
ECCV , pages299–315, 2018. 7[4] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan,S. Guadarrama, K. Saenko, and T. Darrell. Long-term recur-rent convolutional networks for visual recognition and de-scription.
TPAMI , 39(4):677–691, 2017. 2[5] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporalresidual networks for video action recognition. neural infor-mation processing systems , pages 3468–3476, 2016. 2, 3[6] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporalmultiplier networks for video action recognition. In
CVPR ,pages 7445–7454, 2017. 1, 2[7] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3dcnns retrace the history of 2d cnns and imagenet? In
CVPR ,June 2018. 7[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. computer vision and pattern recogni-tion , pages 770–778, 2016. 3[9] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerat-ing very deep neural networks. In
ICCV , Oct 2017. 2[10] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In
International Conference on Machine Learning , pages 448–456, 2015. 7[11] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and F. F. Li. Large-scale video classification with convo-lutional neural networks. In
Computer Vision and PatternRecognition , pages 1725–1732, 2014. 3[12] A. Klaser. A spatiotemporal descriptor based on 3d-gradients. In
British Machine Vision Conference, September ,2010. 2[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , 2012. 1[14] I. Laptev. On space-time interest points. ijcv.
InternationalJournal of Computer Vision , 64(2):107–123, 2005. 2[15] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf.Pruning filters for efficient convnets.
International Confer-ence on Learning Representations , 2017. 2[16] H. Minkowski et al. Space and time.
The principle of rela-tivity , pages 73–91, 1908. 8[17] M. Monfort, B. Zhou, S. A. Bargal, T. Yan, A. Andonian,K. Ramakrishnan, L. Brown, Q. Fan, D. Gutfruend, C. Von-drick, et al. Moments in time dataset: one million videos forevent understanding. 5, 7 [18] J. Y. Ng, M. J. Hausknecht, S. Vijayanarasimhan, O. Vinyals,R. Monga, and G. Toderici. Beyond short snippets: Deepnetworks for video classification. computer vision and pat-tern recognition , pages 4694–4702, 2015. 2[19] Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal rep-resentation with pseudo-3d residual networks. In
ICCV , Oct2017. 3, 4[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.
International Journal of Computer Vision , 115(3):211–252,2015. 5[21] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift de-scriptor and its application to action recognition. In
Inter-national Conference on Multimedia , pages 357–360, 2007.2[22] J. Shao, C.-C. Loy, K. Kang, and X. Wang. Slicing convo-lutional neural network for crowd video understanding. In
CVPR , June 2016. 3[23] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. neural informa-tion processing systems , pages 568–576, 2014. 1, 2[24] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, andM. Paluri. Learning spatiotemporal features with 3d con-volutional networks. international conference on computervision , pages 4489–4497, 2015. 1, 3[25] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, andM. Paluri. A closer look at spatiotemporal convolutions foraction recognition. In
CVPR , June 2018. 3, 4, 7[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is allyou need.
NIPS , pages 5998–6008, 2017. 4[27] H. Wang, A. Kl¨aser, C. Schmid, and C. L. Liu. Dense tra-jectories and motion boundary descriptors for action recog-nition.
IJCV , 103(1):60–79, 2013. 2[28] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In
IEEE International Conference on ComputerVision , pages 3551–3558, 2014. 2[29] L. Wang, W. Li, W. Li, and L. Van Gool. Appearance-and-relation networks for video classification. In
CVPR , June2018. 7[30] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neuralnetworks. In
CVPR , June 2018. 3, 4, 6, 7[31] D. Xie, J. Xiong, and S. Pu. All you need is beyond a goodinit: Exploring better solution for training extremely deepconvolutional neural networks with orthonormality and mod-ulation. In
CVPR , pages 6176–6185, 2017. 2[32] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinkingspatiotemporal feature learning: Speed-accuracy trade-offsin video classification. In
ECCV , 2018. 7[33] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin.Temporal action detection with structured segment networks. international conference on computer vision , pages 2933–2942, 2017. 2[34] Y. Zhu, Z. Lan, S. D. Newsam, and A. G. Hauptmann. Hid-den two-stream convolutional networks for action recogni-tion. arXiv: Computer Vision and Pattern RecognitionarXiv: Computer Vision and Pattern Recognition