[PDF] Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Abstract

Classifying videos according to content semantics is an important problem with a wide range of applications. In this paper, we propose a hybrid deep learning framework for video classification, which is able to model static spatial information, short-term motion, as well as long-term temporal clues in the videos. Specifically, the spatial and the short-term motion features are extracted separately by two Convolutional Neural Networks (CNN). These two types of CNN-based features are then combined in a regularized feature fusion network for classification, which is able to learn and utilize feature relationships for improved performance. In addition, Long Short Term Memory (LSTM) networks are applied on top of the two features to further model longer-term temporal clues. The main contribution of this work is the hybrid learning framework that can model several important aspects of the video data. We also show that (1) combining the spatial and the short-term motion features in the regularized fusion network is better than direct classification and fusion using the CNN with a softmax layer, and (2) the sequence-based LSTM is highly complementary to the traditional classification strategy without considering the temporal frame orders. Extensive experiments are conducted on two popular and challenging benchmarks, the UCF-101 Human Actions and the Columbia Consumer Videos (CCV). On both benchmarks, our framework achieves to-date the best reported performance: 91.3% on the UCF-101 and 83.5% on the CCV.

Full PDF

MModeling Spatial-Temporal Clues in a Hybrid DeepLearning Framework for Video Classiﬁcation

Zuxuan Wu, Xi Wang, Yu-Gang Jiang † , Hao Ye, Xiangyang Xue School of Computer Science, Shanghai Key Lab of Intelligent Information Processing,Fudan University, Shanghai, China {zxwu, xwang10, ygj, haoye10, xyxue}@fudan.edu.cn

ABSTRACT

Classifying videos according to content semantics is an im-portant problem with a wide range of applications. In thispaper, we propose a hybrid deep learning framework forvideo classiﬁcation, which is able to model static spatialinformation, short-term motion, as well as long-term tem-poral clues in the videos. Speciﬁcally, the spatial and theshort-term motion features are extracted separately by twoConvolutional Neural Networks (CNN). These two types ofCNN-based features are then combined in a regularized fea-ture fusion network for classiﬁcation, which is able to learnand utilize feature relationships for improved performance.In addition, Long Short Term Memory (LSTM) networksare applied on top of the two features to further modellonger-term temporal clues. The main contribution of thiswork is the hybrid learning framework that can model sev-eral important aspects of the video data. We also showthat (1) combining the spatial and the short-term motionfeatures in the regularized fusion network is better than di-rect classiﬁcation and fusion using the CNN with a softmaxlayer, and (2) the sequence-based LSTM is highly comple-mentary to the traditional classiﬁcation strategy withoutconsidering the temporal frame orders. Extensive experi-ments are conducted on two popular and challenging bench-marks, the UCF-101 Human Actions and the Columbia Con-sumer Videos (CCV). On both benchmarks, our frameworkachieves to-date the best reported performance: 91 .

3% onthe UCF-101 and 83 .

5% on the CCV.

Categories and Subject Descriptors

H.3.1 [

Information Storage and Retrieval ]: ContentAnalysis and Indexing—

Indexing methods ; I.5.2 [

PatternRecognition ]: Design Methodology—

Classiﬁer design andevaluation

Keywords

Video Classiﬁcation, Deep Learning, CNN, LSTM, Fusion.

1. INTRODUCTION

Video classiﬁcation based on contents like human actionsor complex events is a challenging task that has been ex-tensively studied in the research community. Signiﬁcant † Corresponding author. progress has been achieved in recent years by designing var-ious features, which are expected to be robust to intra-classvariations and discriminative to separate diﬀerent classes.For example, one can utilize traditional image-based featureslike the SIFT [26] to capture the static spatial information invideos. In addition to the static frame based visual features,motion is a very important clue for video classiﬁcation, asmost classes containing object movements like the humanactions require the motion information to be reliably recog-nized. For this, a very popular feature is the dense trajecto-ries [44], which tracks densely sampled local frame patchesover time and computes several traditional features basedon the trajectories.In contrast to the hand-crafted features, there is a grow-ing trend of learning robust feature representations from rawdata with deep neural networks. Among the many existingnetwork structures, Convolutional Neural Networks (CNN)have demonstrated great success on various tasks, includingimage classiﬁcation [21, 38, 34], image-based object localiza-tion [7], speech recognition [5], etc . For video classiﬁcation,Ji et al. [13] and Karparthy et al. [18] extended the CNNto work on the temporal dimension by stacking frames overtime. Recently, Simonyan et al. [33] proposed a two-streamCNN approach, which uses two CNNs on static frames andoptical ﬂows respectively to capture the spatial and the mo-tion information. It focuses only on short-term motion asthe optical ﬂows are computed in very short time windows.With this approach, similar or slightly better performancethan the hand-crafted features like [44] has been reported.These existing works, however, are not able to model thelong-term temporal clues in the videos. As aforementioned,the two-stream CNN [33] uses stacked optical ﬂows com-puted in short time windows as inputs, and the order ofthe optical ﬂows is fully discarded in the learning process(cf. Section 3.1). This is not suﬃcient for video classiﬁca-tion, as many complex contents can be better identiﬁed byconsidering the temporal order of short-term actions. Take“birthday” event as an example—it usually involves severalsequential actions, such as “making a wish”, “blowing outcandles” and “eating cakes”.To address the above limitation, this paper proposes a hy-brid deep learning framework for video classiﬁcation, whichis able to harness not only the spatial and short-term mo-tion features, but also the long-term temporal clues. In or-der to leverage the temporal information, we adopt a Re-current Neural Networks (RNN) model called Long ShortTerm Memory (LSTM), which maps the input sequencesto outputs using a sequence of hidden states and incorpo- a r X i v : . [ c s . C V ] A p r ideo-levelFeature Pooling Input VideoFinal PredictionSpatial Frames LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM

Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN s y s y s y sT  y sT y Stacked Motion Optical Flow

LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM

Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN m y m y m y mT  y mT y Es W Em W E   Video-levelFeature Pooling

Fusion Layer

Figure 1: An overview of the proposed hybrid deep learning framework for video classiﬁcation. Given aninput video, two types of features are extracted using the CNN from spatial frames and short-term stackedmotion optical ﬂows respectively. The features are separately fed into two sets of LSTM networks for long-term temporal modeling (Left and Right). In addition, we also employ a regularized feature fusion networkto perform video-level feature fusion and classiﬁcation (Middle). The outputs of the sequence-based LSTMand the video-level feature fusion network are combined to generate the ﬁnal prediction. See texts for morediscussions. rates memory units that enable the network to learn whento forget previous hidden states and when to update hiddenstates with new information. In addition, many approachesfuse multiple features in a very “shallow” manner by eitherconcatenating the features before classiﬁcation or averagingthe predictions of classiﬁers trained using diﬀerent featuresseparately. In this work we integrate the spatial and theshort-term motion features in a deep neural network withcarefully designed regularizations to explore feature correla-tions. This method can perform video classiﬁcation withinthe same network and further combining its outputs withthe predictions of the LSTMs can lead to very competitiveclassiﬁcation performance.Figure 1 gives an overview of the proposed framework.Spatial and short-term motion features are ﬁrst extractedby the two-stream CNN approach [33], and then input intothe LSTM for long-term temporal modeling. Average pool-ing is adopted to generate video-level spatial and motionfeatures, which are fused by the regularized feature fusionnetwork. After that, outputs of the sequence-based LSTMand the video-level feature fusion network are combined asthe ﬁnal predictions. Notice that, in contrast to the cur-rent framework, alternatively one may train a fusion net-work to combine the frame-level spatial and motion featuresﬁrst and then use a single set of LSTM for temporal mod-eling. However, in our experiments we have observed worseresults using this strategy. The main reason is that learn-ing dimension-wise feature correlations in the fusion networkrequires strong and reliable supervision, but we only havevideo-level class labels which are not necessarily always re-lated to the frame semantics. In other words, the impreciseframe-level labels populated from the video annotations are too noisy to learn a good fusion network. The main contri-butions of this work are summarized as follows: • We propose an end-to-end hybrid deep learning frame-work for video classiﬁcation, which can model not onlythe short-term spatial-motion patterns but also thelong-term temporal clues with variable-length video se-quences as inputs. • We adopt the LSTM to model long-term temporalclues on top of both the spatial and the short-termmotion features. We show that both features workwell with the LSTM, and the LSTM based classiﬁersare very complementary to the traditional classiﬁerswithout considering the temporal frame orders. • We fuse the spatial and the motion features in a regu-larized feature fusion network that can explore featurecorrelations and perform classiﬁcation. The network iscomputationally eﬃcient in both training and testing. • Through an extensive set of experiments, we demon-strate that our proposed framework outperforms sev-eral alternative methods with clear margins. On thewell-known UCF-101 and CCV benchmarks, we attainto-date the best performance.The rest of this paper is organized as follows. Section 2reviews related works. Section 3 describes the proposed hy-brid deep learning framework in detail. Experimental resultsand comparisons are discussed in Section 4, followed by con-clusions in Section 5. . RELATED WORKS

Video classiﬁcation has been a longstanding research topicin multimedia and computer vision. Successful classiﬁcationsystems rely heavily on the extracted video features, andhence most existing works focused on designing robust anddiscriminative features. Many video representations weremotivated by the advances in image domain, which can beextended to utilize the temporal dimension of the video data.For instance, Laptev [23] extended the 2D Harris corner de-tector [10] into 3D space to ﬁnd space-time interest points.Klaser et al. proposed HOG3D by extending the idea of inte-gral images for fast descriptor computation [19]. Wang et al. reported that dense sampling at regular positions in spaceand time outperforms the detected sparse interest points onvideo classiﬁcation tasks [45]. Partly inspired by this ﬁnding,they further proposed the dense trajectory features, whichdensely sample local patches from each frame at diﬀerentscales and then track them in a dense optical ﬂow ﬁeld overtime [44]. This method has demonstrated very competitiveresults on major benchmark datasets. In addition, furtherimprovements may be achieved by using advantageous fea-ture encoding methods like the Fisher Vectors [29] or adopt-ing feature normalization strategies, such as RootSift [1] andPower Norm [32]. Note that these spatial-temporal video de-scriptors only capture local motion patterns within a veryshort period, and popular descriptor quantization methodslike the bag-of-words entirely destroy the temporal order in-formation of the descriptors.To explore the long-term temporal clues, graphical mod-els have been popularly used, such as hidden Markov mod-els (HMM), Bayesian Networks (BN), Conditional RandomFields (CRF), etc . For instance, Li et al. proposed to replacethe hidden states in HMMs with visualizable salient posesestimated by Gaussian Mixture Models [24], and Tang et al. introduced latent variables over video frames to discover themost discriminative states of an event based on a variableduration HMM [39]. Zeng et al. exploited multiple typesof domain knowledge to guide the learning of a DynamicBN for action recognition [51]. Instead of using directedgraphical models like the HMM and BN, undirected graphi-cal models have also been adopted. Vail et al. employed theCRF for activity recognition in [41]. Wang et al. proposeda max-margin hidden CRF for action recognition in videos,where a human action is modeled as a global root templateand a constellation of several “parts” [46].Many related works have investigated the fusion of mul-tiple features, which is often eﬀective for improving classiﬁ-cation performance. The most straightforward and popularways are early fusion and late fusion. Generally, the earlyfusion refers to fusion at the feature level, such as featureconcatenation or linear combination of kernels of individualfeatures. For example, in [53], Zhang et al. computed non-linear kernels for each feature separately, and then fused thekernels for model training. The fusion weights can be manu-ally set or automatically estimated by multiple kernel learn-ing (MKL) [2]. For the late fusion methods, independentclassiﬁers are ﬁrst trained using each feature separately, andoutputs of the classiﬁers are then combined. In [50], Ye etal. proposed a robust late fusion approach to fuse multipleclassiﬁcation outputs by seeking a shared low-rank latentmatrix, assuming that noises may exist in the predictions ofsome classiﬁers, which can possibly be removed by using thelow-rank matrix. Both early and late fusion fail to explore the correlationsshared by the features and hence are not ideal for videoclassiﬁcation. In this paper we employ a regularized neu-ral network tailored feature fusion and classiﬁcation, whichcan automatically learn dimension-wise feature correlations.Several studies are related. In [15, 12], the authors proposedto construct an audio-visual joint codebook for video clas-siﬁcation, in order to discover and model the audio-visualfeature correlations. There are also studies on using neuralnetworks for feature fusion. In [37], the authors employeddeep Boltzmann machines to learn a fused representationof images and texts. In [28], a deep denoised auto-encoderwas used for cross-modality and shared representation learn-ing. Very recently, Wu et al. [47] presented an approach us-ing regularizations in neural networks to exploit feature andclass relationships. The fusion approach in this work diﬀersin the following. First, instead of using the traditional hand-crafted features as inputs, we adopt CNN features trainedfrom both static frames and motion optical ﬂows. Secondand very importantly, the formulation in the regularized fea-ture fusion network has a much lower complexity comparedwith that of [47].Researchers have attempted to apply the RNN to modelthe long-term temporal information in videos. Venugopalan et al. [43] proposed to translate videos to textual sentenceswith the LSTM through transferring knowledge from imagedescription tasks. Ranzato et al. [30] introduced a genera-tive model with the RNN to predict motions in videos. Inthe context of video classiﬁcation, Donahua et al. adoptedthe LSTM to model temporal information [6] and Srivas-tava et al. designed an encoder-decoder RNN architectureto learn feature representations in an unsupervised man-ner [36]. To model motion information, both works adoptedoptical ﬂow “images” between nearby frames as the inputs ofthe LSTM. In contrast, our approach adopts stacked opti-cal ﬂows. Stacked ﬂows over a short time period can betterreﬂect local motion patterns, which are found to be able toproduce better results. In addition, our framework incorpo-rates video-level predictions with the feature fusion networkfor signiﬁcantly improved performance, which was not con-sidered in these existing works.Besides the above discussions of related studies on featurefusion and temporal modeling with the RNN, several rep-resentative CNN-based approaches for video classiﬁcationshould also be covered here. The image-based CNN featureshave recently been directly adopted for video classiﬁcation,extracted using oﬀ-the-shelf models trained on large-scaleimage datasets like the ImageNet [11, 31, 52]. For instance,Jain et al. [11] performed action recognition using the SVMclassiﬁer with such CNN features and achieved top resultsin the 2014 THUMOS action recognition challenge [16]. Afew works have also tried to extend the CNN to exploit themotion information in videos. Ji et al. [13] and Karparthy etal. [18] extended the CNN by stacking visual frames in ﬁxed-size time windows and using spatial-temporal convolutionsfor video classiﬁcation. Diﬀerently, the two-stream CNN ap-proach by Simonyan et al. [33] applies the CNN separatelyon visual frames (the spatial stream) and stacked opticalﬂows (the motion stream). This approach has been foundto be more eﬀective, which is adopted as the basis of ourproposed framework. However, as discussed in Section 1,all these approaches [13, 18, 33] can only model short-termmotion, not the long-term temporal clues. . METHODOLOGY

In this section, we describe the key components of theproposed hybrid deep learning framework shown in Figure 1,including the CNN-based spatial and short-term motion fea-tures, the long-term LSTM-based temporal modeling, andthe video-level regularized feature fusion network.

Conventional CNN architectures take images as the inputsand consist of alternating convolutional and pooling layers,which are further topped by a few fully-connected (FC) lay-ers. To extract the spatial and the short-term motion fea-tures, we adopt the recent two-stream CNN approach [33].Instead of using stacked frames in short time windows like [13,18], this approach decouples the videos into spatial and mo-tion streams modeled by two CNNs separately. Figure 2gives an overview. The spatial stream is built on sampled in-dividual frames, which is exactly the same as the CNN-basedimage classiﬁcation pipeline and is suitable for capturing thestatic information in videos like scene backgrounds and basicobjects. The motion counterpart operates on top of stackedoptical ﬂows. Speciﬁcally, optical ﬂows (displacement vectorﬁelds) are computed between each pair of adjacent frames,and the horizontal and vertical components of the displace-ment vectors can form two optical ﬂow images. Instead ofusing each individual ﬂow image as the input of the CNN, itwas reported that stacked optical ﬂows over a time windoware better due to the ability of modeling the short-term mo-tion. In other words, the input of the motion stream CNNis a 2 L -channel stacked optical ﬂow image, where L is thenumber of frames in the window. The two CNNs produceclassiﬁcation scores separately using a softmax layer and thescores are linearly combined as the ﬁnal prediction.Like many existing works on visual classiﬁcation using theCNN features [31], we adopt the output of the ﬁrst FC layerof the two CNNs as the spatial and the short-term motionfeatures. During the training process of the spatial and the motionstream CNNs, each sweep through the network takes onevisual frame or one stacked optical ﬂow image, and the tem-poral order of the frames is fully discarded. To model thelong-term dynamic information in video sequences, we lever-age the LSTM model, which has been successfully applied tospeech recognition [8], image captioning [6], etc . LSTM is atype of RNN with controllable memory units and is eﬀectivein many long range sequential modeling tasks without suf-fering from the “vanishing gradients” eﬀect like traditionalRNNs. Generally, LSTM recursively maps the input repre-sentations at the current time step to output labels via asequence of hidden states, and thus the learning process ofLSTM should be in a sequential manner (from left to rightin the two sets of LSTM of Figure 1). Finally, we can obtaina prediction score at each time step with a softmax trans-formation using the hidden states from the last layer of theLSTM.More formally, given a sequence of feature representations( x , x , . . . , x T ), an LSTM maps the inputs to an output Note that the authors of the original paper [33] used the name“temporal stream”. We call it “motion stream” as it only capturesshort-term motion, which is diﬀerent from the long-term temporalmodeling in our proposed framework.

Spatial stream CNNIndividualframe Motion stream CNNStacked optical flow

Scorefusion

Figure 2: The framework of the two-stream CNN.Outputs of the ﬁrst fully-connected layer in the twoCNNs (outlined) are used as the spatial and theshort-term motion features for further processing. × Input Gate Output GateForget Gate × ×

Cell t  h t x t i t f t o t c t h Figure 3: The structure of an LSTM unit. sequence ( y , y , . . . , y T ) by computing activations of theunits in the network with the following equations recursivelyfrom t = 1 to t = T : i t = σ ( W xi x t + W hi h t − + W ci c t − + b i ) , f t = σ ( W xf x t + W hf h t − + W cf c t − + b f ) , c t = f t c t − + i t tanh( W xc x t + W hc h t − + b c ) , o t = σ ( W xo x t + W ho h t − + W co c t + b o ) , h t = o t tanh( c t ) , where x t , h t are the input and hidden vectors with the sub-scription t denoting the t -th time step, i t , f t , c t , o t are re-spectively the activation vectors of the input gate, forgetgate, memory cell and output gate, W αβ is the weight ma-trix between α and β (e.g., W xi is weight matrix from theinput x t to the input gate i t ), b α is the bias term of α and σ is the sigmoid function deﬁned as σ ( x ) = e − x . Figure 3visualizes the structure of an LSTM unit.The core idea behind the LSTM model is a built-in mem-ory cell that stores information over time to explore long-range dynamics, with non-linear gate units governing theinformation ﬂow into and out of the cell. As we can seefrom the above equations, the current frame x t and the pre-vious hidden states h t − are used as inputs of four parts atthe t -th time step. The memory cell aggregates informationfrom two sources: the previous cell memory unit c t − multi-plied by the activation of the forget gate f t and the squashedinputs regulated with the input gate’s activation i t , the com-bination of which enables the LSTM to learn to forget in-formation from previous states or consider new information.n addition, the output gate o t controls how much informa-tion from the memory cell is passed to the hidden states h t for the following time step. With the explicitly control-lable memory units and diﬀerent functional gates, LSTMcan explore long-range temporal clues with variable-lengthinputs. As a neural network, the LSTM model can be easilydeepened by stacking the hidden states from a layer l − l .Let us now consider a model of K layers, the feature vector x t at the t -th time step is fed into the ﬁrst layer of theLSTM together with the hidden state h t − in the same layerobtained from the last time step to produce an updated h t ,which will then be used as the inputs of the following layer.Denote f W as the mapping function from the inputs to thehidden states, the transition from layer l − l canbe written as: h lt = (cid:26) f W ( h l − t , h lt − ) , l > f W ( x t , h lt − ) , l = 1 . In order to obtain the prediction scores for a total of C classes at a time step t , the outputs from last layer of theLSTM are sent to a softmax layer estimating probabilitiesas: prob c = exp( w cT h Kt + b c ) (cid:80) c (cid:48) ∈ C exp( w c (cid:48) T h Kt + b c (cid:48) ) , where prob c , w c and b c are respectively the probability pre-diction, the corresponding weight vector and the bias termof the c -th class. Such an LSTM network can be trained withthe Back-Propagation Through Time (BPTT) algorithm [9],which “unrolls” the model into a feed forward neural net andback-propagates to determine the optimal weights.As shown in Figure 1, we adopt two LSTM models for tem-poral modeling. With the two-stream CNN pipeline for fea-ture extraction, we have a spatial feature set ( x s , x s , . . . , x sT )and a motion feature set ( x m , x m , . . . , x mT ). The learningprocess leads to a set of predictions ( y s , y s , . . . , y sT ) for thespatial part and another set ( y m , y m , . . . , y mT ) for the mo-tion part. For both LSTM models, we adopt the last stepoutput y T as the video-level prediction scores, since the out-puts at the last time step are based on the consideration ofthe entire sequence. We empirically observe that the laststep outputs are better than pooling predictions from allthe time steps. Given both the spatial and the motion features, it is easyto understand that correlations may exist between themsince both are computed on the same video (e.g., person-related static visual features and body motions). A goodfeature fusion method is supposed to be able to take advan-tages of the correlations, while also can maintain the uniquecharacteristics to produce a better fused representation. Inorder to explore this important problem rather than usingthe simple late fusion as [33], we employ a regularized fea-ture fusion neural network, as shown in the middle part ofFigure 1. First, average pooling is adopted to aggregate theframe-level CNN features into video-level representations,which are used as the inputs of the fusion network. Theinput features are non-linearly mapped to another layer andthen fused in a feature fusion layer, where we apply regular-izations in the learning of the network weights. Denote N as the total number of training videos withboth the spatial and the motion representations. For the n -th sample, it can be written as a 3-tuple as ( x sn , x mn , y n ),where x sn = (cid:80) Tt =1 x sn,t ∈ R d s and x mn = (cid:80) Tt =1 x mn,t ∈ R d m represent the averaged spatial and motion features respec-tively, and y n is the corresponding label of the n -th sample.For the ease of discussion, let us consider a degeneratedcase ﬁrst, where only one feature is available. Denote g ( · )as the mapping of the neural network from inputs to out-puts. The objective of network training is to minimize thefollowing empirical loss:min W N (cid:88) i =1 (cid:107) g ( x i ) − y i (cid:107) + λ Φ( W ) , (1)where the ﬁrst term measures the discrepancy between theoutput g ( x i ) and the ground-truth label y i , and the sec-ond term is usually a Frobenius norm based regularizer toprevent over-ﬁtting.We now move on to discuss the case of fusion and predic-tion with two features. Note that the approach can be easilyextended to support more than two input features. Specif-ically, we use a fusion layer (see Figure 1) to absorb thespatial and temporal features into a fused representation.To exploit the correlations of the features, we regularize thefusion process with a structural (cid:96) norm, which is deﬁnedas (cid:107) W (cid:107) , = (cid:80) i (cid:113)(cid:80) j w ij . Incorporating the (cid:96) norm inthe standard deep neural network formulation, we arrive atthe following optimization problem:min W L + λ Φ( W ) + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , , (2)where L = (cid:80) Ni =1 (cid:107) g ( x si , x mi ) − y i (cid:107) , W E = [ W Es , W Em ] ∈ R P × D denotes the concatenated weight matrix from the E thlayer (i.e., the last layer of feature abstraction in Figure 1), D = d s + d m and P is the dimension of the fusion layer.Diﬀerent from the objective in Equation (1), here we havean additional (cid:96) norm that is used for exploring feature cor-relations in the E -th layer. The term (cid:107) W E (cid:107) , computes the2-norm of the weight values across diﬀerent features in eachdimension. Therefore, the regularization attains minimumwhen W E contains the smallest number of non-zero rows,which is the discriminative information shared by distinctfeatures. That is to say, the (cid:96) norm encourages the matrix W E to be row sparse, which leads to similar zero/nonzeropatterns of the columns of the matrix W E . Hence it en-forces diﬀerent features to share a subset of hidden neurons,reﬂecting the feature correlations.However, in addition to seeking for the correlations sharedamong features, the unique discriminative information shouldalso be preserved so that the complementary information canbe used for improved classiﬁcation performance. Thus, weadd an additional regularizer to Equation (2) as following:min W L + λ Φ( W ) + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , . (3)The term (cid:107) W E (cid:107) , can be regarded as a complement of the (cid:107) W E (cid:107) , norm. It provides the robustness of the (cid:96) normfrom sharing incorrect features among diﬀerent representa-tions, and thus allows diﬀerent representations to emphasizediﬀerent hidden neurons.Although the regularizer terms in Equation (3) are allconvex functions, the optimization problem in Equation (3)s nonconvex due to the nonconvexity of the sigmoid func-tion. Below, we discuss the optimization strategy using thegradient descent method in two cases:1. For the E -th layer, our objective function has four validterms: the empirical loss, the (cid:96) regularizer Φ( W ), andtwo nonsomooth structural regularizers, i.e. , the (cid:96) and (cid:96) terms. Note that simply using the gradientdescent method is not optimal due to the two nons-mooth terms. We propose to optimize the E -th layerusing a proximal gradient descent method, which splitsthe objective function into two parts: p = L + λ Φ( W ) ,q = λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , , where p is a smooth function and q is a nonsmoothfunction. Thus, the update of the i -th iteration is for-mulated as:( W E ) ( i ) = Prox q (( W E ) ( i ) − ∇ p (( W E ) ( i ) )) , where Prox is a proximal operator deﬁned as:Prox q ( W ) = arg min V (cid:107) W − V (cid:107) + q ( V ) . The proximal operator on the combination of (cid:96) /(cid:96) norm ball can be computed analytically as: W Er · = (cid:18) − λ (cid:107) U r · (cid:107) (cid:19) U r · , ∀ r = 1 , · · · , P, (4)where U r · = [ | V r · | − λ ] + · sign [ V r · ], and W r · , U r · , V r · denote the r -th row of matrix W , U and V , respec-tively. Readers may refer to [49] for a detailed proofof a similar analytical solution.2. For all the other layers, the objective function in Equa-tion 3 only contains the ﬁrst two valid terms, which areboth smooth. Thus, we can directly apply the gradientdescent method as in [3]. Denote G l as the gradient of W l , the weight matrix of the l th layer is updated as: W l = W l − η G l . (5)The only additional computation cost for training our reg-ularized feature fusion network is to compute the proximaloperator in the E -th layer. The complexity of the analyticalsolution in Equation (4) is O ( P × D ). Therefore, the pro-posed proximal gradient descent method can quickly trainthe network with aﬀordable computational cost. When in-corporating more features, our formulation can be computedeﬃciently with linearly increased cost, while cubic opera-tions are required by the approach of [47] to reach a similargoal. In sum, the above optimization is incorporated intothe conventional back propagation procedure, as describedin Algorithm 1. The approach described above has the capability of model-ing static spatial, short-term motion and long-term temporalclues, which are all very important in video content analysis.One may have noticed that the proposed hybrid deep learn-ing framework contains several components that are sepa-rately trained. Joint training is feasible but not adoptedin this current framework for the following reason. The

Input : x sn and x mn : the spatial and motion CNNfeatures of the n -th video sample; y n : the label of the n -th video sample;randomly initialized weight matrices W ; begin for epoch ← to M do Get the prediction error with feedforwardpropagation; for l ← L to do Evaluate the gradients and update theweight matrices using Equation (5); if l == E then Evaluate the proximal operatoraccording to Equation (4); end end end endAlgorithm 1: The training procedure of the regular-ized feature fusion network.joint training process is more complex and existing worksexploring this aspect indicate that the performance gain isnot very signiﬁcant. In [6], the authors jointly trained theLSTM with a CNN for feature extraction, which only im-proved the performance on a benchmark dataset from 70.5%to 71.1%. Besides, an advantage of separate training is thatthe framework is more ﬂexible, where a component can bereplaced easily without the need of re-training the entireframework. For instance, more discriminative CNN modelslike the GoogLeNet [38] and deeper RNN models [4] can beused to replace the CNN and LSTM parts respectively.In addition, as mentioned in Section 1, there could bealternative frameworks or models with similar capabilities.The main contribution of this work is to show that such ahybrid framework is very suitable for video classiﬁcation. Inaddition to showing the eﬀectiveness of the LSTM and theregularized feature fusion network, we also show that thecombination of both in the hybrid framework can lead tosigniﬁcant improvements, particularly for long videos thatcontain rich temporal clues.

4. EXPERIMENTS4.1 Experimental Setup

We adopt two popular datasets to evaluate the proposedhybrid deep learning framework.

UCF-101 [35]. The UCF-101 dataset is one of the mostpopular action recognition benchmarks. It consists of 13,320video clips of 101 human actions (27 hours in total). The 101classes are divided into ﬁve groups: Body-Motion, Human-Human Interactions, Human-Object Interactions, PlayingMusical Instruments and Sports. Following [16], we conductevaluations using 3 train/test splits, which is currently themost popular setting in using this dataset. Results are mea-sured by classiﬁcation accuracy on each split and we reportthe mean accuracy over the three splits.

Columbia Consumer Videos (CCV) [17]. The CCVdataset contains 9,317 YouTube videos annotated accord-ng to 20 classes, which are mainly events like “basketball”,“graduation ceremony”, “birthday party” and “parade”. Wefollow the convention deﬁned in [17] to use a training set of4,659 videos and a test set of 4,658 videos. The performanceis evaluated by average precision (AP) for each class, and wereport the mean AP (mAP) as the overall measure.

For feature extraction network structure, we adopt theVGG 19 [34] and the CNN M [33] to extract the spatial andthe motion CNN features, respectively. The two networkscan achieve 7.5% [34] and 13.5% [33] top-5 error rates onthe ImageNet ILSVRC-2012 validation set respectively. Thespatial CNN is ﬁrst pre-trained with the ILSVRC-2012 train-ing set with 1.2 million images and then ﬁne-tuned using thevideo data, which is observed to be better than training fromscratch. The motion CNN is trained from scratch as thereis no oﬀ-the-shelf training set in the required form. In addi-tion, simple data augmentation methods like cropping andﬂipping are utilized following [33].The CNN models are trained using mini-batch stochas-tic gradient descent with a momentum ﬁxed to 0.9. In theﬁne-tuning case of the spatial CNN, the rate starts from10 − and decreases to 10 − after 14K iterations, then to10 − after 20K iterations. This setting is similar to [33], butwe start from a smaller rate of 10 − instead of 10 − . Forthe motion CNN, we set the learning rate to 10 − initially,and reduce it to 10 − after 100K iterations, then to 10 − after 200K iterations. Our implementation is based on thepublicly available Caﬀe toolbox [14] with modiﬁcations tosupport parallel training with multiple GPUs in a server.For temporal modeling, we adopt two layers in the LSTMfor both the spatial and the motion features. Each LSTMhas 1,024 hidden units in the bottom layer and 512 hiddenunits in the other layer. The network weights are learnt us-ing a parallel implementation of the BPTT algorithm with amini-batch size of 10. In addition, the learning rate and mo-mentum are set to 10 − and 0.9 respectively. The trainingis stopped after 150K iterations for both datasets.For the regularized feature fusion network, we use fourlayers of neurons as illustrated in the middle of Figure 1.Speciﬁcally, we ﬁrst use one layer with 200 neurons for thespatial and motion feature to perform feature abstractionseparately, and then one layer with 200 neurons for featurefusion with the proposed regularized structural norms. Fi-nally, the fused features are used to build a logistic regressionmodel in the last layer for video classiﬁcation. We set thelearning rate to 0.7 and ﬁx λ to 3 × − in order to pre-vent over-ﬁtting. In addition, we tune λ and λ in the samerange as λ using cross-validation. To validate the eﬀectiveness of our approach, we com-pare with the following baseline or alternative methods: (1)

Two-stream CNN . Our implementation produces similaroverall results with the original work [33]. We also report theresults of the individual spatial-stream and motion-streamCNN models, namely

Spatial CNN and

Motion CNN ,respectively; (2)

Spatial LSTM , which refers to the LSTMtrained with the spatial CNN features; (3)

Motion LSTM ,the LSTM trained with the motion CNN features; (4)

SVM-based Early Fusion (SVM-EF) . χ -kernel is computedfor each video-level CNN feature and then the two kernels are averged for classiﬁcation; (5) SVM-based Late Fu-sion (SVM-LF) . Separate SVM classiﬁers are trained foreach video-level CNN feature and the prediction outputs areaveraged; (6)

Multiple Kernel Learning (SVM-MKL) ,which combines the two features with the (cid:96) p -norm MKL [20]by ﬁxing p = 2; (7) Early Fusion with Neural Net-works (NN-EF) , which concatenates the two features intoa long vector and then use a neural network for classiﬁca-tion; (8)

Late Fusion with Neural Networks (NN-LF) ,which deploys a separate neural network for each featureand then uses the average output scores as the ﬁnal pre-diction; (9)

Multimodal Deep Boltzmann Machines(M-DBM) [28, 37], where feature fusion is performed usinga neural network in a free manner without regularizations;(10)

RDNN [47], which also imposes regularizations in aneural network for feature fusion, using a formulation thathas a much higher complexity than our approach.The ﬁrst three methods are a part of the proposed frame-work, which are evaluated as baselines to better understandthe contribution of each individual component. The lastseven methods focus on fusing the spatial and the motionfeatures (outputs of the ﬁrst fully-connected layer of theCNN models) for improved classiﬁcation performance. Wecompare our regularized fusion network with all the sevenmethods.

We ﬁrst evaluate the LSTM to investigate the signiﬁcanceof leveraging the long-term temporal clues for video classi-ﬁcation. The results and comparisons are summarized inTable 1. The upper two groups in the table compare theLSTM models with the two-stream CNN, which performsclassiﬁcation by pooling video-level representations withoutconsidering the temporal order of the frames. On UCF-101,the Spatial LSTM is better than the spatial stream CNN,while the result of the Motion LSTM is slightly lower thanthat of the motion stream CNN. It is interesting to see that,on the spatial stream, the LSTM is even better than thestate-of-the-art CNN, indicating that the temporal informa-tion is very important for human action modeling, which isfully discarded in the spatial stream CNN. Since the mecha-nism of the LSTM is totally diﬀerent, these results are fairlyappealing because it is potentially very complementary tothe video-level classiﬁcation based on feature pooling.On the CCV dataset, the LSTM models produce lowerperformance than the CNN models on both streams. Thereasons are two-fold. First, since the average duration of theCCV videos (80 seconds) is around 10 times longer than thatof the UCF-101 and the contents in CCV are more complexand noisy, the LSTM might be aﬀected by the noisy videosegments that are irrelevant to the major class of a video.Second, some classes like“wedding reception”and“beach”donot contain clear temporal order information (see Figure 4),for which the LSTM can hardly capture helpful clues.We now assess the performance of combining the LSTMand the CNN models to study whether they are complemen-tary, by fusing the outputs of the two types of models trainedon the two streams. Note that the fusion method adoptedhere is the simple late average fusion, which uses the averageprediction scores of diﬀerent models. More advanced fusionmethods will be evaluated in the next subsection.CF-101 CCVSpatial CNN 80.1% 75.0%Spatial LSTM 83.3% 43.3%Motion CNN 77.5% 58.9%Motion LSTM 76.6% 54.7%CNN + LSTM (Spatial) 84.0% 77.9%CNN + LSTM (Motion) 81.4% 70.9%CNN + LSTM (Spatial & Motion)

UCF-101 CCVSpatial SVM 78.6% 74.4%Motion SVM 78.2% 57.9%SVM-EF 86.6% 75.3%SVM-LF 85.3% 74.9%SVM-MKL 86.8% 75.4%NN-EF 86.5% 75.6%NN-LF 85.1% 75.2%M-DBM 86.9% 75.3%Two-stream CNN 86.2% 75.8%RDNN 88.1% 75.9%Non-regularized Fusion Network 87.0% 75.4%Regularized Fusion Network

Results are reported in the bottom three rows of Table 1.We observe signiﬁcant improvements from model fusion onboth datasets. On UCF-101, the fusion leads to an absoluteperformance gain of around 1% compared with the best sin-gle model for the spatial stream, and a gain of 4% for themotion stream. On CCV, the improvements are more sig-niﬁcant, especially for the motion stream where an absolutegain of 12% is observed. These results conﬁrm the fact thatthe long-term temporal clues are highly complementary tothe spatial and the short-term motion features. In addition,the fusion of all the CNN and the LSTM models trainedon the two streams attains the highest performance on bothdatasets: 90.1% and 81.7% on UCF-101 and CCV respec-tively, showing that the spatial and the short-term motionfeatures are also very complementary. Therefore, it is im-portant to incorporate all of them into a successful videoclassiﬁcation system.

Next, we compare our regularized feature fusion networkwith the alternative fusion methods, using both the spa-tial and the motion CNN features. The results are pre-sented in Table 2, which are divided into four groups. Theﬁrst group reports the performance of individual features ex-tracted from the ﬁrst fully connected layer of the CNN mod-els, classiﬁed by SVM classiﬁers. This is reported to study the gain from the SVM based fusion methods, as shown inthe second group of results. The individual feature resultsusing the CNN, i.e., the Spatial CNN and the Motion CNN,are already reported in Table 1. The third group of resultsin Table 2 are based on the alternative neural network fusionmethods. Note that NN-EF and NN-LF take the featuresfrom the CNN models and perform fusion and classiﬁca-tion using separate neural networks, while the two-streamCNN approach performs classiﬁcation using the CNN di-rectly with a late score fusion (Figure 2). Finally, the lastgroup contains the results of the proposed fusion network.As can be seen, the SVM based fusion methods can greatlyimprove the results on UCF-101. On CCV, the gain is con-sistent but not very signiﬁcant, indicating that the short-term motion is more important for modeling the human ac-tions with clearer motion patterns and less noises. SVM-MKL is only slightly better than the simple early and latefusion methods, which is consistent with observations in re-cent works on visual recognition [42].Our proposed regularized feature fusion network (the lastrow in Table 2) is consistently better than the alternativeneural network based fusion methods shown in the thirdgroup of the table. In particular, the gap between our re-sults and that of the M-DBM and the two-stream CNN con-ﬁrms that using regularizations in the fusion process is help-ful. Compared with the RDNN, our formulation producesslightly better results but with a much lower complexity asdiscussed earlier.In addition, as the proposed formulation contains twostructural norms, to directly evaluate the contribution ofthe norms, we also report a baseline using the same net-work structure without regularization (“non-regularized fu-sion network” in the table), which is similar to the M-DBMapproach in its design but diﬀers slightly in network struc-tures. We see that adding regularizations in the same net-work can improve 1.4% on UCF-101 and 0.8% on CCV.

Finally, we discuss the results of the entire hybrid deeplearning framework by further combining results from thetemporal LSTM and the regularized fusion network. Theprediction scores from these networks are fused linearly withweights estimated by cross-validation. As shown in the lastrow of Table 3, we achieve very strong performance on bothdatasets: 91.3% on UCF-101 and 83.5% on CCV. The per-formance of the hybrid framework is clearly better thanthat of the Spatial LSTM and the Motion LSTM (in Ta-ble 1). Compared with the Regularized Fusion Network (inTable 2), adding the long-term temporal modeling in thehybrid framework improves 2.9% on UCF-101 and 7.3% onCCV, which are fairly signiﬁcant considering the diﬃcultiesof the two datasets. In contrast to the fusion result in thelast row of Table 1, the gain of the proposed hybrid frame-work comes from the use of the regularized fusion, whichagain veriﬁes the eﬀectiveness of our fusion method.To better understand the contributions of the key compo-nents in the hybrid framework, we further report the per-class performance on CCV in Figure 4. We see that, al-though the performance of the LSTM is clearly lower, fusingit with the video-level predictions by the regularized fusionnetwork can signiﬁcantly improve the results for almost allthe classes. This is a bit surprising because some classesdo not seem to require temporal information to be recog- a s k e t b a ll b a s e b a ll s o cc e r i c e s k a t i n g s k ii n g s w i m m i n g b i k i n g c a t d o g b i r d g r a d u a t i o n b i r t h d a y w e d d i n g r e c e p t i o n w e d d i n g c e r e m o n y w e d d i n g d a n c e m u s i c p e r f n o n - m u s i c p e r f p a r a d e b e a c h p l a y g r o u n d a v e r a g e p r e c i s i o n Spatial LSTMRegularized Feature Fusion Motion LSTMHybrid Deep Learning Framework

Figure 4: Per-class performance on CCV, using the Spatial and Motion LSTM, the Regularized FusionNetwork, and their combination, i.e., the Hybrid Deep Learning Framework.Figure 5: Two example videos of class “cat” in theCCV dataset with similar temporal clues over time. nized. After checking into some of the videos we ﬁnd thatthere could be helpful clues which can be modeled, even forobject-related classes like “cat” and “dog”. For instance, asshown in Figure 5, we observe that quite a number of “cat”videos contain only a single cat running around on the ﬂoor.The LSTM network may be able to capture this clue, whichis helpful for classiﬁcation.

Eﬃciency.

In addition to achieving the superior classiﬁ-cation performance, our framework also enjoys high compu-tational eﬃciency. We summarize the average testing timeof a UCF-101 video (duration: 8 seconds) as follows. The ex-traction of the frames and the optical ﬂows takes 3.9 seconds,and computing the CNN-based spatial and short-term mo-tion features requires 9 seconds. Prediction with the LSTMand the regularized fusion network needs 2.8 seconds. Allthese are evaluated on a single NVIDIA Telsa K40 GPU.

In this subsection, we compare with several state-of-the-art results. As shown in Table 3, our hybrid deep learn-ing framework produces the highest performance on bothdatasets. On the UCF-101, many works with competitiveresults are based on the dense trajectories [44, 52], whileour approach fully relies on the deep learning techniques.Compared with the original result of the two-stream CNNin [33], our framework is better with the additional fusionand temporal modeling functions, although it is built on ourimplementation of the CNN models which are slightly worsethan that of [33] (our two-stream CNN result is 86.2%).Note that a gain of even just 1% on the widely adopted UCF-101 dataset is generally considered as a signiﬁcant progress. UCF-101 CCVDonahue et al. [6] 82.9% Xu et al. [48] 60.3%Srivastava et al. [36] 84.3% Ye et al. [50] 64.0%Wang et al. [44] 85.9% Jhuo et al. [12] 64.0%Tran et al. [40] 86.7% Ma et al. [27] 63.4%Simonyan et al. [33] 88.0% Liu et al. [25] 68.2%Lan et al. [22] 89.1% Wu et al. [47] 70.6%Zha et al. [52] 89.6% /Ours % Ours % Table 3: Comparison with state-of-the-art results.Our approach produces to-date the best reportedresults on both datasets.

In addition, the recent works in [6, 36] also adopted theLSTM to explore the temporal clues for video classiﬁcationand reported promising performance. However, our LSTMresults are not directly comparable as the input features areextracted by diﬀerent neural networks.On the CCV dataset, all the recent approaches relied onthe joint use of multiple features by developing new fusionmethods [48, 50, 12, 27, 25, 47]. Our hybrid deep learningframework is signiﬁcantly better than all of them.

5. CONCLUSIONS

We have proposed a novel hybrid deep learning frameworkfor video classiﬁcation, which is able to model static visualfeatures, short-term motion patterns and long-term tempo-ral clues. In the framework, we ﬁrst extract spatial andmotion features with two CNNs trained on static framesand stacked optical ﬂows respectively. The two types of fea-tures are used separately as inputs of the LSTM networkfor long-term temporal modeling. A regularized fusion net-work is also deployed to combine the two features on video-level for improved classiﬁcation. Our hybrid deep learningframework integrating both the LSTM and the regularizedfusion network produces very impressive performance on twowidely adopted benchmark datasets. Results not only verifythe eﬀectiveness of the individual components of the frame-work, but also demonstrate that the frame-level temporalmodeling and the video-level fusion and classiﬁcation arehighly complementary, and a big leap of performance canbe attained by combining them.lthough deep learning based approaches have been suc-cessful in addressing many problems, eﬀective network ar-chitectures are urgently needed for modeling sequential datalike the videos. Several researchers have recently exploredthis direction. However, compared with the progress on im-age classiﬁcation, the achieved performance gain on videoclassiﬁcation over the traditional hand-crafted features ismuch less signiﬁcant. Our work in this paper representsone of the few studies showing very strong results. For fu-ture work, further improving the capability of modeling thetemporal dimension of videos is of high priority. In addi-tion, audio features which are known to be useful for videoclassiﬁcation can be easily incorporated into our framework.

6. REFERENCES [1] R. Arandjelovic and A. Zisserman. Three things everyoneshould know to improve object retrieval. In

CVPR , 2012.[2] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernellearning, conic duality, and the smo algorithm. In

ICML , 2004.[3] Y. Bengio. Practical recommendations for gradient-basedtraining of deep architectures. In

Neural Networks: Tricks ofthe Trade . Springer, 2012.[4] J. Chung, ¸C. G¨ul¸cehre, K. Cho, and Y. Bengio. Gated feedbackrecurrent neural networks.

CoRR , 2015.[5] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-DependentPre-Trained Deep Neural Networks for Large-VocabularySpeech Recognition.

IEEE TASLP , 2012.[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-termrecurrent convolutional networks for visual recognition anddescription.

CoRR , 2014.[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich featurehierarchies for accurate object detection and semanticsegmentation. In

CVPR , 2014.[8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognitionwith deep recurrent neural networks. In

ICASSP , 2013.[9] A. Graves and J. Schmidhuber. Framewise phonemeclassiﬁcation with bidirectional lstm and other neural networkarchitectures.

Neural Networks , 2005.[10] C. Harris and M. J. Stephens. A combined corner and edgedetector. In

Alvey Vision Conference , 1988.[11] M. Jain, J. van Gemert, and C. G. M. Snoek. University ofamsterdam at thumos challenge 2014. In

ECCV THUMOSChallenge Workshop , 2014.[12] I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, andS.-F. Chang. Discovering joint audio-visual codewords for videoevent detection.

Machine Vision and Applications , 2014.[13] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neuralnetworks for human action recognition. In

ICML , 2010.[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caﬀe:Convolutional architecture for fast feature embedding. In

ACMMultimedia , 2014.[15] W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui.Short-term audio-visual atoms for generic video conceptclassiﬁcation. In

ACM Multimedia , 2009.[16] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev,M. Shah, and R. Sukthankar. THUMOS challenge: Actionrecognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/ , 2014.[17] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui.Consumer video understanding: A benchmark database and anevaluation of human and machine performance. In

ACMICMR , 2011.[18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classiﬁcation withconvolutional neural networks. In

CVPR , 2014.[19] A. Klaser, M. Marsza(cid:32)lek, and C. Schmid. A spatio-temporaldescriptor based on 3d-gradients. In

BMVC , 2008.[20] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-normmultiple kernel learning.

The Journal of Machine LearningResearch , 2011.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS , 2012. [22] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyondgaussian pyramid: Multi-skip feature stacking for actionrecognition.

CoRR , 2014.[23] I. Laptev. On space-time interest points.

IJCV ,64(2/3):107–123, 2007.[24] W. Li, Z. Zhang, and Z. Liu. Expandable data-driven graphicalmodeling of human actions based on salient postures.

IEEETCSVT , 2008.[25] D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang.Sample-speciﬁc late fusion for visual category recognition. In

CVPR , 2013.[26] D. G. Lowe. Distinctive image features from scale-invariantkeypoints.

IJCV , 2004.[27] A. J. Ma and P. C. Yuen. Reduced analytic dependencymodeling: Robust fusion for visual recognition.

IJCV , 2014.[28] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng.Multimodal deep learning. In

ICML , 2011.[29] D. Oneata, J. Verbeek, C. Schmid, et al. Action and eventrecognition with ﬁsher vectors on a compact feature set. In

ICCV , 2013.[30] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert,and S. Chopra. Video (language) modeling: a baseline forgenerative models of natural videos.

CoRR , 2014.[31] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNNfeatures oﬀ-the-shelf: an astounding baseline for recognition.

CoRR , 2014.[32] J. S´anchez, F. Perronnin, T. Mensink, and J. Verbeek. Imageclassiﬁcation with the ﬁsher vector: Theory and practice.

IJCV , 2013.[33] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In

NIPS , 2014.[34] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition.

CoRR , 2014.[35] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of101 human actions classes from videos in the wild.

CoRR , 2012.[36] N. Srivastava, E. Mansimov, and R. Salakhutdinov.Unsupervised learning of video representations using LSTMs.

CoRR , 2015.[37] N. Srivastava and R. Salakhutdinov. Multimodal learning withdeep boltzmann machines. In

NIPS , 2012.[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeperwith Convolutions.

CoRR , 2014.[39] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporalstructure for complex event detection. In

CVPR , 2012.[40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.C3d: Generic features for video analysis.

CoRR , 2014.[41] D. L. Vail, M. M. Veloso, and J. D. Laﬀerty. Conditionalrandom ﬁelds for activity recognition. In

AAAMS , 2007.[42] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiplekernels for object detection. In

ICCV , 2009.[43] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J.Mooney, and K. Saenko. Translating videos to natural languageusing deep recurrent neural networks.

CoRR , 2014.[44] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In

ICCV , 2013.[45] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for actionrecognition. In

BMVC , 2009.[46] Y. Wang and G. Mori. Max-margin hidden conditional randomﬁelds for human action recognition. In

CVPR , 2009.[47] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploringinter-feature and inter-class relationships with deep neuralnetworks for video classiﬁcation. In

ACM Multimedia , 2014.[48] Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Featureweighting via optimal thresholding for video analysis. In

ICCV ,2013.[49] H. Yang, M. R. Lyu, and I. King. Eﬃcient online learning formultitask feature selection.

ACM SIGKDD , 2013.[50] G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusionwith rank minimization. In

CVPR , 2012.[51] Z. Zeng and Q. Ji. Knowledge based activity recognition withdynamic bayesian network. In

ECCV , 2010.[52] S. Zha, F. Luisier, W. Andrews, N. Srivastava, andR. Salakhutdinov. Exploiting image-trained cnn architecturesfor unconstrained video classiﬁcation.

CoRR , 2015.[53] J. Zhang, M. Marsza(cid:32)lek, S. Lazebnik, and C. Schmid. Localfeatures and kernels for classiﬁcation of texture and objectcategories: A comprehensive study.