Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification
MModeling Spatial-Temporal Clues in a Hybrid DeepLearning Framework for Video Classification
Zuxuan Wu, Xi Wang, Yu-Gang Jiang † , Hao Ye, Xiangyang Xue School of Computer Science, Shanghai Key Lab of Intelligent Information Processing,Fudan University, Shanghai, China {zxwu, xwang10, ygj, haoye10, xyxue}@fudan.edu.cn
ABSTRACT
Classifying videos according to content semantics is an im-portant problem with a wide range of applications. In thispaper, we propose a hybrid deep learning framework forvideo classification, which is able to model static spatialinformation, short-term motion, as well as long-term tem-poral clues in the videos. Specifically, the spatial and theshort-term motion features are extracted separately by twoConvolutional Neural Networks (CNN). These two types ofCNN-based features are then combined in a regularized fea-ture fusion network for classification, which is able to learnand utilize feature relationships for improved performance.In addition, Long Short Term Memory (LSTM) networksare applied on top of the two features to further modellonger-term temporal clues. The main contribution of thiswork is the hybrid learning framework that can model sev-eral important aspects of the video data. We also showthat (1) combining the spatial and the short-term motionfeatures in the regularized fusion network is better than di-rect classification and fusion using the CNN with a softmaxlayer, and (2) the sequence-based LSTM is highly comple-mentary to the traditional classification strategy withoutconsidering the temporal frame orders. Extensive experi-ments are conducted on two popular and challenging bench-marks, the UCF-101 Human Actions and the Columbia Con-sumer Videos (CCV). On both benchmarks, our frameworkachieves to-date the best reported performance: 91 .
3% onthe UCF-101 and 83 .
5% on the CCV.
Categories and Subject Descriptors
H.3.1 [
Information Storage and Retrieval ]: ContentAnalysis and Indexing—
Indexing methods ; I.5.2 [
PatternRecognition ]: Design Methodology—
Classifier design andevaluation
Keywords
Video Classification, Deep Learning, CNN, LSTM, Fusion.
1. INTRODUCTION
Video classification based on contents like human actionsor complex events is a challenging task that has been ex-tensively studied in the research community. Significant † Corresponding author. progress has been achieved in recent years by designing var-ious features, which are expected to be robust to intra-classvariations and discriminative to separate different classes.For example, one can utilize traditional image-based featureslike the SIFT [26] to capture the static spatial information invideos. In addition to the static frame based visual features,motion is a very important clue for video classification, asmost classes containing object movements like the humanactions require the motion information to be reliably recog-nized. For this, a very popular feature is the dense trajecto-ries [44], which tracks densely sampled local frame patchesover time and computes several traditional features basedon the trajectories.In contrast to the hand-crafted features, there is a grow-ing trend of learning robust feature representations from rawdata with deep neural networks. Among the many existingnetwork structures, Convolutional Neural Networks (CNN)have demonstrated great success on various tasks, includingimage classification [21, 38, 34], image-based object localiza-tion [7], speech recognition [5], etc . For video classification,Ji et al. [13] and Karparthy et al. [18] extended the CNNto work on the temporal dimension by stacking frames overtime. Recently, Simonyan et al. [33] proposed a two-streamCNN approach, which uses two CNNs on static frames andoptical flows respectively to capture the spatial and the mo-tion information. It focuses only on short-term motion asthe optical flows are computed in very short time windows.With this approach, similar or slightly better performancethan the hand-crafted features like [44] has been reported.These existing works, however, are not able to model thelong-term temporal clues in the videos. As aforementioned,the two-stream CNN [33] uses stacked optical flows com-puted in short time windows as inputs, and the order ofthe optical flows is fully discarded in the learning process(cf. Section 3.1). This is not sufficient for video classifica-tion, as many complex contents can be better identified byconsidering the temporal order of short-term actions. Take“birthday” event as an example—it usually involves severalsequential actions, such as “making a wish”, “blowing outcandles” and “eating cakes”.To address the above limitation, this paper proposes a hy-brid deep learning framework for video classification, whichis able to harness not only the spatial and short-term mo-tion features, but also the long-term temporal clues. In or-der to leverage the temporal information, we adopt a Re-current Neural Networks (RNN) model called Long ShortTerm Memory (LSTM), which maps the input sequencesto outputs using a sequence of hidden states and incorpo- a r X i v : . [ c s . C V ] A p r ideo-levelFeature Pooling Input VideoFinal PredictionSpatial Frames LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM
Spatial CNN Spatial CNN Spatial CNN Spatial CNN Spatial CNN s y s y s y sT y sT y Stacked Motion Optical Flow
LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM LSTMLSTM
Motion CNN Motion CNN Motion CNN Motion CNN Motion CNN m y m y m y mT y mT y Es W Em W E Video-levelFeature Pooling
Fusion Layer
Figure 1: An overview of the proposed hybrid deep learning framework for video classification. Given aninput video, two types of features are extracted using the CNN from spatial frames and short-term stackedmotion optical flows respectively. The features are separately fed into two sets of LSTM networks for long-term temporal modeling (Left and Right). In addition, we also employ a regularized feature fusion networkto perform video-level feature fusion and classification (Middle). The outputs of the sequence-based LSTMand the video-level feature fusion network are combined to generate the final prediction. See texts for morediscussions. rates memory units that enable the network to learn whento forget previous hidden states and when to update hiddenstates with new information. In addition, many approachesfuse multiple features in a very “shallow” manner by eitherconcatenating the features before classification or averagingthe predictions of classifiers trained using different featuresseparately. In this work we integrate the spatial and theshort-term motion features in a deep neural network withcarefully designed regularizations to explore feature correla-tions. This method can perform video classification withinthe same network and further combining its outputs withthe predictions of the LSTMs can lead to very competitiveclassification performance.Figure 1 gives an overview of the proposed framework.Spatial and short-term motion features are first extractedby the two-stream CNN approach [33], and then input intothe LSTM for long-term temporal modeling. Average pool-ing is adopted to generate video-level spatial and motionfeatures, which are fused by the regularized feature fusionnetwork. After that, outputs of the sequence-based LSTMand the video-level feature fusion network are combined asthe final predictions. Notice that, in contrast to the cur-rent framework, alternatively one may train a fusion net-work to combine the frame-level spatial and motion featuresfirst and then use a single set of LSTM for temporal mod-eling. However, in our experiments we have observed worseresults using this strategy. The main reason is that learn-ing dimension-wise feature correlations in the fusion networkrequires strong and reliable supervision, but we only havevideo-level class labels which are not necessarily always re-lated to the frame semantics. In other words, the impreciseframe-level labels populated from the video annotations are too noisy to learn a good fusion network. The main contri-butions of this work are summarized as follows: • We propose an end-to-end hybrid deep learning frame-work for video classification, which can model not onlythe short-term spatial-motion patterns but also thelong-term temporal clues with variable-length video se-quences as inputs. • We adopt the LSTM to model long-term temporalclues on top of both the spatial and the short-termmotion features. We show that both features workwell with the LSTM, and the LSTM based classifiersare very complementary to the traditional classifierswithout considering the temporal frame orders. • We fuse the spatial and the motion features in a regu-larized feature fusion network that can explore featurecorrelations and perform classification. The network iscomputationally efficient in both training and testing. • Through an extensive set of experiments, we demon-strate that our proposed framework outperforms sev-eral alternative methods with clear margins. On thewell-known UCF-101 and CCV benchmarks, we attainto-date the best performance.The rest of this paper is organized as follows. Section 2reviews related works. Section 3 describes the proposed hy-brid deep learning framework in detail. Experimental resultsand comparisons are discussed in Section 4, followed by con-clusions in Section 5. . RELATED WORKS
Video classification has been a longstanding research topicin multimedia and computer vision. Successful classificationsystems rely heavily on the extracted video features, andhence most existing works focused on designing robust anddiscriminative features. Many video representations weremotivated by the advances in image domain, which can beextended to utilize the temporal dimension of the video data.For instance, Laptev [23] extended the 2D Harris corner de-tector [10] into 3D space to find space-time interest points.Klaser et al. proposed HOG3D by extending the idea of inte-gral images for fast descriptor computation [19]. Wang et al. reported that dense sampling at regular positions in spaceand time outperforms the detected sparse interest points onvideo classification tasks [45]. Partly inspired by this finding,they further proposed the dense trajectory features, whichdensely sample local patches from each frame at differentscales and then track them in a dense optical flow field overtime [44]. This method has demonstrated very competitiveresults on major benchmark datasets. In addition, furtherimprovements may be achieved by using advantageous fea-ture encoding methods like the Fisher Vectors [29] or adopt-ing feature normalization strategies, such as RootSift [1] andPower Norm [32]. Note that these spatial-temporal video de-scriptors only capture local motion patterns within a veryshort period, and popular descriptor quantization methodslike the bag-of-words entirely destroy the temporal order in-formation of the descriptors.To explore the long-term temporal clues, graphical mod-els have been popularly used, such as hidden Markov mod-els (HMM), Bayesian Networks (BN), Conditional RandomFields (CRF), etc . For instance, Li et al. proposed to replacethe hidden states in HMMs with visualizable salient posesestimated by Gaussian Mixture Models [24], and Tang et al. introduced latent variables over video frames to discover themost discriminative states of an event based on a variableduration HMM [39]. Zeng et al. exploited multiple typesof domain knowledge to guide the learning of a DynamicBN for action recognition [51]. Instead of using directedgraphical models like the HMM and BN, undirected graphi-cal models have also been adopted. Vail et al. employed theCRF for activity recognition in [41]. Wang et al. proposeda max-margin hidden CRF for action recognition in videos,where a human action is modeled as a global root templateand a constellation of several “parts” [46].Many related works have investigated the fusion of mul-tiple features, which is often effective for improving classifi-cation performance. The most straightforward and popularways are early fusion and late fusion. Generally, the earlyfusion refers to fusion at the feature level, such as featureconcatenation or linear combination of kernels of individualfeatures. For example, in [53], Zhang et al. computed non-linear kernels for each feature separately, and then fused thekernels for model training. The fusion weights can be manu-ally set or automatically estimated by multiple kernel learn-ing (MKL) [2]. For the late fusion methods, independentclassifiers are first trained using each feature separately, andoutputs of the classifiers are then combined. In [50], Ye etal. proposed a robust late fusion approach to fuse multipleclassification outputs by seeking a shared low-rank latentmatrix, assuming that noises may exist in the predictions ofsome classifiers, which can possibly be removed by using thelow-rank matrix. Both early and late fusion fail to explore the correlationsshared by the features and hence are not ideal for videoclassification. In this paper we employ a regularized neu-ral network tailored feature fusion and classification, whichcan automatically learn dimension-wise feature correlations.Several studies are related. In [15, 12], the authors proposedto construct an audio-visual joint codebook for video clas-sification, in order to discover and model the audio-visualfeature correlations. There are also studies on using neuralnetworks for feature fusion. In [37], the authors employeddeep Boltzmann machines to learn a fused representationof images and texts. In [28], a deep denoised auto-encoderwas used for cross-modality and shared representation learn-ing. Very recently, Wu et al. [47] presented an approach us-ing regularizations in neural networks to exploit feature andclass relationships. The fusion approach in this work differsin the following. First, instead of using the traditional hand-crafted features as inputs, we adopt CNN features trainedfrom both static frames and motion optical flows. Secondand very importantly, the formulation in the regularized fea-ture fusion network has a much lower complexity comparedwith that of [47].Researchers have attempted to apply the RNN to modelthe long-term temporal information in videos. Venugopalan et al. [43] proposed to translate videos to textual sentenceswith the LSTM through transferring knowledge from imagedescription tasks. Ranzato et al. [30] introduced a genera-tive model with the RNN to predict motions in videos. Inthe context of video classification, Donahua et al. adoptedthe LSTM to model temporal information [6] and Srivas-tava et al. designed an encoder-decoder RNN architectureto learn feature representations in an unsupervised man-ner [36]. To model motion information, both works adoptedoptical flow “images” between nearby frames as the inputs ofthe LSTM. In contrast, our approach adopts stacked opti-cal flows. Stacked flows over a short time period can betterreflect local motion patterns, which are found to be able toproduce better results. In addition, our framework incorpo-rates video-level predictions with the feature fusion networkfor significantly improved performance, which was not con-sidered in these existing works.Besides the above discussions of related studies on featurefusion and temporal modeling with the RNN, several rep-resentative CNN-based approaches for video classificationshould also be covered here. The image-based CNN featureshave recently been directly adopted for video classification,extracted using off-the-shelf models trained on large-scaleimage datasets like the ImageNet [11, 31, 52]. For instance,Jain et al. [11] performed action recognition using the SVMclassifier with such CNN features and achieved top resultsin the 2014 THUMOS action recognition challenge [16]. Afew works have also tried to extend the CNN to exploit themotion information in videos. Ji et al. [13] and Karparthy etal. [18] extended the CNN by stacking visual frames in fixed-size time windows and using spatial-temporal convolutionsfor video classification. Differently, the two-stream CNN ap-proach by Simonyan et al. [33] applies the CNN separatelyon visual frames (the spatial stream) and stacked opticalflows (the motion stream). This approach has been foundto be more effective, which is adopted as the basis of ourproposed framework. However, as discussed in Section 1,all these approaches [13, 18, 33] can only model short-termmotion, not the long-term temporal clues. . METHODOLOGY
In this section, we describe the key components of theproposed hybrid deep learning framework shown in Figure 1,including the CNN-based spatial and short-term motion fea-tures, the long-term LSTM-based temporal modeling, andthe video-level regularized feature fusion network.
Conventional CNN architectures take images as the inputsand consist of alternating convolutional and pooling layers,which are further topped by a few fully-connected (FC) lay-ers. To extract the spatial and the short-term motion fea-tures, we adopt the recent two-stream CNN approach [33].Instead of using stacked frames in short time windows like [13,18], this approach decouples the videos into spatial and mo-tion streams modeled by two CNNs separately. Figure 2gives an overview. The spatial stream is built on sampled in-dividual frames, which is exactly the same as the CNN-basedimage classification pipeline and is suitable for capturing thestatic information in videos like scene backgrounds and basicobjects. The motion counterpart operates on top of stackedoptical flows. Specifically, optical flows (displacement vectorfields) are computed between each pair of adjacent frames,and the horizontal and vertical components of the displace-ment vectors can form two optical flow images. Instead ofusing each individual flow image as the input of the CNN, itwas reported that stacked optical flows over a time windoware better due to the ability of modeling the short-term mo-tion. In other words, the input of the motion stream CNNis a 2 L -channel stacked optical flow image, where L is thenumber of frames in the window. The two CNNs produceclassification scores separately using a softmax layer and thescores are linearly combined as the final prediction.Like many existing works on visual classification using theCNN features [31], we adopt the output of the first FC layerof the two CNNs as the spatial and the short-term motionfeatures. During the training process of the spatial and the motionstream CNNs, each sweep through the network takes onevisual frame or one stacked optical flow image, and the tem-poral order of the frames is fully discarded. To model thelong-term dynamic information in video sequences, we lever-age the LSTM model, which has been successfully applied tospeech recognition [8], image captioning [6], etc . LSTM is atype of RNN with controllable memory units and is effectivein many long range sequential modeling tasks without suf-fering from the “vanishing gradients” effect like traditionalRNNs. Generally, LSTM recursively maps the input repre-sentations at the current time step to output labels via asequence of hidden states, and thus the learning process ofLSTM should be in a sequential manner (from left to rightin the two sets of LSTM of Figure 1). Finally, we can obtaina prediction score at each time step with a softmax trans-formation using the hidden states from the last layer of theLSTM.More formally, given a sequence of feature representations( x , x , . . . , x T ), an LSTM maps the inputs to an output Note that the authors of the original paper [33] used the name“temporal stream”. We call it “motion stream” as it only capturesshort-term motion, which is different from the long-term temporalmodeling in our proposed framework.
Spatial stream CNNIndividualframe Motion stream CNNStacked optical flow
Scorefusion
Figure 2: The framework of the two-stream CNN.Outputs of the first fully-connected layer in the twoCNNs (outlined) are used as the spatial and theshort-term motion features for further processing. × Input Gate Output GateForget Gate × ×
Cell t h t x t i t f t o t c t h Figure 3: The structure of an LSTM unit. sequence ( y , y , . . . , y T ) by computing activations of theunits in the network with the following equations recursivelyfrom t = 1 to t = T : i t = σ ( W xi x t + W hi h t − + W ci c t − + b i ) , f t = σ ( W xf x t + W hf h t − + W cf c t − + b f ) , c t = f t c t − + i t tanh( W xc x t + W hc h t − + b c ) , o t = σ ( W xo x t + W ho h t − + W co c t + b o ) , h t = o t tanh( c t ) , where x t , h t are the input and hidden vectors with the sub-scription t denoting the t -th time step, i t , f t , c t , o t are re-spectively the activation vectors of the input gate, forgetgate, memory cell and output gate, W αβ is the weight ma-trix between α and β (e.g., W xi is weight matrix from theinput x t to the input gate i t ), b α is the bias term of α and σ is the sigmoid function defined as σ ( x ) = e − x . Figure 3visualizes the structure of an LSTM unit.The core idea behind the LSTM model is a built-in mem-ory cell that stores information over time to explore long-range dynamics, with non-linear gate units governing theinformation flow into and out of the cell. As we can seefrom the above equations, the current frame x t and the pre-vious hidden states h t − are used as inputs of four parts atthe t -th time step. The memory cell aggregates informationfrom two sources: the previous cell memory unit c t − multi-plied by the activation of the forget gate f t and the squashedinputs regulated with the input gate’s activation i t , the com-bination of which enables the LSTM to learn to forget in-formation from previous states or consider new information.n addition, the output gate o t controls how much informa-tion from the memory cell is passed to the hidden states h t for the following time step. With the explicitly control-lable memory units and different functional gates, LSTMcan explore long-range temporal clues with variable-lengthinputs. As a neural network, the LSTM model can be easilydeepened by stacking the hidden states from a layer l − l .Let us now consider a model of K layers, the feature vector x t at the t -th time step is fed into the first layer of theLSTM together with the hidden state h t − in the same layerobtained from the last time step to produce an updated h t ,which will then be used as the inputs of the following layer.Denote f W as the mapping function from the inputs to thehidden states, the transition from layer l − l canbe written as: h lt = (cid:26) f W ( h l − t , h lt − ) , l > f W ( x t , h lt − ) , l = 1 . In order to obtain the prediction scores for a total of C classes at a time step t , the outputs from last layer of theLSTM are sent to a softmax layer estimating probabilitiesas: prob c = exp( w cT h Kt + b c ) (cid:80) c (cid:48) ∈ C exp( w c (cid:48) T h Kt + b c (cid:48) ) , where prob c , w c and b c are respectively the probability pre-diction, the corresponding weight vector and the bias termof the c -th class. Such an LSTM network can be trained withthe Back-Propagation Through Time (BPTT) algorithm [9],which “unrolls” the model into a feed forward neural net andback-propagates to determine the optimal weights.As shown in Figure 1, we adopt two LSTM models for tem-poral modeling. With the two-stream CNN pipeline for fea-ture extraction, we have a spatial feature set ( x s , x s , . . . , x sT )and a motion feature set ( x m , x m , . . . , x mT ). The learningprocess leads to a set of predictions ( y s , y s , . . . , y sT ) for thespatial part and another set ( y m , y m , . . . , y mT ) for the mo-tion part. For both LSTM models, we adopt the last stepoutput y T as the video-level prediction scores, since the out-puts at the last time step are based on the consideration ofthe entire sequence. We empirically observe that the laststep outputs are better than pooling predictions from allthe time steps. Given both the spatial and the motion features, it is easyto understand that correlations may exist between themsince both are computed on the same video (e.g., person-related static visual features and body motions). A goodfeature fusion method is supposed to be able to take advan-tages of the correlations, while also can maintain the uniquecharacteristics to produce a better fused representation. Inorder to explore this important problem rather than usingthe simple late fusion as [33], we employ a regularized fea-ture fusion neural network, as shown in the middle part ofFigure 1. First, average pooling is adopted to aggregate theframe-level CNN features into video-level representations,which are used as the inputs of the fusion network. Theinput features are non-linearly mapped to another layer andthen fused in a feature fusion layer, where we apply regular-izations in the learning of the network weights. Denote N as the total number of training videos withboth the spatial and the motion representations. For the n -th sample, it can be written as a 3-tuple as ( x sn , x mn , y n ),where x sn = (cid:80) Tt =1 x sn,t ∈ R d s and x mn = (cid:80) Tt =1 x mn,t ∈ R d m represent the averaged spatial and motion features respec-tively, and y n is the corresponding label of the n -th sample.For the ease of discussion, let us consider a degeneratedcase first, where only one feature is available. Denote g ( · )as the mapping of the neural network from inputs to out-puts. The objective of network training is to minimize thefollowing empirical loss:min W N (cid:88) i =1 (cid:107) g ( x i ) − y i (cid:107) + λ Φ( W ) , (1)where the first term measures the discrepancy between theoutput g ( x i ) and the ground-truth label y i , and the sec-ond term is usually a Frobenius norm based regularizer toprevent over-fitting.We now move on to discuss the case of fusion and predic-tion with two features. Note that the approach can be easilyextended to support more than two input features. Specif-ically, we use a fusion layer (see Figure 1) to absorb thespatial and temporal features into a fused representation.To exploit the correlations of the features, we regularize thefusion process with a structural (cid:96) norm, which is definedas (cid:107) W (cid:107) , = (cid:80) i (cid:113)(cid:80) j w ij . Incorporating the (cid:96) norm inthe standard deep neural network formulation, we arrive atthe following optimization problem:min W L + λ Φ( W ) + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , , (2)where L = (cid:80) Ni =1 (cid:107) g ( x si , x mi ) − y i (cid:107) , W E = [ W Es , W Em ] ∈ R P × D denotes the concatenated weight matrix from the E thlayer (i.e., the last layer of feature abstraction in Figure 1), D = d s + d m and P is the dimension of the fusion layer.Different from the objective in Equation (1), here we havean additional (cid:96) norm that is used for exploring feature cor-relations in the E -th layer. The term (cid:107) W E (cid:107) , computes the2-norm of the weight values across different features in eachdimension. Therefore, the regularization attains minimumwhen W E contains the smallest number of non-zero rows,which is the discriminative information shared by distinctfeatures. That is to say, the (cid:96) norm encourages the matrix W E to be row sparse, which leads to similar zero/nonzeropatterns of the columns of the matrix W E . Hence it en-forces different features to share a subset of hidden neurons,reflecting the feature correlations.However, in addition to seeking for the correlations sharedamong features, the unique discriminative information shouldalso be preserved so that the complementary information canbe used for improved classification performance. Thus, weadd an additional regularizer to Equation (2) as following:min W L + λ Φ( W ) + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , . (3)The term (cid:107) W E (cid:107) , can be regarded as a complement of the (cid:107) W E (cid:107) , norm. It provides the robustness of the (cid:96) normfrom sharing incorrect features among different representa-tions, and thus allows different representations to emphasizedifferent hidden neurons.Although the regularizer terms in Equation (3) are allconvex functions, the optimization problem in Equation (3)s nonconvex due to the nonconvexity of the sigmoid func-tion. Below, we discuss the optimization strategy using thegradient descent method in two cases:1. For the E -th layer, our objective function has four validterms: the empirical loss, the (cid:96) regularizer Φ( W ), andtwo nonsomooth structural regularizers, i.e. , the (cid:96) and (cid:96) terms. Note that simply using the gradientdescent method is not optimal due to the two nons-mooth terms. We propose to optimize the E -th layerusing a proximal gradient descent method, which splitsthe objective function into two parts: p = L + λ Φ( W ) ,q = λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , + λ (cid:13)(cid:13)(cid:13) W E (cid:13)(cid:13)(cid:13) , , where p is a smooth function and q is a nonsmoothfunction. Thus, the update of the i -th iteration is for-mulated as:( W E ) ( i ) = Prox q (( W E ) ( i ) − ∇ p (( W E ) ( i ) )) , where Prox is a proximal operator defined as:Prox q ( W ) = arg min V (cid:107) W − V (cid:107) + q ( V ) . The proximal operator on the combination of (cid:96) /(cid:96) norm ball can be computed analytically as: W Er · = (cid:18) − λ (cid:107) U r · (cid:107) (cid:19) U r · , ∀ r = 1 , · · · , P, (4)where U r · = [ | V r · | − λ ] + · sign [ V r · ], and W r · , U r · , V r · denote the r -th row of matrix W , U and V , respec-tively. Readers may refer to [49] for a detailed proofof a similar analytical solution.2. For all the other layers, the objective function in Equa-tion 3 only contains the first two valid terms, which areboth smooth. Thus, we can directly apply the gradientdescent method as in [3]. Denote G l as the gradient of W l , the weight matrix of the l th layer is updated as: W l = W l − η G l . (5)The only additional computation cost for training our reg-ularized feature fusion network is to compute the proximaloperator in the E -th layer. The complexity of the analyticalsolution in Equation (4) is O ( P × D ). Therefore, the pro-posed proximal gradient descent method can quickly trainthe network with affordable computational cost. When in-corporating more features, our formulation can be computedefficiently with linearly increased cost, while cubic opera-tions are required by the approach of [47] to reach a similargoal. In sum, the above optimization is incorporated intothe conventional back propagation procedure, as describedin Algorithm 1. The approach described above has the capability of model-ing static spatial, short-term motion and long-term temporalclues, which are all very important in video content analysis.One may have noticed that the proposed hybrid deep learn-ing framework contains several components that are sepa-rately trained. Joint training is feasible but not adoptedin this current framework for the following reason. The
Input : x sn and x mn : the spatial and motion CNNfeatures of the n -th video sample; y n : the label of the n -th video sample;randomly initialized weight matrices W ; begin for epoch ← to M do Get the prediction error with feedforwardpropagation; for l ← L to do Evaluate the gradients and update theweight matrices using Equation (5); if l == E then Evaluate the proximal operatoraccording to Equation (4); end end end endAlgorithm 1: The training procedure of the regular-ized feature fusion network.joint training process is more complex and existing worksexploring this aspect indicate that the performance gain isnot very significant. In [6], the authors jointly trained theLSTM with a CNN for feature extraction, which only im-proved the performance on a benchmark dataset from 70.5%to 71.1%. Besides, an advantage of separate training is thatthe framework is more flexible, where a component can bereplaced easily without the need of re-training the entireframework. For instance, more discriminative CNN modelslike the GoogLeNet [38] and deeper RNN models [4] can beused to replace the CNN and LSTM parts respectively.In addition, as mentioned in Section 1, there could bealternative frameworks or models with similar capabilities.The main contribution of this work is to show that such ahybrid framework is very suitable for video classification. Inaddition to showing the effectiveness of the LSTM and theregularized feature fusion network, we also show that thecombination of both in the hybrid framework can lead tosignificant improvements, particularly for long videos thatcontain rich temporal clues.
4. EXPERIMENTS4.1 Experimental Setup
We adopt two popular datasets to evaluate the proposedhybrid deep learning framework.
UCF-101 [35]. The UCF-101 dataset is one of the mostpopular action recognition benchmarks. It consists of 13,320video clips of 101 human actions (27 hours in total). The 101classes are divided into five groups: Body-Motion, Human-Human Interactions, Human-Object Interactions, PlayingMusical Instruments and Sports. Following [16], we conductevaluations using 3 train/test splits, which is currently themost popular setting in using this dataset. Results are mea-sured by classification accuracy on each split and we reportthe mean accuracy over the three splits.
Columbia Consumer Videos (CCV) [17]. The CCVdataset contains 9,317 YouTube videos annotated accord-ng to 20 classes, which are mainly events like “basketball”,“graduation ceremony”, “birthday party” and “parade”. Wefollow the convention defined in [17] to use a training set of4,659 videos and a test set of 4,658 videos. The performanceis evaluated by average precision (AP) for each class, and wereport the mean AP (mAP) as the overall measure.
For feature extraction network structure, we adopt theVGG 19 [34] and the CNN M [33] to extract the spatial andthe motion CNN features, respectively. The two networkscan achieve 7.5% [34] and 13.5% [33] top-5 error rates onthe ImageNet ILSVRC-2012 validation set respectively. Thespatial CNN is first pre-trained with the ILSVRC-2012 train-ing set with 1.2 million images and then fine-tuned using thevideo data, which is observed to be better than training fromscratch. The motion CNN is trained from scratch as thereis no off-the-shelf training set in the required form. In addi-tion, simple data augmentation methods like cropping andflipping are utilized following [33].The CNN models are trained using mini-batch stochas-tic gradient descent with a momentum fixed to 0.9. In thefine-tuning case of the spatial CNN, the rate starts from10 − and decreases to 10 − after 14K iterations, then to10 − after 20K iterations. This setting is similar to [33], butwe start from a smaller rate of 10 − instead of 10 − . Forthe motion CNN, we set the learning rate to 10 − initially,and reduce it to 10 − after 100K iterations, then to 10 − after 200K iterations. Our implementation is based on thepublicly available Caffe toolbox [14] with modifications tosupport parallel training with multiple GPUs in a server.For temporal modeling, we adopt two layers in the LSTMfor both the spatial and the motion features. Each LSTMhas 1,024 hidden units in the bottom layer and 512 hiddenunits in the other layer. The network weights are learnt us-ing a parallel implementation of the BPTT algorithm with amini-batch size of 10. In addition, the learning rate and mo-mentum are set to 10 − and 0.9 respectively. The trainingis stopped after 150K iterations for both datasets.For the regularized feature fusion network, we use fourlayers of neurons as illustrated in the middle of Figure 1.Specifically, we first use one layer with 200 neurons for thespatial and motion feature to perform feature abstractionseparately, and then one layer with 200 neurons for featurefusion with the proposed regularized structural norms. Fi-nally, the fused features are used to build a logistic regressionmodel in the last layer for video classification. We set thelearning rate to 0.7 and fix λ to 3 × − in order to pre-vent over-fitting. In addition, we tune λ and λ in the samerange as λ using cross-validation. To validate the effectiveness of our approach, we com-pare with the following baseline or alternative methods: (1)
Two-stream CNN . Our implementation produces similaroverall results with the original work [33]. We also report theresults of the individual spatial-stream and motion-streamCNN models, namely
Spatial CNN and
Motion CNN ,respectively; (2)
Spatial LSTM , which refers to the LSTMtrained with the spatial CNN features; (3)
Motion LSTM ,the LSTM trained with the motion CNN features; (4)
SVM-based Early Fusion (SVM-EF) . χ -kernel is computedfor each video-level CNN feature and then the two kernels are averged for classification; (5) SVM-based Late Fu-sion (SVM-LF) . Separate SVM classifiers are trained foreach video-level CNN feature and the prediction outputs areaveraged; (6)
Multiple Kernel Learning (SVM-MKL) ,which combines the two features with the (cid:96) p -norm MKL [20]by fixing p = 2; (7) Early Fusion with Neural Net-works (NN-EF) , which concatenates the two features intoa long vector and then use a neural network for classifica-tion; (8)
Late Fusion with Neural Networks (NN-LF) ,which deploys a separate neural network for each featureand then uses the average output scores as the final pre-diction; (9)
Multimodal Deep Boltzmann Machines(M-DBM) [28, 37], where feature fusion is performed usinga neural network in a free manner without regularizations;(10)
RDNN [47], which also imposes regularizations in aneural network for feature fusion, using a formulation thathas a much higher complexity than our approach.The first three methods are a part of the proposed frame-work, which are evaluated as baselines to better understandthe contribution of each individual component. The lastseven methods focus on fusing the spatial and the motionfeatures (outputs of the first fully-connected layer of theCNN models) for improved classification performance. Wecompare our regularized fusion network with all the sevenmethods.
We first evaluate the LSTM to investigate the significanceof leveraging the long-term temporal clues for video classi-fication. The results and comparisons are summarized inTable 1. The upper two groups in the table compare theLSTM models with the two-stream CNN, which performsclassification by pooling video-level representations withoutconsidering the temporal order of the frames. On UCF-101,the Spatial LSTM is better than the spatial stream CNN,while the result of the Motion LSTM is slightly lower thanthat of the motion stream CNN. It is interesting to see that,on the spatial stream, the LSTM is even better than thestate-of-the-art CNN, indicating that the temporal informa-tion is very important for human action modeling, which isfully discarded in the spatial stream CNN. Since the mecha-nism of the LSTM is totally different, these results are fairlyappealing because it is potentially very complementary tothe video-level classification based on feature pooling.On the CCV dataset, the LSTM models produce lowerperformance than the CNN models on both streams. Thereasons are two-fold. First, since the average duration of theCCV videos (80 seconds) is around 10 times longer than thatof the UCF-101 and the contents in CCV are more complexand noisy, the LSTM might be affected by the noisy videosegments that are irrelevant to the major class of a video.Second, some classes like“wedding reception”and“beach”donot contain clear temporal order information (see Figure 4),for which the LSTM can hardly capture helpful clues.We now assess the performance of combining the LSTMand the CNN models to study whether they are complemen-tary, by fusing the outputs of the two types of models trainedon the two streams. Note that the fusion method adoptedhere is the simple late average fusion, which uses the averageprediction scores of different models. More advanced fusionmethods will be evaluated in the next subsection.CF-101 CCVSpatial CNN 80.1% 75.0%Spatial LSTM 83.3% 43.3%Motion CNN 77.5% 58.9%Motion LSTM 76.6% 54.7%CNN + LSTM (Spatial) 84.0% 77.9%CNN + LSTM (Motion) 81.4% 70.9%CNN + LSTM (Spatial & Motion)
UCF-101 CCVSpatial SVM 78.6% 74.4%Motion SVM 78.2% 57.9%SVM-EF 86.6% 75.3%SVM-LF 85.3% 74.9%SVM-MKL 86.8% 75.4%NN-EF 86.5% 75.6%NN-LF 85.1% 75.2%M-DBM 86.9% 75.3%Two-stream CNN 86.2% 75.8%RDNN 88.1% 75.9%Non-regularized Fusion Network 87.0% 75.4%Regularized Fusion Network
Results are reported in the bottom three rows of Table 1.We observe significant improvements from model fusion onboth datasets. On UCF-101, the fusion leads to an absoluteperformance gain of around 1% compared with the best sin-gle model for the spatial stream, and a gain of 4% for themotion stream. On CCV, the improvements are more sig-nificant, especially for the motion stream where an absolutegain of 12% is observed. These results confirm the fact thatthe long-term temporal clues are highly complementary tothe spatial and the short-term motion features. In addition,the fusion of all the CNN and the LSTM models trainedon the two streams attains the highest performance on bothdatasets: 90.1% and 81.7% on UCF-101 and CCV respec-tively, showing that the spatial and the short-term motionfeatures are also very complementary. Therefore, it is im-portant to incorporate all of them into a successful videoclassification system.
Next, we compare our regularized feature fusion networkwith the alternative fusion methods, using both the spa-tial and the motion CNN features. The results are pre-sented in Table 2, which are divided into four groups. Thefirst group reports the performance of individual features ex-tracted from the first fully connected layer of the CNN mod-els, classified by SVM classifiers. This is reported to study the gain from the SVM based fusion methods, as shown inthe second group of results. The individual feature resultsusing the CNN, i.e., the Spatial CNN and the Motion CNN,are already reported in Table 1. The third group of resultsin Table 2 are based on the alternative neural network fusionmethods. Note that NN-EF and NN-LF take the featuresfrom the CNN models and perform fusion and classifica-tion using separate neural networks, while the two-streamCNN approach performs classification using the CNN di-rectly with a late score fusion (Figure 2). Finally, the lastgroup contains the results of the proposed fusion network.As can be seen, the SVM based fusion methods can greatlyimprove the results on UCF-101. On CCV, the gain is con-sistent but not very significant, indicating that the short-term motion is more important for modeling the human ac-tions with clearer motion patterns and less noises. SVM-MKL is only slightly better than the simple early and latefusion methods, which is consistent with observations in re-cent works on visual recognition [42].Our proposed regularized feature fusion network (the lastrow in Table 2) is consistently better than the alternativeneural network based fusion methods shown in the thirdgroup of the table. In particular, the gap between our re-sults and that of the M-DBM and the two-stream CNN con-firms that using regularizations in the fusion process is help-ful. Compared with the RDNN, our formulation producesslightly better results but with a much lower complexity asdiscussed earlier.In addition, as the proposed formulation contains twostructural norms, to directly evaluate the contribution ofthe norms, we also report a baseline using the same net-work structure without regularization (“non-regularized fu-sion network” in the table), which is similar to the M-DBMapproach in its design but differs slightly in network struc-tures. We see that adding regularizations in the same net-work can improve 1.4% on UCF-101 and 0.8% on CCV.
Finally, we discuss the results of the entire hybrid deeplearning framework by further combining results from thetemporal LSTM and the regularized fusion network. Theprediction scores from these networks are fused linearly withweights estimated by cross-validation. As shown in the lastrow of Table 3, we achieve very strong performance on bothdatasets: 91.3% on UCF-101 and 83.5% on CCV. The per-formance of the hybrid framework is clearly better thanthat of the Spatial LSTM and the Motion LSTM (in Ta-ble 1). Compared with the Regularized Fusion Network (inTable 2), adding the long-term temporal modeling in thehybrid framework improves 2.9% on UCF-101 and 7.3% onCCV, which are fairly significant considering the difficultiesof the two datasets. In contrast to the fusion result in thelast row of Table 1, the gain of the proposed hybrid frame-work comes from the use of the regularized fusion, whichagain verifies the effectiveness of our fusion method.To better understand the contributions of the key compo-nents in the hybrid framework, we further report the per-class performance on CCV in Figure 4. We see that, al-though the performance of the LSTM is clearly lower, fusingit with the video-level predictions by the regularized fusionnetwork can significantly improve the results for almost allthe classes. This is a bit surprising because some classesdo not seem to require temporal information to be recog- a s k e t b a ll b a s e b a ll s o cc e r i c e s k a t i n g s k ii n g s w i m m i n g b i k i n g c a t d o g b i r d g r a d u a t i o n b i r t h d a y w e d d i n g r e c e p t i o n w e d d i n g c e r e m o n y w e d d i n g d a n c e m u s i c p e r f n o n - m u s i c p e r f p a r a d e b e a c h p l a y g r o u n d a v e r a g e p r e c i s i o n Spatial LSTMRegularized Feature Fusion Motion LSTMHybrid Deep Learning Framework
Figure 4: Per-class performance on CCV, using the Spatial and Motion LSTM, the Regularized FusionNetwork, and their combination, i.e., the Hybrid Deep Learning Framework.Figure 5: Two example videos of class “cat” in theCCV dataset with similar temporal clues over time. nized. After checking into some of the videos we find thatthere could be helpful clues which can be modeled, even forobject-related classes like “cat” and “dog”. For instance, asshown in Figure 5, we observe that quite a number of “cat”videos contain only a single cat running around on the floor.The LSTM network may be able to capture this clue, whichis helpful for classification.
Efficiency.
In addition to achieving the superior classifi-cation performance, our framework also enjoys high compu-tational efficiency. We summarize the average testing timeof a UCF-101 video (duration: 8 seconds) as follows. The ex-traction of the frames and the optical flows takes 3.9 seconds,and computing the CNN-based spatial and short-term mo-tion features requires 9 seconds. Prediction with the LSTMand the regularized fusion network needs 2.8 seconds. Allthese are evaluated on a single NVIDIA Telsa K40 GPU.
In this subsection, we compare with several state-of-the-art results. As shown in Table 3, our hybrid deep learn-ing framework produces the highest performance on bothdatasets. On the UCF-101, many works with competitiveresults are based on the dense trajectories [44, 52], whileour approach fully relies on the deep learning techniques.Compared with the original result of the two-stream CNNin [33], our framework is better with the additional fusionand temporal modeling functions, although it is built on ourimplementation of the CNN models which are slightly worsethan that of [33] (our two-stream CNN result is 86.2%).Note that a gain of even just 1% on the widely adopted UCF-101 dataset is generally considered as a significant progress. UCF-101 CCVDonahue et al. [6] 82.9% Xu et al. [48] 60.3%Srivastava et al. [36] 84.3% Ye et al. [50] 64.0%Wang et al. [44] 85.9% Jhuo et al. [12] 64.0%Tran et al. [40] 86.7% Ma et al. [27] 63.4%Simonyan et al. [33] 88.0% Liu et al. [25] 68.2%Lan et al. [22] 89.1% Wu et al. [47] 70.6%Zha et al. [52] 89.6% /Ours % Ours % Table 3: Comparison with state-of-the-art results.Our approach produces to-date the best reportedresults on both datasets.
In addition, the recent works in [6, 36] also adopted theLSTM to explore the temporal clues for video classificationand reported promising performance. However, our LSTMresults are not directly comparable as the input features areextracted by different neural networks.On the CCV dataset, all the recent approaches relied onthe joint use of multiple features by developing new fusionmethods [48, 50, 12, 27, 25, 47]. Our hybrid deep learningframework is significantly better than all of them.
5. CONCLUSIONS
We have proposed a novel hybrid deep learning frameworkfor video classification, which is able to model static visualfeatures, short-term motion patterns and long-term tempo-ral clues. In the framework, we first extract spatial andmotion features with two CNNs trained on static framesand stacked optical flows respectively. The two types of fea-tures are used separately as inputs of the LSTM networkfor long-term temporal modeling. A regularized fusion net-work is also deployed to combine the two features on video-level for improved classification. Our hybrid deep learningframework integrating both the LSTM and the regularizedfusion network produces very impressive performance on twowidely adopted benchmark datasets. Results not only verifythe effectiveness of the individual components of the frame-work, but also demonstrate that the frame-level temporalmodeling and the video-level fusion and classification arehighly complementary, and a big leap of performance canbe attained by combining them.lthough deep learning based approaches have been suc-cessful in addressing many problems, effective network ar-chitectures are urgently needed for modeling sequential datalike the videos. Several researchers have recently exploredthis direction. However, compared with the progress on im-age classification, the achieved performance gain on videoclassification over the traditional hand-crafted features ismuch less significant. Our work in this paper representsone of the few studies showing very strong results. For fu-ture work, further improving the capability of modeling thetemporal dimension of videos is of high priority. In addi-tion, audio features which are known to be useful for videoclassification can be easily incorporated into our framework.
6. REFERENCES [1] R. Arandjelovic and A. Zisserman. Three things everyoneshould know to improve object retrieval. In
CVPR , 2012.[2] F. R. Bach, G. R. Lanckriet, and M. I. Jordan. Multiple kernellearning, conic duality, and the smo algorithm. In
ICML , 2004.[3] Y. Bengio. Practical recommendations for gradient-basedtraining of deep architectures. In
Neural Networks: Tricks ofthe Trade . Springer, 2012.[4] J. Chung, ¸C. G¨ul¸cehre, K. Cho, and Y. Bengio. Gated feedbackrecurrent neural networks.
CoRR , 2015.[5] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-DependentPre-Trained Deep Neural Networks for Large-VocabularySpeech Recognition.
IEEE TASLP , 2012.[6] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-termrecurrent convolutional networks for visual recognition anddescription.
CoRR , 2014.[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich featurehierarchies for accurate object detection and semanticsegmentation. In
CVPR , 2014.[8] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognitionwith deep recurrent neural networks. In
ICASSP , 2013.[9] A. Graves and J. Schmidhuber. Framewise phonemeclassification with bidirectional lstm and other neural networkarchitectures.
Neural Networks , 2005.[10] C. Harris and M. J. Stephens. A combined corner and edgedetector. In
Alvey Vision Conference , 1988.[11] M. Jain, J. van Gemert, and C. G. M. Snoek. University ofamsterdam at thumos challenge 2014. In
ECCV THUMOSChallenge Workshop , 2014.[12] I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, andS.-F. Chang. Discovering joint audio-visual codewords for videoevent detection.
Machine Vision and Applications , 2014.[13] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neuralnetworks for human action recognition. In
ICML , 2010.[14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe:Convolutional architecture for fast feature embedding. In
ACMMultimedia , 2014.[15] W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui.Short-term audio-visual atoms for generic video conceptclassification. In
ACM Multimedia , 2009.[16] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev,M. Shah, and R. Sukthankar. THUMOS challenge: Actionrecognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/ , 2014.[17] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui.Consumer video understanding: A benchmark database and anevaluation of human and machine performance. In
ACMICMR , 2011.[18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classification withconvolutional neural networks. In
CVPR , 2014.[19] A. Klaser, M. Marsza(cid:32)lek, and C. Schmid. A spatio-temporaldescriptor based on 3d-gradients. In
BMVC , 2008.[20] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-normmultiple kernel learning.
The Journal of Machine LearningResearch , 2011.[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , 2012. [22] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyondgaussian pyramid: Multi-skip feature stacking for actionrecognition.
CoRR , 2014.[23] I. Laptev. On space-time interest points.
IJCV ,64(2/3):107–123, 2007.[24] W. Li, Z. Zhang, and Z. Liu. Expandable data-driven graphicalmodeling of human actions based on salient postures.
IEEETCSVT , 2008.[25] D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang.Sample-specific late fusion for visual category recognition. In
CVPR , 2013.[26] D. G. Lowe. Distinctive image features from scale-invariantkeypoints.
IJCV , 2004.[27] A. J. Ma and P. C. Yuen. Reduced analytic dependencymodeling: Robust fusion for visual recognition.
IJCV , 2014.[28] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng.Multimodal deep learning. In
ICML , 2011.[29] D. Oneata, J. Verbeek, C. Schmid, et al. Action and eventrecognition with fisher vectors on a compact feature set. In
ICCV , 2013.[30] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert,and S. Chopra. Video (language) modeling: a baseline forgenerative models of natural videos.
CoRR , 2014.[31] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNNfeatures off-the-shelf: an astounding baseline for recognition.
CoRR , 2014.[32] J. S´anchez, F. Perronnin, T. Mensink, and J. Verbeek. Imageclassification with the fisher vector: Theory and practice.
IJCV , 2013.[33] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In
NIPS , 2014.[34] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition.
CoRR , 2014.[35] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of101 human actions classes from videos in the wild.
CoRR , 2012.[36] N. Srivastava, E. Mansimov, and R. Salakhutdinov.Unsupervised learning of video representations using LSTMs.
CoRR , 2015.[37] N. Srivastava and R. Salakhutdinov. Multimodal learning withdeep boltzmann machines. In
NIPS , 2012.[38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeperwith Convolutions.
CoRR , 2014.[39] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporalstructure for complex event detection. In
CVPR , 2012.[40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.C3d: Generic features for video analysis.
CoRR , 2014.[41] D. L. Vail, M. M. Veloso, and J. D. Lafferty. Conditionalrandom fields for activity recognition. In
AAAMS , 2007.[42] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiplekernels for object detection. In
ICCV , 2009.[43] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J.Mooney, and K. Saenko. Translating videos to natural languageusing deep recurrent neural networks.
CoRR , 2014.[44] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In
ICCV , 2013.[45] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for actionrecognition. In
BMVC , 2009.[46] Y. Wang and G. Mori. Max-margin hidden conditional randomfields for human action recognition. In
CVPR , 2009.[47] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploringinter-feature and inter-class relationships with deep neuralnetworks for video classification. In
ACM Multimedia , 2014.[48] Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Featureweighting via optimal thresholding for video analysis. In
ICCV ,2013.[49] H. Yang, M. R. Lyu, and I. King. Efficient online learning formultitask feature selection.
ACM SIGKDD , 2013.[50] G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusionwith rank minimization. In
CVPR , 2012.[51] Z. Zeng and Q. Ji. Knowledge based activity recognition withdynamic bayesian network. In
ECCV , 2010.[52] S. Zha, F. Luisier, W. Andrews, N. Srivastava, andR. Salakhutdinov. Exploiting image-trained cnn architecturesfor unconstrained video classification.
CoRR , 2015.[53] J. Zhang, M. Marsza(cid:32)lek, S. Lazebnik, and C. Schmid. Localfeatures and kernels for classification of texture and objectcategories: A comprehensive study.