Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification
EEfficient Two-Stream Motion and Appearance 3D CNNs forVideo Classification
Ali DibaESAT-KU Leuven [email protected]
Ali PazandehSharif UTech [email protected]
Luc Van GoolESAT-KU Leuven, ETH Zurich [email protected]
Abstract
The video and action classification have extremely evolvedby deep neural networks specially with two stream CNNusing RGB and optical flow as inputs and they present out-standing performance in terms of video analysis. One ofthe shortcoming of these methods is handling motion infor-mation extraction which is done out side of the CNNs andrelatively time consuming also on GPUs. So proposing end-to-end methods which are exploring to learn motion rep-resentation, like 3D-CNN can achieve faster and accurateperformance. We present some novel deep CNNs using 3Darchitecture to model actions and motion representation inan efficient way to be accurate and also as fast as real-time.Our new networks learn distinctive models to combine deepmotion features into appearance model via learning opticalflow features inside the network.
Recent efforts on human action recognition focused on us-ing the spatio-temporal information of the video as efficientas possible [6, 11]. To do so there are many different viewpoints to the problem. Considering the video as a 3D Vol-ume or a sequence of 2D frames are the most common ones[9, 11]. Despite the promising results of convolutional net-works on most of the visual recognition tasks, having tem-poral information, and fine-grained classes exclude actionrecognition task from other recognition task in having a sig-nificant gap between the results of deep and handcraftedfeatures based methods. The two base factors of the caseare first, having not enough number of videos in proposeddatasets and second, disability of the proposed works inhandling the temporal information as well as other tasks[4]. As it can be inferred from the results which reportedin number of the works, the network which works on opti-cal flow have more discrimination on action classes. Alsoit needs a pre-process for each set of frames to compute theoptical flow which is time consuming. In the other hand,the works which does not compute the flow and try to han-dle the temporal information by the network like C3D[9]have less accuracy than two stream works. We believe that
C3DMotion C3D
SVM
Figure 1: We use 3D-Convnet to learn motion from opticalflow and extract the mid-level features to combine with reg-ular C3D features as a new representations for videos. Thisfigure shows our initial idea to combine these features.the main cause of this occurrence is insufficiency of trainingdata. So the network can not be learnt to handle the tempo-ral information of the video, considering training phase isfrom the scratch due to the differences of the proposed net-work to common networks.In conclusion we need a network with power of two-stream networks in handling temporal information besidethe performance of C3D in time and handling spatial data.To reach this goal we proposed our two-stream 3D net-work which use 3D convolutions in both spatial and tem-poral streams. Also it uses the abstract feature vector of theoptical flow estimation network. This abstract feature is ob-tained from the last layer of the convolution part in the flowestimation network.In the fallowing, we discuss related works in section 2,in section 3 we describe our proposed networks. Finally insection 4 the results of the proposed networks reported onthe common datasets.
Most of the recent works on visual recognition tasks andspecially human action recognition based their works onusing convolutional neural networks to perform better thanprevious works. As usually action recognition datasets con-tain videos, proposed methods effort to benefit from thetemporal information beside the spatial information of eachframe.1 a r X i v : . [ c s . C V ] S e p ppearance C3D StreamMotion C3D Stream Optical Flow
Class ScoreShared Weights
Figure 2: Our second idea to train an end-to-end 3D CNNwith two loss function to classify video and estimate theoptical flow. The 3D-Convnet part is shared for the twotasks.The two stream architecture [6] proposed by simonyanet al. extracts the temporal data by training a CNN alexnetnetwork on the optical flow which computed between con-secutive frames. Many works offer to extend the proposedidea of two stream network by different views. Wang etal.[12] tried to improve the result by using deeper net-works. Gkioxari et al.[5] proposed an action detectionmethod based on the two stream network. Feichtenhoferet al. [4] extended the two stream network by implement-ing different fusion methods in different layers instead ofthe late fusion in the score layer of the two stream networkof [6]. As they claimed in their work, in contrast with mostof the works, their results got state of the art without com-bining with IDT[10] method.Donahue et al.[3] handled the temporal information byusing a long short term memory on the extracted featuresof frames. Duo to not having an end-to-end network, theresults have not improve as much as expected.The handcrafted features of improved trajectory pro-posed by wang et al.[10] have considerable results as wellas having the power of improving convolutional neural net-work based results in combination with them. Hence mostof the proposed works got advantage of this power by con-catenating the IDT feature with the proposed feature. Wanget al.[11] use the power of IDT by extracting convolutionalneural network based features locally around the trajectoryand then encode the local features by fisher vector encoding.As it can be inferred from the reviewed methods, most ofthem use either optical flow or handcrafted features (IDT) toimprove the performance. In one hand both of the ideas aretime consuming, and in the other a network exist which canperform better than the handcrafted features or inputs. Inthe following we describe our proposed network on raw im-ages, which handles the temporal information, despite us-ing a network architecture which is trained on the Imagenetdataset.
We will explain in detail the concepts of our proposed meth-ods and analyze different architectures for 3D ConvNets of appearance and motion to achieve better performing CNNsfor action and video classification empirically, and a train-ing scheme on large scale video datasets for feature learn-ing.Inspired by the recently proposed C3D method [9, 8]to learn 3d convolution and deconvolution networks, wedesigned our networks to learn a new feature representa-tion of videos by exploiting a novel way of training twostream 3D-CNNs. The proposed method is able to clas-sify videos based on the learned spatio-temporal networkswhich does not need optical flow as input and it is bene-fited from learned optical flow information embedded. Welet the network learn the best aspects of optical flow and ap-pearance together in one network by an end-to-end mannerusing only the RGB frames of the video. We will show that,the mid-level motion features, which are extracted fromthe trained convolutional neural network can be a good re-placement of optical flow with a considerable lower com-putational cost for video classification. It’s shown that thelearning motion representation in the way which also is in-volved in classification problem and inherited action cluesperforming well in terms of speed and accuracy.
Our first proposed and initial method is to train a 3D Conv-Deconv network to compute optical flow from sequence ofvideo frames and then combine the mid-level motion fea-tures with the RGB 3D-Convnet features to do classifica-tion. Figure 1 shows the detail of this initial approach touse a new motion feature with C3D features. Using thismethod instead of other two stream networks which needoptical flow as input has benefits in manner of speed, sinceit is incomparable faster than those methods. However itgains comparable results but the method still needs to beimproved. In the next parts we propose our new two streamnetworks to train both on class label and flow estimationtogether and address the issues of speed and accuracy.
The second idea is to train an end-to-end network using 3D-Convnet and 3D-Deconvnet to train on both action class andmotion structure together. The 3D-Convnet part is sharedbetween the action classification and flow estimation net-works. In an other view the shared 3D-Conv network isthe main network with two loss function. The first one isa softmax loss for action classification and the second oneis a 3D-Deconv network followed by a voxel-wise loss. Sothe end-to-end network is providing a more solid solution toperform better than the first method since the new learnedrepresentation is optimized to exploit appearance and mo-tion model together obtained from frames and is as fast as2 ppearance C3D StreamMotion C3D Stream
Class Score Optical Flow
Figure 3: The proposed two stream end-to-end 3D CNN learning for video classification.C3D method in the test time.
Our main proposal is to train a 3D two stream network by anend-to-end learning. Figure 3 shows the details of the pro-posed idea. The appearance stream is an RGB 3d-convnetand the motion estimation stream is a 3d-convnet followedby a 3d-deconvnet which has optical flow as output to learnmotion information. As it has been shown in the figure, Thesoftmax loss performs on the concatenation of the last layersof both 3d-convnets. Which it adds the abstract temporal in-formation of the motion stream to the appearance stream tomake more rich representation in categorizing of action. Inthe test time, this method performs similar to single frametwo-stream nets in accuracy, by 20 times faster frames persecond rate.This single-step training algorithm is beneficial to accu-racy of 3D-CNN on videos by exploiting new convolutionalfeatures which are shared among the motion learning andaction classification tasks. This method also can be consid-ered as a multi-task network for different purposes. Basedon some works [2], It’s proved that network with differentsub-tasks can be more efficient and learn stronger featurerepresentation considering all tasks together.Our method improves the C3D to learn and use better mo-tion representation than knowledge which is just extractedfrom sequences of RGB frames without using optical flowin training. In this two-stage cascade, optical flow informa-tion is stored via training phase and in the test time, there isno need to computer optical flow, so we can classify videosvery fast.
For our proposed method, we use the networks for 3D-convnet and 3D-deconvnet which are inspired by [9, 8].The 3D-convnet has 5 layers of convolution and 2 fully con-nected layers plus the layer of classes. The filter numbersfor each convolution layer are 64, 128, 256, 256 and 256 re- spectively and 2048 for fully connected layers. Same as [9],we use filters with size of 3 × × We have evaluated our proposed networks on theUCF101[7] action video dataset and will try other datasetsin future. In this section, first we explain details of thedataset, then results of the experiments and discussion willcome.
The UCF101[7] action dataset have 101 action classeswhich categories in 5 main types: human-object interaction,body-motion only, human-human interaction, playing musi-cal instruments, and sports. It contains 13320 video clips,split to 3 different sets of train, validation and test data. Thereported results will be the average of the accuracy on these3 sets.For training the networks, we use the C3D pre-trainedmodel on the Sports-1M dataset for 3D-Convnet parts andfinetune it on the UCF101 dataset. We trained the 3D-Deconv network from scratch by optical flow extracted fromUCF101 frames as groundtruth by the Brox’s method [1].For evaluation, we train a linear SVM on the extracted fea-tures from each network.
In this section we compare the results of the proposed net-works with the previous methods in two main factors, ac-curacy and test time. Comparing the mean accuracy ofmethods has been shown in Table 1. The baseline of thework, which is the appearance stream of our proposed net-works, has been reported in the first row of the table (C3D),they also trained 3 different nets and reported the results3ethod Average AccuracyC3D (1 net) [9] 82.3C3D (3 nets) [9] 85.2Two-stream CNNs[6] 88.0Very-Deep[12] 91.4TDD[11] 90.3Ours-Initial 85.2Ours-Combined net 87.0Ours-Twostream 3Dnet 90.2Table 1: Comparing the Average Accuracy of our proposednetworks with previous works, on the 3 sets of the UCF101dataset. Method frames per secC3D (1 net) [9] 313Two-stream CNNs [6] 14.3iDT+FV [10] 2.1Ours-Initial 210Ours-Combined net 300Ours-Twostream 3Dnet 246Table 2: Comparing the number of processed frames in onesecond in our proposed methods with the related worksusing these three networks which improved their mean ac-curacy with about 3 percents. The two stream, very deeptwo stream and trajectory-pooled deep convolutional de-scriptors methods are reported in the next rows. The bottompart shows the mean accuracy of our proposed method onthe UCF101 dataset. The initial method which has separatetraining on the appearance stream and action labels and mo-tion stream and optical flows, outperforms the C3D methodusing a single network, which means that the abstract mid-level feature of the motion stream has the ability of improv-ing the accuracy. Our second proposed network have anend-to-end training phase. With the shared weights whichis trained to learn both action and optical flow simultane-ously, we expect an improvement on the accuracy, the re-sults support the expectation with 2 percent improvements.Our third proposed network is also an end-to-end network,with more degree of freedom, because of having two sep-arate networks for appearance and motion. The classifica-tion of this network performs on the concatenated featureof both layers. The results show that training the networkswithout weight sharing outperforms the previous proposednetworks.The comparison of the methods in term of speed of thealgorithm in frames per second has been made in Table2. As our proposed methods work without using any opti-cal flow extraction method they have a considerable differ- ence in speed with other methods. Comparing our methodsshow that the second network with shared weights have thehighest speed between our three proposed networks and theother two with two separate networks have a slightly lowerspeed.
We presented novel convolutional neural networks to em-bed both appearance and motion in human actions video.Our two-stream 3D network demonstrates efficient schemeto apply 3D-ConvNets and achieve good performance forvideo classification. We showed the effectiveness of themethod in terms of speed and accuracy in run-time. It canobtain accurate results in a speed of very faster than realtime.
References [1] Thomas Brox, Andr´es Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical flow estimation based on atheory for warping. In
ECCV’04 .[2] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se-mantic segmentation via multi-task network cascades.[3] Jeffrey Donahue and et al. Long-term recurrent convolu-tional networks for visual recognition and description. In
CVPR’15 .[4] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutionaltwo-stream network fusion for video action recognition. In
CVPR’16 .[5] G. Gkioxari and J. Malik. Finding action tubes.[6] Karen Simonyan and Andrew Zisserman. Two-stream con-volutional networks for action recognition in videos.
NIPS ,2014.[7] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah.Ucf101: A dataset of 101 human actions classes from videosin the wild. arXiv’12 .[8] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Deep end2end voxel2voxel prediction.
CVPRW’16 .[9] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Learning spatiotemporal features with3d convolutional networks.
ICCV’15 .[10] Heng Wang and Cordelia Schmid. Action recognition withimproved trajectories. In
ICCV’13 .[11] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recogni-tion with trajectory-pooled deep-convolutional descriptors.In
CVPR’15 .[12] Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. To-wards good practices for very deep two-stream convnets. arXiv’15 ..