[PDF] CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos

Abstract

Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories, but also to localize the start time and end time of each instance. Many state-of-the-art systems use segment-level classifiers to select and rank proposal segments of pre-determined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial downsampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly modeling action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. We will update the camera-ready version and publish the source codes online soon.

Full PDF

CCDC: Convolutional-De-Convolutional Networks for Precise Temporal ActionLocalization in Untrimmed Videos

Zheng Shou † , Jonathan Chan † , Alireza Zareian † , Kazuyuki Miyazawa ‡ , and Shih-Fu Chang †† Columbia University, New York, NY, USA; ‡ Mitsubishi Electric, Japan { zs2262,jc4659,az2407,sc250 } @columbia.edu, [email protected] Abstract

Temporal action localization is an important yet chal-lenging problem. Given a long, untrimmed video consist-ing of multiple action instances and complex backgroundcontents, we need not only to recognize their action cate-gories, but also to localize the start time and end time ofeach instance. Many state-of-the-art systems use segment-level classiﬁers to select and rank proposal segments of pre-determined boundaries. However, a desirable model shouldmove beyond segment-level and make dense predictions ata ﬁne granularity in time to determine precise temporalboundaries. To this end, we design a novel

Convolutional-De-Convolutional (CDC) network that places CDC ﬁlterson top of 3D ConvNets, which have been shown to be effec-tive for abstracting action semantics but reduce the tempo-ral length of the input data. The proposed CDC ﬁlter per-forms the required temporal upsampling and spatial down-sampling operations simultaneously to predict actions atthe frame-level granularity. It is unique in jointly model-ing action semantics in space-time and ﬁne-grained tem-poral dynamics. We train the CDC network in an end-to-end manner efﬁciently. Our model not only achieves su-perior performance in detecting actions in every frame, butalso signiﬁcantly boosts the precision of localizing temporalboundaries. Finally, the CDC network demonstrates a veryhigh efﬁciency with the ability to process 500 frames persecond on a single GPU server. Source code and trainedmodels are available online at https://bitbucket.org/columbiadvmm/cdc .

1. Introduction

Recently, temporal action localization has drawn consid-erable interest in the computer vision community [25, 15,39, 66, 26, 68, 54, 47, 43, 74, 9, 18, 36]. This task involvestwo components: (1) determining whether a video contains speciﬁc actions (such as diving, jump, etc .) and (2) identi-fying temporal boundaries (start time and end time) of eachaction instance.A typical framework used by many state-of-the-art sys-tems [68, 54, 39, 66, 26] is fusing a large set of featuresand training classiﬁers that operate on sliding windows orsegment proposals. Recently, an end-to-end deep learn-ing framework called Segment-CNN (S-CNN) [47] basedon 3D ConvNets [61] demonstrated superior performancesboth in efﬁciency and accuracy on standard benchmarkssuch as THUMOS’14 [25]. S-CNN consists of a proposal

CricketShot

InputvideoSegment

ProposalPer-frame

ConfidenceScores

CricketShot

PreciseLocalization

3D ConvNets......Conv-De-Conv ......

CDC ActionLocalization

Network

L L/8

CricketShot

CliffDiving

LongJump

Figure 1. Our framework for precise temporal action localiza-tion. Given an input raw video, it is fed into our CDC localizationnetwork, which consists of 3D ConvNets for semantic abstractionand a novel CDC network for dense score prediction at the frame-level. Such ﬁne-granular score sequences are combined with seg-ment proposals to detect action instances with precise boundaries. a r X i v : . [ c s . C V ] J un etwork for generating candidate video segments and a lo-calization network for predicting segment-level scores ofaction classes. Although the localization network can be op-timized to select segments with high overlaps with groundtruth action instances, the detected action boundaries arestill retained and thus are restricted to the pre-determinedboundaries of a ﬁxed set of proposal segments.As illustrated in Figure 1, our goal is to reﬁne tempo-ral boundaries from proposal segments to precisely local-ize boundaries of action instances. This motivates us tomove beyond existing practices based on segment-level pre-dictions, and explicitly focus on the issue of ﬁne-grained,dense predictions in time. To achieve this goal, some exist-ing techniques can be adapted: (1) Single-frame classiﬁersoperate on each frame individually; (2) Recurrent NeuralNetworks (RNN) further take into account temporal depen-dencies across frames. But both of them fail to explicitlymodel the spatio-temporal information in raw videos.3D CNN [61, 47] has been shown that it can learn spatio-temporal abstraction of high-level semantics directly fromraw videos but loses granularity in time, which is importantfor precise localization, as mentioned above. For example,layers from conv1a to conv5b in the well-known C3D ar-chitecture [61] reduce the temporal length of an input videoby a factor of 8. In pixel-level semantic segmentation, de-convolution proves to be an effective upsampling methodin both image [34, 45] and video [62] for producing outputof the same resolution as the input. In our temporal local-ization problem, the temporal length of the output shouldbe the same as the input video, but the spatial size shouldbe reduced to 1x1. Therefore, we not only need to upsam-ple in time but also need to downsample in space. To thisend, we propose a novel Convolutional-De-Convolutional (CDC) ﬁlter, which performs convolution in space (for se-mantic abstraction) and de-convolution in time (for frame-level resolution) simultaneously. It is unique in jointly mod-eling the spatio-temporal interactions between summarizinghigh-level semantics in space and inferring ﬁne-grained ac-tion dynamics in time. On top of 3D ConvNets, we stackmultiple CDC layers to form our CDC network, which canachieve the aforementioned goal of temporal upsamplingand spatial downsampling, and thereby can determine ac-tion categories and can reﬁne boundaries of proposal seg-ments to precisely localize action instances.In summary, this paper makes three novel contributions:(1) To the best of our knowledge, this is the ﬁrst workto combine two reverse operations ( i.e . convolution and de-convolution) into a joint CDC ﬁlter, which simultaneouslyconducts downsampling in space and upsampling in time toinfer both high-level action semantics and temporal dynam-ics at a ﬁne granularity in time.(2) We build a CDC network using the proposed CDCﬁlter to speciﬁcally address precise temporal action local- ization. The CDC network can be efﬁciently trained end-to-end from raw videos to produce dense scores that are usedto predict action instances with precise boundaries.(3) Our model outperforms state-of-the-art methods invideo per-frame action labeling and signiﬁcantly boosts theprecision of temporal action localization over a wide rangeof detection thresholds.

2. Related work

Action recognition and detection.

Early works mainly fo-cus on simple actions in well-controlled environments andcan be found in recent surveys [69, 41, 3]. Recently, re-searchers have started investigating untrimmed videos in thewild and have designed various features and techniques. Webrieﬂy review the following that are also useful in temporalaction localization: frame-level Convolutional Neural Net-works (CNN) trained on ImageNet [44] such as AlexNet[29], VGG [51], ResNet [16], etc .; 3D CNN architecturecalled C3D [61] trained on a large-scale sports video dataset[27] ; improved Dense Trajectory Feature (iDTF) [64, 65]consisting of HOG, HOF, MBH features extracted alongdense trajectories with camera motion inﬂuences elimi-nated; key frame selection [13]; ConvNets adapted for us-ing motion ﬂow as input [50, 10, 67]; feature encoding withFisher Vector (FV) [40, 38] and VLAD [23, 72].There are also studies on spatio-temporal action de-tection, which aim to detect action regions with bound-ing boxes over consecutive frames. Various methods havebeen developed, from the perspective of supervoxel merg-ing [20, 55, 56], tracking [70, 42, 63, 53], object detectionand linking [28, 14, 76, 42, 63], spatio-temporal segmenta-tion [31, 71], and leveraging still images [21, 59, 22].

Temporal action localization.

Gaidon et al . [11, 12]introduced the problem of temporally localizing actionsin untrimmed videos, focusing on limited actions suchas “drinking and smoking” [30] and “open door and sitdown” [8]. Later, researchers worked on building large-scale datasets consisting of complex action categories, suchas THUMOS [25, 15] and MEXaction2 [57, 1, 58], anddatasets focusing on ﬁne-grained actions [35, 49, 48] or ac-tivities of high-level semantics [17]. The typical approachused in most systems [68, 54, 39, 66, 26] is extracting apool of features, which are fed to train SVM classiﬁers, andthen applying these classiﬁers on sliding windows or seg-ment proposals for prediction. In order to design a modelspeciﬁc to temporal localization, Richard and Gall [43] pro-posed using statistical length and language modeling to rep-resent temporal and contextual structures. Heilbron et al .[18] introduced a sparse learning framework for generatingsegment proposals of high recall.Recently, deep learning methods showed improved per-formance in localizing action instances. RNN has beenidely used to model temporal state transitions over frames:Escorcia et al . [9] built a temporal action proposal sys-tem based on Long-Short Term Memory (LSTM); Yeung etal . [74] used REINFORCE to learn decision policies for aRNN-based agent; Yeung et al . [73] introduced MultiTHU-MOS dataset of multi-label annotations for every frame inTHUMOS videos and deﬁned a LSTM network to modelmultiple input and output connections; Yuan et al . [77] pro-posed a pyramid of score distribution feature at the cen-ter of each sliding window to capture the motion informa-tion over multiple resolutions, and utilized RNN to improveinter-frame consistency; Sun et al . [60] leveraged web im-ages to train LSTM model when only video-level annota-tions are available. In addition, Lea et al . [31] used tem-poral 1D convolution to capture scene changes when ac-tions were being performed. Although RNN and temporal1D convolution can model temporal dependencies amongframes and make frame-level predictions, they are usuallyplaced on top of deep ConvNets, which take a single frameas input, rather than directly modeling spatio-temporal char-acteristics in raw videos. Shou et al . [47] proposed anend-to-end Segment-based 3D CNN framework (S-CNN),which outperformed other RNN-based methods by captur-ing spatio-temporal information simultaneously. However,S-CNN lacks the capability to predict at a ﬁne time resolu-tion and to localize precise temporal boundaries of actioninstances.

De-convolution and semantic segmentation.

Zeiler et al .[79] originally proposed de-convolutional networks for im-age decomposition, and later Zeiler and Fergus [78] re-purposed de-convolutional ﬁlter to map CNN activationsback to the input to visualize where the activations comefrom. Long et al . [34, 45] showed that deep learning basedapproaches can signiﬁcantly boost performance in imagesemantic segmentation. They proposed Fully ConvolutionalNetworks (FCN) to output feature maps of reduced dimen-sions, and then employed de-convolution for upsamplingto make dense, pixel-level predictions. The fully convo-lutional architecture and learnable upsampling method areefﬁcient and effective, and thus inspired many extensions[37, 19, 33, 4, 32, 80, 5, 6, 75].Recently, Tran et al . [62] extended de-convolution from2D to 3D and achieved competitive results on various voxel-level prediction tasks such as video semantic segmenta-tion. This shows that de-convolution is also effective in thevideo domain and has the potential to be adapted for makingdense predictions in time for our temporal action localiza-tion task. However, unlike the problem of semantic segmen-tation, we need to upsample in time but maintain downsam-pling in space. Instead of stacking a convolutional layer anda de-convolutional layer to conduct upsampling and down-sampling separately, our proposed CDC ﬁlter learns a joint model to perform these two operations simultaneously, andproves to be more powerful and easier to train.

3. Convolutional-De-Convolutional networks

C3D architecture, consisting of 3D ConvNets followedby three Fully Connected (FC) layers, has achieved promis-ing results in video analysis tasks such as recognition [61]and localization [47]. Further, Tran et al . [62] experimen-tally demonstrated the 3D ConvNets, i.e . from conv1a to conv5b , to be effective in summarizing spatio-temporalpatterns from raw videos into high-level semantics.Therefore, we build our CDC network upon C3D. Weadopt from conv1a to conv5b as the ﬁrst part of our CDCnetwork. For the rest of layers in C3D, we keep pool5 toperform max pooling in height and width by a factor of 2but retain the temporal length. Following conventional set-tings [61, 47, 62], we set the height and width of the CDCnetwork input to 112x112. Given an input video segment oftemporal length L , the output data shape of pool5 is (512, L/ , 4, 4) . Now in order to predict the action class scoresat the original temporal resolution (frame-level), we need toupsample in time (from L/ back to L ), and downsample inspace (from 4x4 to 1x1). To this end, we propose the CDCﬁlter and design a CDC network to adapt the FC layers fromC3D to perform the required upsample and downsample op-erations. Details are described in Sections 3.2 and 3.3. In this section, we walk through a concrete example ofadapting

FC6 layer in C3D to perform spatial downsamplingby a factor of 4x4 and temporal upsampling by a factor of2. For the sake of clarity, we focus on how a ﬁlter operateswithin one input channel and one output channel.As explained in [34, 45], the FC layer is a special caseof a convolutional layer (when the input data and the kernelhave the same size and there is no striding and no padding).So we can transform

FC6 into conv6 , which is shown inFigure 2 (a). Previously, a ﬁlter in

FC6 takes a 4x4 featuremap from pool5 as input and outputs a single value. Now,a ﬁlter in conv6 can slide on L/ feature maps of size 4x4stacked in time and respectively output L/ values in time.The kernel size of conv6 is 4x4=16.Although conv6 performs spatial downsampling, thetemporal length remains unchanged. To upsample in time,as shown in Figure 2 (b), a straightfoward solution adds ade-convolutional layer deconv6 after conv6 to double thetemporal length while maintaining the spatial size. The ker-nel size of deconv6 is 2. Therefore, the total number of pa- We denote the shape of data in the networks using the form of (numberof channels, temporal length, height, width) and the size of feature map,kernel, stride, zero padding using (temporal length, height, width). ameters for this solution (separated conv6 and deconv6 )is 4x4+2=18.However, this solution conducts temporal upsamplingand spatial downsampling in a separate manner. Instead,we propose the CDC ﬁlter

CDC6 to jointly perform thesetwo operations. As illustrated in Figure 2 (c), a

CDC6 ﬁlterconsists of two independent convolutional ﬁlters (the redone and the green one) operating on the same input 4x4 fea-ture map. Each of these convolutional ﬁlters has the samekernel size as the ﬁlter in conv6 and separately outputs onesingle value. So each 4x4 feature map results in 2 outputs intime. As the CDC ﬁlter slides on L/ feature maps of size4x4 stacked in time, this input feature volume of temporallength L/ is upsampled in time to L/ , and its spatial sizeis reduced to 1x1. Consequently, in space this CDC ﬁlter isequivalent to a 2D convolutional ﬁlter of kernel size 4x4; intime it has the same effect as a 1D de-convolutional ﬁlter ofkernel size 2, stride 2, padding 0. The kernel size of sucha joint ﬁlter in CDC6 is 2x4x4=32, which is larger than theseparate convolution and de-convolution solution (18).Therefore, a CDC ﬁlter is more powerful for jointly mod-eling high-level semantics and temporal dynamics: eachoutput in time comes from an independent convolutionalkernel dedicated to this output (the red/green node corre-sponds to the red/green kernel); however, in the separateconvolution and de-convolution solution, different outputsin time share the same high-level semantics (the blue node)outputted by one single convolutional kernel (the blue one).Having more parameters makes the CDC ﬁlter harder to conv6conv6 & deconv6 (separate)CDC6

44 4 (a) (b)(c)

L/8

L/8L/8

L/8

L/4

Figure 2. Illustration of how a ﬁlter in conv6 , deconv6 , CDC6 operates on pool5 output feature maps (grey rectangles) stackedin time. In each panel, dashed lines with the same color indicatethe same ﬁlter sliding over time. Nodes stand for outputs. learn. To remedy this issue, we propose a method to adaptthe pre-trained

FC6 layer in C3D to initialize

CDC6 . Afterwe convert

FC6 to conv6 , conv6 and CDC6 have the samenumber of channels ( i.e . 4,096) and thus the same numberof ﬁlters. Each ﬁlter in conv6 can be used to initialize itscorresponding ﬁlter in

CDC6 : the ﬁlter in conv6 (the blueone) has the same kernel size as each of these two convo-lutional ﬁlters (the red one and the green one) in the

CDC6 ﬁlter and thus can serve as the initialization for them both.Generally, assume that a CDC ﬁlter F of kernel size ( k l , k h , k w ) takes the input receptive ﬁeld X of height k h andwidth k w , and produces Y that consists of k l successive out-puts in time. For the example given in Figure 2 (c), we have k l = 2 , k h = 4 , k w = 4 . Given the indices a ∈ { , ..., k h } and b ∈ { , ..., k w } in height and width respectively for X and the index c ∈ { , ..., k l } in time for Y : during the for-ward pass, we can compute Y by Y [ c ] = k h (cid:88) a =1 k w (cid:88) b =1 F [ c, a, b ] · X [ a, b ]; (1)during the back-propagation, our CDC ﬁlter follows thechain rule and propagates gradients from Y to X via X [ a, b ] = k l (cid:88) c =1 F [ c, a, b ] · Y [ c ] . (2)A CDC ﬁlter F can be regarded as coupling a series of con-volutional ﬁlters (each one has kernel size k h in height and k w in width) in time with a shared input receptive ﬁeld X ,and at the same time, F performs 1D de-convolution withkernel size k l in time. In addition, the cross-channel mech-anisms within a CDC layer and the way of adding biases tothe outputs of the CDC ﬁlters follow the conventional strate-gies used in convolutional and de-convolutional layers. In Figure 3, we illustrate our CDC network for labelingevery frame of a video. The ﬁnal output shape of the CDCnetwork is ( K +1, L , 1, 1), where K +1 stands for K ac-tion categories plus the background class. As described inSection 3.1, from conv1a to pool5 , the temporal length ofan input segment has been reduced from L to L/ . On topof pool5 , in order to make per-frame predictions, we adaptFC layers in C3D as CDC layers to perform temporal up-sampling and spatial downsampling operations. Followingprevious de-convolution works [62, 34, 45], we upsamplein time by a factor of 2 in each CDC layer, to graduallyincrease temporal length from L/ back to L .In the previous Section 3.2, we provide an example ofhow to adapt FC6 as CDC6 , performing temporal 1D de-convolution of kernel size 2, stride 2, padding 0. For

CDC6 in the CDC network, we construct a CDC ﬁlter with 4 con-volutional ﬁlters instead of 2, and thus its temporal kernel

D ConvNetsSegment ...... Down (S)

CDC6 Up (T) Softmax (3, L,112,112) (512, L/8, 4, 4) (4096, L/4, 1, 1) CDC7

Up (T)(4096, L/2, 1, 1)

CDC8 Up (T)(K+1, L, 1, 1) (K+1, L, 1, 1) Figure 3. Architecture of a typical CDC network. Following the notations indicated in the footnote 1, the top row lists the shape of outputdata at each layer. (1) A video segment is ﬁrst fed into 3D ConvNets and the temporal length reduces from L to L/ . (2) CDC6 has kernelsize (4, 4, 4), stride (2, 1, 1), padding (1, 0, 0), and therefore reduces both height and width to 1 while increases the temporal length from L/ to L/ . Both CDC7 and

CDC8 have kernel size (4, 1, 1), stride (2, 1, 1), padding (1, 0, 0), and hence both

CDC7 and

CDC8 furtherperform upsampling in time by a factor of 2, and thus the temporal length is back to L . (3) A frame-wise softmax layer is added on top of CDC8 to obtain conﬁdence scores for every frame. Each channel stands for one class. size in time increases from 2 to 4. We set the correspond-ing stride to 2 and padding to 1. Now each 4x4 feature mapproduces 4 output nodes, and every two consecutive featuremaps have 2 nodes overlapping in time. Consequently, thetemporal length of input is still upsampled by

CDC6 from L/ to L/ , but each output node sums contributions fromtwo consecutive input feature maps, allowing temporal dy-namics in input to be taken into account.Likewise, we can adapt FC7 as CDC7 , as indicated inFigure 3. Additionally, we retain the Relu layers and theDropout layers with 0.5 dropout ratio from C3D to attach toboth

CDC6 and

CDC7 . CDC8 corresponds to

FC8 but cannotbe directly adapted from

FC8 because the classes in

FC8 and

CDC8 are different. Since each channel stands for one class,

CDC8 has K +1 channels. Finally, the CDC8 output is fed intoa frame-wise softmax layer

Softmax to produce per-framescores. During each mini-batch with N training segments,for the n -th segment, the CDC8 output O n has the shape( K +1, L , 1 ,1). For each frame, performing the conven-tional softmax operation and computing the softmax lossand gradient are independent of other frames. Correspond-ing to the t -th frame, the CDC8 output O n [ t ] and Softmax output P n [ t ] both are vectors of K +1 values. Note that forthe i -th class, P ( i ) n [ t ] = e O ( i ) n [ t ] (cid:80) K +1 j =1 e O ( j ) n [ t ] . The total loss L isdeﬁned as: L = 1 N N (cid:88) n =1 L (cid:88) t =1 (cid:16) − log (cid:16) P ( z n ) n [ t ] (cid:17)(cid:17) , (3)where z n stands for the ground truth class label for the n -thsegment. The total gradient w.r.t the output of i -th chan-nel/class and t -th frame in CDC8 is the summation over all N training segments of: ∂ L ∂O ( i ) n [ t ] = (cid:40) N · (cid:16) P ( z n ) n [ t ] − (cid:17) if i = z n N · P ( i ) n [ t ] if i (cid:54) = z n . (4) Training data construction.

In theory, because both theconvolutional ﬁlter and the CDC ﬁlter slide over the input,they can be applied to input of arbitrary size. Therefore,our CDC network can operate on videos of variable lengths.Due to GPU memory limitations, in practice we slide a tem-poral window of 32 frames without overlap on the video andfeed each window individually into the CDC network to ob-tain dense predictions in time. From the temporal boundaryannotations, we know the label of every frame. Frames inthe same window can have different labels. To prevent in-cluding too many background frames for training, we onlykeep windows that have at least one frame belonging to ac-tions. Therefore, given a set of training videos, we obtain atraining collection of windows with frame-level labels.

Optimization.

We use stochastic gradient descent to trainthe CDC network with the aforementioned frame-wise soft-max loss. Our implementation is based on Caffe [24] andC3D [61]. The learning rate is set to 0.00001 for all lay-ers except for

CDC8 layer where the learning rate is 0.0001since

CDC8 is randomly initialized. Following conventionalsettings [61, 47], we set momentum to 0.9 and weight decayto 0.005.C3D [61] is trained on Sports-1M [27] and can be usedto directly initialize conv1a to conv5b . CDC6 and

CDC7 areinitialized by

FC6 and

FC7 respectively using the strategydescribed in the Section 3.2. In addition, since

FC8 in C3Dand

CDC8 in the CDC network have the different number ofchannels, we randomly initialize

CDC8 . With such initial-ization, our CDC network turns out to be very easy to trainand converges quickly, i.e . 4 training epochs (within half aday) on THUMOS’14 .

Fine-grained prediction and precise localization.

Duringtesting, after applying the CDC network on the whole video,e can make predictions for every frame of the video.Through thresholding on conﬁdence scores and groupingadjacent frames of the same label, it is possible to cut thevideo into segments and produce localization results. Butthis method is not robust to noise, and designing tempo-ral smoothing strategies turns out to be ad hoc and non-trivial. Recently, researchers developed some efﬁcient seg-ment proposal methods [47, 9] to generate a small set ofcandidate segments of high recall. Utilizing these propos-als for our localization model not only bypasses the chal-lenge of grouping adjacent frames, but also achieves con-siderable speedup during testing, because we only need toapply the CDC network on the proposal segments insteadof the whole video.Since these proposal segments only have coarse bound-aries, we propose using ﬁne-grained predictions from theCDC network to localize precise boundaries. First, to lookat a wider interval, we extend each proposal segment’sboundaries on both sides by the percentage α of the orig-inal segment length. We set α to 1/8 for all experiments.Then, similar to preparing training segments, we slide tem-poral windows without overlap on the test videos. We onlyneed to keep test windows that overlap with at least one ex-tended proposal segment. We feed these windows into ourCDC network and generate per-frame action classes scores.The category of each proposal segment is set to the classwith the maximum average conﬁdence score over all framesin the segment. If a proposal segment does not belong to thebackground class, we keep it and further reﬁne its bound-aries. Given the score sequence of the predicted class inthe segment, we perform Gaussian kernel density estima-tion and obtain its mean µ and standard deviation σ . Start-ing from the boundary frame at each side of the extendedsegment and moving towards its middle, we shrink its tem-poral boundaries until we reach a frame with the conﬁdencescore no lower than µ - σ . Finally, we set the predictionscore of the segment to the average conﬁdence score of thepredicted class over frames in the reﬁned segment of bound-aries. methods mAPSingle-frame CNN [51] 34.7Two-stream CNN [50] 36.2LSTM [7] 39.3MultiLSTM [73] 41.3C3D + LinearInterp 37.0Conv & De-conv 41.7CDC (ﬁx 3D ConvNets) 37.4 CDC 44.4

Table 1. Per-frame labeling mAP on THUMOS’14 .

4. Experiments

We ﬁrst demonstrate the effectiveness of our model inpredicting accurate labels for every frame. Note that thistask can accept an input of multiple frames to take into ac-count temporal information. We denote our model as

CDC . THUMOS’14 [25].

The temporal action localization taskin THUMOS Challenge 2014 involves 20 actions. We use2,755 trimmed training videos and 1,010 untrimmed valida-tion videos (3,007 action instances) to train our model. Fortesting, we use all 213 test videos (3,358 action instances)which are not entirely background videos.

Evaluation metrics.

Following conventional metrics [73],we treat the per-frame labeling task as a retrieval problem.For each action class, we rank all frames in the test set bytheir conﬁdence scores for that class and compute AveragePrecision (AP). Then we average over all classes to obtainmean AP (mAP).

Comparisons.

In Table 1, we ﬁrst compare our CDC net-work (denoted by CDC) with some state-of-the-art mod-els (results are quoted from [73]): (1) Single-frame CNN:the frame-level 16-layer VGG CNN model [51]; (2) Two-stream CNN: the frame-level two-stream CNN model pro-posed in [50], which has one stream for pixel and onestream for optical ﬂow; (3) LSTM: the basic per-frame la-beling LSTM model of 512 hidden units [7] on the top ofVGG CNN

FC7 layer; (4) MultiLSTM: a LSTM modeldeveloped by Yeung et al . [73] to process multiple inputframes together with temporal attention mechanism andoutput predictions for multiple frames. Single-frame CNNonly takes into account appearance information. Two-stream CNN models appearance and motion informationseparately. LSTM based models can capture temporal de-pendencies across frames but do not model motion explic-itly. Our

CDC model is based on 3D convolutional layersand CDC layers, which can operate on spatial and tempo-ral dimensions simultaneously, achieving the best perfor-mance.In addition, we compare CDC with other C3D based ap-proaches that use different upsampling methods. (1) C3D +LinearInterp: we train a segment-level C3D using the sameset of training segments whose segment-level labels are de-termined by the majority vote. During testing we performlinear interpolation to upsample segment-level predictionsas frame-level. (2) Conv & De-conv:

CDC7 and

CDC8 in ourCDC network keep the spatial data shape unchanged andtherefore can be also regarded as de-convolutional layers.For

CDC6 , we replace it with a convolutional layer conv6 and a separate de-convolutional layer deconv6 as shown inoU threshold 0.3 0.4 0.5 0.6 0.7Karaman et al . [26] 0.5 0.3 0.2 0.2 0.1Wang et al . [66] 14.6 12.1 8.5 4.7 1.5Heilbron et al . [18] - - 13.5 - -Escorcia et al . [9] - - 13.9 - -Oneata et al . [39] 28.8 21.8 15.0 8.5 3.2Richard and Gall [43] 30.0 23.2 15.2 - -Yeung et al . [74] 36.0 26.4 17.1 - -Yuan et al . [77] 33.6 26.1 18.8 - -S-CNN [47] 36.3 28.7 19.0 10.3 5.3C3D + LinearInterp 36.0 26.4 19.6 11.1 6.6Conv & De-conv 38.6 28.2 22.4 12.0 7.5CDC (ﬁx 3D ConvNets) 36.9 26.2 20.4 11.3 6.8

CDC 40.1 29.4 23.3 13.1 7.9

Table 2. Temporal action localization mAP on THUMOS’14 as theoverlap IoU threshold used in evaluation varies from 0.3 to 0.7. -indicates that results are unavailable in the corresponding papers.

Figure 2 (b). The CDC model outperforms these baselinesbecause the CDC ﬁlter can simultaneously model high-levelsemantics and temporal action dynamics. We also evalu-ate the CDC network with ﬁxed weights in 3D ConvNetsand only ﬁne-tune CDC layers, resulting in a minor perfor-mance drop. This implies that it is helpful to train CDCnetworks in an end-to-end manner so that the 3D ConvNetspart can be trained to summarize more discriminative in-formation for CDC layers to infer more accurate temporaldynamics.

Given per-frame labeling results from the CDC network,we generate proposals, determine class category, and pre-dict precise boundaries following Section 3.4. Our ap-proach is applicable to any segment proposal method. Herewe conduct experiments on THUMOS’14, and thus employthe publicly available proposals generated by the S-CNNproposal network [47], which achieves high recall on THU-MOS’14 . Finally, we follow [73, 47] to perform standardpost-processing steps such as non-maximum suppression.

Evaluation metrics.

Localization performance is also eval-uated by mAP. Each item in the rank list is a predicted seg-ment. The prediction is correct when it has the correct cat-egory and its temporal overlap IoU with the ground truthis larger than the threshold. Redundant detections for thesame ground truth instance are not allowed.

Comparisons.

As shown in Table 2,

CDC achieves muchbetter results than all the other state-of-the-art methods,which have been reviewed in Section 2. Compared to theproposed CDC model: the typical approach of extracting aset of features to train SVM classiﬁers and then applying the trained classiﬁers on sliding windows or segment proposals(Karaman et al . [26], Wang et al . [66], Oneata et al . [39],Escorcia et al . [9]) does not directly address the tempo-ral localization problem. Systems encoding iDTF with FV(Heilbron et al . [18], Richard and Gall [43]) cannot learnspatio-temporal patterns directly from raw videos to makepredictions. RNN/LSTM based methods (Yeung et al . [74],Yuan et al . [77]) are unable to explicitly capture motion in-formation beyond temporal dependencies. S-CNN can ef-fectively capture spatio-temporal patterns from raw videosbut lacks the ability of adjusting boundaries from proposalcandidates. With the proposed CDC ﬁlter, the CDC net-work can determine conﬁdence scores at a ﬁne granularity,beyond segment-level prediction, and hence precisely local-ize temporal boundaries. In addition, we employ per-framepredictions of other methods indicated in Table 1 (C3D +LinearInterp, Conv & De-conv, CDC with ﬁxed 3D Con-vNets ) to perform temporal localization based on S-CNNproposal segments. As shown in Table 2, the performanceof the CDC network is still better, because more accuratepredictions at the same temporal granularity can be used topredict more accurate label and more precise boundaries forthe same input proposal segment. In Figure 4, we illustratehow our model reﬁnes boundaries from segment proposalto precisely localize action instance in time.

The necessity of predicting at a ﬁne granularity in time.

In Figure 5, we compare CDC networks predicting ac-tion scores at different temporal granularities. When thetemporal granularity increases, mAP increases accordingly.This demonstrates the importance of predicting at a ﬁne-granularity for achieving precise localization.

Efﬁciency analysis.

The CDC network is compact and de-mands little storage, because it can be trained from rawvideos directly to make ﬁne-grained predictions in an end-to-end manner without the need to cache intermediate fea-tures. A typical CDC network such as the example in Figure3 only requires around 1GB storage.Our approach is also fast. Compared with segment-levelprediction methods such as S-CNN localization network[47], CDC has to perform more operations due to the needof making predictions at every frame. Therefore, when theproposal segment is long, CDC is less efﬁcient for the sakeof achieving more accurate boundaries. But in the case ofshort proposal segments, since these proposals usually aredensely overlapped, segment-level methods have to processa large number of segments one by one. However, CDCnetworks only need to process each frame once, and thusit can avoid redundant computations. On a NVIDIA TitanX GPU of 12GB memory, the speed of a CDC network isaround 500 Frames Per Second (FPS), which means it can id 1153GT: 1738-1803Proposal: 1729-1792Ext: 1721-1800CDC: 1739-18001723

CDC 72s68.8s Proposal-extended69.5s 72.1sGround truth

SoccerPenalty

Background 72s69.16 Proposal 71.7s

Vid 1153

GT: 1738-1803

Proposal: 1729-1792Ext: 1721-1800CDC: 1739-1800

CDC 79s75.9s

Proposal-extended

77s 79sGround truth

CliffDiving

Background Background

CliffDiving

Background Background

Javelin Throw

Background 554.2s542.8 Proposal 552.9s

Background

21s CDC 22.3s21s Proposal-extended20.8s 22.3s

Ground truth

CricketShot BG Proposal

Background

CliffDiving

Background Background79s

Proposal

Figure 4. Visualization of the process of reﬁning temporal boundaries for a proposal segment. Horizontal axis stands for time. Fromthe top to the bottom: (1) frame-level ground truths for a CliffDiving instance in an input video with some representative frames; (2) acorresponding proposal segment; (3) the proposal segment after extension; (4) the per-frame score of detecting CliffDiving predicted bythe CDC network; (5) the predicted action instance after the reﬁnement using CDC. x1 (x1x1x1) x2 (x1x1x2) x4 (x1x2x2) x8 (x2x2x2) m AP ( % ) Figure 5. mAP gradually increases when the temporal granularityof CDC network prediction increases from x1 (one label for ev-ery 8 frames) to x8 (one label per frame). Each point correspondsto x total upscaling factor (x

CDC6 upscaling factor x

CDC7 up-scaling factor x

CDC8 upscaling factor) in time. We conduct theevaluation on THUMOS’14 with IoU 0.5. process a 20s long video clip of 25 FPS within one second.

Temporal activity localization.

Furthermore, we foundthat our approach is also useful for localizing activities ofhigh-level semantics and complex components. We conductexperiments on ActivityNet Challenge 2016 dataset [17, 2],which involves 200 activities, and contains around 10Ktraining videos (15K instances) and 5K validation videos(7.6K instances). Each video has an average of 1.65 in-stances with temporal annotations. We train on the train-ing videos and test on the validation videos. Since no ac-tivity proposal results of high quality exist, we apply themAP 0.5 0.75 0.95 Average-mAPbefore 45.1 4.1 0.0 16.4after 45.3 26.0 0.2 23.8

Table 3. Temporal localization mAP on ActivityNet Challenge2016 [2] of Wang and Tao [68] before and after the reﬁnementstep using our CDC network. We follow the ofﬁcial metrics usedin [2] to evaluate the average mAP. trained CDC network to the results of the ﬁrst place win-ner [68] in this Challenge to localize more precise bound-aries. As shown in Table 5, they achieve high mAP whenthe IoU in evaluation is set to 0.5, but mAP drops rapidlywhen the evaluation IoU increases. After using the per-frame predictions of our CDC network to reﬁne temporalboundaries of their predicted segments, we gain signiﬁcantimprovements particularly when the evaluation IoU is high( i.e . 0.75). This means that after the reﬁnement, these seg-ments have more precise boundaries and have larger overlapwith ground truth instances.

5. Conclusion and future works

In this paper, we propose a novel CDC ﬁlter to simulta-neously perform spatial downsampling (for spatio-temporalsemantic abstraction) and temporal upsampling (for precisetemporal localization), and design a CDC network to predictactions at frame-level. Our model signiﬁcantly outperformsall other methods both in the per-frame labeling task andthe temporal action localization task. Supplementary de-scriptions of the implementation details and additional ex-perimental results are available in [46].

6. Acknowledgment

The project was supported by Mitsubishi Electric, andalso by Award No. 2015-R2-CX-K025, awarded by the Na-tional Institute of Justice, Ofﬁce of Justice Programs, U.S.Department of Justice. The opinions, ﬁndings, and conclu-sions or recommendations expressed in this publication arethose of the author(s) and do not necessarily reﬂect thoseof the Department of Justice. The Tesla K40 used for thisresearch was donated by the NVIDIA Corporation. Wethank Wei Family Private Foundation for their support forZheng Shou, and anonymous reviewers for their valuableomments.

7. Appendix

As mentioned in the paper, the traditional approaches usesegment-level detection, in which segment proposals are an-alyzed to predict the action class in each segment. Suchapproaches are limited by the ﬁxed segment lengths andboundary locations, and thus inadequate for ﬁnding preciseaction boundaries. Here we proposed a novel model to ﬁrstpredict actions at ﬁne-level and then use such ﬁne-grainedscore sequences to accurately detect the action boundaries.The ﬁne-grained score sequence also offers natural ways todetermine the score threshold needed in reﬁning boundariesat the frame level. Also, though not emphasized in the pa-per, the ﬁne-level score sequence can also be used to selectprecise keyframes or discover sub-actions within an action.Following the reviewer’s suggestion, we also computedthe frame-to-frame score gradient using the frame-level de-tection results. As shown in Figure 6, the frame-level gradi-ent peaks nicely correlate with the action boundaries, con-ﬁrming the intuition of using the ﬁne-level detection results.Also, as shown in Figure 5 in the paper, when the tempo-ral granularity increases, localization performance increasesaccordingly. Finally, our motivation is quantitatively jus-tiﬁed by the good results on two standard benchmarks asshown in Section 4.

Temporal boundary reﬁnement.

Here, we provide de-tails and pseudo-codes for temporal boundary reﬁnementpresented in Section 3.4. Algorithm 1 is used to reﬁneboundaries of each proposal segment. Also, our sourcecodes can be found at https://bitbucket.org/columbiadvmm/cdc . Algorithm 1

Temporal Boundary Reﬁnement

Input : A proposal segment of starting frame index t s andending frame index t e , the percentage parameter of segmentlength expansion α , the ﬁrst frame index v s and the lastframe index v e of the video containing the proposal seg-ment, the total number of categories K Output : the reﬁned starting frame index t s (cid:48) and endingframe index t e (cid:48) , the predicted category c , the predicted con-ﬁdence score s

1. // Extend boundaries on both sides by the percentage ofthe original segment length2. t s (cid:48) = max ( v s , t s − α · ( t e − t s + 1)) t e (cid:48) = min ( v e , t e + α · ( t e − t s + 1))

4. // Feed frames into the CDC network to produce the con-ﬁdence score matrix P ∈ (cid:60) ( t e − t s +1) × K P = CDC (frames from t s (cid:48) to t e (cid:48) )6. assign c as the category with the maximum average con-ﬁdence score over all frames from t s (cid:48) to t e (cid:48)

7. // Estimate the mean µ and the standard deviation σ µ, σ = Gaussian Kernel Density Estimation ( P [: , c ]) β = µ − σ // Compute the score threshold10. // Reﬁne the starting time11. for i s = 1 , , . . . , ( t e (cid:48) − t s (cid:48) + 1) do if P [ i s , c ] > = β then break end if end for

16. // Reﬁne the ending time17. for i e = ( t e (cid:48) − t s (cid:48) + 1) , . . . , , do if P [ i e , c ] > = β then break end if end for t e (cid:48) = t s (cid:48) + i e − t s (cid:48) = t s (cid:48) + i s − s = ie (cid:80) i = is P [ i,c ] t e − t s +1 // Compute the average score25. return t s (cid:48) , t e (cid:48) , c , s Discussions about the window length used during cre-ating mini-batches.

During mini-batch construction, ide-ally we would like to set the window length as longer aspossible. Therefore, when CDC processes each window, itcan take into account more temporal contextual informa-tion. However, due to the limitation of the GPU mem-ory, if the window length is too high, we have to set thenumber of training samples for each mini-batch to be verysmall, which will make the optimization unstable and thusthe training procedure cannot converge well. Also, a longwindow usually contains much more background framesthan action frames and thus we need to further handle thedata imbalance issue. During experiments, we conduct a

CliffDiving

BG BG

With abs value

Ground truth

CliffDiving

BG BG

Ground truth

CliffDiving

BG BG

Figure 6. We use the frame-level detection scores (3rd row) to compute the absolute frame-to-frame score differences (4th row), whichshow high correlations with the true action boundaries. grid search of window length in 16, 32, 64, 128, 256, 512and empirically found that setting the window length to 32frames is a good trade-off on a single NVIDIA Titan X GPUof 12GB memory: (1) we can include sufﬁcient temporalcontextual information to achieve good accuracy and (2) wecan set the batch size as 8 to guarantee stable optimization.

Sensitivity analysis.

When we extend the segment pro-posal, the percentage α of the original proposal lengthshould not be too small so that our model can consider awider interval and not be too large to include too many ir-relevant frames. As shown in Table 4, the system has stableperformances when α varies within a reasonable range. α Table 4. mAP on THUMOS’14 with the evaluation IoU set to 0.5when we vary the extension percentage α of the original proposallength from 1/8 to 1/4. Additional results on ActivityNet.

We expand the com-parisons on ActivityNet validation set to include results pro-vided by additional top performers [51, 52] in ActivityNetChallenge 2016. As shown in Table 5, our method CDCoutperforms all other methods. As shown in Table 6, CDCalso performs the best on ActivityNet test set.

Discussions about other proposal methods.

As shown inTable 7, we evaluate temporal localization performances ofCDC based on other proposals on THUMOS’14.On ActivityNet, the proposals currently used in Section4 from [68] is a reasonable choice - its recall is 0.681 with56K proposals when evaluate at IoU=0.5 on the validationset. We also have considered using other state-of-the-art IoU threshold 0.5 0.75 0.95 Ave-mAPSingh and Cuzzolin [54] 22.7 10.8 0.3 11.3Singh [52] 26.0 15.2 2.6 14.6Wang and Tao [68] 45.1 4.1 0.0 16.4CDC 45.3 26.0 0.2 23.8

Table 5. Additional baseline results of temporal localization mAPon ActivityNet Challenge 2016 [2] validation set. The baselineresults are kindly provided by the authors of [54, 52, 68].

IoU threshold 0.5 0.75 0.95 Ave-mAPSingh and Cuzzolin [54] 36.4 11.1 0.1 17.8Singh [52] 28.7 17.8 2.9 17.7Wang and Tao [68] 42.5 2.9 0.1 14.6CDC (train) 43.1 25.6 0.2 22.9CDC (train+val) 43.0 25.7 0.2 22.9

Table 6. Comparisons of temporal localization mAP on Activi-tyNet Challenge 2016 [2] test set. The baseline results are quotedfrom the ActivityNet Challenge 2016 leaderboard [2]. CDC (train)is training the CDC model on the training set only and CDC(train+val) uses the training set and the validation set together totrain the CDC model. proposal methods: (1) The ActivityNet challenge providesproposals computed by [18], but it has a low recall at 0.527on the validation set with 441K proposals, which contain alot of false alarms. (2) DAPs [9] advocates that train pro-posal model on THUMOS and then generalize the modelto ActivityNet. Due to lack training data from ActivityNet,DAPs has a quite low recall at around 0.23 and is not a rea-sonable proposal candidate. (3) S-CNN [47] is designed forinstance-level detection. However, ground truth annotationsin ActivityNet do not distinguish consecutive instances -one ground truth interval can contain multiple activity in-stances. Also, for activities of high-level semantics, it isambiguous to deﬁne what is an individual activity instance. id 1153GT: 1738-1803Proposal: 1729-1792Ext: 1721-1800CDC: 1739-18001723

CDC 72s68.8s Proposal-extended

SoccerPenalty

Background

Proposal-extended

Javelin Throw

Background

Vid 1255GT: 1738-1803Proposal: 13569 - 13824Ext: 13537 - 13856

CDC: 13618 - 13835

Figure 7. Visualization of the process of reﬁning temporal boundaries for an action proposal segment. Horizontal axis stands for time.From the top to the bottom: (1) frame-level ground truths for a SoccerPenalty action instance in a test video with some representativeframes; (2) a corresponding proposal segment; (3) the proposal segment after extension; (4) the per-frame score of being SoccerPenaltypredicted by the CDC network; (5) the precisely predicted action instance after the reﬁnement step using CDC.

Vid 1153GT: 1738-1803Proposal: 1729-1792Ext: 1721-1800CDC: 1739-18001723

CDC 72s68.8s Proposal-extended

SoccerPenalty

Background

Proposal-extended

Javelin Throw

Background

Vid 1255GT: 1738-1803Proposal: 13569 - 13824Ext: 13537 - 13856

CDC: 13618 - 13835

Figure 8. Visualization of the process of reﬁning temporal boundaries for an action proposal segment. Horizontal axis stands for time. Fromthe top to the bottom: (1) frame-level ground truths for a JavelinThrow action instance in a test video with some representative frames; (2)a corresponding proposal segment; (3) the proposal segment after extension; (4) the per-frame score of being JavelinThrow predicted bythe CDC network; (5) the precisely predicted action instance after the reﬁnement step using CDC.

Therefore, S-CNN does not suit ActivityNet.

Additional discussions about speed.

For the sake of avoid-ing confusions, we would like to emphasize that the CDCnetwork is end-to-end while the task of temporal localiza-tion is not end-to-end due to the need of combing with pro-posals and performing post-processing. Throughout the pa-per, the speed is also computed for the CDC network itself.Following C3D [61], each input frame has spatial resolution × and will be cropped into × as networkinput (random cropping during training and center croppingduring testing). As indicated in Figure 3, each input videoof L frames has the shape of (3, L , 112, 112). As aforemen-tioned, on a single NVIDIA Titan X GPU of 12GB memory,the speed of a CDC network is around 500 Frames Per Sec-ond (FPS), which means it can process a 20s long video clipof 25 FPS within one second. Additional visualization examples.

As supplementarymaterial to Figure 4, we provide additional examples to show the process of using Convolutional-De-Convolutional(CDC) model to reﬁne the boundaries of proposal segmentsand achieve precise temporal action localization on THU-MOS’14 [25]. As shown in Figure 7 and Figure 8, the com-bination of the segment proposal and the CDC frame-levelscore prediction is powerful. The segment proposal allowsfor leveraging candidates of coarse boundaries to help han-dle the noisy outliers in the dipped score intervals such asshown in Figure 8. The proposed CDC model allows forﬁne-grained predictions at the frame level to help reﬁne thesegment boundaries in frame-level for precise localization.

References [1] Mexaction2. http://mexculture.cnam.fr/xwiki/bin/view/Datasets/Mex+action+dataset , 2015. 2[2] Activitynet challenge 2016. http://activity-net.org/challenges/2016/ , 2016. 8, 10[3] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: Areview. In

ACM Computing Surveys , 2011. 2 oU threshold 0.3 0.4 0.5 0.6 0.7S-CNN [47] w/o CDC 36.3 28.7 19.0 10.3 5.3ResC3D+S-CNN [47] w/o CDC 40.6 32.6 22.5 12.3 6.4S-CNN [47] 40.1 29.4 23.3 13.1 7.9ResC3D+S-CNN [47] 41.3 30.7 24.7 14.3 8.8

Table 7. Temporal action localization mAP on THUMOS’14 as the overlap IoU threshold used in evaluation varies from 0.3 to 0.7. Weevaluate our CDC model based on different proposal methods.[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: Adeep convolutional encoder-decoder architecture for imagesegmentation.

TPAMI , 2016. 3[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deep con-volutional nets and fully connected crfs. In

ICLR , 2015. 3[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. 2016. 3[7] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur-rent convolutional networks for visual recognition and de-scription. In

CVPR , 2015. 6[8] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Auto-matic annotation of human actions in video. In

ICCV , 2007.2[9] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem.Daps: Deep action proposals for action understanding. In

ECCV , 2016. 1, 3, 6, 7, 10[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutionaltwo-stream network fusion for video action recognition. In

CVPR , 2016. 2[11] A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequencemodels for efﬁcient action detection. In

CVPR , 2011. 2[12] A. Gaidon, Z. Harchaoui, and C. Schmid. Temporal local-ization of actions with actoms. In

TPAMI , 2013. 2[13] C. Gan, N. Wang, Y. Yang, D.-Y. Yeung, and A. G. Haupt-mann. Devnet: A deep event network for multimedia eventdetection and evidence recounting. In

CVPR , 2015. 2[14] G. Gkioxari and J. Malik. Finding action tubes. In

CVPR ,2015. 2[15] A. Gorban, H. Idrees, Y.-G. Jiang, A. R. Zamir, I. Laptev,M. Shah, and R. Sukthankar. THUMOS challenge: Actionrecognition with a large number of classes. , 2015. 1, 2[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

CVPR , 2016. 2[17] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles.Activitynet: A large-scale video benchmark for human ac-tivity understanding. In

CVPR , 2015. 2, 8[18] F. C. Heilbron, J. C. Niebles, and B. Ghanem. Fast temporalactivity proposals for efﬁcient detection of human actions inuntrimmed videos. In

CVPR , 2016. 1, 2, 7, 10[19] S. Hong, H. Noh, and B. Han. Decoupled deep neural net-work for semi-supervised semantic segmentation. In

NIPS ,2015. 3 [20] M. Jain, J. van Gemert, H. J´egou, P. Bouthemy, andC. Snoek. Action localization with tubelets from motion.In

CVPR , 2014. 2[21] M. Jain, J. van Gemert, T. Mensink, and C. Snoek. Ob-jects2action: Classifying and localizing actions without anyvideo example. In

ICCV , 2015. 2[22] M. Jain, J. van Gemert, and C. Snoek. What do 15,000 objectcategories tell us about classifying and localizing actions? In

CVPR , 2015. 2[23] H. J´egou, M. Douze, C. Schmid, and P. P´erez. Aggregatinglocal descriptors into a compact image representation. In

CVPR , 2010. 2[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In

ACM MM , 2014.5[25] Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev,M. Shah, and R. Sukthankar. THUMOS challenge: Ac-tion recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/ , 2014. 1, 2, 6, 11[26] S. Karaman, L. Seidenari, and A. D. Bimbo. Fast saliencybased pooling of ﬁsher encoded dense trajectories. In

ECCVTHUMOS Workshop , 2014. 1, 2, 7[27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,and L. Fei-Fei. Large-scale video classiﬁcation with convo-lutional neural networks. In

CVPR , 2014. 2, 5[28] A. Kl¨aser, M. Marszałek, C. Schmid, and A. Zisserman. Hu-man focused action localization in video. In

Trends and Top-ics in Computer Vision , 2012. 2[29] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS , 2012. 2[30] I. Laptev and P. P´erez. Retrieving actions in movies. In

ICCV , 2007. 2[31] C. Lea, A. Reiter, R. Vidal, and G. D. Hager. Segmentalspatiotemporal cnns for ﬁne-grained action segmentation. In

ECCV , 2016. 2, 3[32] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Efﬁcientpiecewise training of deep structured models for semanticsegmentation. In

CVPR , 2016. 3[33] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patchaggregation (mpa) for simultaneous detection and segmenta-tion. In

CVPR , 2016. 3[34] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In

CVPR , 2015. 2, 3, 4[35] M. A. M. Rohrbach, S. Amin and B. Schiele. A databasefor ﬁne grained activity detection of cooking activities. In

CVPR , 2012. 236] P. Mettes, J. van Gemert, and C. Snoek. Spot on: Actionlocalization from pointly-supervised proposals. In

ECCV ,2016. 1[37] H. Noh, S. Hong, and B. Han. Learning deconvolution net-work for semantic segmentation. In

ICCV , 2015. 3[38] D. Oneata, J. Verbeek, and C. Schmid. Action and eventrecognition with ﬁsher vectors on a compact feature set. In

ICCV , 2013. 2[39] D. Oneata, J. Verbeek, and C. Schmid. The lear submissionat thumos 2014. In

ECCV THUMOS Workshop , 2014. 1, 2,7[40] F. Perronnin, J. S´anchez, and T. Mensink. Improving theﬁsher kernel for large-scale image classiﬁcation. In

ECCV ,2010. 2[41] R. Poppe. A survey on vision-based human action recogni-tion. In

Image and vision computing , 2010. 2[42] M. M. Puscas, E. Sangineto, D. Culibrk, and N. Sebe. Un-supervised tube extraction using transductive learning anddense trajectories. In

ICCV , 2015. 2[43] A. Richard and J. Gall. Temporal action detection using astatistical language model. In

CVPR , 2016. 1, 2, 7[44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge.

IJCV , 2015. 2[45] E. Shelhamer, J. Long, and T. Darrell. Fully convolutionalnetworks for semantic segmentation.

TPAMI , 2016. 2, 3, 4[46] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang.Cdc: Convolutional-de-convolutional networks for precisetemporal action localization in untrimmed videos. arXivpreprint arXiv:1703.01515 , 2017. 8[47] Z. Shou, D. Wang, and S.-F. Chang. Temporal action local-ization in untrimmed videos via multi-stage cnns. In

CVPR ,2016. 1, 2, 3, 5, 6, 7, 10, 12[48] G. A. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, andA. Gupta. Much ado about time: Exhaustive annotation oftemporal data. In

HCOMP , 2016. 2[49] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev,and A. Gupta. Hollywood in homes: Crowdsourcing datacollection for activity understanding. In

ECCV , 2016. 2[50] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In

NIPS , 2014. 2,6[51] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. In

InternationalConference on Learning Representations , 2015. 2, 6[52] B. Singh. Action detection. In

CVPR ActivityNet Workshop ,2016. 10[53] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. Amulti-stream bi-directional recurrent neural network for ﬁne-grained action detection. In

CVPR , 2016. 2[54] G. Singh and F. Cuzzolin. Untrimmed classiﬁcation foractivity detection: submission to activitynet challenge. In

CVPR ActivityNet Workshop , 2016. 1, 2, 10[55] K. Soomro, H. Idrees, and M. Shah. Action localization invideos through context walk. In

ICCV , 2015. 2 [56] K. Soomro, H. Idrees, and M. Shah. Predicting the where andwhat of actors and actions through online action localization.In

CVPR , 2016. 2[57] A. Stoian, M. Ferecatu, J. Benois-Pineau, and M. Crucianu.Fast action localization in large scale video archives. In

TCSVT , 2015. 2[58] A. Stoian, M. Ferecatu, J. Benois-Pineau, and M. Crucianu.Scalable action localization with kernel-space hashing. In

ICIP , 2015. 2[59] W. Sultani and M. Shah. What if we do not have multiplevideos of the same action? – video action localization usingweb images. In

CVPR , 2016. 2[60] C. Sun, S. Shetty, R. Sukthankar, and R. Nevatia. Tempo-ral localization of ﬁne-grained actions in videos by domaintransfer from web images. In

ACM MM , 2015. 3[61] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Learning spatiotemporal features with 3d convolutional net-works. In

ICCV , 2015. 1, 2, 3, 5, 10[62] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.Deep end2end voxel2voxel prediction. In

CVPR Workshopon Deep Learning in Computer Vision , 2016. 2, 3, 4[63] J. van Gemert, M. Jain, E. Gati, and C. Snoek. Apt: Actionlocalization proposals from dense trajectories. In

BMVC ,2015. 2[64] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu. ActionRecognition by Dense Trajectories. In

CVPR , 2011. 2[65] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In

ICCV , 2013. 2[66] L. Wang, Y. Qiao, and X. Tang. Action recognition and de-tection by combining motion and appearance features. In

ECCV THUMOS Workshop , 2014. 1, 2, 7[67] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, andL. V. Gool. Temporal segment networks: Towards good prac-tices for deep action recognition. In

ECCV , 2016. 2[68] R. Wang and D. Tao. Uts at activitynet 2016. In

CVPRActivityNet Workshop , 2016. 1, 2, 8, 10[69] D. Weinland, R. Ronfard, and E. Boyer. A survey of vision-based methods for action representation, segmentation andrecognition. In

Computer Vision and Image Understanding ,2011. 2[70] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning totrack for spatio-temporal action localization. In

ICCV , 2015.2[71] C. Xu and J. J. Corso. Actor-action semantic segmentationwith grouping process models. In

CVPR , 2016. 2[72] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnnvideo representation for event detection. In

CVPR , 2015. 2[73] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori,and L. Fei-Fei. Every moment counts: Dense detailedlabeling of actions in complex videos. arXiv preprintarXiv:1507.05738 , 2015. 3, 6, 7[74] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end learning of action detection from frame glimpses invideos. In

CVPR , 2016. 1, 3, 7[75] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. In

ICLR , 2016. 376] G. Yu and J. Yuan. Fast action proposals for human actiondetection and search. In

CVPR , 2015. 2[77] J. Yuan, B. Ni, X. Yang, and A. Kassim. Temporal actionlocalization with pyramid of score distribution features. In

CVPR , 2016. 3, 7[78] M. Zeiler and R. Fergus. Visualizing and understanding con-volutional networks. In

ECCV , 2014. 3[79] M. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Decon-volutional networks. In

CVPR , 2010. 3[80] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran-dom ﬁelds as recurrent neural networks. In