TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
TTPP: Temporal Transformer with ProgressivePrediction for Efficient Action Anticipation
Wen Wang, Xiaojiang Peng, Yanzhou Su, Yu Qiao, Jian Cheng
Abstract —Video action anticipation aims to predict futureaction categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networksto encode history information into hidden states, and predictfuture actions from the hidden representations. It is well knownthat the recurrent pipeline is inefficient in capturing long-terminformation which may limit its performance in predication task.To address this problem, this paper proposes a simple yet effi-cient Temporal Transformer with Progressive Prediction (TTPP)framework, which repurposes a Transformer-style architectureto aggregate observed features, and then leverages a light-weightnetwork to progressively predict future features and actions.Specifically, predicted features along with predicted probabilitiesare accumulated into the inputs of subsequent prediction. Weevaluate our approach on three action datasets, namely TVSeries,THUMOS-14, and TV-Human-Interaction. Additionally we alsoconduct a comprehensive study for several popular aggregationand prediction strategies. Extensive results show that TTPP notonly outperforms the state-of-the-art methods but also moreefficient.
Index Terms —Action anticipation, Transformer, Encoder-Decoder.
I. I
NTRODUCTION
Human action anticipation, also aka early action prediction,aiming to predict future unseen actions, is one of the maintopics in video understanding with wide applications in secu-rity, visual surveillance and human-computer interaction, etc.In contrast to the well-studied action recognition, which infersthe action label after observing the entire action execution,action anticipation is to early predict human actions withoutobserving the future action execution. It is a very challengingtask because the input videos are temporally incomplete withwide variety of irrelevant background, and decisions must bemade based on such incomplete action executions. In all, ac-tion anticipation needs to overcome all the difficulties of actionrecognition and capture sufficient historical and contextualinformation to make future predictions in untrimmed videostreams.Generally, most of the action anticipation approaches canbe divided into two key phases, namely observed informationaggregation and future prediction, as shown in Figure 1. Early
Wen Wang, Yanzhou Su and Jian Cheng are with the School of Informa-tion and Communication Engineering, University of Electronic Science andTechnology of China, Chengdu, Sichuan, China, 611731.Xiaojiang Peng and Yu Qiao are with ShenZhen Key Lab of ComputerVision and Pattern Recognition, SIAT-SenseTime Joint Lab, Shenzhen Insti-tutes of Advanced Technology, Chinese Academy of Sciences; SIAT Branch,Shenzhen Institute of Artificial Intelligence and Robotics for Society.This work was done when Wen Wang was intern at Shenzhen Institutes ofAdvanced Technology, Chinese Academy of Sciences.Corresponding author: Xiaojiang Peng ([email protected])
Fig. 1: The summarized generic flowchart for action an-ticipation, which mainly consists of observed informationaggregation and future prediction.works of action anticipation focus on trimmed action videos,and mainly make efforts on extracting discriminative featuresfrom partial videos, i . e . observed information aggregation, forearly action prediction [1], [28], [30], [42], [57]. In deeplearning era, recent works turn to predict future actions inpractical untrimmed video streams [11], [13], [56], [62], andmainly repurpose sequential models from the natural lan-guage processing (NLP) domain, like long short-term memory(LSTM) [15] and gated recurrent neural networks [6]. Forinstance, Gao et al . [13] propose a Reinforced Encoder-Decoder network, which utilizes an encoder-decoder LSTMnetwork [36], [59] to aggregate historical features and predictfuture features or actions. Xu et al . [62] propose a LSTM-based temporal recurrent network to predict future features forboth online action detection and action anticipation. Thoughthe encoder-decoder recurrent networks can be easily trans-formed from NLP domain to temporal action anticipation,their inherent sequential nature precludes parallelization withintraining examples and limits the memory power for longersequence length [52]. Moreover, they are known to havelimited improvements in other action understanding tasks [ ? ],[60].In this paper, we address the two issues of action antici-pation via a simple yet efficient Temporal Transformer withProgressive Prediction (TTPP) framework. TTPP repurposesa Transformer-style module to aggregate observed informationand leverages a light-weight network to progressively predictfuture features and actions. Specifically, TTPP contains a Tem-poral Transformer Module (TTM) and an elaborately-designedProgressive Prediction Module (PPM). Given historical andcurrent features, the TTM aggregates the historical featuresbased solely on self-attention mechanisms with the currentfeature as query, which is inspired by the Transformer inmachine translation [52]. The aggregated historical featurealong with the current feature are then fed into the PPM. ThePPM is comprised of an initial prediction block and a shared- a r X i v : . [ c s . C V ] M a r OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 parameter progressive prediction block, each of which is builtwith two fully-connected (FC) layers, a ReLU activation [38]and a layer normalization (LN) [2]. With the output feature ofTTM, the initial prediction block of PPM predicts the immedi-ately following clip feature and action probabilities. The pro-gressive prediction block accumulates the former predictionsand the output of TTM, and further predicts a few subsequentfuture features and actions. The whole TTPP model can bejointly trained in an end-to-end manner with supervision fromground-truth future features and action labels. Compared toprevious encoder-decoder methods, the benefits of our TTPPare two-fold. First, the temporal Transformer is more efficientthan recurrent methods in capturing historical context by self-attention. Second, the progressive prediction module with skipconnections to aggregated historical features can efficientlydeliver temporal information and help long-term anticipation.We evaluate our approach on three widely-used action an-ticipation datasets, namely TVSeries [7], THUMOS-14 [20],and TV-Human-Interaction [39]. Additionally we also conducta comprehensive study for several popular aggregation andprediction strategies, including temporal convolution, LSTMand single-shot prediction, etc. Extensive results show thatTTPP is more efficient than the state-of-the-art methods inboth training and inference, and outperforms them with a largemargin.The main contributions of this work can be concluded asfollows. • We propose a simple yet efficient TTPP framework foraction anticipation, which leverages a Transformer-stylearchitecture to aggregate information and a light-weightmodule to predict future actions. • We elaborately design a progressive prediction modulefor predicting future features and actions, and achieve thestate-of-the-art performance on TVSeries, THUMOS-14,and TV-Human-Interaction. • We conduct a comprehensive study for several popularaggregation and prediction strategies, including aggrega-tion methods of temporal convolution, Encoder-LSTM,and prediction methods of Decoder-LSTM, single-shotprediction, etc.The rest of this paper is organized as follows: We first re-view some related works in Section II. Section III describes theproposed framework with TTM to aggregate observed featuresand PPM to progressively predict future actions. Afterwards,we show our experimental results on several datasets in sectionIV and conclude the paper in Section V.II. R
ELATED W ORK
Action recognition . Action recognition is an importantbranch of video related research areas and has been exten-sively studied in the past decades. The existing methods aremainly developed for extracting discriminative action featuresfrom temporally complete action videos. These methods canbe roughly categorized into hand-crafted feature based ap-proaches and deep learning based approaches. Early methodssuch as Improved Dense Trajectory (IDT) mainly adopt hand-crafted features, such as HOF [32], HOG [32] and MBH [55]. Recent studies demonstrate that action features canbe learned by deep learning methods such as convolutionalneural networks (CNN) and recurrent neural networks (RNN).Two-stream network [46], [57] learns appearance and motionfeatures based on RGB frame and optical flow field separately.RNNs, such as long short-term memory (LSTM) [15] andgated recurrent unit (GRU) [6], have been used to model long-term temporal correlations and motion information in videos,and generate video representation for action classification.A CNN+LSTM model, which uses a CNN to extract framefeatures and a LSTM to integrate features over time, is alsoused to recognize activities in videos [9]. C3D [10] networksimultaneously captures appearance and motion features usinga series of 3D convolutional layers. Recently, I3D [4] networksuse two stream CNNs with inflated 3D convolutions on bothdense RGB and optical flow sequences to achieve state of theart performance on Kinetics dataset [24].
Action anticipation . Many works have been proposed toexploit the partially observed videos for early action predic-tion or future action anticipation. Recently, Hoai et al . [17]propose a max-margin framework with structured SVMs tosolve this problem. Ryoo et al . [42] develop an early actionprediction system by observing some evidences from thetemporal accumulated features. Lan et al . [31] design a coarse-to-ne hierarchical representation to capture the discriminativehuman movement at different levels, and use a max-marginframework for final prediction. Cao et al . [64] formulatethe action prediction problem into a probabilistic framework,which aims to maximize the posterior of activity given ob-served frames. In their work, the likelihood is computed byfeature reconstruction error using sparse coding. However, itsuffers from high computational complexity as the inferenceis performed on the entire training data. Carl et al . [54]present a framework that uses large-scale unlabeled data topredict a rich visual representation in the future, and applyit towards anticipating both actions and objects. Kong etal . [29] propose a combined CNN and LSTM along with amemory module in order to record “hard-to-predict” samples,they benchmark their results on UCF101 [49] and Sports-1M[23] datasets. Gao et al . [13] propose a Reinforced Encoder-Decoder (RED) network for action anticipation, which usesreinforcement learning to encourage the model to make thecorrect anticipations as early as possible. Ke et al . [27] proposean attended temporal feature, which uses multi-scale temporalconvolutions to process the time-conditioned observation. Inthis work, we focus only on recent results on anticipation ofaction labels, more details can be found in [13] and [62].
Online action detection . Online action detection is usuallysolved as an online per-frame labelling task on streamingvideos, which requires correctly classifying every frame with-out accessing future frames. De Geest et al . [7] first introducethe problem by introducing a realistic dataset, i . e . TVSeries,and benchmarked the existing models. They have shown thata simple LSTM approach is not sufficient for online actiondetection, and even worse than the traditional pipeline ofimproved trajectories, Fisher vectors and SVM. Their laterwork [8] introduces a two-stream feedback network, whereone stream processes the input and the other one models OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 𝐼 𝑡−3′ 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑔 𝑒𝑛𝑐 𝑓 𝑡 𝑓 𝑡+1 𝑓 𝑡+2 𝑓 𝑡+3 𝑓 𝑡+4 Temporal Transformer Module Progressive Prediction Module 𝑓 𝑡+1′ 𝑓 𝑡+2′ 𝑓 𝑡+3′ 𝑓 𝑡+4′ MSE MSE MSE MSE CE 𝑆 𝑡 ★ ★ ★ 𝑔 𝑡𝑡 𝑚 𝑔 𝑡𝑡𝑚 𝑔 𝑡𝑡𝑚 Classifier 𝑔 𝑝𝑟𝑒𝑑0 𝑔 𝑝𝑟𝑒𝑑 𝑔 𝑝𝑟𝑒𝑑 𝑔 𝑝𝑟𝑒𝑑 𝐼 𝑡−2′ 𝐼 𝑡−1′ 𝐼 𝑡′ 𝐼 𝑡+1′ 𝐼 𝑡+2′ 𝐼 𝑡+3′ 𝐼 𝑡+4′ Element-wise sum 𝑓 𝑡−3 𝑓 𝑡−2 𝑓 𝑡−1 𝐿 𝑐 + 𝐿 𝑟 concat PE 𝐾 𝑄 𝑉 Fig. 2: The flowchart of our TTPP method for action anticipation. Given a continuous video sequence, an encoder networkis first used to map each video clip to clip features, and then a Temporal Transformer Module is proposed to aggregateobserved clip features, and finally a Progressive Prediction Module is designed for future action anticipation. Note that thefuture information in the red dashed box is only used in training stage and the classifier is performed on each prediction.the temporal relations. Gao et al . [13] propose a ReinforcedEncoder-Decoder network for action anticipation and treatonline action detection as a special case of their framework.Xu et al . [62] propose the Temporal Recurrent Network (TRN)to model the temporal context by simultaneously performingonline action detection and anticipation. Besides, Shou et al .[45] address the online detection of action start (ODAS) byencouraging a classification network to learn the representationof action start windows.
Attention for video understanding . The attention mecha-nism which directly models long-term interactions with self-attention has led to state-of-the-art models for action under-standing tasks, such as video-based and skeleton-based actionrecognition [14], [34], [35], [44], [48]. Our work is relatedto the recent Video Action Transformer Network [14], whichuses the Transformer architecture as the “head” of a detectionframework. Specifically, it uses the ROI-pooled I3D feature ofa target region as query and aggregates contextual informationfrom the spatial-temporal feature maps of an input video clip.Our work differs from it in the following aspects: (1) Theproblem is different from spatial-temporal action detection. Tothe best of our knowledge, we are the first to use Transformerarchitecture for action anticipation. (2) We have task-specificconsiderations. For instance, our Transformer unit takes thecurrent frame feature as query and the historical frame featuresas memory. (3) We elaborately design a light-weight progres-sive prediction module for efficient action anticipation. III. O UR A PPROACH
In this section, we present our temporal Transformer withprogressive prediction for the action anticipation task. Wepropose two module, temporal Transformer module (TTM)to aggregate observed information and progressive predictionmodule (PPM) to anticipate future actions.
A. Problem Formulation
The action anticipation task aims to predict the action class y for each frame in the future from an observed action video V . More formally, let V L = [ I , I , ..., I L ] be a video with L frames. Given the first t frames V t = [ I , I , ..., I t ] , the task isto predict the actions happening from frames t + 1 to L . Thatis, we aim to assign action labels y Lt +1 = [ y t +1 , y t +2 , ..., y L ] to each of the unobserved frames. B. Overall Framework
Two crucial issues of action anticipation are i) how toaggregate observed information and ii) how to predict futureactions. We address these two issues with a simple yet efficientframework, termed as
Temporal Transformer with Progres-sive Prediction (TTPP). As illustrated in Figure 2, a longvideo is first segmented into multiple non-overlapped chunks (cid:104) I (cid:48) , I (cid:48) , ..., I (cid:48) t (cid:105) with each clip containing an equal number ofconsecutive frames. Then, a network, i . e . g enc , maps eachvideo chunk into a representation f t = g enc ( I (cid:48) t ) . More detailson video pre-processing and feature extraction are presentedin Section IV-C. Subsequently, a Temporal Transformer Mod-ule (TTM), i . e . g ttm , temporally aggregates t consecutive OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 chunk representations into a historical representation S t = g ttm ( f , f , ..., f t ) . Finally, a Progressive Prediction Module(PPM) progressively predicts future features and actions. ThePPM is comprised of an initial prediction block , i . e . g pred , anda shared-parameter progressive prediction block , i . e . g pred . g pred takes S t as input and predicts the immediately followingclip feature and action probability. g pred accumulates the for-mer predictions and S t , and further predicts a few subsequentfuture features and actions. C. Temporal Transformer Module (TTM)
Transformer revisit . Transformer was originally proposedto replace traditional recurrent models for machine transla-tion [52]. The core idea of Transformer is to model correlationbetween contextual signals by an attention mechanism. Specif-ically, it aims to encode the input sequence to a higher-levelrepresentation by modeling the relationship between queries( Q ) and memory (keys ( K ) and values ( V )) with, Attention ( Q, K, V ) =
Sofmax ( QK T √ d m ) V, (1)where Q ∈ R L q × d m , K ∈ R L k × d m and V ∈ R L k × d v .This architecture becomes “self-attention” with Q = K = V = { f , f , · · · , f T } which is also known as the non-localnetworks [58]. A self-attention module maps the sequence to ahigher-level representation like RNNs but without recurrence. Temporal Transformer .To efficiently aggregate observedinformation, our TTPP framework resorts to a Transformer-style architecture, termed as Temporal Transformer Module(TTM). The TTM takes as input the video chunk features andmaps them into a query feature and memory features. Foronline action anticipation, considering that the last observedfeature f t would be the most relevant one to the future actions,we use f t as the query of TTM. The memory of TTMis intuitively set as the historical features [ f , f , ..., f t − ] .Formally, the query and memory are as follows, Q = f t , K = V = [ f , f , ..., f t − ] . (2)Since temporal information is lost in the attention operation,we add the positional encoding [52] into the input representa-tions. Given sequence feature f in = [ f , f , ..., f T ] ∈ R T × d m ,the i -th value of the positional vector in temporal position pos is defined as, P E ( pos,i ) = (cid:26) sin ( pos/ i/d m ) if i is evencos ( pos/ i/d m ) otherwise . (3)The original feature vector f pos is then updated by f pos = f pos + P E ( pos, :) which provides information about temporalposition of each clip feature.To model complicated action videos, our TTM furtherleverages the multi-head attention mechanism as follows, A t = M ultiHead ( Q, K, V ) =
Concat ( h , ..., h n ) W o , where h i = Attention ( QW Qi , KW Ki , V W Vi ) , (4)where n is the number of attention heads, and W Qi ∈ R d m × d q , W Ki ∈ R d m × d k , W Vi ∈ R d m × d v are parameters for the i -thattention head which are used for linear projection, and W o ∈ (𝑆 𝑡 , 𝑓 𝑡 , 𝑝 𝑡 ) 𝑆 𝑡 𝑆 𝑡 𝑓 𝑡+1′ 𝑝 𝑡+1′ Share Parameters
LinearLinearReLULNDropout
Share Parameters
Linear
LinearReLULNDropout
Linear
LinearReLULN
Dropout 𝑝 𝑡+2′ 𝑓 𝑡+2′ Classifier Classifier Classifier
Fig. 3: The illustration of our PPM. It consists of an initialprediction block highlighted in blue and a shared-parameterprogressive prediction block highlighted in yellow, where eachblock is built with two fully-connected layers, followed by aReLU activation. In addition, we use layer normalization anddropout to improve regularization. R nd v × d m is the projection matrix to reduce the dimensionof the concatenated attention vector. For each head, we use d k = d q = d v = d m n . Considering the importance of f t foranticipation, we view A t as an extra information and add it tothe original query feature via a shortcut connection. The finaloutput feature of TTM is S t = A t + f t with dimension d m . D. Progressive Prediction Module (PPM)
Partially inspired by WaveNet [51], we design a ProgressivePrediction Module (PPM) to better exploit the aggregatedhistorical knowledge for future prediction. As illustrated inFigure 3, the PPM is comprised of an initial prediction block and a shared-parameter progressive prediction block , whereeach block is built with two fully-connected (FC) layers, aReLU activation [38] and a layer normalization (LN) [2].Assume we predict l steps in the future from time t + 1 to t + l . At the first time step t + 1 , the initial predictionblock takes as input the aggregated historical representation S t ∈ R d m and predicts the feature f (cid:48) t +1 ∈ R d m and actionprobability p (cid:48) t +1 ∈ R C . Formally, this block is as follows, p t = Softmax ( W c f t ) (5) f (cid:48) t +1 = g pred ( S t ⊕ f t ⊕ p t ) , (6) p (cid:48) t +1 = Softmax ( W c f (cid:48) t +1 ) , (7)where W c is the multi-class ( C action classes) action classifier.At other time step t + i ( i > ), the previously predictedembedding f (cid:48) t + i − and action probability p (cid:48) t + i − are firstconcatenated with S t in channel-wise, and then fed into the progressive prediction block . Formally, this block is definedas follows, f (cid:48) t + i = g pred ( S t ⊕ f (cid:48) t + i − ⊕ p (cid:48) t + i − ) , (8) p (cid:48) t + i = Softmax ( W c f (cid:48) t + i ) , (9) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 where ‘ ⊕ ’ denotes concatenate operation. Due to the concate-nation, the input dimension of the progressive prediction block is d m + C . For both blocks, we use two fully-connected (FC)layers with the first FC reducing the input dimension to d m andthe second FC generating output vector of dimension d m . It isworth noting that different steps in the progressive predictionblock share parameters. Thus, the whole PPM is a light-weightnetwork. Training.
Our TTPP framework is trained in an end-to-end manner with supervision on the PPM module. Specifi-cally, we use two types of loss functions, namely a featurereconstruction loss L r and a classification loss L c . L r is themean squared error loss (MSE) between predicted features andground-truth features, which is defined as, L r = l (cid:88) i =1 || f (cid:48) t + i − f t + i || . (10) L c is the sum of cross-entropy loss (CE) on all the predictionsteps, which is defined as, L c = − l (cid:88) i =1 C (cid:88) j =1 y ( t + i,j ) log p (cid:48) ( t + i,j ) , (11)where y ( t + i, :) is the one-hot ground-truth vector at time t + i .The total loss is formulated as, L = L c + λL r , (12)where λ is a trade-off weight for feature reconstruction loss.We experimentally find the final performance is not sensitiveto the value of weight, we set λ = 1 for simplicity in ourexperiments. IV. E XPERIMENTS
The proposed method was evaluated on three datasets, i . e .TVSeries [62], THUMOS-14 [20] and TV-Human-Interaction[39]. We choose these datasets because they include videosfrom diverse perspectives and applications: TVSeries wasrecorded from television and contains a variety of everydayactivities, THUMOS-14 is a popular dataset of sports-relatedactions, and TV-Huamn-Interaction contains human interactionactions collected from tv shows. In this section, we reportexperimental results and detailed analysis. A. Datasets
TVSeries [7] is originally proposed for online action de-tection, which consists of episodes of popular TV series,namely Breaking Bad ( episodes), How I Met Your Mother ( ), Mad Men ( ), Modern Family ( ), Sons of Anarchy ( ),and Twenty-four ( ). It contains totally hours of video.The dataset is temporally annotated at the frame level with realistic, everyday actions ( e . g . , pick up , open door , drink , etc.). It is challenging with diverse actions, multipleactors, unconstrained viewpoints, heavy occlusions, and a largeproportion of non-action frames. THUMOS-14 [20] is a popular benchmark for temporalaction detection. It contains over hours of sport videosannotated with actions. The training set ( i . e . UCF101 [49]) contains only trimmed videos that cannot be used to traintemporal action detection models. Following prior works [13],[62], we train our model on the validation set (including K action instances in untrimmed videos) and evaluate on thetest set (including . K action instances in untrimmedvideos). TV-Human-Interaction (TV-HI) [39]. We also evaluate ourmethod on TV-Human-Interaction which is also used in [13].The dataset contains trimmed video clips extracted from different TV shows. It is annotated with four interactionclasses, namely hand shake , high five , hug , and kissing . Italso contains a negative class with videos, that have noneof the listed interactions. We use the suggested experimentalsetup of two train/test splits. B. Evaluation Protocols
For each class on TVSeries, we use the per-frame calibratedaverage precision (cAP) which is proposed in [7], cAP = (cid:80) k cP rec ( k ) ∗ I ( k ) P , (13)where calibrated precision cP rec = T PT P + F P/w , I ( k ) is anindicator function that is equal to if the cut-off frame k is a true positive, P denotes the total number of truepositives, and w is the ratio between negative and positiveframes. The mean cAP over all classes is reported for finalperformance. The advantage of cAP is that it is fair for classimbalance condition. For THUMOS-14, we report per-framemean Average Precision (mAP) performance. For TV-Human-Interaction, we report classification accuracy (ACC). C. Implementation Details
To make fair comparisons with state-of-the-art methods[7], [13], [62], we follow their experimental settings on eachdataset.
Chunk-level feature extraction . We extract frames fromall videos at Frames Per Second (
F P S ). The video chunksize is set to , i . e . . second. We use three different featureextractors as the visual encoder g enc , VGG-16 [47] networkpre-trained on UCF101 [49], two-stream (TS) [61] network pre-trained on ActivityNet-1.3 [3], and inflated 3D ConvNet (I3D) [5] pre-trained on Kinetics [25]. VGG-16 features (4096-D) are extracted at the f c layer for the central frame ofeach chunk. For the two-stream features in each chunk, theappearance CNN feature is extracted on the central framewhich is the output of Flatten 673 layer in ResNet-200 [16],and the motion feature is extracted on the optical flowframes of each chunk which is output of global pool layer in a pre-trained BN-Inception model [19]. The motion featureand appearance feature are then concatenated into a TS feature(4096-D) for each chunk. Different from prior works [13],[62], we also use recent I3D features. The I3D model isoriginally trained with -frame video snippets, thus may notbe a good idea for per-frame action anticipation. Nevertheless, https://github.com/yjxiong/anet2016-cuhk https://github.com/piergiaj/pytorch-i3d OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Time predicted into the future (seconds)Method Inputs 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s AvgED [13] VGG 71.0 70.6 69.9 68.8 68.0 67.4 67.0 66.7 68.7RED [13] VGG 71.2 71.0 70.6 70.2 69.2 68.5 67.5 66.8 69.4Ours VGG
ED [13] TS 78.5 78.0 76.3 74.6 73.7 72.7 71.7 71.0 74.5RED [13] TS 79.2 78.7 77.1 75.5 74.2 73.0 72.0 71.2 75.1TRN [62] TS 79.9 78.4 77.1 75.9 74.9 73.9 73.0 72.3 75.7Ours TS
TABLE I: Comparison with the state-of-the-art methods onTVSeries in terms of mean cAP (%).
Time predicted into the future (seconds)Method Inputs 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s AvgED [13] TS 43.8 40.9 38.7 36.8 34.6 33.9 32.5 31.6 36.6RED [13] TS 45.3 42.1 39.6 37.5 35.8 34.4 33.2 32.1 37.5TRN [62] TS 45.1 42.4 40.7 39.1 37.7 36.4 35.3 34.3 38.9Ours TS
Ours I3D 46.8 45.5 44.6 43.6 41.9 41.1 40.4 38.7
TABLE II: Comparison with the state-of-the-art methods onTHUMOS-14 in terms of mAP (%).we input the frames of each chunk to I3D and extract theoutput (1024-D) of the last global average pooling layer asI3D-based feature. Hyperparameter setting . We implement our proposedmethod in PyTorch and perform all experiments on a systemwith Nvidia TiTAN X graphic cards. We use the SGDoptimizer with a learning rate . , a momentum of . ,and batch size . The input sequence length is set to bydefault, corresponding to seconds. We use single-layer multi-head setting for our TTM, and the number of heads is set to by default. D. Popular Baselines
Here we present several advanced baselines for temporalinformation aggregation and future prediction.
Temporal convolution ( i . e . Conv1D) aggregates temporalfeatures with 1-D convolution operations in temporal axis. Weapply Conv1D layers with kernel size and stride on two-stream features for this baseline. LSTM takes sequence features as input and recurrentlyupdates its hidden states over time. The
Encoder-LSTM summarizes historical information into the final hidden statefor information aggregation. The
Decoder-LSTM recurrentlydecodes information into hidden states as predicted features.We use a single-layer LSTM architecture with hiddenunits for this baseline.
Single-shot prediction (SSP). We implement a single-shot prediction method similar to [13], [54]. With the ag-gregated historical feature, this method uses two FC layersto anticipate the single future feature at T a , where T a ∈{ t + 1 , t + 2 , ..., t + l } . This prediction method is equal to ourPPM without the progressive process. E. Comparison with State of the Art
We compare our proposed TTPP method to several state-of-the-art methods on TVSeries, THUMOS-14, and TV-HI. The
Method Vondrick et al . [54] RED-VGG [13] RED-TS [13] Ours-TSACC (%) 43.6 47.5 50.2
TABLE III: Anticipation results on TV-Human-Interaction at T a = 1 s in terms of ACC (%). Time predicted into the future (seconds)Method 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s AvgConv1D − − − (a) TVSeries Time predicted into the future (seconds)Method 0.25s 0.5s 0.75s 1.0s 1.25s 1.5s 1.75s 2.0s AvgConv1D 41.8 40.9 39.6 38.1 37.2 36.7 35.9 35.5 38.2LSTM 41.6 40.5 39.1 37.9 35.6 34.9 34.3 33.3 37.0TTM (Ours) 45.9 43.7 42.4 41.0 39.9 39.4 37.9 37.3 (b) THUMOS-14
TABLE IV: Evaluation of temporal aggregation methods withtwo-stream features on TVSeries and THUMOS-14. For faircomparison, PPM is used for future prediction. “ − ” denotesthat the aggregated feature is directly used for predictionwithout the shortcut connection to current feature.results are presented in Table I, Table II, and Table III, respec-tively. Our method consistently outperforms all these methodsin all the predicted steps. With two-stream features, our TTPPachieves . (mean cAP), . (mAP), and . (ACC)on these datasets, which outperforms these recent advancedmethods by . , . , and . , respectively.On both TVSeries and THUMOS-14, the improvementsover other methods are more evident on long-term predictions.For instance, with two-stream features, our TTPP outperformsED (Encoder-Decoder LSTM) [13] by . at T a = 0 . s and . at T a = 2 . s on THUMOS-14, and these numbers are . and . on TVSeries. With VGG features, our methodimproves the Reinforcement ED by . in average cAP onTVSeries. Since the VGG and TS features are relatively old,we also test the I3D features, which updates a new state-of-the-art on THUMOS-14 with . in average mAP overtime. F. Ablation Study of TTM and PPM
To further investigate the effectiveness of our proposedTTPP, we conduct extensive evaluations for both TTM andPPM by comparing them to recent temporal aggregation andprediction methods, respectively.For temporal aggregation, we compare our TTM to Conv1Dand Encoder-LSTM on both THUMOS-14 and TVSeries withthe PPM as prediction phase. Since we use a shortcut con-nection in TTM to highlight the current frame information,we also evaluate the benefits of this design for all the aggre-gation methods. The results are shown in Table IV. Severalobservations can be concluded as follows. First, the proposedTTM is superior to both Conv1D and LSTM regardless
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Time predicted into the future (seconds)Method (A-P) ∗ (a) TVSeries Time predicted into the future (seconds)Method (A-P) ∗ (b) THUMOS-14 TABLE V: Evaluation of future prediction methods with two-stream features on TVSeries and THUMOS-14. Both TTM andLSTM are evaluated for temporal aggregation. (A-P) ∗ denotestemporal aggregation and future prediction, respectively.of the shortcut connection. Specifically, TTM outperformsConv1D and LSTM by . ( . ) and . ( . ) withshortcut connection on TVSeries (THUMOS-14), respectively.Second, the shortcut connection to current feature significantlyimproves all methods on TVSeries. For instance, our TTMdegrades from . to . after removing the shortcutconnection which demonstrates the importance of currentfeature and the superiority of our design. Last but not the least,the improvements of our TTM over other methods are similarfor different time steps, which suggests that TTM providesbetter aggregated features via attention than the others.For future prediction, we compare our PPM to Decoder-LSTM and SSP with either Encoder-LSTM or TTM as aggre-gation method. The results are presented in Table V. Severalfinds are concluded as follows. First, with both aggregationmethods, our PPM consistently outperforms Decoder-LSTMand SSP on both datasets which shows the effectiveness ofPPM. Second, our PPM obtains more improvements with ouraggregation method TTM than Encoder-LSTM. For instance,TTM-PPM outperforms TTM-LSTM by . while LSTM-PPM only outperforms LSTM-LSTM by . . Third, withboth aggregation methods, our PPM is significantly superiorto SSP on both datasets especially at long-term prediction timesteps, which demonstrates the progressive design of our PPMis important. G. Importance of Feature Prediction
In order to evaluate the influence of feature predictionfor the final action anticipation, we remove the predictedfeatures (w/o FP) by only use the concatenation of actionprobability and the aggregated historical representation in thePPM. The results are shown in Figure 4. Without consideringthe predicted features, the performance of the model w/o FPdegenerates dramatically. It indicates that only relying on theaction probability to predict future actions is not enough andthe predicted feature representations are always related to the Fig. 4: Evaluation of feature prediction for action anticipationon TVSeries (cAP %) and THUMOS-14 dataset (mAP %)with two-stream features.Fig. 5: Evaluation of trade-off weight λ on TVSeries (cAP %)and THUMOS-14 dataset (mAP %) with two-stream features.action itself and thus could possibly provide some usefulinformation. H. Evaluation of Sequence Length and Parameters
In the above experiments, we use a fixed historical length for aggregation, parallel heads and trade-off loss weight λ = 1 . for training by default. To investigate their impactsto the proposed TTPP framework, we evaluate them on bothTHUMOS-14 and TVSeries. The impact of λ . λ is the weight of feature reconstructionloss in training. Figure 5 shows the results of varied λ onTHUMOS-14 and TVSeries. Removing the feature reconstruc-tion loss, i . e . λ = 0 , degrades performance dramatically onboth datasets which suggests the necessary of feature predic-tion. Increasing the weight from 0 to 1 improves performance,and it gets saturation or slightly hurts performance after 1. Thismay be explained by that overemphasizing feature reconstruc-tion can hurt the discrimination of predicted features. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Fig. 6: Evaluation of input sequence length for temporalaggregation on TVSeries (cAP %) and THUMOS-14 dataset(mAP %) with two-stream features.
Number of heads n . We also study performance variationsgiven various number of heads for temporal Transformer.Average prediction performance of our TTPP network with n ∈ { , , , , } are shown in Table VI. Results in Table VIindicate that our method is not sensitive to parameter n . Thelargest performance variation is only within . on TVSeriesand . on THUMOS-14. On both datasets, we achieve bestperformance with head number n = 4 . Input sequence length . The length of observed sequencedetermines how much historical information can be used.Figure 6 illustrates the evaluation results on THUMOS-14 andTVSeries. On both datasets, we achieve the best performancewith length 8. Decreasing sequence length leads to insufficientcontext information and increasing sequence length results tomassive background information which are both inferior to thedefault length.
I. Efficiency and Visualization
Table VII reports a comparison of parameters, memoryfootprint, inference time and performance of different mod-els on TVSeries dataset. Compared to the popular Encoder-Decoder LSTM model, our TTPP has fewer parameters, fewer memory footprint and less inference time, whileachieves . higher performance. The efficiency of theproposed TTPP owes to both the Transformer architecturefor sequence modeling and the efficient progressive predictionmodule.Figure 7 shows some examples of attention weights and ac-tion anticipation on TVSeries, THUMOS-14 and TV-Human-Interaction. We find that frames near the current frame usuallyget higher weights compared to these distant frames sincethe current frame feature is used as the query. On TVSeriesand THUMOS-14, multiple action instances and confusingbackground frames exist in the videos which lead to incorrectanticipation inevitably.V. C ONCLUSION
In this paper, we propose a novel deep framework to boostaction anticipation by adopting Temporal Transformer withProgressive Prediction, where a TTM is used to aggregateobserved information and a PPM to progressively predict
Number of Heads n=1 n=2 n=4 n=8 n=16TVSeries 77.1 77.7
TABLE VI: Comparison between different number of headson TVSeries and THUMOS-14 with two-stream features.
Model Parameter (M) Memory (M) Inf (s) cAP (%)ED [13] 277 6560 212 74.5TTPP 101 3675 145 77.9
TABLE VII: Comparison of parameter, memory footprint andinference time on TVSeries dataset with two-stream features.future features and actions. Experimental results on TVSeries,THUMOS-14, and TV-Human-Interaction demonstrate thatour framework significantly outperforms the state-of-the-artmethods. Extensive ablation studies are conducted to showthe effectiveness of each module of our method.VI. A
CKNOWLEDGEMENTS
This work is partially supported by the NationalKey Research and Development Program of China(No.2016YFC1400704), and National Natural Science Foun-dation of China (U1813218, U1713208, 61671125), ShenzhenBasic Research Program (JCYJ20170818164704758,CXB201104220032A), the Joint Lab of CAS-HK, ShenzhenInstitute of Artificial Intelligence and Robotics for Society,and the Sichuan Province Key Rresearch and DevelopmentPlan (2019YFS0427). R
EFERENCES[1] Mohammad Sadegh Aliakbarian, Fatemehsadat Saleh, Mathieu Salz-mann, Basura Fernando, and Lars Andersson. Encouraging lstms toanticipate actions very early. In
ICCV , 2017.[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layernormalization. 2016.[3] Fabian Caba, Victor Escorcia, Bernard Ghanem, and Juan CarlosNiebles. Activitynet: A large-scale video benchmark for human activityunderstanding. In
CVPR , 2015.[4] Jo˜ao Carreira and Andrew Zisserman. Quo vadis, action recognition? Anew model and the kinetics dataset.
CoRR , abs/1705.07750, 2017.[5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? anew model and the kinetics dataset. In
CVPR , 2017.[6] Junyoung Chung, C¸ aglar G¨ulc¸ehre, KyungHyun Cho, and Yoshua Ben-gio. Empirical evaluation of gated recurrent neural networks on sequencemodeling.
CoRR , abs/1412.3555, 2014.[7] Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, CeesSnoek, and Tinne Tuytelaars. Online action detection. In
CVPR , 2016.[8] Roeland De Geest and Tinne Tuytelaars. Modeling temporal structurewith lstm for online action detection. In
WACV , pages 1549–1557, 2018.[9] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, MarcusRohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.Long-term recurrent convolutional networks for visual recognition anddescription.
CoRR , abs/1411.4389, 2014.[10] Tran Du, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, andManohar Paluri. Learning spatiotemporal features with 3d convolutionalnetworks. In
ICCV , 2015.[11] Antonino Furnari and Giovanni Maria Farinella. What would youexpect? anticipating egocentric actions with rolling-unrolling lstms andmodality attention. In
ICCV , 2019.[12] Jiyang Gao and Ram Nevatia. Revisiting temporal modeling for video-based person reid. In
BMVC , 2018.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
Run Run
Background (gt)
CurrentObserved PoleVault PoleVault PoleVault BackgroundPoleVault (gt)
Run Run
Hug Hug Hug Hug
Fig. 7: Visualization of attention weights and action anticipation on TVSeries (1st row), THUMOS-14 (2nd row), and TV-Human-Interaction (3rd row). Incorrect anticipation results are shown in red. [13] Jiyang Gao, Zhenheng Yang, and Ram Nevatia. RED: reinforcedencoder-decoder networks for action anticipation. In
BMVC, 2017 .[14] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman.Video action transformer network. In
CVPR , 2019.[15] Alex Graves. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997.[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition.
CoRR , abs/1512.03385, 2015.[17] Minh Hoai and Fernando De La Torre. Max-margin early eventdetectors. In
CVPR , 2012.[18] Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory.
Neural Computation , 9(8):1735–1780, 1997.[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. 2015.[20] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah,and R. Sukthankar. THUMOS challenge: Action recognition with a largenumber of classes. 2014.[21] Xu Jing, Zhao Rui, Zhu Feng, Huaming Wang, and Wanli Ouyang.Attention-aware compositional network for person re-identification. In
CVPR , 2018.[22] Rafal J´ozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, andYonghui Wu. Exploring the limits of language modeling.
CoRR ,abs/1602.02410, 2016.[23] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, andFei Fei Li. Large-scale video classification with convolutional neuralnetworks. In , 2014.[24] Will Kay, Jo˜ao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kineticshuman action video dataset.
CoRR , abs/1705.06950, 2017.[25] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, and AndrewZisserman. The kinetics human action video dataset. 2017.[26] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Farid Boussaid,and Ferdous Sohel. Human interaction prediction using deep temporalfeatures. In
ECCV , 2016.[27] Qiuhong Ke, Mario Fritz, and Bernt Schiele. Time-conditioned actionanticipation in one shot. In
CVPR , 2019.[28] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for actionprediction. In
CVPR , 2017.[29] Yu Kong, Shangqian Gao, Bin Sun, and Yun Fu. Action prediction fromvideos via memorizing hard-to-predict samples. In
Thirty-Second AAAIConference on Artificial Intelligence , 2018.[30] Yu Kong, Dmitry Kit, and Yun Fu. A discriminative model with multipletemporal scales for action prediction. In
ECCV , 2014. [31] Tian Lan, Tsung-Chuan Chen, and Silvio Savarese. A hierarchicalrepresentation for future action prediction. In David Fleet, Tomas Pajdla,Bernt Schiele, and Tinne Tuytelaars, editors,
Computer Vision – ECCV2014 , pages 689–704, Cham, 2014. Springer International Publishing.[32] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozen-feld. Learning realistic human actions from movies. In
CVPR , 2008.[33] Kang Li and Yun Fu. Prediction of human activity by discoveringtemporal sequence patterns.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 36(8):1644–1657, 2014.[34] Jun Liu, Gang Wang, Ling Yu Duan, Kamila Abdiyeva, and Alex C.Kot. Skeleton based human action recognition with global context-aware attention lstm networks.
IEEE Transactions on Image Processing ,PP(99):1–1, 2018.[35] Jun Liu, Gang Wang, Ping Hu, Ling Yu Duan, and Alex C. Kot. Globalcontext-aware attention lstm networks for 3d action recognition. In
CVPR , 2017.[36] Minh Thang Luong, Hieu Pham, and Christopher D. Manning. Effectiveapproaches to attention-based neural machine translation.
ComputerScience , 2015.[37] Tahmida Mahmud, Mahmudul Hasan, and Amit K. Roy-Chowdhury.Joint prediction of activity labels and starting times in untrimmed videos.In
CVPR , 2017.[38] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improverestricted boltzmann machines. In
ICML , 2010.[39] Alonso Patron, Marcin Marszalek, Andrew Zisserman, and Ian Reid.High five: Recognising human interactions in tv shows. In
BMVC , 2010.[40] S. Ren, K. He, R Girshick, and J. Sun. Faster r-cnn: Towards real-timeobject detection with region proposal networks.
IEEE Transactions onPattern Analysis and Machine Intelligence , 39(6):1137–1149, 2017.[41] Cristian Rodriguez, Basura Fernando, and Hongdong Li. Action antici-pation by predicting future dynamic images.[42] M. S. Ryoo. Human activity prediction: Early recognition of ongoingactivities from streaming videos. In
CVPR , 2012.[43] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recog-nition using visual attention. In
ICLR , 2015.[44] Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. Action recog-nition using visual attention.
Computer Science , 2017.[45] Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, HassanMansour, Anthony Vetro, Xavier Giro-I-Nieto, and Shih Fu Chang.Online detection of action start in untrimmed, streaming videos. In
CVPR , 2018.[46] Karen Simonyan and Andrew Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In
NIPS , 2014.[47] Karen Simonyan and Andrew Zisserman. Very deep convolutionalnetworks for large-scale image recognition.
Computer Science , 2014.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 [48] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and JiayingLiu. An end-to-end spatio-temporal attention model for human actionrecognition from skeleton data. In
AAAI , 2016.[49] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: Adataset of 101 human actions classes from videos in the wild.
ComputerScience , 2012.[50] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, andRuslan Salakhutdinov. Dropout: a simple way to prevent neural networksfrom overfitting.
Journal of Machine Learning Research , 15(1):1929–1958, 2014.[51] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, andKoray Kavukcuoglu. Wavenet: A generative model for raw audio.
CoRR ,abs/1609.03499, 2016.[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, LlionJones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attentionis all you need.
CoRR , abs/1706.03762, 2017.[53] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipatingthe future by watching unlabeled video.
CoRR , abs/1504.08023, 2015.[54] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipatingvisual representations from unlabeled video. In
CVPR , 2016.[55] Heng Wang, Alexander Klser, Cordelia Schmid, and Cheng Lin Liu.Dense trajectories and motion boundary descriptors for action recogni-tion.
International Journal of Computer Vision , 103(1):60–79, 2013.[56] Hongsong Wang and Jiashi Feng. Delving into 3d action anticipationfrom streaming videos. 2019.[57] Limin Wang, Yuanjun Xiong, Wang Zhe, and Qiao Yu. Towards goodpractices for very deep two-stream convnets.
Computer Science , 2015.[58] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In
CVPR , 2018.[59] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, MohammadNorouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, andKlaus Macherey. Google’s neural machine translation system: Bridgingthe gap between human and machine translation. 2016.[60] Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue.Modeling spatial-temporal clues in a hybrid deep learning frameworkfor video classification.
CoRR , abs/1504.01561, 2015.[61] Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song,Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. Cuhkand ethz and siat submission to activitynet challenge 2016.[62] Mingze Xu, Mingfei Gao, Yi-Ting Chen, Larry S. Davis, and David J.Crandall. Temporal recurrent networks for online action detection. In
ICCV , 2019.[63] Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Xiaodong Xie, and WenGao. Attention driven person re-identification.
Pattern Recognition ,pages S0031320318303133–, 2018.[64] Cao Yu, Daniel Barrett, Andrei Barbu, Siddharth Narayanaswamy, andWang Song. Recognize human activities from partially observed videos.In