Streamlined Dense Video Captioning
SStreamlined Dense Video Captioning
Jonghwan Mun , ∗ Linjie Yang Zhou Ren Ning Xu Bohyung Han POSTECH ByteDance AI Lab Wormpex AI Research Amazon Go Seoul National University [email protected] [email protected] [email protected] [email protected] [email protected] Abstract
Dense video captioning is an extremely challenging tasksince accurate and coherent description of events in a videorequires holistic understanding of video contents as well ascontextual reasoning of individual events. Most existing ap-proaches handle this problem by first detecting event pro-posals from a video and then captioning on a subset of theproposals. As a result, the generated sentences are proneto be redundant or inconsistent since they fail to considertemporal dependency between events. To tackle this chal-lenge, we propose a novel dense video captioning frame-work, which models temporal dependency across events ina video explicitly and leverages visual and linguistic contextfrom prior events for coherent storytelling. This objectiveis achieved by 1) integrating an event sequence generationnetwork to select a sequence of event proposals adaptively,and 2) feeding the sequence of event proposals to our se-quential video captioning network, which is trained by re-inforcement learning with two-level rewards—at both eventand episode levels—for better context modeling. The pro-posed technique achieves outstanding performances on Ac-tivityNet Captions dataset in most metrics.
1. Introduction
Understanding video contents is an important topic incomputer vision. Through the introduction of large-scaledatasets [9, 31] and the recent advances of deep learningtechnology, research towards video content understandingis no longer limited to activity classification or detectionand addresses more complex tasks including video captiongeneration [1, 4, 13, 14, 15, 22, 23, 26, 28, 30, 33, 35, 36].Video captions are effective for holistic video descrip-tion. However, since videos usually contain multiple in-terdependent events in context of a video-level story ( i.e .episode), a single sentence may not be sufficient to describevideos. Consequently, dense video captioning task [8] has ∗ This work was done during the internship program at Snap Research. time 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 : a caesar salad is ready and is served in a bowl 𝑒𝑒 : croutons are in a bowl and chopped ingredients are separated 𝑒𝑒 : the man mix all the ingredients in a bowl to make the dressing, put plastic wrap as a lid 𝑒𝑒 : the man puts the dressing on the lettuces and adds the croutons in the bowl and mixes them all together Topic: caesar salad recipe 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 net net net net 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 net net net net (a) Conventional approach (b) Our approach time 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 : a caesar salad is ready and is served in a bowl 𝑒𝑒 : croutons are in a bowl and chopped ingredients are separated 𝑒𝑒 : the man mix all the ingredients in a bowl to make the dressing, put plastic wrap as a lid 𝑒𝑒 : the man puts the dressing on the lettuces and adds the croutons in the bowl and mixes them all together Topic: caesar salad recipe 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 (a) Conventional approach (b) Our approach time 𝑒𝑒 : an elderly man is playing the piano in front of a crowd 𝑒𝑒 : a woman walks to the piano and briefly talks to the elderly man 𝑒𝑒 : the woman starts singing along with the pianist 𝑒𝑒 : eventually the elderly man finishes playing and hugs the woman, and the crowd applaud Episode: busking 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 (a) Conventional approach 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 Event Detection (b) Our approachEvent Sequence Detection 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 → → → Figure 1. An example of dense video captioning about a busking episode, which is composed of four interdependent events. been introduced and getting more popular recently. Thistask is conceptually more complex than simple video cap-tioning since it requires detecting individual events in avideo and understanding their context. Fig. 1 presents anexample of dense video captioning for a busking episode,which is composed of four ordered events. Despite the com-plexity of the problem, most existing methods [8, 10, 27,37] are limited to describing an event using two subtasks—event detection and event description—in which an eventproposal network is in charge of detecting events and a cap-tioning network generates captions for the selected propos-als independently.We propose a novel framework for dense video caption-ing, which considers the temporal dependency of the events.Contrary to existing approaches shown in Fig. 2(a), our al-gorithm detects event sequences from videos and generatescaptions sequentially, where each caption is conditioned onprior events and captions as illustrated in Fig. 2(b). Our al-gorithm has the following procedure. First, given a video,we obtain a set of candidate event proposals from an eventproposal network. Then, an event sequence generation net-work selects a series of ordered events adaptively from theevent proposal candidates. Finally, we generate captions forthe selected event proposals using a sequential captioningnetwork. The captioning network is trained via reinforce-ment learning using both event and episode-level rewards;1 a r X i v : . [ c s . C V ] A p r ime 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 : a caesar salad is ready and is served in a bowl 𝑒𝑒 : croutons are in a bowl and chopped ingredients are separated 𝑒𝑒 : the man mix all the ingredients in a bowl to make the dressing, put plastic wrap as a lid 𝑒𝑒 : the man puts the dressing on the lettuces and adds the croutons in the bowl and mixes them all together Episode: cooking caesar salad time 𝑒𝑒 : an elderly man is playing the piano in front of a crowd 𝑒𝑒 : a woman walks to the piano and briefly talks to the elderly man 𝑒𝑒 : the woman starts singing along with the pianist 𝑒𝑒 : eventually the elderly man finishes playing and hugs the woman, and the crowd applaud Episode: busking 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 (a) Conventional approach 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑 Event Detection (b) Our approachEvent Sequence Detection 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑑𝑑 𝑑𝑑 𝑑𝑑 𝑑𝑑 → → → Figure 2. Comparison between the existing approaches and oursfor dense video captioning. Our algorithm generates captions forevents sequentially conditioned on the prior ones by detecting anevent sequence in a video. the event-level reward allows to capture specific content ineach event precisely while the episode-level reward drivesall generated captions to make a coherent story.The main contributions of the proposed approach aresummarized as follows: • We propose a novel framework of detecting event se-quences for dense video captioning. The proposedevent sequence generation network allows the caption-ing network to model temporal dependency betweenevents and generate a set of coherent captions to de-scribe an episode in a video. • We present reinforcement learning with two-level re-wards, episode and event levels, which drives the cap-tioning model to boost coherence across generatedcaptions and quality of description for each event. • The proposed algorithm achieves the state-of-the-artperformance on the ActivityNet Captions dataset withlarge margins compared to the methods based on theexisting framework.The rest of the paper is organized as follows. We firstdiscuss related works for our work in Section 2. The pro-posed method and its training scheme are described in Sec-tion 3 and 4 in detail, respectively. We present experimentalresults in Section 5, and conclude this paper in Section 6.
2. Related Work
Recent video captioning techniques often incorporatethe encoder-decoder framework inspired by success in im-age captioning [11, 16, 17, 25, 32]. Basic algorithms [22,23] encode a video using Convolutional Neural Networks(CNNs) or Recurrent Neural Networks (RNNs), and decodethe representation into a natural sentence using RNNs. Thenvarious techniques are proposed to enhance the quality ofgenerated captions by integrating temporal attention [33], joint embedding space of sentences and videos [14], hier-archical recurrent encoder [1, 13], attribute-augmented de-coder [4, 15, 36], multimodal memory [28], and reconstruc-tion loss [26]. Despite their impressive performances, theyare limited to describing a video using a single sentenceand can be applied only to a short video containing a singleevent. Thus, Yu et al . [35] propose a hierarchical recurrentneural network to generate a paragraph for a long video,while Xiong et al . [30] introduce a paragraph generationmethod based on event proposals, where an event selectionmodule determines which proposals need to be utilized forcaption generation in a progressive way. Contrary to thesetasks, which simply generate a sentence or paragraph for aninput video, dense video captioning requires localizing anddescribing events at the same time.
Recent dense video captioning techniques typically at-tempt to solve the problem using two subtasks—event de-tection and caption generation [8, 10, 27, 37]; an event pro-posal network finds a set of candidate proposals and a cap-tioning network is employed to generate a caption for eachproposal independently. The performance of the methods isaffected by the manual thresholding strategies to select thefinal event proposals for caption generation.Based on the framework, Krishna et al . [8] adopt a multi-scale action proposal network [3], and introduce a caption-ing network that exploits visual context from past and futureevents with an attention mechanism. In [27], a bidirectionalRNN is employed to improve the quality of event propos-als and a context gating mechanism in caption generationis proposed to adaptively control the contribution of sur-rounding events. Li et al . [10] incorporate temporal coor-dinate and descriptiveness regressions for precise localiza-tion of event proposals, and adopt the attribute-augmentedcaptioning network [34]. Rennie et al . [37] utilize a self-attention [20] for event proposal and captioning networks,and propose a masking network for conversion of the eventproposals to differentiable masks and end-to-end learningof the two networks.In contrast to the prior works, our algorithm identifies asmall set of representative event proposals ( i.e ., event se-quences) for sequential caption generation, which enablesus to generate coherent and comprehensive captions by ex-ploiting both visual and linguistic context across selectedevents. Note that the existing works fail to take advantageof linguistic context since the captioning network is appliedto event proposals independently.
3. Our Framework
This section describes our main idea and the deep neuralnetwork architecture for our algorithm in detail. equential Captioning Network R NN ε � e R NN e � d R NN ε � e R NN e � d R NN ε � e R NN e � d Event EvaluatorGround-Truth Captions E p i s o d e E v a l u a t o r Event Proposal Network p p p p p C D C D C D C D C D G R U G R U G R U G R U G R U Event Sequence Generation Network
Pointer Network (PtrNet) � e 𝑒𝑒𝑒𝑒𝑒𝑒 � e 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 location visual p p p p END � e � e � e p p p p p end 𝑝𝑝 𝑒𝑒𝑒𝑒𝑒𝑒 𝑝𝑝 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 visual � e (p ) � e (p ) � e (p ) � e � e � e � e R NN e n c R NN e n c R NN e n c R NN e n c R NN e n c R NN p t r R NN p t r R NN p t r R NN p t r Figure 3. Overall framework of the proposed algorithm. Given an input video, our algorithm first extracts a set of candidate event proposals( p , p , p , p , p ) using the Event Proposal Network (Section 3.2). From the candidate set, the Event Sequence Generation Networkdetects an event sequence ( ˆ e → ˆ e → ˆ e ) by selecting one out of the candidate event proposals (Section 3.3). Finally, the SequentialCaptioning Network takes the detected event sequence and sequentially generates captions ( ˆ d , ˆ d , ˆ d ) conditioned on preceding events(Section 3.4). The three models are trained in a supervised manner (Section 4.1) and then the Sequential Captioning Network is optimizedadditionally with reinforcement learning using two-level rewards (Section 4.2). Let a video V contain a set of events E = { e , . . . , e N } with corresponding descriptions D = { d , . . . , d N } , where N events are temporally localized using their starting andending time stamps. Existing methods [8, 10, 27, 37] typ-ically divide the whole problem into two steps: event de-tection followed by description of detected events. Thesealgorithms train models by minimizing the sum of negativelog-likelihoods of event and caption pairs as follows: L = N (cid:88) n =1 − log p ( d n , e n | V )= N (cid:88) n =1 − log p ( e n | V ) p ( d n | e n , V ) . (1)However, events in a video have temporal dependencyand should be on a story about a single topic. Therefore, itis critical to identify an ordered list of events to describe acoherent story corresponding to the episode, the composi-tion of the events. With this in consideration, we formulatedense video captioning as detection of an event sequencefollowed by sequential caption generation as follows: L = − log p ( E , D| V )= − log p ( E| V ) N (cid:89) n =1 p ( d n | d , . . . , d n − , E , V ) . (2)The overall framework of our proposed algorithm is il-lustrated in Fig. 3. For a given video, a set of candidate event proposals is generated by the event proposal network.Then, our event sequence generation network provides a se-ries of events by selecting one of candidate event propos-als sequentially, where the selected proposals correspondto events comprising an episode in the video. Finally, wegenerate captions from the selected proposals using the pro-posed sequential captioning network, where each caption isgenerated conditioned on preceding proposals and their cap-tions. The captioning network is trained via reinforcementlearning using event and episode-level rewards. EPN plays a key role in selecting event candidates. Weadopt Single-Stream Temporal action proposals (SST) [2]due to its good performance and efficiency in finding se-mantically meaningful temporal regions via a single scanof videos. SST divides an input video into a set of non-overlapping segments with a fixed length ( e.g ., 16 frames),where the representation of each segment is given by a 3Dconvolution (C3D) network [19]. By treating each segmentas an ending point of an event proposal, SST identifies itsmatching starting points from the k preceding segments,which are represented by k -dimensional output vector froma Gated Recurrent Unit (GRU) at each time step. After ex-tracting the top 1,000 event proposals, we obtain M candi-date proposals, P = { p , . . . , p M } , by eliminating highlyoverlapping ones using non-maximum suppression. Notethat EPN provides a representation of each proposal p ∈ P ,which is a concatenated vector of two hidden states at start-ing and ending segments in SST. This visual representation,denoted by Vis ( p ) , is utilized for the other two networks. .3. Event Sequence Generation Network (ESGN) Given a set of candidate event proposals, ESGN selectsa series of events that are highly correlated and make up anepisode for a video. To this ends, we employ a Pointer Net-work (PtrNet) [24] that is designed to produce a distributionover the input set using a recurrent neural network by adopt-ing an attention module. PtrNet is well-suited for selectingan ordered subset of proposals and generating coherent cap-tions with consideration of their temporal dependency.As shown in Fig. 3, we first encode a set of candidateproposals, P , by feeding proposals to an encoder RNN inan increasing order of their starting times, and initialize thefirst hidden state of PtrNet with the encoded representationsto guide proposal selection. At each time step in PtrNet, wecompute likelihoods a t over the candidate event proposalsand select a proposal with the highest likelihood out of allavailable proposals. The procedure is repeated until PtrNethappens to select the END event proposal, p end , which is aspecial proposal to indicate the end of an event sequence.The whole process is summarized as follows: h ptr = RNN enc ( Vis ( p ) , . . . , Vis ( p M )) , (3) h ptr t = RNN ptr ( u (ˆ e t − ) , h ptr t − ) , (4) a t = ATT ( h ptr t , u ( p ) , . . . , u ( p M )) , (5)where h ptr is a hidden state in PtrNet, ATT () is an atten-tion function computing confidence scores over proposals,and the representation of proposal p in PtrNet, u ( p ) =[ Loc ( p ); Vis ( p )] , is given by visual information Vis ( p ) aswell as the location information Loc ( p ) . Also, ˆ e t is a se-lected event proposal at time step t , which is given by ˆ e t = p j ∗ and j ∗ = arg max j ∈{ ,...,M } a jt , (6)where p corresponds to p end . Note that the location feature,Loc ( p ) , is a binary mask vector, where the elements corre-sponding to temporal intervals of an event are set to 1s and0s otherwise. This is useful to identifying and disregardingproposals that overlap strongly with previous selections.Our ESGN has clear benefits for dense video captioning.Specifically, it determines the number and order of eventsadaptively, which facilitates compact, comprehensive andcontext-aware caption generation. Noticeably, there are toomany detected events in existing approaches ( e.g ., ≥ )given by manual thresholding. On the contrary, ESGN de-tects only 2.85 on average, which is comparable to the av-erage number of events per video in ActivityNet Captiondataset, 3.65. Although sorting event proposals is an ill-defined problem, due to their two time stamps (starting andending points), ESGN naturally learns the number and orderof proposals based on semantics and contexts in individualvideos in a data-driven manner. SCN employs a hierarchical recurrent neural network togenerate coherent captions based on the detected event se-quence ˆ E = { ˆ e , . . . , ˆ e N s } , where N s ( ≤ M ) is the numberof selected events. As shown in Fig. 3, SCN consists oftwo RNNs—an episode RNN and an event RNN—denotedby RNN E and RNN e , respectively. The episode RNN takesthe proposals in a detected event sequence one by one andmodels the state of an episode implicitly, while the eventRNN generates words in caption sequentially for each eventproposal conditioned on the implicit representation of theepisode, i.e ., based on the current context of the episode.Formally, the caption generation process for the t th eventproposal ˆ e t in the detected event sequence is given by r t = RNN E ( Vis (ˆ e t ) , g t − , r t − ) , (7) g t = RNN ∗ e ( C3D (ˆ e t ) , Vis (ˆ e t ) , r t ) , (8)where r t is an episodic feature from the t th event proposal,and g t is a generated caption feature given by the last hiddenstate of the unrolled (denoted by ∗ ) event RNN. C3D (ˆ e t ) denotes a set of C3D features for all segments lying in tem-poral intervals of t th event proposal. The episode RNN pro-vides the current episodic feature so that the event RNNgenerates context-aware captions, which are given back tothe episode RNN.Although both networks can be implemented with anyRNNs conceptually, we adopt a single-layer Long Short-Term Memory (LSTM) with a 512 dimensional hidden stateas the episode RNN, and a captioning network with tem-poral dynamic attention and context gating (TDA-CG) pre-sented in [27] as the event RNN. TDA-CG generates wordsfrom a feature computed by gating a visual feature Vis ( e ) and an attended feature obtained from segment feature de-scriptors C3D ( e ) .Note that sequential captioning generation scheme en-ables to exploit both visual context ( i.e . how other eventslook) and linguistic context ( i.e . how other events are de-scribed) across events, and allows us to generate captions inan explicit context. Although existing methods [8, 27] alsoutilize context for caption generation, they are limited to vi-sual context and model with no linguistic dependency dueto their architectural constraints from independent captiongeneration scheme, which would result in inconsistent andredundant caption generation.
4. Training
We first learn the event proposal network and fix its pa-rameters during training of the other two networks. Wetrain the event sequence generation network and the sequen-tial captioning network in a supervised manner, and furtheroptimize the captioning network based on reinforcementlearning with two-level rewards—event and episode levels. .1. Supervised Learning
Event Proposal Network
Let c kt be the confidence of the k th event proposal at time step t in EPN, which is SST [2]in our algorithm. Denote the ground-truth label of the pro-posal by y kt , which is set to 1 if the event proposal has atemporal Intersection-over-Union (tIoU) with ground-truthevents larger than 0.5, and 0 otherwise. Then, for a givenvideo V and ground-truth labels y , we train EPN by mini-mizing a following weighted binary cross entropy loss: L EPN ( V, Y ) = − T c (cid:88) t =1 K (cid:88) k =1 y kt log c kt + (1 − y kt ) log(1 − c kt ) , (9)where Y = { y kt | ≤ t ≤ T c , ≤ k ≤ K } , K is the numberof proposals containing each segment at the end and T c isthe number of segments in the video. Event Sequence Generation Network
For a video withground-truth event sequence E = { e , . . . , e N } and a set ofcandidate event proposals P = { p , . . . , p M } , the goal ofESGN is to select a proposal p highly overlapping with theground-truth event e , which is achieved by minimizing thefollowing sum of binary cross entropy loss: L ESGN ( V, P , E ) = − N (cid:88) n =1 M (cid:88) m =1 tIoU ( p m , e n ) log a mn (10) + (1 − tIoU ( p m , e n )) log(1 − a mn ) , where tIoU ( · , · ) is a temporal Intersection-over-Union valuebetween two proposals, and a mn is the likelihood that the m th event proposal is selected as the n th event. Sequential Captioning Network
We utilize the ground-truth event sequence E and its descriptions D to learn ourSCN via the teacher forcing technique [29]. Specifically,to learn two RNNs in SCN, we provide episode RNN andevent RNN with ground-truth events and captions as theirinputs, respectively. Then, the captioning network is trainedby minimizing negative log-likelihood over words of theground-truth captions as follows: L SCN ( V, E , D ) = − N (cid:88) n =1 log p ( d n | e n ) (11) = − N (cid:88) n =1 T dn (cid:88) t =1 log p ( w tn | w n , . . . , w t − n , e n ) , where p ( · ) denotes a predictive distribution over word vo-cabulary from the event RNN, and w tn and T d n mean the t th ground-truth word and the length of ground-truth descrip-tion for the n th event. Inspired by the success in image captioning task [16, 17],we further employ reinforcement learning to optimize SCN.While similar to the self-critical sequence training [17] ap-proach, the objective of learning our captioning network isrevised to minimize the negative expected rewards for sam-pled captions. The loss is formally given by L RLSCN ( V, ˆ E , ˆ D ) = − N s (cid:88) n =1 E ˆ d n (cid:104) R ( ˆ d n ) (cid:105) , (12)where ˆ D = { ˆ d , . . . , ˆ d N S } is a set of sampled descriptionsfrom the detected event sequence ˆ E with N s events fromESGN, and R ( ˆ d ) is a reward value for the individual sam-pled description ˆ d . Then, the expected gradient on the sam-ple set ˆ D is given by ∇L RLSCN ( V, ˆ E , ˆ D ) = − N s (cid:88) n =1 E ˆ d n (cid:104) R ( ˆ d n ) ∇ log p ( ˆ d n ) (cid:105) ≈ − N s (cid:88) n =1 R ( ˆ d n ) ∇ log p ( ˆ d n ) . (13)We adopt a reward function with two levels: episode andevent levels. This encourages models to generate coherentcaptions by reflecting the overall context of videos, whilefacilitating the choices of better word candidates in describ-ing individual events depending on the context. Also, mo-tivated by [6, 16, 17], we use the rewards obtained fromthe captions generated with ground-truth proposals as base-lines, which is helpful to reduce the variance of the gradientestimate. This drives models to generate captions at least ascompetitive as the ones generated from ground-truth pro-posals, although the intervals of event proposals are not ex-actly aligned with those of ground-truth proposals. Specif-ically, for a sampled event sequence ˆ E , we find a refer-ence event sequence ˜ E = { ˜ e , . . . , ˜ e N s } and its descriptions ˜ D = { ˜ d , . . . , ˜ d N s } , where the reference event ˜ e is given byone of the ground-truth proposals with the highest overlap-ping ratio with sampled event ˆ e . Then, the reward for the n th sampled description ˆ d n is given byR ( ˆ d n ) = (14) (cid:104) f ( ˆ d n , ˜ d n ) − f ( ˇ d n , ˜ d n ) (cid:105) + (cid:104) f ( ˆ D , ˜ D ) − f ( ˇ D , ˜ D ) (cid:105) , where f ( · , · ) returns a similarity score between two cap-tions or two set of captions, and ˇ D = { ˇ d , . . . , ˇ d N s } de-note the generated descriptions from the reference event se-quence. Both terms in Eq. (14) encourage our model to in-crease the probability of sampled descriptions whose scoresare higher than the results of generated captions from theground-truth event proposals. Note that the first and sec-ond terms are computed on the current event and episode, able 1. Event detection performances including recall and precision at four thresholds of temporal intersection of unions (@tIoU) on theActivityNet Captions validation set. The bold-faced numbers mean the best performance for each metric.Method Recall (@tIoU) Precision (@tIoU)@0.3 @0.5 @0.7 @0.9 Average @0.3 @0.5 @0.7 @0.9 AverageMFT [30] 46.18 29.76 15.54 5.77 24.31 86.34 68.79 38.30 Table 2. Dense video captioning results including Bleu@N (B@N), CIDEr (C) and METEOR (M) for our model and other state-of-the-artmethods on ActivityNet Captions validation set. We report performances obtained from both ground-truth (GT) proposals and learnedproposals. Asterisk ( ∗ ) stands for the methods re-evaluated using the newer evaluation tool and star ( (cid:63) ) indicates the methods exploitingadditional modalities ( e.g . optical flow and attribute) for video representation. The bold-faced numbers mean the best for each metric.Method with GT proposals with learned proposalsB@1 B@2 B@3 B@4 C M B@1 B@2 B@3 B@4 C MDCE [8] 18.13 8.43 4.09 1.60 25.12 8.88 10.81 4.57 1.90 0.71 12.43 5.69DVC [10] (cid:63) ∗ (cid:63) ∗ - - - - - 10.89 10.75 5.06 2.55 respectively. We use two famous captioning metrics, ME-TEOR and CIDEr, to define f ( · , · ) .
5. Experiments
We evaluate the proposed algorithm on the ActivityNetCaptions dataset [8], which contains 20k YouTube videoswith an average length of 120 seconds. The dataset consistsof 10,024, 4,926 and 5,044 videos for training, validationand test splits, respectively. The videos have 3.65 tempo-rally localized events and descriptions on average, wherethe average length of the descriptions is 13.48 words.
We use the performance evaluation tool provided by the2018 ActivityNet Captions Challenge, which measures thecapability to localize and describe events . For evaluation,we measure recall and precision of event proposal detec-tion, and METEOR, CIDEr and BLEU of dense video cap-tioning. The scores of the metrics are summarized via theiraverages based on tIoU thresholds of . , . , . and . given identified proposals and generated captions. We useMETEOR as the primary metric for comparison, since it isknown to be more correlated to human judgments than oth-ers when only a small number of reference descriptions areavailable [21]. https://github.com/ranjaykrishna/densevid_eval On 11/02/2017, the official evaluation tool fixed a critical issue; onlyone out of multiple incorrect predictions for each video was counted. Thisleads to performance overestimation of [27, 37]. Thus, we received raw re-sults from the authors and reported the scores measured by the new metric.
Table 3. Results on ActivityNet Captions evaluation server.Audio Flow Visual Ensemble METEORRUC+CMU √ √ √ yes 8.53YH Technologies √ √ no 8.13Shandong Univ. √ √ yes 8.11SDVC (ours) √ no 8.19 For EPN, we use a two-layer GRU with 512 dimensionalhidden states and generate 128 proposals at each endingsegment, which makes the dimensionality of c t in Eq. (9)128. In our implementation, EPN based on SST takes awhole span of video for training as an input to the network,this allows the network to consider all ground-truth propos-als, while the original SST [2] is trained with densely sam-pled clips given by the sliding window method.For ESGN, we adopt a single-layer GRU and a single-layer LSTM as EncoderRNN and RNN ptr , respectively,where the dimensions of hidden states are both 512. We rep-resent the location feature, denoted by Loc ( · ) , of proposalswith a 100 dimensional vector. When learning SGN withreinforcement learning, we sample 100 event sequences foreach video and generate one caption for each event in theevent sequence with a greedy decoding. In all experiments,we use Adam [7] to learn models with the mini-batch size 1video and the learning rate 0.0005. We compare the proposed Streamlined Dense VideoCaptioning (SDVC) algorithm with several existing state-of-the-art methods including DCE [8], DVC [10], MaskedTransformer [37] and TDA-CG [27]. We additionally report able 4. Ablation results of mean averaged recall, precision and METEOR over four tIoU thresholds of 0.3, 0.5, 0.7 and 0.9 on the Activ-ityNet Captions validation set. We also present the number of proposals in average. The bold-faced number means the best performance.Method Proposal modules Captioning modules Number of Recall Precision METEOREPN ESGN eventRNN episodeRNN RL proposals
EPN-Ind √ √
ESGN-Ind √ √
ESGN-SCN √ √ √
ESGN-SCN-RL (SDVC) √ √ √ √ the results of MFT [30], which is originally proposed forvideo paragraph generation but its event selection module isalso able to generate an event sequence from the candidateevent proposals; it makes a choice between selecting eachproposal for caption generation and skipping it, and con-structs an event sequence implicitly. For MFT, we compareperformances in both event detection and dense captioning.Table 1 presents the event detection performances ofESGN and MFT in ActivityNet Captions validation set.ESGN outperforms the progressive event selection modulein MFT on most tIoUs with large margins, especially in re-call. This validates the effectiveness of our proposed eventsequence selection algorithm.Table 2 illustrates performances of dense video caption-ing algorithms evaluated on ActivityNet Captions validationset. We measure the scores with both ground-truth propos-als and learned ones, where the number of the predictedproposals in individual algorithms may be different; DCE,DVC, Masked Transformer and TDA-CG uses 1,000, 1,000,226.78 and 97.61 proposals in average, respectively, whilethe average number of proposals in SDVC is only 2.85.According to Table 2, SDVC improves the quality of cap-tions significantly compared to all other methods. MaskedTransformer achieves comparable performance to ours us-ing ground-truth proposals, but does not work well withlearned proposals. Note that it uses optical flow featuresin addition to visual features, while SDVC is only trainedon visual features. Since the motion information from opti-cal flow features consistently improves the performances inother video understanding tasks [12, 18], incorporating mo-tion information to our model may lead to additional perfor-mance gain. MFT has the highest METEOR score amongexisting methods, which is partly because MFT considerstemporal dependency across captions.Table 3 shows the test split results from the evaluationserver. SDVC achieves competitive performance based onlyon basic visual features while other methods exploit addi-tional modalities ( e.g ., audio and optical flow) to representvideos and/or ensemble models to boost accuracy as de-scribed in [5].
We perform several ablation studies on ActivityNet Cap-tions validation set to investigate the contributions of indi-
Table 5. Performance comparison varying reward levels in rein-forcement learning on the ActivityNet Captions dataset.Event-level reward Episode-level reward METEOR √ √ √ √ vidual components in our algorithm. In this experiment,we train the following four variants of our model: 1) EPN-Ind : generating captions independently from all candidateevent proposals, which is a baseline similar to most ex-isting frameworks, 2)
ESGN-Ind : generating captions in-dependently using eventRNN only from the events withinthe event sequence identified by our ESGN, 3)
ESGN-SCN : generating captions sequentially using our hierarchi-cal RNN from the detected event sequence, and 4)
ESGN-SCN-RL : our full model (SDVC) that uses reinforcementlearning to further optimize the captioning network.Table 4 summarizes the results from this ablation study,and we have the following observations. First, the approachbased on ESGN (
ESGN-Ind) is more effective than the base-line that simply relies on all event proposals (
EPN-Ind ).Also, ESGN reduces the number of candidate proposals sig-nificantly, from 77.99 to 2.85 in average, with substantial in-crease in METEOR score, which indicates that ESGN suc-cessfully identifies event sequences from candidate eventproposals. Second, context modeling through hierarchicalstructure ( i.e ., event RNN + episode RNN) in a caption-ing network ( ESGN-SCN ) enhances performance comparedto the method with independent caption generation withoutconsidering context (
ESGN-Ind ). Finally,
ESGN-SCN-RL successfully integrates reinforcement learning to effectivelyimprove the quality of generated captions.We also analyze the impact of two reward levels—eventand episode—used for reinforcement learning. The resultsare presented in Table 5, which clearly demonstrates the ef-fectiveness of training with rewards from both levels.
Fig. 4 illustrates qualitative results, where the detectedevent sequences and generated captions are presented to-gether. We compare the generated captions by our model(SDVC), which sequentially generates captions, with themodel (ESGN-Ind) that generates descriptions indepen- : two men are shown in playing racket ball 𝑒𝑒 : they then take a brief and the man begins hitting the ball on the ground 𝑒𝑒 : the other man back from his break and they begin playing again Ground-truth 𝑒𝑒 : a man is seen standing in a room with a tennis racket and begins hitting the ball around the room 𝑒𝑒 : the man then begins to play squash with the camera and leads into him hitting the ball 𝑒𝑒 : the man then begins to play with the racket and the man walks around the room ESGN-Ind 𝑒𝑒 : two men are playing racquetball on a court 𝑒𝑒 : they are playing a game of racquetball 𝑒𝑒 : they continue to play the game SDVC Ground-truthPredicted time 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 : a man is seen speaking to the camera that leads into several clips of a gym 𝑒𝑒 : many people are seen performing gymnastics on a mat while the camera follows close behind 𝑒𝑒 : people continue flipping around the gym while also stopping to speak to the camera Ground-truth 𝑒𝑒 : a man is doing a gymnastics routine on a blue mat 𝑒𝑒 : a man is doing gymnastics on a beam 𝑒𝑒 : the man then does a gymnastics routine on the mat ESGN-Ind 𝑒𝑒 : a man is seen speaking to the camera while holding a pole and speaking to the camera 𝑒𝑒 : the man then jumps onto a mat and begins performing a routine 𝑒𝑒 : the man continues to perform several more tricks and ends with him jumping down SDVC Ground-truthPredicted 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 𝑒𝑒 time Figure 4. Qualitative results on ActivityNet Captions dataset. The arrows represent ground-truth events (red) and events in the predictedevent sequence from our event sequence generation network (blue) for input videos. Note that the events in the event sequence are selectedin the order of its index. For the predicted events, we show the captions generated independently (ESGN-Ind) and sequentially (SDVC).More consistent captions are obtained by our sequential captioning network, where words for comparison are marked in bold-faced black. dently from the detected event sequences. Note that the pro-posed ESGN effectively identifies event sequences for inputvideos and our sequential caption generation strategy facil-itates to describe events more coherently by exploiting bothvisual and linguistic contexts. For instance, in the first ex-ample in Fig. 4, SDVC captures the linguistic context (‘twomen’ in e is represented by ‘they’ in both e and e ) aswell as temporal dependency between events (an expres-sion of ‘continue’ in e ), while ESGN-Ind just recognizesand describes e and e as independently occurring events.
6. Conclusion
We presented a novel framework for dense video cap-tioning, which considers visual and linguistic contexts for coherent caption generation by modeling temporal depen-dency across events in a video explicitly. Specifically, weintroduced the event sequence generation network to detecta series of event proposals adaptively. Given the detectedevent sequence, a sequence of captions is generated by con-ditioning on preceding events in our sequential captioningnetwork. We trained the captioning network in a supervisedmanner while further optimizing via reinforcement learn-ing with two-level rewards for better context modeling. Ouralgorithm achieved the state-of-the-art accuracy on the Ac-tivityNet Captions dataset in terms of METEOR.
Acknowledgments
This work was partly supported bySnap Inc., Korean ICT R&D program of the MSIP/IITPgrant [2016-0-00563, 2017-0-01780], and SNU ASRI. eferences [1] Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. Hi-erarchical Boundary-Aware Neural Encoder for Video Cap-tioning. In
CVPR , 2017.[2] Shyamal Buch, Victor Escorcia, Chuanqi Shen, BernardGhanem, and Juan Carlos Niebles. SST: Single-Stream Tem-poral Action Proposals. In
CVPR , 2017.[3] Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles,and Bernard Ghanem. DAPs: Deep Action Proposals forAction Understanding. In
ECCV , 2016.[4] Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, KennethTran, Jianfeng Gao, Lawrence Carin, and Li Deng. SemanticCompositional Networks for Visual Captioning. In
CVPR ,2017.[5] Bernard Ghanem, Juan Carlos Niebles, Cees Snoek,Fabian Caba Heilbron, Humam Alwassel, Victor Escor-cia, Ranjay Khrisna, Shyamal Buch, and Cuong Duc Dao.The ActivityNet Large-Scale Activity Recognition Chal-lenge 2018 Summary. arXiv preprint arXiv:1808.03766 ,2018.[6] Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen.Stack-Captioning: Coarse-to-Fine Learning for Image Cap-tioning. In
AAAI , 2018.[7] Diederik P Kingma and Jimmy Ba. Adam: A Method forStochastic Optimization. In
ICLR , 2015.[8] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, andJuan Carlos Niebles. Dense-Captioning Events in Videos. In
ICCV , 2017.[9] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault,Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. TGIF: ANew Dataset and Benchmark on Animated GIF Description.In
CVPR , 2016.[10] Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and TaoMei. Jointly Localizing and Describing Events for DenseVideo Captioning. In
CVPR , 2018.[11] Jonghwan Mun, Minsu Cho, and Bohyung Han. Text-GuidedAttention Model for Image Captioning. In
AAAI , 2017.[12] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han.Weakly Supervised Action Localization by Sparse TemporalPooling Network. In
CVPR , 2018.[13] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and YuetingZhuang. Hierarchical Recurrent Neural Encoder for VideoRepresentation with Application to Captioning. In
CVPR ,2016.[14] Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and YongRui. Jointly Modeling Embedding and Translation to BridgeVideo and Language. In
CVPR , 2016.[15] Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. VideoCaptioning with Transferred Semantic Attributes. In
CVPR ,2017.[16] Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-JiaLi. Deep Reinforcement Learning-Based Image Captioningwith Embedding Reward. In
CVPR , 2017.[17] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, JarretRoss, and Vaibhava Goel. Self-Critical Sequence Trainingfor Image Captioning. In
CVPR , 2017. [18] Karen Simonyan and Andrew Zisserman. Two-Stream Con-volutional Networks for Action Recognition in Videos. In
NIPS , 2014.[19] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,and Manohar Paluri. Learning Spatiotemporal Features with3D Convolutional Networks. In
ICCV , 2015.[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and IlliaPolosukhin. Attention Is All You Need. In
NIPS , 2017.[21] Ramakrishna Vedantam, C Lawrence Zitnick, and DeviParikh. CIDEr: Consensus-Based Image Description Evalu-ation. In
CVPR , 2015.[22] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Don-ahue, Raymond Mooney, Trevor Darrell, and Kate Saenko.Sequence to Sequence-Video to Text. In
ICCV , 2015.[23] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, MarcusRohrbach, Raymond Mooney, and Kate Saenko. TranslatingVideos to Natural Language using Deep Recurrent NeuralNetworks. In
NAACL-HLT , 2015.[24] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. PointerNetworks. In
NIPS , 2015.[25] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-mitru Erhan. Show and Tell: A Neural Image Caption Gen-erator. In
CVPR , 2015.[26] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruc-tion Network for Video Captioning. In
CVPR , 2018.[27] Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and YongXu. Bidirectional Attentive Fusion with Context Gating forDense Video Captioning. In
CVPR , 2018.[28] Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tie-niu Tan. M3: Multimodal Memory Modelling for VideoCaptioning. In
CVPR , 2018.[29] Ronald J Williams and David Zipser. A Learning Algorithmfor Continually Running Fully Recurrent Neural Networks.
Neural computation , 1(2):270–280, 1989.[30] Yilei Xiong, Bo Dai, and Dahua Lin. Move Forward andTell: A Progressive Generator of Video Descriptions. In
ECCV , 2018.[31] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. MSR-VTT:A Large Video Description Dataset for Bridging Video andLanguage. In
CVPR , 2016.[32] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, andYoshua Bengio. Show, Attend and Tell: Neural Image Cap-tion Generation with Visual Attention. In
ICML , 2015.[33] Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas,Christopher Pal, Hugo Larochelle, and Aaron Courville. De-scribing Videos by Exploiting Temporal Structure. In
ICCV ,2015.[34] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and TaoMei. Boosting Image Captioning with Attributes. In
ICCV ,2017.[35] Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and WeiXu. Video Paragraph Captioning using Hierarchical Recur-rent Neural Networks. In
CVPR , 2016.[36] Youngjae Yu, Hyungjin Ko, Jongwook Choi, and GunheeKim. End-to-End Concept Word Detection for Video Cap-tioning, Retrieval, and Question Answering. In
CVPR , 2017.37] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher,and Caiming Xiong. End-to-End Dense Video Captioningwith Masked Transformer. In
CVPR , 2018.
A. Details of Event RNN