[PDF] Semantically Sensible Video Captioning (SSVC)

Abstract

Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. Generating a semantically accurate description of a video is an arduous task. Considering the complexity of the problem, the results obtained in recent researches are quite outstanding. But still there is plenty of scope for improvement. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise of two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, SSVC (Semantically Sensible Video Captioning) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". For evaluating the proposed architecture, along with the BLEU scoring metric for quantitative analysis, we have used a human evaluation metric for qualitative analysis. This paper refers to this proposed human evaluation metric as the Semantic Sensibility (SS) scoring metric. SS score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of the state-of-the-art architectures.

Full PDF

SSemantically Sensible Video Captioning(SSVC)

Md. Mushﬁqur Rahman , Thasin Abedin , Khondokar S. S. Prottoy ,Ayana Moshruba , and Fazlul Hasan Siddiqui Islamic University of Technology, Gazipur, Bangladesh Dhaka University of Engineering and Technology, Gazipur, Bangladesh

Corresponding author:Md. Mushﬁqur Rahman Email address: mushﬁ[email protected]

ABSTRACT

Video captioning, i.e. the task of generating captions from video sequences creates a bridge betweenthe Natural Language Processing and Computer Vision domains of computer science. Generating asemantically accurate description of a video is an arduous task. Considering the complexity of theproblem, the results obtained in recent researches are quite outstanding. But still there is plenty ofscope for improvement. This paper addresses this scope and proposes a novel solution. Most videocaptioning models comprise of two sequential/recurrent layers - one as a video-to-context encoder andthe other as a context-to-caption decoder. This paper proposes a novel architecture, namely SemanticallySensible Video Captioning (SSVC) which modiﬁes the context generation mechanism by using two novelapproaches - “stacked attention” and “spatial hard pull”. For evaluating the proposed architecture, alongwith the BLEU (Papineni et al., 2002) scoring metric for quantitative analysis, we have used a humanevaluation metric for qualitative analysis. This paper refers to this proposed human evaluation metricas the Semantic Sensibility (SS) scoring metric. SS score overcomes the shortcomings of commonautomated scoring metrics. This paper reports that the use of the aforementioned novelties improves theperformance of the state-of-the-art architectures.

INTRODUCTION

After the success of Image Captioning in recent times, researchers have been interested to explore thescope of Video Captioning. Video Captioning is the process of describing a video in a meaningfulcaption using Natural Language Processing. The core mechanism of Video Captioning is based on thesequence-to-sequence architecture (Gers et al., 2000). In video captioning models the encoder encodesthe visual stream and the decoder generates the caption. Such models are capable of retaining both thespatial and temporal information which is essential for generating semantically correct video captions.This requires the video to be split up into a sequence of frames. The model uses these frames as input andgenerates a series of meaningful words in the form of a caption as output.Video captioning has many applications, for example, interaction between human and machine,aid for people with visual impairments, video indexing, information retrieval, fast video retrieval, etc.Unlike image captioning where only spatial information is required to generate captions, video captioningrequires the use of a mechanism that combines spatial information with temporal information to storeboth the higher level and the lower level features to generate semantically sensible captions. Even thoughthe progress has been rapid, there are scopes to work on the existing complexities in video captioningtask. One of the main challenges is the ability to extract high level features from videos to generate amore meaningful caption, for which we come up with a solution.In our paper we have come up with a novel architecture that is based on the work of Venugopalanet al. (2015). It uses the combination of two novel methods - a variation of dual-attention (Nam et al.,2017), namely, Stacked Attention and a novel method, namely, Spatial Hard Pull. In the encoding side,we use a stacked sequential encoder having two bi-directional LSTM layers. Stacked Attention networksets priority to the object in the video layer-by-layer. To overcome the redundancy of similar information a r X i v : . [ c s . C V ] O c t aption: “A man is playing a guitar.” Figure 1.

Video Captioning Taskbeing lost in the LSTM layers we propose the novel method, Spatial Hard Pull. For the decoding side weemploy a sequential decoder with a single layer LSTM and a fully connected layer to generate a wordfrom a given context produced by the encoder.Most text generation architectures use BLEU (Papineni et al., 2002) as the scoring metrics. But due toit’s inability of considering recall, few variations, including ROUGE (Lin, 2004), METEOR (Banerjeeand Lavie, 2005) etc., are introduced.Though these automatic scoring metrics are modiﬁed in differentways to give more meaningful results, they have their shortcomings (Kilickaya et al., 2016; Aafaq et al.,2019b). On top of that, no scoring metrics solely used for the purpose of video captioning is availableto the best of our knowledge. Some relevant works (Xu et al., 2017; Pei et al., 2019) have used humanevaluation. To get a better understanding of the captioning capability of our model, we perform qualitativeanalysis based on human evaluation and propose our metric, “Semantic Sensibility Score” or “SS score”,in short, for video captioning.

RELATED WORKS

For the past few decades, much work has been conducted on analysing videos to extract different formsof information, such as, sports-feature summary (Shih, 2017; Ekin et al., 2003; Ekin and Tekalp, 2003;Li and Sezan, 2001), medical video analysis (Quellec et al., 2017), video ﬁnger-print (Oostveen et al.,2002) and other high-level features (Chang et al., 2005; Divakaran et al., 2003; Kantorov and Laptev,2014). These high-level feature extraction mechanisms heavily relied on analyzing each frame separatelyand therefore, could not retain the sequential information. When the use of memory retaining cells likeLSTM (Gers et al., 2000) became computationally possible, models were only then capable of storingmeaningful temporal information for complex tasks like caption generation (Venugopalan et al., 2015).Previously, caption generation was mostly treated with template based learning approaches (Kojima et al.,2002; Xu et al., 2015) or other adaptations of statistical machine translation approach (Rohrbach et al.,2013).

Sequence-to-sequence architecture for video captioning

Video is a sequence of frames and the output of a video captioning model is a sequence of words. So,video captioning can be classiﬁed as a sequence-to-sequence (seq2seq) task. Sutskever et al. (2014)introduce the seq2seq architecture where the encoder encodes an input sentence, and the decoder generatesa translated sentence. After the remarkable result of seq2seq architecture in different seq2seq tasks (Shaoet al., 2017; Weiss et al., 2017), it is only intuitive to leverage this architecture in video captioning workslike (Venugopalan et al., 2015). In recent years, different variations of the base seq2seq architecture hasbeen widely used, e.g. hierarchical approaches (Baraldi et al., 2017; Wang et al., 2018a; Shih, 2017),variations of GAN (Yang et al., 2018), boundary-aware encoder approaches (Shih, 2017; Baraldi et al.,2017) etc.

Attention in sequence-to-sequence tasks

In earlier seq2seq literature (Venugopalan et al., 2015; Pan et al., 2017; Sutskever et al., 2014), the decodercells generate the next word from the context of the preceding word and the ﬁxed output of the encoder.As a result, the overall context of the encoded information often got lost and the generated output becamehighly dependent on the last hidden cell-state. The introduction of attention mechanism (Vaswani et al.,2017) paved the way to solve this problem. The attention mechanism enables the model to store the contextfrom the start to the end of the sequence. This allows the model to focus on certain input sequences oneach stage of output sequence generation (Bahdanau et al., 2014; Luong et al., 2015). Luong et al. (2015)proposed a combined global-local attention mechanism for translation models. In global attention scheme, he whole input is given attention at a time, while the local attention scheme attends to a part of the inputat a given time. The work on video captioning enhanced with these ideas. Bin et al. (2018) describe abidirectional LSTM model with attention for producing better global contextual representation as wellas enhancing the longevity of all the contexts to be recognized. Gao et al. (2017) build a hierarchicaldecoder with a fused GRU. Their network combines a semantic information based hierarchical GRU,and a semantic-temporal attention based GRU and a multi-modal decoder. Ballas et al. (2016) proposedto leverage the frame spatial topology by introducing an approach to learn spatio-temporal features invideos from intermediate visual representations using GRUs. Similarly, several other variations of theattention exists including multi-faceted attention (Long et al., 2018), multi-context fusion attention (Wanget al., 2018a) etc. All these papers use one attention at a time. This limits the available information for therespective models. Nam et al. (2017) introduce a mechanism to use multiple attentions. With their dualattention mechanism, they have retained visual and textual information simultaneously. Ziqi Zhang andteam have achieved commendatory scores by proposing an object relational graph (ORG) based encodercapturing more detailed interaction features and designed a teacher-recommended learning (TRL) methodto integrate the abundant linguistic knowledge into the caption model (Zhang et al., 2020).

METHODOLOGY

This paper proposes a novel architecture that uses a combination of stacked-attention and spatial-hard-pullon top of a base video-to-text architecture to generate captions from video sequences. This paper refers tothis architecture as Semantically Sensible Video Captioning (SSVC).

Data Pre-processing and Representation

The primary input of the model is a video sequence. The data pre-processor converts it into a usableformat before passing it to the actual model. The primary output of the model is a sequence of words.The words are stacked to generate the required caption.

Visual Feature Extraction

A video is nothing but a sequence of frames. Each frame is a 2D image with n channels. In sequentialarchitectures, either the frames are directly passed into ConvLSTM (Xingjian et al., 2015) layer(s) orthe frames are individually passed through a convolutional block and then are passed into LSTM (Gerset al., 2000) layer(s). For our computational limitations, our model uses the latter option. Like “Sequenceto Sequence–Video to Text” (Venugopalan et al., 2015), our model uses a pre-trained VGG16 model(Simonyan and Zisserman, 2014) and extracts the fc7 layer’s output. This CNN layer converts each ( × × ) shaped frame into ( × ) shaped vectors. These vectors are primary inputs of ourmodel. Textual Feature Representation

Each video sequence has multiple corresponding captions and each caption has a variable number of words.In our model, to create captions of equal length all the captions are padded with “pad” markers. A “start”marker and an “end” marker marks the start and end of each caption. The entire text data is tokenized,and each word is represented by a one-hot vector of shape ( × uniquewordcount ) . So, a caption with m words is represented with a matrix of shape ( m × × uniquewordcount ) . Instead of using these one-hotvectors directly, our model embeds each word into vectors of shape ( × embeddingdimension ) with apre-trained embedding layer. The embedded vectors are semantically different and linearly distant invector space from others on the basis of relationship of the corresponding words. Base architecture

Like most sequence-to-sequence models, our base architecture consists of a sequential encoder and asequential decoder. The encoder converts the sequential input vectors into contexts and the decoderconverts those contexts into captions. This work proposes an encoder with double LSTM layers withstacked attention. The introduction of a mechanism that stacks attention and the mechanism of pullingspatial information from input vectors are the two novel concepts in this paper and are discussed in-detailin later sections. patial Hard PullStacked Attention DecoderEncoderTime Distributed

FC FC FC FCLSTMEnc 1 LSTMEnc 1 LSTMEnc 1 LSTMEnc 1 Attention Concatenate LSTMDec state FC PredictedWord

Teacher Forced Word training

EmbeddingReference Sentence

VideoFrame 1 Frame 2 Frame 3 Frame n

LSTMEnc 2 LSTMEnc 2 LSTMEnc 2 LSTMEnc 2 Attention FC first wordStart Token inference reference

Predicted Sentence(Stacked words)

StackandJoin FC

Layer repeateduntil end tokenreceived asoutput word

Spatio-TemporalContextVisualContext

ConcatenateCNN CNN CNN CNN

Figure 2.

Proposed model with Stacked Attention and Spatial Hard Pull

Multi-layered Sequential Encoder

The proposed method uses a time-distributed fully connected layer followed by two consecutive bi-directional LSTM layers. The fully connected layer works on each frame separately and then their outputmoves to the LSTM layers. In sequence-to-sequence literature, it is common to use stacked LSTM forencoder. For it, our intuition is, the two layers capture separate information from the video sequence. Fig.4 show having two layers ensures optimum performance. The output of the encoder is converted into acontext. In relevant literature, this context is mostly generated using a single attention layer. This is wherethis paper proposes a novel concept. With the mechanism mentioned in later sections our model generatesa spatio-temporal context.

Single-layered Sequential Decoder

The proposed decoder uses a single layer LSTM followed by a fully connected layer to generate a wordfrom a given context. In relevant literature, many models have used stacked decoder. Most of thesepapers suggest, each layer of decoder handles separate information, while our model uses a single layer.Our experimental results show that having stacked decoder does not improve the result much for ourarchitecture. Therefore, instead of stacking decoder layers, we increased the number of decoder cells.Speciﬁcally, we have used twice as many cells in decoder than in encoder and it has shown the optimumoutput during experimentation.

Training and Inference Behaviour

To mark the start of a caption and to distinguish the mark from the real caption, a “start” token is used atthe beginning. The decoder uses this token as a reference to generate the ﬁrst true word of the caption.The Fig. 2 represents this as “ﬁrst word”. During inference, each subsequent word is generated withthe previously generated word as reference. The sequentially generated words together form the desiredcaption. The loop terminates upon receiving the “end” marker.During training, if each iteration in the generation loop uses previously generated word, then onewrong generation can derail the entire remaining caption. Thus error calculation process becomesvulnerable. To overcome this, like most seq-to-seq papers, we use the teacher forcing mechanism (Lambet al., 2016). The method uses words from the original caption as reference for generating the next wordsduring the training loop. Therefore, the generation of each word is independent of previously generated ttention for 1 layer

RepeatandStack Dense 1withtanh Dense 2withsoftmax DotProductDecoder state at time t-1Encoder outputof layer i Attention contextfor layer i at time t (A)

Single Layer Attention

Attention contextfor layer 1 at time tAttention contextfor layer 2 at time tAttention contextfor layer n at time t Dense 1withReLU Spatio-TemporalContext (B)

Stacking Attention for n layers

Figure 3.

Diagram of Stacked Attentionwords. The Fig. 2 illustrates this difference in training and testing time behaviour. During training,“Teacher Forced Word” is the word from the reference caption for that iteration.

Proposed Context Generation architecture

The paper proposes two novel methods. The effectiveness of these methods make the paper stand-outfrom other works in this ﬁeld.

Stacked Attention

Attention creates an importance map for individual vectors from a sequence of vectors. In text-to-text, i.e.,translation models, this mapping creates a valuable information that suggests which word or phrase in theinput side has higher correlation to which words and phrases in the output. However, in video captioning,attention plays a different role. For a particular word, instead of determining which frame (from originalvideo) or frames to put more emphasis on, the stacked attention emphasizes on objects. This paper usesa stacked LSTM. Like other relevant literature (Venugopalan et al., 2015; Song et al., 2017), this paperreports separate layers to carry separate information. So, if each layer has separate information, it is onlyintuitive to generate separate attention for each layer. Our architecture stacks the separately generatedattentions and connects them with a fully connected layer with tanh activation. The output of this layerdetermines whether to put more emphasis on the object or the action. f attn ([ h , ss ]) = a s ( W ∗ a tanh ( W [ h , ss ] + b ) + b ) (1) c attn = dot ( h , f attn ([ h , ss ])) (2) c st = a relu ( W st [ c attn , c attn , . . . , c attn n ] + b st ) (3)where, h = encoder output for 1 layer ss = decoder state is repeated to match h ’s dimension a s ( x ) = exp ( x − max ( x )) ∑ exp ( x − max ( x )) atten n = number of attention layers to be stacked c attn = context for single attention c st = stacked context for n encoder layersEq. 1 is the attention function. Eq. 2 uses the output of this function to generate the attentioncontext for one layer. Eq. 3 combines the attention context of several layers to generate the desirablespatio-temporal context. The paper also refers to this context as “stacked context”. Fig. 3 correspondswith these equation. In SSVC, we have particularly used n =

2, where n is the number of attention layersin the stacked attention.The stacked attention mechanism generates the spatio-temporal context for the input video sequence.All types of low-level context required to generate the next word is available in this novel contextgeneration mechanism. patial Hard Pull Amaresh and Chitrakala (2019) mentions that most successful image and video captioning models mainlylearn to map low-level visual features to sentences. They do not focus on the high-level semantic videoconcepts - like actions and objects. By low-level features, they meant object shapes and their existencein the video. High-level features refer to proper object classiﬁcation with position in the video and thecontext in which the object appears in the video. On the other hand, our analysis of previous architecturesshows that almost identical information is often found in nearby frames of a video. However, passing theframes through LSTM layer does not help to extract any valuable information from this almost identicalinformation. So, we have devised a method to hard-pull the output of the time-distributed layer anduse it to add high-level visual information to the context. This method enables us to extract meaningfulhigh-level features, like objects and their relative position in the individual frames.This method extracts information from all frames simultaneously and does not consider sequentialinformation. As the layer pulls spatial information from sparsely located frames, this paper names it“Spatial Hard Pull” layer. It can be compared to a skip connection. But unlike other skip connections, itskips a recurrent layer, and directly contributes to the context. The output units of the fully connected (FC)layer of this spatial-hard-pull layers determines how much effect will the sparse layer have on the context.

PROPOSED SCORING METRIC

No automatic scoring metric has been designed yet for the sole purpose of video captioning. The existingmetrics that have been built for other purposes, like neural machine translation, image captioning, etc., areused for evaluating video captioning models. For quantitative analysis, we use the BLEU scoring metric(Papineni et al., 2002). Although these metrics serve similar purposes, according to Aafaq et al. (2019b),they fall short in generating “meaningful” scores for video captioning.BLEU is a precision metric. It is mainly designed to evaluate text at a corpus level. However, (Post,2018; Callison-Burch et al., 2006; Graham, 2015) demonstrate the inefﬁciency of BLEU scoring metricin generating a meaningful score. A video may have multiple contexts. So, machines face difﬁculty toaccurately measure the merit of the generated captions as there is no speciﬁc right answer. Therefore, forvideo captioning, it is more challenging to generate meaningful scores to reﬂect the captioning capabilityof the model. As a result, human evaluation is an important part to judge the effectiveness of the captioningmodel. In fact, Fig. 6, 7, 8, 9, and 10 show this same fact that a higher BLEU score is not necessarily agood reﬂection of the captioning capability. On the other hand, our proposed human evaluation methodportrays a better reﬂection of the model’s performance compared to the BLEU scores.

Semantic Sensibility(SS) score Evaluation

To get a better understanding of captioning capability of our model, we perform qualitative analysisthat is based on human evaluation similar to (Graham et al., 2018; Xu et al., 2017; Pei et al., 2019).We propose a human evaluation metric, namely “Semantic Sensibility” score, for video captioning. Itevaluates sentences at a contextual level from videos based on both recall and precision. It takes 3 factorsinto consideration. These are the grammatical structure of predicted sentences, detection of the mostimportant element (subject or object) in the videos and whether the captions give an exact or synonymousanalogy to the action of the videos to describe the overall context.It is to be noted that for the latter two factors, we take into consider both the recall and precision valuesaccording to their general deﬁnition. In case of recall, we evaluate these 3 factors from our predictedcaptions and match them with the corresponding video samples. Similarly, for precision, we judge thesefactors from the video samples and match with the corresponding predicted captions. Following suchcomparisons, each variable is assigned to a boolean value of 1 or 0 based on human judgement. Thesigniﬁcance of the variables and how to assign their values are elaborated below: S grammar S grammar = (cid:40) , if grammatically correct0 , otherwise (4) S grammar evaluates the correctness of grammar of the generated caption without considering the video. element S element = R R ∑ i = S element irecall + P P ∑ i = S element iprecision R = number of prominent objects in video P = number of prominent objects in captionAs S action evaluates the action-similarity between the predicted caption and its corresponding video, S element evaluates the object-similarity. For each object in the caption, the corresponding S element precision receives a boolean score and for the major objects in the video, the corresponding S element recall receives aboolean score. The average recall and average precision is combined to get the S element . S action S action = S action recall + S action precision S action evaluates the ability to describe the action-similarity between the predicted caption and its cor-responding video. S action recall and S action precision separately receives a boolean score (1 for correct, 0 forincorrect) for action recall and action precision respectively. By action recall, we determine if the gen-erated caption has successfully captured the most prominent action of the video segment. Similarly, byaction precision, we determine if the action mentioned in the generated caption is present in the video ornot. SS score calculation

Combining equations Eq. 4, Eq. 5 and Eq. 6, the equation for the SS Score can be obtained.

SSscore = N N ∑ n = (cid:18) S grammar ∗ S element + S action (cid:19) (7) RESULTS

Dataset And Experimental Setup

Our experiments are primarily centered around comparing our novel model with different commonly usedarchitectures for video captioning like simple attention (Gao et al., 2017; Wu et al., 2018), modiﬁcationsof attention mechanism (Yang et al., 2018; Yan et al., 2019; Zhang et al., 2019), variations of visualfeature extraction techniques (Aafaq et al., 2019a; Wang et al., 2018b) etc that provide state-of-art results.We conducted the experiments under identical computational environment - Framework:

Tensorﬂow 2.0 ,Platform:

Google Cloud Platform with a virtual machine having an 8-core processor and 30GB RAM ,GPU:

None . We used the Microsoft-Research Video Description (MSVD) dataset (Chen and Dolan, 2011).It contains 1970 video snippets together with 40 English captions (Chen et al., 2020) for each video. Wesplit the entire dataset into training, validation, and test set with 1200, 100, and 670 snippets respectivelyfollowing previous works (Venugopalan et al., 2015; Pan et al., 2016). To create a data-sequence, framesfrom a video are taken with a ﬁxed temporal distance. We used 15 frames for each data-sequence. For thepre-trained embedding layer, we used ‘glove.6B.100d’ (Pennington et al., 2014). Due to lack of GPU,we used 256 LSTM units in each encoder layer and 512 LSTM units in our decoder network and trainedeach experimental model for 40 epochs. To analyse the importance of the Spatial Hard Pull layer, we alsotuned the Spatial Hard Pull FC units from 0 to 45 and 60 successively.However, due to our limitations like lack of enough computational resources, rigidity in our datapre-processing due to memory limitation and inability to train on a bigger dataset, we could not analyseour novel model with global benchmarks. On top of that, most of the benchmark models use multiplefeatures as input to the model. However, we only use a single 2D based CNN feature as input as wewanted to make an extensive study on the capability of 2D CNN for video captioning. So, we implementedsome of the fundamental concepts used in most state-of-the-art works on our experimental setup andcompared the results. Thus a proper qualitative and quantitative analysis of our model with other existingworks was achieved to show that our novelties improve the performance of state-of-the-art methods. ur proposed architecture, SSVC, with 45 hard-pull units and 2 layer stacked attention gives theBLEU score of “BLEU1”: 0.7072, “BLEU2”: 0.5193, “BLEU3”: 0.3961, “BLEU4”: 0.1886 after40 epochs of training with the best combination of hyper-parameters. For generating the SS score, weconsidered the ﬁrst 50 random videos from the test set. We obtained an SS Score of 0.34 for SSVC model.

Ablation Study of Stacked Attention

BLEU1 Score

BLEU2 Score

BLEU3 Score

BLEU4 Score

10 15 20 25 30 35 40Epoch0.250.30

SS Score (SSVC) Dual LSTM with Stacked AttnDual LSTM with Single AttnSingle LSTM with AttnTriple LSTM with Stacked Attn

Figure 4.

Comparing Stacked Attention with variations in encoder attention architecture•

No attention:

Many previous works (Long et al., 2018; Nam et al., 2017) mentioned that captioningmodels perform better with some form of attention mechanism. Thus, in this paper, we avoidcomparing use of attention and no attention mechanisms.•

Non stacked (or single) attention:

In relevant literature, though the use of attention is verycommon, the use of stacked attention is quite infrequent. Nam et al. (2017) have shown the use ofstacked (or dual) attention and improvements of performance that are possible through it. In Fig. 4,the comparison between single attention and stacked attention indicates dual attention has clearedge over single attention.•

Triple Attention:

Since the use of dual attention has improved performance in comparison tosingle attention, it is only evident to create a triple attention to check the performance. Fig. 4 showthat triple attention under-performs in comparison to all other variants.Considering our limitations, our stacked attention gives satisfactory results for both BLEU and SSScore in comparison to the commonly used attention methods when performed on similar experimentalsetup. Graphs in Fig. 4 suggest the same fact that our stacked attention improves the result of existingmethods due to improved overall temporal information. Moreover, we can clearly see that the 2 layerLSTM encoder performs much better than single or triple layer encoder. Combining these two facts,we can conclude that, our dual encoder LSTM with stacked attention has the capability to improvecorresponding architectures.

10 20 30 40Epoch657075

BLEU1 Score

BLEU2 Score

BLEU3 Score

BLEU4 Score

10 15 20 25 30 35 40Epoch0.2500.2750.3000.325

SS Score (SSVC) SHP with 45 unitsno SHPSHP with 60 units

Figure 5.

Evaluating model performance with varied hard-pull units

Ablation Study of Spatial Hard Pull

To boost the captioning capability, many state-of-art works(Pan et al., 2017) emphasized on the importanceof retrieving additional visual information by some methods. We implemented the same fundamental ideain our model with the Spatial Hard Pull. To depict the effectiveness of our Spatial Hard Pull (SHP), weconducted experiments with our stacked attention as a constant and changed the SHP FC units with 0,45 and 60 units successively. The Fig. 5 shows that as the number of SHP FC units are increased from0 to 45, both BLEU and SS score get better and again gradually falls from 45 to 60. The performanceimprovement in the early stages indicate that SHP layer is indeed improving the model. The reason forfall of scores in the later stages is that the model starts to show high variance. Hence it is evident fromthis analysis that our approach of using SHP layer yields satisfactory result compared to not using anySHP layer.

DISCUSSION

By thorough experimentation on a ﬁxed experimental setting, we analysed the spatio-temporal behaviourof a video captioning model. After seeing that single layer encoder LSTM causes more repetitivepredictions, we used double and triple layer LSTM encoder to encode the visual information into abetter sequence. Hence, we were able to propose our novel stacked attention mechanism with doubleencoder layer that performs better than the rest. The intuition behind this mechanism is that, as our modelseparately gives attention to each encoder layer, this generates a better overall temporal context for it todecode the video sequence and decide whether to give more priority to the object or the action. And theaddition of Spatial Hard Pull to this model bolsters its ability to identify and map high level semanticvisual information. Moreover, the results also indicate that addition of excess SHP units drastically affectthe performance of the model. Hence, a balance is to be maintained while increasing the SHP units sothat the model does not over-ﬁt. As a result, both of these key components of our novel model greatlycontributed to improving the overall ﬁnal performance of our novel architecture, that is based upon theexisting fundamental concepts of state of art models.Although the model performed good in qualitative and quantitative analysis, our proposed SS Scoregives a more meaningful method to analyse video captioning models. The auto metrics although useful,

SSVC: “a woman is cutting a piece of meat ”GT: “a woman is cutting into the fatty areas of a pork chop”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (B)

SSVC: “a person is slicing a tomato ”GT: “someone wearing blue rubber gloves is slicing a tomato with a large knife”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (C)

SSVC: “a man is mixing ingredients in a bowl ”GT: “chicken is being season”SS score: , BLEU1: , BLEU2: , BLEU3:0.61, BLEU4:

Figure 6.

Completely accurate captionsIn Fig.( 6A), Fig.( 6B) and Fig.( 6C), the model performs well and both SS and BLEU have evaluated thecaptions perfectly (A)

SSVC: “a woman is slicing an onion ”GT: “someone is cutting red bell peppers into thin slices”SS score: , BLEU1: , BLEU2: , BLEU3:0.63, BLEU4: (B)

SSVC: “a man is cutting a piece of meat ”GT: “a man is peeling a carrot”SS score: , BLEU1: , BLEU2: , b3 :0.0 , BLEU4: (C)

SSVC: “ a man is dancing ”GT: “four girls are dancing onstage”SS score: , BLEU1: , BLEU2: , BLEU3:0.0, BLEU4:0.0 Figure 7.

Captions with accurate actionsIn Fig.( 7A) and Fig.( 7C), our model is able to extract only the action part correctly and gets decent scorein both SS and BLEU score. In Fig.( 7B), SSVC gives bad prediction according to the context of thevideo but BLEU 4 scores it highly

SSVC: “ a man is jumping. ”GT: “a guy typing on a computer”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (B) SSVC: “ a kitten is playing with a toy ”GT: “a cat in a cage is angrily meowing at something”SS score: , BLEU1: , BLEU2: , BLEU3:0.0, BLEU4:0.0 (C) SSVC: “ a man is eating spaghetti ”GT: “a man pours cooked pasta from a plastic container into a bowl”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4:0.0 Figure 8.

Captions with accurate objectsIn Fig.( 8A), Fig.( 8B) and ( 8C), the generated caption is completely wrong in case of actions, but BLEU1 gives a very high score. On the contrary, SS score heavily penalizes them (A)

SSVC: “a man is driving a car ”GT: “a car drives backwards while trying to escape the police”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (B)

SSVC: “a baby is crawling ”GT: “a small baby is dancing”SS score: , BLEU1: , BLEU2: , BLEU3: , b4:0.0 Figure 9.

Partially accurate captionsIn Fig.( 9A) a car is running away and the police chasing it is happening simultaneously. Our SSVCmodel only predicts the driving part. In Fig.( 9B) a girl at ﬁrst, dancing by herself and later on startedcrawling on the ﬂoor. Our SSVC model manages to predict the crawling part. The interesting thing is that,the crawling part doesn’t even exist in the ground truth. Thus the generated captions only partially capturethe original idea. However, BLEU evaluates them with very high score where SS score evaluates themaccordingly.

SSVC: “a man is writing ”GT: “someone is burning the tops of two cameras with a torch”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (B)

SSVC: “a woman is pouring milk into a bowl ”GT: “a powdered substance is being shifted into a pan”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4: (C)

SSVC: “ a man is mixing a pot of water ”GT: “a ﬁsh is being fried in a pan”SS score: , BLEU1: , BLEU2: , BLEU3: , BLEU4:

Figure 10.

Inaccurate CaptionsIn Fig.( 10A), Fig.( 10B) and Fig.( 10C), the generated caption is completely wrong, but BLEU 1 gives avery high score where SS score gives straight up zero. So BLEU performs poorly here.cannot interpret the videos correctly. In our experimental results, we can see a steep rise in the BLEUScore in Fig. 4 and Fig. 5 at early epochs even though the predicted captions are not up to the mark.These suggest the limitations of BLEU score in judging the captions properly with a meaningful score. SSscore considers these limitations and ﬁnds a good semantic relationship between the context of the videosand the generated language that portrays the video interpreting capability of a model into language to itstruest sense. Hence, we can safely evaluate the captioning capability of our Stacked Attention with SpatialHard Pull mechanism to better understand the acceptability of the performance of our novel model.

CONCLUSION AND FUTURE WORK

Video captioning is a complex task. Most state-of-the-art models are only able to extract lower levelfeatures and ignore the higher level features completely. This paper shows how stacking the attentionlayer for a multi-layer encoder makes a more semantically accurate context. Complementing it, the SparseSequential Join, introduced in this paper, is able to capture the higher level features with greater efﬁciency.Due to our computational limitations, our experiments use custom pre-processing and constrainedtraining environment. We also use a single feature as input unlike most of the state-of-art models.Therefore, the scores we obtained in our experiments are not comparable to global benchmarks. In future,we hope to perform similar experiments with industry standard pre-processing with multiple features asinput.The paper also introduces the novel SS score. This deterministic scoring metric has shown greatquality in calculating the semantic sensibility of a generated video-caption. But since it is a humanevaluation metric, it relies heavily on human understanding. Thus, a lot of manual work is to be putbehind it. But as this metric system is deterministic, we are conﬁdent that an automated algorithm, tocompute the same score, can be obtained. For the grammar score, we can use Naber et al. (2003)’s “ARule-Based Style and Grammar Checker” technique. We plan on generating a method to automaticallycompute the remaining precision and recall based scores as well. This way, the entire SS score metric canbe automatically computed.

EFERENCES

Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., and Mian, A. (2019a). Spatio-temporal dynamics andsemantic attribute enriched visual encoding for video captioning. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 12487–12496.Aafaq, N., Mian, A., Liu, W., Gilani, S. Z., and Shah, M. (2019b). Video description: A survey ofmethods, datasets, and evaluation metrics.

ACM Computing Surveys (CSUR) , 52(6):1–37.Amaresh, M. and Chitrakala, S. (2019). Video captioning using deep learning: An overview of methods,datasets and metrics. In , pages 0656–0661. IEEE.Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to alignand translate. arXiv preprint arXiv:1409.0473 .Ballas, N., Yao, L., Pal, C., and Courville, A. C. (2016). Delving deeper into convolutional networks forlearning video representations. In Bengio, Y. and LeCun, Y., editors, .Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improvedcorrelation with human judgments. In

Proceedings of the acl workshop on intrinsic and extrinsicevaluation measures for machine translation and/or summarization , pages 65–72.Baraldi, L., Grana, C., and Cucchiara, R. (2017). Hierarchical boundary-aware neural encoder for videocaptioning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages1657–1666.Bin, Y., Yang, Y., Shen, F., Xie, N., Shen, H. T., and Li, X. (2018). Describing video with attention-basedbidirectional lstm.

IEEE transactions on cybernetics , 49(7):2631–2641.Callison-Burch, C., Osborne, M., and Koehn, P. (2006). Re-evaluation the role of bleu in machinetranslation research. In .Chang, S.-F., Hsu, W. H., Kennedy, L. S., Xie, L., Yanagawa, A., Zavesky, E., and Zhang, D.-Q. (2005).Columbia university trecvid-2005 video search and high-level feature extraction. In

TRECVID , pages0–06.Chen, D. and Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies , pages 190–200.Chen, H., Li, J., and Hu, X. (2020). Delving deeper into the decoder for video captioning. arXiv preprintarXiv:2001.05614 .Divakaran, A., Sun, H., and Ito, H. (2003). Methods of feature extraction of video sequences. US Patent6,618,507.Ekin, A., Tekalp, A. M., and Mehrotra, R. (2003). Automatic soccer video analysis and summarization.

IEEE Transactions on Image processing , 12(7):796–807.Ekin, A. and Tekalp, M. (2003). Generic play-break event detection for summarization and hierar-chical sports video analysis. In , volume 1, pages I–169. IEEE.Gao, L., Guo, Z., Zhang, H., Xu, X., and Shen, H. T. (2017). Video captioning with attention-based lstmand semantic consistency.

IEEE Transactions on Multimedia , 19(9):2045–2055.Gers, F. A., Schmidhuber, J. A., and Cummins, F. A. (2000). Learning to forget: Continual predictionwith lstm.

Neural Comput. , 12(10):2451–2471.Graham, Y. (2015). Re-evaluating automatic summarization with bleu and 192 shades of rouge. In

Proceedings of the 2015 conference on empirical methods in natural language processing , pages128–137.Graham, Y., Awad, G., and Smeaton, A. (2018). Evaluation of automatic video captioning using directassessment.

PloS one , 13(9):e0202789.Kantorov, V. and Laptev, I. (2014). Efﬁcient feature extraction, encoding and classiﬁcation for actionrecognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 2593–2600.Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. (2016). Re-evaluating automatic metrics forimage captioning. arXiv preprint arXiv:1612.07600 . ojima, A., Tamura, T., and Fukunaga, K. (2002). Natural language description of human activitiesfrom video images based on concept hierarchy of actions. International Journal of Computer Vision ,50(2):171–184.Lamb, A. M., Goyal, A. G. A. P., Zhang, Y., Zhang, S., Courville, A. C., and Bengio, Y. (2016). Professorforcing: A new algorithm for training recurrent networks. In

Advances in neural information processingsystems , pages 4601–4609.Li, B. and Sezan, M. I. (2001). Event detection and summarization in sports video. In

Proceedings IEEEWorkshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001) , pages 132–138.IEEE.Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In

Text summarizationbranches out , pages 74–81.Long, X., Gan, C., and de Melo, G. (2018). Video captioning with multi-faceted attention.

Transactionsof the Association for Computational Linguistics , 6:173–184.Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based neuralmachine translation. arXiv preprint arXiv:1508.04025 .Naber, D. et al. (2003). A rule-based style and grammar checker. .Nam, H., Ha, J.-W., and Kim, J. (2017). Dual attention networks for multimodal reasoning and matching.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 299–307.Oostveen, J., Kalker, T., and Haitsma, J. (2002). Feature extraction and a database strategy for videoﬁngerprinting. In

International Conference on Advances in Visual Information Systems , pages 117–128.Springer.Pan, Y., Mei, T., Yao, T., Li, H., and Rui, Y. (2016). Jointly modeling embedding and translation tobridge video and language. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 4594–4602.Pan, Y., Yao, T., Li, H., and Mei, T. (2017). Video captioning with transferred semantic attributes. In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 6504–6512.Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation ofmachine translation. In

Proceedings of the 40th annual meeting of the Association for ComputationalLinguistics , pages 311–318.Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., and Tai, Y.-W. (2019). Memory-attended recurrentnetwork for video captioning. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 8347–8356.Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) ,pages 1532–1543.Post, M. (2018). A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771 .Quellec, G., Cazuguel, G., Cochener, B., and Lamard, M. (2017). Multiple-instance learning for medicalimage and video analysis.

IEEE reviews in biomedical engineering , 10:213–234.Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., and Schiele, B. (2013). Translating video contentto natural language descriptions. In

Proceedings of the IEEE International Conference on ComputerVision , pages 433–440.Shao, L., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. (2017). Generating high-quality and informative conversation responses with sequence-to-sequence models. arXiv preprintarXiv:1701.03185 .Shih, H.-C. (2017). A survey of content-aware video analysis for sports.

IEEE Transactions on Circuitsand Systems for Video Technology , 28(5):1212–1231.Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale imagerecognition.Song, J., Guo, Z., Gao, L., Liu, W., Zhang, D., and Shen, H. T. (2017). Hierarchical lstm with adjustedtemporal attention for video captioning. arXiv preprint arXiv:1706.01231 .Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In

Advances in neural information processing systems , pages 3104–3112.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin,I. (2017). Attention is all you need. In

Advances in neural information processing systems , pages

Proceedings of the IEEE international conference on computer vision ,pages 4534–4542.Wang, B., Ma, L., Zhang, W., and Liu, W. (2018a). Reconstruction network for video captioning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 7622–7631.Wang, J., Wang, W., Huang, Y., Wang, L., and Tan, T. (2018b). M3: Multimodal memory modelling forvideo captioning. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 7512–7520.Weiss, R. J., Chorowski, J., Jaitly, N., Wu, Y., and Chen, Z. (2017). Sequence-to-sequence models candirectly translate foreign speech. arXiv preprint arXiv:1703.08581 .Wu, X., Li, G., Cao, Q., Ji, Q., and Lin, L. (2018). Interpretable video captioning via trajectory structuredlocalization. In

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition ,pages 6829–6837.Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-c. (2015). Convolutional lstmnetwork: A machine learning approach for precipitation nowcasting. In

Advances in neural informationprocessing systems , pages 802–810.Xu, J., Yao, T., Zhang, Y., and Mei, T. (2017). Learning multimodal attention lstm networks for videocaptioning. In

Proceedings of the 25th ACM international conference on Multimedia , pages 537–545.Xu, R., Xiong, C., Chen, W., and Corso, J. J. (2015). Jointly modeling deep video and compositional textto bridge vision and language in a uniﬁed framework. In

Twenty-Ninth AAAI Conference on ArtiﬁcialIntelligence .Yan, C., Tu, Y., Wang, X., Zhang, Y., Hao, X., Zhang, Y., and Dai, Q. (2019). Stat: spatial-temporalattention mechanism for video captioning.

IEEE transactions on multimedia , 22(1):229–241.Yang, Y., Zhou, J., Ai, J., Bin, Y., Hanjalic, A., Shen, H. T., and Ji, Y. (2018). Video captioning byadversarial lstm.

IEEE Transactions on Image Processing , 27(11):5600–5611.Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., and Zha, Z. J. (2020). Object relational graph withteacher-recommended learning for video captioning. In , pages 13275–13285.Zhang, Z., Xu, D., Ouyang, W., and Tan, C. (2019). Show, tell and summarize: Dense video captioningusing visual cue aided sentence summarization.

IEEE Transactions on Circuits and Systems for VideoTechnology ..