Annotation Cleaning for the MSR-Video to Text Dataset
JJOURNAL OF L A TEX CLASS FILES 1
Annotation Cleaning for the MSR-Video to TextDataset
Haoran Chen, Jianmin Li, Simone Frintrop, and Xiaolin Hu
Abstract —The video captioning task is to describe the video contents with natural language by the machine. Many methods havebeen proposed for solving this task. A large dataset called MSR Video to Text (MSR-VTT) is often used as the benckmark dataset fortesting the performance of the methods. However, we found that the human annotations, i.e., the descriptions of video contents in thedataset are quite noisy, e.g., there are many duplicate captions and many captions contain grammatical problems. These problemsmay pose difficulties to video captioning models for learning. We cleaned the MSR-VTT annotations by removing these problems, thentested several typical video captioning models on the cleaned dataset. Experimental results showed that data cleaning boosted theperformances of the models measured by popular quantitative metrics. We recruited subjects to evaluate the results of a model trainedon the original and cleaned datasets. The human behavior experiment demonstrated that trained on the cleaned dataset, the modelgenerated captions that were more coherent and more relevant to contents of the video clips. The cleaned dataset is publicly available.
Index Terms —MSR-VTT dataset, data cleaning, data analysis, video captioning. (cid:70)
NTRODUCTION T HE goal of the video captioning task is to summarizethe content of a video clip by a single sentence, whichis an extension of image captioning task [1], [2], [3], [4]. Toaccomplish this task, one must use both computer vision(CV) techniques and natural language processing (NLP)techniques. A benchmark dataset, called MSR-Video to Text1.0 (MSR-VTT v1) [5], was released in 2016. It contains10,000 video clips and each clip is described by 20 captions,which are supposed to be different, given by human anno-tators. The dataset has become popular in the field of videocaptioning. Until February 8th, 2021, that work [5] has beencited by 501 times according to Google scholar.However, with a quick look, one can find many dupli-cate annotations, spelling mistakes and syntax errors in theannotations (Figs. 1, 2). It is unknown how many mistakesthere are exactly in the dataset and whether/how thesemistakes would influence the performance of the videocaptioning models.We quantitatively analyzed the annotations in the MSR-VTT dataset, and identified four main types of problems.First, thousands of annotations have duplicates for some ofthe video clips in the dataset. Second, thousands of specialcharacters, such as ”+”, ”-”, ”.”, ”/”, ”:”, exist in the anno-tations. Third, thousands of spelling mistakes exist in theannotations. Fourth, hundreds of sentences are redundantor incomplete. We developed some techniques to clean theannotations by solving these problems. Our experiments • Haoran Chen, Jianmin Li and Xiaolin Hu are with the Institute forArtificial Intelligence, the State Key Laboratory of Intelligent Tech-nology and Systems, Beijing National Research Center for Informa-tion Science and Technology, and Department of Computer Scienceand Technology, Tsinghua University, Beijing 100084, China. E-mail:[email protected], { lijianmin, xlhu } @tsinghua.edu.cn • Simone Frintrop is with the Department of Informatics, Univeristyof Hamburg, Hamburg, Germany. E-mail: [email protected] received February 12, 2021. demonstrated that existing models, trained on the cleanedtraining set, had better performances compared to the re-sults obtained by the models trained on the original trainingset. A human evaluation study also showed that a state-of-the-art model trained on the cleaned training set generatedbetter captions than trained on the original training set interms of semantic relevance and sentence coherence.The cleaned dataset is available to the public .
1. A man is throwing a football at a target. × 2
2. A man throws an American football at an aiming board. × 3
3. Kids throws football at target. × 4
4. Man throwing football to target in slow motion. × 2
5. People are playing sports. × 2
6. Someone is throwing a football at a target. × 2
Figure 1. An example video clip (No. 4290, starting from 0) with duplicateannotations. × t denotes repeating t times.
1. https://github.com/WingsBrokenAngel/MSR-VTT-DataCleaning a r X i v : . [ c s . C V ] F e b OURNAL OF L A TEX CLASS FILES 2
Video id: video7021, sentence id: 148018
Caption: this is vidio from a baseball game
Video id: video6751, sentence id: 124320 Caption: usa president barack obama talking standing on the dias
Video id: video9852, sentence id: 187440
Caption: in a restarunt all cups and some else vessels are felldown from the desk and brokens
Figure 2. Three examples in the MSR-VTT dataset. The words in blue and red denote grammatical mistakes and spelling mistakes, respectively.
ELATED W ORK
Two datasets MSVD (also called YouTube2Text) and MSR-VTT, unlimited to a specific domain, are widely used inrecent video captioning researches as benchmarks. MSVDwas published in 2013 [6]. It contains 1970 video clipsand roughly 80,000 captions. Each video clip pairs with40 captions. MSR-VTT v1 was published in 2016 [5]. Itcontains 10,000 video clips and 200,000 captions. Each videoclip pairs with 20 captions. The MSR-VTT v2 dataset wasproposed in the second Video Captioning Competition using the MSR-VTT v1 dataset as the training and validationsets and additional 3000 video clips as the test set. However,the annotations of the test set are not open to the public. Many models have been proposed for video captioning[7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18],[19]. With semantic concepts detected from the video, theprobability distribution of each tag is integrated into theparameters of a recurrent unit in SCN [7]. Video caption-ing is improved by sharing knowledge with two relatedtasks on the encoder and the decoder of a sequence-to-sequence model [8]. Reinforced learning is enhancedfor video captioning with the mixed-loss function andthe CIDEr-entailment reward in CIDEnt-RL [9]. Multiplemodalities are fused by hierarchical attention, which helpsto improve the model performance, in HATT [10]. The videofeature produced by Efficient Convolutional Network is fedinto a video captioning model, which boosts the quality ofthe generated caption, in the model named ECO [11]. Inthe GRU-EVE, the Short Fourier Transform is applied tovideo features and high level semantics is derived fromthe object detector in order to generate captions rich insemantics [12]. A memory structure is used to capture thecomprehensive visual information across the whole trainingset for a word in the MARN [13]. The encoder employs a
2. http://ms-multimedia-challenge.com/2017/challenge sibling (dual-branch) architecture to encode video clips inthe SibNet [14]. HACA fuses both global and local temporaldynamics existing in a video clip and generates an accuratedescription with knowledge from different modalities [15].Different expert modules are trained to provide knowledgefor describing out-of-domain video clips in the TAMoE[16]. The model called SAM-SS is trained under the self-teaching manner to reduce the gap between the trainingand the test phase with meaningful semantic features [17].Different types of representations are encoded and fusedby the cross-gating block and captions are generated withPart-of-Speech information in the POS RL [18]. In the VNS-GRU, “absolute equalitarianism” in the training process isalleviated by professional learning while a comprehensiveselection method is used to choose the best checkpoint forthe final test [19].
NALYSIS AND C LEANING OF THE
MSR-VTT
DATASET
Since MSR-VTT v2 uses MSR-VTT v1 for training and val-idation, and the annotations of the test set of MSR-VTT v2are not open to the public, we performed analysis on MSR-VTT v1.The MSR-VTT v1 dataset contains 10,000 video clips. Itstraining set has 6,513 video clips, the validation set has 497video clips and the test set has 2,990 video clips. All clips arecategorized into 20 classes with diverse contents and scener-ies. A total of 0.2 million human annotations were collectedto describe those video clips. The training/validation/testsets have 130,260/9,940/59,800 annotations, respectively.The vocabulary sizes of the training/validation/test set are23,666/5,993/16,001, respectively.
There are 60 different characters in the dataset, including0-9, a-z and 24 special characters in Table 1 (space is ne-glected). Generally speaking, those special characters arenot used to train a model. We are intented to remove
OURNAL OF L A TEX CLASS FILES 3 special characters while preserve information integrity inannotations.We processed those special characters as follow:1) Some special characters were removed from thesentences, include ” > ”,”[”, ”]”, ”(”, ”)”, ” \ ”, where ”[”, ”]”, ”(” and ”)”were removed only when they were not in pairs.2) The contents between bracket pairs ”()” and ”[]”were removed.3) Special characters ”-”, ” | ”, ”‘”, ”@”, ” ”, ”’”, ”/”were replaced with spaces.4) Characters from another language were replaced bythe most similar English characters. For example,”´e” was replaced by ”e” in ”´error” and ” в ” by ”b”in ” в eautiful”.5) ”&” between two different words was substitutedby ”and”.In total, 7,247 sentences, which account for 3.6% of allsentences, were modified. Table 1Special characters in the MSR-VTT dataset. > @ [ \ ]‘ | ´e в ’ Many spelling mistakes were found in the annotations dur-ing manual check. Tokenization is a process of demarcatinga string of an input sentence into a list of words. Aftertokenization on each of the sentences, we used a popularspelling check software Hunspell to check spelling errors.Before we used Hunspell to do spelling checks, weadded 784 new words to its vocabulary. We chose thesewords manually by four criteria:1) word abbreviations that are popular, eg. F1, WWF,RPG;2) specific terms that are widely used, eg. Minecraft,Spongebob, Legos;3) new words that are popular on the Internet, eg.gameplay, spiderman, talkshow;4) names of persons, eg. Mariah, Fallon, Avril.After that, we found spelling mistakes in 19,038 an-notations out of 200,000 annotations. 21,826 words mighthave spelling mistakes suggested by Hunspell. We correctedthose candidates in the following steps:1) Substituted British English spellings with the corre-sponding American English spellings. For instance,colour → color, travelling → traveling, programme → program, practising → practicing, theatre → theater. There were 61 such pairs.2) Split unusual words that were created by concate-nating two different words, e.g. rockclimbing →
3. Available at https://hunspell.github.io
Table 2Examples of duplicates and its similarity value. Those values arecalculated by each pair of sentences with ¯ e = 0 , , . Each pair ofsentences describe the same video clip. Sentences Similaritya woman is walking down the aisle in a wedding 0.86, 0.96, 0.96a woman is walking down the isle in a wedding dressa man is talking to a woan 0.80, 0.94, 0.94a young man is talking to a womana woman is singing on a music video 0.83, 0.94, 0.94a young woman is singing in a music video rock climbing, blowdrying → blow drying, sword-fighting → sword fighting, screencaster → screencaster, rollercoaster → roller coaster. In total, 34distinct words were found.3) Corrected words that truly contain spelling mis-takes, e.g., discusing → discussing, explaning → explaining, coversation → conversation, vedio → video, diffrent → different.In total 32,056 words were substituted, split or corrected inthese three steps. We found duplicate sentences for many video clips (Fig. 1).For each video clip, duplicates were removed. The similaritybetween two sentences was defined as follow s a,b = 0 . µ ( a , b ) /ι ( a ) + µ ( a , b ) /ι ( b )) , (1)where ι ( x ) denotes the word count in the sentence x = { x , x , ... } and µ ( a , b ) denotes the word count of thelongest common subsequence in a and b . µ ( a , b ) is definedas follows, µ ( a , b ) = max ( ι ( c )) (2) s . t . c ∈ a , c ∈ b , (3)where x ∈ x stands for that x is a subsequence of x .Word w and word w were regarded as the same if theLevenshtein distance [20] between them was less than orequal to ¯ e . Two sentences were regarded as duplicated if s a,b > ¯ s , where ¯ s is the similarity threshold. With propervalues of ¯ e and ¯ s , we could find duplicated sentences thathad little difference. For example, considering the secondpair of sentences in Table 2, the character “m” is missing inthe word “woan” and the second sentence just has one moreword “young” than the first sentence. These two sentencesare almost the same in terms of meaning.After duplicate removal, 183,856 video annotations re-mained in the dataset with 119,625 in the training set, 9,126in the validation set and 55,105 in the test set. Each clip has9 annotations at least, 20 at most and 18.4 on average. In the task of video captioning, we expect each annotationcontains one sentence. For many annotations in the dataset,each of them consists of multiple sentences. In Fig. 3, the first
OURNAL OF L A TEX CLASS FILES 4
1. A women in a dress talks about data scientist she tells how they are problem solvers and well educated she starts asking how you can stand out among other data scientist.2. A video game is displayed on the screen and in this game a man riding a motorcycle hits a car then we see a webpage with cars with a man speaking as a voice over.
Figure 3. Redundancy samples in the MSR-VTT dataset. Caption 1 canbe divided into three sentences. And Caption 2 can be divided into twoor three sentences. annotation can be split into three complete sentences: ”Awomen in a dress talks about data scientist.”, ”She tells howthey are problem solvers and well educated.”, ”She startsasking how you can stand out among other data scientist.”It causes two potential problems. First, the models trainedon such annotations may output grammatically problematicsentences because these annotations are syntactically incor-rect. Second, such annotations in the test set are no longerreliable ground truth so that the metrics, computed withthem, are not reliable, neither.To solve these two problems, one needs to manuallyseparate the annotations into several complete sentencesand merge them into a single sentence. But there are toomany annotations in the training and validation sets. Weonly did this for the test set. For the training and validationsets, we truncated the sentences longer than l a + 2 σ , where l a and σ denote the average sentence length and its standarddeviation, respectively. XPERIMENTS
We conducted experiments on the original and cleanedMSR-VTT datasets with several existing video captioningmodels, SCN [7], ECO [11], SAM-SS [17] and VNS-GRU [19].They were trained for 30, 30, 50, 80 epochs, respectively.
Table 3Influence of Edit Distance Threshold ¯ e on the remaining annotationcount and the performance of the model VNS-GRU. We set ¯ s = 0 . . Allmetric values are presented in percentage. SC represents the numberof remaining sentences in the dataset. ¯ e SC B4 C M R O They were evaluated on the validation set at the end ofeach epoch. The first two models used the early stoppingstrategy with cross-entropy loss as the indicator. The lasttwo models used the Comprehensive Selection Method toselect a checkpoint for testing [19]. For the sake of faircomparison, the experiment settings were the same as theoriginal papers. The two hyperparameters ¯ e and ¯ s (seesection 3.3) were set to 0 and 0.85 in our experiments, unlessotherwise stated. We adopted BLEU, CIDEr, METEOR and ROUGE-L asobjective metrics for evaluating the results of the models.BLEU is a quick and easy-to-calculate metric, originally usedfor evaluating the performance of machine translation mod-els [21]. CIDEr is a metric that captures human consensus[22]. METEOR is a metric that involves precision, recall andorder correlation, based on unigram matches [23]. ROUGE-L is a metric that determines the quality of a summary byfinding the longest common subsequence [24]. Besides theseindividual metrics, we used a score to combine all of thesemetrics [17]: O i = ( B i B b + C i C b + M i M b + R i R b ) / , (4)where the subscript i denotes the model i and the subscript b denotes the best score of the metric b over a group of modelsfor comparison. B4, C, M, R and O denote BLEU-4, CIDEr,METEOR, ROUGE-L and the overall score (4), respectively. In the step of removing duplicated annotations, there aretwo hyperparameters: the edit distance threshold ¯ e andsimilarity threshold ¯ s . We investigated the sensitivity of thehyperparameters on the output of this step. As shown inTable 3, the threshold of edit distance ¯ e was inversely pro-portionate to the remained sentence count. The performanceof the model VNS-GRU was the best when ¯ e = 0 . As shownin Table 4, the threshold of similarity was proportionate tothe remained sentence count. The performance of the modelVNS-GRU was the best when ¯ s = 0 . .Table 2 shows that with the method described in the Sec-tion 3.3, we can find similar sentences, in terms of semantics,with one or two words different. OURNAL OF L A TEX CLASS FILES 5
Table 4Influence of Similarity Threshold ¯ s on the remaining annotation countand the performance of the model VNS-GRU. We set ¯ e = 0 . SCrepresents the number of remaining sentences in the dataset. ¯ s SC B4 C M R O . . . . . . Table 5Results on the original/cleaned MSR-VTT dataset.
Model B4 C M R OSCN [7] 42.1 48.3 28.7 61.6 0.9148SCN a b a b a b a VNS-GRU b In Table 5, a model name without any superscript indicatesthat the model was trained on the original training setand the metrics were calculated on the original test set. Amodel name with a superscript indicates that the modelwas trained on the cleaned training set and the metricswere calculated on the original test set (superscript a ) orthe cleaned test set (superscript b ). We had several observa-tions. First, the models trained on the cleaned training setachieved higher scores of metrics than the models trainedon the original training set, even though the metrics werecalculated on the original test set. For instance, VNS-GRU a [19] improves over VNS-GRU by 1.6% on BLEU-4, by 2.1%on CIDEr, by 0.9% on METEOR and by 1.1% on ROUGE-L. Second, the models trained on the cleaned training setand tested on the cleaned test set achieved higher scoresof metrics than the models trained on the original trainingset and tested on the original test set. For instance, VNS-GRU b [19] improves over VNS-GRU by 1.3% on BLEU-4,by 0.7% on METEOR and by 0.9% on ROUGE-L. Third, thescores of VNS-GRU b were slightly lower than the scores ofVNS-GRU a . We attributed this to the increase of annotationdiversity in the cleaned test set.We plot the scores of BLEU-4, CIDEr, METEOR andROUGE-L of popular video captioning models proposedin recent years in Fig. 4. One of the earlist models on theMSR-VTT dataset, VideoLAB, from ACM Multimedia MSR-VTT Challenge 2016 [25], was used as the baseline, andall other models were compared with it. Then we see therelative changes of other models in percetage on the right Table 6Results on the origin test set. The model was trained on the training setwith data cleaning steps I, II, III and IV taken one by one.
I II III IV B4 C M R O × × × × √ × × × √ √ × × √ √ √ × √ √ √ √
Table 7Results on the cleaned test set. The model was trained on the trainingset with data cleaning steps I, II, III and IV taken one by one.
I II III IV B4 C M R O × × × × √ × × × √ √ × × √ √ √ × √ √ √ √ vertical axes in Fig. 4. By training on the cleaned training set,one of the state-of-the-art models, VNS-GRU was improvedfrom 15.9% to 19.9% on BLEU-4, from 20.2% to 24.7% onCIDEr, from 7.9% to 10.8% on METEOR, from 4.6% to 6.1%on ROUGE-L, compared with the results obtained by thesame model trained on the original training set. From thefigure, it is seen that the relative improvements brought byannotation cleaning were nonnegligible.
To analyze the utility of each step in data cleaning, wecompared the performances of the model VNS-GRU [19] onthe original and cleaned test sets in Tables 6 and 7, trainedon the training set cleaned by Step I (Section 3.1), Step II(Section 3.2), Step III (Section 3.3), Step IV (Section 3.4),accumulatively.As shown in Tables 6 and 7, Step I brought improve-ments in all the metrics since it reduced the number ofirregular words and phrases, which contain special charac-ters. After Step II, the four metrics remained similar to thoseafter Step I when measured on the original test set (Table6), but the metrics were improved when measured on thecleaned test set (Table 7). After Step III, all metrics exceptMETEOR increased in the both cases. The METEOR valueslightly decreased when measured on the cleaned test set(Table 7). After the last step, almost all metrics were furtherimproved, except BLEU-4. If we focus on the performanceof the model measured on the cleaned test set (Table 7), wefound that the overall score was improved after each step.These results suggest that all steps are necessary for cleaningthe annotations.
UMAN E VALUATION
It is well-known that the metrics including BLEU-4, CIDEr,METEOR, ROUGE-L do not fully reflect the quality of thevideo captioning results. We then conducted a human eval-uation study. We recruited 17 people (11 male and 6 female,
OURNAL OF L A TEX CLASS FILES 6 -5.37-0.264.869.9715.0920.2025.3237.039.041.043.045.047.049.0 V i d e o L A B ( ) A a l t o ( ) v t _ n a v i g a t o r ( ) M T V C ( ) C I D E n t - R L ( ) S i b N e t ( ) H A C A ( ) T A M o E ( ) P O S _ R L ( ) S A M - SS ( ) V N S - G R U ( ) V N S - G R U * c h a n g e ( % ) B L E U - -4.76-0.793.177.1411.1115.0819.0523.0226.9842.044.046.048.050.052.054.056.0 V i d e o L A B ( ) A a l t o ( ) v t _ n a v i g a t o r ( ) M T V C ( ) C I D E n t - R L ( ) S i b N e t ( ) H A C A ( ) T A M o E ( ) P O S _ R L ( ) S A M - SS ( ) V N S - G R U ( ) V N S - G R U * c h a n g e ( % ) C I D E r -4.33-2.53-0.721.082.894.696.508.3010.1111.9126.527.027.528.028.529.029.530.030.531.0 V i d e o L A B ( ) A a l t o ( ) v t _ n a v i g a t o r ( ) M T V C ( ) C I D E n t - R L ( ) S i b N e t ( ) H A C A ( ) T A M o E ( ) P O S _ R L ( ) S A M - SS ( ) V N S - G R U ( ) V N S - G R U * c h a n g e ( % ) M E T E O R -2.64-0.990.662.313.965.617.2659.060.061.062.063.064.065.0 V i d e o L A B ( ) A a l t o ( ) v t _ n a v i g a t o r ( ) M T V C ( ) C I D E n t - R L ( ) S i b N e t ( ) H A C A ( ) T A M o E ( ) P O S _ R L ( ) S A M - SS ( ) V N S - G R U ( ) V N S - G R U * c h a n g e ( % ) R O U G E - L Figure 4. The performance of typical models on the MSR-VTT dataset during 2016 and 2020. The models include VideoLAB, Aalto, v2t navigator,MTVC [8], CIDEnt-RL [9], SibNet [14], HACA [15], TAMoE [16], SAM-SS [17] and POS RL [18] and VNS-GRU [19]. The first three models are fromACM Multimedia MSR-VTT Challenge 2016 [25]. VideoLAB was used as the baseline (0% change).
Caption A: a group of people are singing and playing instrumentsCaption B: a man is singingQuestion. In terms of relevance and coherence, which caption is better?A. Caption A B. Caption B C. Indistinguishable
Figure 5. An example question in the human evaluation experiment.Captions A and B were generated by VNS-GRU or VNS-GRU*. ages between 20 and 35) with normal or corrected-to-normalvision to do this experiment. The subjects were mainlyfrom Tsinghua University, Beijing, China. All subjects hadat least college level English. This study was approved bythe Department of Psychology Ethics Committee, TsinghuaUniversity, Beijing, China.The subjects watched video clips from the MSR-VTTdataset and compared the results of VNS-GRU trained onthe original and cleaned annotations of the dataset (Figure5). The subjects were instructed to compare the results basedon two criteria:1) relevance, the match between the contents of thevideo clip and the caption; 2) coherence, the language fluency and grammaticalcorrectness in the caption.For each video clip, there were three options: (A) Caption Ais better; (B) Caption B is better; and (C) Indistinguishable.The two captions were generated by VNS-GRU or VNS-GRU*, which were trained on the original and cleanedannotations of the dataset, respectively. The subjects neededto choose one and only one of three options. A total of 30video clips were randomly sampled from the test set andpresented to all subjects in an fixed order. Every subjectcompleted the experiment within half an hour.We recorded the number of votes for VNS-GRU, VNS-GRU* and Indistinguishable for every subject and calculatedthe average over all subjects (Figure 6). On average, for 11.8video clips the subjects voted for “VNS-GRU* is better”and for 10.1 video clips the subjects voted for “VNS-GRUis better”. The one-sided student t-test indicated that VNS-GRU* performed better than VNS-GRU ( p = 0 . , n = 17 ).On average, for 8.1 videos the subjects could not distinguishthe quality of the results.These results suggested that annotation cleaning couldboost the quality of the generated captions by video cap-tioning models from subjective evaluation of humans. ONCLUSION
The MSR-VTT dataset is a widely used dataset in the videocaptioning area. We found many problems in its annota-tions, and many of them are obvious mistakes. We inspectedthe influence of these problems on the results of video
OURNAL OF L A TEX CLASS FILES 7
VNS−GRU* VNS−GRU Indistinguishable T he nu m be r o f v i deo c li p s Figure 6. Human evaluation results. “VNS-GRU ∗ ”, “VNS-GRU” and“Indistinguishable” denote the numbers of videos which the subjectsvoted for “VNS-GRU ∗ is better than VNS-GRU”, “VNS-GRU is betterthan VNS-GRU ∗ ” and “They are indistinguishable”, respectively. Errorbars are standard deviations. The p-value between “VNS-GRU ∗ ” and“VNS-GRU” is 0.02. captioning models. In four steps, we removed or correctedthese problems, and compared the results of several popularvideo captioning models. The models trained on the cleaneddataset generated better captions than the models trained onthe original dataset measured by both objective metrics andsubjective evaluations. In particular, trained on the cleaneddataset, our previous model VNS-GRU achieved the newstate-of-the-art results on this dataset. We recommend touse this cleaned dataset for developing video captioningmodels. A CKNOWLEDGMENTS
The authors would like to thank Han Liu and Huiran Yufor insightful discussions. This work was supported by theNational Natural Science Foundation of China under Grant62061136001, Grant U19B2034 and Grant 61620106010, andby the deutsche Forschungsgemeinschaft (DFG, GermanResearch Foundation) – TRR 169/A6. R EFERENCES [1] K. Cho, B. van Merri¨enboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-resentations using rnn encoder–decoder for statistical machinetranslation,” in
Proceedings of the 2014 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) , 2014, pp. 1724–1734.[2] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2017, pp. 7008–7024.[3] N. Yu, X. Hu, B. Song, J. Yang, and J. Zhang, “Topic-oriented imagecaptioning based on order-embedding,”
IEEE Transactions on ImageProcessing , vol. PP, pp. 1–1, 2018.[4] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould,and L. Zhang, “Bottom-up and top-down attention for imagecaptioning and visual question answering,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2018, pp.6077–6086. [5] J. Xu, T. Mei, T. Yao, and Y. Rui, “MSR-VTT: A large video descrip-tion dataset for bridging video and language,” in
Proceedings of theIEEE conference on computer vision and pattern recognition , 2016, pp.5288–5296.[6] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venu-gopalan, R. Mooney, T. Darrell, and K. Saenko, “Youtube2text:Recognizing and describing arbitrary activities using semantichierarchies and zero-shot recognition,” in
Proceedings of the IEEEinternational conference on computer vision , 2013, pp. 2712–2719.[7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, andL. Deng, “Semantic compositional networks for visual caption-ing,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 5630–5639.[8] R. Pasunuru and M. Bansal, “Multi-task video captioning withvideo and entailment generation,” in
Proceedings of the 55th AnnualMeeting of the Association for Computational Linguistics , R. Barzilayand M. Kan, Eds., 2017, pp. 1273–1283.[9] ——, “Reinforced video captioning with entailment rewards,” in
Proceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing , pp. 979–985.[10] C. Wu, Y. Wei, X. Chu, W. Sun, F. Su, and L. Wang, “Hierarchicalattention-based multimodal fusion for video captioning,”
Neuro-computing , vol. 315, pp. 362–370, 2018.[11] M. Zolfaghari, K. Singh, and T. Brox, “ECO: efficient convolutionalnetwork for online video understanding,” in
ECCV , 2018, pp. 713–730.[12] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, “Con-trollable video captioning with pos sequence guidance based ongated fusion network,” in
Proceedings of the IEEE/CVF InternationalConference on Computer Vision , 2019, pp. 2641–2650.[13] W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y.-W. Tai, “Memory-attended recurrent network for video captioning,” in
Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition ,2019, pp. 8347–8356.[14] S. Liu, Z. Ren, and J. Yuan, “Sibnet: Sibling convolutional encoderfor video captioning,” in
Proceedings of the 26th ACM InternationalConference on Multimedia , 2018, p. 1425–1434.[15] X. Wang, Y. Wang, and W. Y. Wang, “Watch, listen, and describe:Globally and locally aligned cross-modal attentions for videocaptioning,” in
Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies , pp. 795–801.[16] X. Wang, J. Wu, D. Zhang, Y. Su, and W. Y. Wang, “Learningto compose topic-aware mixture of experts for zero-shot videocaptioning,” in
Proceedings of the AAAI Conference on ArtificialIntelligence , vol. 33, no. 01, 2019, pp. 8965–8972.[17] H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, “A semantics-assistedvideo captioning model trained with scheduled sampling,”
Fron-tiers in Robotics and AI , vol. 7, p. 129, 2020.[18] B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, “Con-trollable video captioning with pos sequence guidance based ongated fusion network,” in
Proceedings of the IEEE/CVF InternationalConference on Computer Vision , 2019, pp. 2641–2650.[19] H. Chen, J. Li, and X. Hu, “Delving deeper into the decoderfor video captioning,” in
ECAI 2020 - 24th European Conferenceon Artificial Intelligence , ser. Frontiers in Artificial Intelligence andApplications, vol. 325, 2020, pp. 1079–1086.[20] V. I. Levenshtein, “Binary codes capable of correcting deletions,insertions, and reversals,” in
Soviet physics doklady , vol. 10, no. 8,1966, pp. 707–710.[21] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a methodfor automatic evaluation of machine translation,” in
Proceedingsof the 40th Annual Meeting of the Association for ComputationalLinguistics , 2002, pp. 311–318.[22] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in , 2015, pp. 4566–4575.[23] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MTevaluation with improved correlation with human judgments,” in
Proceedings of the ACL Workshop on Intrinsic and Extrinsic EvaluationMeasures for Machine Translation and/or Summarization , 2005, pp.65–72.[24] C.-Y. Lin, “ROUGE: A package for automatic evaluation of sum-maries,” in
Text Summarization Branches Out , 2004, pp. 74–81.[25] J. Xu, T. Mei, T. Yao, and Y. Rui. (2016) The 1st video to language