[PDF] End-to-End Video Question-Answer Generation with Generator-Pretester Network

Abstract

We study a novel task, Video Question-Answer Generation (VQAG), for challenging Video Question Answering (Video QA) task in multimedia. Due to expensive data annotation costs, many widely used, large-scale Video QA datasets such as Video-QA, MSVD-QA and MSRVTT-QA are automatically annotated using Caption Question Generation (CapQG) which inputs captions instead of the video itself. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG). Existing video-to-text (V2T) approaches, despite taking a video as the input, only generate a question alone. In this work, we propose a novel model Generator-Pretester Network that focuses on two components: (1) The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question "Answering" training. (2) The Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model's proposed answer and the ground truth answer. We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances. Furthermore, using our generated QA pairs only on the Video QA task, we can surpass some supervised baselines. We apply our generated questions to Video QA applications and surpasses some supervised baselines using generated questions only. As a pre-training strategy, we outperform both CapQG and transfer learning approaches when employing semi-supervised (20%) or fully supervised learning with annotated data. These experimental results suggest the novel perspectives for Video QA training.

Full PDF

11 End-to-End Video Question-Answer Generationwith Generator-Pretester Network

Hung-Ting Su † , Chen-Hsi Chang † , Po-Wei Shen † , Yu-Siang Wang ‡ ,Ya-Liang Chang † , Yu-Cheng Chang † , Pu-Jen Cheng † and Winston H. Hsu †† National Taiwan University, ‡ University of Toronto

Abstract —We study a novel task, Video Question-AnswerGeneration (VQAG), for challenging Video Question Answering(Video QA) task in multimedia. Due to expensive data annotationcosts, many widely used, large-scale Video QA datasets suchas Video-QA, MSVD-QA and MSRVTT-QA are automaticallyannotated using Caption Question Generation (CapQG) whichinputs captions instead of the video itself. As captions neitherfully represent a video, nor are they always practically available,it is crucial to generate question-answer pairs based on a videovia Video Question-Answer Generation (VQAG). Existing video-to-text (V2T) approaches, despite taking a video as the input, onlygenerate a question alone. In this work, we propose a novel modelGenerator-Pretester Network that focuses on two components: (1)The Joint Question-Answer Generator (JQAG) which generatesa question with its corresponding answer to allow Video Question“Answering” training. (2) The Pretester (PT) veriﬁes a generatedquestion by trying to answer it and checks the pretested answerwith both the model’s proposed answer and the ground truthanswer. We evaluate our system with the only two availablelarge-scale human-annotated Video QA datasets and achievesstate-of-the-art question generation performances. Furthermore,using our generated QA pairs only on the Video QA task, we cansurpass some supervised baselines. As a pre-training strategy, weoutperform both CapQG and transfer learning approaches whenemploying semi-supervised (20%) or fully supervised learningwith annotated data. These experimental results suggest the novelperspectives for Video QA training.

Index Terms —video question answering, video question gener-ation, pretester network.

I. I

NTRODUCTION

Video Question Answering (Video QA; Table I lists the ab-breviations used in this paper.), which aims to answer a naturallanguage question according to a video clip, is an importanttask in multimedia understanding. Modern Video QA systems[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]require sufﬁcient video-question-answer triples to train. Asshown in Figure 1, many Video QA datasets are labeled with caption question generation (CapQG) , where question-answerpairs are created using a text question generation systemaccording to captions of a video. Three of the most widelyused benchmarks, Video-QA [2], MSVD-QA, and MSRVTT-QA [1] are massively labeled in this way and provide morethan 50,000 question-answer pairs.Two problems remain unsolved for applying CapQG to real-world applications. First, captions are difﬁcult to obtain inpractice. Second, these datasets assume captions could repre-sent a video clip, but as the old saying goes,

A picture is wortha thousand words . Even a single frame has abundant visual information about various objects and interactions betweenthem, and this information increases with the video length.An intuitive way to approach videos without captions is toapply video captioning systems. However, video captioningis still an unsolved task that requires deep and ﬁne-grainedunderstanding of a video. Although modern systems can gen-erate “correct” captions in terms of some automatic evaluationmetrics such as BLEU, qualities of machine generated captionsare still not applicable for CapQG due to factual errors wherea few erroneous words could lead the captions to have totallydifferent meaning, as shown in Figure 2. Moreover, correctcaptions can be inferior because of a general problem suchas repetition or redundancy [14]. Existing Video QA datasetsgenerated with human-labeled descriptions still suffer from in-formation loss during captioning. Although human annotatorscan write high-quality captions to a video clip, multimediacontents are too abundant to describe with sentences. Forexample, Figure 1 shows a video and human written captions.Various details such as detailed actions, the attributes of theperson including the gender, clothes and face expressions arenot represented in captions.Video-to-text (V2T), a generalized video captioning frame-work could be another way to approach the issue by replacingoutput captions with questions. Despite a video has morecomplete information comparing to captions as the input , itsmajor drawback is the absence of the answer . A recent work,SRCMSA [15] adopted V2T to generate questions alone.Although transformer based SRCMSA outperforms variousvideo-to-text baselines by proposing a stronger video encoder,video-question pairs do not enable video question “answering”training. Moreover, applying V2T to QG task also suffers fromfactual error, repetition and redundancy problems. One of themain reasons for these problems is training with perplexityloss only. Perplexity loss is a double-edged blade for questiongeneration as it captures question patterns but does not focuson speciﬁc details. More speciﬁcally, vital cues for questiongeneration such as an object, an attribute, or a relation mayweigh only a few or even a single word and be neglected.In this work, we propose a novel yet effective Generator-Pretester Network (GPN) (Figure 3) which jointly generatesquestion-answer pairs and veriﬁes the generated question bypretesting it. The GPN model has two technical componentsto tackle the above challenges. (1) The Joint Question-AnswerGenerator (JQAG), which comprehends the video and providesthe answer proposal according to the video and the desiredquestion type. The answer proposal and the video are then used a r X i v : . [ c s . MM ] J a n Conventional (Caption Question Generation (CapQG))

Ours (Video Question Answer Generation (VQAG))

Inputs Caption (manually annotated)

A person is shown in fast motion spraying liquid allover a fence in aback yard. The person continues spraying and ends with the camera fading to black.

Answer reverse lacquer

Video Frames

Outputs Question : What is the person in black clothes doing ?

Answer : male

Question : What is the gender of the person in white?

Fig. 1. Video Question-Answer Generation (VQAG). Conventional Caption Question Generation (CapQG, Left column) [1], [2] which outputs a questionwith captions and an explicitly generated answer. CapQG uses captions which neither fully represent the video, nor are they always practically available. Wepropose a novel VQAG (Right Column) that automatically generates question-answer pairs according to a video clip for Video Question Answering (VideoQA) training in an end-to-end manner. [Best viewed in color.]

Generated Question what is behind the person in blue ?

Generated Question what is behind the woman with a white dress ? Human Annotated Captions … the man rubs the tools along his skin andpoints back and fourth to the vacuum while speaking.

Machine Generated Captions … a woman is seen speaking to the cameraand leads into a woman speaking to the camera. … the woman then puts the lens in the sink and puts it on the sink.

Caption Question Generation (Cap-QG)

Fig. 2. When human annotated captions (Left) are unavailable, applyingmachine generated captions (Right) to question generation systems incurs cap-tioning error and results in unanswerable questions generated. Our proposedVQAG directly generates questions from a video and tries to reduce aboveinformation loss. [Best viewed in color.]APV Answer Proposal VectorASV Answer Sheet VectorAnet-QA ActivityNet-QA [16]CMSA Cross-Modal Self AttentionCapQG Caption Question GenerationGPN Generator-Pretester NetworkJQAG Joint Question-Answer GeneratorKLdiv KL DivergencePT PretesterQ.Ctrl Question ControllerQA Question AnsweringSA Self AttentionSRCMSA Semantic Rich Cross-Modal Self Attention [15]TCL Target Consistency LossV2T Video-to-textVQAG Video Question-Answer GenerationVQG Visual Question GenerationVideoQA Video Question AnsweringTABLE IL

IST OF A BBREVIATIONS to generate a question-answer pair in an end-to-end manner.(2) The Pretester (PT), which pretests the generated questionand veriﬁes the answer with the answer proposal and theanswer ground truth. Traditional V2T approaches suffer fromfactual errors because it optimizes the model by checking thegenerated question word-by-word, and encourages the modelto neglect essential facts due to their low word frequency. Our Pretester tackles the problem in a new way by tryingto answer the generated question. Many modern human-annotated QA datasets, including text QA [17], [18], and VideoQA [16], perform additional manual veriﬁcation after questionlabeling. Another set of crowd-workers is asked to answerquestions according to an input video or passage. Inspired bythis concept, our Pretester acts as an agent which pretests agenerated question and checks the answer. Speciﬁcally, thePretester ﬁrst attempts to answer the question by projectingthe question embedding to the answer space and produces anAnswer Sheet Vector (ASV). Next, Target Consistency Loss(TCL) is performed to verify the ASV with the ground truthanswer and the answer proposal.We evaluate our GPN model on two large-scale, humanannotated datasets, ActivityNet-QA [16] and TVQA [19],and achieve the state-of-the-art performances. On ActivityNet-QA dataset, we signiﬁcantly outperform previous state-of-the-art V2T and CapQG models with merely 20% of trainingdata. Our experimental results also conﬁrm that conventionalCapQG approaches suffer from information loss during ma-chine video captioning. We observe a huge question genera-tion performance drop when training with machine generatedcaptions instead of human labeled ones, in spite of usinga modern captioning model. We also applied our generatedquestions to the Video Question Answering task and providethe new perspectives of Video QA training. We reached 29.5%of accuracy training with only VQAG produced data , whichoutperforms some supervised baseline training and indicatesthat the knowledge obtained from generated data is able totrain Video QA models. Also, we compared the ﬁne-tuningperformance with different pre-training strategies includingCapQG and transfer learning. Our approach achieved the bestQA performance in both semi-supervised and fully supervisedscenarios. To pave a novel path for future researchers, wecarefully analyzed the quality of generated questions for VideoQA training.Our contributions are listed below: (1)

We propose anovel task, Video Question-Answer Generation, which jointlygenerates question-answer pairs from videos. (2)

We pro-pose an end-to-end Generator-Pretester Network which jointlygenerates a question-answer pair and verify the generated question by pretesting it. (3)

Our proposed model reaches anew state-of-the-art question generation performance on twolarge-scale, human annotated Video QA datasets. (4)

We applygenerated question-answer pairs to Video QA applications anddemonstrate the opportunity of Video QA training with VQAGgenerated data. II. R

ELATED W ORK

Video Question Answering (Video QA) has been a crucialyet challenging task in multimedia. While deep learning hasenabled representation learning for multimedia features suchas image, video and text and achieved remarkable performanceon various tasks, one of the biggest barriers is the needfor abundant training data. The annotation process of VideoQA data requires video comprehension and question-answerlabeling and is much more time-consuming compared to clas-siﬁcation tasks such as action recognition. Most of large-scaleVideo QA datasets are automatically generated as the result.For example, Jang et al. proposed the TGIF-QA [20] datasetby using templates to generate 165,000 question-answer pairsfor animated GIFs. Meanwhile, many of researches applycaption question answering (CapQG), which utilizes humanlabeled captions and text question generation systems, toproduce sufﬁcient question-answer pairs from existing datasetsor available web videos. Therefore, the availability and thequality of generated questions strongly relies on captions. Zenget al. proposed the Video-QA [2] dataset by harvesting videoswith human written descriptions from web and obtains 170,000question-answer pairs for Video QA training with CapQG. Xuet al. extended two video captioning datasets, MSVD [21] andMSRVTT [22] with CapQG and generated MSVD-QA with50,000 question-answer pairs and MSRVTT-QA with 240,000question-answer pairs. Recently, several human labeled VideoQA datasets were released with a lots of crowd-sourcingefforts. Lei et al. proposed a human annotated dataset, TVQA[19], with 150,000 multi-choice question-answer pairs. Yuet al. extended the ActivityNet [23] dataset to open-endedActivitynet-QA [16] by manually labeling 58,000 question-answer pairs. Zadeh et al. proposed a Social-IQ [24] datasetto evaluate socially intelligent techniques with a year of humanannotation period. Yi et al. tackled a more challenging ofcasual and collision reasoning with a synthetic dataset [25]As CapQG depends on captions while human annotation isexpensive, it is necessary to automatically generate massivequestions according to videos.Apart from text generation, Some works applied questiongeneration using visual signals. Image Question Generation(also known as visual question generation (VQG)) models[26], [27], [28] took an image and generate correspondingquestions. While VQG models could generate questions forimage QA, using VQG generated questions for Video QA losevital temporal information. Video-to-text (V2T) Generationadopts captioning models to generate questions. Video cap-tioning [29], [30], [31], [32], [33], [34], [35], [36], [37], [38],[39], [40], [41], [42], [43], [44], [45], [46], [47], [48] is also anactive research area in multimedia. A recent work, SRCMSA[15], adopts the video-to-text generation model to generate questions alone for Video QA training. SRCMSA proposeda transformer-based video-to-text model and demonstrates theeffectiveness on TVQA [19] dataset. However, SRCMSA doesnot consider the corresponding answer; video-question pairsalone do not enable Video QA training as models cannot beoptimized without answers. Moreover, V2T models trainedwith perplexity loss generally suffer from factual error asperplexity loss prioritizes repeating patterns rather than rela-tively infrequent words such as objects, attributes or relations.Hence, our proposed model tackles the above challenges byjointly generating a question-answer and veriﬁes the questionby pretesting it.III. G

ENERATOR -P RETESTER N ETWORK (GPN)As shown in Figure 3, our model is composed of twonovel components. (1) Joint QA Generator jointly generates aquestion-answer pair according to a video clip and the questioncontroller by estimating an Answer Proposal Vector (APV)and feeding APV to generate a question and an answer. (2)Pretester tries to answer the generated question and checksthe answer with the answer proposal and the answer groundtruth. Our model takes a video and the question controlleras input and generates question-answer pairs accordingly. Thequestion controller represents a desired question type such asspatial relation and can be obtained during dataset annotation.Precisely, given a video V and a question-answer type T i ,we generate a question-answer pair Q i , A i . Comparing to aprevious work [15] which generates a single question based ona video clip, we are able to generate multiple question-answerpairs for Video QA. This is essential for Video QA trainingas the model learns to not only comprehend but answer thequestion.The input video can be represented in various ways.In our work, we follow SRCMSA [15] for fair compari-son. The inputs of our model are following features ex-tracted from n frames: (1) CNN extracted visual feature V F = { V F , V F , · · · , V Fn } , V Fi ∈ R , (2) Object fea-tures V O = { V O , V O , . . . , V On } , where V Oi ∈ R , and(3) Question Controller C ∈ R which determines thetype of generated questions. The model outputs (1) Question Q = { Q , Q , . . . , Q m } , Q i ∈ R | V | , where m is the wordlength of the questions and | V | is vocabulary size, and (2)Answer A ∈ R | A | where | A | represents answer size. In ourexperiments, we extract 20 video frames per video followingexisting Video QA settings [1], [16]. Then, we obtain CNNfeatures from the Pool5 layer of ResNet101 [49] and detectobjects features by Faster-RCNN [50] pretrained on VisualGenome [51]. Afterward, we mean pool the object represen-tation in each frame and get object embeddings.The model optimizes the trainable parameters as follow: Q, A = f ( θ E , θ QG , θ A , θ AS , V F , V O , C ) , (1)where θ E , θ QG , θ A , θ AS are the parameters of Video Encoder,Question Generator, Answering Layer and Answer Selector. A. Joint Question-Answer Generator (JQAG)1) Video Encoder:

The Video Encoder encodes videofeatures and the question controller to generate a question- embedding

VideoQuestion Controller

Video

Encoder AnswerSelectorQuestion Generator

Answer Proposal Vector what kind of clothes does the lady wear in the video ? Generated Question

Question Label

Perplexity

Answer Label argmax

Generated Answer: dress

Answering

Layer

Answer Sheet Vector

TargetConsistency Loss

Answer Loss

Index Type … …2 Number3 Object4 Motion… …

Joint QA Generator (JQAG)

Pretester (PT)

Question Type Lookup Table

Fig. 3. Generator-Pretester Network (GPN) Model which generates massive question-answer pairs with two novel components. (1) Joint Question-AnswerGenerator (JQAG, Section III-A) comprehends the video and jointly generates a question-answer pair. JQAG ﬁrst infers an Answer Proposal vector and thengenerates a question-answer pair. (2) Pretester (PT, Section III-B) pretests the generated question by trying to answer it and check the answer with both theanswer proposal and the answer ground truth. (cf. Section III) [Best viewed in color.] answer pair. We adopt a transformer-based Cross-Modal Self-Attention (CMSA) Encoder [15]. The video encoder ﬁrst fusesframe feature V F and object feature V O with a projection W proj ∈ R × for each frame: V Si = W proj ( V Fi ) (cid:12) V Oi , (2)where V Si ∈ R .The original CMSA encoder does not consider the questiontype. However, different types of questions require differentsignals in the video. For instance, a question about spatialrelationship relies on certain objects, while the question aboutthe scene needs the cues of whole frame. Observing this,we introduce the Question Controller C which represent thequestion type with a 256 dimension embedding for each frame: V srci = V Si (cid:12) C (3)C is obtained from a question controller matrix C M ∈ R × T , where T represents the number of question types. V src is then fed into Self-Attention Layers: V i +1 = SA ( W Qi V i , W Ki V i , W Vi V i ) , (4)where V i ∈ R × n is the i -th layer output, SA refersto multi-head self attention. We initialize the ﬁrst layer by V = P E ( V src ) and P E is positional encoding introducedin previous work [52].Finally, we mean-pool the last layer output to obtain theclip level representation, and feed the clip level representationinto a two-layer projection: V final = W P ( ReLU ( W P meanpool ( V | L | ))) (5)to obtain the video embedding for answer and question gen-eration.

2) Answer Selector:

The Answer selector aims to producethe answer proposal vector according to the video. Video QGwith answer allows Video QA training based on generatedquestion-answer pairs. The answer selector takes video em-bedding V final and feeds forward the networks includes twotrainable linear layers W A G and W A G : A P = sof tmax ( W AG ( ReLU ( W AG V final ))) (6)The answer proposal vector A P represents the distributionof the answer likelihood for each candidate answer. Duringinference stage, the answer is generated by applying A = argmax ( A P ) . During training stage, A P is optimized byCross Entropy Loss: L ap = CrossEntropy ( A P , A tgt ) , (7)where A tgt denotes the ground truth answer.

3) Question Generator:

The Question Generator takes thevideo embedding V final and the answer proposal A P , andoutputs the question according to them. We utilize the LSTMdecoder including two hidden layers and an output layer usedin previous work [15]. In addition, for the Video QG task,question composition is less complex than the input video,so we reduce the parameters. For example, Anet questions aremostly less than 10 words on average. Also, TVQA questions,while being longer, follow a strict pattern of “Something be-fore/after/then Something.” Hence, compared to general video-to-text tasks, question generation follows question patterns andtherefore is much simpler.The question is generated word-by-word according to theprevious hidden state: Q i +1 , H i +1 = LST M ( Q i , H i ) , (8) where Q i is the i -th word and H i is the i -th hidden state. Weinitialize the hidden states in LSTM with video embeddingand the answer proposal: H = V final (cid:12) W AE ( A P ) , (9)where H is the initial hidden states and W AE ∈ R | A |× embeds the generated answer proposal. This constrains thequestion generator with the input video and the model pro-posed answer. The loss of question generator is obtained bythe perplexity of generated question Q and the ground truthquestion Q tgt : L qg = P erplexity ( Q, Q tgt ) (10)Finally, we keep last hidden state H | Q | as question embedding Q emb . B. Pretester (PT)

Pretester (PT) pretests the generated question Q emb andgenerates an Answer Sheet Vector A Q , and then optimizesthe consistency between A Q , the answer proposal A P , andthe ground truth answer A tgt . First, the Answering Layer actsas an agent to answer the question and generates the AnswerSheet Vector A Q : A Q = sof tmax ( W AL ( ReLU ( W AL Q emb ))) (11)Then, we apply our Target Consistency Loss (TCL). A Q isoptimized by (1) the ground truth answer with Cross EntropyLoss: L ans = CrossEntropy ( A Q , A tgt ) (12)and (2) the Answer Proposal Vector by minimizing KL-divergence (KLdiv): L c = KLdiv ( A Q , A P ) (13)Finally, the Target Consistency Loss (TCL) is obtained by L tc = λ c L c + λ a L ans , (14)where λ c and λ a are hyper-parameters and λ c + λ a = 1.The ﬁnal loss is the sum of the TCL, the perplexity lossand the answer loss: L total = L tc + L qg + L ap (15)IV. Q UESTION G ENERATION E XPERIMENTS

In this section, we adopt experiments on Video QuestionGeneration and analyze the results in terms of quantity andquality. The results demonstrate the effectiveness of our pro-posed model. Furthermore, we carefully analyze the generatedquestions and lead a path for VQAG researches. The code isavailable at https://github.com/htsucml/VQAG

QG-train QG-val QG-test QA-val QA-testAnet-QA 14,400 2,880 14,400 18,000 8,000TVQA 122,093 15,253 7,623 - -TABLE IID

ATA SPLIT . F OR A CTIVITYNET -QA (A

NET -QA) [16],

WE SPLIT THEORIGINAL TRAINING SET INTO

QG-

TRAIN /QG-

VAL /QG-

TEST IN THERATIO OF

FOR QG EXPERIMENTS IN S ECTION

IV.QG-

TEST SET IS ALSO USED FOR PRE - TRAINING OF OUR QA EXPERIMENTS IN S ECTION

V. F OR TVQA,

WE FOLLOW THE SETTING OF [15].

A. Experimental Setup1) Data:

We use two recent human annotated datasets,including ActivityNet-QA (Anet-QA) [16] and TVQA [19] forVideo QG experiments. The statistics of the dataset is shown inTable II. The main dataset for Video QG and QA experimentsis Anet-QA, which is an open-ended dataset, as TVQA datasetis a multi-choice dataset which involves distractors. We alsoevaluate Video QG results on TVQA dataset.To evaluate both Video Question Generation and VideoQuestion Answering (Section V), we split the original Anet-QA training set into

QG train , QG val and

QG test sets. ForTVQA dataset, we follow the setting of a previous work [15].

2) Evaluation Metrics:

Following previous approaches, weevaluate Video QG with BLEU [53] (B), BLEU-4 (B4),ROUGE [54] (R), CIDEr [55] (C) and METEOR [56] (M)scores. BLEU is a word-level precision metric, and BLEU-4 is the BLEU score for 4-grams. ROUGE is a word-levelrecall measurement. CIDEr considers word frequency andpenalizes frequent words by TF-IDF weighting. METEORtakes synonyms into consideration by utilizing WordNet [57].

3) Implementation Detail:

We implement our model andthe baseline with OpenNMT [58] built on Pytorch [59]. Forfair comparison, we add the question controller to all baselines.For V2TQG, we modify the CMSA encoder (see SectionIII-A) and add the question controller, which follows thesame approach of our method. For CapQG and CapQG-oracle,we add all question types as tokens in the vocabulary andconcatenate the question with the corresponding question typetoken. We set λ c = 0 . and λ c = 0 . for TCL. We useADAM [60] as our optimizer and set α = 0 . , β . , and β . . The learning rate is initialized as e − for Anet-QA and e − for TVQA. We train the model for 200,000steps for fully supervised experiments, and train the modelfor 40,000 steps for semi-supervised (20%) experiments. Themodel is validated every 5,000 steps. For fair comparison,we use the same hyper-parameters for the V2TQG baseline.We use the default parameters for CapQG and CapQG-oraclebaselines. B. Question Generation Performance1) Activitynet-QA Results:

Table III examines the QuestionGeneration performance on Activitynet-QA [16]. We compareour method with both Caption QG (CapQG) and video-to-text QG (V2TQG). For CapQG, we generate captions witha competitive dense captioning model, Masked Transformer[61], and then apply the state-of-the-art text QG model,UniLM [62] for question generation with generated captions

B B4 R C MCapQG (100%) 67.63 40.64 62.39 8.56 29.75CapQG-oracle* (100%) 71.17 44.58 63.76

GPN (ours) (100%) 72.38 46.00 66.22

TABLE IIIQG R

ESULTS ON A CTIVITYNET -QA

DATASET . O

UR MODELOUTPERFORMS C AP QG AND

V2TQG

BASELINES . B:BLEU, B4:BLEU-4,R:ROUGE, C:CIDE R , M:METEOR. *C AP QG-

ORACLE IS A VERYSTRONG BASELINE WHICH USES GROUND TRUTH CAPTIONS , WHICH ARENOT ALWAYS AVAILABLE PRACTICALLY . S EE S ECTION

IV-B1

FORDETAILED DISCUSSION .B B4 R C MS2VT † [64] 57.80 7.58 36.25 6.39 14.83OBJ2TEXT † [65] 61.78 10.44 38.49 6.42 15.33IMGD † [66] 61.08 9.59 37.78 7.29 15.21SRCMSA [15] 66.11 12.17 42.02 28.81 18.82 GPN (ours) 68.14 13.63 42.95 30.87 19.53

TABLE IVV

IDEO

QG R

ESULTS ON

TVQA. † : OBTAINED FROM [15]. F

OR THE FAIRCOMPARISON , WE IMPLEMENT

SRCMSA

AND RUN THE EXPERIMENTSFOLLOWING THEIR SETTING WITH SAME RANDOM SEED . O

UR MODELSIGNIFICANTLY OUTPERFORMS ALL BASELINES , INCLUDING PREVIOUSSTATE - OF - THE - ART IN A LARGE MARGIN . S EE S ECTION

IV-B2

FORDETAILED DISCUSSION . and ground truth answers. We sort generated captions based onstart time and remove duplicated captions. We also comparewith the oracle case where captions are human annotated(CapQG-oracle). For V2TQG, we compare with the modernSRCMSA [15] model.Our model reaches the new state-of-the-art and outperformsall competitive baseline in a large margin as shown in thesecond block, including a very strong CapQG-oracle whichuses human labeled captions and a BERT [63] based questiongeneration model. Our full model training with merely 20%of data (second row) signiﬁcantly surpasses both CapQG andV2TQG training with 100% of data in all metrics but CIDEr.As GPN does not require abundant data to work well, it isapplicable in practice where the annotated data is availablebut expensive.Comparing baseline models CapQG and CapQG-oracle inthe ﬁrst block in Table III, we could see more than 3 points ofBLEU scores, roughly 1.5 points of ROUGE and METEOR,and about 1 point of CIDEr drop when using generated cap-tions, even with a modern captioning system. It demonstratesthat CapQG suffers information loss in practice where humanlabeled text descriptions are unavailable.

2) TVQA Results:

Table IV shows performance comparisonwith S2VT [64], IMGD [66], OBJ2TEXT [65], and SRCMSA[15] on TVQA [19] dataset. GPN outperforms state-of-the-artresults across all metrics and attains 19.5 of METEOR score.Different from ActivityNet-QA, TVQA dataset includes bothvideo and subtitles input. The questions are also longer com-paring to ActivityNet-QA. The remarkable results demonstratethe robustness of our proposed method.

C. Ablation Study

The ablation study is shown in Table V. Removing Pretestermodule (Section III-B) results in performance drop in all

B B4 R C M (cid:79)

MFull, λ a = 0 . -No Pretester 69.16 42.25 63.03 7.05 30.94 1.18%No Object 69.15 42.31 63.03 7.05 30.94 1.18%No Frame 69.06 42.29 63.11 6.73 31.02 0.93%No Q.Ctrl 69.38 42.73 λ a = 0 . λ a = 0 . λ a = 0 . λ a = 1 . TRL : Q

UESTION C ONTROLLER A BLATION S TUDY WITH OF A CTIVITY N ET -QA QG- TRAIN SET USED , COMPARING WITH FULL MODELAND BEST HYPER - PARAMETERS (F IRST BLOCK ). S

ECOND BLOCKEXAMINES INPUT FEATURES AND COMPONENTS AND SHOWS ALL OFTHEM CONTRIBUTE TO THE PERFORMANCE . T

HIRD BLOCK TESTSDIFFERENT HYPER - PARAMETERS OF P RETESTER MODULE (S ECTION

III-B)

AND DEMONSTRATES THE BEST PERFORMANCE USING BOTH A NSWER P ROPOSAL AND A NSWER G ROUND T RUTH AS LABELS . ( CF .S ECTION

IV-C). metrics especially CIDEr and METEOR scores. ComparingBLEU and ROUGE scores, which match word level precisionand recall, CIDEr considers word frequency and penalizesrepeating words, and METEOR takes synonyms into account.Hence, it reveals that Pretester boosts the performance byprioritizing the essential cues to answer more than simplymatching words and observed patterns.The question controller (Q.Ctrl) improves the performancein all metrics except for ROUGE, especially for more than2% of METEOR score. Without the question controller (w/oQ.Ctrl), the video QAG model learns one-to-many mappingwith multiple questions associated with a video, resulting inhigher matching scores on frequent patterns at the price ofrare patterns and diversity. Therefore, the w/o Q.Ctrl modelobtains comparable BLEU, ROUGE, and CIDEr scores whilethe performance drops when considering synonyms, such asMETEOR score. On the other hand, with the aid of thequestion controller, the Video QAG model needs to performone-to-one mapping and is forced to capture the semanticsinstead of repetitive patterns, resulting in a more signiﬁcantMETEOR score gap. In addition, Question Controller enablesthe model to control the desired question type during theinference stage.Removing video or object features also leads to performancedrop as both video frames and objects provide cues for VQAG.Video frames provide complete but raw and sparse visual cues,while objects allow VQAG model to capture word semanticsaccording to object co-occurrence. Some of the questions,such as scene (day/night) related question-answer pairs requireframe features to generate. Meanwhile, some questions rely onthe semantics of only certain objects. Questions involve spatialor temporal relationships entails both video frames and visualconcepts to properly pinpoint the essential object semanticsand comprehend the relationship between them.The third block of Table V analyzes the performance de-pending on the parameters λ A and λ C for TCL in the Pretestermodule (Section III-B), which controls the loss obtained fromground truth answer and model generated answer distribution.We reach best performance when setting λ A = 0 . and λ C = 0 . . Furthermore, we can observe that performance drops when either loss is removed, revealing that both signalsare essential and collaborates well. D. Case Study

Figure 4 illustrates several generated questions of SRCMSA(Left), our model without the Pretester module (Middle), andour full model with Pretester (Right). For question (1) and (2),while SRCMSA is able to generate questions according to thevideo, the absence of answers is the biggest gap to VideoQA training. Additionally, SRCMSA generates questions withredundant relation “black”. While questions are literally notwrong, redundant words may distract Video QA models to payattention on useless details. Middle block exhibits the results ofour model with Pretester removed. It still successfully outputsthe answer, but with an incorrect relation in each question.Without the Pretester, QAG model fails to focus on a keyobject (the person in this case) and is distracted by grass(colored green) in the video, and generates wrong questions.On the right block, our model guided with Pretester generatesprecise questions without redundancy. For question (3), eachmodel successfully generates a question, while both variantsof our models output a wrong answer 2 (should be 1). It maycause by the only man wearing different color of clothes, andfrequent lens rotation when he switches inside and outside theview. Above results indicate that while we signiﬁcantly out-perform previous state-of-the-art models, Video QAG remainschallenging due to the sparsity, the diversity, and the temporaldependencies of video features.Figure 5 demonstrates generated questions and perplexityscores. We compare our models with or without Pretester. Ourmodel without Pretester generates an unanswerable, wrongquestion with non-existent facts (in the blue jacket, hit theball). Our full GPN equipped with Pretester generates ananswerable question despite a higher perplexity. It revealsthat the Pretester encourages the generator to generate ananswerable question by ignoring frequent but wrong wordsduring decoding.

E. Error Analysis

We achieve state-of-the-art performance for the questiongeneration task. Additionally, we investigate errors of gen-erated questions and classify errors into 3 categories.

QAMismatch Error, Question Error and Answer Error , as shownin Figure 6.

Question-Answer (QA) mismatch error (Left block) iswhere a question and an answer are both represented in thevideo but do not match properly. For instance, an answer ofﬁrst question-answer should be “1” instead of “3” as there isonly a person in blue pants. Otherwise, we have to changethis question to “how many people are there in the video” tomatch the answer “3”. Similarly, for the second one, we caneither adjust the answer to “court”, or modify the questionby replacing “in front of” with “behind”. QA Mismatch Errorindicates that the VQAG model successfully comprehends avideo clip and capture essential segments but fails to pair them.QA pairs with this type of error are usually seen in videos withvarious objects. This suggests that Pretester is a promising direction that worth considering as it directly measures thequestion by trying to answer it.

Question Error (Middle block) represents invalid questionswhich are undesirable on the basis of a video clip despitethe corresponding answer. Previous work [15] has mentionedsome types of question errors, including unanswerable, redun-dant, and general questions. The most severe type are unan-swerable questions as they may mislead the video QA model.We observed that many question errors are mostly caused bya single word of the wrong attribute or action. In other words,these unanswerable questions are actually “mostly correct”in terms of automatic evaluation such as BLEU or ROUGEscores. This suggests future research should consider weightedor semantic metrics such as CIDEr and METEOR scores, andcarefully analyze qualities beyond scores.

Answer Error (Right Block) is deﬁned as an invalid answeraccording to a video clip in spite of a question. It is worthmentioning that most of Answer Errors occurred together withQuestion Errors, or pair with more general questions such as“what is the name of the game ?”. The phenomenon revealsthat Answer Errors are usually presented when the modelfailed to comprehend a video clip. In this situation, a questionis either wrong or “safely guessed” as frequent and generalquestions by the distribution learned by language modeling.This reveals that the model fails to understand the input videoand suggests more ﬁne-grained comprehension of the videoencoder.V. A

PPLICATION IN Q UESTION A NSWERING

In this section, we apply our generated question to down-stream open-ended Video QA task. We demonstrate remark-able performance gain with generated questions only, and dis-play improvement on both semi and fully supervised scenarios,and point out promising directions for Video QA training.

A. Setup

For open ended Video QA experiments, we evaluate withthe accuracy in each question type as deﬁned by [16]. Weseparate Y/N and counting questions as the answer spacesfor these two types are much smaller. To simulate a practicalscenario where training data is expensive but available, Wetrain CapQG or VQAG models with 20% of QA-train set. Weuse a modern Video QA model HME [6]. The word embeddinglayer for a question is initialize by 300 dimension GLOVE [70]vectors. On pre-training stage, we use QA train (see Table II)set. When ﬁne-tuning, we use QG-train set as those QA pairsare resources available for both QG and QA training. For everyexperiment, all layers except for the ﬁnal classiﬁcation layerare initialized with pre-trained weights, if a word embeddingis not found in the pre-trained checkpoint, we initialize thatword with 300 dimension GLOVE [70].

B. Question Answering Performance

Table VI shows Video QA accuracy on the Activitynet-QAdataset with baselines including E-VQA [67], E-MN [68], E-SA [69] and HME [6]. We also apply different pre-training

SRCMSA

Q1: what color is the hair of the person in blackclothes? A: N/A

Q2: what is the gender of the person in black clothes? A: N/AQ3: how many people are there in the video?

A: N/A

Ours (w/o Pretester)

Q1: what color is the hair of the person in greenclothes? A: black

Q2: what is the gender of the person in green clothes? A: maleQ3: how many people are there in the video?

A: 2 (Correct: 1)

Ours (w/ Pretester)

Q1: what color is the hair of the person in the video? A: black

Q2: what is the gender of the person in the video?A: male

Q3: how many people are there in the video?

A: 2 (Correct: 1)

Fig. 4. Case study. SRCMSA [15], a V2TQG model, only generates questions alone. Our model without the Pretester module (Middle) can generate question-answer pairs, but fail to capture correct details (the color of clothes). Our full model with Pretester (Right) generates correct questions for Q1 and Q2. ForQ3, both variants of our model failed to generate the correct answer, it may be caused by the person wearing different color of clothes and frequently camerarolling in the clip. (cf.Section IV-D) [Best viewed in color.] w/o Pretester

Generated Answer:

Cheers

Generated Question: what happened to the person in the blue jacket before they hit the ball ? Perplexity = 14.36

Generated Answer:

Cheers

Pretester Response:

Cheers

Generated Question: what happened to the athletes after the end of the game ?

Perplexity = 16.41 w/ Pretester

Fig. 5. (Top) With the Pretester removed, the model generates a wrong question. (Bottom) With the Pretester equipped, it encourages the generator to outputa correct question in spite of higher perplexity loss. (cf. Section IV-D)

Question Answer Mismatch

Q: how many people are there in the blue pants in the video? A: 3Q: what is in front of the player in blue pants?

A: tree

Question Error

Q: what is in front of the player in green pants?

A: playgroundQ: what happened to the person in blue pants after he played golf ? A: retracting pole

Answer Error

Q: what is the name of the game ? A: golf

Q: what is behind the person who appears at thebeginning of the video? A: gym

Fig. 6. Error Analysis. We dig into the generated questions and analyze the occurred errors. Left: QA Mismatch which the question and the answer both arein the video clip but does not match. Middle: Question Error where the question is invalid according to the video. Right: Answer Error where the answer isinvalid according to the video. (cf. Section IV-E) [Best viewed in color.]

Motion Spatial Temporal Y/n Counting Free AllUnsupervised Random 0.1 0.1 0.1 0.1 0.1 0.1 0.1Question Type Prior 0.5 0.5 0.5 50.0 10.0 2.6 18.7

GPN (ours) + HME

20% Supervised HME [6] 2.5 5.6 3.5 60.2 42.0 41.0 33.3MSVD-trans + HME [6] 8.5 8.1 3.1 57.3 40.8 40.2 33.4CapQG-oracle [62] + HME [6] 4.8 6.6 2.6 58.5 39.8 25.2 33.6

GPN (ours) + HME [6] † [67] 2.5 6.6 1.4 - - 34.4 25.1E-MN † [68] 3.0 8.1 1.6 - - 36.9 27.1E-SA † [69] 12.5 14.4 2.5 - - 41.2 31.8HME [6] 14.3 12.2 7.0 62.7 44.1 46.6 39.4MSVD-trans + HME [6] 16.1 13.5 7.9 60.6 38.5 44.7 38.2CapQG-oracle [62] + HME [6] 7.5 10.2 7.0 63.0 43.8 29.4 37.7 GPN (ours) + HME [6]

TABLE VIV

IDEO QA ACCURACY ON A CTIVITYNET -QA QA-

TEST SET WITH QG GENERATED DATA . † : OBTAINED FROM [16]

AND TRAINING WITH FULL TRAININGSET ( ABOUT TWICE OF

QA-

TRAIN SET , SEE T ABLE

II). T

RAINING

HME

MODEL WITH OUR GENERATED DATA REACHES REMARKABLE

ACCURACY , WHICH EVEN OUTPERFORMS

E-VQA

AND

E-MN

MODELS TRAINING WITH FULL TRAINING SET WITHOUT WITNESSING A SINGLE HUMANLABELED PAIR . I N SUPERVISED AND

SUPERVISED SCENARIO , WE ALSO OUTPERFORM ALL PRETRAINING BASELINES . ( CF . S ECTION

V-B)VDQG Data No Pretrain 20% 40% 60% 80% 100%w/o ft 0.1 5.7 11.6 25.0 27.1 29.5w/ 20% ft 33.3 36.0 36.3 34.4 34.1 34.3TABLE VIIA

NET -QA

PERFORMANCE WITH DIFFERENT

VDQG

GENERATEDQUESTION SCALE USING

HME [6]. FT : FINE - TUNING WITH ANNOTATEDDATA . O

UR GENERATED QUESTIONS CONSTANTLY ENHANCES QA PERFORMANCE . CF . S ECTION

V-C strategy: (1) pre-training with MSVD-QA [1] dataset (MSVD-trans) and (2) pre-training with CapQG-oracle generated ques-tions (CapQG-oracle).As shown in the ﬁrst block, without observing any humanannotated pairs, the model trained with our generated datacan achieve 29.5% accuracy, which even outperform E-VQAand E-MN training with tens of thousands of human labeledQA pairs. Notably, in spatial and counting questions, we evenoutperform supervised training accuracy. However, our modelmarginally improves on Y/N questions, which may result fromthe less information provided from the answer during VideoQG training. To improve Y/N question generation, we suggestto integrate our model with reasoning model in the future.The second and the third block of Table VI representsthe ﬁne-tuning results with 20% and 100% QA-train set.Our pre-training approach outperforms both MSVD-transferand CapQG-oracle, especially when ﬁne-tuning with full QA-train set where MSVD-transfer and CapQG-oracle degrade theperformance. This indicates that generated questions enhanceQA performance despite containing noise compared to humanannotated ones. The performance drop of MSVD-trans with100% supervised reveals the domain gap of video matterson Video QA tasks. Meanwhile, CapQG-oracle suffers 17%of accuracy drop on free type questions, which may involvedetails that do not cover in text descriptions. Compared withtwo pre-training strategies, our approach directly generatequestions with videos in the same domain and minimizeinformation loss and domain gap.

C. Pre-training Data Analysis

Table VII presents the impact of generated questions indifferent scales. We evaluate different scales of VQAG gener- ated questions in two scenarios: (1) w/o ﬁne-tuning (trainingQA with QG generated data only, total 14,400 examples) and(2) w/ 20% ﬁne-tuning (pre-training QA with QG generateddata, then ﬁne-tuning with 20% of human-annotated data). Asshown in the table, pre-training with our generated questionsconstantly enhances Video QA performance regardless thescale. Without ﬁne-tuning on human annotated data, QA per-formance is roughly correlated with the questions’ scale beforereaching 60%, demonstrating that the Video QA system couldbeneﬁt from different patterns from more generated questions.With ﬁne-tuning, Video QA performance improves the most at40% (with 3.0 QA accuracy improved). With more generatedquestions adopted, we get roughly 1.0 QA accuracy gain,revealing the gap between QG generated data and human-annotated data, and the noise in generated data might inﬂuencethe QA model. Note that in spite of potential inﬂuence of thenoise, pre-training with our generated questions continuouslyimproves the Video QA performance (from 0.8 to 3.0). Wesuggest future research to tackle the challenge to enable morebeneﬁt from generated questions.VI. C

ONCLUSION

In this paper, we introduce a novel task, Video Question-Answer Generation, to automatically generate question-answerpairs for Video Question Answering training. We proposea novel Generator-Pretester Network to end-to-end jointlygenerates question-answer pairs and verify the generated ques-tion by trying to answer it. We demonstrate the efﬁcacy ofproposed modules in the Video Question Generation taskand achieve state-of-the-art performance on two large-scale,human annotated datasets. The extensive results on Video QAexhibits the performance boost from our generated questions,reaching 29% of accuracy without witnessing any annotateddata and also enhances the performance in both semi and fullysupervised scenarios. To lead future research, we also providedetailed analysis of generation errors and its impact on VideoQA tasks. We encourage future work to extend our researchbased on our results and analysis. For example, to resolveunanswerable questions discussed in Section IV-E, one mightwant to obtain more ﬁne-grained object-level cues instead of applying mean-pooling to all objects in a frame. Whileachieving remarkable performance, we believe that we couldfurther enhance Video QAG capability by utilizing a more ﬁne-grained and semantic sensitive representation, such as object-level attention or contextual embedding. We are excited andoptimistic to introduce a new perspective for targeting thechallenging and practical Video QA task with cheap data.A CKNOWLEDGEMENT

This work was supported in part by the Ministry of Scienceand Technology, Taiwan, under Grant MOST 109-2634-F-002-032. We beneﬁt from NVIDIA DGX-1 AI Supercomputerand are grateful to the National Center for High-performanceComputing. R

EFERENCES[1] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Videoquestion answering via gradually reﬁned attention over appearance andmotion,” in

ACM Multimedia , 2017.[2] K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles,and M. Sun, “Leveraging video descriptions to learn video questionanswering,” in

AAAI , 2017.[3] Z. Zhao, Q. Yang, D. Cai, X. He, Y. Zhuang, Z. Zhao, Q. Yang, D. Cai,X. He, and Y. Zhuang, “Video question answering via hierarchicalspatio-temporal attention networks.” in

IJCAI , 2017.[4] Z. Zhao, Z. Zhang, S. Xiao, Z. Yu, J. Yu, D. Cai, F. Wu, and Y. Zhuang,“Open-ended long-form video question answering via adaptive hierar-chical reinforced networks.” in

IJCAI , 2018.[5] Z. Zhao, J. Lin, X. Jiang, D. Cai, X. He, and Y. Zhuang, “Video questionanswering via hierarchical dual-level attention network learning,” in

ACM Multimedia , 2017.[6] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang,“Heterogeneous Memory Enhanced Multimodal Attention Model forVideo Question Answering,” in

CVPR , 2019.[7] W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang, “Multi-interactionnetwork with object relation for video question answering,” in

ACMMultimedia , 2019.[8] X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen, and J. Song,“Learnable aggregating net with diversity learning for video questionanswering,” in

ACM Multimedia , 2019.[9] T. Yang, Z.-J. Zha, H. Xie, M. Wang, and H. Zhang, “Question-awaretube-switch network for video question answering,” in

ACM Multimedia ,2019.[10] G. Jiyang, G. Runzhou, C. Kan, and N. Ram, “Motion-Appearance Co-Memory Networks for Video Question Answering,” in

CVPR , 2018.[11] W. Zhang, S. Tang, Y. Cao, S. Pu, F. Wu, and Y. Zhuang, “Frameaugmented alternating attention network for video question answering,”

IEEE Transactions on Multimedia , vol. 22, no. 4, pp. 1032–1041, 2020.[12] Y. Han, B. Wang, R. Hong, and F. Wu, “Movie question answering viatextual memory and plot graph,”

IEEE Transactions on Circuits andSystems for Video Technology , vol. 30, no. 3, pp. 875–887, 2020.[13] X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan, “Be-yond rnns: Positional self-attention with co-attention for video questionanswering,” in

AAAI , 2019.[14] Y. Xiong, B. Dai, and D. Lin, “Move forward and tell: A progressivegenerator of video descriptions,” in

ECCV , 2018.[15] Y.-S. Wang, H.-T. Su, C.-H. Chang, Z.-Y. Liu, and W. Hsu, “Videoquestion generation via cross-modal self-attention networks learning,”in

ICASSP , 2020.[16] Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao, “Activitynet-qa: A dataset for understanding complex web videos via questionanswering,” in

AAAI , 2019.[17] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+questions for machine comprehension of text,” in

EMNLP , 2016.[18] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, andK. Suleman, “Newsqa: A machine comprehension dataset,” in

Rep4NLP ,2017.[19] J. Lei, L. Yu, M. Bansal, and T. Berg, “Tvqa: Localized, compositionalvideo question answering,” in

EMNLP , 2018.[20] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “TGIF-QA: toward spatio-temporal reasoning in visual question answering,” in

CVPR , 2017. [21] D. Chen and W. Dolan, “Collecting highly parallel data for paraphraseevaluation,” in

ACL , 2011.[22] J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video descriptiondataset for bridging video and language,” in

CVPR , 2016.[23] B. G. Fabian Caba Heilbron, Victor Escorcia and J. C. Niebles, “Activi-tynet: A large-scale video benchmark for human activity understanding,”in

CVPR , 2015.[24] A. Zadeh, M. Chan, P. P. Liang, E. Tong, and L.-P. Morency, “Social-iq: A question answering benchmark for artiﬁcial social intelligence,”in

CVPR , 2019.[25] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum,“Clevrer: Collision events for video representation and reasoning,” in

ICLR , 2020.[26] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun, “Ivqa: Inversevisual question answering,” in

CVPR , 2018.[27] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou,“Visual question generation as dual task of visual question answering,”in

CVPR , 2018.[28] R. Krishna, M. Bernstein, and L. Fei-Fei, “Information maximizingvisual question generation,” in

CVPR , 2019.[29] X. Shi, J. Cai, S. Joty, and J. Gu, “Watch it twice: Video captioningwith a refocused video encoder,” in

ACM Multimedia , 2019.[30] Y. Hu, Z. Chen, Z.-J. Zha, and F. Wu, “Hierarchical global-localtemporal modeling for video captioning,” in

ACM Multimedia , 2019.[31] Y. Zhu and S. Jiang, “Attention-based densely connected lstm for videocaptioning,” in

ACM Multimedia , 2019.[32] E. Barati and X. Chen, “Critic-based attention network for event-basedvideo captioning,” in

ACM Multimedia , 2019.[33] J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, “Hierarchicalmemory modelling for video captioning,” in

ACM Multimedia , 2018.[34] S. Liu, Z. Ren, and J. Yuan, “Sibnet: Sibling convolutional encoder forvideo captioning,” in

ACM Multimedia , 2018.[35] H. Wang, Y. Xu, and Y. Han, “Spotting and aggregating salient regionsfor video captioning,” in

ACM Multimedia , 2018.[36] Z. Yang, Y. Han, and Z. Wang, “Catching the temporal regions-of-interest for video captioning,” in

ACMn Multimedia , 2017.[37] J. Xu, T. Yao, Y. Zhang, and T. Mei, “Learning multimodal attentionlstm networks for video captioning,” in

ACM Multimedia , 2017.[38] Z. Yang, Y. Xu, H. Wang, B. Wang, and Y. Han, “Multirate multimodalvideo captioning,” in

ACM Multimedia , 2017.[39] Q. Jin, S. Chen, J. Chen, and A. Hauptmann, “Knowing yourself:Improving video caption via in-depth recap,” in

ACM Multimedia , 2017.[40] S. Chen, J. Chen, Q. Jin, and A. Hauptmann, “Video captioning withguidance of multimodal latent topics,” in

ACM Multimedia , 2017.[41] P. Tang, H. Wang, H. Wang, and K. Xu, “Richer semantic visual andlanguage representation for video captioning,” in

ACM Multimedia ,2017.[42] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai,“Stat: Spatial-temporal attention mechanism for video captioning,”

IEEETransactions on Multimedia , vol. 22, no. 1, pp. 229–241, 2020.[43] N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli,“Dual-stream recurrent neural network for video captioning,”

IEEETransactions on Circuits and Systems for Video Technology , vol. 29,no. 8, pp. 2482–2493, 2019.[44] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embeddingand translation to bridge video and language,” in

CVPR , 2016.[45] Y. Pan, T. Yao, H. Li, and T. Mei, “Video captioning with transferredsemantic attributes,” in

CVPR , 2017.[46] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng,“Semantic compositional networks for visual captioning,” in

CVPR ,2017.[47] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generatingattractive visual captions with styles,” in

CVPR , 2017.[48] X. Long, C. Gan, and G. de Melo, “Video captioning with multi-faceted attention,”

Transactions of the Association for ComputationalLinguistics , vol. 6, pp. 173–184, 2018.[49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

CVPR , 2016.[50] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,”

IEEE Trans.Pattern Anal. Mach. Intell. , vol. 39, no. 6, pp. 1137–1149, 2017.[51] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L.-J. Li, D. A. Shamma et al. , “Visual genome: Connectinglanguage and vision using crowdsourced dense image annotations,” in

International journal of computer vision , 2017.[52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” in

NIPS , 2017. [53] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method forautomatic evaluation of machine translation,” in ACL , 2002.[54] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in

Text Summarization Branches Out , 2004.[55] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in

CVPR , 2015.[56] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MTevaluation with improved correlation with human judgments,” in

Intrin-sic and Extrinsic Evaluation Measures for Machine Translation and/orSummarization , 2005.[57] T. Pedersen, S. Patwardhan, and J. Michelizzi, “Wordnet::similarity:Measuring the relatedness of concepts,” in

NAACL , 2004.[58] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT:Open-source toolkit for neural machine translation,” in

ACL , 2017.[59] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in

ICLR , 2015.[61] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-enddense video captioning with masked transformer,” in

CVPR , 2018.[62] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao,M. Zhou, and H.-W. Hon, “Uniﬁed language model pre-training fornatural language understanding and generation,” in

NIPS , 2019.[63] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,”in

NAACL-HLT , 2019.[64] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, andK. Saenko, “Sequence to sequence – video to text,” in

ICCV , 2015.[65] X. Yin and V. Ordonez, “Obj2text: Generating visually descriptivelanguage from object layouts,” in

EMNLP , 2017.[66] I. Calixto and Q. Liu, “Incorporating global visual features intoattention-based neural machine translation,” in

EMNLP , 2017.[67] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, andD. Parikh, “VQA: Visual Question Answering,” in

ICCV , 2015.[68] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus, “End-to-end memorynetworks,” in

NIPS , 2015.[69] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, andA. Courville, “Describing videos by exploiting temporal structure,” in

ICCV , 2015.[70] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors forword representation,” in