DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, Satwik Kottur
DDVD: A Diagnostic Dataset for Multi-step Reasoningin Video Grounded Dialogue
Hung Le ‡§∗ , Chinnadhurai Sankar † , Seungwhan Moon † , Ahmad Beirami † ,Alborz Geramifard † , Satwik Kottur †† Facebook { chinnadhurai, shanemoon, beirami, alborzg, skottur } @fb.com ‡ Singapore Management University § Institute for Infocomm Research, A*STAR [email protected]
Abstract
A video-grounded dialogue system is requiredto understand both dialogue, which containssemantic dependencies from turn to turn, andvideo, which contains visual cues of spatialand temporal scene variations. Building suchdialogue systems is a challenging problem in-volving complex multimodal and temporal in-puts, and studying them independently is hardwith existing datasets. Existing benchmarksdo not have enough annotations to help ana-lyze dialogue systems and understand their lin-guistic and visual reasoning capability and lim-itations in isolation. These benchmarks arealso not explicitly designed to minimise biasesthat models can exploit without actual reason-ing. To address these limitations, in this paper,we present a diagnostic dataset that can test arange of reasoning abilities on videos and di-alogues. The dataset is designed to containminimal biases and has detailed annotationsfor the different types of reasoning each ques-tion requires, including cross-turn video inter-val tracking and dialogue object tracking. Weuse our dataset to analyze several dialogue sys-tem approaches, providing interesting insightsinto their abilities and limitations. In total, thedataset contains instances of -round dia-logues for each of ∼ k synthetic videos, re-sulting in more than k dialogues and M question-answer pairs. Our code and datasetwill be made public. Visual question answering (VQA) is a popular lineof research that aims to develop intelligent systemsthat can reason and answer questions about visualinformation. Earlier datasets have been introducedto study this problem, focusing on images as the vi-sual input (Antol et al., 2015; Gao et al., 2015; Ma-linowski and Fritz, 2014; Zhu et al., 2016) Recently, ∗ Work done when HL was a research intern at Facebook many question answering benchmarks have beenproposed to extend the visual information fromthe image to video domain (Jang et al., 2017; Leiet al., 2018; Zadeh et al., 2019). While image QAproblems require a system to learn cross-modalityinteraction, video QA problems go beyond andcapture visual information with temporal variance.Correctly answering questions about the content ofvideos requires different types of perceptual abili-ties, such as recognizing moving objects, includingtheir locations and actions.As an orthogonal extension from VQA problems,another line of research investigates image/videoQA in a dialogue setting (Das et al., 2017; Seoet al., 2017; De Vries et al., 2017; Chattopadhyayet al., 2017; Alamri et al., 2019). In this problem,questions about a given video or image are posi-tioned in a multi-turn dialogue. In each dialogueturn, a question usually exhibits different types ofcross-turn relations to other questions in prior dia-logue turns, such as object co-references and topicalignment. In this work, we investigate the problemof multi-turn video question answering (QA), orvideo-grounded dialogue.Numerous approaches to video-grounded dia-logue have showed remarkable performance inbuilding intelligent multimodal systems (Hori et al.,2019; Schwartz et al., 2019; Le et al., 2019; Liet al., 2020; Le et al., 2020). However, most ofthe methods exhibit marginal performance gain,and our ability to understand their limitations is im-peded by the complexity of the task. Existing video-grounded dialogue benchmarks are not designedwith enough information to determine whether cur-rent approaches are capable of sophisticated reason-ing and not just exploiting biases (Agrawal et al.,2016; Goyal et al., 2017; Qi et al., 2020). Multiplefactors can affect the performance of a system. Forinstance, a dialogue agent answers incorrectly ifit is unable to decode dialogue context and derive a r X i v : . [ c s . A I] J a n : until the end of the cube 's rotation , what types of actions does the big thing undertake the most ? A1 : flying Q2 : during the same time period , how many sliding objects are there ? A2 : 2 Q3 : among them , there is a ball . during the whole video , what type of action does it undertake second ? A3 : no action Q5 : during the red thing 's last slide , how many things are behind the earlier mentioned large object ? A5 : 2 Q6 : how about left of it ? A6 : 0 t1 t2 t3 t4 t0 TObject A: t1 t3 Object B: t2 t4 t5 t6t5 t6 slides flies rotates Q4 : how about up until now ? A4 : slidingrotatingsliding flying flyingsliding slidingslidingflyingflying slidingflying sliding flying EOV2EOV1 EOV2EOV2 V1 EOV2EOV1EOV1EOV1 EOV2EOV1 sliding sliding
Figure 1:
Example video-grounded dialogue in DVD:
We demonstrate sample questions that test aspects ofvisual and linguistic reasoning, such as action recognition, temporal reasoning, spatial reasoning, cross-turn videointerval tracking, and dialogue object tracking. Q i / A i : question/answer of turn i . EOV j : end of video input V j . contextualized objects, such as “the earlier men-tioned large object” and “it” in Questions and of Figure 1. Moreover, a video-grounded dia-logue does not just focus on a specific timestampor video segment. Instead, different video segmentsare mentioned from turn to turn and a system is re-quired to locate the right temporal information (SeeFigure 1).To address the limitations of existing bench-marks and analyze dialogue systems efficiently,we propose DVD , a D iagnostic Dataset for V ideo-grounded D ialogues . In total, from about k synthetic videos, we built a benchmark containing k dialogues, over M automatically-generatedquestion-answer pairs, of which more than k are unique. We built our benchmark on top of achallenging video dataset, CATER (Girdhar andRamanan, 2020). CATER dataset contains videosof multiple objects arranged in a 3D environmentwith high variance of object appearance, loca-tions, and actions. Moreover, the information ineach video is not affected by external informationsources, such as commonsense knowledge, makingit ideal for learning visual reasoning in dialogues.From scene graphs and object action annotationof CATER videos, we simulate questions and theirfunctional programs (Johnson et al., 2017) in amulti-turn setting. As can be seen in Figure 1,at each dialogue turn, a question is generated totest the model ability to perform different types of reasoning on videos, such as action recogni-tion and spatio-temporal reasoning. Across turns,questions are designed to be related to one anotherthrough different types of semantic relationships.Specifically, we proposed two sub-tasks relevantto video-grounded dialogue problems, video inter-val tracking ( VIT ) , and dialogue object tracking( DOT ) .The VIT task requires a dialogue system to iden-tify which video segment each dialogue turn isreferring to. In each turn, the question is gener-ated such that the corresponding video segment iseither independent or related to another segmentin prior dialogue turns. This task is an extensionof prior research in temporal localization throughtext, or text-to-clip (Anne Hendricks et al., 2017)but is designed in a multi-turn setting. DOT hasthe dialogue state tracking (DST) nature, whichrequires a system to track information slots in task-oriented dialogues (Mrkˇsi´c et al., 2017). While intask-oriented dialogues, the tracked slots are usedto create API queries to entity databases, trackedobjects in DOT is used to solve object referencesand locate the visual objects from videos.The DVD benchmark allows us to train and ana-lyze methods by their reasoning capabilities. Forinstance, we found that current dialogue modelsstruggle on question types requiring both video tem-poral and spatial localization. Existing approachesdo not explicitly track dialogue objects and video enchmarks Diagnosticbenchmark Visual reasoning Language reasoningSR TR DOT VIT Image/video QA, embodied QA
VQA (Antol et al., 2015), Visual7W (Zhu et al., 2016) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55)
TGIF-QA (Jang et al., 2017), TV-QA (Lei et al., 2018) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)
IQA (Gordon et al., 2018), EQA (Wijmans et al., 2019) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)
Image/video grounded dialogues, navigation dialogues
VisDial (Das et al., 2017), GuessWhat (De Vries et al., 2017) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)
AVSD (Hori et al., 2019), CVDN (Thomason et al., 2019) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51)
Synthetic image/video QA
SHAPE (Andreas et al., 2016), CLEVR (Johnson et al., 2017) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)
SVQA (Song et al., 2018), CLEVRER (Yi* et al., 2020) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55)
Synthetic dialogues bAbI (Bordes et al., 2017) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55)
MNIST Dialog (Seo et al., 2017), CLEVR-Dialog (Kottur et al., 2019) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55)
DVD (Ours) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)
Table 1:
Comparison to related benchmarks:
Compared to existing datasets for vision-language understanding,DVD is the first diagnostic benchmark designed for both spatial reasoning (SR) and temporal reasoning (TR) andexplicit requiring dialogue object tracking (DOT) and video interval tracking (VIT) in a multi-turn setting. segments properly and struggle to learn differentsemantic dependencies. These observations pointus to potential avenues for future research of video-grounded dialogue systems.
Table 1 presents a comparison between DVD andrelevant benchmarks. We position our dataset com-pared to existing datasets from four angles: visual-linguistic, 2) visually-grounded, 3) diagnostic, and4) multi-step reasoning.
1) Vision-linguistic.
Numerical benchmarks forvision-linguistic understanding has been intro-duced, including datasets for image and videocaptioning (Farhadi et al., 2010; Lin et al., 2014;Rohrbach et al., 2015), phrase grounding or visualobject reference (Kazemzadeh et al., 2014; Plum-mer et al., 2015), scene graph learning (Krishnaet al., 2017), and text-to-clip (Anne Hendricks et al.,2017). Our benchmark, DVD, is more related tobenchmarks for visual question answering in whicha visual input such as image or video is given anda system is required to answer natural-languagequestions about the visual input (Antol et al., 2015;Zhu et al., 2016; Jang et al., 2017; Lei et al., 2018).Another related line of research is the research ofnavigation systems in a dynamic environment (Gor-don et al., 2018; Wijmans et al., 2019). In thistask, systems are required to understand and followan instruction to navigate in a simulated physicalspace. Compared to the prior benchmarks, onemajor difference of DVD is the extension of single-turn interaction to a multi-turn setting.
2) Visually-grounded.
In this line of research, asystem is required to obtain information from agiven visual input to answer questions over mul-tiple rounds (De Vries et al., 2017; Das et al.,2017; Chattopadhyay et al., 2017; Hori et al., 2019;Thomason et al., 2019). To decode a question ofthe current turns, a system has to understand thedialogue context and resolve different semantic de-pendencies. However, due to the complexity ofthe tasks, involving cross-modality and cross-turninformation, prior benchmarks are often subjectto bias that models can just exploit without ac-tual reasoning (Qi et al., 2020). Specifically, in avideo-grounded dialogue, each question turn oftenfocuses on different parts of the video. One limi-tation of current benchmarks such as AVSD (Horiet al., 2019) is the lack of information to evaluatehow a model localizes relevant video intervals fromturn to turn. In this work, we introduce a diagnosticbenchmark that explicitly requires a system to trackdialogue objects and video intervals.
3) Diagnostic.
Seo et al. (2017); Kottur et al.(2019) used synthetic images to develop diagnos-tic image-grounded dialogues. Compared to thesebenchmarks, one major difference of DVD is theextension to the video domain and the injectionof diverse linguistic dependencies between turns.Specifically, in DVD, we incorporated 3 types ofrelationships: temporal relation, object reference,and topic transfer (See Section 3). This designchoice results in synthetic dialogues with richer lan-guage and higher linguistic variance in questions.As shown in Table 2, compared to similar bench- t0 TObject A: t1 t3
Object B: t2 t4 t5 t6t5 t6 slides flies rotates
Formulation of Video Intervals start end
Example spatial relations
Figure 2:
Left : examples of video intervals, defined as continuous time period between two events. Atomicintervals are non-overlapping time periods and all remaining periods are compositional intervals.
Right : projectionof objects on the ground plane. Considering the “left” relationship, “A1 is left of B2” and “A2 is left of B4”. marks DVD contains a higher number of uniquequestions over a very large scale dataset. DVD isalso motivated by dialogue state tracking (DST)of task-oriented dialogue systems (Mrkˇsi´c et al.,2017; Bordes et al., 2017). In these systems, aDST model is required to tightly detect all domain-specific information slots in dialogues. Instead ofinformation slots, in DVD, for each dialogue turn,we introduce a dialogue object state, defined as anydetected objects and their attributes mentioned inthe dialogue context. DVD is the first diagnosticmultimodal dialogue benchmark that provides suchdetailed annotations of dialogue states.
4) Multi-step reasoning.
In multi-step reason-ing, a question is represented as sub-sequencescalled functional programs. Earlier efforts (An-dreas et al., 2016; Johnson et al., 2017) proposedto use synthetic images and design questions thatare expressed as elementary operation programs.More related to our work, Song et al. (2018); Yi*et al. (2020) extended the prior benchmarks to thevideo domain with questions focusing on the tem-poral variance of video frames. A major differencebetween our work and these benchmarks is the ex-tension of functional programs to a dialogue taskwith dialogue context-based operations, such asobject tracking and interval tracking. This exten-sion brings a step toward more transparent dialoguesystems capable of performing complex reasoning.
Our benchmark provides a dataset that can be usedto conduct rich diagnostics to better understandthe reasoning capabilities of dialogue systems. Weutilize CATER, a challenging video action recogni-tion benchmark (Girdhar and Ramanan, 2020), toautomatically generate dialogues. The videos haveassociated ground-truth object attributes, locations,and action time intervals. The generated dialogues contain questions that have an associated functionalprogram structures. In this benchmark, we focuson several types of visual and linguistic operations,such as temporal localization, object tracking, andspatio-temporal reasoning. Figure 2, 3, and 4 givea brief overview of the main components of thebenchmark, which we describe in detail below.
Objects.
Objects are identified by their attributes,including object shapes (cube, sphere, cylinder,cone, and snitch), sizes (small, large, and medium),materials (rubber and metal), and colors. Oneunique characteristic of CATER objects is that eachobject can move multiple times in a single video.We define types of object actions: “flying”, “ro-tating”, “sliding”, and “no action” (stationary). Video intervals.
Video intervals are continuousvideo frames that can be defined by two events,each of which can be the start or end of an object’saction or the start or end of the whole video. Weformulate two types of video intervals:
Atomic intervals.
Atomic intervals are intervalsthat each object can have at most one action andthey can be in only one of the two states: in mo-tion or stationary. To find atomic intervals, wesimply collate the start and end time of all objectactions in a video and sort them chronologically.By definition, all non-overlapping time intervalsare considered atomic. For example, in Figure 2, alltime ranges ( t , t ) , ( t , t ) , ..., ( t , T ) ) are atomic.This constraint allows us to identify the relative spa-tial relationships (‘left”, “right”, “behind”, and “infront”) between any two objects during an atomicinterval by using their coordinates at the start andend of the interval. Note that we can apply this rulesince all object actions in the CATER universe areprojected as a straight line (“flying”,“sliding”) or asingle point (“rotating”, “no action”) on the groundplane. Practically, we decide to focus on spatial uery Action
Action by Frequency object
Same Action Set object
Same Action SequenceFilter action
Count action
Find Interval
Relate Interval
Filter material Unique Relate Count metal front
Filter color brown
RelateInterval during
FindInterval flight
Filter shape Unique cone
Filter size small
Functional program operations after the gray rubber thing 's first flight and before its second slide , is there any other object with the same set of activities performed by the block ?
Filter shape Unique Same Action Set Exist block
RelateInterval after
FindInterval first flight
Filter material Unique rubber
Filter color gray
RelateInterval before
FindInterval second slide
Union
Example atomic interval question Example compositional interval question
Figure 3:
Function program operations and example reasoning structures:
Top: Catalog of basic functionsused to build questions on videos. Bottom: Examples of questions and their associated functional programs. reasoning only when one of the object is stationary.We use this object as a “base” to compute the rel-ative position of the remaining objects. In Figure2, we demonstrate different scenarios of the “left”spatial relation between object A and B.
Compositional intervals.
Compositional inter-vals are all other intervals that are not atomic. Inthese intervals, an object can have more than oneactions and be in more than one states. Therefore,its movement projections are not linear and we donot identify spatial relations in these cases. Instead,we focus on information about action order and action frequency and incorporate it into questions.
Question representation.
Following CLEVR,we associate each question with a functional pro-gram that can be executed on video scenes and dia-logue context. As shown in Figure 3, from CLEVR,we adopt the question families and templates withseveral extension to video domain. We introduce additional symbolic operations to retrieve infor-mation along the temporal aspect of video. Over-all, we utilize question templates. In Figure3, we illustrate two sample questions with corre-sponding reasoning structures. One question in-corporates spatio-temporal reasoning in an atomicinterval while the other contains action set in a com-positional interval. Figures 6-(a) and 6-(b) showthe percentages of questions by question types andvideo interval types. The full detail of each reason-ing function and more example questions can befound in the supplementary material. Dialogue generation.
We generated dialogueswith fixed turns. In each turn, we adopted a DFSapproach, as similarly used in CLEVR (Johnsonet al., 2017), to instantiate questions by sequen-tially executing functional programs. To generatelinguistic dependencies between dialogue turns, ateach turn, we randomly sample and incorporate upto three relationship types listed below. Figure 4and 5 present examples of a dialogue and questionswith these semantic relations. Figure 6-(d) showsthe dialogue distributions by these relations. Type I: Video Temporal Relation (TR).
Thistest a system to localize video intervals in relationto past dialogue turns. We randomly select one ofthree types of relation: “during” , “before” , and “after” . The “during” relation reuses the same timeinterval as the last dialogue turn, e.g. the Q2 inFigure 1, while the “before” and “after” relationssimulate a dialogue flow with references to theearlier and subsequent video segments. Type II: Dialogue Object Reference (OR).
Weincorporate object references into a question tem-plate by replacing the original object phrase, suchas “the large rubber cone”, with pronouns, such as“it” or “them”. Additionally, we simulate long-termmemory reasoning by injecting unique objects men-tioned earlier in the dialogue history. We simulatethis behavior by maintaining a dialogue object state at each turn. To choose an object for references, werandomly sample a dialogue turn from the dialoguecontext and sample an object introduced in this turn.This object is used to replace the original phrases ialogue Dialogue Object State TR OR TTA S T Q1 : before the large thing 's first flight , what color is the average thing that is in front of the small thing? A1 : yellow Q2 : what about its material ? A2 : rubber {obj1: size=large}, {obj2: size=average, color=yellow}, {obj3: size=small} ✓ ✓ Q3 : during the earlier mentioned small thing 's first slide , what shape is the stationary thing to the right of the aforementioned average object? A3 : cube {obj1: size=large}, {obj2: size=average, color=yellow, material=rubber}, {obj3: size=small} ✓ Q4 : during the same time period , how many average cyan shiny things are behind the gray object? A4 : 1 {obj1: size=large}, {obj2: size=average, color=yellow, material=rubber}, {obj3: size=small}, {obj4: shape=cube} ✓ Q5 : how about to the left of it ? A5 : 0 {obj1: size=large}, {obj2: size=average, color=yellow, material=rubber}, {obj3: size=small}, {obj4: shape=cube}, {obj5: color=gray}, {obj6: color=cyan, size=average, material=metal} ✓ ✓ ✓ Q6 : throughout the whole video , does the earlier cube object fly more frequently than the earlier mentioned average object slides ? A6 : True {obj1: size=large}, {obj2: size=average, color=yellow, material=rubber}, {obj3: size=small}, {obj4: shape=cube}, {obj5: color=gray}, {obj6: color=cyan, size=average, material=metal} ✓ Q7 : what about up until now ? A7 : False {obj1: size=large}, {obj2: size=average, color=yellow, material=rubber}, {obj3: size=small}, {obj4: shape=cube}, {obj5: color=gray}, {obj6: color=cyan, size=average, material=metal} ✓ Turn 1 Turn i Turn i+1 Turn j Turn 100 to e_0 0 to e_0 0 to e_1 0 to T 0 to T ... ... ...
Video Inputs:Dialogue Turns:
Temporal Topic Transfer
Figure 4:
Dialogue generation:
In each dialogue turn, we generate questions with randomly sampled cross-turndependencies: temporal relation (TR), object reference (OR), and topic transfers (TT), including attribute (A),spatial (S), and temporal (T) transfer. In each turn, the annotation of dialogue object state is obtained, including allobjects and their attributes mentioned up to the last dialogue turn.
Before this time period, how many other things with the same set of activities performed by the aforementioned yellow thing ?
Filter color Unique Same Action Set Count yellow
Track ObjectsRelateInterval before
TrackInterval
Example question with ‘before’ TR and long-term OR semantics among them , there is a red thing . After this time window, what type of action does it undertake last ?
Example question with ‘after’ TR and short-term OR semantics
Unique Action by OrderFiltercolor red
Refer last
RelateInterval after
TrackInterval them
Track Objects
ReferTrack Interval
Figure 5:
Top:
Catalog of program functions over dialogue context.
Bottom:
Examples of questions positionedin dialogue and their associated functional programs. in question template, using attributes recorded inthe dialogue object state. For example, in the ques-tion Q3 in Figure 4, “the earlier mentioned smallthing” is identified from the object originally intro-duced in the st turn. In Figure 6-(c), we show thequestion distribution by the turn distance of objectreferences, with a max distance of dialogue turns. Type III: Topic Transfer (TT).
This relationtests the model ability to perform short-term mem-ory reasoning from the last dialogue turn to thecurrent turn. We introduce types of topic trans-fers: attribute (A), spatial (S) , and temporal (T)transfer . Attribute and spatial transfers reuse thesame question from the prior dialogue turn exceptswith an update of attribute field or spatial relation(e.g. Q2 and Q5 in Figure 4). In temporal transfer,we introduce a unique setting of situated dialogue .At the first dialogue turn, we shorten a CATERinput video by a cutoff point, e.g. T . At eachdialogue turn, for % of time, we update the cur-rent video input to with a new cutoff point later than the previous one e.g. T i +1 (cid:29) T i . There isno more video update when the cutoff is the endof the original CATER video i.e. T i +1 = T . Forinstance, in Figure 1, at Q4, we reuse the samequestion from Q3 but with video content extendedfrom the previous input video. This design testsdialogue systems in a dynamic environment with acontinuous visual stream, as similarly adopted innavigation systems (Thomason et al., 2019). Dialogue filters.
In addition to ill-posed and de-generate questions (Johnson et al., 2017), at eachturn, we remove any question that becomes redun-dant when positioned in dialogue context. For in-stance, the question “how many red rubber objectsare there?” is removed if in a prior dialogue turn,the question is “how many red objects are there?” and the answer is “1” . To do this, we perform acheck at every dialogue turn to determine whetherinvolving objects and their attributes are alreadyrecorded in the dialogue object state. Furthermore,we also remove any dialogue that do not satisfy cer- ompositionalAtomicNone (a) Question distribution by question type (d) Dialogue distribution by the number semantic relations per dialogue(c) Number of questions by turn distance of object references(b) Question distribution by video interval
All TR OR TT
Action Count4%Compare Interval12%Action Query20%Object Count19% Object Exist19%Attribute Query3%Compare22% None3%Compositional69% Atomic28%
Figure 6:
Data analysis of DVD.
Questions and dialogues in the DVD benchmark are simulated with various typesof reasoning requirement, including TR (temporal relation), OR (object references), and TT (topic transfer). tain complexity level, defined by its total number ofsemantic dependencies. Finally, for each questiontype, we simulate an approximate uniform distri-bution of answer values, minimizing bias resultingfrom question-type data distribution.We present the overall statistics of DVD in com-parison with related benchmarks in Table 2. Formore details of our benchmark and dialogue exam-ples, please refer to the supplementary material.
Split
Table 2:
Statistics for DVD : Overall, DVD has a largenumber of dialogues with substantial number of uniquequestions.
The video-grounded dialogue task in DVD is de-fined as a turn-based retrieval task from multiplechoice candidate answers. At each dialogue turn i ( i = 1 , , ..., ), the corresponding video input V i , the ground-truth dialogue context, includingquestion and answer pairs up to the last dialogueturn, C i = ( Q k , A k ) | k = i − k =1 , and the question of thecurrent turn Q i , are provided. The system is givena set of candidate answers A , predefined as allpossible answer values for all question types, with | A | = 40 in DVD. The model is required to selectone correct answer out of the candidate list. Themodel is evaluated on the accuracy metric againstthe ground-truth answer. For a dialogue system,denoted as θ , the objective function is: ˆ A i = arg max A P ( A i |V i , Q i , C i ; θ ) We reproduce a representative set of existingmethods in video-grounded dialoug QA (Jang et al.,017; Lei et al., 2018; Hori et al., 2019; Schwartzet al., 2019; Le et al., 2019; Li et al., 2020): Answer Prior.
Each answer option is encodedusing a token-level LSTM (Hochreiter and Schmid-huber, 1997) and scored by a multi-layer percep-tron (MLP). This model is trained to select themost popular answer options from the training setwithout looking at either videos or dialogues.
Q-type.
This baseline selects a random answer(
Q-type Random ) or the most popular answer (
Q-type Frequency ) for each question type. Theground-truth question type is given in this base-line.
Q-retrieval.
At test time, for each question, thismodel simply computes the cosine similarity toall questions in the training set based on TF-IDFfeatures. The answer to the most similar questionis directly chosen as the predicted answer.
HRNN(D).
Dialogue D is processed withlearned word embeddings and encoded by a toke-level LSTM. The final hidden state is passed toan MLP with softmax scores to predict a distribu-tion over answer candidates. We experiment withdifferent combinations of dialogue inputs, includ-ing question Q and dialogue context C , to test thequestion and dialogue-conditional bias. HRNN(D).
As above, when using dialogue con-text, this model uses a hierarchical architecturewith 2 LSTMs to encode dialogue by turn-leveland token-level sequence (Serban et al., 2016). Thefinal hidden state of the turn-level LSTM is iput tothe MLP.
HRNN(D)+CNN(V)/TA(V).
In video-groundeddialogue systems, a video V is typically representedwith features from a pretrained 3D CNN. A videois separated into shorter segments/clips, each ofwhich is passed through a CNN model, resulting intemporal-variant features. To aggregate video fea-tures, we experiment with 2 approaches: 1) CNN simply using the CNN video features averagedalong the temporal steps and 2) TA using an at-tention mechanism to select the relevant temporalsteps as similarly adopted by Hori et al. (2019).The prior text-only systems are integrated withthese aggregation methods by concatenating thefinal text and video representations before passingto the MLP. TF(D+V).
Similar to the work of Le et al. (2019);Schwartz et al. (2019); Li et al. (2020), this modeladopts deep attention networks to model cross-modal interactions. We concatenate all text andvisual input components as a single sequence andpass to a Transformer encoder (Vaswani et al.,2017). A special “[CLS]” token is used in the firstposition of the sequence to aggregate informationthrough all attention rounds. Its final representationis then passed to an MLP to predict answers.
We used a 3D version of ResNet-101 (Hara et al.,2018) pretrained on the Kinetics dataset (Kay et al.,2017). We extracted all video features from the fi-nal average pooling layer, giving -dimensionalfeatures which are not fine-tuned. The videos wereresized to × prior to feature extraction,and video clips were sampled with a size of 16frames and striding of 4 frames. All LSTM net-works used 2 recurrent layers and all MLP net-works used ReLU activation with dropout (Srivas-tava et al., 2014) and one hidden layer. All modelswere optimized using cross-entropy objective lossbetween ground-truth and predicted answers withAdam optimizer (Kingma and Ba, 2015). We tunedmodel hyper-parameters using the validation setand selected the best models with the highest accu-racy metric to evaluate on the test set. As shown in Table 3,we observe that “blind” systems that only haveaccess to answers or questions only, achieve low re-sults of - accuracy. In the RNN(Q) model,we note that the performance per question type isvery close to the performance of the
Q-type(freq) model with very marginal performance increase.For question types with only binary answer optionssuch as “compare int.” questions, the performanceof
RNN(Q) is not far above results from randomguess.
Dialogue systems.
When a “blind” model hasaccess to full dialogue history, the performance in-creases by points, to . . This incrementshows that dialogue context contains useful in-formation for a dialogue system to infer answers.However, the performance increase is not signifi-cant and it is most likely coming from less challeng-ing question turns injected with short-term memory reasoning. We note that on average there are out odel AnswerPrior Q-type(Random) Q-type(Freq) Q-retrieval(TF-IDF) RNN(Q) HRNN(C+Q) HRNN(C+Q)+CNN(V) HRNN(C+Q)+TA(V) TF(C+Q+V) HumanAccuracy 16.07% 28.00% 37.02% 30.58% 42.22% 47.75% 51.66% 52.82% 54.23% 89.30%Action count 0.00% 11.28% 21.05% 17.29% 18.05% 27.07% 31.58% 36.09% 41.35% 87.50%Action query 0.00% 16.16% 30.28% 24.65% 32.91% 39.79% 46.05% 49.87% 50.97% 88.10%Attr. query 0.00% 27.63% 39.24% 30.20% 44.17% 46.55% 47.33% 48.85% 49.00% 98.00%Compare action 26.60% 29.83% 36.93% 29.55% 38.38% 46.03% 50.31% 50.22% 54.64% 84.21%Compare int. 48.88% 52.35% 50.04% 43.09% 57.48% 56.25% 62.33% 65.65% 67.53% 88.46%Obj. count 0.00% 8.86% 21.85% 15.68% 27.34% 40.40% 42.48% 42.98% 43.47% 90.57%Obj. exist 46.43% 50.61% 53.57% 50.77% 66.68% 67.27% 69.94% 70.44% 70.48% 92.31%Atomic 19.25% 33.10% 42.86% 45.56% 63.59% 62.46% 64.98% 65.42% 66.55% 83.33%Atomic(spatial) 17.40% 28.03% 35.64% 27.30% 39.41% 46.04% 47.86% 48.32% 47.76% 93.88%Compositional 17.97% 27.21% 35.98% 30.32% 40.85% 46.57% 51.40% 53.19% 55.83% 87.12%None 0.00% 29.01% 42.22% 29.59% 42.10% 48.56% 49.38% 52.67% 51.85% 99.10%Transfer(attr.) 0.00% 28.60% 44.36% 26.02% 50.08% 61.23% 59.61% 63.64% 63.18% 100.00%Transfer(spatial) 0.00% 44.08% 44.90% 32.30% 29.35% 46.98% 49.18% 50.00% 48.07% 90.48%Transfer(temporal) 34.93% 37.26% 32.18% 4.27% 30.99% 53.62% 62.01% 65.58% 66.33% 79.83% Table 3:
Experiment results on the DVD test split : Models are evaluated for overall accuracy as well as a accuracyper question type and question spatio-temporal complexity. In addition, they are evaluated by transferability metricon question turns with topic transfers, including attribute, spatial, and temporal transfers. of question turns with a topic transfer per dia-logue (see Figure 6). In such cases, a model canrandomly make a good guess by just reusing theanswer of the last question turn. Transferability.
To clarify this performance in-crease, we investigate a new metric, called transfer-ability . When a system is presented with a questionturn with a topic transfer, it should learn to derive anew answer in relation to the answer of the last dia-logue turn. If the last answer is right, a consistentdialogue system should likely be able to answerthe current question turn correctly. For instance,given a question-answer pair “what is the color ofthe sliding cube? red”, we can infer the answerto a transferred question “what about its material?”based on the same visual object. We gather ques-tions that precede questions containing some topictransfers and call this set Q prior . For each ques-tion q prior that the model answered correctly, wemeasure the accuracy over the corresponding trans-ferred question q tt and average the scores acrossall Q prior . From Table 3, we notice there is a clearperformance gain from RNN(Q) to HRNN(C+Q) in transferability metric, with the largest gain intemporal topic transfers. However, this gain is stillfar from an ideal system as the transferability ofmost baselines are not far above chance. A chance-based system can achieve transferability byjust recycling answers from prior dialogue turns.
Video-grounded dialogue systems.
When a sys-tem is presented with the visual input, we observemodel performance increases from X up to . . The drive of the performance gain can be investi-gated by the spatio-temporal dynamics of questions.Compared to the model HRNN(C+Q)+CNN(V) , us-ing attention can increase compositional-intervalquestions by up to points, pushing the overallaccuracy score. Attention methods such as dot-product attention have been used to select relevantinformation along the video temporal dimension.However, even in the best performing system, TF(C+Q+V) , using deep attention design to rea-son over multimodal context, the performance isstill far below human performance. All exist-ing baselines are not sufficient enough to tacklequestions involving both spatial and temporal de-pendencies, as shown by the very marginal gainin the atomic(spatial) results. Specifically, wenoted that challenging questions with very marginalimprovements are “action count”, “attr. query”,“atomic(spatial)”, and “transfer(spatial)”. Thesequestions require strong reasoning ability over bothobjects’ appearance and their movements.
Dialogue object tracking.
To further diagnosea dialogue system, we aim to study their long-term memory reasoning ability to track objects andtheir attributes mentioned in the dialogue context.Different from topic transfer relations, evaluatingsystem ability to learn from object references ismuch harder. We are inspired by research work ofdialogue state tracking in task-oriented dialogues(Bordes et al., 2017) and propose to use trackingaccuracy metric in video-grounded dialogue sys-tems. An ideal system should be able to track andpdate a dialogue state S , including all mentionedobjects o m and their attributes, including sizes, col-ors, materials, and shapes, in a turn-based basis.We define two tracking metrics, including joint ac-curacy , measuring the accuracy of prediction ofall attributes as a set in dialogue state, and slotaccuracy , measuring the accuracy of predicted at-tributes individually. The introduction of theseevaluation metrics necessitates a new learning task,dialogue object tracking (DOT) in video-groundeddialogue systems, to better understand current sys-tems’ long-term reasoning ability (more details insupplementary material). Video interval tracking.
Another aspect of di-alogue systems that we want to diagnose is theirability to localize video segments in a multi-turnsetting. Each question turn often focuses on dif-ferent parts of the video as the dialogue extendsover time. It is important to learn how a systemcan localize the right segments of the video fromturn to turn. We define grounding , a metric mea-suring attention score of attention-based modelson the video parts where the questions or answersrefer to. A higher grounding score shows higherconfidence the model grounds its reasoning in thevideo. Similar to DOT, we define a new learningtask for video interval tracking (VIT) in a similarnature as text-to-clip tasks (Anne Hendricks et al.,2017) (please refer to the supplementary materialfor more details).
In this paper, we introduced DVD, a novel diag-nostic benchmark to study reasoning capabilitiesof video-grounded dialogue systems. We describedthe dataset generation process, provided baselineexperiments, and defined new evaluation metrics toanalyze model abilities and limitations. We believethe benchmark can lead to interesting insights todesign better dialogue systems capable of complexvisual and linguistic reasoning.
References
Aishwarya Agrawal, Dhruv Batra, and Devi Parikh.2016. Analyzing the behavior of visual question an-swering models. In
Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1955–1960, Austin, Texas. Asso-ciation for Computational Linguistics.Huda Alamri, Vincent Cartillier, Abhishek Das, JueWang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al.2019. Audio visual scene-aware dialog. In
Pro-ceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pages 7558–7567.Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural module networks. In
Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 39–48.Lisa Anne Hendricks, Oliver Wang, Eli Shechtman,Josef Sivic, Trevor Darrell, and Bryan Russell. 2017.Localizing moments in video with natural language.In
Proceedings of the IEEE international conferenceon computer vision , pages 5803–5812.Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In
Proceedings of the IEEE internationalconference on computer vision , pages 2425–2433.Antoine Bordes, Y-Lan Boureau, and Jason Weston.2017. Learning end-to-end goal-oriented dialog.In . OpenRe-view.net.Prithvijit Chattopadhyay, Deshraj Yadav, Viraj Prabhu,Arjun Chandrasekaran, Abhishek Das, Stefan Lee,Dhruv Batra, and Devi Parikh. 2017. Evaluatingvisual conversational agents via cooperative human-ai games. In
Proceedings of the Fifth AAAI Con-ference on Human Computation and Crowdsourcing(HCOMP) .Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh,and Dhruv Batra. 2017. Visual dialog. In
Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 326–335.Harm De Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron Courville.2017. Guesswhat?! visual object discovery throughmulti-modal dialogue. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 5503–5512.Ali Farhadi, Mohsen Hejrati, Mohammad AminSadeghi, Peter Young, Cyrus Rashtchian, JuliaHockenmaier, and David Forsyth. 2010. Every pic-ture tells a story: Generating sentences from images.In
European conference on computer vision , pages15–29. Springer.Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang,Lei Wang, and Wei Xu. 2015. Are you talking to amachine? dataset and methods for multilingual im-age question.
Advances in neural information pro-cessing systems , 28:2296–2304.Rohit Girdhar and Deva Ramanan. 2020. Cater: A di-agnostic dataset for compositional actions and tem-poral reasoning. In
International Conference onLearning Representations .aniel Gordon, Aniruddha Kembhavi, MohammadRastegari, Joseph Redmon, Dieter Fox, and AliFarhadi. 2018. Iqa: Visual question answering in in-teractive environments. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recog-nition , pages 4089–4098.Yash Goyal, Tejas Khot, Douglas Summers-Stay,Dhruv Batra, and Devi Parikh. 2017. Making thev in vqa matter: Elevating the role of image under-standing in visual question answering. In
Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 6904–6913.Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.2018. Can spatiotemporal 3d cnns retrace the his-tory of 2d cnns and imagenet? In
Proceedings ofthe IEEE conference on Computer Vision and Pat-tern Recognition , pages 6546–6555.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.C. Hori, H. Alamri, J. Wang, G. Wichern, T. Hori,A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes,A. Das, I. Essa, D. Batra, and D. Parikh. 2019. End-to-end audio visual scene-aware dialog using mul-timodal attention-based video features. In
ICASSP2019 - 2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,pages 2352–2356.Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim,and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 2758–2766.Justin Johnson, Bharath Hariharan, Laurens van derMaaten, Li Fei-Fei, C Lawrence Zitnick, and RossGirshick. 2017. Clevr: A diagnostic dataset for com-positional language and elementary visual reasoning.In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2901–2910.Will Kay, Joao Carreira, Karen Simonyan, BrianZhang, Chloe Hillier, Sudheendra Vijaya-narasimhan, Fabio Viola, Tim Green, Trevor Back,Paul Natsev, et al. 2017. The kinetics human actionvideo dataset. arXiv preprint arXiv:1705.06950 .Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,and Tamara Berg. 2014. Referitgame: Referringto objects in photographs of natural scenes. In
Pro-ceedings of the 2014 conference on empirical meth-ods in natural language processing (EMNLP) , pages787–798.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In . Satwik Kottur, Jos´e M. F. Moura, Devi Parikh, DhruvBatra, and Marcus Rohrbach. 2019. CLEVR-dialog:A diagnostic dataset for multi-round reasoning invisual dialog. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 582–595, Minneapolis, Minnesota. As-sociation for Computational Linguistics.Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-son, Kenji Hata, Joshua Kravitz, Stephanie Chen,Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.2017. Visual genome: Connecting language and vi-sion using crowdsourced dense image annotations.
International journal of computer vision , 123(1):32–73.Hung Le, Doyen Sahoo, Nancy Chen, and Steven Hoi.2019. Multimodal transformer networks for end-to-end video-grounded dialogue systems. In
Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 5612–5623,Florence, Italy. Association for Computational Lin-guistics.Hung Le, Doyen Sahoo, Nancy Chen, and Steven C.H.Hoi. 2020. BiST: Bi-directional spatio-temporal rea-soning for video-grounded dialogues. In
Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages1846–1859, Online. Association for ComputationalLinguistics.Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg.2018. TVQA: Localized, compositional video ques-tion answering. In
Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing , pages 1369–1379, Brussels, Belgium.Association for Computational Linguistics.Zekang Li, Zongjia Li, Jinchao Zhang, Yang Feng,Cheng Niu, and Jie Zhou. 2020. Bridging textand video: A universal multimodal transformer forvideo-audio scene-aware dialog.
DSTC Workshop@ AAAI .Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Doll´ar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In
European confer-ence on computer vision , pages 740–755. Springer.Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input.
Advancesin neural information processing systems , 27:1682–1690.Nikola Mrkˇsi´c, Diarmuid ´O S´eaghdha, Tsung-HsienWen, Blaise Thomson, and Steve Young. 2017. Neu-ral belief tracker: Data-driven dialogue state track-ing. In
Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers) , pages 1777–1788, Vancouver,Canada. Association for Computational Linguistics.ryan A Plummer, Liwei Wang, Chris M Cervantes,Juan C Caicedo, Julia Hockenmaier, and SvetlanaLazebnik. 2015. Flickr30k entities: Collectingregion-to-phrase correspondences for richer image-to-sentence models. In
Proceedings of the IEEEinternational conference on computer vision , pages2641–2649.Jiaxin Qi, Yulei Niu, Jianqiang Huang, and HanwangZhang. 2020. Two causal principles for improvingvisual dialog. In
Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recogni-tion , pages 10860–10869.Anna Rohrbach, Marcus Rohrbach, Niket Tandon, andBernt Schiele. 2015. A dataset for movie descrip-tion. In
Proceedings of the IEEE conference on com-puter vision and pattern recognition , pages 3202–3212.Idan Schwartz, Seunghak Yu, Tamir Hazan, andAlexander G Schwing. 2019. Factor graph attention.In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 2039–2048.Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han,and Leonid Sigal. 2017. Visual reference resolu-tion using attention memory for visual dialog. In
Advances in neural information processing systems ,pages 3719–3729.Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Buildingend-to-end dialogue systems using generative hierar-chical neural network models. In
Proceedings of theThirtieth AAAI Conference on Artificial Intelligence ,AAAI’16, page 3776–3783. AAAI Press.Xiaomeng Song, Yucheng Shi, Xin Chen, and YahongHan. 2018. Explore multi-step reasoning in videoquestion answering. In
Proceedings of the 26thACM international conference on Multimedia , pages239–247.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting.
The journal of machine learningresearch , 15(1):1929–1958.Jesse Thomason, Michael Murray, Maya Cakmak, andLuke Zettlemoyer. 2019. Vision-and-dialog naviga-tion. In
Conference on Robot Learning (CoRL) .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.
Advances in neural information process-ing systems , 30:5998–6008.Erik Wijmans, Samyak Datta, Oleksandr Maksymets,Abhishek Das, Georgia Gkioxari, Stefan Lee, IrfanEssa, Devi Parikh, and Dhruv Batra. 2019. Embod-ied Question Answering in Photorealistic Environ-ments with Point Cloud Perception. In
Proceedings of the IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) .Kexin Yi*, Chuang Gan*, Yunzhu Li, Pushmeet Kohli,Jiajun Wu, Antonio Torralba, and Joshua B. Tenen-baum. 2020. Clevrer: Collision events for video rep-resentation and reasoning. In
International Confer-ence on Learning Representations .Amir Zadeh, Michael Chan, Paul Pu Liang, EdmundTong, and Louis-Philippe Morency. 2019. Social-iq:A question answering benchmark for artificial socialintelligence. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages8807–8817.Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7w: Grounded question answeringin images. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages4995–5004.
Additional Information of DVD
A.1 Question and dialogue size (a) Question distribution by
Figure 7:
Distribution of dialogues and questions:
Questions and dialogues in DVD are well distributedby the program size and text length. The dotted lineindicates the position of the overall average.
A.2 Answer distributionA.3 Video distribution questiontype
Table 4:
Average size per question type:
Query-related question types such as “attr query” and “actionquery” tend to have smaller program size. Questionsrequiring comparison such as “compare int” and “com-pare action” tend to have larger program size. ompare action (binary)Compare action (count) Compare int Obj existObj count Action queryAction countAttr. query (color) Attr. query (shape)Attr. query (size)
Attr. query (material)
Figure 8:
Distribution of ground-truth answers:
We report the distribution of answer candidates per questiontype, including binary answers (first row), numerical answers (second row), and answers of object attributes andactions (third and last row). In general, the answer options are well balanced to minimize the impact of answer-conditioned bias on model performance.igure 9:
Distribution of turn position where videoinput is updated:
To simulate situated dialogues, weupdate the video input from the prior dialogue turn withan additional subsequent segment. ideo part
Figure 10:
Distribution of dialogue turn positions where video segments are involved:
Our approach to simu-late situated dialogues results in more balanced usage of video segments throughout the entire dialogue. In DVD,the earlier video segments tend to be involved in the earlier turns of the dialogue. Likewise, later video segmentstend to be mentioned more in the later turns of the dialogue.Figure 11:
Number of active visual objects per di-alogue turn:
The average number of active objectsrange from 2.5 objects in the st turn to 5 objects inthe thth