[PDF] DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Abstract

Full PDF

DDVD: A Diagnostic Dataset for Multi-step Reasoningin Video Grounded Dialogue

Hung Le ‡§∗ , Chinnadhurai Sankar † , Seungwhan Moon † , Ahmad Beirami † ,Alborz Geramifard † , Satwik Kottur †† Facebook { chinnadhurai, shanemoon, beirami, alborzg, skottur } @fb.com ‡ Singapore Management University § Institute for Infocomm Research, A*STAR [email protected]

Abstract

A video-grounded dialogue system is requiredto understand both dialogue, which containssemantic dependencies from turn to turn, andvideo, which contains visual cues of spatialand temporal scene variations. Building suchdialogue systems is a challenging problem in-volving complex multimodal and temporal in-puts, and studying them independently is hardwith existing datasets. Existing benchmarksdo not have enough annotations to help ana-lyze dialogue systems and understand their lin-guistic and visual reasoning capability and lim-itations in isolation. These benchmarks arealso not explicitly designed to minimise biasesthat models can exploit without actual reason-ing. To address these limitations, in this paper,we present a diagnostic dataset that can test arange of reasoning abilities on videos and di-alogues. The dataset is designed to containminimal biases and has detailed annotationsfor the different types of reasoning each ques-tion requires, including cross-turn video inter-val tracking and dialogue object tracking. Weuse our dataset to analyze several dialogue sys-tem approaches, providing interesting insightsinto their abilities and limitations. In total, thedataset contains instances of -round dia-logues for each of ∼ k synthetic videos, re-sulting in more than k dialogues and M question-answer pairs. Our code and datasetwill be made public. Visual question answering (VQA) is a popular lineof research that aims to develop intelligent systemsthat can reason and answer questions about visualinformation. Earlier datasets have been introducedto study this problem, focusing on images as the vi-sual input (Antol et al., 2015; Gao et al., 2015; Ma-linowski and Fritz, 2014; Zhu et al., 2016) Recently, ∗ Work done when HL was a research intern at Facebook many question answering benchmarks have beenproposed to extend the visual information fromthe image to video domain (Jang et al., 2017; Leiet al., 2018; Zadeh et al., 2019). While image QAproblems require a system to learn cross-modalityinteraction, video QA problems go beyond andcapture visual information with temporal variance.Correctly answering questions about the content ofvideos requires different types of perceptual abili-ties, such as recognizing moving objects, includingtheir locations and actions.As an orthogonal extension from VQA problems,another line of research investigates image/videoQA in a dialogue setting (Das et al., 2017; Seoet al., 2017; De Vries et al., 2017; Chattopadhyayet al., 2017; Alamri et al., 2019). In this problem,questions about a given video or image are posi-tioned in a multi-turn dialogue. In each dialogueturn, a question usually exhibits different types ofcross-turn relations to other questions in prior dia-logue turns, such as object co-references and topicalignment. In this work, we investigate the problemof multi-turn video question answering (QA), orvideo-grounded dialogue.Numerous approaches to video-grounded dia-logue have showed remarkable performance inbuilding intelligent multimodal systems (Hori et al.,2019; Schwartz et al., 2019; Le et al., 2019; Liet al., 2020; Le et al., 2020). However, most ofthe methods exhibit marginal performance gain,and our ability to understand their limitations is im-peded by the complexity of the task. Existing video-grounded dialogue benchmarks are not designedwith enough information to determine whether cur-rent approaches are capable of sophisticated reason-ing and not just exploiting biases (Agrawal et al.,2016; Goyal et al., 2017; Qi et al., 2020). Multiplefactors can affect the performance of a system. Forinstance, a dialogue agent answers incorrectly ifit is unable to decode dialogue context and derive a r X i v : . [ c s . A I] J a n : until the end of the cube 's rotation , what types of actions does the big thing undertake the most ? A1 : flying Q2 : during the same time period , how many sliding objects are there ? A2 : 2 Q3 : among them , there is a ball . during the whole video , what type of action does it undertake second ? A3 : no action Q5 : during the red thing 's last slide , how many things are behind the earlier mentioned large object ? A5 : 2 Q6 : how about left of it ? A6 : 0 t1 t2 t3 t4 t0 TObject A: t1 t3 Object B: t2 t4 t5 t6t5 t6 slides flies rotates Q4 : how about up until now ? A4 : slidingrotatingsliding flying flyingsliding slidingslidingflyingflying slidingflying sliding flying EOV2EOV1 EOV2EOV2 V1 EOV2EOV1EOV1EOV1 EOV2EOV1 sliding sliding

Figure 1:

Example video-grounded dialogue in DVD:

We demonstrate sample questions that test aspects ofvisual and linguistic reasoning, such as action recognition, temporal reasoning, spatial reasoning, cross-turn videointerval tracking, and dialogue object tracking. Q i / A i : question/answer of turn i . EOV j : end of video input V j . contextualized objects, such as “the earlier men-tioned large object” and “it” in Questions and of Figure 1. Moreover, a video-grounded dia-logue does not just focus on a speciﬁc timestampor video segment. Instead, different video segmentsare mentioned from turn to turn and a system is re-quired to locate the right temporal information (SeeFigure 1).To address the limitations of existing bench-marks and analyze dialogue systems efﬁciently,we propose DVD , a D iagnostic Dataset for V ideo-grounded D ialogues . In total, from about k synthetic videos, we built a benchmark containing k dialogues, over M automatically-generatedquestion-answer pairs, of which more than k are unique. We built our benchmark on top of achallenging video dataset, CATER (Girdhar andRamanan, 2020). CATER dataset contains videosof multiple objects arranged in a 3D environmentwith high variance of object appearance, loca-tions, and actions. Moreover, the information ineach video is not affected by external informationsources, such as commonsense knowledge, makingit ideal for learning visual reasoning in dialogues.From scene graphs and object action annotationof CATER videos, we simulate questions and theirfunctional programs (Johnson et al., 2017) in amulti-turn setting. As can be seen in Figure 1,at each dialogue turn, a question is generated totest the model ability to perform different types of reasoning on videos, such as action recogni-tion and spatio-temporal reasoning. Across turns,questions are designed to be related to one anotherthrough different types of semantic relationships.Speciﬁcally, we proposed two sub-tasks relevantto video-grounded dialogue problems, video inter-val tracking ( VIT ) , and dialogue object tracking( DOT ) .The VIT task requires a dialogue system to iden-tify which video segment each dialogue turn isreferring to. In each turn, the question is gener-ated such that the corresponding video segment iseither independent or related to another segmentin prior dialogue turns. This task is an extensionof prior research in temporal localization throughtext, or text-to-clip (Anne Hendricks et al., 2017)but is designed in a multi-turn setting. DOT hasthe dialogue state tracking (DST) nature, whichrequires a system to track information slots in task-oriented dialogues (Mrkˇsi´c et al., 2017). While intask-oriented dialogues, the tracked slots are usedto create API queries to entity databases, trackedobjects in DOT is used to solve object referencesand locate the visual objects from videos.The DVD benchmark allows us to train and ana-lyze methods by their reasoning capabilities. Forinstance, we found that current dialogue modelsstruggle on question types requiring both video tem-poral and spatial localization. Existing approachesdo not explicitly track dialogue objects and video enchmarks Diagnosticbenchmark Visual reasoning Language reasoningSR TR DOT VIT Image/video QA, embodied QA

VQA (Antol et al., 2015), Visual7W (Zhu et al., 2016) (cid:55) (cid:51) (cid:55) (cid:55) (cid:55)

TGIF-QA (Jang et al., 2017), TV-QA (Lei et al., 2018) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

IQA (Gordon et al., 2018), EQA (Wijmans et al., 2019) (cid:55) (cid:51) (cid:51) (cid:55) (cid:55)

Image/video grounded dialogues, navigation dialogues

VisDial (Das et al., 2017), GuessWhat (De Vries et al., 2017) (cid:55) (cid:51) (cid:55) (cid:51) (cid:55)

AVSD (Hori et al., 2019), CVDN (Thomason et al., 2019) (cid:55) (cid:51) (cid:51) (cid:51) (cid:51)

Synthetic image/video QA

SHAPE (Andreas et al., 2016), CLEVR (Johnson et al., 2017) (cid:51) (cid:51) (cid:55) (cid:55) (cid:55)

SVQA (Song et al., 2018), CLEVRER (Yi* et al., 2020) (cid:51) (cid:51) (cid:51) (cid:55) (cid:55)

Synthetic dialogues bAbI (Bordes et al., 2017) (cid:51) (cid:55) (cid:55) (cid:51) (cid:55)

MNIST Dialog (Seo et al., 2017), CLEVR-Dialog (Kottur et al., 2019) (cid:51) (cid:51) (cid:55) (cid:51) (cid:55)

DVD (Ours) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51)

Table 1:

Comparison to related benchmarks:

Compared to existing datasets for vision-language understanding,DVD is the ﬁrst diagnostic benchmark designed for both spatial reasoning (SR) and temporal reasoning (TR) andexplicit requiring dialogue object tracking (DOT) and video interval tracking (VIT) in a multi-turn setting. segments properly and struggle to learn differentsemantic dependencies. These observations pointus to potential avenues for future research of video-grounded dialogue systems.

Table 1 presents a comparison between DVD andrelevant benchmarks. We position our dataset com-pared to existing datasets from four angles: visual-linguistic, 2) visually-grounded, 3) diagnostic, and4) multi-step reasoning.

1) Vision-linguistic.

Numerical benchmarks forvision-linguistic understanding has been intro-duced, including datasets for image and videocaptioning (Farhadi et al., 2010; Lin et al., 2014;Rohrbach et al., 2015), phrase grounding or visualobject reference (Kazemzadeh et al., 2014; Plum-mer et al., 2015), scene graph learning (Krishnaet al., 2017), and text-to-clip (Anne Hendricks et al.,2017). Our benchmark, DVD, is more related tobenchmarks for visual question answering in whicha visual input such as image or video is given anda system is required to answer natural-languagequestions about the visual input (Antol et al., 2015;Zhu et al., 2016; Jang et al., 2017; Lei et al., 2018).Another related line of research is the research ofnavigation systems in a dynamic environment (Gor-don et al., 2018; Wijmans et al., 2019). In thistask, systems are required to understand and followan instruction to navigate in a simulated physicalspace. Compared to the prior benchmarks, onemajor difference of DVD is the extension of single-turn interaction to a multi-turn setting.

2) Visually-grounded.

In this line of research, asystem is required to obtain information from agiven visual input to answer questions over mul-tiple rounds (De Vries et al., 2017; Das et al.,2017; Chattopadhyay et al., 2017; Hori et al., 2019;Thomason et al., 2019). To decode a question ofthe current turns, a system has to understand thedialogue context and resolve different semantic de-pendencies. However, due to the complexity ofthe tasks, involving cross-modality and cross-turninformation, prior benchmarks are often subjectto bias that models can just exploit without ac-tual reasoning (Qi et al., 2020). Speciﬁcally, in avideo-grounded dialogue, each question turn oftenfocuses on different parts of the video. One limi-tation of current benchmarks such as AVSD (Horiet al., 2019) is the lack of information to evaluatehow a model localizes relevant video intervals fromturn to turn. In this work, we introduce a diagnosticbenchmark that explicitly requires a system to trackdialogue objects and video intervals.

3) Diagnostic.

Seo et al. (2017); Kottur et al.(2019) used synthetic images to develop diagnos-tic image-grounded dialogues. Compared to thesebenchmarks, one major difference of DVD is theextension to the video domain and the injectionof diverse linguistic dependencies between turns.Speciﬁcally, in DVD, we incorporated 3 types ofrelationships: temporal relation, object reference,and topic transfer (See Section 3). This designchoice results in synthetic dialogues with richer lan-guage and higher linguistic variance in questions.As shown in Table 2, compared to similar bench- t0 TObject A: t1 t3

Object B: t2 t4 t5 t6t5 t6 slides flies rotates

Formulation of Video Intervals start end

Example spatial relations

Figure 2:

Left : examples of video intervals, deﬁned as continuous time period between two events. Atomicintervals are non-overlapping time periods and all remaining periods are compositional intervals.

Right : projectionof objects on the ground plane. Considering the “left” relationship, “A1 is left of B2” and “A2 is left of B4”. marks DVD contains a higher number of uniquequestions over a very large scale dataset. DVD isalso motivated by dialogue state tracking (DST)of task-oriented dialogue systems (Mrkˇsi´c et al.,2017; Bordes et al., 2017). In these systems, aDST model is required to tightly detect all domain-speciﬁc information slots in dialogues. Instead ofinformation slots, in DVD, for each dialogue turn,we introduce a dialogue object state, deﬁned as anydetected objects and their attributes mentioned inthe dialogue context. DVD is the ﬁrst diagnosticmultimodal dialogue benchmark that provides suchdetailed annotations of dialogue states.

4) Multi-step reasoning.

In multi-step reason-ing, a question is represented as sub-sequencescalled functional programs. Earlier efforts (An-dreas et al., 2016; Johnson et al., 2017) proposedto use synthetic images and design questions thatare expressed as elementary operation programs.More related to our work, Song et al. (2018); Yi*et al. (2020) extended the prior benchmarks to thevideo domain with questions focusing on the tem-poral variance of video frames. A major differencebetween our work and these benchmarks is the ex-tension of functional programs to a dialogue taskwith dialogue context-based operations, such asobject tracking and interval tracking. This exten-sion brings a step toward more transparent dialoguesystems capable of performing complex reasoning.

Our benchmark provides a dataset that can be usedto conduct rich diagnostics to better understandthe reasoning capabilities of dialogue systems. Weutilize CATER, a challenging video action recogni-tion benchmark (Girdhar and Ramanan, 2020), toautomatically generate dialogues. The videos haveassociated ground-truth object attributes, locations,and action time intervals. The generated dialogues contain questions that have an associated functionalprogram structures. In this benchmark, we focuson several types of visual and linguistic operations,such as temporal localization, object tracking, andspatio-temporal reasoning. Figure 2, 3, and 4 givea brief overview of the main components of thebenchmark, which we describe in detail below.

Objects.

Objects are identiﬁed by their attributes,including object shapes (cube, sphere, cylinder,cone, and snitch), sizes (small, large, and medium),materials (rubber and metal), and colors. Oneunique characteristic of CATER objects is that eachobject can move multiple times in a single video.We deﬁne types of object actions: “ﬂying”, “ro-tating”, “sliding”, and “no action” (stationary). Video intervals.

Video intervals are continuousvideo frames that can be deﬁned by two events,each of which can be the start or end of an object’saction or the start or end of the whole video. Weformulate two types of video intervals:

Atomic intervals.

Atomic intervals are intervalsthat each object can have at most one action andthey can be in only one of the two states: in mo-tion or stationary. To ﬁnd atomic intervals, wesimply collate the start and end time of all objectactions in a video and sort them chronologically.By deﬁnition, all non-overlapping time intervalsare considered atomic. For example, in Figure 2, alltime ranges ( t , t ) , ( t , t ) , ..., ( t , T ) ) are atomic.This constraint allows us to identify the relative spa-tial relationships (‘left”, “right”, “behind”, and “infront”) between any two objects during an atomicinterval by using their coordinates at the start andend of the interval. Note that we can apply this rulesince all object actions in the CATER universe areprojected as a straight line (“ﬂying”,“sliding”) or asingle point (“rotating”, “no action”) on the groundplane. Practically, we decide to focus on spatial uery Action