[PDF] An Empirical Study on the Generalization Power of Neural Representations Learned via Visual Guessing Games

Abstract

Guessing games are a prototypical instance of the "learning by interacting" paradigm. This work investigates how well an artificial agent can benefit from playing guessing games when later asked to perform on novel NLP downstream tasks such as Visual Question Answering (VQA). We propose two ways to exploit playing guessing games: 1) a supervised learning scenario in which the agent learns to mimic successful guessing games and 2) a novel way for an agent to play by itself, called Self-play via Iterated Experience Learning (SPIEL). We evaluate the ability of both procedures to generalize: an in-domain evaluation shows an increased accuracy (+7.79) compared with competitors on the evaluation suite CompGuessWhat?!; a transfer evaluation shows improved performance for VQA on the TDIUC dataset in terms of harmonic average accuracy (+5.31) thanks to more fine-grained object representations learned via SPIEL.

Full PDF

AAn Empirical Study on the Generalization Power of NeuralRepresentations Learned via Visual Guessing Games

Alessandro Suglia , Yonatan Bisk , Ioannis Konstas , Antonio Vergari ,Emanuele Bastianelli , Andrea Vanzo , and Oliver Lemon Heriot-Watt University, Edinburgh, UK Carnegie Mellon University, Pittsburgh, USA University of California, Los Angeles, USA { as247,i.konstas,e.bastianelli,a.vanzo,o.lemon } @hw.ac.uk [email protected] , [email protected] Abstract

Guessing games are a prototypical instance ofthe “learning by interacting” paradigm. Thiswork investigates how well an artiﬁcial agentcan beneﬁt from playing guessing games whenlater asked to perform on novel NLP down-stream tasks such as Visual Question An-swering (VQA). We propose two ways to ex-ploit playing guessing games: 1) a supervisedlearning scenario in which the agent learnsto mimic successful guessing games and 2)a novel way for an agent to play by itself,called Self-play via Iterated Experience Learn-ing (SPIEL). We evaluate the ability of bothprocedures to generalise: an in-domain eval-uation shows an increased accuracy ( +7 . )compared with competitors on the evaluationsuite CompGuessWhat?!; a transfer evaluationshows improved performance for VQA on theTDIUC dataset in terms of harmonic averageaccuracy ( +5 . ) thanks to more ﬁne-grainedobject representations learned via SPIEL. Learning a language requires interacting with boththe environment and other agents (Bisk et al., 2020).Language games represent one common exampleof this (Wittgenstein et al., 1953), as seen by theimportant role of play in L1 child language acqui-sition (Hainey et al., 2016) as well as L2 learn-ers (Godwin-Jones, 2014).Among the language games deﬁned in the litera-ture (Steels, 2015), guessing games represent theﬁrst step in a curriculum for language learning. Forexample, in GuessWhat?! (de Vries et al., 2017),two agents interact with each other: a

Questioner generates questions aimed at ﬁnding a hidden ob-ject in the scene and an

Oracle , aware of the targetobject, answers the questions supporting the Ques-tioner in playing the game. Different from otherlanguage games (Das et al., 2017), guessing games have a speciﬁc goal which represents a clear in-centive for learning. In addition, they require thatthe

Questioner masters both natural language gen-eration and understanding with a focus on objectcategories and attributes. For humans, conceptslearned in this way are generic and generalisable tonew tasks and domains where grounded reasoningis important (Hampton, 1979). However, how wellcan AI agents generalise with concepts acquiredfrom visual guessing games?

The literature has not explored if representationsbuilt from self-play are transferable, focusing in-stead on large scale self-supervised learning. Forinstance, large scale image captioning datasets havebeen used to train multi-modal Transformers (Luet al., 2019; Li et al., 2019; Tan and Bansal, 2019;Chen et al., 2019). Multi-task learning (Lu et al.,2020) has been used to leverage the diversity oftraining signals provided combining datasets, butonly for discriminative tasks. While some dialoguework (Cogswell et al., 2020) aims to bootstrap aconversing agent from VQA datasets, most workon GuessWhat?! (de Vries et al., 2017; Shekharet al., 2019; Strub et al., 2017) has designed be-spoke models for the task, ignoring the utility ofthis dataset for other Vision+Language tasks.We propose self-play as a mechanism for learn-ing general grounded representations. We seedour approach with the

GuessWhat?! corpus ofquestions and objects, and demonstrate how togeneralise to other downstream tasks. We pro-pose two different strategies to exploit these data.First, a supervised learning phase is undertaken tolearn a Questioner and Oracle model able to playguessing games. Second, the trained agents canbe used to play guessing games on images requir-ing only object annotations as supervision. Weshow that an agent trained on GuessWhat?! dia-logues can use self-play to adapt to new and hardertasks. Speciﬁcally, we investigate models’ gener- a r X i v : . [ c s . C L ] J a n lisation performance and quality of the learnedrepresentations on the CompGuessWhat?! bench-mark (Suglia et al., 2020), a more extensive evalua-tion suite for GuessWhat?!. Furthermore, we studyhow the learned representation help solve VQA onthe dataset TDIUC (Kaﬂe and Kanan, 2017). Weshow overall comparable performance with state-of-the-art models and improvements for speciﬁcquestion types that require object attribute informa-tion to be answered correctly. Our proposed transfer/ﬁne-tuning procedure re-quires a training set of guessing games D g fromwhich we learn a Questioner Q and an Oracle O via supervised learning. Given a set of images I ,it is possible to use the trained models Q and O torun the self-play procedure for n epochs obtainingthe model Q n . Finally, given a downstream task t and an associated dataset D t based on images from I , we use Q n ’s parameters as initialisation for thetraining procedure on D t .To apply this procedure, both the Questioner andthe

Oracle require a multi-modal encoder Γ able togenerate d -dimensional representations for the tex-tual tokens h t , for the objects h o , as well as fusingthe visual and textual modalities in a representationof the current context h c . After the self-play proce-dure, only the encoder Γ of the model Q n is used inthe ﬁne-tuning process on the downstream task t us-ing the dataset D t . It is important to underline thatthe presented self-play procedure does not dependon a speciﬁc implementation of the multi-modalencoder Γ . A possible implementation is presentedin Section 2.4 and it is used in the experimentalevaluation of this paper. The Oracle task is cast as a Visual Question An-swering (VQA) task conditioned on the image I ,the current question q and on the target object ˆ o .We follow common practice in vocabulary-basedVQA (Antol et al., 2015) and we treat the prob-lem as a multi-class classiﬁcation task over theclasses { Y es, N o, N/A } . We use h c as input to amulti-layer feedforward neural network to obtain aprobability distribution over the label set. The Questioner must play two roles: question gen-eration and target object prediction (de Vries et al., 2017). It is beneﬁcial to jointly learn the two tasksbecause the representations learned by each taskare complementary. In addition, they better encodeattributes, which favours better generalisation tounseen object categories (Suglia et al., 2020).To solve the two speciﬁc tasks in a multi-taskfashion, we design two different heads on top of theshared encoder Γ : 1) the guesser head , producesa probability distribution over every object o i us-ing the encoded representations h o i passed throughan MLP; 2) the generator head , a multi-modal de-coder, also implemented as an MLP, which predictsa probability distribution over the vocabulary V given the context representation generated by Γ .We include two losses in our model: 1) the nega-tive log-likelihood of the probability associated bythe guesser head with the target object ˆ o (Shekharet al., 2019); 2) a sequence-to-sequence cross-entropy loss (Sutskever et al., 2014) for the gener-ated question tokens. Unlike previous work thattrains a separate module to learn to stop (Shekharet al., 2018), we add a special token [STOP] tothe input data so that it learns when to stop moreefﬁciently as part of the question generation task.Training an agent to solve tasks of different com-plexity and size is challenging. The procedure pre-sented in (Shekhar et al., 2019) alternates betweentasks, updating the hardest task more often. Forthis technique, ﬁnding the right schedule is cum-bersome and requires ﬁne-tuning. We rely on amore systematic training procedure based on ran-dom dataset-proportional batch sampling inspiredby (Sanh et al., 2019). This represents a hard-parameter sharing multi-task training procedurethat avoids interference between tasks and favoursa more stable training, which mitigates catastrophicforgetting (French, 1999). Inspired by iterated learning (Kirby et al., 2014),we design a process by which the Questionerlearns from games previously generated by otherinstances of the Questioner agent. We call ourtraining procedure

Self-play via Iterated Experi-ence Learning (SPIEL).In SPIEL, described in Algorithm 1, we assumeaccess to a set of images I and the bounding boxes O I of the objects therein. In every gameplay, there Object annotations intended as either gold boundingboxes or predicted bounding boxes from an object detector. .. Uniﬁed Encoder-Decoder for Vision Language Pretraining (VLP)

Guesser head bowl...... AAACS3icbVDLbtQwFHUGSkt4TWEFbCxGSGUzShAIlhVsWA4S01aaRJHj3MxY9SOyb0pHxuJr2MKX8AF8BzvEAs9jAVOOZOnonHN1r0/dSeEwy34kg2vX927sH9xMb92+c/fe8PD+iTO95TDlRhp7VjMHUmiYokAJZ50FpmoJp/X525V/egHWCaM/4LKDUrG5Fq3gDKNUDR9OjkyV00+0UAwXdesXofJRCc+q4SgbZ2vQqyTfkhHZYlIdJo+KxvBegUYumXOzPOuw9Myi4BJCWvQOOsbP2RxmkWqmwJV+/YdAn0aloa2x8Wmka/XvCc+Uc0tVx+TqUrfrrcT/erXa2Yzt69IL3fUImm8Wt72kaOiqINoICxzlMhLGrYi3U75glnGMNaZpoeEjN0ox3fiC2blil2GWl77odRMDgH6UB18gXKLf2DSEkMY2893urpKT5+P85Th7/2J0/Gbb6wF5TJ6QI5KTV+SYvCMTMiWcfCZfyFfyLfme/Ex+Jb830UGynXlA/sFg7w+OD7KY P ( o | h o ) AAACS3icbVDLbtQwFHUGCiW8prACNhYjpLIZJRWoLCvYsBwkpq00iSLHuZmx6kdk30BHxuJr2MKX8AF8BzvEAs9jAVOOZOnonHN1r0/dSeEwy34kg2vX927c3L+V3r5z99794cGDU2d6y2HKjTT2vGYOpNAwRYESzjsLTNUSzuqLNyv/7ANYJ4x+j8sOSsXmWrSCM4xSNXw0OTTVMf1EC8VwUbd+ESoflfC8Go6ycbYGvUryLRmRLSbVQfK4aAzvFWjkkjk3y7MOS88sCi4hpEXvoGP8gs1hFqlmClzp138I9FlUGtoaG59Gulb/nvBMObdUdUyuLnW73kr8r1ernc3Yviq90F2PoPlmcdtLioauCqKNsMBRLiNh3Ip4O+ULZhnHWGOaFho+cqMU040vmJ0rdhlmeemLXjcxAOhHefAFwiX6jU1DCGlsM9/t7io5PRrnL8fZuxejk9fbXvfJE/KUHJKcHJMT8pZMyJRw8pl8IV/Jt+R78jP5lfzeRAfJduYh+QeDvT+kL7Kk P ( o | h o ) AAACS3icbVDLbtQwFHUGCiU8OoUVsLEYIZXNKKlAsKxgw3KQmLbSJIoc52bGqh+RfQMdGYuvYQtfwgfwHewQCzyPBUw5kqWjc87VvT51J4XDLPuRDK5d37txc/9WevvO3XsHw8P7p870lsOUG2nsec0cSKFhigIlnHcWmKolnNUXb1b+2QewThj9HpcdlIrNtWgFZxilavhwcmSqY/qJForhom79IlQ+KuFZNRxl42wNepXkWzIiW0yqw+RR0RjeK9DIJXNulmcdlp5ZFFxCSIveQcf4BZvDLFLNFLjSr/8Q6NOoNLQ1Nj6NdK3+PeGZcm6p6phcXep2vZX4X69WO5uxfVV6obseQfPN4raXFA1dFUQbYYGjXEbCuBXxdsoXzDKOscY0LTR85EYpphtfMDtX7DLM8tIXvW5iANCP8uALhEv0G5uGENLYZr7b3VVyejzOX4yzd89HJ6+3ve6Tx+QJOSI5eUlOyFsyIVPCyWfyhXwl35Lvyc/kV/J7Ex0k25kH5B8M9v4Akb+ymg== P ( o | h o ) [CLS] [SEP] is it a cup ? is it a [MASK] Generator head no [MASK] [SEP]

Figure 1: We use the single-stream VLP model as a backbone multi-modal encoder for our task. The visualfeatures tokens (marked in red) are the FastRCNN features associated with the objects in the image, the historytokens (marked in blue) and the tokens to be generated (marked in yellow) are given in input to the model. A

Guesser head uses the learned contextual object representations to generate a probability distribution over theobjects P ( o i | h o i ) , whereas the Generator head is used to incrementally predict the masked tokens.

Algorithm 1

SPIEL: Self-Play via Iterated Experi-ence Learning procedure SELF PLAY ( Q , O, I , n )2: D q ← READ GOLD GAMES () E g ← [] (cid:46) Initialise the experience buffer4: for e ← , n do (cid:46) Interactive phase6: Q ← Q e (cid:46) load latest weights7: G e ← GENERATE GAMES ( I ) G e ← PLAY GAMES ( Q, O, G e ) APPEND ( E g , G e )10: D eg ← [] (cid:46) Transmission phase12: for i ← , len ( E g ) do g ← E g [ i ] (cid:46) Priority to the latest games14: if IS VALID GAME ( g ) then APPEND ( D eg , g )16: if L EN ( D eg ) == L EN ( D q ) then break (cid:46) Learning phase18: Q e +1 ← TRAIN ( Q, D q , D eg ) is a Questioner Q and an Oracle O , initialised withagents Q and O , respectively, that were trainedwith Supervised Learning using gold successfuldialogues. We consider every iteration e of thealgorithm as a self-play epoch. In a single self-playepoch, we alternate 3 phases: Interactive phase: the agents play guessinggames with novel combinations of image and targetobject. The generated dialogue can be successfulif the predicted target object is equal to the tar-get object. Every played dialogue is stored in anexperience buffer E g . Transmission phase: in this phase the datasetsfor the multi-task learning procedure for the Ques-tioner are created. The generator head dataset D q is ﬁxed in advance while the dataset for the guesserhead D eg is created from the experience buffer E g by selecting the unique and valid dialogues. The Oracle is ﬁxed during this learning procedure.

Learning phase: the same multi-task learningprocedure used in the supervised learning phase isused to ﬁne-tune the Questioner parameters usingthe datasets D eg and D q collected for the currentepoch e . This procedure is repeated n times or untila halting condition is reached (e.g. early stoppingbased on validation metric).See Appendix A.1 for implementation details.At the end of the SPIEL procedure, we obtain themodel Q n whose parameters can be reused in othertasks. Particularly, we use the parameters of Q n ’sshared encoder Γ as initialisation for the ﬁne-tuningon the downstream task t using dataset D t . We implement a shared multi-modal encoder Γ us-ing VLP (Zhou et al., 2020), a single-stream multi-modal Transformer for captioning depicted in Fig-ure 1. During the GuessWhat?! ﬁne-tuning, weextend VLP by including dialogue context in theinput together with the features associated with theobjects in the image. We learn two new segment idsto represent the question/answer exchanges in thedialogue, as described in (Wolf et al., 2019). Thequestion is generated by incrementally replacing [MASK] tokens until the end of sequence is gener-ated. See Appendix A.2 for more details. SPIELtraining is run on a set of images I from Guess-What?! and TDIUC dataset with corresponding ob-ject annotations. We make sure that GuessWhat?!test images are not contained in I . This is not anissue for TDIUC test images because the down-stream task annotations (QA pairs) are not used bythe model during this phase. Once the model hasbeen trained with SPIEL, we use the parametersof the shared encoder Γ as a backbone for a VQAmodel that is ﬁne-tuned on the TDIUC dataset. Experimental Evaluation

To assess the generality of our learned representa-tions, we include two evaluation paradigms: 1) in-domain evaluation and 2) transfer evaluation . Weevaluate several variants of our model: 1)

VLP-SL :VLP-based model trained on GuessWhat?! data us-ing multi-task learning; 2)

SPIEL-gs : VLP-SL model ﬁne-tuned with our SPIEL procedure wherethe generator head uses only gold successful games(gs); 3)

SPIEL-gm : same as 2) but both success-ful and failed gold games are used by the generatorhead. In both SPIEL variants, the guesser headis trained using failed and successful generatedgames because it is important for the guesser headto be exposed to both types of signal to learn amore robust policy. We decided to investigate thetwo variants

SPIEL-gs and

SPIEL-gm to getmore insights about the effect that successful andfailed games have on the generator head ability toproduce effective dialogues.

We use the CompGuessWhat?! evaluationsuite (Suglia et al., 2020) to assess the ability ofthe Questioner to play guessing games and learnvisually grounded representations in the process. Itcomplements an evaluation based only on game-play accuracy (de Vries et al., 2017) with 2 auxil-iary tasks: target object 1) attribute-prediction ex-pressed in terms of abstract attributes (A), situated-attributes (SO), abstract+situated attributes (AS),and location attributes (L); 2) zero-shot game-play with near-domain accuracy (ND) and out-of-domain accuracy (OD). Table 1 shows the compar-ison with previous state-of-the-art models on thisbenchmark such as de Vries et al. (2017) (

DV-* )and Shekhar et al. (2019) (

GDSE-* ). VLP-SL has a greater advantage in terms of representationpower compared to previous models. This is re-ﬂected in all the tasks of the CompGuessWhat?!evaluation. Particularly, we see better performanceeven for the zero-shot gameplay (ND: +5 . , OD: +15 . ). This is because VLP associates a vectorof probabilities that represents a distribution overthe VisualGenome object classes with every object.This helps VLP to cope with the issue of unseenobjects and helps the model to generalise. Learningto play is key to gameplay performance, leading toan increase of +4 . over VLP-SL and +7 . over GDSE-CL . In this setup, the difference betweenthe versions

SPIEL-gs and

SPIEL-gm is very

Attribute Pred. ZShot ScoreModels Acc. A SO AS L ND OD

Random

DV-SL

DV-RL

GDSE-SL

GDSE-CL

VLP-SL

SPIEL-gs

SPIEL-gm

Table 1: F1 scores for attribute prediction and accura-cies for zero-shot evaluation on CompGuessWhat?!. minimal ( . ). However, when analysed in moredetail, we can see that training the questioner withgold successful data only improves attribute pre-diction while using mixed data improves overallgeneralisation in the zero-shot evaluation. For the transfer evaluation, we use the VQA datasetTDIUC (Kaﬂe and Kanan, 2017). It provides aﬁner-grained way to assess the quality of the rep-resentations learned by our guessing game trans-fer technique in terms of several question typesincluding object categories and their attributes.Speciﬁcally, we were interested in improving onthe following question types: 1) Positional rea-soning; 2) Counting; 3) Object presence; 4) Util-ity/Affordances; 5) Attribute; 6) Color; and 7) Ob-ject recognition. TDIUC is evaluated using thearithmetic mean accuracy per question type (A-MPT), as well as the harmonic mean (H-MPT)that better captures the skewed question-type dis-tribution. In Table 2, we report a comparisonbetween variants trained on guessing games data(

VLP+SL and

SPIEL-* ), the original model VLPtrained on Conceptual Captions (

VLP+CC ) andother state-of-the-art models speciﬁcally designedfor the VQA task such as

MUREL (Cadene et al.,2019),

RAU (Noh and Han, 2016),

NMN (Andreaset al., 2016),

MCB-* (Fukui et al., 2016). The fullset of results is available in the Appendix, Table 4.Among them,

MUREL achieves the best scoresacross the board, due to a custom iterative reason-ing mechanism and a non-linear fusion module.However, all our models have a more balancedoverall performance which results in better har-monic means (H-MPT, +5 points over MUREL ).Speciﬁcally, this improvement is favoured by an in-crease in accuracy on the

Utility/Affordances ques- enerated dialogue (b) is it food? nois it a spoon? nois it a cup? nois it a bowl? yesleft picture? yesthe one on the soup? nothe one with the soup in it? yeswhat is the spoon made of?GOLD answer woodVLP+CC plasticVLP+SP+gm woodwhat is the water glass made of?GOLD answer glassVLP+CC plasticVLP+SP+gm glass

TDIUC predictions (a) Attribute prediction (c)

Situated attributes Conﬁdencehome bowl_used_to_scoop kitchen_utentils bowl_can_be_carried center

Are the contents of the plate edible?GOLD answer yesVLP+CC beerVLP+SP+gm yes

Figure 2: We show the ability of the model to play guessing games with the bowl as target object (highlightedin red ). Given the generated dialogue, we use the probing classiﬁer trained for CompGuessWhat?! to predict the bowl ’s attributes. Predictions on TDIUC questions associated with the current image are reported as well. tion type ( +20 . ). As shown by the attribute pre-diction in the CompGuessWhat?! and depicted inFigure 2 (c), our models learn better representationsthan competitors speciﬁcally for abstract attributesamong which there are object affordances. Particu-larly, we can see how it is able to understand thatcertain objects can contain things (e.g. “the onewith the soup in it?”), that objects have speciﬁcfunctions (e.g. “are the contents of the plate edi-ble?”) or that they have speciﬁc properties (e.g. “aspoon is made of wood”).The effectiveness of the proposed ﬁne-tuningprocedure is conﬁrmed by the improved perfor-mance across all the question types compared toour baseline VLP+CC . Models such as

MUREL and

MCB-* equipped with speciﬁc VQA modules havean advantage on speciﬁc question (e.g., positionalreasoning) compared to VLP that relies only onBERT self-attention layers (Devlin et al., 2019). Inaddition, when comparing the two SPIEL variants,a similar trend showed in the in-domain evaluationcan be observed. Particularly,

SPIEL-gm beneﬁtsfrom being exposed to more language data comingfrom successful and failed guessing games.

In this work, we veriﬁed that representationslearned while playing guessing games can be trans-ferred to other downstream tasks such as VQA.We presented two ways of learning from guessinggames data namely multi-task learning and SPIEL.Models using SPIEL performed better both on in-domain evaluation on CompGuessWhat?! as wellas on the transfer task TDIUC. Our self-play pro- M ODEL P O S I T I ON C OUN T P R E S E N C E A FF O R D . A TT R . C O L O R R E C OG . A - M P T H - M P T RAU

NMN

MCB-A

MCB

MUREL

VLP + CC SL SPIEL-gs

SPIEL-gm

Table 2: Results for the transfer evaluation on TDIUC.The models are divided in two categories: (top) Modelsspeciﬁcally designed for VQA and (bottom) our VLP-based implementations. We report only the questiontypes that we believe will beneﬁt from the guessinggames ﬁne-tuning procedure. For the full set of resultsplease refer to Appendix, Table 4. cedure was able to learn useful and ﬁner-grainedobject representations such as object affordances,thus demonstrating that learning to guess helpslearning to ground .The current study showed how we can applythe SPIEL training procedure to a VQA datasetsuch as TDIUC. We believe that this work canbe extended to other datasets because the SPIELprocedure only requires a set of images and as-sociated object bounding boxes. These could beeither gold or generated by a trained object detectortherefore classifying guessing games as a holisticself-training procedure for multi-modal datasets. eferences

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Neural module networks. In

Pro-ceedings of the IEEE conference on computer visionand pattern recognition , pages 39–48.Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. Vqa: Visual question an-swering. In

Proceedings of the IEEE internationalconference on computer vision , pages 2425–2433.Yonatan Bisk, Ari Holtzman, Jesse Thomason, JacobAndreas, Yoshua Bengio, Joyce Chai, Mirella Lap-ata, Angeliki Lazaridou, Jonathan May, AleksandrNisnevich, et al. 2020. Experience grounds lan-guage. arXiv preprint arXiv:2004.10151 .Remi Cadene, Hedi Ben-Younes, Matthieu Cord, andNicolas Thome. 2019. Murel: Multimodal rela-tional reasoning for visual question answering. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 1989–1998.Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed ElKholy, Faisal Ahmed, Zhe Gan, Yu Cheng, andJingjing Liu. 2019. Uniter: Learning univer-sal image-text representations. arXiv preprintarXiv:1909.11740 .Michael Cogswell, Jiasen Lu, Rishabh Jain, Stefan Lee,Devi Parikh, and Dhruv Batra. 2020. Dialog withoutdialog data: Learning visual dialog agents from vqadata. arXiv preprint arXiv:2007.12750 .Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh,and Dhruv Batra. 2017. Visual dialog. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 326–335.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xi-aodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou,and Hsiao-Wuen Hon. 2019. Uniﬁed languagemodel pre-training for natural language understand-ing and generation. In

Advances in Neural Informa-tion Processing Systems , pages 13063–13075.Robert M French. 1999. Catastrophic forgetting in con-nectionist networks.

Trends in cognitive sciences ,3(4):128–135.Akira Fukui, Dong Huk Park, Daylen Yang, AnnaRohrbach, Trevor Darrell, and Marcus Rohrbach.2016. Multimodal compact bilinear pooling for vi-sual question answering and visual grounding. In

Proceedings of the 2016 Conference on EmpiricalMethods in Natural Language Processing , pages457–468.Robert Godwin-Jones. 2014. Games in language learn-ing: Opportunities and challenges.

Language Learn-ing & Technology , 18(2):9–19.Thomas Hainey, Thomas M Connolly, Elizabeth ABoyle, Amanda Wilson, and Aisya Razak. 2016. Asystematic literature review of games-based learningempirical evidence in primary education.

Comput-ers & Education , 102:202–223.James A Hampton. 1979. Polymorphous concepts insemantic memory.

Journal of verbal learning andverbal behavior , 18(4):441–461.Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2019. The curious case of neural textdegeneration. arXiv preprint arXiv:1904.09751 .Kushal Kaﬂe and Christopher Kanan. 2017. An analy-sis of visual question answering algorithms. In

Pro-ceedings of the IEEE International Conference onComputer Vision , pages 1965–1973.Simon Kirby, Tom Grifﬁths, and Kenny Smith. 2014.Iterated learning and the evolution of language.

Cur-rent opinion in neurobiology , 28:108–114.Jason Lee, Kyunghyun Cho, and Douwe Kiela. 2019.Countering language drift via visual grounding. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 4376–4386.Liunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019. Visualbert: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557 .Jiasen Lu, Dhruv Batra, Devi Parikh, and StefanLee. 2019. Vilbert: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. In

Advances in Neural Information Process-ing Systems , pages 13–23.Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, DeviParikh, and Stefan Lee. 2020. 12-in-1: Multi-taskvision and language representation learning. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition , pages 10437–10446.Hyeonwoo Noh and Bohyung Han. 2016. Training re-current answering units with joint loss minimizationfor vqa. arXiv preprint arXiv:1606.03647 .Victor Sanh, Thomas Wolf, and Sebastian Ruder. 2019.A hierarchical multi-task approach for learning em-beddings from semantic tasks. In

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence , vol-ume 33, pages 6949–6956.om Schaul, John Quan, Ioannis Antonoglou, andDavid Silver. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952 .Piyush Sharma, Nan Ding, Sebastian Goodman, andRadu Soricut. 2018. Conceptual captions: Acleaned, hypernymed, image alt-text dataset for au-tomatic image captioning. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages2556–2565.Ravi Shekhar, Tim Baumg¨artner, Aashish Venkatesh,Elia Bruni, Raffaella Bernardi, and RaquelFern´andez. 2018. Ask no more: Deciding whento guess in referential visual dialogue. In

Pro-ceedings of the 27th International Conference onComputational Linguistics , pages 1218–1233.Ravi Shekhar, Aashish Venkatesh, Tim Baumg¨artner,Elia Bruni, Barbara Plank, Raffaella Bernardi, andRaquel Fern´andez. 2019. Beyond task success: Acloser look at jointly learning to see, ask, and guess-what. In

The Talking Heads experiment: Ori-gins of words and meanings , volume 1. LanguageScience Press.Florian Strub, Harm De Vries, Jeremie Mary, BilalPiot, Aaron Courvile, and Olivier Pietquin. 2017.End-to-end optimization of goal-driven and visuallygrounded dialogue systems. In

Proceedings of the26th International Joint Conference on Artiﬁcial In-telligence , pages 2765–2771.Alessandro Suglia, Ioannis Konstas, Andrea Vanzo,Emanuele Bastianelli, Desmond Elliott, StellaFrank, and Oliver Lemon. 2020. CompGuess-What?!: A multi-task evaluation framework forgrounded language learning. In

Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics , pages 7625–7641, Online. As-sociation for Computational Linguistics.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In

Advances in neural information processing sys-tems , pages 3104–3112.Hao Tan and Mohit Bansal. 2019. Lxmert: Learningcross-modality encoder representations from trans-formers. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages5103–5114.Harm de Vries, Florian Strub, Sarath Chandar, OlivierPietquin, Hugo Larochelle, and Aaron C. Courville.2017. Guesswhat?! visual object discovery through multi-modal dialogue. In , pages4466–4475. IEEE Computer Society.Ludwig Wittgenstein, Gertrude Elizabeth MargaretAnscombe, and Rush Rhees. 1953.

PhilosophischeUntersuchungen.(Philosophical investigations .Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019. Transfertransfo: Atransfer learning approach for neural networkbased conversational agents. arXiv preprintarXiv:1901.08149 .Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. arXiv preprintarXiv:1609.08144 .Luowei Zhou, Hamid Palangi, Lei Zhang, HoudongHu, Jason J Corso, and Jianfeng Gao. 2020. Uni-ﬁed vision-language pre-training for image caption-ing and vqa. In

AAAI , pages 13041–13049.Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, andAnton Van Den Hengel. 2018. Parallel attention:A uniﬁed framework for visual object discoverythrough dialogs and queries. In

Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 4252–4261.

A Appendices

A.1 Self-Play via Iterated ExperienceLearning (SPIEL)

Learning to replicate gold dialogues is not enoughto play successfully. High performance in game-play can be achieved only when the agents startplaying the game and are exposed to their own mis-takes. Reinforcement Learning (Strub et al., 2017)or Collaborative Learning (Shekhar et al., 2019)are possible approaches to tackle this problem.Inspired by iterated learning (Kirby et al., 2014),we design a process by which “the gameplay arisesin one instance of the questioner through inductionon the basis of observations of gameplay in otherquestioner agents who acquired that gameplay ca-pability in the same way”. Therefore, we call ourprocedure

Self-play via Iterated Experience Learn-ing (SPIEL).In this setup, we assume we have access to a setof images I and for each image I we have objectbounding boxes O I . The SP training procedure,showed in Figure 1, can be described as follows.We assume that there is a Questioner agent Q andn Oracle agent O . At the beginning of the pro-cedure they are initialised with agents Q and O ,respectively, trained with Supervised Learning us-ing gold successful dialogues . We consider everyiteration e of the algorithm as a self-play epoch. Ina single self-play epoch we alternate 3 phases: 1) interactive phase: the agents play guessing gameswith novel combinations of image and target ob-ject; 2) transmission phase: the questioner createsnew datasets from the dialogues generated over theepochs; 3) learning phase: multi-task learning isused to ﬁne-tune the Questioner parameters usingthe datasets collected for the current epoch. A.1.1 Interactive phase

We start the interactive phase by ﬁrst sampling a setof reference games G e which consists of pairs ( I, ˆ o ) where I ∈ I and ˆ o is the target object sampledat random from the object annotations O I . Theagents Q e and O play the games G e and accumulatethe generated experiences. During this phase, thequestioner agent is using the most updated weightsgenerated at epoch e − . It generates questions bynucleus sampling (Holtzman et al., 2019) from theprobability distribution over the vocabulary learnedby the generator head. When the [STOP] tokenis sampled, the guesser head, conditioned on thedialogue generated so far, selects the object ˜ o withthe highest probability. A game is successful if thepredicted object ˜ o is equal to the target object ˆ o . A.1.2 Transmission phase

For every epoch e , in the transmission phase, wecreate the datasets D q and D g for the questioner andguesser heads, respectively, used in the learningphase for the questioner parameters update. Questioner experience buffer

To make surethat the questioner does not experience languagedrift (Lee et al., 2019), we consider a ﬁxed dataset D q composed of dialogues generated by humanscontained in the GuessWhat?! training data. Theshared encoder Γ beneﬁts from this data too be-cause it is still exposed to human generated lan-guage, which guarantees better generalisation. Guesser experience buffer

The Guesser shouldlearn from its own mistakes – therefore we use gen-erated dialogues for the model updates (de Vrieset al., 2017; Shekhar et al., 2019). Inspired by Pri-oritised Experience Replay (Schaul et al., 2015),we create the experience buffer for the guesser E eg by accumulating all the unique and valid dialogues The Oracle is ﬁxed during this learning procedure. generated until epoch e . We consider a dialogue unique if D eg does not contain another dialoguewith the same encoding . In addition, we considera dialogue valid if it does not contain repeated ques-tions. We cap the number of dialogues in D eg sothat it matches the number of experiences in D q .This is done so that during the multi-task trainingprocedure there is an equal number of dialoguesfor each task from which the agent will learn. A.1.3 Learning phase

In this phase, we use the same multi-task train-ing procedure that was used during the supervisedlearning phase. We update the Questioner param-eters using the dialogues collected in D q and D eg .The updated parameters resulting from this stepwill be used for the self-play epoch e + 1 . A.2 VLP implementationA.2.1 Multi-modal encoder

To implement the agents in our guessing games,we rely on VLP, a single-stream multi-modalmodel (Zhou et al., 2020) that jointly learns vi-sual and language representations using ConceptualCaptions (CC) dataset (Sharma et al., 2018). Theinput starts with a classiﬁcation token ( [CLS] ),followed by a series of K visual tokens, a separa-tion token ( [SEP] ) divides the dialogue sequencefrom the visual and from the sequence of tokensto be generated . In a guessing game, we repre-sent the reference image I as a set of image re-gions extracted from an off-the-shelf object detec-tor { r , r , . . . , r K } . Following (Zhou et al., 2020),each region r i is represented by linear transforma-tion of a feature vector f ∈ R d n , region class prob-abilities c ∈ R d c and region geometric information g ∈ R d o where d o = 5 consists of four values fortop left and bottom right corner coordinates of theregion bounding box (normalized between 0 and 1)and one value for its relative area (i.e., ratio of thebounding box area to the image area, also between0 and 1). The Questioner models uses at most 36predicted bounding boxes from FastRCNN whilethe Guesser is using features generated by FastR-CNN for gold bounding boxes. We use a speciﬁcsegment id s v for every region.For the language part, we use Wordpiece embed-dings (Wu et al., 2016). In particular, we ﬂattenthe turns of the dialogue context as a sequence of The encoding of a dialogue is the SHA-256 hash associ-ated with its sequence of tokens. okens. However, to allow the model to differenti-ate between question and answer tokens, following(Wolf et al., 2019), we rely on novel segment ids( s u , s a ). The VLP’s hidden state of the [CLS] token is used as context representation h c . A.2.2 Oracle design

The implementation of the Oracle follows theone presented in the original VLP paper to solvethe VQA task (Zhou et al., 2020). Particularly,the model predicts a probability distribution overthe possible answers by using a multi-layer feed-forward neural network that receives in input theelement-wise product between the hidden state as-sociated with the [CLS] token and the hidden stateassociated with target object. The model is opti-mised by minimising the cross-entropy loss usingas training dataset the question/answer pairs in thesuccessful GuessWhat?! training dialogues.

A.2.3 Questioner design

We rely on the VLP ability to generate captionsfor the question generation task. In particular, weprovide in input to the model: 1) predicted FastR-CNN visual features following (Zhou et al., 2020);2) dialogue generated so far as a ﬂattened sequenceof tokens; 3) question to be generated. We useanother segment id s q to allow the model to differ-entiate what is the input and which are the tokensto be generated. Following (Dong et al., 2019), wemake sure that the attention mask for tokens of thequestion to be generated are masked so that thetoken at timestep t is not allowed to attend to thefuture tokens ( seq2seq attention mask ). For thisspeciﬁc model, we use the masked language mod-elling objective (Devlin et al., 2019) casting thetask as multi-modal masked language modelling. A.3 GuessWhat?! evaluationOracle evaluation

We report the test accuracyfor the Oracle of 82.22%. The baseline model usedby all the other is . (de Vries et al., 2017). Guesser evaluation

We report in Table 3 theaccuracy of the guesser in predicting the tar-get object when gold dialogues are given ininput. We compare this model with severalbaselines reported in (de Vries et al., 2017)(ﬁrst block), more sophisticated methods suchas

ParallelAttention (Zhuang et al., 2018)and

GDSE-* (Shekhar et al., 2019) (second block)as well as other Transformer-based models such asVILBERT (Lu et al., 2020) (third block).

Model AccuracyHuman 90.80%Random 17.10%LSTM 61.30%HRED 61%LSTM+VGG 60.50%HRED+VGG 60.40%ParallelAttention 63.40%GDSE-SL 62.96%GDSE-CL 59.79%VILBERT 65.69%VLP-SL 69.30%SPIEL-gs

SPIEL-gm 71.70%

Table 3: Results for the guesser accuracy evaluation ongold dialogues.

ODEL P O S I T I ONA L C OUN T I NG P R E S E N C E A FF O R DAN C E S A TT R I B U TE C O L O R R E C OGN I T I ON S C E N E A B S U R D S E N T I M E N T A C T I V I T Y S P O R T A CC U R A C Y A - M P T H - M P T MUREL 41.19 61.78 95.75 21.43 58.19 74.43 89.41 96.11 99.80 60.65 63.83 96.20 88.20 71.20 59.30RAU 35.26 48.43 94.38 31.58 56.49 66.86 86.11 93.96 96.08 60.09 51.60 93.47 84.26 67.81 59.00NMN 27.92 49.21 92.50 25.15 47.66 54.91 82.02 91.88 87.51 58.02 44.26 89.99 79.56 62.59 51.87MCB-A 55.40 51.01 93.64 35.09 56.72 68.54 85.54 93.06 84.82 66.25 52.35 92.77 81.86 67.90

MCB 33.34 50.29 91.84 33.92 53.24 56.93 84.63 92.04 83.44 65.46 51.42 92.47 79.20 65.75 58.03VLP-CC 36.93 55.28 94.65 30.99 55.42 67.33 85.76 92.98 98.34 62.62 51.34 94.11 85.60 68.81 60.14VLP-SL 39.04 57.61 94.79 42.11 54.29 69.01 86.07 93.39 97.54 65.77 52.39 94.34 85.98 70.53 63.95SPIEL-gs 40.94 57.53 94.76 36.26 56.87 69.2 86.33 93.97 97.48 62.3 54.44 94.6270.89 64.31