Improving Context Modelling in Multimodal Dialogue Generation
Shubham Agarwal, Ondrej Dusek, Ioannis Konstas, Verena Rieser
IImproving Context Modelling in Multimodal Dialogue Generation
Shubham Agarwal, ∗ Ondˇrej Duˇsek, Ioannis Konstas and Verena Rieser
The Interaction Lab, Department of Computer ScienceHeriot-Watt University, Edinburgh, UK ∗ Adeptmind Scholar, Adeptmind Inc., Toronto, Canada { sa201, o.dusek, i.konstas, v.t.rieser } @hw.ac.uk Abstract
In this work, we investigate the task of tex-tual response generation in a multimodaltask-oriented dialogue system. Our workis based on the recently released Mul-timodal Dialogue (MMD) dataset (Sahaet al., 2017) in the fashion domain. Weintroduce a multimodal extension to theHierarchical Recurrent Encoder-Decoder(HRED) model and show that this exten-sion outperforms strong baselines in termsof text-based similarity metrics. We alsoshowcase the shortcomings of current vi-sion and language models by performingan error analysis on our system’s output.
This work aims to learn strategies for textual re-sponse generation in a multimodal conversationdirectly from data. Conversational AI has greatpotential for online retail: It greatly enhances userexperience and in turn directly affects user reten-tion (Chai et al., 2000), especially if the interactionis multi-modal in nature. So far, most conversa-tional agents are uni-modal – ranging from open-domain conversation (Ram et al., 2018; Papaioan-nou et al., 2017; Fang et al., 2017) to task ori-ented dialogue systems (Rieser and Lemon, 2010,2011; Young et al., 2013; Singh et al., 2000; Wenet al., 2016). While recent progress in deep learn-ing has unified research at the intersection of vi-sion and language, the availability of open-sourcemultimodal dialogue datasets still remains a bot-tleneck.This research makes use of a recently releasedMultimodal Dialogue (MMD) dataset (Saha et al.,2017), which contains multiple dialogue sessionsin the fashion domain. The MMD dataset providesan interesting new challenge, combining recent ef- forts on task-oriented dialogue systems, as well asvisually grounded dialogue. In contrast to sim-ple QA tasks in visually grounded dialogue, e.g.(Antol et al., 2015), it contains conversations witha clear end-goal. However, in contrast to previ-ous slot-filling dialogue systems, e.g. (Rieser andLemon, 2011; Young et al., 2013), it heavily relieson the extra visual modality to drive the conversa-tion forward (see Figure 1).In the following, we propose a fully data-drivenresponse generation model for this task. Our workis able to ground the system’s textual responsewith language and images by learning the seman-tic correspondence between them while modellinglong-term dialogue context.Figure 1: Example of a user-agent interaction inthe fashion domain. In this work, we are inter-ested in the textual response generation for a userquery. Both user query and agent response can bemultimodal in nature.
Our model is an extension of the recently intro-duced Hierarchical Recurrent Encoder Decoder(HRED) architecture (Serban et al., 2016, 2017; a r X i v : . [ c s . C L ] O c t u et al., 2016). In contrast to standard sequence-to-sequence models (Cho et al., 2014; Sutskeveret al., 2014; Bahdanau et al., 2015), HREDs modelthe dialogue context by introducing a context Re-current Neural Network (RNN) over the encoderRNN, thus forming a hierarchical encoder.We build on top of the HRED architecture toinclude multimodality over multiple images. Asimple HRED consists of three RNN modules: en-coder, context and decoder. In multimodal HRED,we combine the output representations from theutterance encoder with concatenated multiple im-age representations and pass them as input to thecontext encoder (see Figure 2). A dialogue is mod-elled as a sequence of utterances (turns), which inturn are modelled as sequences of words and im-ages. Formally, a dialogue is generated accordingto the following: P θ ( t , . . . t N ) = N (cid:89) n =1 P θ ( t n | t Text Context: Sorry i don’t think i have any 100 % acrylic but i can showyou in knit | Show me something similar to the 4th image but with thematerial different Image Context: [Img 1, Img 2, Img 3, Img 4, Img 5] | [0, 0, 0, 0, 0] Target Response: The similar looking ones areSaha et al. (Saha et al., 2017) Text Context: | Image Context: Img 4 | Img 5 Target Response: The similar looking ones are Figure 3: Example contexts for a given system ut-terance; note the difference in our approach fromSaha et al. (2017) when extracting the training datafrom the original chat logs. For simplicity, in thisillustration we consider a context size of 2 previ-ous utterances. ‘ | ’ differentiates turns for a givencontext. We concatenate the representation vec-tor of all images in one turn of a dialogue to formthe image context. If there is no image in the utter-ance, we consider a vector to form the imagecontext. In this work, we focus only on the textualresponse of the agent.igure 2: The Multimodal HRED architecture consists of four modules: utterance encoder, image en-coder, context encoder and decoder. While Saha et al. (2017) ‘rollout’ images to encode only one imageper context, we concatenate all the ‘local’ representations to form a ‘global’ image representation perturn. Next, we concatenate the encoded text representation and finally everything gets fed to the contextencoder. The MMD dataset (Saha et al., 2017) consistsof 100/11/11k train/validation/test chat sessionscomprising 3.5M context-response pairs for themodel. Each session contains an average of 40dialogue turns (average of 8 words per textual re-sponse, 4 images per image response). The datacontains complex user queries, which pose newchallenges for multimodal, task-based dialogue,such as quantitative inference (sorting, countingand filtering) : “Show me more images of the 3rdproduct in some different directions”, inferenceusing domain knowledge and long term context :“Will the 5th result go well with a large sized mes-senger bag?”, inference over aggregate of images: “List more in the upper material of the 5th imageand style as the 3rd and the 5th”, co-reference res-olution . Note that we started with the raw tran-scripts of dialogue sessions to create our own ver-sion of the dataset for the model. This is donesince the authors originally consider each image asa different context, while we consider all the im-ages in a single turn as one concatenated context(cf. Figure 3). We use the PyTorch framework (Paszke et al.,2017) for our implementation. We used 512as the word embedding size as well as hiddendimension for all the RNNs using GRUs (Choet al., 2014) with tied embeddings for the (bi-directional) encoder and decoder. The decoderuses Luong-style attention mechanism (Luonget al., 2015) with input feeding. We trained ourmodel with the Adam optimizer (Kingma and Ba,2015), with a learning rate of 0.0004 and clippinggradient norm over 5. We perform early stoppingby monitoring validation loss. For image repre-sentations, we use the FC6 layer representationsof the VGG-19 (Simonyan and Zisserman, 2014),pre-trained on ImageNet. We report sentence-level B LEU -4 (Papineni et al.,2002), M ETEOR (Lavie and Agarwal, 2007) andR OUGE -L (Lin and Och, 2004) using the evalu-ation scripts provided by (Sharma et al., 2017). https://pytorch.org/ Our code is freely available at: https://github.com/shubhamagarwal92/mmd In future, we plan to exploit state-of-the-art frameworkssuch as ResNet or DenseNet and fine tune the image encoderjointly, during the training of the model. igure 4: Examples of predictions using M-HRED–attn (5). Recall, we are focusing on generatingtextual responses. Our model predictions are shown in blue while the true gold target in red. We areshowing only the previous user utterance for brevity’s sake.We compare our results against Saha et al. (2017)by using their code and data-generation scripts. Note that the results reported in their paper are ona different version of the corpus, hence not directlycomparable. Model Cxt B LEU -4 M ETEOR R OUGE -LSaha et al. M-HRED* 2 0.3767 0.2847 0.6235T-HRED 2 0.4292 0.3269 0.6692M-HRED 2 0.4308 0.3288 0.6700T-HRED–attn 2 0.4331 0.3298 0.6710M-HRED–attn 2 0.4345 0.3315 0.6712T-HRED–attn 5 0.4442 M-HRED–attn Table 1: Sentence-level B LEU -4, METEOR andROUGE-L results for the response generationtask on the MMD corpus. “Cxt” represents con-text size considered by the model. Our best per-forming model is M-HRED–attn over a context of5 turns. *Saha et al. has been trained on a differentversion of the dataset.Table 1 provides results for different configura-tions of our model (“T” stands for text-only in theencoder, “M” for multimodal, and “attn” for usingattention in the decoder). We experimented withdifferent context sizes and found that output qual-ity improved with increased context size (mod-els with 5-turn context perform better than thosewith a 2-turn context), confirming the observationby Serban et al. (2016, 2017). Using attentionclearly helps: even T-HRED–attn outperforms M-HRED (without attention) for the same contextsize. We also tested whether multimodal inputhas an impact on the generated outputs. However,there was only a slight increase in BLEU score(M-HRED–attn vs T-HRED–attn). https://github.com/amritasaha1812/MMD_Code Using pairwise bootstrap resampling test (Koehn, 2004),we confirmed that the difference of M-HRED-attn (5) vs. M-HRED-attn (2) is statistically significant at 95% confidencelevel. To summarize, our best performing model (M-HRED–attn) outperforms the model of Saha et al.by 7 BLEU points. This can be primarily at-tributed to the way we created the input for ourmodel from raw chat logs, as well as incorporat-ing more information during decoding via atten-tion. Figure 4 provides example output utterancesusing M-HRED–attn with a context size of 5. Ourmodel is able to accurately map the response toprevious textual context turns as shown in (a) and(c). In (c), it is able to capture that the user is ask-ing about the style in the 1st and 2nd image. (d)shows an example where our model is able to re-late that the corresponding product is ‘jeans’ fromvisual features, while it is not able to model fine-grained details like in (b) that the style is ‘casualfit’ but resorts to ‘woven’. In this research, we address the novel task ofresponse generation in search-based multimodaldialogue by learning from the recently releasedMultimodal Dialogue (MMD) dataset (Saha et al.,2017). We introduce a novel extension to theHierarchical Recurrent Encoder-Decoder (HRED)model (Serban et al., 2016) and show that our im-plementation significantly outperforms the modelof Saha et al. (2017) by modelling the full multi-modal context. Contrary to their results, our gen-eration outputs improved by adding attention andincreasing context size. However, we also showthat multimodal HRED does not improve signif-icantly over text-only HRED, similar to observa-tions by Agrawal et al. (2016) and Qian et al.(2018). Our model learns to handle textual cor-respondence between the questions and answers,while mostly ignoring the visual context. This in-dicates that we need better visual models to en- The difference is statistically significant at 95% confi-dence level according to the pairwise bootstrap resamplingtest (Koehn, 2004). ode the image representations when he have mul-tiple similar-looking images, e.g., black hats inFigure 3. We believe that the results should im-prove with a jointly trained or fine-tuned CNN forgenerating the image representations, which weplan to implement in future work. Acknowledgments This research received funding from AdeptmindInc., Toronto, Canada and the MaDrIgAL EPSRCproject (EP/N017536/1). The Titan Xp used forthis work was donated by the NVIDIA Corp. References Aishwarya Agrawal, Dhruv Batra, and Devi Parikh.2016. Analyzing the behavior of visual question an-swering models. Proceedings of EMNLP .Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-garet Mitchell, Dhruv Batra, C Lawrence Zitnick,and Devi Parikh. 2015. VQA: Visual question an-swering. In Proceedings of ICCV , pages 2425–2433.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR .Joyce Yue Chai, Nanda Kambhatla, and WlodekZadrozny. 2000. Natural language sales assistant-aweb-based dialog system for online sales. In Pro-ceedings of AAAI .Kyunghyun Cho, Bart Van Merri¨enboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. In Proceedings ofEMNLP .Hao Fang, Hao Cheng, Elizabeth Clark, Ariel Holtz-man, Maarten Sap, Mari Ostendorf, Yejin Choi, andNoah A Smith. 2017. Sounding board–university ofwashington’s alexa prize submission. Alexa PrizeProceedings .Diederik P Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization. CoRRabs/1412.6980 .Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proceedings ofEMNLP .Alon Lavie and Abhaya Agarwal. 2007. METEOR: Anautomatic metric for MT evaluation with high levelsof correlation with human judgments. In Proceed-ings of 2nd Workshop on Statistical Machine Trans-lation , pages 228–231. Chin-Yew Lin and Franz Josef Och. 2004. Auto-matic evaluation of machine translation quality us-ing longest common subsequence and skip-bigramstatistics. In Proceedings of ACL , pages 605–612.Jiasen Lu, Jianwei Yang, Dhruv Batra, and DeviParikh. 2016. Hierarchical question-image co-attention for visual question answering. In Proceed-ings of NIPS , pages 289–297.Minh-Thang Luong, Hieu Pham, and Christopher DManning. 2015. Effective approaches to attention-based neural machine translation. Proceedings ofEMNLP .Ioannis Papaioannou, Amanda Cercas Curry, Jose LPart, Igor Shalyminov, Xinnuo Xu, Yanchao Yu, On-drej Duˇsek, Verena Rieser, and Oliver Lemon. 2017.Alana: Social dialogue using an ensemble modeland a ranker trained on user feedback. Alexa PrizeProceedings .Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automaticevaluation of machine translation. In Proceedingsof ACL , pages 311–318.Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In NIPS-W .Xin Qian, Ziyi Zhong, and Jieli Zhou. 2018. Multi-modal machine translation with reinforcement learn-ing. CoRR abs/1805.02356 .Ashwin Ram, Rohit Prasad, Chandra Khatri, AnuVenkatesh, Raefer Gabriel, Qing Liu, Jeff Nunn,Behnam Hedayatnia, Ming Cheng, Ashish Nagar,et al. 2018. Conversational AI: The science behindthe alexa prize. CoRR abs/1801.03604 .Verena Rieser and Oliver Lemon. 2010. Natural lan-guage generation as planning under uncertainty forspoken dialogue systems. In Empirical methodsin natural language generation , pages 105–120.Springer.Verena Rieser and Oliver Lemon. 2011. Reinforcementlearning for adaptive dialogue systems: a data-driven methodology for dialogue management andnatural language generation . Springer.Amrita Saha, Mitesh Khapra, and Karthik Sankara-narayanan. 2017. Multimodal dialogs (MMD): Alarge-scale dataset for studying multimodal domain-aware conversations. CoRR abs/1704.00200 .Iulian Vlad Serban, Alessandro Sordoni, Yoshua Ben-gio, Aaron C Courville, and Joelle Pineau. 2016.Building end-to-end dialogue systems using gener-ative hierarchical neural network models. In Pro-ceedings of AAAI .ulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron C Courville,and Yoshua Bengio. 2017. A hierarchical latentvariable encoder-decoder model for generating di-alogues. In Proceedings of AAAI , pages 3295–3301.Shikhar Sharma, Layla El Asri, Hannes Schulz, andJeremie Zumer. 2017. Relevance of unsupervisedmetrics in task-oriented dialogue for evaluating nat-ural language generation. CoRR abs/1706.09799 .Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. CoRR abs/1409.1556 .Satinder P Singh, Michael J Kearns, Diane J Litman,and Marilyn A Walker. 2000. Reinforcement learn-ing for spoken dialogue systems. In Proceedings ofNIPS , pages 956–962.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Proceedings of NIPS , pages 3104–3112.Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialoguesystem. CoRR abs/1604.04562 .Steve Young, Milica Gaˇsi´c, Blaise Thomson, and Ja-son D Williams. 2013. POMDP-based statisticalspoken dialog systems: A review.