[PDF] The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents

Abstract

We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.

Full PDF

TThe Dialogue Dodecathlon:Open-Domain Knowledge and Image Grounded Conversational Agents

Kurt Shuster, Da Ju, Stephen RollerEmily Dinan, Y-Lan Boureau, Jason Weston

Facebook AI Research { kshuster,daju,roller,edinan,ylan,jase } @fb.com Abstract

We introduce dodeca

Dialogue: a set of 12tasks that measures if a conversational agentcan communicate engagingly with personal-ity and empathy, ask questions, answer ques-tions by utilizing knowledge resources, dis-cuss topics and situations, and perceive andconverse about images. By multi-tasking onsuch a broad large-scale set of data, we hopeto both move towards and measure progressin producing a single uniﬁed agent that canperceive, reason and converse with humansin an open-domain setting. We show thatsuch multi-tasking improves over a BERT pre-trained baseline, largely due to multi-taskingwith very large dialogue datasets in a similardomain, and that the multi-tasking in generalprovides gains to both text and image-basedtasks using several metrics in both the ﬁne-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, provid-ing a strong baseline for this challenge.

One of the goals of AI is to build a seeing, talk-ing agent that can discuss, reason, empathize, andprovide advice – in short a system that can per-form natural communication displaying many ofthe properties expected when speaking to a humanpartner. Ideally, it should be able to be knowl-edgeable and personable, expert and engaging, se-rious or humorous – depending on the situation. Itshould be capable of answering questions, askingquestions, responding to statements, having its ownpersona, and grounding the dialogue with externalinformation and images.While no single task exists that can train an agentor measure its ability on all of these axes at once,a number of distinct large-scale datasets targetingsubsets of these skills have recently become avail-able. We thus assemble these disparate tasks to form a single challenge: dodeca

Dialogue, consist-ing of 12 subtasks. Each contains both trainingdata to build the skills we desire for our agent, andvalidation and test sets to measure our agent’s abil-ity at that skill. The overall goal is a single agentthat can display all these skills. As some of thesubtasks have very large datasets, e.g. 2.2 billionutterances, they can possibly help the agent withother skills too.We thus build a model capable of training andmulti-tasking on all these sources. We employa transformer-based architecture (Vaswani et al.,2017) which accepts an image, external textualinformation and dialogue history as input, and gen-erates a response for a given dialogue turn. Practi-cally, by pre-training on the largest of the subtasksand then multi-tasking on all them, we can obtainstate-of-the-art results compared to existing inde-pendently reported performance on all 10 of the12 subtasks that have previous comparable results.We hence set a strong baseline for this challenge.While many existing approaches use large-scalepre-training on general text corpora, we show thatusing dialogue datasets instead, which are moreclosely linked to the desired agent’s goals, is astrong alternative.However, many challenges remain. While multi-tasking performs well, and has clear beneﬁts, asshown in other works (Liu et al., 2015; Raffel et al.,2019), when compared to ﬁne-tuning of the samesystem we do obtain typically small losses. Zero-shot transfer to left-out tasks is also demandingfor current approaches. We analyze these aspects,along with our model’s ability to ground on ex-ternal knowledge and images in conjunction withthe dialogue context, the impact of decoding algo-rithms, analysis of the weighting of tasks duringmulti-tasking as well as cross-task transfer abilityin order to shed light and make progress on thischallenging topic. a r X i v : . [ c s . C L ] A p r s k Q u e s ti on s A n s w e r Q u e s ti on s R e s pond t o S t a t e m e n t s P e r s on a G r ound i ng K no w l e dg e G r ound i ng S it u a ti on G r ound i ng I m a g e G r ound i ng Resp.Name Train Valid Test (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) ∼ (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Table 1: The 12 dodeca

Dialogue subtasks, their sizes (number of train, valid, test utterances), and average numberof turns and response length (words). dodeca

Dialogue Task

The dodeca

Dialogue task is intended to assembleimportant aspects of an engaging conversationalagent into a single collection, where each sub-task covers some of those goals. Such an agentshould be able to get to know you when youﬁrst talk to it (ConvAI2), discuss everyday topics(DailyDialog, pushshift.io Reddit, Twitter, CornellMovie), speak knowledgeably at depth (Wizard ofWikipedia, Ubuntu) and answer questions on suchtopics (ELI5). It must be able to handle situatedconversations and demonstrate empathy (Empa-thetic Dialog, LIGHT) . It can also discuss images,as this is a vital part of human connection (ImageChat, IGC). We note that all of the provided sub-tasks are in English.The overall statistics of the subtasks are given inTable 1. We now discuss each in turn.

ConvAI2

ConvAI2 is a dataset used at theNeurIPS 2018 competition of the same name, andis based on PersonaChat (Zhang et al., 2018; Di-nan et al., 2020). The training data involves pairedcrowdworkers having a conversation where they getto know each other, in which each is given a role toplay based on sentences describing their persona,which were also separately crowdsourced (whilethey cannot see their partner’s persona). It thus in-volves asking and answering questions, respondingin kind, and getting to know the other speaker andengaging them in friendly conversation – usefulskills for an open-domain conversational agent.

DailyDialog

Li et al. (2017) built a dialoguedataset intended to reﬂect conversations occurringin daily life. It covers ten categories ranging fromholidays to ﬁnancial topics, rather than focusing onone domain. Compared to ConvAI2, these conver-sations seem more in keeping with partners who al-ready know each other, and want to discuss typicallife details, again useful skills for a conversationalagent. The dataset is also annotated with topic,emotion and utterance acts, but here we ignorethese annotations and learn only from the utter-ances in the dialogue turns.

Wizard of Wikipedia

This task involves dis-cussing a given topic in depth, where the goal is toboth engage the partner as well as display expertknowledge (Dinan et al., 2019). The training setconsists of 1247 topics and a retrieval system overWikipedia from which the dialogues were groundedduring the human-human crowdsourced conversa-tions. The topics were also crowdsourced and rangefrom e-books to toga parties to showers. A modelcan thus learn to also perform similar retrieval andgrounding at test time to potentially discuss anytopic if it can generalize. We use the gold knowl-edge version of the task. We see this skill as acore component of an agent being able to not justchitchat, but actually engage a user in discussingreal information about the world, e.g. by retrievingover documents from the internet.

Empathetic Dialogues

Rashkin et al. (2019)constructed a dataset of crowdworker conversationsgrounded in an emotional situation. In each dia-ogue, one speaker describes a personal situationand the other plays a “listener” role, displaying em-pathy during the discussion. The dataset containsdescriptions of the situations being discussed withan attached emotion label, but these are not usedhere. Trained models are measured playing the partof the empathetic listener, an important feature ofan agent to which humans wish to speak.

Cornell Movie

Danescu-Niculescu-Mizil andLee (2011) constructed a corpus containing acollection of ﬁctional conversations from moviescripts, thus covering a large diversity of topics andemotional states.

LIGHT

LIGHT (Urbanek et al., 2019) involvessituated interactions between characters in a textadventure game. Similar to ConvAI2, personas foreach character are given, with the training set in-cluding conversations between crowdworkers play-ing those roles. Different from ConvAI2, includedare emotes and actions grounded within the gameworld (e.g. picking up and giving objects). As such,it measures the ability of a conversational agent toground its discussion on a dynamic environment.

ELI5

ELI5 (Fan et al., 2019) involves long-formquestion answering grounded on multiple retrieveddocuments in order to answer common questionswhich people ask on the popular ELI5 subreddit.As such, the answers are in a conversational formapplicable to a dialogue agent.

Ubuntu

Lowe et al. (2015) built a dataset thatinvolves in-depth discussions in solving Ubuntuproblems. This studies the ability of an agent ona very focused single topic, and is also a standardbenchmark in the ﬁeld.

Twitter

We use a variant of Twitter discussions(text-only), which have been used in many existingstudies, e.g. Sordoni et al. (2015); See et al. (2019).This data naturally involves everyday discussionsabout topics that people care about. The publicforum makes them different from the more personaldiscussions of some of the other tasks. This isthe second largest dataset in the collection, andwe thus measure in experiments its ability to helpperformance on other tasks. pushshift.io Reddit

We use a variant of Redditdiscussions (text-only), which has also been used inseveral existing studies, see e.g. Yang et al. (2018);Mazar´e et al. (2018); Keskar et al. (2019). Fol-lowing Humeau et al. (2019), we use a previously existing Reddit dataset extracted and obtained bya third party and made available on pushshift.io,training to generate a comment conditioned onthe full thread leading up to the comment, span-ning 2200M training examples. This is the largestdataset in the collection – much larger than the oth-ers. The subreddits cover a vast range of topics,and hence this is a strong candidate for helping im-prove performance on other tasks via pre-trainingand multi-tasking. Note this dataset does not over-lap with ELI5.

Image Chat

Shuster et al. (2018) collected acrowdsourced dataset of human-human conver-sations about an image with a given personality,where the goal is to engage the other speaker. Assuch, it covers natural conversational responses,including displays of emotion and humor.

Image Grounded Conversations (IGC)

IGC(Mostafazadeh et al., 2017) similarly involves twospeakers discussing an image, here focusing onquestions and responses. It only includes a valida-tion and test set, and so we converted most of thevalidation set to form a small training set.

For all tasks, we use the following met-rics: perplexity (PPL), BLEU, ROUGE-1,-2 and -Land F1, and also pick the metric most used in theliterature as that subtask’s ‘Score’ to compare toexisting work.

Multi-tasking

As we are interested in building asingle conversational agent, we measure the abilityof multi-tasked models that can perform all twelvetasks at once.

Single-Task Fine-tuning

We can still comparesuch multi-tasked models to single-task ﬁne-tunedbaselines to assess if we have gained or lost perfor-mance. Like other works (Liu et al., 2015; Raffelet al., 2019) we also consider a multi-task followedby ﬁnetune setup in order to see if this producesbetter models. The latter tests if multi-tasking stillproves useful in the single-task setting.

Zero-shot Transfer

Finally, we consider a leave-one-out zero-shot setting whereby training is con-strained to be on all the training data except forthe task being evaluated . This evaluates the per-formance on truly new unseen tasks, an importantbehavior given there are always new tasks.

Related Work

Where possible, we have tried to track the best exist-ing results for each task and provided a comparisonin our ﬁnal results table.As ConvAI2 was a competition, a number ofcompetitors built strong models on it. The bestresults were obtained by large pre-trained trans-formers (Dinan et al., 2020). In particular, Wolfet al. (2019b) pre-trained via the method of Radfordet al. (2018) using the BooksCorpus dataset, result-ing in the best perplexities and F1 scores. Sincethen, results have gotten even better with the ad-vent of better and larger pretraining (Lewis et al.,2019), which we compare to here; the same workalso reports strong results on ELI5.He et al. (2019) recently obtained strong resultson the DailyDialog and Cornell Movie tasks interms of perplexity by pre-training on 10% of CC-NEWS (Bakhtin et al., 2019), thus using 100 mil-lion sentences (2.7 billion words) and then ﬁne-tuning a transformer based model with a multi-taskstrategy.Overall, large pre-trained transformers indeedprovide strong existing results on many of thetasks. Several large language modeling projectshave been undertaken in order to show prowess inmulti-tasking ability (Radford et al., 2019; Keskaret al., 2019), and transformer-based approacheshave been adapted to language and vision tasksas well (Lu et al., 2019; Tan and Bansal, 2019;Li et al., 2019a; Shuster et al., 2018). As well asciting the relevant papers’ results where possiblein the experiments section, we also train a BERT-based (Devlin et al., 2019) generative model as anadditional baseline.

In the interests of feasibility, there are tasks wedid not include in dodeca

Dialogue. For example,there are additional knowledge tasks (Qin et al.,2019; Moghe et al., 2018; Gopalakrishnan et al.,2019) and image-based datasets (Das et al., 2017)one could use. There are also a large number ofQA tasks we did not include, e.g. Rajpurkar et al.(2016); Choi et al. (2018). In general, our choiceswere made based on tasks that after training mightproduce an engaging dialogue agent that humansnaturally would want to talk to – which meanseither natural datasets or crowdsourced datasetswhere crowdworkers were encouraged to engage one another. As computational resources and am-bitions scale, it would be interesting to add moretasks as well, while retaining the twelve we havechosen here in order to continue to evaluate theirsuccess, whilst extending the scope of the entiresystem.All the subtasks in the collection we use here al-ready exist. Other research projects have also builtsuch collection-based tasks before as well. In par-ticular, the NLP decathlon (McCann et al., 2018),from which the name of this paper is inspired, col-lects together a diverse set of NLP tasks – fromsentiment detection to parsing. Talmor and Berant(2019) collect a set of 10 QA datasets and buildM

ULTI

QA. Recently, (Raffel et al., 2019) also sim-ilarly multi-tasked a large set of NLP tasks, on aneven bigger scale. Our work differs from these inthat it is focused on dialogue tasks which naturallygroup together to form a conversational agent.

BERT baseline.

We implement a generativebaseline using BERT via adapting the model us-ing a standard auto-regressive loss. We concatenateboth the context and current generation and providethese as input to the model, using BERT’s sentenceembeddings to distinguish the roles in the network.Although BERT is trained to predict masked to-kens, we ﬁnd that ﬁne-tuning can easily adjust itsbehavior to predicting the next token. Our BERTbaseline is roughly equivalent to the model of Wolfet al. (2019b), but does not have a classiﬁcation lossterm. The implementation relies on HuggingFaceTransformers (Wolf et al., 2019a). We thus ﬁne-tune this model for each of our tasks, except ImageChat and IGC which require images as input.

Image+Seq2Seq.

We use a modiﬁcation of atransformer Seq2Seq architecture (Vaswani et al.,2017), additionally adding image features to theencoder. Our model is a 8 layer encoder, 8 layerdecoder with 512 dimensional embeddings and 16attention heads, and is based on the ParlAI im-plementation (Miller et al., 2017). We use BPEfollowing Humeau et al. (2019) via lower-casedWikipedia, Toronto Books, and Open Subtitles with30k merges, giving 54,940 terms. Reported per-plexities are computed with this dictionary. For im-age features, we use the pre-trained image featuresfrom the ResNeXt-IG-3.5B model, a ResNeXt 32x 48d architecture (Xie et al., 2017) trained on 3.5billion Instagram images following the procedure E R T - b a s e d S i ng l e T a s k (fr o m s c r a t c h ) S i ng l e T a s k (f a s t T e x ti n it ) T w itt e r + S i ng l e T a s k R e dd it O n l y R e dd it + S i ng l e T a s k M T A ll T a s k s + F T S i ng l e T a s k A ll T a s k s M TL ea v e - O n e - O u t Z e r o - S ho t ConvAI2 19.4 43.3 38.9 28.7 18.3 11.4

N/A

N/A dodeca

Score

N/A

Table 2: Validation perplexity for the dodeca

Dialogue tasks in various settings. described by Mahajan et al. (2018). This modelwas previously used successfully for the ImageChat task in Shuster et al. (2018). The ﬁnal encod-ing from the ResNeXt model is a vector of size2048; we then use a linear layer to project into thesame size as the text encoding, and add it as anextra token at the end of the transformer’s encoderoutput, then feed them all into the decoder. Dur-ing ﬁne-tuning we train the text transformer, butleave the image encoding ﬁxed, apart from ﬁne-tuning the linear projection. The text transformeris ﬁne-tuned with a standard auto-regressive neg-ative log-likelihood (NLL) loss, following usualsequence to sequence training schemes.Our best models are available at parl.ai/projects/dodecadialogue . Task Training

We employ the ParlAI framework(Miller et al., 2017) for training on single tasksand for multi-tasking, as many of the tasks arealready implemented there, along with a (multi-task) training and evaluation framework for suchmodels.

Pre-training

As pushshift.io Reddit and (tosome extent) Twitter are much larger than our othertasks, we try pre-training the Seq2Seq module ofour Image+Seq2Seq networks with those datasets,before multi-tasking on all of the tasks, or for eval-uating single task ﬁne-tuning.For Reddit, the model was trained to generate

Model C onv A I W i z . o f W i k i p e d i a E m p a t h e ti c D i a l og Reddit 18.3 15.3 14.4Reddit+ConvAI2

Table 3: Transfer performance of various multi-taskmodels (validation perplexity). a comment conditioned on the full thread leadingup to the comment. Comments containing URLsor that were under 5 characters in length were re-moved from the corpus, as were all child com-ments. Comments were truncated to 1024 BPEtokens. The model was trained with a batch sizeof 3072 sequences for approximately 3M updatesusing a learning rate of 5e-4, and an inverse squareroot scheduler. This took approximately two weeksusing 64 NVIDIA V100s. We note that our trans-former pre-training only includes text, while ourimage encoder was pre-trained separately in previ-ous work (Mahajan et al., 2018). Learning how tocombine these sources occurs during ﬁne-tuning.It is important to note that, while compute-heavy,pre-training was conducted exactly once, and all ofthe subsequent ﬁne-tuning is signiﬁcantly faster torun. nowledge grounding

Without WithWiz. of Wikipedia 16.8

ELI5 21.3

Image grounding

Image Chat 19.5

IGC

Table 4: The impact of knowledge and image ground-ing in dodeca

Dialogue (validation perplexity).

Transfer Performance between Tasks

We ﬁrstperform a preliminary study on a subset of the tasks:Reddit, ConvAI2, Wizard of Wikipedia and Em-pathetic Dialogues, and report the transfer abilityof training on some of them, and testing on all ofthem (using the validation set), reporting perplex-ity. The results are reported in Table 3. They showthat training on pushshift.io Reddit alone, a hugedataset, is effective at transfer to other tasks, butnever as effective as ﬁne-tuning on the task itself.Moreover, ﬁne-tuning on most of the smaller tasksactually provides improvements over pushshift.ioReddit training alone at transfer, likely because thethree tasks selected are more similar to each otherthan to pushshift.io Reddit. Finally, training on allfour tasks is the most effective strategy averagedover all tasks compared to any other single model,although this does not beat switching between dif-ferent ﬁne-tuned models on a per-task basis.

Comparison of Pre-training + Fine-tuningstrategies

Across all 12 tasks, we compare sev-eral pre-training strategies: using BERT, no pre-training at all, only initializing via fastText (Joulinet al., 2017), and using Twitter and pushshift.ioReddit pre-training with our Image+Seq2Seq archi-tecture. For each variant we tune the learning rate,layers, number of heads and embedding size, withless pre-training typically requiring smaller capac-ity models. We then only ﬁne-tune on a single taskin these experiments, and report perplexity for thattask alone, over all 12 tasks. The results are givenin Table 2, reporting results on the validation set .The results show a clear reduction in perplexitywith more pre-training, as expected. This is mosteasily seen by the dodeca Score (last row) that is themean perplexity over all 12 tasks, which decreasesfrom 49.5 (from scratch models) down to 17.1 withpushshift.io Reddit pre-training. FastText (45.7)and Twitter (35.6) initializations help, but nowherenear as much. BERT fares better, but still is clearly We choose not to use the test set here as we report somany numbers, we do not want to overuse it. Relative Task Weighting1 2 5 10 20 50 ∞ Cornell 21.9 21.5 20.6 20.1 19.9 - Fine-tuned

Fine-tuned

Table 5: Validation perplexity on select do-deca

Dialogue tasks comparing relative weightsof tasks during multi-tasking, followed by ﬁne-tuning(row below). The relative task weight is the ratio ofexamples from that task compared to others presentedduring multitasking. ∞ indicates single-task training. N-gramBeam Size Block NucleusTask 1 2 3 5 N = 3 p = Table 6: Impact of the decoding strategy on select tasks,reporting validation F1 score for the All Tasks MTmodel. N-gram block is for best beam size. worse than pushshift.io Reddit pre-training. Thehypothesis here is that pushshift.io Reddit yieldsmuch more effective transfer as it is a dialogue tasklike our others, whereas non-dialogue corpora suchas Wikipedia are not. This was previously observedfor retrieval models in Humeau et al. (2019). Notethat we do not report results for the image dialoguetasks for BERT as that architecture does not dealwith images.Finally, as pushshift.io Reddit is so effective, wealso compare to pushshift.io Reddit training only,with no ﬁne-tuning at all across all tasks, similarto our initial study in Table 3. The performanceis impressive, with some tasks yielding lower per-plexity than BERT pre-training + single task ﬁne-tuning. However, it still lags signiﬁcantly behindﬁne-tuning applied after pushshift.io Reddit pre-training.

Image and Knowledge Grounding

Some ofour tasks involve grounding on knowledge or im-ages. To show such grounding helps, we reportresults with and without grounding on those tasksin Table 4, reporting perplexity. Particularly forWizard of Wikipedia (knowledge) and Image Chat(images) such grounding has a clear effect.

Multi-Task Results

Next, we perform multi-task training across all tasks, which is our ultimategoal in order to obtain an open-domain conversa-tional agent. We optimize over the same set of xisting Approaches (independent) MT + FT All Tasks MTApproach PPL Score (Metric) PPL Score PPL ScoreConvAI2 (Lewis et al., 2019) *11.9 *20.7 F1 11.1 21.6

DailyDialog (He et al., 2019) (Dinan et al., 2019)

Empathetic Dialog (Rashkin et al., 2019)

Avg-BLEU

Cornell Movie (He et al., 2019) (Urbanek et al., 2019) ∗ ∗ (Lewis et al., 2019) Avg-ROUGE (Luan et al., 2016) (Shuster et al., 2018) - 27.4

ROUGE-L (1 st turn) (Mostafazadeh et al., 2017) - 1.57 BLEU (responses)

Table 7: Test performance for various metrics on the dodeca

Dialogue tasks comparing our multi-task and multi-task + ﬁne-tuned methods to existing approaches (cited). Dashes mean metric was not provided. ∗ was reportedon validation only. Score is deﬁned on a per-task basis in the metric column. hyperparameters as before, including multi-taskingweights for tasks, where one samples during train-ing with differing probabilities, and we choosethe best model by performing early stopping onthe average performance across all tasks. In thisway, we treat all 12 tasks as a single task, and thusduring test time it is the model’s responsibility tounderstand how to respond from the context (im-age/dialogue) itself.In the end we did not obtain clear improvementsbeyond pre-training with pushshift.io Reddit andthen equally sampling from all tasks. We reportthat ﬁnal model’s validation performance in termsof perplexity in Table 2 (second to last column,“All Tasks MT”). It achieves a dodeca Score of 19.1,superior to all pre-train ﬁne-tune approaches exceptpushshift.io Reddit pre-training followed by ﬁne-tuning, and is also superior to a single pushshift.ioReddit model. However, comparing across tasks,while most are close to the corresponding best ﬁne-tuned model, many are just slightly worse. This isan expected result and is often reported in multi-task systems (Raffel et al., 2019). We look uponthis result as both positive – we can obtain a singlemodel doing well on all tasks, which a ﬁne-tunedmodel cannot – whilst also remaining a challengeto the community: can one ﬁnd architectures thatleverage multi-tasking even better?

Multi-Task followed by Fine-Tuning

As alsoperformed in Liu et al. (2015); Raffel et al. (2019)we can try to train in a multi-task manner on alltasks, before ﬁne-tuning on a single task, and builda separate model performing this procedure for alltasks, in an attempt to improve single task resultsfurther. Using this approach, one is free to per-form hyperparameter search differently for each task. Here, we found that applying relative taskup-weighting during multi-tasking training made aclear difference to the ﬁnal quality of the ﬁne-tunedtarget task model, see Table 5. Generally, betterresults come from assigning most of the multi-taskweight towards the task itself to be ﬁne-tuned. Us-ing such an approach we can get marginally betterresults than ﬁne-tuning alone, although the differ-ences are generally small. The ﬁnal best modelsper task are shown compared to other approaches inTable 2 (third to last column, “MT All Tasks + FTSingle Task”). The ﬁnal validation dodeca

Score is16.8, only slightly below 17.1 for ﬁne-tuning.

Decoding Strategies

So far, we have only beenmeasuring perplexity, but we are actually inter-ested in generation, which requires us to decode.We consider several standard approaches: greedy,beam search (with beam size, and minimum andmaximum output length hyperparameters), beamsearch with beam blocking (blocking n -grams, weuse n = 3 ) (Paulus et al., 2018) and nucleus sam-pling (with parameter p ) (Holtzman et al., 2019).We show the effect of these choices in Table 6 forConvAI2 and Wizard of Wikipedia (WoW). Final Systems

The ﬁnal test performance for ourbest multi-task and ﬁne-tuned (via multi-task fol-lowed by ﬁne-tuning) systems are reported in Ta-ble 7 (right), with more detailed results with alldecoding-based metrics, and validation as well astest performance in Appendix A. Here, for themulti-task model we have ﬁne-tuned the decod-ing hyperparameters per task. For results with asingle set of decoding hyperparameters, see also The length parameters are important for ELI5. ppendix A. We generally ﬁnd across all metricsa similar story as before when comparing the ﬁne-tuning with multi-tasking: multi-tasking is success-ful, but the challenge is still to do better.

Comparison to Existing Systems

We compareto existing state-of-the-art results previously pub-lished for each task. Results are given in Table 7.As existing works report different metrics per task,we report perplexity where possible (but note, theymay be computed on a different dictionary), andchoose the sequence decoding-based metric that iscommonly reported per task (listed in column ‘Met-ric’), where the ’Score’ column reports its value.We compare these to our best ﬁne-tuned and multi-tasked models. Our multi-task model outperformsall available existing results, with 2 of the 12 taskshaving no previous result. It is only surpassed byour ﬁne-tuned model which also outperforms allavailable existing results. Overall, our methods seta strong challenge to future approaches.

Human Evaluation

In addition to automaticmetrics, we perform human evaluation on two ofthe tasks to assess the abilities of our All TasksMT conversational agent: the knowledge ground-ing task Wizard of Wikipedia (WoW) and the im-age grounding task Image Chat. We follow thesame evaluation protocols as in Dinan et al. (2019);Shuster et al. (2018), comparing our method to theexisting approaches referenced in Table 7. This in-volves collecting 100 human-bot conversations forWoW using crowdworkers, involving 8–10 turnseach, across seen topics (seen in the training set)and unseen topics, and 500 image-based responsesfor Image Chat. A separate set of crowdworkers arethen used to compare models pairwise followingthe ACUTE-Eval procedure of (Li et al., 2019b),where they are asked to choose which is “the moreengaging response” for Image Chat (1500 trials)and “Who would you prefer to talk to for a longconversation?” for WoW (400 trials).The results, given in Figure 1, show our methodoutperforming the existing state of the art genera-tive models on all three comparisons: Image Chat,WoW seen topics and WoW unseen topics. Allthree results are statistically signiﬁcant (binomialtest, p < . ). Additional details and results break-down are given in Appendix Section B. Example Outputs

We show some example out-puts of our multi-task model for some of the tasksin Appendix C. Our model is able to leverage im-

Figure 1: Human evaluations on Image Chat and Wiz-ard of Wikipedia (WoW), comparing existing state ofthe art models with our All Tasks MT conversationalagent. Engagingness win rates are statistically signiﬁ-cant in all three matchups (binomial test, p < . ). ages, knowledge, and given personality attributesto produce engaging dialogue with a large amountof variety, depending on the situation. Leave-One-Out Zero-Shot Performance

Last,but not least, we evaluate the performance of amulti-task model at zero-shot transfer to a new di-alogue task. This is performed by training on allbut one of the tasks, and reporting performance onthe left out one, repeating this experiment for alltasks. Our best performing models in that regardare reported in Table 2 (last column). First, it isreassuring that the overall scores are reasonable,outperforming a pushshift.io Reddit only model onevery task except pushshift.io Reddit itself. Thismeans that multi-tasking across many tasks helpstransfer learning. However, the gap between zero-shot performance and multi-task or ﬁne-tuning per-formance means there is still a signiﬁcant challengein improving these results. Finally, we believe thatreporting results in this regime in addition to multi-tasking results may help avoid the temptation to“cheat” at multi-tasking by trying to detect the taskand then apply a separate ﬁne-tuned classiﬁer, aspresumably that approach will not truly leveragereasoning and skills between tasks, which transfermay help measure.

We have introduced the dodeca

Dialogue task, andprovide strong baseline results leveraging multi-modal Image+Seq2Seq transformers trained acrossall tasks. The goal of introducing this task is notjust as another challenge dataset, but to furthermotivate building and evaluating conversationalgents capable of multiple skills – one of the coregoals of AI. We believe current systems are closerto that goal than ever before – but we also still havea long way to go.Recently reported results show systems can bereasonably competitive compared to humans in par-ticular domains for short conversations (Li et al.,2019b; Shuster et al., 2018). This work tries tobridge the gap to avoid agents with niche skills,to move towards evaluating an open-domain setof skills. Still, despite leveraging 12 tasks, thereare many skills not included in our set. For exam-ple, longer conversations involving memory (Moonet al., 2019), or mixing open-domain conversationwith task oriented goals. Future work should con-sider adding these tasks to the ones used here, whilecontinuing the quest for improved models.

References

Anton Bakhtin, Sam Gross, Myle Ott, YuntianDeng, Marc’Aurelio Ranzato, and Arthur Szlam.2019. Real or fake? learning to discriminate ma-chine from human generated text. arXiv preprintarXiv:1906.03351 .Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettle-moyer. 2018. QuAC: Question answering in con-text. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 2174–2184, Brussels, Belgium. Associationfor Computational Linguistics.Cristian Danescu-Niculescu-Mizil and Lillian Lee.2011. Chameleons in imagined conversations: Anew approach to understanding coordination of lin-guistic style in dialogs. In

Proceedings of the2nd Workshop on Cognitive Modeling and Compu-tational Linguistics , Portland, Oregon, USA. Asso-ciation for Computational Linguistics.Abhishek Das, Satwik Kottur, Khushi Gupta, AviSingh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh,and Dhruv Batra. 2017. Visual dialog. In

Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 326–335.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Emily Dinan, Varvara Logacheva, Valentin Ma-lykh, Alexander Miller, Kurt Shuster, Jack Ur-banek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai Prabhumoye, Alan W. Black,Alexander Rudnicky, Jason Williams, Joelle Pineau,Mikhail Burtsev, and Jason Weston. 2020. Thesecond conversational intelligence challenge (Con-vAI2). In

The NeurIPS ’18 Competition , pages 187–208, Cham. Springer International Publishing.Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2019. Wizardof wikipedia: Knowledge-powered conversationalagents. In

Proceedings of the International Confer-ence on Learning Representations .Angela Fan, Yacine Jernite, Ethan Perez, David Grang-ier, Jason Weston, and Michael Auli. 2019. ELI5:Long form question answering. In

Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 3558–3567, Florence,Italy. Association for Computational Linguistics.Karthik Gopalakrishnan, Behnam Hedayatnia, Qin-lang Chen, Anna Gottardi, Sanjeev Kwatra, AnuVenkatesh, Raefer Gabriel, and Dilek Hakkani-Tr. 2019. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In

Proc. In-terspeech 2019 , pages 1891–1895.Tianxing He, Jun Liu, Kyunghyun Cho, Myle Ott, BingLiu, James Glass, and Fuchun Peng. 2019. Mix-review: Alleviate forgetting in the pretrain-ﬁnetuneframework for neural language generation models. arXiv preprint arXiv:1910.07117 .Ari Holtzman, Jan Buys, Maxwell Forbes, and YejinChoi. 2019. The curious case of neural text degener-ation. arXiv preprint arXiv:1904.09751 .Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux,and Jason Weston. 2019. Poly-encoders: Trans-former architectures and pre-training strategies forfast and accurate multi-sentence scoring. arXivpreprint arXiv:1905.01969 .Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efﬁcienttext classiﬁcation. In

Proceedings of the 15th Con-ference of the European Chapter of the Associationfor Computational Linguistics: Volume 2, Short Pa-pers , pages 427–431, Valencia, Spain. Associationfor Computational Linguistics.Nitish Shirish Keskar, Bryan McCann, Lav RVarshney, Caiming Xiong, and Richard Socher.2019. CTRL: A conditional transformer languagemodel for controllable generation. arXiv preprintarXiv:1909.05858 .Mike Lewis, Yinhan Liu, Naman Goyal, Mar-jan Ghazvininejad, Abdelrahman Mohamed, OmerLevy, Ves Stoyanov, and Luke Zettlemoyer.2019. BART: Denoising sequence-to-sequencepre-training for natural language generation, trans-lation, and comprehension. arXiv preprintarXiv:1910.13461 .iunian Harold Li, Mark Yatskar, Da Yin, Cho-JuiHsieh, and Kai-Wei Chang. 2019a. VisualBERT: Asimple and performant baseline for vision and lan-guage. arXiv preprint arXiv:1908.03557 .Margaret Li, Jason Weston, and Stephen Roller. 2019b.ACUTE-EVAL: Improved dialogue evaluation withoptimized questions and multi-turn comparisons. In

Proceedings of the NeurIPS Workshop on Conversa-tional AI .Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017. DailyDialog: A manu-ally labelled multi-turn dialogue dataset. In

Proceed-ings of the Eighth International Joint Conference onNatural Language Processing (Volume 1: Long Pa-pers) , pages 986–995, Taipei, Taiwan. Asian Federa-tion of Natural Language Processing.Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng,Kevin Duh, and Ye-yi Wang. 2015. Representationlearning using multi-task deep neural networks forsemantic classiﬁcation and information retrieval. In

Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 912–921, Denver, Colorado. Association forComputational Linguistics.Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The Ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. In

Proceedings of the 16th AnnualMeeting of the Special Interest Group on Discourseand Dialogue , pages 285–294, Prague, Czech Re-public. Association for Computational Linguistics.Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.2019. ViLBERT: Pretraining task-agnostic visi-olinguistic representations for vision-and-languagetasks. arXiv preprint arXiv:1908.02265 .Yi Luan, Yangfeng Ji, and Mari Ostendorf. 2016.LSTM based conversation models. arXiv preprintarXiv:1603.09457 .Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,Kaiming He, Manohar Paluri, Yixuan Li, AshwinBharambe, and Laurens van der Maaten. 2018. Ex-ploring the limits of weakly supervised pretraining.In

Proceedings of the European Conference on Com-puter Vision , pages 185–201, Cham. Springer Inter-national Publishing.Pierre-Emmanuel Mazar´e, Samuel Humeau, MartinRaison, and Antoine Bordes. 2018. Training mil-lions of personalized dialogue agents. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 2775–2779,Brussels, Belgium. Association for ComputationalLinguistics.Bryan McCann, Nitish Shirish Keskar, Caiming Xiong,and Richard Socher. 2018. The natural language de-cathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 . Alexander Miller, Will Feng, Dhruv Batra, AntoineBordes, Adam Fisch, Jiasen Lu, Devi Parikh, andJason Weston. 2017. ParlAI: A dialog research soft-ware platform. In

Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing: System Demonstrations , pages 79–84,Copenhagen, Denmark. Association for Computa-tional Linguistics.Nikita Moghe, Siddhartha Arora, Suman Banerjee, andMitesh M. Khapra. 2018. Towards exploiting back-ground knowledge for building conversation sys-tems. In

Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing ,pages 2322–2332, Brussels, Belgium. Associationfor Computational Linguistics.Seungwhan Moon, Pararth Shah, Rajen Subba, andAnuj Kumar. 2019. Memory grounded conversa-tional reasoning. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations , pages 145–150,Hong Kong, China. Association for ComputationalLinguistics.Nasrin Mostafazadeh, Chris Brockett, Bill Dolan,Michel Galley, Jianfeng Gao, Georgios Spithourakis,and Lucy Vanderwende. 2017. Image-groundedconversations: Multimodal context for natural ques-tion and response generation. In

Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers) ,pages 462–472, Taipei, Taiwan. Asian Federation ofNatural Language Processing.Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep reinforced model for abstractive sum-marization. In

Proceedings of the International Con-ference on Learning Representations .Lianhui Qin, Michel Galley, Chris Brockett, XiaodongLiu, Xiang Gao, Bill Dolan, Yejin Choi, and Jian-feng Gao. 2019. Conversing by reading: Contentfulneural conversation with on-demand machine read-ing. In

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 5427–5436, Florence, Italy. Association forComputational Linguistics.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIBlog , 1(8).Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2019. Exploring the limitsof transfer learning with a uniﬁed text-to-text trans-former. arXiv preprint arXiv:1910.10683 .ranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Hannah Rashkin, Eric Michael Smith, Margaret Li, andY-Lan Boureau. 2019. Towards empathetic open-domain conversation models: A new benchmark anddataset. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 5370–5381, Florence, Italy. Associationfor Computational Linguistics.Abigail See, Stephen Roller, Douwe Kiela, and Ja-son Weston. 2019. What makes a good conver-sation? how controllable attributes affect humanjudgments. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 1702–1723, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Kurt Shuster, Samuel Humeau, Antoine Bordes, andJason Weston. 2018. Engaging image chat: Model-ing personality in grounded dialogue. arXiv preprintarXiv:1811.00945 .Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. In

Proceedingsof the 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies , pages 196–205, Denver, Colorado. Association for Computa-tional Linguistics.Alon Talmor and Jonathan Berant. 2019. MultiQA: Anempirical investigation of generalization and trans-fer in reading comprehension. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 4911–4921, Florence,Italy. Association for Computational Linguistics.Hao Tan and Mohit Bansal. 2019. LXMERT: Learningcross-modality encoder representations from trans-formers. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages5099–5110, Hong Kong, China. Association forComputational Linguistics.Jack Urbanek, Angela Fan, Siddharth Karamcheti,Saachi Jain, Samuel Humeau, Emily Dinan, TimRockt¨aschel, Douwe Kiela, Arthur Szlam, and Ja-son Weston. 2019. Learning to speak and act ina fantasy text adventure game. In

Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 673–683, Hong Kong, China. Association for Computational Lin-guistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rmi Louf, Morgan Funtow-icz, and Jamie Brew. 2019a. HuggingFace’s trans-formers: State-of-the-art natural language process-ing. arXiv preprint arXiv:1910.03771 .Thomas Wolf, Victor Sanh, Julien Chaumond, andClement Delangue. 2019b. TransferTransfo: Atransfer learning approach for neural networkbased conversational agents. arXiv preprintarXiv:1901.08149 .Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu,and Kaiming He. 2017. Aggregated residual trans-formations for deep neural networks. In

ComputerVision and Pattern Recognition (CVPR) .Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,Noah Constant, Petr Pilar, Heming Ge, Yun-HsuanSung, Brian Strope, and Ray Kurzweil. 2018. Learn-ing semantic textual similarity from conversations.In

Proceedings of The Third Workshop on Repre-sentation Learning for NLP , pages 164–174, Mel-bourne, Australia. Association for ComputationalLinguistics.Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? In

Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 2204–2213, Melbourne, Australia. Association for Com-putational Linguistics.

Additional Results

MT + FT All Tasks MTPPL BLEU ROUGE F1 PPL BLEU ROUGE F14 1 2 L 4 1 2 LConvAI2 11.1 6.6 37.0 11.6 31.8 21.6 10.8 5.5 39.4 12.5 33.7 21.7DailyDialog 10.4 4.0 35.6 10.0 30.8 18.2 12.0 2.9 33.9 8.7 29.2 16.2Wiz. of Wikipedia 8.3 21.5 55.3 28.4 44.9 38.4 8.4 21.0 53.2 28.0 45.4 38.4Empathetic Dialog 11.4 3.5 38.0 9.5 32.3 19.5 11.5 3.7 37.2 8.9 31.4 19.3Cornell Movie 20.2 2.5 29.5 6.7 25.7 12.4 22.2 2.1 29.1 6.5 25.6 11.9LIGHT 18.9 2.6 30.8 5.8 24.8 16.2 19.3 2.4 30.5 5.6 24.6 16.1ELI5 21.0 3.7 38.6 7.2 22.1 23.1 24.9 3.2 35.2 6.3 20.5 21.3Ubuntu 17.1 2.5 27.0 5.0 22.8 12.7 23.1 3.7 26.0 4.3 22.0 12.1Twitter 30.7 3.2 16.5 3.3 14.3 9.9 38.2 2.6 19.4 3.3 16.5 9.8pushshift.io Reddit 25.6 2.1 24.1 4.5 18.7 13.6 27.8 1.6 23.4 4.2 18.1 13.5Image Chat 18.8 2.4 30.1 5.7 26.0 13.0 22.3 2.1 28.4 4.9 24.6 12.9IGC 11.9 8.6 65.0 34.1 60.5 38.4 12.0 8.0 61.3 28.3 56.8 41.4 dodeca

Score 17.1 5.3 35.6 11.0 29.6 19.8 19.4 4.9 34.8 10.1 29.0 19.6

Table 8: Test performance for various metrics on the dodeca

Dialogue tasks comparing our multi-task and multi-task + ﬁne-tuned methods.

MT + FT All Tasks MTPPL BLEU ROUGE F1 PPL BLEU ROUGE F14 1 2 L 4 1 2 LConvAI2 11.2 5.7 36.7 10.9 31.6 21.1 11.3 5.3 38.7 11.6 32.9 21.3DailyDialog 10.2 4.4 36.8 10.7 32 18.8 11.8 3.1 34.8 9.3 30.2 17.1Wiz. of Wikipedia 8.5 20.8 54.9 28.0 44.8 37.9 8.7 20.2 55.2 28.2 45.0 37.9Empathetic Dialog 11.1 3.6 38.6 9.8 32.7 19.7 11.2 3.5 37.5 9.1 31.8 19.3Cornell Movie 19.8 2.5 29.3 6.7 25.6 12.3 21.9 2.1 29.0 6.5 25.6 11.8LIGHT 18.7 2.6 31.2 6.2 25.2 16.5 19.0 2.5 30.9 6.1 25.0 16.4ELI5 21.1 3.7 38.7 7.3 22.1 23.2 25.0 3.2 35.3 6.3 20.6 21.2Ubuntu 17.2 2.4 27.1 5.0 22.9 12.8 23.3 3.5 26.4 4.6 22.3 12.2Twitter 29.8 3.2 16.7 3.5 14.5 10.1 37.0 2.6 19.7 3.6 16.8 9.9pushshift.io Reddit 25.8 2.2 24.2 4.5 18.7 13.4 28.0 1.7 23.4 4.1 18.2 13.3Image Chat 18.3 2.4 30.7 6.2 26.3 14.3 21.8 2.1 28.6 5.3 24.7 13.1IGC 10.0 10.6 67.9 38.2 64.5 45.1 10.2 11.0 66.3 34.8 61.4 45.3 dodeca

Score 16.8 5.3 36.1 11.4 30.1 20.4 19.1 5.1 35.5 10.8 29.5 19.9

Table 9: Validation performance for various metrics on the dodeca

Dialogue tasks comparing our multi-task andmulti-task + ﬁne-tuned methods.

PPL BLEU ROUGE f14 1 2 LConvAI2 11.3 5.6 22.2 7.0 20.4 21.3DailyDialog 11.8 4.8 18.9 5.6 17.6 16.6Wiz. of Wikipedia 8.7 19.7 40.9 22.6 36.9 37.7Empathetic Dialog 11.2 4.8 20.9 5.6 19.0 19.3Cornell Movie 21.9 3.3 14.2 3.2 13.4 11.3LIGHT 19.0 2.9 17.0 3.4 15.0 16.2ELI5 25.0 1.6 14.2 2.6 9.6 16.2Ubuntu 23.3 2.3 12.5 1.9 11.6 11.2Twitter 37.0 2.3 9.5 1.7 8.7 8.9pushshift.io Reddit 28.0 1.8 12.1 2.2 10.4 11.3Image Chat (all turns) 21.8 2.1 14.7 2.5 13.6 13.1IGC 10.2 5.5 50.7 25.3 49.1 36.0 dodeca

Score 19.1 4.7 20.7 7.0 18.8 18.3

Table 10: All Tasks Multi-Tasking (MT) validation performance for various metrics on the dodeca

Dialogue taskswith one set of decoding parameters: a beam size of 3, minimum response length of 10, and blocking repeatedtri-grams.

LEU ROUGE-L F1

Score Beam Min L Max L N-gram Block Score Beam Min L Max L N-gram Block Score Beam Min L Max L N-gram Block

ConvAI2 5.7 10 10 128 3 31.6 10 50 128 3 21.1 3 10 128 3DailyDialog 4.4 10 5 128 3 32.0 3 50 128 3 18.8 5 10 128 3Wiz. of Wikipedia 20.8 10 5 128 0 44.8 10 50 128 3 37.9 10 10 128 3Empathetic Dialog 3.6 10 5 128 3 32.7 5 50 128 3 19.7 5 10 128 3Cornell Movie 2.5 10 5 128 3 25.6 10 50 128 3 12.3 10 20 128 3LIGHT 2.6 3 5 128 3 25.2 5 50 128 3 16.5 5 20 128 3ELI5 3.7 10 200 256 3 22.1 5 200 256 3 23.2 10 200 256 3Ubuntu 2.4 10 5 128 0 22.9 10 40 128 3 12.8 2 10 128 3Twitter 3.2 10 20 128 3 14.5 5 50 128 3 10.1 10 20 128 3pushshift.io Reddit 2.2 10 10 128 0 18.7 5 50 128 3 13.4 5 50 128 3Image Chat (all turns) 2.4 10 5 128 3 26.4 3 50 128 3 14.3 5 1 128 3IGC 10.6 10 5 128 3 64.5 3 50 128 3 45.1 10 5 128 3

Table 11: Best decoding parameters for each task, based on metric. Scores are from the best performing task-speciﬁc multi-task + ﬁne-tuned model on validation sets. ”Min L” and ”Max L” refer to the minimum and maxi-mum decoding length, where ”L” is the number of tokens.

B Human Evaluation Further Details

We provide additional results from our human evaluations described in Section 5. In Figure 1, we compareour All Tasks MT Image+Seq2Seq model to existing baselines from both tasks; to produce those outputs,we used beam search with a beam size of 10 and tri-gram blocking. As with our experiments regardingautomatic metrics, we additionally explored nucleus sampling, with parameter p = 0 . , and compared toboth the baseline models as well as human outputs. In tables 12, 13, and 14, we show the full results ofcomparing various models both to each other and also to humans.When collecting the model-human chats for Wizard of Wikipedia, we additionally asked the humansfor a rating from 1-5 at the end of each conversation, to indicate the quality of the model’s responses; wecompare these Likert ratings to that of Dinan et al. (2019), which followed the same protocol, in Table 15.The ﬁndings are similar to the pairwise ACUTE-Eval results in the main paper. Win PercentageLose Percentage (Shuster et al., 2018) Image+Seq2Seq Image+Seq2Seq HumanNucleus Beam(Shuster et al., 2018) - 50.8 ∗ . ∗ . Image+Seq2Seq Nucleus 49.2 - 52.1 ∗ . Image+Seq2Seq Beam ∗ . ∗ . Human ∗ . ∗ . ∗ . - Table 12: Human evaluations on Image Chat, comparing various decoding schemes for our Image+Seq2Seq modeltrained on all tasks MT, as well as comparisons with human outputs. Scores with ∗ are statistically signiﬁcant(binomial test, p < . ). Win PercentageLose Percentage (Dinan et al., 2019) Image+Seq2Seq Image+Seq2Seq HumanNucleus Beam(Dinan et al., 2019) - 59.1 62.1 71.9Image+Seq2Seq Nucleus 40.1 - - 70.4Image+Seq2Seq Beam 37.9 - - 60.0Human 28.1 29.6 40.0 -

Table 13: Human evaluations on Wizard of Wikipedia (seen) test set, comparing various decoding schemes for ourImage+Seq2Seq model trained on all tasks MT, as well as comparisons with human outputs, using ACUTE-Eval.All scores are statistically signiﬁcant (binomial test, p < . ). in PercentageLose Percentage (Dinan et al., 2019) Image+Seq2Seq Image+Seq2Seq HumanNucleus Beam(Dinan et al., 2019) - 62.3 64.1 75.8Image+Seq2Seq Nucleus 37.7 - - 72.8Image+Seq2Seq Beam 35.9 - - 60.5Human 24.2 27.2 39.5 - Table 14: Human evaluations on Wizard of Wikipedia (unseen) test set, comparing various decoding schemes forour Image+Seq2Seq model trained on all tasks MT, as well as comparisons with human outputs, using ACUTE-Eval. All scores are statistically signiﬁcant (binomial test, p < . ). Method Decode Method Seen UnseenHuman Performance 4.13 (1.08) 4.34 (0.98)(Dinan et al., 2019) Beam 2.92 (1.33) 2.93 (1.30)Image+Seq2Seq (All Tasks MT) Beam 3.42 (1.10) 3.33 (1.09)Image+Seq2Seq (All Tasks MT) Nucleus 2.95 (1.08) 3.43 (1.05)

Table 15: Human evaluations on Wizard of Wikipedia, comparing various decoding schemes for our Im-age+Seq2Seq model trained on all tasks MT, as well as comparisons with human outputs, in terms of

LikertScores . Ratings are reported as mean (stddev).

All Tasks Multi-Task Model Examples

Context ConvAI2

Personas: i love to drink wine and dancein the moonlight.i am very strong for my age.i’m 100 years old.i feel like i might live forever. Speaker A: hi , how are you doing today ?Speaker B: good ! you ! celebrating with fellow centenariansSpeaker A: nice . i’m playing some card games with my family .Speaker B: that sounds like fun . i like wine and dancing too !Speaker A: same . i try to get a small workout in a three mile walk for me is key .Model: that’s great. i’m very strong for my age.Context Wizard of Wikipedia

Chosen Topic:

The Rolling Stones

Knowledge: no passages used Speaker A: I love the Rolling Stones. Do you? They were formed in 1962.Speaker B: I love them! Who was in the band at the start?Speaker A: The ﬁrst lineup was Brian Jones, Mick, Keith, Bill Wyman, Charlie Watts and IanSteward.Speaker B: Was Mick ever replaced?Model: i’m not sure, but i do know that mick jagger was the lead vocalist.Context Wizard of Wikipedia

Chosen Topic:

Dog

Knowledge:

The dog was the ﬁrst species to bedomesticated and has been selectivelybred over millennia for variousbehaviors, sensory capabilities, andphysical attributes. Speaker A: Hi, do you like dogs?Speaker B: I do! Actually, my favourite is the Basenji breed! It is a hunting dog.Speaker A: Oh, those are cool. I have seen them on the Westminster Dog Show, but I’ve nevermet one in person. I have a Havanese.Speaker B: Wow, are you Cuban? I know that the Havanese are the national dog of cuba!Speaker A: No, not Cuban, but yes they are the national dog of Cuba! Apparently all theHavanese in the US are descended from a handful that came here from Cubaduring the revolution!Model: that’s so cool! the dog was the ﬁrst species to be domesticated!Context Wizard of Wikipedia

Chosen Topic:

Orc

Knowledge:

While the overall concept of orcs drawson a variety of pre-existing mythology,the main conception of the creaturesstems from the fantasy writings ofJ.R.R.Tolkien, in particular ”The Lordof the Rings”. Speaker A: Orcs are cool ﬁctional humanoid beingsSpeaker B: Yeah I’ve seen them in a lot of things like Magic and DnD.Speaker A: While the concept remains the same it draws on many preexisting mythologySpeaker B: That’s really interesting! Orcs are some of my favorite ﬁctional characters hon-estly.Model: the main conception of the creatures stems from the fantasy writings of the lordof the ringsontext Empathetic Dialogues

Situation

I received concert tickets for Christmas.

Emotion excited Speaker A: I couldn’t wait to go to the concert.Speaker B: Which Concert?Speaker A: The U2 concert. Tickets were really expensive and I never thought we would beable to go, but somehow we did!!!Model: that’s awesome. i’ve never been to a u2 concert.Context Empathetic Dialogues

Situation

I went to buy a 42 inch tv at walmart,the billboard says 299 usd, it was at 399usd after taxes.

Emotion disappointed Speaker A: So yeah i drove 2 hours to get to wallmart. with 299 in my wallet as advertised,and the real price of the tv was 399 usd. what a bummerSpeaker B: Thats horrible.Speaker A: Yeah i had my hopes reall high and then i left empty handed, such a terriblefeeling!Model: i know what you mean. i would have been so mad.Context Empathetic Dialogues

Situation

All my friends live in a different country

Emotion lonely Speaker A: hi, i feel so lonely sometimes because all my friends live in a different countrySpeaker B: Oh, I’m sure you are lonely. Maybe you can join some kind of club that lets youmeet new friends?Speaker A: i was thinking about it! I wanted to join a group for local momsModel: that’s a great idea! i’m sure you’ll meet some great people there.Context Cornell MovieNo Context Speaker A: Why’d you let him get to you?Speaker B: Who?Speaker A: Dorsey.Speaker B: I hate him.Speaker A: I know. It’d have to be a pretty big deal to get you to mainline tequila. You don’tseem like the type.Model: i’m not above all that. i just don’t think i can be ”laid back” like everyone else.Context LIGHT setting name : Castle Maids’ Room, InsideCastle setting desc : Inside the maids room is dark andgrey. It is much different from the rest of theCastle which has grand walls and ﬂoors andpaintings from famous painters around the wall.The maid holds many secrets in this room. Onewhich is a secret door that holds paintings sheshould not have. partner name : painter self name : worker self persona : I am a worker from a village upthe river. I tie the logs our woodsmen fell into rafts,and steer them down to the next town for milling. Iam hardworking and patient. I enjoy ﬁshing off ofthe edge of the raft during quiet times on the raft. Speaker A: HelloSpeaker B: Good Evening. You must be the painter of this room.Speaker A: Yes sir! do you have a preferred color? self act get brushSpeaker B: I am just a lowly worker. I think the red paint looks good. I can assist you withpainting.Speaker A: I will love that. Tell me, how long have you been here? partner act give brush toworker self act drop paint brushesModel: i’ve been here for a long time. i have been working here since i was a child.ontext Daily Dialog

Topic ordinary life

Emotion no emotion Speaker A: So how did I do on my driving test ?Speaker B: Do you want the truth ?Speaker A: Of course , I do .Speaker B: Well , you really didn’t do all that well .Speaker A: How bad did I do ?Model: don’t worry, i’ll give you another try.Context pushshift.io Reddit

Subreddit beer Speaker A: In Santa Cruz? Be SURE to hit up Sante Adairius Rustic Ales in Capitola! Ifyou like farmhouse, funky, sour ales, they are real tough to beat. Their IPAs andporters are also amazing. Another great new brewery is Discretion, in Capitola,and I am also a fan of Santa Cruz Mountain Brewing and Seabright Brewing foraffordability’s sake ($5 pitchers at Seabright on Tuesday/Thursday I believe?) butsome beer snobs would turn their noses up at those places. Also, burger in SantaCruz and Aptos always has a great taplist, as well as Aptos BBQ in Aptos.Model: thanks for the suggestions! i’ll have to check out sante adairius rustic ales. i’veheard good things about seabright brewing, but haven’t had a chance to try it yet.do you know much about it?Context TwitterNo Context Speaker A: reading now . this is absolutely fantastic .Model: thank you so much. i’m glad you like it. i hope you enjoy it as much as i do.Context UbuntuNo Context Speaker A: i set up my hd such that i have to type a passphrase to access it at boot. how cani remove that passwrd, and just boot up normal. i did this at install, it works ﬁne,just tired of having reboots where i need to be at terminal to type passwd in. help?.Speaker B: backup your data, and re-install without encryption ”might” be the easiest method.Model: i’ll try that, but i’m not sure how to do it without encryption..Context ELI5

Knowledge < P > And it does nt come without questioningand critical thought . What we feel entitled to is abetter world . And the thing is , we can see it ,clearer and with broader scope than anyone beforeus , because all we know is the entire world , inreal time . < P > Page 1 of 2 1 2 Next > Spycke83PSLS Level : Bronze Messages : 52 LikesReceived : 20 PSN : Spycke83 I‘m wonderingwhat you guys think of this newish trend ofgamers feeling they are entitled certain thingsfrom a game . . . Speaker A: Why do recent generations feel so entitled , like everyone owes them ? –T–Model: it’s not that they feel entitled, but that they don’t have enough money to pay for it.ontext Image Chat

ImagePersonality

Emotional Speaker A: Airplanes are scary to get on, you never know if it will crash or not.Speaker B: But these are professional pilots though.Model: i can’t imagine being a professional pilot.Context Image Chat

ImagePersonality

Playful Speaker A: Great for night ﬁshing.Speaker B: Meh, I doubt any ﬁsh would bite anyway.Model: i m sure there are plenty of ﬁsh that would bite.Context Image Chat (First Turn)

ImagePersonality

Curious Model: i wonder if this is a lake?Context IGC

Image

Speaker A: I visited this forest on a walk.Speaker B: Was this tree special?Model: yes, it’s one of the oldest trees in the world.Context IGC