[PDF] Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework

Abstract

Human-like chit-chat conversation requires agents to generate responses that are fluent, engaging and consistent. We propose Sketch-Fill-A-R, a framework that uses a persona-memory to generate chit-chat responses in three phases. First, it generates dynamic sketch responses with open slots. Second, it generates candidate responses by filling slots with parts of its stored persona traits. Lastly, it ranks and selects the final response via a language model score. Sketch-Fill-A-R outperforms a state-of-the-art baseline both quantitatively (10-point lower perplexity) and qualitatively (preferred by 55% heads-up in single-turn and 20% higher in consistency in multi-turn user studies) on the Persona-Chat dataset. Finally, we extensively analyze Sketch-Fill-A-R's responses and human feedback, and show it is more consistent and engaging by using more relevant responses and questions.

Full PDF

SSketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework

Michael Shum ∗ , Stephan Zheng , Wojciech Kry´sci ´nski ,Caiming Xiong , Richard Socher MIT, Salesforce Research [email protected] { stephan.zheng,kryscinski,cxiong,rsocher } @salesforce.com Abstract

Human-like chit-chat conversation requiresagents to generate responses that are ﬂuent, en-gaging and consistent. We propose Sketch-Fill-A-R, a framework that uses a persona-memory to generate chit-chat responses inthree phases. First, it generates dynamic sketch responses with open slots. Second, itgenerates candidate responses by ﬁlling slotswith parts of its stored persona traits. Lastly, it ranks and selects the ﬁnal response via a lan-guage model score. Sketch-Fill-A-R outper-forms a state-of-the-art baseline both quanti-tatively (10-point lower perplexity) and qual-itatively (preferred by 55% heads-up in single-turn and 20% higher in consistency in multi-turn user studies) on the Persona-Chat dataset.Finally, we extensively analyze Sketch-Fill-A-R’s responses and human feedback, and showit is more consistent and engaging by usingmore relevant responses and questions.

Chit-chat is a rich domain that challenges machinelearning models to express ﬂuent natural languageand to successfully interact with other agents. Chit-chat stands in contrast to goal-oriented dialogue,such as when a customer has the explicit goal ofbooking a ﬂight ticket. When agents communicate,they each have internal state (e.g., their knowledge,intent) and typically have limited knowledge of thestate of other agents (Chen et al., 2017). As a result,human-like chit-chat requires agents to be ﬂuent,engaging and consistent with what has been saidand their persona (Zhang et al., 2018).These requirements make learning generativechit-chat models a complex task. First, given anexisting conversation history, there may be a largenumber of valid responses (Vinyals and Le, 2015).Hence, supervised learning of chit-chat models that ∗ Work done as an intern at Salesforce Research.

Bot Persona traits:

1. i love to drink wine and dance in the moonlight2. i am very strong for my age3. i’m 100 years old4. i feel like i might live foreverhi , how are you doing today ?Human user Botgood ! you ? celebrating with fellow centenariansBot sketch responsethat is great ! i love to @persona and @persona .Human usernice . i'm playing some card games with my family .Bot selected response with lowest language model perplexitythat is great ! i love to drink and dance . [LM-score = 1.24]

Figure 1: Chit-chat generation with Sketch-Fill-A-R. cover a large number of topics and styles requires asigniﬁcant amount of data (Zhou et al., 2018). Sec-ond, as conversations progress and more opportuni-ties for contradiction arise, maintaining consistencybecomes more difﬁcult (Serban et al., 2016, 2017).Third, engaging chit-chat responses follow conver-sational structures that are not captured well byperplexity (Dinan et al., 2019). Indeed, our humanuser studies show that both consistency and engag-ingness are only weakly correlated with perplexity,and ﬂuency is not at all.We propose Sketch-Fill-A-R, a dialogue agentframework that can learn to generate ﬂuent, consis-tent and engaging chit-chat responses. Our key mo-tivation is the hypothesis that human-like chit-chatresponses often 1) follow common conversationalpatterns with insertions of agent-speciﬁc traits, and2) condition explicitly on those persona traits.Sketch-Fill-A-R decomposes response genera-tion into three phases: sketching, ﬁlling and rank-ing, see Figure 1. First, Sketch-Fill-A-R dynami-cally generates a sketch response with slots, whichenables it to learn response patterns that are com- a r X i v : . [ c s . C L ] O c t atible with many speciﬁc persona traits. Second,it generates candidate responses by ﬁlling in slotswith words stored in memory. This enables Sketch-Fill-A-R’s responses to adhere to its persona. Third,the candidate responses are ranked by perplexityunder a pre-trained language model (LM), whichencourages the ﬁnal response (with lowest LM per-plexity) to be ﬂuent.In sum, our contributions are as follows: • We describe Sketch-Fill-A-R and how itsmulti-phase generation process encouragesﬂuency, consistency and engagingness. • We show that Sketch-Fill-A-R signiﬁcantlyimproves hold-out perplexity by ∼ pointson the Persona-Chat dataset over state-of-the-art baselines. • We show Sketch-Fill-A-R is rated higheron conversational metrics and preferred overbaselines in single and multi-turn user studies. • We extensively analyze Sketch-Fill-A-R’s re-sponse statistics and human feedback, andshow that it is more consistent by using a nar-rower set of responses, and more engaging, byasking more questions than baselines.

Chit-chat Dialogue

Dialogue agents such asAmazon Alexa, Apple Siri, and Google Home arecommonplace today, and are mainly task-oriented:they help users achieve speciﬁc tasks. On the otherhand, Microsoft XiaoIce (Zhou et al., 2018) is anexample of an undirected chit-chat dialogue agent.Historically task-oriented dialogue systems arecomposed via components such as dialogue statetracking and natural language generation (Jurafskyand Martin, 2009). Even now, the natural languagegeneration component often uses hand-crafted tem-plates and rules deﬁned by domain experts that areﬁlled via heuristics (Gao et al., 2019). More re-cently task-oriented dialogue systems have beentrained end-to-end (Bordes et al., 2016), but thesesystems have speciﬁc user intents they aim to ful-ﬁll, and so represent a more constrained task. Earlyconversational dialogue systems such as ELIZA(Weizenbaum et al., 1966) and Alice (Wallace,2009) were also based on hand-crafted rules andthus brittle. To alleviate this rigidity, more re-cent neural seq2seq models (Sutskever et al., 2014) are trained end-to-end (Vinyals and Le, 2015; Sor-doni et al., 2015; Serban et al., 2017; Li et al.,2016). To help guide conversation (Ghazvinine-jad et al., 2018; Dinan et al., 2018) incorporatedknowledge-grounded datasets, while (Zhang et al.,2018) created the

Persona-Chat dataset usedin this work. Sketch-Fill-A-R dynamically gener-ates slot sketches and bears resemblance to (Wuet al., 2019) which assumed data are structureddomain-speciﬁc triplets and contexts follow tem-plates. However, Sketch-Fill-A-R does not assumethe personas and responses have rigid syntacticstructure, and introduces a ranking procedure. Con-verse to our sketch-and-ﬁll procedure, (Qian et al.,2017) train a model to select a persona trait anddecode around the trait. Finally, (Welleck et al.,2018) also re-rank by scoring utterances with Nat-ural Language Inference to improve consistency.

Neural Sequence Models

Sketch-Fill-A-R ex-tends a neural encoder-decoder structure (Sutskeveret al., 2014) and is agnostic to the chosen formof encoder-decoder. We use recurrent modelsand attention (Bahdanau et al., 2014), whichauto-regressively embed and generate sequences.Hence, our framework is general and is compatiblewith non-recurrent encoders and decoders, suchas Transformer networks with non-recurrent self-attention (Vaswani et al., 2017; Devlin et al., 2018).Sketch-Fill-A-R uses a simple memory moduleto store words from personas, which act as contextfor generation. Weston et al. (2014); Sukhbaataret al. (2015) introduced learned Key-Value MemoryNetworks, while Kumar et al. (2016) introducedDynamic Memory Nets for question-answering viaan iterative attention over memory. Also, Sketch-Fill-A-R decodes responses using a re-ranking strat-egy based on language model scores, which com-plements strategies in (Kulikov et al., 2018).

Our key motivation is to generate human-like chit-chat responses that are conditioned on persona-relevant information. Sketch-Fill-A-R generateschit-chat using a persona-memory to dynamicallygenerate sketches that capture conversational pat-terns, and inserting persona-relevant information.To set notation: capitals

W, V, . . . denote matri-ces, i, j, k are vector-matrix indices and x, y, . . . denote vectors. The model input at time t is x t andthe output at time u is y u . We denote the conver-sation by x ct and persona trait words by x pt . Both never broke a bone in my body ever in my lifei am a stunt double as my second jobi read twenty books a yeari was raised in a single parent householdi only eat kosher persona-encoder A: hello there , how are you today ?B: hello ! how has your day been ?A: great so far preparing my kindergarten lesson for tomorrow . conversation-encoder that is great ! i am a @persona @persona . sketch decoderselect LSTM LSTMLSTMmemory-readoutpersona-memory hmem Figure 2: Sketch-Fill-A-R generates sketch responses in 4 steps. First, it encodes the conversation history andpersonas into hidden states h e, · t . It stores word-embeddings for selected rare words from persona traits in a persona-memory. The ﬁnal encoder hidden state h e,cT produces a read-out vector h mem . Lastly, the decoder outputs a sketchresponse with @persona slots using h mem , encoder hidden states and attention over personas and conversation. i never broke a bone in my body ever in my life i am a stunt double as my second job i read twenty books a year i was raised in a single parent household i only eat kosherthat is great ! i am a stunt double . that is great ! i am a twenty double .awesome ! i like to stunt book .awesome ! i like to stunt double . @persona @persona .

1: beam-search sketch responsesselected candidate words to fill @persona slots2: candidate responses LM perplexity awesome ! i like to do @persona @persona . Figure 3: Sketch-Fill-A-R inference strategy. Duringinference, Sketch-Fill-A-R uses beam search to gener-ate B sketch responses (step 1, depicted B = 2 ). Instep 2, each beam with @persona slots produces can-didate responses by ﬁlling it with rare words selectedfrom the persona traits. Finally, a language modelscores each candidate and Sketch-Fill-A-R returns thelowest-perplexity candidate. input and output words x t , y u ∈ { , } V are 1-hotvectors, where V denotes the vocabulary size. Thevocabulary contains all unique words, punctuationand special symbols (e.g., EOS , @persona ). x T denotes a sequence ( x , . . . , x T ) .Formally, we aim to learn a response genera-tion model that predicts words y u using a probabil-ity distribution P ( y U | x T ; θ ) over sequences of T words and N persona traits with R rare words.Here U is the output sequence length and θ are themodel weights. We mainly focus on deep neuralnetworks, a model class that has recently seen great success in language generation tasks (Sutskeveret al., 2014; Bahdanau et al., 2014).Sketch-Fill-A-R composes several componentsto generate sketch responses: • An encoder h e T = Enc ( x T ; θ ) that com-putes hidden representations e t of the input. • A memory module h mem = Mem ( x R ; θ ) that stores all rare words from persona traits(constructed by removing stop words). • A language model P LM ( x t +1 | x t ; θ ) thatcomputes a distribution over next words. • A sketch decoder h d U = Dec (cid:16) h e T , h mem ; θ (cid:17) that synthesizes both the encoded input andmemory readouts, and predicts the next word ˆ y u in the sketch response. We instantiate both encoder and de-coder using recurrent neural networks. In thiswork, we use LSTMs (Hochreiter and Schmidhu-ber, 1997), although other choices are possible (El-man, 1990; Cho et al., 2014). The encoder com-putes hidden states h T ∈ R d hid auto-regressively: h et +1 = LSTM ( e ( x t ) , h et ; θ ) , (1)where e ( x t ) are word-embedding representationsof the raw input tokens x t . As such, Sketch-Fill-A-R encodes both conversation history x c T andindividual persona traits x p T into hidden states h conv T and h pers T . We denote ﬁnal hidden states forall N personas as h pers N . emory Module Sketch-Fill-A-R selects a sub-set of rare words, x pr from the persona traits byremoving stop-words, punctuation, and other sym-bols. After encoding the input dialogue, Sketch-Fill-A-R does a memory readout using the ﬁnalconversation encoder hidden state h conv T as a query: h mem = h conv T + o, (2) o = (cid:88) r p r x pr C , (3) p r = σ ((( h cr ) T x pr C )) (4)where r is a vector index over the rare word mem-ory, σ is a softmax activation function creatingattention weights p i ∈ R d hid , and C k are trainableembedding matrices where C k ∈ R V × d hid . Attention Decoder

The decoder is an LSTMwhich recursively computes hidden states h du thatare mapped into a distribution over output words: h du +1 = LSTM (cid:16) y u , h du ; θ (cid:17) , (5) h d = f (cid:16) W d [ h eT , h mem ] + b d (cid:17) , (6) P ( y u +1 | x T , y u ) = σ ( c u W emb ) . (7)At decoding time u + 1 the decoder computesthe next hidden state h du +1 using the previous pre-dicted word y u and decoder hidden state h du , inaddition to attention over the context of the re-sponse, i.e., previous utterances and the agent’s per-sona traits. W d ∈ R ∗ d hid × d hid projects [ h eT , h mem ] down to the initial hidden state of the decoder and W emb ∈ R d hid × V is the transpose of the encod-ing embedding matrix. The decoding context c u augments decoder hidden state h du with attentionvectors c conv u over encoded hidden states h conv and c pers u over encoded persona hidden states h pers N : c u = f (cid:16) W ac [ h du , c conv u , c pers u ] + b ac (cid:17) , (8) c conv u = (cid:104) σ ( (cid:104) W a [ h du ] + b a , h conv0:T (cid:105) ) , h conv0:T (cid:105) , (9) c pers u = (cid:104) w u,n , h pers0:N (cid:105) , (10) w u,n = σ ( (cid:104) W a [ h du ] + b a , h pers0:N (cid:105) ) (11)where f is a tanh, W ac ∈ R ∗ d hid × d hid , W a ∈ R d hid × d hid and σ is the softmax activation function.In Equations 9 and 11 the softmax is over the en-coder time dimension and (cid:104)· , ·(cid:105) is an inner product. Sketch-Fill-A-R trains the sketch-decoder outputs(Equation 7) by minimizing cross-entropy loss

Model Parameters Perplexity

KVMemNet 46.3M . Sketch-Fill 26.6M . Sketch-Fill-R 26.6M . Sketch-Fill-A 26.9M . Sketch-Fill-A-R 26.9M . Table 1: Sketch-Fill-A-R achieves signiﬁcantly lowerout-of-sample perplexity than KVMemNet. Perplexityfor Sketch-Fill-A-R is measured over the sketch tem-plate responses. The number of persona tag occur-rences is very small, constituting 8% of the total words.See Appendix for more information.

Sequence size KVMemNet Sketch-Fill-A-R (ours)

Unigram 5.39% 1.72%Bigram 32.65 % 7.32 %Trigram 54.95 % 13.97 %Full responses 70.16 % 50.60 %

Table 2: Percentage of novel n -grams and full re-sponses generated by the KVMemNet and Sketch-Fill-A-R models computed on the full validation set. with ground truths y ∗ u . However, during infer-ence, Sketch-Fill-A-R uses an iterative generate-and-score approach to produce the ﬁnal response:1. Perform beam search with beam size B togenerate B sketch responses (cid:8) ˆ y b Ux b (cid:9) b =1 ,...,B that may contain @persona tags.2. For each sketch with tags, select the persona i ∗ with the highest attention weight w u ∗ ,i ∗ ( h cT ) from the ﬁrst sketch tag location u ∗ , and con-struct B (cid:48) candidate responses by ﬁlling each @persona slot with words selected from i ∗ .3. Compute the perplexity s b of all B (cid:48) candidateresponses using a pre-trained language model: s k = exp 1 T b T b (cid:88) t =0 − log P LM (cid:16) y bu | y b u − (cid:17) .

4. The ﬁnal response is the response b ∗ =min b s b with the lowest LM-likelihood score.For models that do not use reranking to ﬁll slots,we follow the methodology of (Wu et al., 2019) inusing a global-to-local memory pointer network inorder to ﬁll slots. For detail, see the Appendix. To validate Sketch-Fill-A-R, we ﬁrst show that itachieves better supervised learning performancethan baselines on a chit-chat dialogue dataset. ersona-Chat

Dataset

We trained Sketch-Fill-A-R to generate single-turn agent responses onthe

Persona-Chat dataset (Zhang et al., 2018),which contains 10,907 dialogues. Here, a dialogueconsists of multiple turns : a single turn containsthe utterance of a single agent. We processed thisdataset into training examples that each consist ofthe conversation history x ct , set of persona traits x pt of the model, and the ground truth sketch re-sponse y u . This process yielded 131,438 trainingexamples. Rare words were identiﬁed by remov-ing all punctuation and stop words from the setof persona traits (see Appendix for more informa-tion). Ground truth sketch responses were thenconstructed by replacing all rare word instances inground truth responses with @persona tags. Language Model Pre-training

Sketch-Fill-A-Ruses a Transformer-based GPT-network (Radfordet al., 2018) pre-trained on the Books text cor-pus (Zhu et al., 2015) to rank candidate responseswith ﬁlled @persona slots according to their LM-perplexity scores. For model details, see the Ap-pendix.

Experimental Setup

We compared 4 variationsof Sketch-Fill-A-R with a strong baseline: • Key-Value Memory Network (KVMemNet)(Zhang et al., 2018), • Sketch-Fill (SF) • Sketch-Fill-A: SF + attention • Sketch-Fill-R: SF + reranking • Sketch-Fill-A-R: SF + attention + reranking(Zhang et al., 2018) showed not only that modelstrained on Persona-Chat outperform models trainedon other dialogue datasets (movies, Twitter) in en-gagingness but also that KVMemNet outperformsvanilla Seq2Seq on Persona-Chat. As a result weomit comparison with vanilla Seq2Seq. FurtherKVMemNet is the strongest of the few public base-lines available to compare against on chitchat withpersonas.All Sketch-Fill-A-R models use language modelreranking (see Section 3.2). All input tokens x ct , x pt were ﬁrst encoded using 300-dimensional GLoVe A number of chit-chat models posted results in the Con-vAI2 competition. However, we could not reproduce these,as all competitive methods rely on extensive pre-training withlarge models, or do not have code or trained models available. word embeddings e ( x t ) (Pennington et al., 2014).All models were trained by minimizing loss on theground truth sketch response y ∗ U : min θ − U (cid:88) u =0 (cid:104) y ∗ u , log P ( y u | x T , y u − ; θ ) (cid:105) . (12)For training details, see the Appendix. The resultsare shown in Table 1. Sketch-Fill models outper-form KVMemNet on validation perplexity, whileusing signiﬁcantly fewer weights than KVMemNet.This suggests the structure of Sketch-Fill modelsﬁts well with chit-chat dialogue. Although Sketch-Fill models perform well quanti-tatively, a crucial test is to evaluate how well theyperform when judged by human users on conver-sational quality , which is not explicitly capturedby perplexity. We performed single and multi-turndialogue user studies to assess the quality of Sketch-Fill-A-R, rated along several dimensions: • Fluency : whether responses are grammati-cally correct and sound natural. • Consistency : whether responses do not con-tradict the previous conversation. • Engagingness : how well responses ﬁt the pre-vious conversation and how likely the conver-sation would continue.Our deﬁnition of engagingness includes relevance ,deﬁned in pragmatics and relevance theory (Wilsonand Sperber, 2002; Grice, 1991) as a statementleading to positive cognitive effect. However anengaging statement may be ironic (Sperber andWilson, 1981), humorous, or further speciﬁc toindividuals.We also explore which qualities of Sketch-Fill-A-R’s outputs are correlated with human ratingsand perplexity scores. Our results suggest that: • Conditioning on persona-memory providesmore consistency. • Sketch-Fill-A-R poses more questions, whichcorrelates with higher engagingness. • Responses need to be ﬂuent in order to be con-sistent or engaging. In addition, more consis-tent responses are more likely to be engaging. • Perplexity is not correlated with high-qualityresponses. aseline Consistency Engagingness Fluency Ours Consistency Engagingness Fluency

Test I KVMemNet . ± .

84 3 . ± .

66 4 . ± . Sketch-Fill . ± .

16 2 . ± .

10 2 . ± . Test II KVMemNet . ± .

86 3 . ± .

62 4 . ± . Sketch-Fill-A . ± .

04 2 . ± .

03 2 . ± . Test III KVMemNet . ± .

16 3 . ± .

85 4 . ± . Sketch-Fill-R . ± .

02 3 . ± .

79 4 . ± . Test IV KVMemNet . ± .

03 3 . ± .

78 4 . ± . Sketch-Fill-A-R . ± .

01 3 . ± . . ± . Table 3: User study ratings of single-turn responses (score range 1 (lowest) - 5 (highest)). Each experimentshowed generated responses from a Sketch-Fill-A-R-variation and KVMemNet on 100 conversations to 5 humanraters. Each row shows ratings from a single heads-up experiment. Sketch-Fill with reranking show a small gainover KVMemNet on all qualitative metrics, but the variance in the ratings is high. Sketch-Fill without rerankingperform much worse, due to their responses not being ﬂuent, despite achieving low perplexity (see Figure 1).

A/B Experiment KVMemNet Sketch-Fill- x (ours) vs Sketch-Fill vs Sketch-Fill-A-R 232 Table 4: Human user A/B-preferences on 100 conver-sations, each shown to 5 users. Two Sketch-Fill-A-Rvariations are preferred over KVMemNet.

Fluency Consistency Engagingness PerplexityFluency

Consistency - 1 0.67 -0.20

Engagingness - - 1 -0.15

Perplexity - - - 1

Table 5: Pearson’s correlation ρ between human ratingsand perplexity of user study examples. For visual KDE-plots of the data, see the Appendix. The studies were completed on 100 random exam-ples sampled from the validation set, where eachexample was rated by 5 judges. Judges hired forthe study came from English speaking countries.As a calibration step, they were shown examplesof good and bad responses in all of the measureddimensions, before proceeding with the study.The study was executed in two settings, ﬁne-grained, where the judges were asked to rate theresponses on a scale from 1 (lowest) to 5 (highest)for each of the mentioned dimensions, and binary,where they were asked to choose a response thatwould best ﬁt the conversation.The results of the ﬁne-grained survey are pre-sented in Table 3, where each row correspondsto a separate heads-up experiments in which theKVMemNet model was paired with one of the ver-sions of Sketch-Fill-A-R. The study showed smallgains on all metrics for all Sketch-Fill-A-R vari-ations, however, the variance of results was high.We believe that this artifact could be caused by anumber of factors, including subjective preferencesof raters and potential ambiguities in the experi-ments description. We notice that Sketch-Fill and Sketch-Fill-A reach lower perplexity values thanKVMemNet, but comparatively have lower evalua-tions across the board. Conversely, ranking modelslike Sketch-Fill-R and Sketch-Fill-A-R have higherscores on all metrics. We observe that the differ-ence is due to the ranker giving more ﬂuent outputsvia better selection of persona words to use.Table 4 shows the results of the human study ina binary setting. In these experiments the base andattention-augmented versions of Sketch-Fill-A-Routperformed KVMemNet by a clear margin.The following subsections present in-depth anal-ysis of the human study. The analysis focuses onthe Sketch-Fill-A-R model, since it yielded the bestperplexity and user study results.

Correlation between ratings

To study and bet-ter understand the reasoning behind the ratings as-signed by annotators, we look at the correlation be-tween the different dimensions in which responseswhere scored. Figure 5 shows Kernel-Density-Estimation plots of the data points and associatedPearson correlation coefﬁcients ρ . The data showsweak ( ρ = 0 . ) to moderate ( ρ = 0 . ) corre-lation between ﬂuency and consistency , and ﬂuency and engagingness ratings respectively. The datashows ρ value of . between engagingness and consistency ratings, suggesting strong correlationbetween those dimensions. See appendix for moredetailed information. The numbers were obtainedon human ratings of the Sketch-Fill-A-R model,but comparable numbers were also obtained for theKVMemNet model. The mentioned results followintuition, as ﬂuency of a response is a notion thatcan be easily deﬁned and identiﬁed. On the otherhand consistency and engagingness are ambiguous,and (possibly) partially overlapping, concepts.To associate quantitative metrics from Table 1with human ratings, we computed correlation be-tween perplexity values from the sketch decoderof the Sketch-Fill-A-R model with human scoresacross different dimensions. The study showed no odel Trait 1: I married a super model from ItalyModel Trait 2: I’ve zero family that I’m close toModel Trait 3: My name is GeorgeModel Trait 4: I’m a bee farmerModel Trait 5: My favorite food is papaya Human: hi whats up

Model: do you like papaya ?

Human: eh, i‘m more into mango

Model: what is your favorite papaya ?

Human: probably just the normal ones.

Human: what do you like to do?

Model: i am a bee farmer .

Human: woah , that‘s cool

Model: how are you ?

Human: i‘m doing well ,

Human: just trying to get some work done

Model: do you have any hobbies ?

Human: i like to play basketball , what about you?

Model: i love papaya food .

Human: what is your name?

Model: george . what is your favorite name ?

Table 6: Example multi-turn dialog between a humanand Sketch-Fill-AR. Model persona traits are shownabove; the human was not given speciﬁc persona traits.

Model Trait 1: I work for our local supermarketModel Trait 2: My favorite band is The WhoModel Trait 3: I have never been out of OhioModel Trait 4: My favorite food is pizza with black olives

Human: what are you doing today?

Model: i am ok what do you do for fun ?

Human:

I like biking. Do you bike?.

Model: cool ! i am watching a movie with my little girl

Human:

You have a daughter? How old is she?.

Model: oh that is cool . i do not have any myself .

Human:

Oh. do you have a girlfriend then?.

Model: she is an ex singer from the early 2000s ,

Model: mostly does pop songs and the like

Human:

What kind of movies do you like?

Model: little girl and my mother lives with us

Model: she‘s a concert pianist, she in between jobs

Human: who is this little girl?

Model: she also likes karate

Model: she is why i got a brown belt !

Human: what is her name?

Model: she had a green thumb and loved ﬂowers

Table 7: Example multi-turn dialog between a hu-man and KVMemNet. Model persona traits are shownabove; the human was not given speciﬁc persona traits.

Metric KVMemNet Sketch-Fill-A-R (ours)

Fluency . ± . . ± . Consistency . ± . . ± . Engagement . ± . . ± . Table 8: Multi-turn user study ratings (score range1 (lowest) - 5 (highest)). We collected conversa-tions with 20 turns between human raters and mod-els. KVMemNet is more ﬂuent, but Sketch-Fill-A-Ris more engaging and signiﬁcantly more consistent. Human: hi there . how are youModel: hi good just writing some music and youHuman: i am well . just took my dog for a walk

KVMemNet: sorry , i trying to become the next big thingin music , i sing

Sketch-Fill-A-R: what kind of music do you like to do ?

Table 9: An example single-turn response for KVMem-Net and Sketch-Fill-A-R given the same context. correlation for ﬂuency ( ρ = − . ), and weakcorrelations for consistency ( ρ = − . ) and en-gagingness ( ρ = − . ). Model vocabulary analysis

To assess the diver-sity of responses generated by the models, we cal-culated the percentage of unique n -grams and fullresponses present in the model outputs. Table 2presents these values for KVMemNet and Sketch-Fill-A-R computed on the full validation set. Thenumbers show that the KVMemNet model clearlyoutperforms our model in terms of generating di-verse and unique outputs by a factor of 3-4x. How-ever, we hypothesize that this additional diversity may lead to lower engagingness scores. Consistency over time

In order to evaluate themodels capacity to stay consistent with its previ-ous statements, and thus implicitly its ability toutilize information present in the chat history, wecompared how the consistency rating changed asthe number of lines of the conversation increased.Figure 4 visualizes this metric both for our modeland KVMemNet. In the case of both models, theconsistency decreases as the chat history get longer,indicating that models have problems keeping trackof their previous statements. When analyzing thelinear trend we noticed that the decrease in perfor-mance is slower for the Sketch-Fill-A-R model. Wehypothesize that this effect can be partially causedby the high diversity of sequences generated bythe KVMemNet, which in turn affects the modelsability to generate consistent conversation.

Effect of question responses

We hypothesizethat for a conversation to be engaging, responses inchit-chat dialogue should be a mix of statements,where the model shares its persona information,and questions, where the model inquires about cer-tain traits and information of the other agent. Toconﬁrm this intuition, we evaluated the effect thatpresence of a question in the response has on theratings coming from the judges. The results arepresented in Figure 4c. The study showed thatthere is a strong correlation between the model ask-ing a question and the users rating the response as a) KVMemNet (b) Sketch-Fill-A-R (c) Sketch-Fill-A-R: Human ratings vsquestion/no-question responses

Figure 4: Impact of conversation length on the consistency of outputs generated by the KVMemNet (left) andSketch-Fill-A-R (middle). As conversation length increases (more dialogue turns) both models become less con-sistent, but KVMemNet degrades faster than Sketch-Fill-A-R. Right: impact of response containing a question onhuman ratings. Responses including questions tend to receive higher human ratings. more engaging. Asking questions has a small, butpositive inﬂuence on engagingness and ﬂuency .To further analyze this aspect, we measured thefrequency of questions in the set of 100 responsescoming from the Sketch-Fill-A-R and KVMemNetmodels. We found that our model produced 49question responses out of which 25 had both astatement and a question. In the same setting theKVMemNet produced 15 questions out of whichonly 1 contained a statement and a question. Thisinsight could explain the gains on the engagingnessratings found by our human study.

To evaluate both models in the more challengingmulti-turn setting, we collected conversationsthat lasted 20 turns, between each model and hu-man users. Users were asked to score their conver-sations with the models on a scale from 1 (lowest)to 5 (highest) across the same dimensions as in thesingle-turn experiments. Table 8 shows the humanratings for both Sketch-Fill-A-R and KVMemNet.Both were judged as less ﬂuent (scores ≈ ) thanin the single-turn case (scores ≥ ). This is likelydue to the models having to respond to a range ofconversation histories unseen during training.Notably, Sketch-Fill-A-R outperformedKVMemNet on consistency , by a signiﬁcantlylarger margin (3.72 vs 2.15) than in the single-turnsetting. This suggests that Sketch-Fill-A-R beneﬁtsfrom conditioning response generation on itspersona-memory and so adheres more closely toresponses that are compatible with its persona.Further, Sketch-Fill-A-R is more engaging. Thissuggests that in the multi-turn setting, there also is apositive correlation between engagingness and con- sistency as in the single-turn case (see Appendix):consistent models can be more engaging as well.Table 7 shows an example of KVMemNet’s in-consistency. While every model utterance is ﬂuentindividually, KVMemNet noticeably contradictsitself in the context of previous utterances and fre-quently ignores the human responses (e.g ”i do nothave any myself” after ”my little girl”). We believethe lack of structure inherent in models built onvanilla Seq2Seq make KVMemNet prone to thismistake. Table 6 shows Sketch-Fill-A-R conductsa more engaging conversation, with pertinent re-sponses and questions. However, this structure canrestrict Sketch-Fill-A-R, as sketches may be ﬁlledwith incorrect persona traits (e.g ”i love papayafood .”). See the Appendix for more examples. In our study we have identiﬁed several paths forfuture work. First, our results show that perplexitydoes not strongly correlate with human judgment ofthe quality of responses. Developing an automatedmetric that correlates well with human judgment iscrucial as human evaluation is expensive, time con-suming, and prone to inconsistencies. Secondly, de-spite outperforming other models in the multi-turndialogue setting on consistency and engagement,our model has not reached human-like ﬂuency. Inorder to demonstrate higher-level complex traitssuch as empathy, models must ﬁrst master theselower-level abilities. Finally, correct usage of rarewords and proper nouns leads to higher humanscores. Existing models are unable to deal with out-of-vocabulary tokens and rare words gracefully,and incorporation of commonsense via externalknowledge bases or other methods will be useful. thical Implications

During experiments, weidentiﬁed a number of ethical implications forfuture work. The

Persona-Chat dataset wasnoted by some raters to contain potentially inap-propriate statements (e.g., ”my wife spends all mymoney”) and is based in US culture (e.g., food,music, cars, names). It also lacked content to failgracefully when it didn’t have an appropriate re-sponse (e.g., ”I’m sorry I don’t understand,” ”Idon’t know”). As such, learned model responseswere occasionally insensitive and confusing to hu-man users.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Antoine Bordes, Y-Lan Boureau, and Jason Weston.2016. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 .Hongshen Chen, Xiaorui Liu, Dawei Yin, and JiliangTang. 2017. A survey on dialogue systems: Recentadvances and new frontiers.

ACM SIGKDD Explo-rations Newsletter , 19(2):25–35.Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXive-prints , page arXiv:1406.1078.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Emily Dinan, Varvara Logacheva, Valentin Malykh,Alexander Miller, Kurt Shuster, Jack Urbanek,Douwe Kiela, Arthur Szlam, Iulian Serban, RyanLowe, et al. 2019. The second conversationalintelligence challenge (convai2). arXiv preprintarXiv:1902.00098 .Emily Dinan, Stephen Roller, Kurt Shuster, AngelaFan, Michael Auli, and Jason Weston. 2018. Wizardof wikipedia: Knowledge-powered conversationalagents. arXiv preprint arXiv:1811.01241 .Jeffrey L Elman. 1990. Finding structure in time.

Cog-nitive science , 14(2):179–211.Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019.Neural approaches to conversational ai.

Founda-tions and Trends R (cid:13) in Information Retrieval , 13(2-3):127–298. Marjan Ghazvininejad, Chris Brockett, Ming-WeiChang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, andMichel Galley. 2018. A knowledge-grounded neuralconversation model. In Thirty-Second AAAI Confer-ence on Artiﬁcial Intelligence .H Paul Grice. 1991.

Studies in the Way of Words . Har-vard University Press.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Dan Jurafsky and James H. Martin. 2009.

Speechand language processing: an introduction to naturallanguage processing, computational linguistics, andspeech recognition, 2nd Edition . Prentice Hall se-ries in artiﬁcial intelligence. Prentice Hall, PearsonEducation International.Ilya Kulikov, Alexander H Miller, Kyunghyun Cho,and Jason Weston. 2018. Importance of a searchstrategy in neural dialogue modelling. arXivpreprint arXiv:1811.00907 .Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong,Romain Paulus, and Richard Socher. 2016. Ask meanything: Dynamic memory networks for naturallanguage processing. In

International Conferenceon Machine Learning , pages 1378–1387.Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016.A persona-based neural conversation model. arXivpreprint arXiv:1603.06155 .Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the 2014 conferenceon empirical methods in natural language process-ing (EMNLP) , pages 1532–1543.Qiao Qian, Minlie Huang, Haizhou Zhao, JingfangXu, and Xiaoyan Zhu. 2017. Assigning personal-ity/identity to a chatting machine for coherent con-versation generation.Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training. OpenAI.Iulian Vlad Serban, Ryan Lowe, Laurent Charlin, andJoelle Pineau. 2016. Generative deep neural net-works for dialogue: A short review. arXiv preprintarXiv:1611.06216 .Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio. 2017. A hierarchical latent variableencoder-decoder model for generating dialogues. In

Thirty-First AAAI Conference on Artiﬁcial Intelli-gence .lessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. arXiv preprintarXiv:1506.06714 .Dan Sperber and Deirdre Wilson. 1981. Irony and theuse-mention distinction.

Philosophy , 3:143–184.Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.2015. End-to-end memory networks. In

Advancesin neural information processing systems , pages2440–2448.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural networks.In

Advances in neural information processing sys-tems , pages 3104–3112.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , pages 5998–6008.Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869 .Richard S Wallace. 2009. The anatomy of alice. In

Parsing the Turing Test , pages 181–210. Springer.Joseph Weizenbaum et al. 1966. Eliza—a computerprogram for the study of natural language communi-cation between man and machine.

Communicationsof the ACM , 9(1):36–45.Sean Welleck, Jason Weston, Arthur Szlam, andKyunghyun Cho. 2018. Dialogue natural languageinference. arXiv preprint arXiv:1811.00671 .Jason Weston, Sumit Chopra, and Antoine Bor-des. 2014. Memory networks. arXiv preprintarXiv:1410.3916 .Deirdre Wilson and Dan Sperber. 2002. Relevance the-ory.Chien-Sheng Wu, Richard Socher, and Caiming Xiong.2019. Global-to-local memory pointer networks fortask-oriented dialogue. In

International Conferenceon Learning Representations .Saizheng Zhang, Emily Dinan, Jack Urbanek, ArthurSzlam, Douwe Kiela, and Jason Weston. 2018. Per-sonalizing dialogue agents: I have a dog, do youhave pets too? arXiv preprint arXiv:1801.07243 .Li Zhou, Jianfeng Gao, Di Li, and Heung-YeungShum. 2018. The design and implementation of xi-aoice, an empathetic social chatbot. arXiv preprintarXiv:1812.08989 .Yukun Zhu, Ryan Kiros, Richard Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watch-ing movies and reading books. In arXiv preprintarXiv:1506.06724arXiv preprintarXiv:1506.06724