[PDF] Multi-turn Dialogue Response Generation in an Adversarial Learning Framework

Abstract

We propose an adversarial learning approach for generating multi-turn dialogue responses. Our proposed framework, hredGAN, is based on conditional generative adversarial networks (GANs). The GAN's generator is a modified hierarchical recurrent encoder-decoder network (HRED) and the discriminator is a word-level bidirectional RNN that shares context and word embeddings with the generator. During inference, noise samples conditioned on the dialogue history are used to perturb the generator's latent space to generate several possible responses. The final response is the one ranked best by the discriminator. The hredGAN shows improved performance over existing methods: (1) it generalizes better than networks trained using only the log-likelihood criterion, and (2) it generates longer, more informative and more diverse responses with high utterance and topic relevance even with limited training data. This improvement is demonstrated on the Movie triples and Ubuntu dialogue datasets using both automatic and human evaluations.

Full PDF

MMulti-turn Dialogue Response Generation in anAdversarial Learning Framework

Oluwatobi Olabiyi

Capital One Conversation ResearchVienna VA [email protected]

Alan Salimov

Capital One Conversation ResearchSan Francisco CA [email protected]

Anish Khazane

Capital One Conversation ResearchSan Francisco CA [email protected]

Erik T. Mueller

Capital One Conversation ResearchVienna VA [email protected]

Abstract

We propose an adversarial learning approachfor generating multi-turn dialogue responses.Our proposed framework, hredGAN , is basedon conditional generative adversarial networks(GANs). The GAN’s generator is a mod-iﬁed hierarchical recurrent encoder-decodernetwork (HRED) and the discriminator is aword-level bidirectional RNN that shares con-text and word embeddings with the generator.During inference, noise samples conditionedon the dialogue history are used to perturbthe generator’s latent space to generate sev-eral possible responses. The ﬁnal response isthe one ranked best by the discriminator. ThehredGAN shows improved performance overexisting methods: (1) it generalizes better thannetworks trained using only the log-likelihoodcriterion, and (2) it generates longer, moreinformative and more diverse responses withhigh utterance and topic relevance even withlimited training data. This improvement isdemonstrated on the Movie triples and Ubuntudialogue datasets using both automatic and hu-man evaluations.

Recent advances in deep neural network architec-tures have enabled tremendous success on a num-ber of difﬁcult machine learning problems. Whilethese results are impressive, producing a deploy-able neural network–based model that can engagein open domain conversation still remains elusive.A dialogue system needs to be able to generatemeaningful and diverse responses that are simul-taneously coherent with the input utterance andthe overall dialogue topic. Unfortunately, earlierconversation models trained with naturalistic dia-logue data suffered greatly from limited contextualinformation (Sutskever et al., 2014; Vinyals andLe, 2015) and lack of diversity (Li et al., 2016a). These problems often lead to generic and safe re-sponses to a variety of input utterances.Serban et al. (2016) and Xing et al. (2017)proposed the Hierarchical Recurrent Encoder-Decoder (HRED) network to capture long tempo-ral dependencies in multi-turn conversations to ad-dress the limited contextual information but thediversity problem remained. In contrast, someHRED variants such as variational (Serban et al.,2017b) and multi-resolution (Serban et al., 2017a)HREDs attempt to alleviate the diversity problemby injecting noise at the utterance level and by ex-tracting additional context to condition the gener-ator on. While these approaches achieve a cer-tain measure of success over the basic HRED, thegenerated responses are still mostly generic sincethey do not control the generator’s output. This isbecause the output conditional distribution is notcalibrated. Li et al. (2016a), on the other hand,consider a diversity promoting training objectivebut their model is for single turn conversations andcannot be trained end-to-end.The generative adversarial network (GAN)(Goodfellow et al., 2014) seems to be an appro-priate solution to the diversity problem. GANmatches data from two different distributions byintroducing an adversarial game between a gener-ator and a discriminator . We explore hredGAN :conditional GANs for multi-turn dialogue mod-els with an HRED generator and discriminator.hredGAN combines ideas from both generativeand retrieval-based multi-turn dialogue systems toimprove their individual performances. This isachieved by sharing the context and word embed-dings between the generator and the discrimina-tor allowing for joint end-to-end training usingback-propagation. To the best of our knowledge,no existing work has applied conditional GANsto multi-turn dialogue models and especially not a r X i v : . [ c s . C L ] J un ith HRED generators and discriminators. Wedemonstrate the effectiveness of hredGAN overthe VHRED for dialogue modeling with evalua-tions on the Movie triples and Ubuntu technicalsupport datasets. Our work is related to end-to-end neural network–based open domain dialogue models. Mostneural dialogue models use transduction frame-works adapted from neural machine translation(Sutskever et al., 2014; Bahdanau et al., 2015).These

Seq2Seq networks are trained end-to-endwith MLE criteria using large corpora of human-to-human conversation data. Others use GAN’sdiscriminator as a reward function in a reinforce-ment learning framework (Yu et al., 2017) andin conjunction with MLE (Li et al., 2017; Cheet al., 2017). Zhang et al. (2017) explored the ideaof GAN with a feature matching criterion. Xuet al. (2017) and Zhang et al. (2018) employedGAN with an approximate embedding layer aswell as with adversarial information maximizationrespectively to improve

Seq2Seq ’s diversity per-formance.Still,

Seq2Seq models are limited in theirability to capture long temporal dependenciesin multi-turn conversation. Although Li et al.(2016b) attempted to optimize a pair of

Seq2Seq models for multi-turn dialogue, the multi-turn ob-jective is only applied at inference and not used foractual model training. Hence the introduction ofHRED models (Serban et al., 2016, 2017a,b; Xinget al., 2017) for modeling dialogue response inmulti-turn conversations. However, these HREDmodels suffer from lack of diversity since theyare only trained with MLE criteria. On the otherhand, adversarial system has been used for eval-uating open domain dialogue models (Bruni andFernndez, 2018; Kannan and Vinyals, 2017). Ourwork, hredGAN, is closest to the combination ofHRED generation models (Serban et al., 2016)and adversarial evaluation (Kannan and Vinyals,2017).

Consider a dialogue consisting of a sequence of N utterances, x = (cid:0) x , x , · · · , x N (cid:1) , whereeach utterance x i = (cid:0) x i , x i , · · · , x M i i (cid:1) contains a variable-length sequence of M i word tokens suchthat x ij ∈ V for vocabulary V . At any timestep i , the dialogue history is given by x i = (cid:0) x , x , · · · , x i (cid:1) . The dialogue response gener-ation task can be deﬁned as follows: Given adialogue history x i , generate a response y i = (cid:0) y i , y i , · · · , y T i i (cid:1) , where T i is the number of gen-erated tokens. We also want the distribution ofthe generated response P ( y i ) to be indistinguish-able from that of the ground truth P ( x i +1 ) and T i = M i +1 . Conditional GAN learns a mappingfrom an observed dialogue history, x i , and a se-quence of random noise vectors, z i to a sequenceof output tokens, y i , G : { x i , z i } → y i . The gen-erator G is trained to produce output sequencesthat cannot be distinguished from the ground truthsequence by an adversarially trained discriminator D that is trained to do well at detecting the gener-ator’s fakes. The distribution of the generator out-put sequence can be factored by the product rule: P ( y i | x i ) = P ( y i ) T i (cid:89) j =2 P (cid:0) y ji | y i , · · · , y j − i , x i (cid:1) (1) P (cid:0) y ji | y i , · · · , y j − i , x i (cid:1) = P θ G (cid:0) y j − i , x i (cid:1) (2)where y i : j − i = ( y i , · · · , y j − i ) and θ G are the pa-rameters of the generator model. P θ G (cid:0) y i : j − i , x i (cid:1) is an autoregressive generative model where theprobability of the current token depends on thepast generated sequence. Training the genera-tor G with the log-likelihood criterion is unsta-ble in practice, and therefore the past generatedsequence is substituted with the ground truth, amethod known as teacher forcing (Williams andZipser, 1989), i.e., P (cid:0) y ji | y i , · · · , y j − i , x i (cid:1) ≈ P θ G (cid:0) x j − i +1 , x i (cid:1) (3)Using (3) in relation to GAN, we deﬁne our fakesample as the teacher forcing output with some in-put noise z i y ji ∼ P θ G (cid:0) x j − i +1 , x i , z i (cid:1) (4)and the corresponding real sample as ground truth x ji +1 .With the GAN objective, we can match thenoise distribution, P ( z i ) , to the distribution of theground truth response, P ( x i +1 | x i ) . Varying theoise input then allows us to generate diverse re-sponses to the same dialogue history. Further-more, the discriminator, since it is calibrated, isused during inference to rank the generated re-sponses, providing a means of controlling the gen-erator output. The objective of a conditional GAN can be ex-pressed as L cGAN ( G, D ) = E x i ,x i +1 [log D ( x i +1 , x i )]+ E x i ,z i [1 − log D ( G ( x i , z i ) , x i )] (5)where G tries to minimize this objective against anadversarial D that tries to maximize it: G ∗ , D ∗ = arg min G max D L cGAN ( G, D ) . (6)Previous approaches have shown that it is beneﬁ-cial to mix the GAN objective with a more tradi-tional loss such as cross-entropy loss (Lamb et al.,2016; Li et al., 2017). The discriminator’s job re-mains unchanged, but the generator is tasked notonly to fool the discriminator but also to be nearthe ground truth x i +1 in the cross-entropy sense: L MLE ( G ) = E x i ,x i +1 ,z i [ − log P θ G (cid:0) x i +1 , x i , z i (cid:1) ] . (7)Our ﬁnal objective is, G ∗ , D ∗ = arg min G max D (cid:0) λ G L cGAN ( G, D )+ λ M L MLE ( G ) (cid:1) . (8)It is worth mentioning that, without z i , the netcould still learn a mapping from x i to y i , but itwould produce deterministic outputs and fail tomatch any distribution other than a delta function(Isola et al., 2017). This is one key area whereour work is different from Lamb et al.’s and Liet al.’s. The schematic of the proposed hredGANis depicted at the right hand side of Figure 1. We adopted an HRED dialogue generator sim-ilar to Serban et al. (2016, 2017a,b) and Xinget al. (2017). The HRED contains three recur-rent structures, i.e. the encoder ( eRN N ) , con-text ( cRN N ) , and decoder ( dRN N ) RNN. Theconditional probability modeled by the HRED peroutput word token is given by P θ G (cid:0) y ji | x j − i +1 , x i (cid:1) = dRN N (cid:0) E ( x j − i +1 ) , h j − i , h i (cid:1) (9) where E ( . ) is the embedding lookup, h i = cRN N ( eRN N ( E ( x i ) , h i − ) , eRN N ( . ) maps asequence of input symbols into ﬁxed-length vec-tor, and h and h are the hidden states of the de-coder and context RNN, respectively.In the multi-resolution HRED, (Serban et al.,2017a), high-level tokens are extracted and pro-cessed by another RNN to improve performance.We circumvent the need for this extra process-ing by allowing the decoder to attend to differentparts of the input utterance during response gener-ation (Bahdanau et al., 2015; Luong et al., 2015).We introduce a local attention into (9) and encodethe attention memory differently from the con-text through an attention encoder RNN ( aRN N ) ,yielding: P θ G (cid:0) y ji | x j − i +1 , x i (cid:1) = dRN N (cid:0) E ( x j − i +1 ) , h j − i , a ji , h i (cid:1) (10)where a ji = (cid:80) M i m =1 exp ( α m ) (cid:80) Mim =1 exp ( α m ) h (cid:48) mi , h (cid:48) mi = aRN N ( E ( x mi ) , h (cid:48) m − i ) , h (cid:48) is the hidden state ofthe attention RNN, and α k is either a logit projec-tion of ( h j − i , h (cid:48) mi ) in the case of Bahdanau et al.(2015) or ( h j − i ) T · h (cid:48) mi in the case of Luong et al.(2015). The modiﬁed HRED architecture is shownin Figure 2. Noise Injection:

We inject Gaussian noise atthe input of the decoder RNN. Noise samplescould be injected at the utterance or word level.With noise injection, the conditional probability ofthe decoder output becomes P θ G (cid:0) y ji | x j − i +1 , z ji , x i (cid:1) = dRN N (cid:0) E ( x j − i +1 ) , h j − i , a ji , z ji , h i (cid:1) (11)where z ji ∼ N i (0 , I ) , for utterance-level noise and z ji ∼ N ji (0 , I ) , for word-level noise. The discriminator shares context and word embed-dings with the generator and can discriminate atthe word level (Lamb et al., 2016). The word-leveldiscrimination is achieved through a bidirectionalRNN and is able to capture both syntactic and con-ceptual differences between the generator outputand the ground truth. The aggregate classiﬁcationof an input sequence, χ can be factored over word- igure 1: Left: The hredGAN architecture -

The generator makes predictions conditioned on the dialoguehistory, h i , attention, a ji , noise sample, z ji , and ground truth, x j − i +1 . Right: RNN-based discriminator thatdiscriminates bidirectionally at the word level.Figure 2:

The HRED generator with local attention -

The attention RNN ensures local relevance while the con-text RNN ensures global relevance. Their states are combined to initialize the decoder RNN and the discriminatorBiRNN. level discrimination and expressed as D ( x i , χ ) = D ( h i , χ ) = (cid:20) J (cid:89) j =1 D RNN ( h i , E ( χ j )) (cid:21) J (12)where D RNN ( . ) is the word discriminator RNN, h i is an encoded vector of the dialogue history x i obtained from the generator’s cRN N ( . ) out-put, and χ j is the jth word or token of the input se-quence χ . χ = y i and J = T i for the case of gen-erator’s decoder output, χ = x i +1 and J = M i +1 for the case of ground truth. The discriminator ar-chitecture is depicted on the left hand side of Fig-ure 1. In this section, we describe the generation processduring inference. The generation objective can be mathematically described as y ∗ i = arg max l (cid:8) P ( y i,l | x i ) + D ∗ ( x i , y i,l )] (cid:9) Ll =1 (13)where y i,l = G ∗ ( x i , z i,l ) , z i,l is the lth noise sam-ples at dialogue step i , and L is the number of re-sponse samples. Equation 13 shows that our infer-ence objective is the same as the training objective(8), combining both the MLE and adversarial cri-teria. This is in contrast to existing work wherethe discriminator is usually discarded during in-ference.The inference described by (13) is intractabledue to the enormous search space of y i,l . There-fore, we turn to an approximate solution wherewe use greedy decoding (MLE) on the ﬁrst partof the objective function to generate L lists of re-sponses based on noise samples { z i,l } Ll =1 . In or-der to facilitate the exploration of the generator’slatent space, we sample a modiﬁed noise distri-bution, z ji,l ∼ N i,l (0 , α I ) , or z ji,l ∼ N ji,l (0 , α I ) lgorithm 1 Adversarial Learning of hredGAN

Require:

A generator G with parameters θ G . Require:

A discriminator D with parameters θ D . for number of training iterations do Initialize cRNN to zero state, h Sample a mini-batch of conversations, x = { x i } Ni =1 , x i =( x , x , · · · , x i ) with N utterances. Each utterance mini batch i con-tains M i word tokens. for i = 1 to N − do Update the context state. h i = cRNN ( eRNN ( E ( x i )) , h i − ) Compute the generator output using (11). P θG (cid:0) y i | , z i , x i (cid:1) = (cid:8) P θG (cid:0) y ji | x j − i +1 , z ji , x i (cid:1)(cid:9) Mi +1 j =1 Sample a corresponding mini batch of utterance y i . y i ∼ P θG (cid:0) y i | , z i , x i (cid:1) end for Compute the discriminator accuracy D acc over N − utterances { y i } N − i =1 and { x i +1 } N − i =1 if D acc < acc Dth then

Update θ D with gradient of the discriminator loss. (cid:80) i [ ∇ θD log D ( h i , x i +1 ) + ∇ θD log (cid:0) − D ( h i , y i ) (cid:1) ] end ifif D acc < acc Gth then

Update θ G with the generator’s MLE loss only. (cid:80) i [ ∇ θG log P θG (cid:0) y i | , z i , x i (cid:1) ] else Update θ G with both adversarial and MLE losses. (cid:80) i [ λ G ∇ θG log D ( h i , y i )+ λ M ∇ θG log P θG (cid:0) y i | , z i , x i (cid:1) ] end ifend for where α > . , is the exploration factor that in-creases the noise variance. We then rank the L listsusing the discriminator score, (cid:8) D ∗ ( x i , y i,l )] (cid:9) Ll =1 .The response with the highest discriminator rank-ing is the optimum response for the dialogue con-text. We trained both the generator and the discrimina-tor simultaneously as highlighted in Algorithm ?? with λ G = λ M = 1 . GAN training is prone toinstability due to competition between the gener-ator and the discriminator. Therefore, parameterupdates are conditioned on the discriminator per-formance (Lamb et al., 2016). The generator consists of fourRNNs with different parameters, that is, aRN N, eRN N, cRN N , and dRN N . aRN N and eRN N are both bidirectional, while cRN N and dRN N are unidirectional. Each RNN has3 layers, and the hidden state size is 512. The dRN N and aRN N are connected using anadditive attention mechanism (Bahdanau et al.,2015). The discriminator shares aRN N, eRN N ,and cRN N with the generator. D RNN is astacked bidirectional RNN with 3 layers and ahidden state size of 512. The cRN N states areused to initialize the states of D RNN . The out-put of both the forward and the backward cells for each word are concatenated and passed to a fully-connected layer with binary output. The output isthe probability that the word is from the groundtruth given the past and future words of the se-quence.

Others:

All RNNs used are gated recurrent unit(GRU) cells (Cho et al., 2014). The word embed-ding size is 512 and shared between the generatorand the discriminator. The initial learning rate is . with decay rate factor of . , applied whenthe adversarial loss has increased over two itera-tions. We use a batch size of 64 and clip gradi-ents around . . As in Lamb et al. (2016), weﬁnd acc D th = 0 . and acc G th = 0 . to suf-ﬁce. All parameters are initialized with Xavieruniform random initialization (Glorot and Bengio,2010). The vocabulary size V is , . Due tothe large vocabulary size, we use sampled softmaxloss (Jean et al., 2015) for MLE loss to expeditethe training process. However, we use full softmaxfor evaluation. The model is trained end-to-endusing the stochastic gradient descent algorithm. We consider the task of generating dialogue re-sponses conditioned on the dialogue history andthe current input utterance. We compare the pro-posed hredGAN model against some alternativeson publicly available datasets. (MTC) dataset (Serbanet al., 2016). This dataset was derived from the

Movie-DiC dataset by Banchs (2012). Althoughthis dataset spans a wide range of topics withfew spelling mistakes, its small size of only about240,000 dialogue triples makes it difﬁcult to traina dialogue model, as pointed out by Serban et al.(2016). We thought that this scenario would reallybeneﬁt from the proposed adversarial generation.

Ubuntu Dialogue Corpus (UDC) dataset (Ser-ban et al., 2017b). This dataset was extracted fromthe Ubuntu Relay Chat Channel. Although thetopics in the dataset are not as diverse as in theMTC, the dataset is very large, containing about1.85 million conversations with an average of 5utterances per conversation.We split both MTC and UDC into training, val-idation, and test sets, using 90%, 5%, and 5% pro-portions, respectively. We performed minimal pre-processing of the datasets by replacing all wordsxcept the top 50,000 most frequent words by an

UNK symbol.

Accurate evaluation of dialogue models is still anopen challenge. In this paper, we employ both au-tomatic and human evaluations.

We employed some of the automatic evaluationmetrics that are used in probabilistic language anddialogue models, and statistical machine transla-tion. Although these metrics may not correlatewell with human judgment of dialogue responses(Liu et al., 2016), they provide a good baseline forcomparing dialogue model performance.

Perplexity - For a model with parameter θ , wedeﬁne perplexity as: exp (cid:20) − N W K (cid:88) k =1 log P θ ( y , y , . . . , y N k − ) (cid:21) (14)where K is the number of conversations in thedataset, N k is the number of utterances in conver-sation k , and N W is the total number of word to-kens in the entire dataset. The lower the perplexity,the better. The perplexity measures the likelihoodof generating the ground truth given the model pa-rameters. While a generative model can generate adiversity of responses, it should still assign a highprobability to the ground truth utterance. BLEU - The BLEU score (Papineni et al., 2002)provides a measure of overlap between the gen-erated response (candidate) and the ground truth(reference) using a modiﬁed n-gram precision.According to Liu et. al. (Liu et al., 2016), BLEU-2score is fairly correlated with human judgment fornon-technical dialogue (such as MTC).

ROUGE - The ROUGE score (Lin, 2014) issimilar to BLEU but it is recall-oriented instead.It is used for automatic evaluation of text summa-rization and machine translation. To complimentthe BLEU score, we use ROUGE-N with N = 2 for our evaluation. Distinct n-gram - This is the fraction of uniquen-grams in the generated responses and it providesa measure of diversity. Models with higher a num-ber of distinct n-grams tend to produce more di-verse responses (Li et al., 2016a). For our evalua-tion, we use 1- and 2- grams.

Normalized Average Sequence Length(NASL) - This measures the average number of words in model-generated responses normalizedby the average number of words in the groundtruth.

For human evaluation, we follow a similar setup asLi et al. (2016a), employing crowd-sourced judgesto evaluate a random selection of 200 samples. Wepresented both the multi-turn context and the gen-erated responses from the models to 3 judges andasked them to rank the general response qualityin terms of relevance and informativeness. For N models, the model with the lowest quality is as-signed a score 0 and the highest is assigned a scoreN-1. Ties are not allowed. The scores are normal-ized between 0 and 1 and averaged over the totalnumber of samples and judges. For each model,we also estimated the per sample score variancebetween judges and then averaged over the num-ber of samples, i.e., sum of variances divided bythe square of number of samples (assuming sam-ple independence). The square root of result is re-ported as the standard error of the human judg-ment for the model. We compare the performance of our model to(V)HRED (Serban et al., 2016, 2017b), since theyare the closest to our approach in implementationand are the current state of the art in open-domaindialogue models. HRED is very similar to ourproposed generator, but without the input utter-ance attention and noise samples. VHRED intro-duces a latent variable to the HRED between the cRNN and the dRNN and was trained using thevariational lower bound on the log-likelihood. TheVHRED can generate multiple responses per con-text like hredGAN, but it has no speciﬁc criteriafor selecting the best response.The HRED and VHRED models are bothtrained using the Theano-based implementa-tion obtained from https://github.com/julianser/hed-dlg-truncated . Thetraining and validation sets used for UDC andMTC dataset were obtained directly from the au-thors of (V)HRED. For model comparison, weuse a test set that is disjoint from the training andvalidation sets. UDC was obtained from , and the link to MTCwas obtained privately.odel Teacher Forcing Autoregression HumanPerplexity − logD ( G ( . )) BLEU-2 ROUGE-2 DISTINCT-1/2 NASL Evaluation

MTC

HRED 31.92/36.00 NA 0.0474 0.0384 0.0026/0.0056 0.535 0.2560 ± ± ± ± HRED 69.39/86.40 NA 0.0177 0.0483 0.0203/0.0466 0.892 0.3475 ± ± ± ± Table 1: Generator Performance Evaluation

We have two variants of hredGAN based onthe noise injection approach, i.e., hredGANwith utterance-level ( hredGAN u ) and word-level( hredGAN w ) noise injections.We compare the performance of these two vari-ants with HRED and VHRED models.

Perplexity : The average perplexity per wordperformance of all the four models on MTC andUDC datasets (validation/test) are reported in theﬁrst column on Table 1. The table indicates thatboth variants of the hredGAN model perform bet-ter than the HRED and VHRED models in termsof the perplexity measure. However, using the ad-versarial loss criterion (Eq. (8)), the hredGAN umodel performs better on MTC and worse onUDC. Note that, for this experiment, we run allmodels in teacher forcing mode.

Generation Hyperparameter : For adversarialgeneration, we perform a linear search for α be-tween 1 and 20 at an increment of 1 using Eq.(13), with sample size L = 64 , on validation setswith models run in autoregression. The optimumvalues of α for hredGAN u and hredGAN w forUDC are . and . respectively. The values forMTC are not convex, probably due to small size ofthe dataset, so we use the same α values as UDC.We however note that for both datasets, any inte-ger value between 3 and 10 (inclusive) works wellin practice. Quantitative Generator Performance : Werun autoregressive inference for all the models (us-ing optimum α values for hredGAN models andselecting the best of L = 64 responses using a dis-criminator) with dialogue contexts from a uniquetest set. Also, we compute the average BLEU-2, ROUGE-2(f1), Distinct(1/2), and normalized Item D ( G ( . )) Utterance

MTC

Context 0 NA perhaps < person > had a word with the man upstairs .Context 1 NA a word ? i ’ m sure by now he ’ s engineered a hostile takeover .Response 0 0.996 < person > , i know what you ’ re saying , < person > , that ’ snot what i ’ m saying .Response 1 0.991 < person > , i know . i was just about to help the guy .Response 2 0.315 < person > , i ’ m sorry .Response 3 0.203 < person > , i ’ m a little out .Context 0 NA says he wanted food . < person > . he wanted the gold .Context 1 NA how ’ s he going to want the gold ? he couldn ’ t even know wehad it .Response 0 0.998 < person > , i know . but it ’ s not him , it ’ s the only way he ’s got it all ﬁguredResponse 1 0.981 < person > , i know . but i have to tell you . these things arereally stupid and you think i was wrong ?Response 2 0.690 < person > , i ’ m sure he did .Response 3 0.314 < person > , i ’ m not sure . UDC

Context 0 NA The netboot one is suppose to download packages from the net.Context 1 NA like the ones to be installed? or the installed to be run?Response 0 0.993 you don ’ t need to install the whole system , just install theubuntu installerResponse 1 0.952 you can install the ubuntu installer from the ubuntu menuResponse 2 0.749 I ’ m not sure , I don ’ t know .Response 3 0.184 you canResponse 4 0.003 you a . 1 . . . . . . . . . . . . . . . . . . . . .Context 0 NA DJones: update manager won’t detect 12.04.1 as a new versionif you are already running 12.04, because 12.04.1 = 12.04 + lotsof package updatesResponse 0 0.991 did you try a clean install of the latest version ?Response 1 0.981 try installing the latest UNK and see if it worksResponse 2 0.615 I ’ m not sure you have a problem . . .Response 3 0.191 try sudo apt-get remove the package that is not installedResponse 4 0.002 try the UNK . 1 . 1 . 1 . 1 . 1 . UNK . deb

Table 2: Example of Discriminator Ranking withhredGAN w average sequence length (NASL) scores for eachmodel and summarize the results in the middle ofTable 1. Distinct(1/2) largely agrees with the per-plexity score. Most scores, similar to the perplex-ity, indicate that hredGAN models perform betterthan (V)HRED on both datasets. However, on theUDC ROUGE and MTC BLEU, VHRED scoresslightly better than hredGAN u but still worse thanhredGAN w.A good dialogue model should ﬁnd the rightbalance between precision (BLEU) and diver-sity. We strongly believe that our adversarial ap-proach is better suited to solving this problem. ovieContext 0 perhaps < person > had a word with the man upstairs . HRED i ’ m sorry , < person > . VHRED < person < ’ t do it . hredGAN w < person > , i ’ m not gon na be okay . i just don ’ t think it ’ s a good idea . hredGAN u < person > , i cause it was all the way to you . Context 1 a word ? i ’ m sure by now he ’ s engineered a hostile takeover .

HRED i ’ m sorry , < person > . VHRED < person > ’ t do it . hredGAN w < person > , i know what you ’ re saying , < person > , that ’ s not what i ’m saying . hredGAN u < person > , i have to tell you i ’ m not the same . Context 0 < person > ’ s . HRED i ’ m sorry , < person > . VHRED < person > ’ t do it . hredGAN w < person > , i ’ m sure it ’ s the only thing i ever imagined . hredGAN u < person > , i ’ ll be right there . Context 1 can i ask you something about her ?

HRED i don ’ t know .

VHRED : you ’ re not going to be here . hredGAN w i don ’ t know . i think i ’ ve heard of it . hredGAN u i ’ m sorry , i can ’ t .

UbuntuContext 0 this is Dacs i am in i used ’sudo statx and it let me in

HRED what ’s the problem ?

VHRED

I don’t know what you want to do . if you want to do it , you need to do it todo it in the terminal ? hredGAN w you don ’ t have to do anything , just type in the terminal and it should be toﬁnd the right device hredGAN u you can have a look at the output of the command . . .

Context 1 good deal... cat /etc/X11/default-display-manager

HRED what ’s the problem ?

VHRED do you know what you want to do ? hredGAN w

I ’ m trying to ﬁgure a command that I can ﬁnd to ﬁnd out the ﬁle that I canﬁnd in the ﬁle” hredGAN u

I don ’ t see the point , but I ’ m not sure how to do that .

Context 2 /usr/sbin/lightdm http://paste.ubuntu.com/1286224/ < —- my/etc/X11/xorg.conf HRED what ’s the problem ?

VHRED is there a way to do that in the terminal ? hredGAN w did you just type \ ” sudo mount -a \ ” ? hredGAN u i have no idea , i just installed ubuntu and i have no idea how to do that Table 3: Sample responses of HRED, VHRED andhredGAN.

As hredGAN generators explore diversity, the dis-criminator ranking gives hredGAN an edge over(V)HRED because it helps detect responses thatare out of context and the natural language struc-ture (Table 2). Also, the ROGUE(f1) performanceindicates that hredGAN w strikes a better balancebetween precision (BLEU) and diversity than therest of the models. This is also obvious from thequality of generated responses.

Qualitative Generator Performance:

The re-sults of the human evaluation are reported inthe last column of Table 1. The human evalua-tion agrees largely with the automatic evaluation.hredGAN w performs best on both datasets al-though the gap is more on the MTC than on theUTC. This implies that the improvement of HREDwith adversarial generation is better than with vari-ational generation (VHRED). In addition, look-ing at the actual samples from the generator out-puts in Table 6 shows that hredGAN, especiallyhredGAN w, performs better than (V)HRED.While other models produce short and generic ut- terances, hredGAN w mostly yields informativeresponses. For example, in the ﬁrst dialogue in Ta-ble 6, when the speaker is sarcastic about “the manupstairs”, hredGAN w responds with the most co-herent utterance with respect to the dialogue his-tory. We see similar behavior across other sam-ples. We also note that although hredGAN u’s re-sponses are the longest on Ubuntu (in line withthe NASL score), the responses are less informa-tive compared to hredGAN w resulting in a lowerhuman evaluation score. We reckon this might bedue to a mismatch between utterance-level noiseand word-level discrimination or lack of capacityto capture the data distribution using single noisedistribution. We hope to investigate this further inthe future.

Discriminator Performance:

Although onlyhredGAN uses a discriminator, the observed dis-criminator behavior is interesting. We observethat the discriminator score is generally reasonablewith longer, more informative and more persona-related responses receiving higher scores as shownin Table 2. It worth to note that this behavior, al-though similar to the behavior of a human judgeis learned without supervision. Moreover, the dis-criminator seems to have learned to assign an av-erage score to more frequent or generic responsessuch as “I don’t know,” “I’m not sure,” and so on,and high score to rarer answers. That’s why wesample a modiﬁed noise distribution during infer-ence so that the generator can produce rarer utter-ances that will be scored high by the discriminator.

In this paper, we have introduced an adversar-ial learning approach that addresses response di-versity and control of generator outputs, usingan HRED-derived generator and discriminator.The proposed system outperforms existing state-of-the-art (V)HRED models for generating re-sponses in multi-turn dialogue with respect toautomatic and human evaluations. The perfor-mance improvement of the adversarial genera-tion (hredGAN) over the variational generation(VHRED) comes from the combination of adver-sarial training and inference which helps to ad-dress the lack of diversity and contextual rele-vance in maximum likelihood based generative di-alogue models. Our analysis also concludes thatthe word-level noise injection seems to performbetter in general. eferences

D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neuralmachine translation by jointly learning to align andtranslate. In

Proceedings of International Confer-ence of Learning Representation (ICLR 2015) .R. E. Banchs. 2012. Movie-dic: A movie dialogue cor-pus for research and development. In

Proceedingsof the 50th Annual Meeting of the Association forComputational Linguistics , pages 203–207.E. Bruni and R. Fernndez. 2018. Adversarial evalua-tion for open-domain dialogue generation. In

Pro-ceedings of the 18th Annual SIGdial Meeting .T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song,and Y. Bengio. 2017. Maximum-likelihood aug-mented discrete generative adversarial networks. In arXiv preprint arXiv:1702.07983 .K. Cho, B. Merrienboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio. 2014.Learning phrase representations using rnn encoder-decoder for statistical machine translation. In

Pro-ceedings of International Conference of LearningRepresentation (ICLR 2015) , pages 1724–1734.X. Glorot and Y. Bengio. 2010. Understanding the dif-ﬁculty of training deep feedforward neural networks.In

International conference on artiﬁcial intelligenceand statistics .I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio. 2014. Generative adversarial nets. In

Proceed-ings of Advances in Neural Information ProcessingSystems (NIPS 2014) .P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros. 2017.Image-to-image translation with conditional adver-sarial networks. In

Conference on Computer Visionand Pattern Recognition (CVPR, 2017) .S. Jean, K. Cho, R. Memisevic, and Y. Bengio.2015. On using very large target vocabularyfor neural machine translation. In arXiv preprintarXiv:1412.2007 .A. Kannan and O. Vinyals. 2017. Adversarial eval-uation of dialogue models. In arXiv preprintarXiv:1701.08198v1 .A. Lamb, A. Goyah, Y. Zhang, S. Zhang, A. Courville,and Y. Bengio. 2016. Professor forcing: A new al-gorithm for training recurrent networks. In

Proceed-ings of Advances in Neural Information ProcessingSystems (NIPS 2016) .J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan.2016a. A diversity-promoting objective functionfor neural conversation models. In

Proceedings ofNAACL-HLT .J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, andD. Jurafsky. 2016b. Deep reinforcement learn-ing for dialogue generation. In arXiv preprintarXiv:arXiv:arXiv:1606.01541v4 . J. Li, W. Monroe, T. Shi, A. Ritter, and D. Jurafsky.2017. Adversarial learning for neural dialogue gen-eration. In arXiv preprint arXiv:1701.06547 .C. Y. Lin. 2014. Rouge: a package for automatic evalu-ation of summaries. In

Proceedings of the Workshopon Text Summarization Branches Out .C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Char-lin, and J. Pineau. 2016. How not to evaluate yourdialogue system: An empirical study of unsuper-vised evaluation metrics for dialogue response gen-eration. In

Proceedings of EMNLP , pages 2122–2132.M. T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, andW. Zaremba. 2015. Addressing the rare word prob-lem in neural machine translation. In

Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics .K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.Bleu: A method for automatic evalution of machinetranslation. In

Proceedings of the 40th Annual Meet-ing of the Association for Computational Linguis-tics , pages 311–318.I. Serban, A. Sordoni, Y. Bengio, A. Courville, andJ. Pineau. 2016. Building end-to-end dialogue sys-tems using generative hierarchical neural networkmodels. In

Proceedings of The Thirtieth AAAI Con-ference on Artiﬁcial Intelligence (AAAI 2016) , pages3776–3784.I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula,B. Zhou, Y. Bengio, and A. Courville. 2017a. Mul-tiresolution recurrent neural networks: An applica-tion to dialogue response generation. In

Proceed-ings of The Thirty-ﬁrst AAAI Conference on Artiﬁ-cial Intelligence (AAAI 2017) .I. V. Serban, A. Sordoni, R. Lowe, L. Charlin,J. Pineau, A. Courville, and Y. Bengio. 2017b. Ahierarchical latent variable encoder-decoder modelfor generating dialogue. In

Proceedings of TheThirty-ﬁrst AAAI Conference on Artiﬁcial Intelli-gence (AAAI 2017) .I. Sutskever, O. Vinyals, and Q. Le. 2014. Sequenceto sequence learning with neural networks. In

Pro-ceedings of Advances in Neural Information Pro-cessing Systems (NIPS) , pages 3104–3112.O. Vinyals and Q. Le. 2015. A neural conversationalmodel. In

Proceedings of ICML Deep LearningWorkshop .R. J. Williams and D. Zipser. 1989. A learning algo-rithm for continually running fully recurrent neuralnetworks.

Neural computation , 1(2):270–280.C. Xing, W. Wu, Y. Wu, M. Zhou, Y. Huang, andW. Ma. 2017. Hierarchical recurrent attention net-work for response generation. In arXiv preprintarXiv:1701.07149 .. Xu, B. Liu, B. Wang, S. Chengjie, X. Wang,Z. Wang, and C. Qi. 2017. Neural response gener-ation via gan with an approximate embedding layer.In

EMNLP .L. Yu, W. Zhang, J. Wang, and Y. Yu. 2017. Seqgan:sequence generative adversarial nets with policy gra-dient. In

Proceedings of The Thirty-ﬁrst AAAI Con-ference on Artiﬁcial Intelligence (AAAI 2017) .Y. Zhang, M. Galley, J. Gao, Z. Gan, X. Li, C. Brock-ett, and B. Dolan. 2018. Generating informativeand diverse conversational responses via adversar-ial information maximization. In arXiv preprintarXiv:arXiv:1809.05972v5 .Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao,D. Shen, and L. Carin. 2017. Adversarial featurematching for text generation. In arXiv preprintarXiv:1706.03850 . A Ablation Experiments

Before proposing the above adversarial learningframework for multi-turn dialogue, we carried outsome experiments.

A.1 Generator:

We consider two main factors here, i.e., additionof an attention memory and injection of Gaussiannoise into the generator input.

A.1.1 Addition of Attention Memory

First, we noted that by adding an additional at-tention memory to the HRED generator, we im-proved the test set perplexity score by more than12 and 25 points on the MTC and UDC respec-tively as shown in Table 4. The addition of atten-tion also shows strong performance at autoregres-sive inference across multiple metrics as well as anobserved improvement in response quality. Hencethe decision for the modiﬁed HRED generator.

A.1.2 Injection of Noise

Before injecting noise into the generator, we ﬁrsttrain hredGAN without noise. The result is alsoreported in 4. We observe accelerated generatortraining but without an appreciable improvementin performance. It seems the discrimination taskis very easy since there is no stochasticity in thegenerator output. Therefore, the adversarial feed-back does not meaningfully impact the generatorweight update.Finally, we also notice that even with noise in-jection, there is no appreciable improvement inthe auto-regressive performance if we sample with L = 1 even though the perplexity is higher. How-ever, as we increase L , producing L responsesper turn, the discriminator’s adversarial selectiongives a better performance as reported in Table 1.Therefore, we conclude that the combinationof adversarial training and adversarial inferencehelps to address the lack of diversity and con-textual relevance observed in the generated re-sponses. A.2 Discriminator:

Before deciding on the word-level discrimination,we experimented with utterance-level discrimina-tion. The utterance-level discriminator trains veryquickly but it leads to mostly generic responsesfrom the generator. We also note that utterance-level discriminator scores are mostly extreme (i.e.,either low or high). Since we had used a convo-lutional neural network discriminator (Yu et al.,2017) in our experiments, we hope to investigatethis further with other architectures.

A.3 Adversarial Training:

Lastly, we also tried a basic policy gradient ap-proach (Li et al., 2017), where word-level discrim-inator score is used as a reward for each generatedword token, but this leads to training instability.This is probably due to the instability of MonteCarlo sampling over a large vocabulary size. Webelieve this might improve with other samplingmethods such as importance sampling and hope toinvestigate this further in the future. odel Teacher Forcing AutoregressionPerplexity BLEU-2 ROUGE-2 DISTINCT-1/2 NASL

MTC

HRED 31.92/36.00 0.0474 0.0384 0.0026/0.0056 0.535HRED+Attn 18.70/19.02 0.0425 0.2239 0.0397/0.1567 0.527hredGAN no noise 18.93/19.19 0.0355 0.1839 0.0272/0.0978 0.471

UDC

HRED 69.39/86.40 0.0177 0.0483 0.0203/0.0466 0.892HRED+Attn 43.43/43.92 0.0140 0.0720 0.0473/0.1262 0.760hredGAN no noise 43.48/44.04 0.0123 0.0827 0.0398/0.1147 0.908

Table 4: Generator Performance: HRED, HRED+Attn and hredGAN without noise odel ResponseContext 0 < person > i ’ ve raised my rate . i might even cut my sessions toforty minutes . HRED i ’ m sorry , < person > . VHRED < person > , < person > , < person > , < person > . hredGAN w < person > , i ’ m sorry , but i ’ m not sure if you ’ re really a verygood guy . hredGAN u i ’ ll never see them again , < person > . Context 0 i ’ m going to handle this exactly the way your father would havehandled this . i ’ ve got it solved .

Context 1 you ’ ve got it solved ? where is she ?

HRED i don ’ t know .

VHRED < person > , , i ’ m sorry. hredGAN w to the end of a beautiful woman who will be very happy . hredGAN u i ’ m sorry , i just didn ’ t know – Context 0 of course not ! ¡person¿ ’ s like a daughter to me . her father and itaught at university together . a good man , but he never listened .

HRED i ’ m sorry , < person > . VHRED < person > , mr . < person > . hredGAN w < person > , i know it ’ s been all six years since i told him aboutyou . hredGAN u i ’ m sure you didn ’ t , but i would have done a little more . Context 0 of course not ! ¡person¿ ’ s like a daughter to me . her father and itaught at university together . a good man , but he never listened .

Context 1 neither did i .

HRED i ’ m sorry , < person > . VHRED < person > , you ’ re a < unk > . hredGAN w < person > , i know you ’ re think about it . think it ’ ll be time , right? hredGAN u i ’ ll be the ﬁrst one . if it did you , you ’ ll be it . Context 0 come on , < person > , we could have a great week up in vermont . Context 1 i see what you mean . but wednesday we ’ ve got to be back in newyork .

HRED < person > . VHRED < person > , i don ’ t know what you ’ re going to do . hredGAN w < person > , i know . i ’ m kind of busy here . hredGAN u i ’ m sorry , i can ’ t . Context 0 isabel ?

Context 1 hi .

HRED that ’ s what i ’ m talking about .

VHRED : you ’ re not going to be a < unk > , < person > . hredGAN w < person > ? hredGAN u what do you want to know ? Context 0 you understand mr . < person > is constantly barraged with nui-sance litigation and irs audits . Context 1 of course .

HRED that ’ s right , < person > . VHRED : < person > , you ’ re going to be a < unk > of the < unk > . hredGAN w < person > , i ’ m sure , but you ’ re probably still in to the new < person > ’ s house , huh ? hredGAN u you ’ re gon na be a little boy , and i ’ m not gon na get out ! Context 0 < person > ’ t go into my room , she ’ s asleep . Context 1 < person > , but the nanny , mrs . < person > , she wants to see her. HRED i ’ m sorry , < person > . VHRED : < person > , mr . < person > . hredGAN w < person > , i know . but she ’ s never seen me anyway . hredGAN u i ’ m sorry i didn ’ t know what to say . Context 0 what was that for ?

Context 1 i ’ d forgotten how exciting it is working together .

HRED i ’ m sorry , < person > . VHRED : < person > , < person > . i don ’ t know . hredGAN w i ’ m sorry , < person > . so many people are so interested , but you’ re a very nice guy . hredGAN u i ’ ve always been so sure to < person > . Context 0 a teacher ’ s .

Context 1 oh my god .

HRED i ’ m sorry , < person > . VHRED : < person > , < person > ’ t worry about it . i don ’ t know what you’ re doing . hredGAN w i ’ m sorry . i didn ’ t mean to . i didn ’ t mean to hurt you . i justwanted you hredGAN u < person > , i don ’ t want to be your friend . Context 0 i took the initiative .

Context 1 you haven ’ t been given an assignment in this case .

HRED < person > . VHRED : < person > , < person > , < person > , < person > . i ’ m not goingto get out of the < unk > . hredGAN w i ’ m not sure . < continued utterance > i ’ m not gon na do you thisshit anyway , < person > . hredGAN u < person > , i guess i didn ’ t . Table 5: Additional sample responses of HRED,VHRED and hredGAN on Movie.

Model ResponseContext 0 what’s the command you are running to chroot and the error messageyou get in return?

Context 1 i mountded the device sudo mount /dev/sda1 /mnt - that works - thensudo chroot /mnt – error is no such ﬁle or command

HRED what ’s the problem ?

VHRED is there a way to do that ? what do you mean ? hredGAN w

I have no idea , I just did a sudo fdisk and it said it was mounted hredGAN u

I ’ m trying to get the latest version of ubuntu , but I ’ m getting a lot ofproblems .

Context 0

Hmm, have you checked that ICMP is being allowed all the way throughand that the server responds to other ICMP requests?

Context 1 now i can ping it (i had to enable the lzo compression on client) i cannotreach the remote subnet adding a rule in the ﬁrewall forward the lan tovpn

HRED what ’s the problem ?

VHRED

I don’t know what you want to do hredGAN w

I have no idea , I ’ m just trying to get the ip address from the router tothe router” hredGAN u

I ’ m not sure if that ’ s the case , but I ’ m not sure if it ’ s a good ideato have a look at the output .

Context 0

The netboot one is suppose to download packages from the net.

HRED

I don’t know what you want to do .

VHRED how do I do that ? hredGAN w

I ’ m not sure of a command to do so I can just install the ubuntu . isoand burn a new iso hredGAN u you can ’ t have a look at the man page . . .

Context 0

The netboot one is suppose to download packages from the net.

Context 1 like the ones to be installed? or the installed to be run?

HRED

I don’t know what the problem is .

VHRED is there a way to install the package manager ? hredGAN w you don ’ t need to install the whole system , just install the ubuntuinstaller hredGAN u

I ’ m trying to install a UNK version of ubuntu , but I ’ m trying toinstall it on my laptop .

Context 0

You trying to mount it open with the password?

HRED

I have to do that , but I have a problem with the same problem .

VHRED is there a way to do ? hredGAN w yes , I ’ m not sure if it ’ s mounted or not hredGAN u

I ’ m trying to install the latest version of Ubuntu , but I ’ m running ona UNK .

Context 0 are you using lvm for your root partition?

HRED yes , but I have a problem with a new user , but I don’t want to do that .

VHRED what is the best way to do that ? what is that ? hredGAN w yes , I ’ m on a live system , I ’ m trying to use the partition to boot fromwindows hredGAN u

I ’ m not sure if it ’ s a good idea to have a backup of the system . . .

Context 0 are you using lvm for your root partition?

Context 1 i have the same conﬁg working on a 10.04 box. It appears in 12.04 lvmis on upstart and multipath is not. i do use lvm for my root folder nifty,is there any way to work around it? have you heard of anyone posting abug about this to have it looked at? darn.... well i hope they can ﬁx thiseventually i hope they keep initrd around and move everything to upstart.Upstart is not easy to setup for programs that don’t come packaged forupstart.

HRED

I don’t know what the problem is , but I don’t know what the problem is. ?

VHRED is there a way to do that ? or **unknown** ? hredGAN w you can also try the live cd and see if you can get it back up the live” hredGAN u

I ’ m trying to get my wireless working on my laptop , but I ’ m gettingto the point that I can be able to use the wireless card to get the wirelessworking .

Context 0 are you using lvm for your root partition?

Context 2 yes, because i cant supply it at boot time, i want the truecrypt drive tocome up by itself without manual intervention

HRED what is the problem ?

VHRED what do you mean ? hredGAN w you can also mount a new one and put the mount command to the mountpoint” hredGAN u