[PDF] NEXUS Network: Connecting the Preceding and the Following in Dialogue Generation

Abstract

Sequence-to-Sequence (seq2seq) models have become overwhelmingly popular in building end-to-end trainable dialogue systems. Though highly efficient in learning the backbone of human-computer communications, they suffer from the problem of strongly favoring short generic responses. In this paper, we argue that a good response should smoothly connect both the preceding dialogue history and the following conversations. We strengthen this connection through mutual information maximization. To sidestep the non-differentiability of discrete natural language tokens, we introduce an auxiliary continuous code space and map such code space to a learnable prior distribution for generation purpose. Experiments on two dialogue datasets validate the effectiveness of our model, where the generated responses are closely related to the dialogue context and lead to more interactive conversations.

Full PDF

NNEXUS Network: Connecting the Preceding and the Following inDialogue Generation

Hui Su ∗ , Xiaoyu Shen , ∗ , Wenjie Li and Dietrich Klakow Pattern Recognition Center, Wechat, Tencent, China Max Planck Institute for Informatics, Saarland Informatics Campus, Germany Spoken Language Systems (LSV), Saarland University, Germany The Hong Kong Polytechnic University, Hong Kong [email protected], [email protected]

Abstract

Sequence-to-Sequence (seq2seq) models havebecome overwhelmingly popular in build-ing end-to-end trainable dialogue systems.Though highly efﬁcient in learning the back-bone of human-computer communications,they suffer from the problem of strongly fa-voring short generic responses. In this pa-per, we argue that a good response shouldsmoothly connect both the preceding dialoguehistory and the following conversations. Westrengthen this connection through mutual in-formation maximization. To sidestep the non-differentiability of discrete natural languagetokens, we introduce an auxiliary continuouscode space and map such code space to a learn-able prior distribution for generation purpose.Experiments on two dialogue datasets validatethe effectiveness of our model, where the gen-erated responses are closely related to the dia-logue context and lead to more interactive con-versations.

With the availability of massive online conver-sational data, there has been a surge of in-terest in building open-domain chatbots withdata-driven approaches. Recently, the neuralnetwork based sequence-to-sequence (seq2seq)framework (Sutskever et al., 2014; Cho et al.,2014) has been widely adopted. In such a model,the encoder, which is typically a recurrent neu-ral network (RNN), maps the source tokens into aﬁxed-sized continuous vector, based on which thedecoder estimates the probabilities on the targetside word by word. The whole model can be efﬁ-ciently trained by maximum likelihood (MLE) andhas demonstrated state-of-the-art performance invarious domains. However, this architecture is not ∗ Indicates equal contribution. X. Shen focuses on algo-rithm and H. Su is responsible for experiments. A : Do you know the movie Star Wars? B : Only a bit. You can tell me about it! A : Of course! This is about ... Figure 1: A conversation in real life suitable for modeling dialogues. Recent researchhas found that while the seq2seq model gener-ates syntactically well-formed responses, they areprone to being off-context, short, and generic (e.g.,“I dont know” or “I am not sure”) (Li et al., 2016a;Serban et al., 2016). The reason lies in the one-to-many alignments in human conversations, whereone dialogue context is open to multiple potentialresponses. When optimizing with the MLE objec-tive, the model tends to have a strong bias towardssafe responses as they can be literally paired witharbitrary dialogue context without semantical orgrammatical contradictions. These safe responsesbreak the dialogue ﬂow without bringing any use-ful information and people will easily lose interestin continuing the conversation.In this paper, we propose NEXUS Networkwhich aims at producing more on-topic responsesto maintain an interactive conversation ﬂow. Ourassumption is that a good response should serveas a “nexus”: connecting and being informativeto both the preceding dialogue context and thefollow-up conversations. For example, in Figure1, the response from B is a smooth connection,where the ﬁrst half indicates the preceding contextis a “Do you know” question and the second halfinforms that the follow-up would be an introduc-tion about Star Wars . We establish this connectionby maximizing the mutual information (MMI) ofthe current utterance with both the past and fu-ture contexts. In this way, generic responses canbe largely discouraged as they contain no valuableinformation and thus have only weak correlations a r X i v : . [ c s . C L ] O c t ith the surrounding context. To enable efﬁcienttraining, two challenges exist.The ﬁrst challenge comes from the discrete na-ture of language tokens, hindering efﬁcient gradi-ent descent. One strategy is to estimate the gradi-ent by methods like Gumbel-Softmax (Maddisonet al., 2017; Jang et al., 2017) or REINFORCEalgorithm (Williams, 1992), which has been ap-plied in many NLP tasks (He et al., 2016; Shettyet al., 2017; Gu et al., 2018; Paulus et al., 2018),but the trade-off between bias and variance of theestimated gradient is hard to reconcile. The re-sulting model usually strongly relies on sensitivehyper-parameter tuning, careful pre-train and task-speciﬁc tricks. Li et al. (2016a); Wang et al. (2017)avoid this non-differentiability problem by learn-ing a separate backward model to rerank candidateresponses in the testing phase while still adheringto the MLE objective for training. However, thecandidate set normally suffers from low diversityand a huge sample size is needed for good perfor-mance (Li et al., 2016b).The second challenge relates to the unknown fu-ture context in the testing phase. In our frame-work, both the history and future context need tobe explicitly observed in order to compute the mu-tual information. When applying it to generatingtasks where only the history context is given, thereis no way to explicitly take into account the futureinformation. Therefore, reranking-based modelsdo not apply here. (Li et al., 2016c) addresses fu-ture information by policy learning, but the modelsuffers from high variance due to the enormoussequential search space. Serban et al. (2017);Zhao et al. (2017); Shen et al. (2017) adopt thevariational inference strategy to reduce the train-ing variance by optimizing over latent continuousvariables. However, they all stick to the originalMLE objective and no connection with the sur-rounding context is considered.In this work, we address both challenges byintroducing an auxiliary continuous code spacewhich is learned from the whole dialogue ﬂow. Ateach time step, instead of directly optimizing dis-crete utterances, the current, past and future utter-ances are all trained to maximize the mutual in-formation with this code space. Furthermore, alearnable prior distribution is simultaneously opti-mized to predict the corresponding code space, en-abling efﬁcient sampling in the testing phase with-out getting access to the ground-truth future con- versation. Extensive experiments have been con-ducted to validate the superiority of our frame-work. The generated responses clearly demon-strate better performance with respect to both co-herence and diversity. Let u i be the i th utterance within a dialogue ﬂow.The dialogue history H i − contains all the preced-ing context u , u , . . . , u i − and F i +1 denotes thefuture conversations u i +1 , . . . , u T . The objectiveof our model is to ﬁnd the decoding probability p θ ( u i | H i − , F i +1 ) that maximizes the mutual in-formation I ( H i − , u i ) and I ( u i , F i +1 ) . Formally,the objective is: max θ λ I ( H i − , u i ) + λ I ( u i , F i +1 ) u i ∼ p θ ( u i | H i − , F i +1 ) (1) λ and λ adjusts the relative weight. Mutual in-formation is deﬁned over p θ ( u i | H i − , F i +1 ) andthe empirical distribution p ( H i − , F i +1 ) . Now weassume the future context F i +1 is known to uswhen training the decoding probability, we willaddress the unknown future problem later.Directly optimizing with this objective is unfor-tunately infeasible because the exact computationof mutual information is intractable, and back-propagating through sampled discrete sequencesis notoriously difﬁcult to train. The discontinuityprevents the direct application of the reparameter-ization trick (Kingma and Welling, 2014). Low-variance relaxations like Gumbel-Softmax (Janget al., 2017), semantic hashing (Kaiser et al., 2018)or vector quantization (van den Oord et al., 2017)lead to biased gradient estimations, which are ac-cumulated as the sequence becomes longer. TheMonte-Carlo-Simulation is unbiased but suffersfrom high variances. Designing a reasonable con-trol variate for variance reduction is an extremelytricky task (Mnih and Gregor, 2014; Tucker et al.,2017). For this sake, we propose replacing u i witha continuous code space c learned from the wholedialogue ﬂow. We deﬁne the continuous code space c to followthe Gaussian probability distribution with a diag-onal covariance matrix conditioning on the whole igure 2: Framework of NEXUS Networks. Full line indicates the generative model to generate the continuouscode and corresponding responses. Dashed line indicates the inference model where the posterior code is trainedto infer the history, current and future utterances. Both parts are simultaneously trained by gradient descent. dialogue: c ∼ p φ ( c | H i − , F i ) = N ( µ, σ I | H i − , F i ) (2)The dialogue history H i − is encoded into vector ˜ H i − by a forward hierarchical GRU model E f asin (Serban et al., 2016). The future conversation,including the current utterance, is encoded into ˜ F i by a backward hierarchical GRU E b . ˜ H i − and ˜ F i are concatenated and a multi-layer perceptronis built on top of them to estimate the Gaussianmean and covariance parameters. The code spaceis trained to infer the encoded history ˜ H i − andfuture ˜ F i +1 . The full optimizing objective is: L ( c ) = max φ E p φ ( H i − ,F i ,c ) [ λ log p φ ( ˜ H i − | c )+ λ log p φ ( ˜ F i +1 | c )] p φ ( H i − , F i , c ) = p ( H i − , F i ) p φ ( c | H i − , F i ) p φ ( ˜ H i − | c ) = N ( µ, σ I | c ) p φ ( ˜ F i +1 | c ) = N ( µ, σ I | c ) (3)where ˜ H i − and ˜ F i +1 are also assumed to beGaussian distributed given c with mean and co-variance estimated from multi-layer perceptrons.We infer the encoded vectors instead of the orig-inal sequences for three reasons. Firstly, infer-ring dense vectors is parallelizable and computa-tionally much cheaper than autoregressive decod-ing, especially when the context sequences could be unlimitedly long. Secondly, sequence vectorscan capture more holistic semantic-level similar-ity than individual tokens. Lastly, It can alsohelp alleviate the posterior collapsing issue (Bow-man et al., 2016) when training variational in-ference models on text (Chen et al., 2017; Shenet al., 2018), which we will use later. It canbe shown that the above objective maximizesa lower bound of λ I ( H i − , c ) + λ I ( c, F i +1 ) ,given the conditional probability p φ ( c | H i − , F i ) .The proof is a direct extension of the derivationin (Chen et al., 2016), followed by the Data Pro-cessing Inequality (Beaudry and Renner, 2012)that the encoding function can only reduce themutual information. As the sampling processcontains only Gaussian continuous variables, theabove objective can be trained through the repa-rameterization trick (Kingma and Welling, 2014),which is a low-variance, unbiased gradient estima-tor (Burda et al., 2015). After training, samplesfrom p φ ( c | H i − , F i ) hold high mutual informationwith both the history and future context. The nextstep is then transferring the continuous code spaceto reasonable discrete natural language utterances. Our decoder transfers the code space c into theground-truth utterance u i by deﬁning the proba-bility distribution p ( u i | H i − , c ) , which is imple-ented as a GRU decoder going through u i wordby word to estimate the output probability. Theencoded history ˜ H i − and code space c are con-catenated as an extra input at each time step. Theloss function for the decoder is then: L ( d ) = max φ E p φ ( H i − ,F i ,c ) log p φ ( u i | H i − , c ) p φ ( H i − , F i , c ) = p ( H i − , F i ) p φ ( c | H i − , F i ) (4)which can be proved to be the lower bound ofthe conditional mutual information I ( u i , c | H i − ) .By maximizing the conditional mutual informa-tion, c i is trained to maintain as much informationabout the target sequence u i as possible.Combining Eq. 3 and 4, our model until nowcan be viewed as optimizing a lower bound of thefollowing objective: max φ λ I ( H i − , c ) + λ I ( c, F i +1 ) + I ( u i , c | H i − ) c ∼ p φ ( c | H i − , F i ) (5)Compared with the original motivation in Eq. 1,we sidestep the non-differentiability problem byreplacing u i with a continuous code space c , thenforcing u i to contain the same information asmaintained in c by additionally maximizing themutual information between them.Nonetheless, Eq. 5 and Eq. 1 might lead to dif-ferent optimums as mutual information does notsatisfy the transitive law. In the extreme case, dif-ferent dimensions of c could individually maintaininformation about history, current and future con-versations and the conversations themselves do notshare any dependency relation. To avoid this issue,we restrict the dimension of c to be smaller thanthat of the encoded vectors. In this case, optimiz-ing Eq. 5 will favor utterances having stronger cor-relations with the surrounding context to achieve ahigher total mutual information. The last problem is the sampling mechanism of c in Eq. 2, which conditions on the ground-truth fu-ture conversation. In the testing phase, when wehave no access to it, we cannot perform the de-coding process as in Eq. 4. To allow for decodingwith only the history context, we need to learn anappropriate prior distribution p θ ( c | H i − ) for c . In the ideal case, we would like p θ ( c | H i − ) = (cid:88) F i p φ ( c | H i − , F i ) = p φ ( c | H i − ) (6)However, p φ ( c | H i − ) is intractable as it integratesover all possible future conversations. We applyvariational inference on c to maximize the varia-tional lower bound (Jordan et al., 1999): L ( p ) = max θ,φ E p φ ( c | H i − ,F i ) log p θ ( ˜ F i | H i − , c ) − KL ( p φ ( c | H i − , F i ) || p θ ( c | H i − )) p θ ( ˜ F i | H i − , c ) ∼ N ( µ, σ I | H i − , c ) p θ ( c | H i − ) ∼ N ( µ, σ I | H i − )) (7)It can be reformulated as maximizing: E p φ ( c | H i − ) KL ( p φ ( ˜ F i | H i − , c ) || p θ ( ˜ F i | H i − , c )) − KL ( p φ ( c | H i − ) || p θ ( c | H i − )) (8)We can see it implicitly matches p φ ( c | H i − ) toa tractable Gaussian distribution p θ ( c | H i − ) byminimizing the KL divergence between them. Italso functions as a regularizer to prevent overﬁt-ting when learning p φ ( c | H i − , F i ) . In the test-ing phase, we can sample c from the learned priordistribution p θ ( c | H i − ) , then generate a responsebased on it. To sum up, the total objective function of ourmodel is: L = L ( c ) + L ( d ) + L ( p ) (9)Weighting can be added to individual loss func-tions for better performance, but we ﬁnd it enoughto maintain equal weights and avoid extra hyper-parameters. All the parameters are simultaneouslyupdated by gradient descent except for the en-coders E f and E b , which only accept gradientsfrom L ( d ) since otherwise the model can easilylearn to encode no information for a lower recon-struction loss in L ( c ) and L ( p ) . An overview ofour training procedure is depicted in Fig. 2. MMI decoding

MMI decoder was proposed by(Li et al., 2016a) and further extended in (Wanget al., 2017). The basic idea is the same as ourmodel by maximizing the mutual information withhe dialogue context. However, the MMI principleis applied only at the testing phase rather than thetraining phase. As a result, it can only be used toevaluate the quality of a generation by estimatingits mutual information with the context. To applyit in a generative task, we have to ﬁrst sample somecandidate responses with the seq2seq model, thenrerank them by accounting for the MMI score. Ourmodel differs from it in that we directly estimatethe decoding probability thus no post-samplingrerank is needed. Moreover, we further include thefuture context to strengthen the connection role ofthe current utterances.

Conditional Variational Autoencoder

Theidea of learning an appropriate prior distributionin Eq. 7 is essentially a conditional variationalautoencoder (Sohn et al., 2015) where the accu-mulated posterior distribution is trained to stayclose to a prior distribution. It has also been ap-plied in dialogue generation (Serban et al., 2017;Zhao et al., 2017). However, all the above meth-ods stick to the MLE objective function and do notoptimize with respect to the mutual information.As we will show in the experiment, they fail tolearn the correlation between the utterance andits surrounding context. The generation diversityof these models comes more from the samplingrandomness of the prior distribution rather thanfrom the correct understanding of context corre-lation. Moreover, they suffer from the posteriorcollapsing problem (Bowman et al., 2016) andrequire special tricks like KL-annealing, BOWloss or word drop-out (Shen et al., 2018). Ourmodel does not have such problems.

Deep Reinforcement Learning Dialogue Gener-ation (Li et al., 2016c) ﬁrst considered futuresuccess in dialogue generation and applied deepreinforcement learning to encourage more interac-tive conversations. However, the reward functionsare intuitively hand-crafted. The relative weightfor each reward needs to be carefully tuned and thetraining stage is unstable due to the huge searchspace. In contrast, our model maximizes the mu-tual information in the continuous space and trainsthe prior distribution through the reparamateriza-tion trick. As a result, our model can be more eas-ily trained with a lower variance. Throughout ourexperiment, the training process of NEXUS net-work is rather stable and much less data-hungry.The MMI objective of our model is theoretically more sound and no manually-deﬁned rules needto be speciﬁed.

We run experiments on the DailyDialog (Li et al.,2017b) and Twitter corpus (Ritter et al., 2011).DailyDialog contains 13118 daily conversationsunder ten different topics. This dataset is crawledfrom various websites for English learner to prac-tice English in daily life, which is high-quality,less noisy but relatively smaller. In contrast, theTwitter corpus is signiﬁcantly larger but containsmore noise. We obtain the dataset as used in Ser-ban et al. (2017) and ﬁlter out tweets that havealready been deleted, resulting in about 750,000multi-turn dialogues. The contents have more in-formal, colloquial expressions which makes thegeneration task harder. These two datasets are ran-domly separated into training/validation/test setswith the ratio of 10:1:1.In order to keep our model comparable with thestate-of-the-art, we keep most parameter valuesthe same as in (Serban et al., 2017). We build ourvocabulary dictionary based on the most frequent20,000 words for both corpus and map other wordsto a UNK token. The dimensionality of the codespace c is 100. We use a learning rate of 0.001 forDailyDialog and 0.0002 for Twitter corpus. Thebatch size is ﬁxed to 128. The word vector di-mension is 300 and is initialized with the pub-lic Word2Vec (Mikolov et al., 2013) embeddingstrained on the Google News Corpus. The prob-ability estimators for the Gaussian distributionsare implemented as 3-layer perceptrons with thehyperbolic tangent activation function. As men-tioned above, when training NEXUS models, weblock the gradient from L ( c ) and L ( p ) with re-spect to E f and E b to encourage more meaningfulencodings. The UNK token is prevented from be-ing generated in the test phase. We implementedall the models with the open-sourced Python li-brary Pytorch (Paszke et al., 2017) and optimizedusing the Adam optimizer (Kingma and Ba, 2015). We conduct extensive experiments to compare ourmodel against several representative baselines.

Seq2Seq : Following the same implementationas in (Vinyals and Le, 2015), the seq2seq modelserves as a baseline. We try both greedy decoding odel DailyDialog TwitterAverage Greedy Extreme Average Greedy Extreme

Greedy 0.443 0.376 0.328 0.510 0.341 0.356Beam 0.437 0.350 0.369 0.505 0.345 0.352MMI 0.457 0.371 0.371 0.518 0.353 0.365RL 0.405 0.329 0.305 0.460 0.349 0.323VHRED

NEXUS-H

NEXUS-F

NEXUS

Table 1: Results of embedding-based metrics. * indicates statistically signiﬁcant difference ( p < . from thebest baselines. The same mark is used in Table 2 and beam search (Graves, 2012) with beam sizeset to 5 when testing. MMI : We implemented the bidirectional-MMIdecoder as in Li et al. (2016a), which showed bet-ter performance over the anti-LM model. The hy-perparameter λ is set to 0.5 as suggested. 200 can-didates per context are sampled for re-ranking. VHRED : The VHRED model is essentially aconditional variational autoencoder with hierar-chical encoders (Serban et al., 2017; Zhao et al.,2017). To alleviate the posterior collapsing prob-lem, we apply the KL-annealing trick and earlystop with the step set as 12,000 for the DailyDia-log and 75,000 for the Twitter corpus. RL : Deep reinforcement learning chatbot as in(Li et al., 2016c). We use all the three reward func-tions mentioned in the paper and keep the relativeweights the same as in the original paper. Policynetwork is initialized with the above-mentionedMMI model. NEXUS-H : NEXUS network maximizing mu-tual information only with the history ( λ = 0 ). NEXUS-F : NEXUS network maximizing mu-tual information only with the future ( λ = 0 ). NEXUS : NEXUS network maximizing mutualinformation with both the history and future.NEXUS-H and NEXUS-F are implemented tohelp us better analyze the effects of different com-ponents in our model. The hyperparameters λ and λ in NEXUS are set to be 0.5 and 1 respec-tively as we ﬁnd history vector is consistently eas-ier to be reconstructed than the future vector (A.6). We conducted threeembedding-based evaluations (average, greedy and extrema) (Liu et al., 2016), which mapresponses into vector space and compute thecosine similarity (Rus and Lintean, 2012). Theembedding-based metrics can to a large extentcapture the semantic-level similarity betweengenerated responses and ground truth. We repre-sent words using Word2Vec embeddings trainedon the Google News Corpus. We also measurethe uncertainty of the score by assuming eachdata point is independently Gaussian distributed.The standard deviation yields the conﬁdenceinterval (Barany et al., 2007). Table 1 reportsthe embedding scores on both datasets. NEXUSnetwork signiﬁcantly outperforms the best base-line model in most cases. Notably, NEXUS canabsorb the advantages from both NEXUS-H andNEXUS-F. The history and future informationseem to help the model from different perspec-tives. Taking into account both of them doesnot create a conﬂict and the combination leadsto an overall improvement. RL performs ratherpoorly on this metric, which is understandableas it does not target the ground-truth responsesduring training (Li et al., 2016c).

BLEU Score

BLEU is a popular metric thatmeasures the geometric mean of the modiﬁed n-gram precision with a length penalty (Papineniet al., 2002). Table 2 reports the BLEU 1-3scores. Compared with embedding-based metrics,the BLEU score quantiﬁes the word-overlap be-tween generated responses and the ground-truth.One challenge of evaluating dialogue generationby BLEU score is the difﬁculty of accessing mul-tiple references for the one-to-many alignment re-lation. Following Sordoni et al. (2015); Zhao et al. odel DailyDialog TwitterBLEU-1 BLEU-2 BLEU-3 BLEU-1 BLEU-2 BLEU-3

Greedy 0.394 0.245 0.157 0.340 0.203 0.116Beam 0.386 0.251 0.163 0.338 0.205 0.112MMI 0.407 0.269 0.172 0.347 0.208 0.118RL 0.298 0.186 0.075 0.314 0.199 0.103VHRED 0.395

NEXUS-H

NEXUS-F

NEXUS 0.424*

Table 2: Results of BLEU score. It is computed based on the smooth BLEU algorithm (Lin and Och, 2004).p-value interval is computed base on the altered bootstrap resampling algorithm (Riezler and Maxwell, 2005) (2017); Shen et al. (2018), for each context, 10more candidate references are acquired by usinginformation retrieval methods (see Appendix A.4for more details). All candidates are then passed tohuman annotators to ﬁlter unsuitable ones, result-ing in 6.74 and 5.13 references for DailyDialogand Twitter dataset respectively. The human an-notation is costly, so we evaluate it on 1000 sam-pled test cases for each dataset. As the BLEUscore is not the simple mean of individual sen-tence scores, we compute the signiﬁcance in-terval by bootstrap resampling (Koehn, 2004; Rie-zler and Maxwell, 2005). As can be seen, NEXUSnetwork achieves best or near-best performanceswith only greedy decoders. NEXUS-H gener-ally outperforms NEXUS-F as the connection withfuture context is not explicitly addressed by theBLEU score metric. MMI and VHRED bring mi-nor improvements over the seq2seq model. Evenwhen evaluated on multiple references, RL stillperforms worse than most models.

Connecting the preceding

We deﬁne two met-rics to evaluate the model’s capability of “connect-ing the preceding context”:

AdverSuc and

Neg-PMI . AdverSuc measures the coherence of gener-ated responses with the provided context by learn-ing an adversarial discriminator (Li et al., 2017a)on the same corpus to distinguish coherent re-sponses from randomly sampled ones. We encodethe context and response separately with two dif-ferent LSTM neural networks and output a binarysignal indicating coherent or not . The Adver- We apply the same architecture as in Lu et al. (2017). Inour experiment, the discriminator performs reasonably wellin the 4 scenarios outlined in Li et al. (2017a) and thus can beused as a fair evaluation metric.

Suc value is reported as the success rate that themodel fools the classiﬁer into believing its falsegenerations ( p ( generated = coherent ) > . ).Neg-PMI measures the negative pointwise mutualinformation value − log p ( c | r ) /p ( c ) between thegenerated response r and the dialogue context c . p ( c | r ) is estimated by training a separate back-ward seq2seq model. As p ( c ) is a constant, weignore it and only report the value of − log p ( c | r ) .A good model should achieve a higher Adver-Suc and a lower Neg-PMI. The results are listedin Table 3. We can see there is still a big gapbetween ground-truth and synthesized responses.As expected, NEXUS-H leads to the most signiﬁ-cant improvement. MMI model also performs re-markably well, but it requires post-reranking thusthe sampling process is much slower. VHREDand NEXUS-F do not help much here, sometimeseven slightly degrade the performance. We alsotried removing the history context when comput-ing the posterior distribution in VHRED, the re-sulting model has similar performance among allmetrics, which suggests VHRED itself cannot ac-tually learn the correlation pattern with the preced-ing context. Surprisingly, though RL explicitly setthe coherence score as a reward function, its per-formance is far from satisfying. We assume RLrequires much more data to learn the appropriatepolicy than other models and the training processsuffers from a higher variance. The result is thushard to be guaranteed. Connecting the following

We measure themodel’s capability of “connecting the followingcontext” from two perspectives: number of thesimulated turns and diversity of generated re- odel

AdverSuc Neg-PMI | | | | .017 .096 | .072 0.45 0.04 0.92Beam 0.16 | | | | .019 .103 | .086 0.52 0.06 0.90MMI 0.30 | | | | .025 .247 | .117 0.56 0.13 0.89RL 0.13 | | | | .033 .324 | .287 0.46 0.15 0.69VHRED 0.19 | | | | .106 .431 | .311 0.42 0.22 0.92 NEXUS-H 0.36 | | | | .108 .454 | .306 0.66 0.20 0.92 NEXUS-F | | | .288 | .117 .466 | .325 0.51 0.31 | | | .282 | .119 .470 | .329 0.70 0.33 | | | | .215 .522 | .495 0.92 0.67 0.97 Table 3: Coherence, diversity and human evaluations. Left: DailyDialog results, right: Twitter results sponses. We apply all models to generate multi-ple turns until a generic response is reached. Theset of generic responses is manually examined toinclude all utterances providing only passive dullreplies . The number of generated turns can re-ﬂect the time that a model can maintain an inter-active conversation. The results are reﬂected in the column in Table 3. As in (Li et al., 2016a),we measure the diversity by the percentage of dis-tinct unigrams ( Distinct-1 ) and bigrams (

Distinct-2 ) in all generated responses. Intuitively a higherscore on these three metrics implies a more inter-active generation system that can better connectthe future context. Again, NEXUS network dom-inates most ﬁelds. NEXUS-F brings more impactthan NEXUS-H as it explicitly encourages moreinteractive turns. Most seq2seq models fail to pro-vide an informative response in the ﬁrst turn. TheMMI-decoder does not change much, possibly be-cause the sampling space is not large enough, amore diverse sampling mechanism (Vijayakumaret al., 2018) might help. NEXUS network can ef-fectively continue the conversation for 2.8 turnsfor DailyDialog and 2.5 turns for Twitter, whichis closest to the ground truth (4.8 and 4.0 turnsrespectively). It also achieves the best diversityscore in both datasets. It is worth mentioning thatNEXUS-H also improves over baselines, thoughnot as signiﬁcantly as NEXUS-F, so NEXUS is nota trade-off but more like an enhanced version fromNEXUS-H and NEXUS-F.In summary, NEXUS network clearly generateshigher-quality responses in both coherence and di-versity, even in a rather small dataset like Daily-Dialog. NEXUS-H contributes more to the coher- We use a simple rule matching method (see AppendixA.5). We manually inspect it on a validation subset and ﬁndthe accuracy is more than 90%. Similar methods are adoptedin (Li et al., 2016c). ence and NEXUS-F more to the diversity.

We also employed crowdsourced judges to pro-vide evaluations for a random sample of 500 itemsin the DailyDialog test dataset. Participants areasked to assign a binary score to each context-response pair from three perspectives: whetherthe response coincides with its preceding context(Pri), whether the response is interesting enoughfor people to continue (Post) and whether the re-sponse itself is a ﬂuent natural sentence (Flu).Each sample gets one point if judged as yes andzero otherwise. Each pair is judged by three par-ticipants and the score supported by most peopleis adopted. We also evaluated the inter-annotatorconsistency by Fleiss’k score(Fleiss, 1971) andobtained k scores of 0.452 for Pri, 0.459 for Post(moderate agreement) and 0.621 for Flu (sub-stantial agreement), which implies most context-response pairs reach a consensus on the evaluationtask. We compute the average human score foreach model. Unlike metric-based scores, the hu-man evaluation is conducted only on the DailyDia-log corpus as it contains less noise and can be morefairly evaluated by human judges. Table 3 showsthe result in the last three columns. As can be seen,the pri and post human scores are highly correlatedwith the automatic evaluation metric “coherence”and “ ontext Response

Hi , Alice , what are yougoing to do this sunday ?eou

MMI:

Oh , that’ s great !

VHRED :

Well , it seems that i’m going to be late .

RL :

How long he it be ?

NEXUS:

Nothing really , what ’ s up ?Did you watch the Oscarson Sunday night ? eou

MMI:

I have no idea .

VHRED:

That’s not too bad .

RL :

It’s my mum’s birthday on Sunday .

NEXUS:

No , nothing much. Could you help me ?Let’s go for a drink !eou

MMI:

Thanks .

VHRED: all right . let me see you then .

RL :

Thank you inviting me .

NEXUS:

Sounds good . Where are we going ?You won’t want to miss ourwebinar tomorrow ! eou

MMI: lol !

VHRED:

That ’ s terriﬁc !

RL :

Sure, where where ?

NEXUS:

Thanks for your invitation! I’m free !How I read the article ?eou

MMI: there!!!

VHRED:

What ’ s good with it ?

RL : : I dont know what to do with it !

NEXUS:

Maybe the force is yourself !

Table 4: Examples of context-response pairs. eou denotes end-of-utterance. First three rows are from DailyDi-alog and the last two rows are from Twitter

VHRED, RL and NEXUS model. We seeNEXUS network does generate more interactiveoutputs than the other three. Though reranked bythe bidirectional language model, the MMI de-coder still produces quite a few generic responses.VHRED’s utterances are more diverse, but it onlycares about answering to the immediate query andmakes no efforts to bring about further topics.Moreover, it also generates more inappropriateresponses than the others. RL provides diverseresponses but sometimes not ﬂuent or coherentenough. We do observe that NEXUS sometimesgenerate over-complex questions which are notvery natural, as in the second example. But inmost cases, it outperforms the others.

In this paper, we propose “NEXUS Network”to enable more interactive human-computer con-versations. The main goal of our model is tostrengthen the “nexus” role of the current utter-ance, connecting both the preceding and the fol-lowing dialogue context. We compare our modelwith MMI, reinforcement learning and CVAE-based models. Experiments show that NEXUSnetwork consistently produces higher-quality re- sponses. The model is easier to train, requires nospecial tricks and demonstrates remarkable gener-alization capability even in a very small dataset.Our model can be considered as combining theobjective of MMI and CVAE and is compatiblewith current improving techniques. For exam-ple, mutual information can be maximized un-der a tighter bound using Donsker-Varadhan orf-divergence representation (Donsker and Varad-han, 1983; Nowozin et al., 2016; Belghazi et al.,2018). Extending the code space distribution tomore than Gaussian by importance weighted au-toencoder (Burda et al., 2015), inverse autoregres-sive ﬂow (Kingma et al., 2016) or VamPrior (Tom-czak and Welling, 2018) should also help with theperformance.

Acknowledgments

We thank all anonymous reviewers, GerhardWeikum, Jie Zhou, Cheng Niu and the dialoguesystem team of Wechat AI for valuable com-ments. Xiaoyu Shen is supported by IMPRS-CS fellowship. This work is partially funded byDFG collaborative research center SFB 1102 andResearch Grants Council of Hong Kong (PolyU152036/17E, 152040/18E). eferences

Imre Barany, Van Vu, et al. 2007. Central limit theo-rems for gaussian polytopes.

The Annals of Proba-bility , 35(4):1593–1621.Normand J Beaudry and Renato Renner. 2012. Anintuitive proof of the data processing inequality.

Quantum Information & Computation , 12(5-6):432–441.Mohamed Ishmael Belghazi, Aristide Baratin, SaiRajeshwar, Sherjil Ozair, Yoshua Bengio, DevonHjelm, and Aaron Courville. 2018. Mutual infor-mation neural estimation. In

Proceedings of the 35thInternational Conference on Machine Learning , vol-ume 80 of

Proceedings of Machine Learning Re-search , pages 531–540, Stockholmsmssan, Stock-holm Sweden. PMLR.Samuel R Bowman, Luke Vilnis, Oriol Vinyals, An-drew Dai, Rafal Jozefowicz, and Samy Bengio.2016. Generating sentences from a continuousspace. In

Proceedings of The 20th SIGNLL Confer-ence on Computational Natural Language Learning ,pages 10–21.Yuri Burda, Roger B. Grosse, and Ruslan Salakhut-dinov. 2015. Importance weighted autoencoders.

CoRR , abs/1509.00519.Xi Chen, Yan Duan, Rein Houthooft, John Schul-man, Ilya Sutskever, and Pieter Abbeel. 2016. In-fogan: Interpretable representation learning by in-formation maximizing generative adversarial nets.In

Advances in Neural Information Processing Sys-tems , pages 2172–2180.Xi Chen, Diederik P Kingma, Tim Salimans, YanDuan, Prafulla Dhariwal, John Schulman, IlyaSutskever, and Pieter Abbeel. 2017. Variationallossy autoencoder.

ICLR .Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using rnn encoder–decoderfor statistical machine translation. In

Proceedings ofthe 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP) , pages 1724–1734.Monroe D Donsker and SR Srinivasa Varadhan. 1983.Asymptotic evaluation of certain markov process ex-pectations for large time. iv.

Communications onPure and Applied Mathematics , 36(2):183–212.Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among many raters.

Psychological bulletin ,76(5):378.Alex Graves. 2012. Sequence transduction with recur-rent neural networks.

CoRR , abs/1211.3711.Jiatao Gu, Daniel Jiwoong Im, and Victor OK Li. 2018.Neural machine translation with gumbel-greedy de-coding.

AAAI , pages 5125–5132. Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu,Tieyan Liu, and Wei-Ying Ma. 2016. Dual learn-ing for machine translation. In

Advances in NeuralInformation Processing Systems , pages 820–828.Eric Jang, Shixiang Gu, and Ben Poole. 2017. Cat-egorical reparameterization with gumbel-softmax.

ICLR .Michael I Jordan, Zoubin Ghahramani, Tommi SJaakkola, and Lawrence K Saul. 1999. An intro-duction to variational methods for graphical models.

Machine learning , 37(2):183–233.Lukasz Kaiser, Samy Bengio, Aurko Roy, AshishVaswani, Niki Parmar, Jakob Uszkoreit, and NoamShazeer. 2018. Fast decoding in sequence modelsusing discrete latent variables. In

Proceedings of the35th International Conference on Machine Learn-ing , volume 80 of

Proceedings of Machine Learn-ing Research , pages 2390–2399, Stockholmsmssan,Stockholm Sweden. PMLR.Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization.

ICLR .Diederik P Kingma, Tim Salimans, Rafal Jozefowicz,Xi Chen, Ilya Sutskever, and Max Welling. 2016.Improved variational inference with inverse autore-gressive ﬂow. In

Advances in Neural InformationProcessing Systems , pages 4743–4751.Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes.

ICLR .Philipp Koehn. 2004. Statistical signiﬁcance tests formachine translation evaluation. In

Proceedings ofthe 2004 conference on empirical methods in naturallanguage processing .Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016a. A diversity-promoting ob-jective function for neural conversation models. In

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies ,pages 110–119.Jiwei Li, Will Monroe, and Dan Jurafsky. 2016b. Asimple, fast diverse decoding algorithm for neuralgeneration.

CoRR , abs/1611.08562.Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky,Michel Galley, and Jianfeng Gao. 2016c. Deep rein-forcement learning for dialogue generation. In

Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing , pages 1192–1202.Jiwei Li, Will Monroe, Tianlin Shi, S˙ebastien Jean,Alan Ritter, and Dan Jurafsky. 2017a. Adversariallearning for neural dialogue generation. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 2157–2169.anran Li, Hui Su, Xiaoyu Shen, Wenjie Li, ZiqiangCao, and Shuzi Niu. 2017b. Dailydialog: A man-ually labelled multi-turn dialogue dataset. In

Pro-ceedings of the Eighth International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers) , volume 1, pages 986–995.Chin-Yew Lin and Franz Josef Och. 2004. Orange: amethod for evaluating automatic evaluation metricsfor machine translation. In

Proceedings of the 20thinternational conference on Computational Linguis-tics , page 501. Association for Computational Lin-guistics.Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose-worthy, Laurent Charlin, and Joelle Pineau. 2016.How not to evaluate your dialogue system: An em-pirical study of unsupervised evaluation metrics fordialogue response generation. In

Proceedings of the2016 Conference on Empirical Methods in NaturalLanguage Processing , pages 2122–2132.Yichao Lu, Phillip Keung, Shaonan Zhang, Jason Sun,and Vikas Bhardwaj. 2017. A practical approachto dialogue response generation in closed domains.

CoRR , abs/1703.09439.Chris J Maddison, Andriy Mnih, and Yee Whye Teh.2017. The concrete distribution: A continuous re-laxation of discrete random variables.

ICLR .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient estimation of word represen-tations in vector space.

ICLR workshop .Andriy Mnih and Karol Gregor. 2014. Neural vari-ational inference and learning in belief networks.In

International Conference on Machine Learning ,pages 1791–1799.Sebastian Nowozin, Botond Cseke, and RyotaTomioka. 2016. f-gan: Training generative neuralsamplers using variational divergence minimization.In

Advances in Neural Information Processing Sys-tems , pages 271–279.Aaron van den Oord, Oriol Vinyals, et al. 2017. Neu-ral discrete representation learning. In

Advancesin Neural Information Processing Systems , pages6306–6315.Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In

Proceedings ofthe 40th annual meeting on association for compu-tational linguistics , pages 311–318. Association forComputational Linguistics.Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in pytorch.In

NIPS workshop .Romain Paulus, Caiming Xiong, and Richard Socher.2018. A deep reinforced model for abstractive sum-marization.

ICLR . Stefan Riezler and John T Maxwell. 2005. On somepitfalls in automatic evaluation and signiﬁcance test-ing for mt. In

Proceedings of the ACL workshop onintrinsic and extrinsic evaluation measures for ma-chine translation and/or summarization , pages 57–64.Alan Ritter, Colin Cherry, and William B Dolan. 2011.Data-driven response generation in social media. In

Proceedings of the conference on empirical methodsin natural language processing , pages 583–593. As-sociation for Computational Linguistics.Vasile Rus and Mihai Lintean. 2012. A comparison ofgreedy and optimal assessment of natural languagestudent input using word-to-word similarity metrics.In

Proceedings of the Seventh Workshop on BuildingEducational Applications Using NLP , pages 157–162. Association for Computational Linguistics.Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Buildingend-to-end dialogue systems using generative hier-archical neural network models. In

Proceedings ofthe Thirtieth AAAI Conference on Artiﬁcial Intelli-gence , pages 3776–3783. AAAI Press.Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio. 2017. A hierarchical latent variableencoder-decoder model for generating dialogues. In

Thirty-First AAAI Conference on Artiﬁcial Intelli-gence , pages 3295–3301.Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, ShuziNiu, Yang Zhao, Akiko Aizawa, and Guoping Long.2017. A conditional variational framework for dia-log generation. In

Proceedings of the 55th AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers) , volume 2, pages504–509.Xiaoyu Shen, Hui Su, Shuzi Niu, and Vera Demberg.2018. Improving variational encoder-decoders in di-alogue generation.

AAAI , pages 5456–5463.Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hen-dricks, Mario Fritz, and Bernt Schiele. 2017. Speak-ing the same language: Matching machine to humancaptions by adversarial training. In

Proceedings ofthe IEEE International Conference on Computer Vi-sion (ICCV) .Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015.Learning structured output representation usingdeep conditional generative models. In

Advancesin Neural Information Processing Systems , pages3483–3491.Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. In

Proceed-ings of the 2015 Conference of the North Ameri-can Chapter of the Association for Computationalinguistics: Human Language Technologies , pages196–205.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112.Jakub Tomczak and Max Welling. 2018. Vae with avampprior. In

International Conference on ArtiﬁcialIntelligence and Statistics , pages 1214–1223.George Tucker, Andriy Mnih, Chris J Maddison, JohnLawson, and Jascha Sohl-Dickstein. 2017. Rebar:Low-variance, unbiased gradient estimates for dis-crete latent variable models. In

Advances in NeuralInformation Processing Systems , pages 2627–2636.Ashwin K Vijayakumar, Michael Cogswell, Ram-prasaath R Selvaraju, Qing Sun, Stefan Lee, David JCrandall, and Dhruv Batra. 2018. Diverse beamsearch for improved description of complex scenes.

AAAI , pages 7371–7379.Oriol Vinyals and Quoc V. Le. 2015. A neural conver-sational model.

CoRR , abs/1506.05869.Di Wang, Nebojsa Jojic, Chris Brockett, and Eric Ny-berg. 2017. Steering output style and topic in neu-ral response generation. In

Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing , pages 2140–2150.Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. In

Reinforcement Learning , pages5–32. Springer.Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning discourse-level diversity for neuraldialog models using conditional variational autoen-coders. In

Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , volume 1, pages 654–664.

Supplementary Material

A.1 Proof of Eq. 3 λ I ( H, c ) + λ I ( c, F ) ≥ λ I ( ˜ H, c ) + λ I ( c, ˜ F )= λ E p φ ( ˜ Hc ) log p φ ( ˜ H | c ) p ( ˜ H ) + λ E p φ ( c ˜ F ) log p φ ( ˜ F | c ) p ( ˜ F )= λ E p φ ( ˜ Hc ) log p φ ( ˜ H | c ) + λ H ( ˜ H ) + λ E p φ ( c ˜ F ) log p φ ( ˜ F | c ) + λ H ( ˜ F ) ≥ λ E p φ ( ˜ Hc ) log p φ ( ˜ H | c ) + λ E p φ ( c ˜ F ) log p φ ( ˜ F | c )= λ E p φ ( ˜ Hc ) log p γ ( ˜ H | c ) + λ KL ( p φ ( ˜ H | c ) || p γ ( ˜ H | c )) + λ E p φ ( c ˜ F ) log p γ ( ˜ F | c ) + λ KL ( p φ ( ˜ H | c ) || p γ ( ˜ H | c )) ≥ λ E p φ ( ˜ Hc ) log p γ ( ˜ H | c ) + λ E p φ ( c ˜ F ) log p γ ( ˜ F | c )= E p φ ( ˜ Hu i ˜ F,c ) [ λ log p γ ( ˜ H | c ) + λ log p γ ( ˜ F | c )] A.2 Proof of Eq. 4 I ( u i , c | H ) = E p ( H ) E p φ ( u i c | H ) log p φ ( u i | Hc ) p ( u i | H )= E p ( H ) E p φ ( u i c | H ) log p φ ( u i | Hc ) + H ( u i | H ) ≥ E p ( Hu i F ) E p φ ( c | Hu i F ) log p φ ( u i | Hc )= E p ( Hu i F ) E p φ ( c | Hu i F ) log p γ ( u i | Hc ) + E p φ ( HcF ) KL ( p φ ( u i | Hc ) || p γ ( u i | Hc )) ≥ E p ( Hu i F ) E p φ ( c | Hu i F ) log p γ ( u i | Hc ) A.3 Derivation of Eq. 8 E p ( ˜ u i F | ˜ H ) [ E p φ ( c | ˜ H ˜ u i F ) log p φ ( ˜ u i F | c ) − KL ( p φ ( c | ˜ H ˜ u i F ) || p θ ( c | ˜ H ))]= E p ( ˜ u i F | ˜ H ) [ E p φ ( c | ˜ H ˜ u i F ) log p φ ( ˜ u i F | c ) p θ ( c | ˜ H ) p φ ( c | ˜ H ˜ u i F ) ]= E p ( ˜ u i F | ˜ H ) [[ E p φ ( c | ˜ H ˜ u i F ) log p φ ( ˜ u i F | c ) p θ ( c | ˜ H ) p ( ˜ u i F | ˜ H ) p φ ( ˜ u i F | ˜ Hc ) p φ ( c | ˜ H ) ]= E q φ ( c | ˜ H ) KL ( p φ ( ˜ u i F | ˜ Hc ) || p φ ( ˜ u i F | ˜ Hc )) − KL ( p φ ( c | ˜ H ) || p θ ( c | ˜ H )) − H ( ˜ u i F | ˜ H ) A.4 Information Retrieval Technique for Multiple References

We collected multiple reference responses for each dialogue context in the test set by information re-trieval techniques. References are retrieved based on their similarity with the provided context. Re-sponses to the retrieved utterances are used as references. The process of retrieving similar context isas follows: First, we select 1000 candidate utterances using the tf-idf score. These candidates are thenmapped to a vector space by summing their contained word vectors. After that, they are reranked basedon the average of cosine similarity, Jaccard distance and Euclidean distance with the ground-truth con-text. The top 10 retrieved responses are passed to human annotators to judge the appropriateness.

A.5 Phrases that count as forming dull responses

1) i know2) no eou (yes eou )3) no problem4) lol5) thanks eou) don’t know7) don’t think8) what ?9) of course10) wtfUtterances matching one of these phrases are treated as dull responses.

A.6 Effect of hyperparameter λ /λ Figure 3: Effect of hyperparameter ratio λ /λ on two datasets. Figure 3 visualizes the effects of hyperparameters λ and λ2