[PDF] Learning to Select Context in a Hierarchical and Global Perspective for Open-domain Dialogue Generation

Abstract

Open-domain multi-turn conversations mainly have three features, which are hierarchical semantic structure, redundant information, and long-term dependency. Grounded on these, selecting relevant context becomes a challenge step for multi-turn dialogue generation. However, existing methods cannot differentiate both useful words and utterances in long distances from a response. Besides, previous work just performs context selection based on a state in the decoder, which lacks a global guidance and could lead some focuses on irrelevant or unnecessary information. In this paper, we propose a novel model with hierarchical self-attention mechanism and distant supervision to not only detect relevant words and utterances in short and long distances, but also discern related information globally when decoding. Experimental results on two public datasets of both automatic and human evaluations show that our model significantly outperforms other baselines in terms of fluency, coherence, and informativeness.

Full PDF

LLEARNING TO SELECT CONTEXT IN A HIERARCHICAL AND GLOBAL PERSPECTIVEFOR OPEN-DOMAIN DIALOGUE GENERATION

Lei Shen , , ∗ , Haolan Zhan , ∗ , Xin Shen , Yang Feng , IIP, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences, Australian National University

ABSTRACT

Open-domain multi-turn conversations mainly have three fea-tures, which are hierarchical semantic structure, redundant in-formation, and long-term dependency. Grounded on these,selecting relevant context becomes a challenge step for multi-turn dialogue generation. However, existing methods cannotdifferentiate both useful words and utterances in long dis-tances from a response. Besides, previous work just performscontext selection based on a state in the decoder, which lacksa global guidance and could lead some focuses on irrelevantor unnecessary information. In this paper, we propose a novelmodel with hierarchical self-attention mechanism and distantsupervision to not only detect relevant words and utterances inshort and long distances, but also discern related informationglobally when decoding. Experimental results on two publicdatasets of both automatic and human evaluations show thatour model signiﬁcantly outperforms other baselines in termsof ﬂuency, coherence, and informativeness.

Index Terms — Open-domain Dialogue Generation, Con-text Selection, Hierarchical and Global Perspective

1. INTRODUCTION

Open-domain multi-turn dialogue generation has gained in-creasing attentions in recent years, as it is more accordantwith real scenarios and aims to produce customized re-sponses. In general, an open-domain multi-turn conversationhas following features: (1) The context (including the queryand previous utterances in our paper) is in a hierarchical struc-ture, which means it consists of some utterances, and eachutterance contains several words. (2) At most cases, manycontents of the context are redundant and irrelevant to theresponse. (3) Some related information (utterances or words)and the response are in a long-term dependency relation.Therefore,

Context Selection , detecting the relevant contextbased on which to generate a more coherent and informativeresponse, is a key point in multi-turn dialogue generation.Based on feature (1), the hierarchical recurrent encoder-decoder network (HRED) [1] has been proposed. It encodes IIP stands for Key Laboratory of Intelligent Information Processing. ∗ Equal Contribution. Corresponding to: [email protected] each utterance and the whole context at two levels, and iswidely applied to other methods for multi-turn dialogue gen-eration. Then, hierarchical recurrent attention [2] and explicitweighting [3, 4], memory networks [5] and self-attentionmechanism [6] have been introduced to match feature (2)and (3), respectively. However, few work could cover allthese features simultaneously to fulﬁll context selection andresponse generation tasks.When it comes to

Context Selection , existing methods canbe categorised into two ways: (1) Detecting related utterancesmeasured by the similarity between query and each previousutterance [3, 4]. (2) Applying the attention mechanism from alocal perspective, i.e., based solely on the current state in de-coder with the Maximum Likelihood Estimation (MLE) loss[4, 6]. The similarity measurement in the former cannot se-lect word-level context, while the guidance from the local per-spective in the latter would make the model choose some de-viated context and produce an inappropriate response [7, 8, 9].To tackle the above mentioned problems, we propose

HiSA-GDS , a modiﬁed Transformer model with Hi erarchical S elf- A ttention and G lobally D istant S upervision. To the bestof our knowledge, it is the ﬁrst time to design these twomodules for open-domain dialogue generation. Speciﬁcally,we use Transformer encoder to encode each utterance in thecontext. During training, the response is ﬁrstly processed bya masked self-attention layer, and then a word-word attentionaggregates related word information in each utterance indi-vidually. After that, we conduct utterance-level self-attentionto get context-sensitive representations of aggregated infor-mation from last layer. Then, we calculate the attentionweights between utterance-level outputs of the previous layerand the masked response representation. Finally, we generatethe corresponding response based on the fusion of selectedinformation at both word and utterance levels. Besides, toprovide a global guidance of decoding, we import a distantsupervision module which utilizes the similarity score be-tween the response and each contextual utterance measuredby a pre-trained sentence-embedding model. All parametersare learned based on the global Distant Supervision and localMLE in an end-to-end framework.Experimental results on two public datasets along withfurther discussions show that HiSA-GDS signiﬁcantly outper- a r X i v : . [ c s . C L ] F e b ig. 1 . Architecture of HiSA-GDS. The white dashed boxis Transformer encoder, while the gray one is the modiﬁedTransformer decoder. The residual connection and layer nor-malization are omitted for brevity. “WPE” and “UPE” repre-sent word position encoding and utterance position encoding.The upper right corner shows the globally distant supervisionthat is only introduced to the N -th layer of decoder.forms other baselines and is capable to generate more ﬂuent,coherent, and informative responses.

2. APPROACH

The input is a context containing n utterances { X i } ni =1 , andeach utterance is deﬁned as X i = { x i, , ..., x i, | X i | } , where | X i | is the length of the i -th utterance and x i,m is the m -thword of X i . Our goal is to select relevant context consistingof utterances and words, and then generate a response Y = { y , y , ..., y | Y | } by utilizing the related information, where | Y | is the length of response Y . We consider each utterance independently, and given an ut-terance X i , the input representation of word x i,j is the sumof its word embedding and position encoding: I ( x i,j ) =WE( x i,j ) + WPE( x i,j ) , where WE( x i,j ) and WPE( x i,j ) represent word and word position embedding, respectively.The input embedding is then fed into Transformer encoderwith N layers. The ﬁnal encoding of X i is the output fromthe N -th layer, E ( N ) i . Please refer to [10] for more details. The decoder also contains N layers, and each layer is com-posed of ﬁve sub-layers. The ﬁrst sub-layer is a masked self-attention, which is deﬁned as: M ( l ) t = MHA( D ( l − t , D ( l − t , D ( l − t ) , (1) where MHA is the multi-head attention function, D ( l − t de-notes the input representation of the l -th layer, and M ( l ) t de-notes the output of masked self-attention at the l -th layer. D (0) t is the concatenated result of all words before time step t in the response and each word is also represented as the sumof its word embedding and position encoding.The second sub-layer is a word-word attention that sum-marizes word-level response-related information from eachutterance X i into a vector at a speciﬁc decoding time: U ( l ) t,i = MHA( f w ( M ( l ) t ) , E ( N ) i , E ( N ) i ) , (2)where f w is a linear transformation.The third sub-layer is an utterance-level self-attention. In-spired by Zhang et al. [6], we also utilize the self-attentionmechanism to capture the long-term dependency of utterance-level information. Similar to word position encoding, we addutterance position encoding ( UPE ) to U ( l ) t,i , and denote thesum result as ˜U ( l ) t,i . The output of this sub-layer is calculatedas: H ( l ) t = MHA( ˜U ( l ) t , ˜U ( l ) t , ˜U ( l ) t ) , (3)where ˜U ( l ) t = [ ˜U ( l ) t, , ˜U ( l ) t, , ..., ˜U ( l ) t,n ] . Then, the fourth sub-layer is a word-utterance attention layer to ﬁnd out utterance-level relevant information which is deﬁned as: C ( l ) t = f l (MHA( f u ( M ( l ) t ) , H ( l ) t , H ( l ) t )) , (4)where f l and f u are linear transformations, and f l is used forchanging the output dimension. The last sub-layer is a feed-forward neural network (FFN): F ( l ) t = FFN( C ( l ) t ) . (5)Each of above mentioned sub-layer is followed by a nor-malization layer and a residual connection. Finally, we use afusion gate to regulate the relevant information at word level( U ( l ) t,n ) and utterance level ( F ( l ) t ): λ t = σ ( W g [ U ( l ) t,n , F ( l ) t ]) , (6) D ( l ) t = λ t ∗ F ( l ) t + (1 − λ t ) ∗ U ( l ) t,n , (7)where W g is parameter metric, σ is the sigmoid activationfunction, and ∗ means the point-wise product. Previous attention-based models achieve context selectionfrom a local perspective, i.e., they try to generate one tokenat a time based solely on the current decoding state, whichwould detect deviated context and mislead the further gener-ation. Besides, we do not have manual annotations to providedirect signals for selection. To address these problems, wedesign a globally distant supervision module to help deter-mine relevant information, which provides a global guidanceor the response generation process. Firstly, we apply a highquality pre-trained sentence-embedding model to encodecontextual utterance X i and response Y into vectors, denotedas x i and y . Then, we use the dot product to measure thesemantic relevance between x i and y [11], and compute theselection probability as follows: P ( x = x i | y ) = exp( x i · y ) (cid:80) nj =1 exp ( x j · y ) . (8) We utilize three loss functions in our training process. Theﬁrst one is MLE loss which is deﬁned as: L MLE ( θ ) = − | Y | | Y | (cid:88) t =1 log p ( y t | y

3. EXPERIMENT SETTINGSDatasets : We evaluate the performance on two public datasets:Ubuntu Dialogue Corpus [14] (

Ubuntu ) and JD CustomerService Corpus [15] (

JDDC ). Baselines : (1) Seq2Seq with Attention Mechanism (

S2SA )[16], and we concatenate all context utterances as a long se-quence; (2) Hierarchical Recurrent Encoder-Decoder (

HRED ) [1]; (3) Variational HRED (

VHRED ) [17] with word dropand KL annealing, and the word drop ratio equals to 0.25;(4) Static Attention based Decoding Network (

Static ) [4]; (5)Hierarchical Recurrent Attention Network (

HRAN ) [18]; (6)

Transformer [10], and we concatenate all context utterancesinto a long sequence; (7) Relevant Contexts Detection withSelf-Attention Model (

ReCoSa ) [6]. They all focus on multi-turn conversations, and ReCoSa is a state-of-the-art modelon both

Ubuntu and

JDDC . For ablation study,

HiSA is ourmodel without the globally distant supervision.

Hyper-parameters : The utterance padding length is set to30, and the maximum conversation length is 10. The hiddensize of encoder and decoder is 512, and the number of lay-ers is 4 for encoder and 2 for decoder. The head number ofmulti-head attention is set to 8. The high-quality pre-trainedsentence-embedding model we used is Infersent [19]/Familia[20] for

Ubuntu / JDDC . These models are both pre-trained onlarge-scale datasets in either English or Chinese, and performwell on our datasets. For optimization, we use Adam [21]with a learning rate of 0.0001 with gradient clipping. Hyper-parameters in Equation 12 are set to 1.

Performance Measures : For automatic evaluation, we use 4groups of metrics: (1)

BLEU-2 [22]; (2)

Embedding-basedMetrics (Average, Greedy, and Extrema) [17]; (3)

Coher-ence [23] that evaluates the semantic coherence between thecontext and response; (4)

Distinct-2 [24]. For human eval-uation, we utilize the side-by-side human comparison. Weinvite 7 postgraduate students as annotators. To each anno-tator, we show a context with two generated responses, onefrom HiSA-GDS and the other from a baseline model, but theannotators do not know the order. Then we ask annotators tojudge which one wins based on ﬂuency, coherence, and infor-mativeness. Please refer to [18] for more details. Agreementsamong the annotators are calculated using Fleiss’ kappa.

4. RESULTS AND DISCUSSIONAutomatic Evaluation Results:

As shown in Table 1, ourmodel outperforms all baselines signiﬁcantly on both

Ubuntu and

JDDC (signiﬁcance tests, p -value < Human Evaluation Results:

These results are shown in Ta-ble 2. We observe that HiSA-GDS outperforms all baselinemodels on both

Ubuntu and

JDDC . Speciﬁcally, the percent-age of “win” is always larger than that of “loss”. Take

Ubuntu dataset as an example. Compared with VHRED and Trans-former, HiSA-GDS achieves preference gains with 48%, odel Ubuntu JDDCB-2 D-2 Avg Ext Gre Coh B-2 D-2 Avg Ext Gre CohS2SA [16] 0.896 6.104 46.323 28.851 39.209 48.117 4.233 3.609 53.901 36.493 37.578 46.176HRED [1] 3.853 6.661 57.972 34.007 41.462 63.173 9.405 11.762 63.191 46.714 43.295 57.183VHRED [17] 3.677 8.098 57.251 32.024 41.808 61.464 6.367 15.184 62.436 43.337 41.787 63.924Static [4] 1.581 3.586 51.055 36.193

HiSA-GDS 7.351 10.934 68.283 41.468

Table 1 . Automatic evaluation results on

Ubuntu and

JDDC (%). The metrics BLEU-2, Distinct-2, Average, Extrema, Greedyand Coherence are abbreviated as B-2, D-2, Avg, Ext, Gre, and Coh, respectively.

Dataset Model HiSA-GDS vs. kappaWin Loss TieUbuntu S2SA [16] 58% 12% 30% 0.468HRED [1] 46% 19% 35% 0.531VHRED [17] 48% 20% 32% 0.493Static [4] 51% 17% 32% 0.596HRAN [18] 42% 9% 49% 0.424Transformer [10] 44% 19% 37% 0.474ReCoSa [6] 40% 6% 54% 0.528JDDC S2SA [16] 53% 24% 23% 0.547HRED [1] 56% 16% 34% 0.468VHRED [17] 52% 19% 29% 0.453Static [4] 48% 11% 41% 0.518HRAN [18] 50% 22% 28% 0.495Transformer [10] 51% 29% 20% 0.447ReCoSa [6] 45% 27% 28% 0.461

Table 2 . Human evaluation between HiSA-GDS and otherbaselines on

Ubuntu and

JDDC .51%, and 44%, respectively. We check responses generatedby our model with “win” and ﬁnd that they are more rele-vant to contextual utterances. The kappa scores indicate thatannotators come to a “Moderate agreement” on judgement.

Discussion of Hierarchical Self-Attention:

To validate theeffectiveness of hierarchical self-attention mechanism, wepresent the heatmap of an example in Figure 2. In this ex-ample, there are seven contextual utterances, and for eachutterance, importance of each word is indicated with thedepth of blue color on the right part. Besides, we also showan utterance-level attention visualization on the left part. Anutterance is more important when the red color is lighter. Forexample, the third and seventh utterances, i.e., X and X ,are more important than the others. The importance of a word(horizontal heatmap on the right of X to X ) or an utterance(vertical heatmap on the left of X to X ) is calculated asthe average value of different heads. From the word-levelvisualization, we ﬁnd that words including “ 订单 (order)”,“ 今天 (today)”, and “ 送货 (deliver)” are selected to be morerelevant. Overall, the results are in accordance with humans’judgement and have achieved the goal of our proposed model. Discussion of GDS:

Since GDS is only utilized during thetraining process, we calculate the relevance score betweeneach contextual utterance and the ground-truth response. Af-

Fig. 2 . Left: Utterance-level multi-head attention visualiza-tion of HiSA-GDS in the word-utterance attention layer. 0 to7 are the index of each head. Right: Word-level attention vi-sualization in the word-word attention layer. The importanceof a word (horizontal blue heatmap) or an utterance (verticalred heatmap) is calculated as the average value of all heads.ter applying Familia [20] over the entire conversation, the rel-evance scores are 0.1502, 0.1388, 0.1602, 0.1548, 0.0979,0.1343, and 0.1638 for X to X , which is consistent withhumans’ intuition. Besides, inspired by Zhang et al. [6],we randomly sample 300 context-response pairs from JDDC .Three annotators who are postgraduate students are invitedto label each context. If a contextual utterance is related tothe response, then it is labeled as 1. The kappa value is 0.568,which indicates the moderate consistency among different an-notators. We then pick out samples that is labeled the same byat least two annotators, and then calculate the kappa value be-tween humans’ judgement and the outputs from Familia [20]on these cases. The value 0.863 reﬂects “Substantial agree-ment” between them.

5. CONCLUSION

In this paper, we propose a novel model for open-domain dia-logue generation, HiSA-GDS, which conducts context selec-tion in a hierarchical and global perspective. The hierarchi-cal self-attention is introduced to capture relevant context atboth word and utterance levels. We also design a globallydistant supervision module to guide the response generationat decoding. Experiments show that HiSA-GDS can generatemore ﬂuent, coherent, and informative responses. . REFERENCES [1] Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau, “Building end-to-end dialogue systems using generative hierarchical neu-ral network models,” in

AAAI , 2016, pp. 3776–3783.[2] Jian Song, Kailai Zhang, Xuesi Zhou, and Ji Wu,“HKA: A hierarchical knowledge attention mechanismfor multi-turn dialogue system,” in

ICASSP , 2020, pp.3512–3516.[3] Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yan-song Feng, and Dongyan Zhao, “How to make contextmore useful? an empirical study on context-aware neu-ral conversational models,” in

ACL , 2017, pp. 231–236.[4] Weinan Zhang, Yiming Cui, Yifa Wang, Qingfu Zhu,Lingzhi Li, Lianqiang Zhou, and Ting Liu, “Context-sensitive generation of open-domain conversational re-sponses,” in

COLING , 2018, pp. 2437–2447.[5] Hongshen Chen, Zhaochun Ren, Jiliang Tang, Yi-hong Eric Zhao, and Dawei Yin, “Hierarchical vari-ational memory network for dialogue generation,” in

WWW , 2018, pp. 1653–1662.[6] Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo,and Xueqi Cheng, “Recosa: Detecting the relevant con-texts with self-attention for multi-turn dialogue genera-tion,” in

ACL , 2019, pp. 3721–3730.[7] Lei Shen, Yang Feng, and Haolan Zhan, “Modeling se-mantic relationship in multi-turn conversations with hi-erarchical latent variables,” in

ACL , 2019, pp. 5497–5502.[8] Lei Shen and Yang Feng, “CDL: Curriculum dual learn-ing for emotion-controllable response generation,” in

ACL , 2020, pp. 556–566.[9] Lei Shen, Xiaoyu Guo, and Meng Chen, “Compose likehumans: Jointly improving the coherence and noveltyfor modern chinese poetry generation,” in

IJCNN , 2020,pp. 1–8.[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in

NIPS , 2017, pp. 5998–6008.[11] Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, andHua Wu, “Learning to select knowledge for responsegeneration in dialog systems,” in

IJCAI , 2019, pp. 5081–5087.[12] Pengjie Ren, Zhumin Chen, Christof Monz, Jun Ma,and Maarten de Rijke, “Thinking globally, acting lo-cally: Distantly supervised global-to-local knowledge selection for background based conversation,” in

AAAI ,2020, pp. 8697–8704.[13] Haolan Zhan, Hainan Zhang, Hongshen Chen, Lei Shen,Yanyan Lan, Zhuoye Ding, and Dawei Yin, “User-inspired posterior network for recommendation reasongeneration,” in

SIGIR , 2020, pp. 1937–1940.[14] Ryan Lowe, Nissan Pow, Iulian V Serban, and JoellePineau, “The ubuntu dialogue corpus: A large datasetfor research in unstructured multi-turn dialogue sys-tems,” in

SIGDIAL , 2015, pp. 285–294.[15] Meng Chen, Ruixue Liu, Lei Shen, Shaozu Yuan,Jingyan Zhou, Youzheng Wu, Xiaodong He, and BowenZhou, “The JDDC corpus: A large-scale multi-turnchinese dialogue dataset for e-commerce customer ser-vice,” in

LREC , 2020, pp. 459–466.[16] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Se-quence to sequence learning with neural networks,” in

NIPS , 2014, pp. 3104–3112.[17] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville, andYoshua Bengio, “A hierarchical latent variable encoder-decoder model for generating dialogues,” in

AAAI ,2017, pp. 3295–3301.[18] Chen Xing, Yu Wu, Wei Wu, Yalou Huang, and MingZhou, “Hierarchical recurrent attention network for re-sponse generation,” in

AAAI , 2018, pp. 5610–5617.[19] Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıcBarrault, and Antoine Bordes, “Supervised learningof universal sentence representations from natural lan-guage inference data,” in

EMNLP , 2017, pp. 670–680.[20] Di Jiang, Yuanfeng Song, Rongzhong Lian, Siqi Bao,Jinhua Peng, Huang He, and Hua Wu, “Familia: A Con-ﬁgurable Topic Modeling Framework for Industrial TextEngineering,” arXiv preprint arXiv:1808.03733 , 2018.[21] Diederick P Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in

ICLR , 2015.[22] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, “BLEU: a method for automatic evaluation ofmachine translation,” in

ACL , 2002, pp. 311–318.[23] Xinnuo Xu, Ondˇrej Duˇsek, Ioannis Konstas, and Ver-ena Rieser, “Better conversations by modeling, ﬁltering,and optimizing for coherence and diversity,” in

EMNLP ,2018, pp. 3981–3991.[24] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan, “A diversity-promoting objective func-tion for neural conversation models,” in