[PDF] Data Augmentation for Neural Online Chat Response Selection

Abstract

Data augmentation seeks to manipulate the available data for training to improve the generalization ability of models. We investigate two data augmentation proxies, permutation and flipping, for neural dialog response selection task on various models over multiple datasets, including both Chinese and English languages. Different from standard data augmentation techniques, our method combines the original and synthesized data for prediction. Empirical results show that our approach can gain 1 to 3 recall-at-1 points over baseline models in both full-scale and small-scale settings.

Full PDF

aa r X i v : . [ c s . C L ] S e p Data Augmentation for Neural Online Chat Response Selection

Wenchao Du

Language Technology InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 [email protected]

Alan W Black

Language Technology InstituteCarnegie Mellon UniversityPittsburgh, PA 15213 [email protected]

Abstract

Data augmentation seeks to manipulate theavailable data for training to improve the gen-eralization ability of models. We investigatetwo data augmentation proxies, permutationand ﬂipping, for neural dialog response se-lection task on various models over multipledatasets, including both Chinese and Englishlanguages. Different from standard data aug-mentation techniques, our method combinesthe original and synthesized data for predic-tion. Empirical results show that our approachcan gain 1 to 3 recall-at-1 points over baselinemodels in both full-scale and small-scale set-tings.

Building machines that are capable of conversinglike humans is one of the primary goals of artiﬁ-cial intelligence. Extensive manual labor is typ-ically required by traditional rule-based systems,limiting the scalability of such systems acrossmultiple domains. With the success of machinelearning, the quest of building data-driven dialogsystems has come into focus over the past fewyears (Ritter et al., 2011). Existing approaches inthis area can be categorized into generation-basedmethods and retrieval-based methods. Whilegeneration-based methods are still far from reli-ably generating informative responses, retrieval-based methods have the advantage of ﬂuency andgroundedness, since they select responses fromexisting data. We concentrate on retrieval-basedmethods in this paper, though we believe the pro-posed techniques could also improve generation-based models.While current state-of-the-art results for dialogmodels are achieved by deep learning approaches,the performance of neural models largely dependson the amount of training data. However, acquir-ing conversational data can be difﬁcult at times. On the other hand, even with thousands of datapoints, it is unclear whether these models can op-timally beneﬁt from them. Therefore, data aug-mentation and its efﬁcient use becomes an im-portant problem. Our main contribution is thatwe investigated new ways to manipulate chat dataand neural model architectures to improve perfor-mance. To our knowledge, we are the ﬁrst to eval-uate data augmentation on different types of neuralconversation models over multiple domains andlanguages.

Recent studies (Adi et al., 2016;Khandelwal et al., 2018) have shown that re-current neural networks (RNN), especiallylong-short term memory networks (LSTM), aresensitive to word order when encoding contextualinformation. However, for the response selectiontask, it is so far unclear to what extent word orderis important. This problem is perplexed by thefollowing language phenomena we observed fromexisting chat data:1. Broken continuity. Simultaneous con-versations happen in multi-party dialogs(Elsner and Charniak, 2008) very often, re-sulting in some utterances not responding totheir immediately preceding ones. Even inconversations between only two people, con-tinuity may still break due to one personswitch topic before the other responds. SeeTable 1 for examples.2. Mixed turn-taking behavior. People can givemultiple utterances before the other respond.Usually, these consecutive messages fromsame person form arguments that are in par-allel (by ’argument’ we mean text spans thatform discourse relations with each other),xample 1:Old I dont run graphical ubuntu,I run ubuntu server.Kuja Haha sucker.Taru ?Burner you can use ”ps ax” and ”kill (PID 在 (there) 吗 (?)Customer 看看 (look at) 此 (this) 款 (one)Agent 在的 (I’m here) 亲 (dear)Agent 亲 (dear) ，请 (please) 发 (send) 链接 (link) Table 1: Example chat snippets for broken continuity.The ﬁrst example is from (Lowe et al., 2015). Burner’smessage is responding to Old, and Kuja’s last mes-sage is replying to Taru. The second example is fromTaobao, where the third message is responding to theﬁrst message, and the fourth message to the secondmessage.

Example 1:Customer A 这 (this) 款 (one) 我 (I) 穿 (wear) 什么 (what) 码 (size)Customer A 160 高 (tall) ，斤 (0.5kg) 重 (heavy)Agent 亲 (dear) 如果 (if) 喜欢 (like) 宽松 (loose) 点的就 (then) 可以 (can) 选 (choose) L 哦 Example 2:Customer B 158cmCustomer B 63kgCustomer B 穿 (wear) 什么 (what) 码 (size) 的合适 (ﬁt)Agent 亲 (dear) 根据 (based on) 亲的 (your) 数据 (data) ，建议 (suggest) 穿 (wear)L 码 (size) Table 2: Example chat snippets for mixed turn-takingfrom Taobao. The question for recommendation and itsrelevant information (height and weight) can be com-municated through different number of utterances inarbitrary order.

Example:Wizard Sorry, I cannot ﬁnd any tripsleaving from Gotham City. Couldyou suggest another nearbydeparture city?Customer Would any packages to Mos Eisleybe available, if I increase mybudget to $2500?Wizard There are no trips available toMos Eisley.

Table 3: Example chat snippets from Frames. The ﬁrstmessage has two sentences. The second message is aconditional complex sentence.

Example 2 of Table 1 after Permutation:Customer 在 (there) 吗 (?)Agent 在的 (I’m here) 亲 (dear)Customer 看看 (look at) 此 (this) 款 (one)Agent 亲 (dear) ，请 (please) 发 (send) 链接 (link)Example 1 of Table 2 after Permutation:Customer A 160 高 (tall) ，斤 (0.5kg) 重 (heavy)Customer A 这 (this) 款 (one) 我 (I) 穿 (wear) 什么 (what) 码 (size)Agent 亲 (dear) 如果 (if) 喜欢 (like) 宽松 (loose) 点的就 (then) 可以 (can) 选 (choose) L 哦 Example of Table 3 after Flipping:Wizard Could you suggest another nearbydeparture city? Sorry, I cannot ﬁndany trips leaving from Gotham City.Customer if I increase my budget to $2500,Would any packages to Mos Eisleybe available?Wizard There are no trips available to MosEisley.

Table 4: Results of proposed transformations on pre-vious examples. In the ﬁrst and second examples, thetwo messages right before the last agent’s response arepermuted. In the third example, the ﬁrst message isﬂipped, splitting at the period; the second messages isseparated at the comma and ﬂipped. nd their orderings are not that important. Wefound this to be very common in online livechats. See Table 2 for examples.3. Long utterances. Some utterances containmultiple sentences. Some are single com-pound sentence with multiple clauses. SeeTable 3 for examples.To summarize, the critical information for re-sponding, which can be either a single word,phrase, or a full sentence, may have varying rel-ative positions in the context. Therefore, we hy-pothesize that there exist alternative orderings ofutterances and intra-utterance arguments in chatdata that can help selecting responses, given recur-rent neural models’ sensitivity to word order. Inthis paper, our main goal is to seek improvementby creating variations in the ordering of utterancesand arguments. We aim for generic methods, by-passing the need of discourse and syntactic parsingas an intermediate step. With the fact that onlinechats are typically noisy with spelling errors andungrammaticality, a relative lack of precision mayactually help. We therefore propose the followingways to manipulate chat data:

Permutation is simply reversing the order ofany two messages in the context. This may helprecover the continuity or create alternative order-ing of parallel arguments.

Flipping breaks an utterance into two parts, andconcatenate them in their reversed order. Thebreak point is the punctuation that is closest to themiddle of the utterance if there is any. Otherwise,we break the utterance at the middle.As illustrated in Table 4, the proposed transfor-mations neither change the implication of the con-texts nor the appropriateness of the responses.

We describe four datasets that we will be using toevaluate our proposed methods:

Taobao chat log was collected by a vendor ofpajamas between 2013 and 2015. The conver-sations took place on Taobao, one of the largestChinese e-commerce websites. The website al-lows two-way conversations between customersand agents in individual sessions.

Ubuntu dialog corpus (Lowe et al., 2015) is theﬁrst large dataset of online chats made available.It contains multi-party chat logs from Ubuntu chat room where people help each other to solve tech-nical problems related to Ubuntu.

Douban conversation corpus is a collection ofweb forum post discussions from Douban, a Chi-nese internet community (Wu et al., 2016). It cov-ers a wide range of topics, hence open-domain innature.

Frames dataset was collected by (Asri et al.,2017) in wizard-of-oz setting. The chats are aboutbooking ﬂight. The wizard has access to databaseto answer domain-speciﬁc questions. Unlike thedatasets mentioned above, the conversations ofFrames are highly controlled so that the languageis perfect and the chats have perfect turn ex-changes.

We ﬁrst give a high level abstraction of the neu-ral models we will be investigating. Given contextand candidate responses, the models score eachcandidate and the one with the highest score is se-lected. The models are trained by maximizing thelikelihood of labels. To build training data, onenegative example is sampled from the corpus foreach pair of context and true response. We groupthe models into the following two categories:

Dual-Encoder Model (DE)

As ﬁrst proposedin (Lowe et al., 2015), DE models encode context m and response r into v ( m ) ∈ R l , v ( r ) ∈ R m ,respectively. Then P ( r | m ) = σ ( v ( m ) T M v ( r )) where σ is the sigmoid function, M ∈ R l × m . Inthis paper, response encoder is LSTM. We con-sider two choices of context encoder: one is word-level LSTM encoder only (LSTM-DE), whichtakes concatenated messages as input. The otherone is hierarchical recurrent encoder (HRE-DE).For HRE, we encode each message with an LSTMword-level encoder, and then feed the last hiddenstates from the word-level encoder to an utterance-level encoder, which is also an LSTM. We con-catenate the last hidden state of the utterance-levelencoder to that of word-level encoder on concate-nated messages as ﬁnal context encoding. Notethat HRE-DE is a simpliﬁed version of the modelin (Zhou et al., 2016). Sequential Matching Network (SMN)

Un-like DE models, SMN ﬁnds the afﬁnity betweencontext messages and responses as a ﬁrst step anguage Medium Style Domain Size (Train) VocabularyUbuntu

English Chat Room Noisy Task 1M 400k

Taobao

Chinese Chat Room Noisy Task 0.9M 90k

Douban

Chinese Web Forum Noisy Open 1M 300k

Frames

English Chat Room Controlled Task 11k 9k

Table 5: Comparison of four dialog corpora (Wu et al., 2016). Given messages m k where k =1 , ..., n and response r , SMN ﬁrst extract feature u ( m k , r ) ∈ R p of how related the two utterancesare, and then accumulate these features with anRNN: v ( m, r ) = RN N ( u ( m k , r )) , k = 1 , ..., nP ( r | m ) = σ ( w T v ( m, r )) where v ( m, r ) , w ∈ R q . Let π i be the applicable transformations includingthe identity. For context m and response r , let m i = π i ( m ) , r j = π j ( r ) . For DE models, weuse the same encoder for m , r to encode m i , r j .Then we combine the encodings and predict by P ( r | m ) = σ ( X i,j v ( m i ) T M ij v ( r j )) where M ij ∈ R l × m . Similarly, for SMN, the pre-dicted score is P ( r | m ) = σ ( X i,j w Ti,j v ( m i , r j )) where w i,j ∈ R q . Please note that this score func-tion allows augmentations to be done at test timefor prediction. Additionally, we inject squared dis-tance between the encodings of the original dataand the transformed data in order to enforce mod-els to learn similar representations for them. Weare assuming that the transformation should notdrastically change the meanings of contexts andresponses even though they are not exactly label-preserving. Empirically we found adding this reg-ularization term actually helps. The training lossfor DE models becomes X ( m,r ) ( − log P ( r | m ) + t ( X i k v ( m i ) − v ( m ) k + X j k v ( r j ) − v ( r ) k ) and the one for SMN becomes X ( m,r ) ( − log P ( r | m )+ t ( X i,j k v ( m i , r j ) − v ( m, j ) k ) where t is a hyper-parameter. We tuned it on thevalidation set in [0 . , . . We evaluate our method on the datasets mentionedin Section 3. For the Ubuntu dataset, we use theversion shared by (Xu et al., 2016). For Douban,we discard the test set provided by the authorssince the responses are not from the same domain,and re-split training set. Negative responses arerandomly sampled. For Frames, we select nega-tive responses from those that have different slottypes and values from true responses. We alsoconduct an experiment with smaller amount oftraining data on the three large datasets, Ubuntu,Douban, and Taobao, in which of the trainingset are randomly selected for training. Following(Lowe et al., 2015), we evaluate the model perfor-mance with recall-at-1, following previous work.We experiment with two types of permutation:the ﬁrst one is permuting the last and the penulti-mate message in contexts, and the second one ispermuting the penultimate with the third to lastmessage. We only do the ﬁrst type of permuta-tion for SMN since SMN seems to be insensitiveto permutation. We ﬂip all messages in contextsand responses for SMN, and only ﬂip context mes-sages for DE models. We initialize word embeddings using the resultsof word2vec (Mikolov et al., 2013) trained on thewhole corpus. The size of word embeddings is300 for LSTM-DE and HRE-DE, and 200 forSMN. For LSTM-DE and HRE-DE, each LSTMlayer has hidden size of 300. We use the samehyper-parameters for SMN as in (Wu et al., 2016).All models are trained with Adam optimizer with buntu Taobao Douban Frames

HRE-DE 0.6729 0.3654 0.8728 0.5085 0.6443 0.3350 0.4436+ permutation 1 0.6817 0.3650 0.8732 0.5053 0.6401 0.3423 0.4339+ permutation 2 0.6786 + ﬂipping

SMN 0.7050 0.4771 0.8194 0.5312 0.6700 0.4662 0.4055+ permutation 1 0.7066 0.4749 0.8171 0.5302 0.6747 0.4669 0.4023+ ﬂipping

Table 6: Numbers on recall-at-1. Best results for each dataset and each model are highlighted. learning rate of . . We use early stoppingto choose parameters. For experiments on smalltraining sets (including Frames), we additionallyapply dropout (Srivastava et al., 2014) with rate . to all recurrent layers. As a side note, we ﬁndthat dropout does not affect the result in any sig-niﬁcant way under full-scale setting. Table 6 shows the performance of LSTM-DE,HRE-DE, and SMN on 4 different datasets underdifferent types of augmentation. For each full-scale dataset, nearly all models gain around 1 to3 points with one of the proposed data augmenta-tion methods. Permutation works best for LSTM-DE, less so for HRE-DE, and has almost no effecton SMN. This is probably because HRE-DE andSMN have an utterance-level recurrent componentwhich makes them better at capturing long rangedependencies. Permutation 1 does not improve onFrames dataset for any model. This might be thatFrames has perfect turn-taking, and wizards’ re-sponses are mostly addressing their immediatelypreceding messages, so moving away the last mes-sage in context does not help. In small-scale set-ting, LSTM-DE with data augmentation outper-forms HRE-DE on some of the datasets. SMNgains even more with ﬂipping than in full-scalesetting.

Data augmentation has been widely adoptedin computer vision and speech recognition(Krizhevsky et al., 2012; Ko et al., 2015). In im-age processing, label-preserving transformations such as tilting and ﬂipping are used, but in NLP,ﬁnding such transformations that exactly preservemeanings is difﬁcult. Language data is discretein nature, and minor perturbation may change themeaning. Most commonly used techniques in-clude word substitution (Fadaee et al., 2017) andparaphrasing (Dong et al., 2017). These methodsmay require heavy external resources, which canbe difﬁcult to apply across multiple languages anddomains.Recently, there has been a surging interestin adversarial training (Goodfellow et al., 2014).For text data, one class of methods generate ad-versarial examples by moving word embeddingsalong the opposite direction of the gradient ofloss functions (Wu et al., 2017; Yasunaga et al.,2017), hence small perturbation in the continuousspace of word vectors. Another class of methodsaim to create genuinely new examples. (Li et al.,2017) adds syntactic and semantic variations totraining data based on grammar rules and the-saurus. (Xie et al., 2017) add noises to data byblanking out or substituting words for languagemodeling. (Yang et al., 2017) adopt a seq2seqmodel (Sutskever et al., 2014) to generate ques-tions based on paragraphs and answers into theirgenerative adversarial framework. One main dif-ference between these methods and our approachis that, while adversarial training only manipulatestraining data, we in addition apply transformationsto data at test time to help prediction. This is closerto (Dong et al., 2017) in spirit.

Conclusion

We proposed a general method to improve dia-log response selection through manipulating ex-isting data that can be applied to different models.Our results show that for both open-domain andtask-oriented dialogues, and for both English andChinese languages, at least one of the proposedaugmentation methods is effective, and the chancethat they hurt is rare. We have deliberately chosena diverse set of domains and models to test thison to try to understand the contribution of dataaugmentation. Thus even when working on newdatasets, and new models, it seems data augmen-tation is still a valuable addition that will likelyimprove results. Being more speciﬁc about whenaugmentation works is harder. One future researchdirection would be to apply data transformationsituationally based on the discourse structure ofdialogs. In our experiments, we tried combiningpermutation and ﬂipping but found no advantageover using only one type of transformation. Webelieve a more sophisticated method of combina-tion could further improve the results, and leave itto future work.

References

Yossi Adi, Einat Kermany, Yonatan Belinkov, OferLavi, and Yoav Goldberg. 2016. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. arXiv preprint arXiv:1608.04207 .Layla El Asri, Hannes Schulz, Shikhar Sharma,Jeremie Zumer, Justin Harris, Emery Fine, RahulMehrotra, and Kaheer Suleman. 2017. Frames: Acorpus for adding memory to goal-oriented dialoguesystems. arXiv preprint arXiv:1704.00057 .Li Dong, Jonathan Mallinson, Siva Reddy, and MirellaLapata. 2017. Learning to paraphrase for questionanswering. In

Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Pro-cessing , pages 875–886.Micha Elsner and Eugene Charniak. 2008. You talkingto me? a corpus and algorithm for conversation dis-entanglement.

Proceedings of ACL-08: HLT , pages834–842.Marzieh Fadaee, Arianna Bisazza, and Christof Monz.2017. Data augmentation for low-resource neuralmachine translation. In

Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) , volume 2,pages 567–573. Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014. Explaining and harnessing adver-sarial examples. arXiv preprint arXiv:1412.6572 .Urvashi Khandelwal, He He, Peng Qi, and Dan Ju-rafsky. 2018. Sharp nearby, fuzzy far away: Howneural language models use context. arXiv preprintarXiv:1805.04623 .Tom Ko, Vijayaditya Peddinti, Daniel Povey, and San-jeev Khudanpur. 2015. Audio augmentation forspeech recognition. In

Sixteenth Annual Conferenceof the International Speech Communication Associ-ation .Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin-ton. 2012. Imagenet classiﬁcation with deep con-volutional neural networks. In

Advances in neuralinformation processing systems , pages 1097–1105.Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017.Robust training under linguistic adversity. In

Pro-ceedings of the 15th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Volume 2, Short Papers , volume 2, pages21–27.Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The ubuntu dialogue corpus: A largedataset for research in unstructured multi-turn dia-logue systems. arXiv preprint arXiv:1506.08909 .Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.Alan Ritter, Colin Cherry, and William B Dolan. 2011.Data-driven response generation in social media. In

Proceedings of the conference on empirical methodsin natural language processing , pages 583–593. As-sociation for Computational Linguistics.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overﬁtting.

The Journal of Machine LearningResearch , 15(1):1929–1958.Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In

Advances in neural information process-ing systems , pages 3104–3112.Yi Wu, David Bamman, and Stuart Russell. 2017. Ad-versarial training for relation extraction. In

Proceed-ings of the 2017 Conference on Empirical Methodsin Natural Language Processing , pages 1778–1783.Yu Wu, Wei Wu, Chen Xing, Ming Zhou, andZhoujun Li. 2016. Sequential matching network:A new architecture for multi-turn response selec-tion in retrieval-based chatbots. arXiv preprintarXiv:1612.01627 .iang Xie, Sida I Wang, Jiwei Li, Daniel L´evy, AimingNie, Dan Jurafsky, and Andrew Y Ng. 2017. Datanoising as smoothing in neural network languagemodels. arXiv preprint arXiv:1703.02573 .Zhen Xu, Bingquan Liu, Baoxun Wang, ChengjieSun, and Xiaolong Wang. 2016. Incorporatingloose-structured knowledge into lstm with recallgate for conversation modeling. arXiv preprintarXiv:1605.05110 .Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, andWilliam W Cohen. 2017. Semi-supervised qa withgenerative domain-adaptive nets. arXiv preprintarXiv:1702.02206 .Michihiro Yasunaga, Jungo Kasai, and DragomirRadev. 2017. Robust multilingual part-of-speechtagging via adversarial training. arXiv preprintarXiv:1711.04903 .Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao,Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016.Multi-view response selection for human-computerconversation. In