Does Dialog Length matter for Next Response Selection task? An Empirical Study
DDoes Dialog Length matter for Next Response Selection task?An Empirical Study.
Jatin Ganhotra
IBM Research [email protected]
Sachindra Joshi
IBM Research [email protected]
Abstract
In the last few years, the release of BERT,a multilingual transformer based model, hastaken the NLP community by storm. BERT-based models have achieved state-of-the-art re-sults on various NLP tasks, including dialogtasks. One of the limitation of BERT is thelack of ability to handle long text sequence. Bydefault, BERT has a maximum wordpiece to-ken sequence length of 512. Recently, therehas been renewed interest to tackle the BERTlimitation to handle long text sequences withthe addition of new self-attention based archi-tectures. However, there has been little to noresearch on the impact of this limitation withrespect to dialog tasks. Dialog tasks are in-herently different from other NLP tasks dueto: a) the presence of multiple utterances frommultiple speakers, which may be interlinked toeach other across different turns and b) longerlength of dialogs. In this work, we empiri-cally evaluate the impact of dialog length onthe performance of BERT model for the NextResponse Selection dialog task on four pub-licly available and one internal multi-turn di-alog datasets. We observe that there is littleimpact on performance with long dialogs andeven the simplest approach of truncating inputworks really well.
BERT is a bidirectional model based on the trans-former architecture (Devlin et al., 2019) and ispre-trained on two unsupervised tasks: maskedlanguage modeling (MLM) and next sentence pre-diction (NSP). The pre-trained BERT model canbe fine-tuned later on different downstream spe-cific tasks such as sentiment classification, ques-tion answering, natural language inference etc. Re-cently, BERT-based models have been used fordialog tasks where they have shown great promise(Whang et al., 2020; Gu et al., 2020).However, there are still open questions and con-cerns with respect to using these models for dialog tasks. Various changes have been proposed foradapting BERT-based models to dialog tasks, e.g.Gu et al. (2020) propose Speaker-Aware BERT(SA-BERT) model to incorporate the notion ofdifferent speakers associated with various utter-ances in the dialog. The SA-BERT model performssubstantially better than the default BERT model.Sankar et al. (2019) explore how the transformer-based seq2seq models use the available dialog his-tory by studying its sensitivity to artificially in-troduced unnatural changes or perturbations totheir context at test time. They observe that thetransformer-based models are rarely sensitive tomost perturbations such as missing or reorderingutterances, shuffling words, etc.One of the key questions, which has not receivedmuch attention from the community and left unno-ticed is: impact of dialog length on performanceof BERT-based models. In a dialog (especiallygoal-oriented dialog), various utterances are inter-linked with each other and a common belief state isshared and updated as the dialog progresses. Tradi-tionally, BERT-based models handle the input as along concatenated sequence and insert special sym-bols like [SEP] tokens to mark the beginning andend of each individual utterance. However, Ganho-tra et al. (2020) show that real-world dialogs can bemuch longer than the BERT input max_seq_lengthof 512 tokens, and hence the entire dialog contextcan not be fed directly to the BERT model.By default, BERT has a maximum wordpiecetoken sequence length of 512. To address the inputlimitation, many different approaches there havebeen proposed. These approaches fall broadly intwo categories: a) keep the BERT architecture sameand b) replace the self-attention mechanism withnew attention-based approaches. We discuss theseapproaches in more detail in Section 2. Usually, thekey information of an article is at the beginning andend for classification (text/ news/ document) tasks.The same is true for dialog tasks as well, especially a r X i v : . [ c s . C L ] J a n oal-oriented dialog. If we look at the anatomyof a dialog, the initial utterances set the overallscope of the conversation i.e. topic being discussed,or concern being raised in technical support, orprimary intent of the request in a customer caresetting. The next set of utterances involve back-and-forth between the speakers for finalizing thecore specific details and the last few utterancesconclude the conversation usually with a summaryand/or confirmation of the task completion.However for a long conversation, it’s hard toconclusively say that the key information is at boththe beginning and the end. It’s possible that asthe conversation progressed, the topic of the con-versation may have shifted away from the originaltopic. In this work, we take an empirical approachto understand the impact of dialog length on BERTperformance for the next response selection task.We experiment with four publicly available and oneinternal multi-turn dialog datasets, and share ourfindings in Section 3. BERT has a maximum wordpiece token sequencelength of 512. The magic number 512 was se-lected to reach a balance between performanceand memory usage, as the self-attention mecha-nism scales quadratically in the sequence length.Thus, for longer input sequences with more than512 word-piece tokens, which happens often inreal-world dialog datasets (Section 3.1), the entiredialog might not fit within the limit. Several ap-proaches have been proposed to circumvent/ tacklethe max_seq_length ( T ) limit of 512. These ap-proaches can be categorized in two groups: a)keep BERT architecture same, and b) replace self-attention with a different attention-based approach.We discuss both of these approaches below: There are four approaches in this group which try tocircumvent the BERT max_seq_length limit: trun-cation, hierarchical methods, strides and combiningpredictions from input subsets.
There are different methods to truncate input textsequence to fit the BERT max_seq_length.1. head-only: Keep the first 512 tokens2. tail-only: Keep the last 512 tokens3. hybrid: Select first n and last − n tokens 4. longest-first : Truncate token by token, re-moving a token from the longest sequence inthe pair of input sequences5. speaker-aware disentanglement: Use a heuris-tic to select a subset of utterances from dialogswith more than 2 speakers (Gu et al., 2020)e.g. DSTC 8-Track 2 (Kim et al., 2019)For hybrid and longest-first approaches, n is se-lected based on the task and dataset, e.g. Sun et al.(2019) select the first 128 and the last 382 tokens. The input text (length L ) is divided into k = L /512 fractions, which are fed into BERT to obtainthe representations for the k fractions. The hiddenstate of the [CLS] token is considered as the repre-sentation for the fraction. The final representationfor the input is achieved by using mean pooling,max-pooling, or self-attention on the representa-tions of all the fractions. However, this may notgive a good representation of the entire input text.And the more the number of fractions, the more di-luted the representation will be for the entire input.Sun et al. (2019) observe that the hybrid truncationperforms better than hierarchical approaches andachieves the best performance. In this approach, the long input is simply splitinto sub-inputs to get a prediction for each sub-input. The predictions for all sub-inputs are thencombined to get an overall prediction. Yang et al.(2019) use this approach for open-ended questionanswering based on Wikipedia articles. They seg-ment documents into paragraphs or sentences andthen score only these smaller pieces. The final soft-max layer over different answer spans is removed toallow comparison and aggregation of results fromdifferent segments.
Recently, many new architectures have been pro-posed to address the challenge of scaling inputlength in the standard Transformer. Transformer-XL (Dai et al., 2019) introduce segment-level recur-rence mechanism and relative positional encodingscheme. They reuse the hidden states for previoussegments as memory for the current segment, tobuild a recurrent connection between the segments. https://huggingface . co/transformers/preprocessing . html ataset Train Valid Test Table 1: Statistics of the datasets. For each dataset, including the internal Tech-Support dataset, we provide detailson the total number of train, valid and test pairs and the positive:negative sample ratio. The column ’ > limit ’ refers to the count and (%age) ofsamples where the input length is > 512.
Longformer (Beltagy et al., 2020) replaces thestandard quadratic self-attention with an atten-tion mechanism that scales linearly with sequencelength. They propose a combination of a windowedlocal-context self-attention and an end-task mo-tivated global attention to encode inductive biasabout the task. Reformer (Kitaev et al., 2020)propose locality-sensitive hashing to reduce thesequence-length complexity and approximate thecostly softmax computation in the full dot-productattention computation. Similar to Reformer, Per-former Choromanski et al. (2020) estimate the soft-max attention by using a Fast Attention Via positiveOrthogonal Random features approach.Big Bird (Zaheer et al., 2020) propose a sparseattention mechanism that reduces the quadraticself-attention dependency to linear. Ainslie et al.(2020) introduce global-local attention mechanismbetween global tokens and regular input tokens.They combine the idea of Longformers and Ran-domized Attention which reduces quadratic depen-dency on the sequence length to linear, but requiresadditional layers for training.
Today, from the two groups mentioned above,these approaches are not used often for variousNLP tasks as the number of articles longer thanmax_seq_length are very few (Table 1 (Sun et al.,2019)). Further, none of the new architectures fromGroup ’tail-only’ for our experiments toevaluate the impact of dialog length on the perfor-mance. We choose the ’tail-only’ strategy becauseas the dialog progresses over multiple turns, it’slikely that the recent utterances capture the mostimportant information required for the end task ofrecommending next response.
The next response selection task is defined asfollows: Given a dialog dataset D , an exam-ple is denoted as a triplet < d, r, l >, where c = { u , u , ..., u n } represents the dialog context with n utterances( u ), r is a response candidate, and l ∈ { , } denotes a label. When r is the correctresponse for c , then y = 1 ; else y = 0 . We experiment with four public and one inter-nal multi-turn response selection dialog datasetsacross different domains (system troubleshooting,social network, e-commerce and technical support):Ubuntu Dialogue Corpus V1 (Lowe et al., 2015),Ubuntu Dialogue Corpus V2 (Lowe et al., 2017),Douban Conversation Corpus (Wu et al., 2017), E-commerce Dialogue Corpus (Zhang et al., 2018)and our internal Tech-Support dataset. Statistics ofthe datasets are provided in Table 1. The Tech-Support dataset was generated following Loweet al. (2015) from real-world conversations between buntu Corpus V1 Ubuntu Corpus V2len R @1 R @1 R @2 R @5 R @1 R @1 R @2 R @5 all < 512 > 512 Douban Corpus Ecommerce Corpuslen R @1 R @2 R @5 R @1 R @2 R @5 all
667 (100%) 0.438 0.477 0.267 0.451 0.78 969 (100%) 0.533 0.327 0.524 0.829 < 512
633 (94.9%) 0.441 0.48 0.272 0.455 0.784 967 (99.8%) 0.533 0.327 0.524 0.828 > 512
36 (5.4%) 0.391 0.444 0.222 0.417 0.722 2 (0.2%) 0.625 0.5 0.5 1
Tech-supportlen R @1 R @2 R @5 all 13536 (100%) 0.93 0.885 0.947 0.992< 512 8015 (59.2%) 0.929 0.883 0.948 0.993> 512 5521 (40.8%) 0.931 0.888 0.945 0.991 Table 2: Evaluation results of BERT on all multi-turn next response selection datasets. The column len refers tothe dialog length. The column len > 512 , dialogs were truncated from beginning i.e. tail-only truncation. users and human agents on technical support.We observe that the %age of long dialogs inpublicly available benchmarks is very small (5%),while its much higher (40%) for our internal Tech-Support dataset. This shows that existing publicbenchmarks for the next response selection task arenot a good representative for real-world dialogs, inthe context of overall dialog length.
For evaluation, we use R n @ k where the model isasked to select the k most best-matched responsesfrom n available candidates, and it is correct if thecorrect response is among these k responses. Inaddition to R n @ k , we also use Mean ReciprocalRank (MRR) and Mean Average Precision (MAP).MRR captures the rank of the first relevant item,while MAP captures whether all of the relevantitems are ranked highly, which is important forthe Douban corpus, as there are multiple correctcandidates for a dialog context in the test set. We use the uncased BERT-base model for our ex-periments. For fine-tuning, all hyper-parametersof the original model were followed (Devlin et al.,2019). We divide the test set for each dataset intwo groups based on dialog length: a) < 512 andb) > 512. We compute the evaluation metrics forboth groups in addition to the overall test set. The https://storage . googleapis . com/bert_models/2018_10_18/uncased_L-12_H-768_A-12 . zip results are provided in Table 2. For the 2nd group,where input > 512, we use the ’ tail-only ’ trunca-tion strategy explained in Section 2.1.1, because asthe conversation progresses, the most recent utter-ances will be more important for the next responseselection task.We observe that the performance of BERT modelon the 2nd group (> 512) is comparable to theoverall test set performance across all the datasets,including our internal Tech-Support dataset whichhas 40% test samples which are longer than theBERT max_sequence_len 512. This implies thata pretrained BERT model deployed in productionfor the next response selection would not suffer onperformance if the input was longer than the BERTlimit and the most obvious strategy ( tail-only ) touse the most recent utterances from dialog willwork really well. With the rising popularity of BERT-based modelsin the NLP community, we empirically evaluatethe impact of dialog length on the performance ofBERT in the context of the next response selec-tion dialog task. We observe that existing publicbenchmarks for the next response selection task arenot a good representative for real-world dialogs,in the context of overall dialog length. To our sur-prise, we notice that there is no performance dropfor longer dialogs and even the simplest ’tail-only’truncation approach performs really well. eferences
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va-clav Cvicek, Zachary Fisher, Philip Pham, AnirudhRavula, Sumit Sanghai, Qifan Wang, and Li Yang.2020. ETC: Encoding long and structured inputsin transformers. In
Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 268–284, Online. Asso-ciation for Computational Linguistics.Iz Beltagy, Matthew E Peters, and Arman Cohan.2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 .Krzysztof Choromanski, Valerii Likhosherstov, DavidDohan, Xingyou Song, Andreea Gane, Tamas Sar-los, Peter Hawkins, Jared Davis, Afroz Mohiuddin,Lukasz Kaiser, et al. 2020. Rethinking attentionwith performers. arXiv preprint arXiv:2009.14794 .Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019.Transformer-xl: Attentive language models beyonda fixed-length context. In
Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics , pages 2978–2988.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Jatin Ganhotra, Haggai Roitman, Doron Cohen,Nathaniel Mills, Chulaka Gunasekara, Yosi Mass,Sachindra Joshi, Luis Lastras, and David Konop-nicki. 2020. Conversational document predictionto assist customer care agents. In
Proceedings ofthe 2020 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP) , pages 349–356,Online. Association for Computational Linguistics.Jia-Chen Gu, Tianda Li, Quan Liu, Zhen-Hua Ling,Zhiming Su, Si Wei, and Xiaodan Zhu. 2020.Speaker-aware bert for multi-turn response selectionin retrieval-based chatbots. In
Proceedings of the29th ACM International Conference on Information& Knowledge Management , pages 2041–2044.Seokhwan Kim, Michel Galley, Chulaka Gunasekara,Sungjin Lee, Adam Atkinson, Baolin Peng, HannesSchulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada,et al. 2019. The eighth dialog system technologychallenge. arXiv preprint arXiv:1911.06394 .Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.2020. Reformer: The efficient transformer. In
Inter-national Conference on Learning Representations .Ryan Lowe, Nissan Pow, Iulian Serban, and JoellePineau. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dia-logue systems. In
Proceedings of the 16th AnnualMeeting of the Special Interest Group on Discourseand Dialogue , pages 285–294, Prague, Czech Re-public. Association for Computational Linguistics.Ryan Lowe, Nissan Pow, Iulian Vlad Serban, Lau-rent Charlin, Chia-Wei Liu, and Joelle Pineau.2017. Training end-to-end dialogue systems withthe ubuntu dialogue corpus.
Dialogue & Discourse ,8(1):31–65.Chinnadhurai Sankar, Sandeep Subramanian, Chris Pal,Sarath Chandar, and Yoshua Bengio. 2019. Do neu-ral dialog systems use the conversation history ef-fectively? an empirical study. In
Proceedings ofthe 57th Annual Meeting of the Association for Com-putational Linguistics , pages 32–37, Florence, Italy.Association for Computational Linguistics.Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to fine-tune bert for text classification?In
China National Conference on Chinese Computa-tional Linguistics , pages 194–206. Springer.Taesun Whang, Dongyub Lee, Chanhee Lee, KisuYang, Dongsuk Oh, and Heuiseok Lim. 2020. Aneffective domain adaptive post-training method forbert in response selection.
Proc. Interspeech 2020 ,pages 1585–1589.Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhou-jun Li. 2017. Sequential matching network: Anew architecture for multi-turn response selectionin retrieval-based chatbots. In
Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages496–505, Vancouver, Canada. Association for Com-putational Linguistics.Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, LuchenTan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.End-to-end open-domain question answering withBERTserini. In
Proceedings of the 2019 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics (Demonstra-tions) , pages 72–77, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.Manzil Zaheer, Guru Guruganesh, Avinava Dubey,Joshua Ainslie, Chris Alberti, Santiago Ontanon,Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,et al. 2020. Big bird: Transformers for longer se-quences. arXiv preprint arXiv:2007.14062 .Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, HaiZhao, and Gongshen Liu. 2018. Modeling multi-turn conversation with deep utterance aggregation.In