A Simple but Effective BERT Model for Dialog State Tracking on Resource-Limited Systems
AA SIMPLE BUT EFFECTIVE BERT MODEL FOR DIALOG STATE TRACKING ONRESOURCE-LIMITED SYSTEMS
Tuan Manh Lai † (cid:63) Quan Hung Tran † Trung Bui † Daisuke Kihara (cid:63)(cid:63)
Purdue University, West Lafayette, IN † Adobe Research, San Jose, CA
ABSTRACT
In a task-oriented dialog system, the goal of dialog state track-ing (DST) is to monitor the state of the conversation from thedialog history. Recently, many deep learning based methodshave been proposed for the task. Despite their impressive per-formance, current neural architectures for DST are typicallyheavily-engineered and conceptually complex, making it dif-ficult to implement, debug, and maintain them in a productionsetting. In this work, we propose a simple but effective DSTmodel based on BERT. In addition to its simplicity, our ap-proach also has a number of other advantages: (a) the numberof parameters does not grow with the ontology size (b) themodel can operate in situations where the domain ontologymay change dynamically. Experimental results demonstratethat our BERT-based model outperforms previous methodsby a large margin, achieving new state-of-the-art results onthe standard WoZ 2.0 dataset . Finally, to make the modelsmall and fast enough for resource-restricted systems, we ap-ply the knowledge distillation method to compress our model.The final compressed model achieves comparable results withthe original model while being 8x smaller and 7x faster. Index Terms — Task-Oriented Dialog Systems, DialogState Tracking, BERT, Knowledge Distillation
1. INTRODUCTION
Task-oriented dialog systems have attracted more and moreattention in recent years, because they allow for natural inter-actions with users to help them achieve simple tasks such asflight booking or restaurant reservation. Dialog state track-ing (DST) is an important component of task-oriented dialogsystems [1]. Its purpose is to keep track of the state of the con-versation from past user inputs and system outputs. Based onthis estimated dialog state, the dialog system then plans thenext action and responds to the user. In a slot-based dialogsystem, a state in DST is often expressed as a set of slot-valuepairs. The set of slots and their possible values are typicallydomain-specific, defined in a domain ontology. We did not use the DSTC-2 dataset because its clean version is no longeraccessible ( http://mi.eng.cam.ac.uk/˜nm480/dstc2-clean.zip ). In addition, it has the exact same ontology as the WoZ 2.0 dataset.
With the renaissance of deep learning, many neural net-work based approaches have been proposed for the task ofDST [2, 3, 4, 5, 6, 7, 8, 9]. These methods achieve highlycompetitive performance on standard DST datasets such asDSTC-2 [10] or WoZ 2.0 [11]. However, most of these meth-ods still have some limitations. First, many approaches re-quire training a separate model for each slot type in the do-main ontology [2, 4, 7]. Therefore, the number of parametersis proportional to the number of slot types, making the scal-ability of these approaches a significant issue. Second, somemethods only operate on a fixed domain ontology [3, 4]. Theslot types and possible values need to be defined in advanceand must not change dynamically. Finally, state-of-the-artneural architectures for DST are typically heavily-engineeredand conceptually complex [4, 5, 6, 7]. Each of these modelsconsists of a number of different kinds of sub-components.In general, complicated deep learning models are difficult toimplement, debug, and maintain in a production setting.Recently, several pretrained language models, such asELMo [12] and BERT [13], were used to achieve state-of-the-art results on many NLP tasks. In this paper, we showthat by finetuning a pretrained BERT model, we can builda conceptually simple but effective model for DST. Givena dialog context and a candidate slot-value pair, the modeloutputs a score indicating the relevance of the candidate.Because the model shares parameters across all slot types,the number of parameters does not grow with the ontologysize. Furthermore, because each candidate slot-value pair issimply treated as a sequence of words, the model can be di-rectly applied to new types of slot-value pairs not seen duringtraining. We do not need to retrain the model every time thedomain ontology changes. Empirical results show that ourproposed model outperforms prior work on the standard WoZ2.0 dataset. However, a drawback of the model is that it is toolarge for resource-limited systems such as mobile devices.To make the model less computationally demanding,we propose a compression strategy based on the knowledgedistillation framework [14]. Our final compressed modelachieves results comparable to that of the full model, but it isaround 8 times smaller and performs inference about 7 timesfaster on a single CPU. a r X i v : . [ c s . C L ] F e b ig. 1 . BERT input format.
2. METHOD2.1. BERT
BERT is a powerful language representation model pretrainedon vast amounts of unlabeled text corpora [13]. It consistsof multiple layers of Transformer [15], a self-attention basedarchitecture. The base version of BERT consists of 12 Trans-former layers, each with a hidden size of 768 units and 12self-attention heads. The input to BERT is a sequence of to-kens (words or pieces of words). The output is a sequence ofvectors, one for each input token. The input representationof BERT is flexible enough that it can unambiguously repre-sent both a single text sentence and a pair of text sentences inone token sequence. The first token of every input sequenceis always a special classification token - [ CLS ] . The outputvector corresponding to this token is typically used as the ag-gregate representation of the original input. For single textsentence tasks (e.g., sentiment classification), this [ CLS ] to-ken is followed by the actual tokens of the input text and aspecial separator token - [ SEP ] . For sentence pair tasks (e.g.,entailment classification), the tokens of the two input texts areseparated by another [ SEP ] token. This input sequence alsoends with the [ SEP ] token. Figure 1 demonstrates the inputrepresentation of BERT.During pretraining, BERT was trained using two self-supervised tasks: masked language modeling (masked LM)and next sentence prediction (NSP). In masked LM, some ofthe tokens in the input sequence are randomly selected andreplaced with a special token [ MASK ] , and then the objectiveis to predict the original vocabulary ids of the masked tokens.In NSP, BERT needs to predict whether two input segmentsfollow each other in the original text. Positive examples arecreated by taking consecutive sentences from the text corpus,whereas negative examples are created by picking segmentsfrom different documents. After the pretraining stage, BERTcan be applied to various downstream tasks such as ques-tion answering and language inference, without substantialtask-specific architecture modifications. Figure 2 shows our proposed application of BERT to DST.At a high level, given a dialog context and a candidate slot-value pair, our model outputs a score indicating the relevanceof the candidate. In other words, the approach is similar to a sentence pair classification task. The first input correspondsto the dialog context, and it consists of the system utterancefrom the previous turn and the user utterance from the cur-rent turn. The two utterances are separated by a [ SEP ] token.The second input is the candidate slot-value pair. We simplyrepresent the candidate pair as a sequence of tokens (wordsor pieces of words). The two input segments are concate-nated into one single token sequence and then simply passedto BERT to get the output vectors ( h , h · · · , h M ) . Here, M denotes the total number of input tokens (including specialtokens such as [ CLS ] and [ SEP ] ).Based on the output vector corresponding the first specialtoken - [ CLS ] (i.e., h ) the probability of the candidate slot-value pair being relevant is: y = σ ( Wh + b ) ∈ IR (1)where the transformation matrix W and the bias term b aremodel parameters, and σ denotes the sigmoid function. Itsquashes the score to a probability between 0 and 1.At each turn, the proposed BERT-based model is used toestimate the probability score of every candidate slot-valuepair. After that, only pairs with predicted probability equal toat least 0.5 are chosen as the final prediction for the turn. Toobtain the dialog state at the current turn, we use the newlypredicted slot-value pairs to update the corresponding val-ues in the state of previous turn. For example, suppose theuser specifies a food = chinese restaurant during the cur-rent turn. If the dialog state has no existing food specifica-tion, then we can add food = chinese to the dialog state.If food = korean had been specified before, we replace itwith food = chinese .Compared to previous works [4, 5, 6, 7], our model is con-ceptually simpler. For example, in the GLAD model [4], thereare two scoring modules: the utterance scorer and the actionscorer. Intuitively, the utterance scorer determines whetherthe current user utterance mentions the candidate slot-valuepair or not, while the action scorer determines the degree towhich the slot-value pair was expressed by the previous sys-tem actions. In our proposed approach, we simply use a singleBERT model for examining the information from all sourcesat the same time. BERT is a powerful language representation model, becauseit was pretrained on large text corpora (Wikipedia and Book-Corpus). However, the original pretrained BERT models arecomputationally expensive and have a huge number of pa-rameters. For example, the base version of BERT consists ofabout 110M parameters. Therefore, if we directly integrate anexisting BERT model into our DST model, it will be difficultto deploy the final model in resource-limited systems suchas mobile devices. In this part, we describe our approach tocompress BERT into a smaller model. ig. 2 . Architecture of our BERT-based model for DST.Over the years, many model compression methods havebeen proposed [16, 17, 18, 14]. In this work, we propose astrategy for compressing BERT based on the knowledge dis-tillation framework [14]. Knowledge distillation (KD) aimsat transferring knowledge acquired in one model (a teacher)to another model (a student) that is typically smaller. We gen-erally assume that the teacher has previously been trained,and that we are estimating parameters for the student. KDsuggests training by matching the student’s predictions to theteacher’s predictions. In other words, we train the student tomimic output activations of individual data examples repre-sented by the teacher.We choose the pretrained base version of BERT as theteacher model. Our student model has the same general archi-tecture as BERT but it is much smaller than the teacher model(Table 1). In the student model, the number of Transformerlayers is 8, each with a hidden size of 256 units and 8 self-attention heads. The feedforward/filter size is 1024. Overall,our student model has 14M parameters in total and it is 8xsmaller and 7x faster on inference than our original teachermodel.We first extract sentences from the BooksCorpus [19],a large-scale corpus consisting of about 11,000 books andnearly 1 billion words. For each sentence, we use the Word-Piece tokenizer [20, 13] to tokenize the sentence into a se-quence of tokens. Similar to the pretraining phase of BERT,we mask 15% of the tokens in the sequence at random. Afterthat, we define the cross-entropy loss for each token as follow: L ( a T , a S ) = H (cid:32) softmax (cid:18) a T τ (cid:19) , softmax (cid:18) a S τ (cid:19)(cid:33) (2)where H refers to the cross-entropy function, a T is the teacher model’s pre-softmax logit for the current token, a S is the stu-dent model’s pre-softmax logit. Finally, τ is the temperaturehyperparameter [14]. In this work, we set τ to be 10. Intu-itively, L ( a T , a S ) will be small if the student’s prediction forthe current token is similar to that of the teacher model. Thedistillation loss for the entire sentence is simply defined as thesum of all the cross-entropy losses of all the tokens. To sum-marize, during the KD process, we use this distillation loss totrain our student model from scratch using the teacher modelslogits on unlabeled examples extracted from the BooksCor-pus. After that, we can integrate our distilled student BERTmodel into our DST model (Figure 2) and use the final modelfor monitoring the state of the conversation.Different from the very first work on exploring knowledgedistillation for BERT [21], our approach does not use any dataaugmentation heuristic. We only extract unlabeled sentencesfrom the BooksCorpus to build training examples for distilla-tion. Our work is in spirit similar to DistilBERT [22], whichalso uses the original BERT as the teacher and a large-scaleunlabeled text corpus as the basic learning material. However,as shown in Table 1, the DistilBERT model is about 5 timeslarger than our student model. Recently, at WWDC 2019,Apple presented a BERT-based on-device model for questionanswering [23]. Instead of using knowledge distillation, Ap-ple used the mixed precision training technique [24] to buildtheir model. From the Table 1, we see that the model of Ap-ple is much larger than our student model, as it has 8x moreparameters and requires 4x more storage space. This impliesthat our student model is small enough to be deployed on mo-bile systems. To the best of our knowledge, we are the first toexplore the use of knowledge distillation to compress neuralnetworks for DST. . EXPERIMENTS AND RESULTS Model Numberparameters StorageSizeTeacher Model ( N = 12 , d = 768 , d = 3072 , h = 12 ) 110M 440MBStudent Model ( N = 8 , d = 256 , d = 1024 , h = 8 )
14M 55MB
Apple Core ML BERT [23] 110M 220 MBDistilBERT [22] ( N = 6 , d = 768 , d = 3072 , h = 12 ) 66M 270 MBGLAD [4] 17M — Table 1 . Approximated size of different models. For modelsbased on BERT, N is the number of Transformer layers, d is the hidden size, d is the feedforward size, and h is thenumber of self-attention heads To evaluate the effectiveness of our proposed approach, weuse the standard WoZ 2.0 dataset. The dataset consists of userconversations with dialog systems designed to help users findsuitable restaurants. The ontology contains three informable slots: food , price , and area . In a typical conversation,a user would first search for restaurants by specifying valuesfor some of these slots. As the dialog progresses, the dialogsystem may ask the user for more information about theseslots, and the user answers these questions. The users goalmay also change during the dialog. For example, the user maywant an ‘expensive’ restaurant initially but change to wantinga ‘moderately priced’ restaurant by the end. Once the systemsuggests a restaurant that matches the user criteria, the usercan also ask about the values of up to eight requestable slots( phone , address , ...). The dataset has 600/200/400 dialogsfor train/dev/test split. Similar to previous work, we focuson two key evaluation metrics introduced in [10]: joint goalaccuracy and turn request accuracy . Model Joint goal Turn Request
Full BERT-based model 90.5
Distilled BERT-based model
StateNet [5] 88.9 —GCE [6] 88.5 97.4GLAD [4] 88.1 97.1BERT-DST PS [9] 87.7 —Neural Belief Tracker - CNN [2] 84.2 91.6Neural Belief Tracker - DNN [2] 84.4 91.2
Table 2 . Test accuracies on the WoZ 2.0 restaurant reserva-tion datasets. Table 2 shows the test accuracies of different models onthe WoZ 2.0 dataset. Our full BERT-based model uses thebase version of BERT (i.e., the teacher model in the knowl-edge distillation process), whereas the distilled BERT-basedmodel uses the compressed student model. Both our fullmodel and the compressed model outperform previous meth-ods by a considerable margin. Even though our compressedmodel is 8x smaller and 7x faster than the full model (Table1), it still achieves almost the same results as the full model.In fact, the smaller model has a slightly higher turn requestaccuracy. From Table 1, we see that our compressed modeleven has less parameters than GLAD, a DST model that isnot based on BERT. This demonstrates the effectiveness ofour proposed knowledge distillation approach.Note that BERT-DST PS [9] is a recent work that alsoutilizes BERT for DST. However, the work focuses only onsituation where the target slot value (if any) should be foundas word segment in the dialog context. According to Table 2,our models outperform BERT-DST PS on the WoZ dataset.Furthermore, BERT-DST PS only uses the original version ofBERT, making it large and slow.
Model Inf. Timeon CPU(secs) Inf. Timeon GPU(secs)Full BERT-based Model 1.465 0.113
Distilled BERT-based Model 0.205 0.024
DistilBERT [22] 0.579 0.043
Table 3 . Inference time of different models on CPU and GPU.We measured the average time it takes for each model to pro-cess one dialog turn (in seconds).
Table 1 shows that our distilled student model is much smallerthan many other BERT-based models in previous works. Ta-ble 3 shows the inference speed of our models on CPU (In-tel Core i7-8700K @3.70GHz) and on GPU (GeForce GTX1080). On average, on CPU, our compressed model is 7xfaster on inference than our full model and 3x faster than Dis-tilBERT. On GPU, our compressed model is about 5x fasterthan the full model and 2x faster than DistilBERT.
4. CONCLUSION
In this paper, we propose a simple but effective model basedon BERT for the task of dialog state tracking. Because theoriginal version of BERT is large, we apply the knowledgedistillation method to compress our model. Our compressedmodel achieves state-of-the-art performance on the WoZ 2.0dataset while being 8x smaller and 7x faster on inference thanthe original model. In future work, we will experiment onmore large scale datasets such as the MultiWOZ dataset [25]. . REFERENCES [1] Steve J. Young, Milica Gasic, Blaise Thomson, and Ja-son D. Williams, “Pomdp-based statistical spoken dia-log systems: A review,”
Proceedings of the IEEE , vol.101, pp. 1160–1179, 2013.[2] Nikola Mrksic, Diarmuid ´O S´eaghdha, Tsung-HsienWen, Blaise Thomson, and Steve J. Young, “Neuralbelief tracker: Data-driven dialogue state tracking,” in
ACL , 2016.[3] Bing Liu and Ian Lane, “An end-to-end trainable neu-ral network model with belief tracking for task-orienteddialog,”
ArXiv , vol. abs/1708.05956, 2017.[4] Victor Zhong, Caiming Xiong, and Richard Socher,“Global-locally self-attentive encoder for dialogue statetracking,” in
ACL , Melbourne, Australia, 2018, pp.1458–1467, Association for Computational Linguistics.[5] Liliang Ren, Kaige Xie, Lu Chen, and Kai Yu, “Towardsuniversal dialogue state tracking,” in
EMNLP , 2018.[6] Elnaz Nouri and Ehsan Hosseini-Asl, “Toward scalableneural dialogue state tracking model,” arXiv preprintarXiv:1812.00899 , 2018.[7] Nikola Mrkˇsi´c and Ivan Vuli´c, “Fully statistical neuralbelief tracking,”
ACL , 2018.[8] Mandy Korpusik and James R. Glass, “Dialogue statetracking with convolutional semantic taggers,”
ICASSP2019 - 2019 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pp. 7220–7224, 2019.[9] Guan-Lin Chao and Ian Lane, “Bert-dst: Scalableend-to-end dialogue state tracking with bidirectional en-coder representations from transformer,”
ArXiv , vol.abs/1907.03040, 2019.[10] Matthew Henderson, Blaise Thomson, and Jason D.Williams, “The second dialog state tracking challenge,”in
SIGDIAL Conference , 2014.[11] Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic,Lina Maria Rojas-Barahona, Pei hao Su, Stefan Ultes,David Vandyke, and Steve J. Young, “A network-basedend-to-end trainable task-oriented dialogue system,” in
EACL , 2016.[12] Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer, “Deep contextualized word representa-tions,” in
NAACL-HLT , 2018.[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “Bert: Pre-training of deep bidi-rectional transformers for language understanding,” in
NAACL-HLT , 2019. [14] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean,“Distilling the knowledge in a neural network,”
ArXiv ,vol. abs/1503.02531, 2015.[15] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin, “Attention is all you need,”in
NIPS , 2017.[16] Jimmy Ba and Rich Caruana, “Do deep nets really needto be deep?,” in
NIPS , 2013.[17] Song Han, Huizi Mao, and William J. Dally, “Deepcompression: Compressing deep neural network withpruning, trained quantization and huffman coding,”
CoRR , vol. abs/1510.00149, 2015.[18] Song Han, Jeff Pool, John Tran, and William J. Dally,“Learning both weights and connections for efficientneural networks,” in
NIPS , 2015.[19] Yukun Zhu, Ryan Kiros, Richard S. Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba, andSanja Fidler, “Aligning books and movies: Towardsstory-like visual explanations by watching movies andreading books,” , pp. 19–27, 2015.[20] Mike Schuster and Kaisuke Nakajima, “Japanese andkorean voice search,” , pp. 5149–5152, 2012.[21] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, OlgaVechtomova, and Jimmy Lin, “Distilling task-specificknowledge from bert into simple neural networks,”
ArXiv , vol. abs/1903.12136, 2019.[22] Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf, “Distilbert, a distilled version of bert:smaller, faster, cheaper and lighter,”
ArXiv , vol.abs/1910.01108, 2019.[23] “Core ml models,” https://developer.apple.com/machine-learning/models/ , Accessed:2019-10-20.[24] Paulius Micikevicius, Sharan Narang, Jonah Alben,Gregory Frederick Diamos, Erich Elsen, David Garc´ıa,Boris Ginsburg, Michael Houston, Oleksii Kuchaiev,Ganesh Venkatesh, and Hao Wu, “Mixed precisiontraining,”
ArXiv , vol. abs/1710.03740, 2017.[25] Pawel Budzianowski, Tsung-Hsien Wen, Bo-HsiangTseng, I˜nigo Casanueva, Stefan Ultes, Osman Ramadan,and Milica Gasic, “Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialoguemodelling,” in