Joint Intent Detection And Slot Filling Based on Continual Learning Model
Yanfei Hui, Jianzong Wang, Ning Cheng, Fengying Yu, Tianbo Wu, Jing Xiao
JJOINT INTENT DETECTION AND SLOT FILLING BASED ON CONTINUAL LEARNINGMODEL
Yanfei Hui, Jianzong Wang ∗ , Ning Cheng, Fengying Yu, Tianbo Wu, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd.
ABSTRACT
Slot filling and intent detection have become a significanttheme in the field of natural language understanding. Eventhough slot filling is intensively associated with intent detec-tion, the characteristics of the information required for bothtasks are different while most of those approaches may notfully aware of this problem. In addition, balancing the ac-curacy of two tasks effectively is an inevitable problem forthe joint learning model. In this paper, a Continual LearningInterrelated Model (
CLIM ) is proposed to consider semanticinformation with different characteristics and balance the accu-racy between intent detection and slot filling effectively. Theexperimental results show that CLIM achieves state-of-the-artperformace on slot filling and intent detection on ATIS andSnips.
Index Terms — Slot Filling, Intent Detection, ContinualLearning
1. INTRODUCTION
Natural Language Understanding (NLU) typically includes theintent detection and slot filling tasks, aiming to identify intentand extract semantic constituents from the utterance. Intentdetection is essentially a classification problem and thereforepopular classifiers like Linear Regression (LR), K-NearestNeighbor (KNN), Support Vector Machines (SVMs) and deepneural network methods can be applied [1, 2]. Slot fillingshould be considered as a sequence labeling task that tagsthe input word sequence x = ( x , x , ..., x T ) with the slotlabel sequence y = ( y , y , ..., y T ) . It should be noted thatalignment is explicit in slot filling unlike machine translationand speech recognition, thus alignment-based models typicallywork well[3, 4]. Popular approaches to solving sequence label-ing problems include Conditional Random Fields (CRFs), andRecurrent Neural Networks (RNNs) based approaches [5]. Par-ticularly Gated Recurrent Unit (GRU) and Long Short-TermMemory (LSTM) models have achieved good performance.Due to the accumulation of errors, pipeline methods usu-ally fail to achieve satisfactory performance. Accordingly,some works suggested using one joint model for slot fillingand intent detection to improve the performance via mutual *Corresponding author: Jianzong Wang, [email protected] enhancement between two tasks. Recently, several joint learn-ing methods for intent detection and slot filling were proposedto improve the performance over independent models [6, 7].It has been proved that attention mechanism [8] can improvemodel’s performance by dealing with long-range dependenciesthan currently used RNNs. So far attention-based joint learn-ing methods have achieved the state-of-the-art performancefor joint intent detection and slot filling [9, 10].However, the prior work seems to lose sight of the factthat slot filling and intent detection are strongly correlativeand the two tasks need encoding information with differentcharacteristics. Among them, slot filling task needs moreaccurate encoding information. And there is a problem ofalignment between slot and word, while intent detection onlyfocuses on the overall semantic information, which is notsensitive to the word location information. Apart from that,there is a phenomenon of ”seesaw” in the later stage of trainingthe joint model, that is to say, when the accuracy of one taskincreases, it often causes the accuracy of the other task todecline. Fig. 1 . Multitask joint training processAs shown in Figure 1a, when using the joint model to solvetwo tasks at the same time, the increase of task 1 accuracy willlead to the deterioration of task 2 performance, and vice versa.In this paper, CLIM is proposed to further enrich encodinginformation and take the cross-impact between two task intoaccount. We introduce the idea of continuous learning into thetraining process, which makes the model perform satisfactorilyin both tasks, as shown in Figure 1b.In summary, the contributions of this paper are as follows:• Applying the idea of continuous learning to solve thephenomenon of “precision seesaw” in the process ofmulti task training.• Proposing a variant dual-encoder model structure to en-rich the semantic information and location informationcontained by adopting different encoding forms. a r X i v : . [ c s . C L ] F e b . PROPOSED APPROACH2.1. Variant Dual-encoder Interrelated Model We make a novel attempt to bridge the gap between word-level slot model and the sentence-level intent model via adual-encoder model architecture, as shown in Figure 2.
Embedding h n x x h h n ' h h ' Slot Attention Intent AttentionInteraction x n ......... ...Transformer Block...decoder classifierLabel Intent ... c i sf Encoder BlockInteraction BlockDecoder Block
Label Label n h ... GAPTransformer Blockc i id h i h' i h i sf h i id h id h id squeeze p i DPG
Fig. 2 . Variant Dual-encoder Interrelated Model
As mentioned above, CLIM apply two encoders (RNN andTransformer) to encode the sentence separately. As shown inFigure 2, there are two encoders in the Encoder Block, oneon the left is a Bi-directional Long Short-Term Memory (BiL-STM), which can encode the input information in sequence.The additional encoder on the right is a bilayer transformerblock [11]. Transformer models constitute the current state-of-the-art in machine translation and are becoming increasinglypopular also in other domains [12, 13, 14], such as languagemodelling.The BiLSTM reads in the input sentences ( x , x , ..., x n ) forward and backward, and generates two sequences of hid-den states h ft and h bt . A concatenation of h ft and h bt forms afinal BiLSTM state h t = concat [ h ft , h bt ] at time step t . Onthe other side, the hidden states generated by two transformerblocks are joined together by residual mechanism. In order tomake the model have a faster inference speed, CLIM directlytakes the output of LSTM as the input of the double-layer transformer block, so as to get another different granularityencoding information. Then this encoder can also get a hiddenstate sequence ( h (cid:48) , h (cid:48) , ..., h (cid:48) n ) : h (cid:48) i = tanh ( h i + h i ) , wherethe hidden state h i stand for the output of first transformerblock, and the hidden state h i represents the output of thesecond transformer block at time step i . The i th context vector c sfi in the slot attention matrix and c idi in the intent attentionmatrix are computed as the weighted sum of encoders’ hid-den states: c sfi = T (cid:80) j =1 α i,j h j and c idi = T (cid:80) j =1 α (cid:48) i,j h (cid:48) j . Theattention weight α is acquired the same as that mentionedintroduced in [9]. And the superscript ’sf’ and ’id’ representslot filling and intent detection respectively. More formally,let h j , h (cid:48) j ∈ R d × s be the j th sentence hidden states obtainedby two encoders, respectively. Then, in the Interaction Block of CLIM, two attention matricesare used for information interaction by using this mechanism: c sfi , c idi = c idi , c sfi . The interaction mechanism can be definedin a variety of ways. We found that a straightforward imple-mentation using matrix exchange can already achieve goodresults. In the Interaction Block, two kinds of hidden statevectors with different granularity are joined together and thensent to the Decoder Block to complete slot filling and intentdetection tasks. The hidden state vector h sf and h id that iseventually fed to the Decoder Block can be obtained by concatefunction: h sfi = concat [ h i , c sfi ] and h idi = concat [ h (cid:48) i , c idi ] .Then, the final hidden state used for intent detection can beobtained by global average pooling. As shown in Figure 2, theDecoding Layer includes a decoder and a classifier for slotfilling and intent detection, respectively. Next we introducethe details of decoder and classifier. For the classification task of intent detection, CLIM obtainsthe final intent category distribution by fusing hidden state h idi .The classifier obtains the final intent in the following ways: y intent = sof tmax ( W intent · GAP ( h idi )) (1)Besides, CLIM compressed the input information of decoderwhich can remove the information noise only related to thetraining set, so as to improve the generalization ability of themodel. The compression process is realized by using one-dimensional convolution operation. s i = conv d ( concat [ h sfi , h id ]) (2)where s i is the input for decoder of slot filling. At last, thecompressed dense representation will be used for sequenceannotation. .2. Dynamic Parameter Generation Several neural networks can be used to implement DynamicParameter Generation (DPG) for parameter generation, e.g.,convolutional neural network (CNN), RNN and multilayerperceptron (MLP). The objective of this paper is not to exploreall possible implementations. We found that a straightforwardimplementation using MLP can already achieve good results.Formally, DPG can be written as: p i = f ( z i ; µ ) = σ ( w D z i + b ) (3)where µ is the activation function, and w D and b are theparameters of DPG and denoted by µ .
3. EXPERIMENTS AND RESULTS3.1. Data
ATIS (Airline Travel Information Systems) data set containsaudio recordings of people making flight reservations. Asdescribed in Table 2, ATIS data set contains 4978 train and 893test spoken utterances. We also use Snips, which is collectedfrom the Snips personal voice assistant. This data set contains13084 train and 700 test utterances. We evaluate the modelperformance on slot filling using F1 score, and the performanceon intent detection using classification error rate.
Table 1 . Dataset details statistics
Dataset ATIS SnipsVocab Size 722 11241Average Sentence Length 11.28 9.05Intents 21 7Slots 120 72Training Samples 4478 13084Validation Samples 500 700Test Samples 893 700
LSTM cell is used as the basic RNN unit in the experiments.Given the size the data set, our proposed model set the number of units in LSTM cell as 200. Word embeddings of size 1024are pre-trained and fine-tuned during mini-batch training withbatch size of 20. Dropout rate 0.5 is applied behind the wordembedding layer and between the fully connected layers. Andour proposed model use Adam optimization method. In theprocess of model training, after a certain number of epochs,we tried to fix the weight of the word embedding layer so thatthe parameters would not be updated.
Table 2 shows our joint training model performance on intentdetection and slot filling comparing to previous reported re-sults where the compared baselines include the state-of-the-artsequence-based joint model using bidirectional LSTM [7] andslot-gated model [16]. As shown in this table, our proposedmodel CLIM achieves % F1 score on slot filling task and % intent classification accuracy on ATIS dataset. Com-paring with other models, it can be found that CLIM has thesame accuracy as the optimal result in intent detection task andachieves state-of-the-art on slot filling task, even better thanBERT, the pre-trained model with large parameters [10]. More-over, on the Snips dataset, it can be found that although F1score only slightly improves on slot filling tasks, our proposedmodel has made significant progress in the accuracy of intentdetection. CLIM reduces the error rate of intent detection from % to % , and achieves % reduction in error rate. It is found that if the traditional continuous learning method isused to train the model, the intent detection task will be learnedfirst, and then the slot filling task will be learned, or vice versa[19]. However, through the previous experiments, we havefound that the performance of solving two tasks at the sametime by using the joint model is more than training on one taskalone, so we need to modify the continuous learning method.Figure 3 and Figure 4 show the independent/joint/continuallearning performance on intent detection and slot filling. Wefind CLIM loses 0.06 % for slot filling on ATIS but wins jointlearning 0.05 % on Snips. We can observe that introducingmore parameters does not always perform better. The different Table 2 . Performance of different joint models on two datasets.
Model ATIS SNIPSSlot(F1) Intent(Acc) Slot(F1) Intent(Acc)Atten. enc-dec NN with aligned inputs[9] 95.87 98.43 - -Atten.-BiRNN[9] 95.98 98.21 - -Enc-dec focus[15] 95.79 - - -Slot-Gated[16] 95.2 94.1 88.8 97.0CAPSULE-NLU[17] 95.2 95.0 91.8 97.3SF-ID network(with CRF)[18] 95.75 97.76 91.43 97.43Joint BERT[10] 96.1 97.5 97.0 98.6CLIM I n d e p e n d e n t J o i n t C o n t i nu a l A cc a n d F - s c o r e o n S n i p s Fig. 3 . Performance of three methods on ATIS I nd e p e nd e n t J o i n t C o n t i nu a l A cc a n d F - s c o r e o n S n i p s Fig. 4 . Performance of three methods on Snipsperformance of continuous learning in two datasets shouldbe due to the different size of vocabulary and types of slots.Meanwhile, CLIM wins joint learning 0.16 % for intent detec-tion on Snips and achieves the same result on ATIS. It makessense since the classifier could use features from LSTM andTransformer simultaneously and different model structurescomplement each other for classification. In particular, ourarchitecture performs best for both cases. In fact, DPG doesnot introduce new model structure or complicated operationsand the number of parameters are almost the same. To further evaluate the advances of variant dual-encoder ar-chitecture for joint learning, we tested different dual-encoderstructures. Note that all the variants are based on joint learningwith information compression mechanism :• BiLSTM-BiLSTM[B-B], where both encoders are bidi-rectional LSTM.• BiLSTM-Transformer[B-T], where the left encoder is abi-directional LSTM, and the other encoder is a double-layer transformer with residual connection.• BiLSTM-Transformer(V)[B-T(V)], where the input oftransformer is no longer word vector, but the encod-ing result of bidirectional LSTM. This is the structureadopted by CLIM model in this paper. (a) (b)
Fig. 5 . Performance of different encoders on two datasets. Figure 5 shows the joint learning performance of our modelon ATIS data set and Snips data set by changing the structure ofdual-encoder at a time. We found that the F1 score of slot fill-ing on ATIS dataset decreased slightly with BiLSTM-BiLSTM,but the model can achieve the optimal accuracy in intent de-tection task. More importantly, it can achieve state-of-the-artfor two tasks on Snips dataset at the same time. BiLSTM-Transformer does not perform well on ATIS dataset, and al-though it improves the accuracy of slot filling task in Snipsdataset, the F1 score in slot filling task decreases a lot. Due tothe small number of stack layers of transformer, it is difficultto play the encoding ability as powerful as BERT. BiLSTM-Transformer(V) can achieve state-of-the-art for two tasks onATIS dataset at the same time. Although CLIM model has bet-ter results than other models on Snips dataset, the improvementof F1 score and accuracy is not as good as BiLSTM-BiLSTM.As shown in Figure 5, the model with BiLSTM-BiLSTM per-forms better than CLIM on Snips dataset, which is due toshorter sentences of Snips compared with ATIS dataset. Aslisted in the table, the variant dual-encoder model struturecontributes to both slot filling and intent classification task.
4. CONCLUSION
In this paper, we proposed a continual learning model archi-tecture to address the problem of accuracy imbalance in mul-titask. Using a joint model for the two NLU tasks simplifiesthe training process, as only one model needs to be trainedand deployed. Our proposed model achieves state-of-the-artperformance on the benchmark ATIS and SNIPS.
5. ACKNOWLEDGEMENTS
This paper is supported by National Key Research and Devel-opment Program of China under grant No. 2017YFB1401202,No. 2018YFB1003500, and No. 2018YFB0204400. Corre-sponding author is Jianzong Wang from Ping An Technology(Shenzhen) Co., Ltd. . REFERENCES [1] Ruhi Sarikaya, Geoffrey E. Hinton, and Bhuvana Ram-abhadran, “Deep belief nets for natural language call-routing,” , pp. 5680–5683, 2011.[2] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao, “Recur-rent convolutional neural networks for text classification,”in
American Association for Artificial Intelligence(AAAI) ,2015.[3] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, “Sequenceto sequence learning with neural networks,” in
Confer-ence and Workshop on Neural Information ProcessingSystems(NIPS) , 2014.[4] William Chan, Navdeep Jaitly, Quoc V. Le, and OriolVinyals, “Listen, attend and spell: A neural networkfor large vocabulary conversational speech recognition,”
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , pp. 4960–4964, 2016.[5] Tom´aˇs Mikolov, Stefan Kombrink, Luk´aˇs Burget, JanˇCernock`y, and Sanjeev Khudanpur, “Extensions of re-current neural network language model,” , pp. 5528–5531, 2011.[6] Daniel Guo, G¨okhan T¨ur, Wen tau Yih, and GeoffreyZweig, “Joint semantic utterance classification and slotfilling with recursive neural networks,” , pp. 554–559,2014.[7] Dilek Zeynep Hakkani, G¨okhan T¨ur, Asli C¸ elikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-YiWang, “Multi-domain joint semantic frame parsing usingbi-directional rnn-lstm,” in
Conference of the Interna-tional Speech Communication Association(Interspeech) ,2016.[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio, “Neural machine translation by jointly learningto align and translate,”
Computing Research Reposi-tory(CoRR) , vol. abs/1409.0473, 2014.[9] Bing Liu and Ian Lane, “Attention-based recurrent neuralnetwork models for joint intent detection and slot filling,”
Conference of the International Speech CommunicationAssociation(Interspeech) , pp. 685–689, 2016.[10] Qian Chen, Zhu Zhuo, and Wen Wang, “Bert forjoint intent classification and slot filling,”
ArXiv , vol.abs/1902.10909, 2019. [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Con-ference and Workshop on Neural Information ProcessingSystems(NIPS) , 2017.[12] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, andPeng Shi, “Neural speech synthesis with transformernetwork,” in
American Association for Artificial Intelli-gence(AAAI) , 2018.[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “Bert: Pre-training of deep bidi-rectional transformers for language understanding,” in
Annual Conference of the North American Chapter ofthe Association for Computational Linguistics: HumanLanguage Technologies(NAACL-HLT) , 2018.[14] Wieting John, Neubig Graham, and Berg-KirkpatrickTaylor, “A bilingual generative transformer for semanticsentence embedding,”
Conference on Empirical Methodsin Natural Language Processing(EMNLP) , pp. 1581–1594, 2020.[15] Zhu Su and Yu Kai, “Encoder-decoder with focus-mechanism for sequence labelling based spoken lan-guage understanding,” in
ICASSP 2017 - 2017 IEEEInternational Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 2017.[16] Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo,Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen,“Slot-gated modeling for joint slot filling and intent pre-diction,” in
Annual Conference of the North Ameri-can Chapter of the Association for Computational Lin-guistics: Human Language Technologies(NAACL-HLT) ,2018.[17] Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, andPhilip S Yu, “Joint slot filling and intent detection viacapsule neural networks,” in
Proceedings of the 57thAnnual Meeting of the Association for ComputationalLinguistics (ACL) , 2019.[18] E Haihong, Peiqing Niu, Zhongfu Chen, and Meina Song,“A novel bi-directional interrelated model for joint intentdetection and slot filling,” in
Proceedings of the 57thAnnual Meeting of the Association for ComputationalLinguistics (ACL) , 2019.[19] Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao,Zhengwei Tao, Jinwen Ma, Dongyan Zhao, and Rui Yan,“Overcoming catastrophic forgetting for continual learn-ing via model adaptation,” in