VirAAL: Virtual Adversarial Active Learning For NLU
VVirAAL: Virtual Adversarial Active Learning
Gregory Senay ∗ , Badr Youbi Idrissi ∗† , Marine Haziza Augustus Intelligence/xBrain [email protected],[email protected],[email protected]
Abstract
This paper presents VirAAL , an Active Learning frameworkbased on Adversarial Training. VirAAL aims to reduce the ef-fort of annotation in Natural Language Understanding (NLU).VirAAL is based on Virtual Adversarial Training (VAT), a semi-supervised approach that regularizes the model through LocalDistributional Smoothness. With that, adversarial perturbationsare added to the inputs making the posterior distribution moreconsistent. Therefore, entropy-based Active Learning becomesrobust by querying more informative samples without requiringadditional components. The first set of experiments studies theimpact of VAT on NLU tasks (joint or not) within low labeleddata regimes. The second set shows the effect of VirAAL in anActive Learning (AL) process. Results demonstrate that VATis robust even on multitask training where the adversarial noiseis computed from multiple loss functions. Substantial improve-ments are observed with entropy-based AL with VirAAL forquerying data to annotate. VirAAL is an inexpensive method interms of AL computation with a positive impact on data sam-pling. Furthermore, VirAAL decreases annotations in AL up to80%. Index Terms : Spoken Language Understanding, Virtual Ad-versarial Training, Low Data Regime, Active Learning
1. Introduction
Data annotation is time-consuming, expensive, and often re-quires experts or at least a good understanding of the data forreaching a qualitative annotation. Scalable, fast, and cheapmethods exist, such as crowdsourcing solutions, like AmazonMechanical Turk (AMT). However, using AMT or third partiesis impossible when data privacy matters and must require inter-nal annotation. This is especially true in specific domains suchas banking, insurance, or medical. Nevertheless, there are moreand more available services with user interfaces, like spoken di-alog systems. These systems must be trained on large datasetsin order to achieve acceptable interactions with users. They relyon Speech or Natural Language Understanding (SLU, NLU)[1]. In addition to user intents, annotations can include slot andoptional domain information, which consequently makes multi-task annotation complex, time-consuming, and repetitive.Dialog systems can collect massive amounts of user data,but this data can rarely be used directly and very often is impos-sible to annotate. One way to take advantage of these amountsof data is to use semi-supervised approaches [2, 3, 4]. Thesemethods have yielded consistently reliable results in text classi-fication [5, 6, 7] for datasets like IMDB, Roten Tomatoes, DB-pedia or RC1. However, these studies usually focus on largelabeled datasets. ∗ Authors made equivalent contributions to this work. † Work done during internship at xBrain. Submitted to INTERSPEECH 2020
On the other hand, Active Learning aims to minimize thenumber of manual annotations required to reach an acceptableperformance level. Active learners query unlabeled samples tobe annotated [8]. These queries are designed to extract informa-tive or diverse samples. Some popular methods are uncertainty-based queries. Furthermore, many studies [9, 10] show that ran-dom query selection is a robust and consistent baseline. Usuallyefficient methods are too complex [9, 10, 11, 12] to be easily in-tegrated into the annotation process, making it time-consumingand expensive to be deployed effectively. Additionally, thesemethods that sample data to be annotated do not contribute di-rectly to the training.Among all these methods, Virtual Adversarial Training(VAT) [5, 13], a semi-supervised approach, proves its robust-ness in a large variety of contexts like computer vision, textclassification or speech [14, 5, 6, 15]. VAT regularizes the train-ing by adding adversarial examples and avoids overfitting. Itcould be interesting to evaluate its impact on NLU in a low dataregime. Nevertheless, recent advances in NLU [16, 17, 18] relyon joint optimization of intent detection and slot filling, whileVAT has been designed for single tasks and mainly evaluatedon large datasets. Besides, if VAT can correctly regularize themodel, even in a low data regime, it will then be possible to ob-tain better confidence in the model posterior distribution, whichwould therefore make it possible to better query unlabeled sam-ples using the improved model confidence. This would occurwithout using additional complex Active Learning methods andwould increase the NLU model generalization with fewer datawhile reducing the effort of human annotation by querying moreinformative data.This paper tries to answer these questions by evaluating theeffectiveness of VAT in low data regimes in Spoken LanguageUnderstanding and its impact on the posterior distribution in anActive Learning paradigm. This framework is called VirAALfor Virtual Adversarial Active Learning. VirAAL combinesVAT with uncertainty-based Active Learning in an attempt toincrease sample efficiency even more. Section 2 presents Vi-rAAL and its components and methods for NLU: a joint-NLUmodel, Virtual Adversarial Training and the Active Learningprocedure. Section 3 shows the protocol, experiments and re-sults for low data regimes and Active Learning. The paper endswith a conclusion and some perspectives.
2. VirAAL
Virtual Adversarial Active Learning (VirAAL) relies on threecomponents: a joint-NLU neural network model for joint in-tent detection and slot filling, a semi-supervised training methodbased on Virtual Adversarial Training and entropy-based query-ing functions for Active Learning. The sections below explainthe model, methods and the motivations. Figure 1 illustratesVirAAL with the adversarial noise on inputs and entropy-basedActive Learning criteria. a r X i v : . [ c s . C L ] M a y ntropyEntropy From LA to Seattle O OFromLoc ToLocFlight
Slot DecoderIntent DecoderEncoder Active Learning
Figure 1:
VirAAL: an Attention-based Recurrent Neural Net-work model for joint-NLU (intent detection and slot filling) [16]where adversarial noise r t is added to the embeddings ¯ w t . Ac-tive Learning querying functions are based on entropy. In recent years, joint training have been shown to accomplishstate-of-the-art results in NLU [17, 18, 19, 20]. The modelarchitecture used in VirAAL is an Attention-based RecurrentNeural Network for joint intent detection and slot filling [16]with aligned inputs. This model is relatively light and fast totrain for iterative human-in-the-loop training, while providinggood results in NLU. Let’s denote w = { w t | t = 1 , ..., T } aninput sequence of words, where T is the sequence length and s = { s t | t = 1 , ..., T } the target sequence of slots. ¯ w and ¯ s are the corresponding embeddings of w and s . i ∈ I is thetarget intent. The superscript x ( k ) denotes the k -th element of K = K l + K u where K l is the labeled set and K u the unlabeledset. The model has a bidirectional-LSTM [21] sentence en-coder taking word embeddings ¯ w as input, and generates h = { h t | t = 1 , ..., T } a hidden representation. The intent atten-tion decoder takes a context vector c intent that is a weightedsum of the encoder hidden states h with the last real word hid-den state h T as the attention key. The slot filling decoder isa unidirectional LSTM that takes as input the aligned encoderoutput h , the context vector c t , and the last predicted embed-ding ¯ s t − slot. c t is an attention vector where the key is ¯ s t − .The decoders have a softmax layer to output the estimated prob-ability distribution of the targeted intent p int (i | ¯ w ; θ ) and slots p slot ( s t | ¯ w , s 3. Experiments Experiments are conducted in low data regimes: only a smallset of labeled data is available. Two sets of experiments are pre-sented. The first set of experiments is the impact of the VATon intent detection only, slot filling only and then joint training.Then, the Active Learning experiments aim to show the impactof VAT on the Active Learning framework VirAAL, where theannotator is supposed to label two sets of data. At first, theinitial set is built with a random data sample. This set is then la-beled, followed by a first training with and without VAT. Exper-iments present the final training performed with an additionalset selected with two criteria: Random or Entropy. This secondset size is equal to the first, doubling the labeled training data. Itis notable that this paper does not focus on the initial samplingselection. Experiments are performed on two Spoken Language Under-standing datasets: ATIS and SNIPS. ATIS (Airline Travel Infor-mation Systems) dataset [30] contains recordings for reservingflights and is widely used in SLU research. This work followsthe data split used in many papers [16, 31, 32, 33, 17]. The train-ing set contains a total of 4,978 utterances and is composed of127 slots and 18 different intents. A development set of 500 ut-terances is extracted from it, and the test set is composed of 893utterances. SNIPS dataset [34] is a voice assistant dataset and contains data annotated in 7 intents (Music, Book, Weather...)and 73 slots. SNIPS is composed of a training set of 13,084utterances, and development and test sets with 700 utteranceswithin each. In the first experiment, ATIS is evaluated at and all mul-tiples of of labeled data ( of ATIS labeled data corre-sponds to 230 labeled utterances). SNIPS is evaluated at ev-ery between and , and all multiples of of la-beled data. In SNIPS, of labeled data corresponds to 131labeled utterances. This experiment compares trainings using6 different losses: [joint; int; slot; vat+joint; vat+int; vat+slot].In the Active Learning experiments, the different criteria areevaluated for different set sizes: [10 , , , , for ATISand [2 , , , , , , for SNIPS. In Figure 3, X % cor-responds to an initial training set of X % randomly annotateddata. Then, an additional X % are selected by one of the meth-ods. The randomly sampled set contains at least 1 of each in-tent to be similar to real-world scenarios, regardless of the slottype coverage: some slot types can be missing. Additionally,in Active Learning the initial random set is the same for all thedifferent methods.Table 1 presents the hyper-parameters used in all experi-ments. It should be noted that experiments use different batchsizes between non-VAT and VAT trainings for the ATIS experi-ments. The larger batch size in VAT aims to include both labeledand unlabeled data in each mini-batch to optimize all lossessimultaneously. Following previous works in joint-NLU [16],batch size of 16 is used on non-VAT trainings. All SNIPS andVAT trainings use a 64 batch size. Non-VAT trainings have beentested on smaller batch sizes showing no improvement or evenworse performances. Evaluation metrics are accuracy for intentdetection and token-level micro-average F1 score for slot fill-ing. Furthermore, all experiments present means and standarddeviations on 8 runs. A validation set proportional to the dataregimes has been used to ensure no overfitting. Additionally,for reproducibility the code is available and experiments useFastText embeddings.Table 1: Training hyper-parameters used in all experiments. Parameter value Parameter valueEmbedding size 300 Epochs 100/64LSTMs hidden size 128 Batch sizes 16/64LSTM layer 1 Optimizer AdamSlot embedding size 128 Learning rate 0.001 Figure 2 shows consistent improvements of VAT trainings (inorange) across the different labeled data regimes against non-VAT training (in blue). Biggest improvements occur in the lowdata regimes (less than 30%) for all training types: intent de-tection, slot filling and joint. This confirms the hypothesis thatVAT, even used in a low labeled data regime, is able to betterregularize the model using the unlabeled data. This suggestsit is due to the smoothing of the decision boundary by VATthat propagates the labeled signals from labeled samples to theneighboring unlabeled samples. This is even more apparent in https://github.com/xbraininc/viraal/ .860.880.90.920.940.960.98 0.860.880.90.920.940.960.98 I n t e n t T e s t A cc u r ac y S l o t T e s t F S c o r e Proportion of labeled data Proportion of labeled dataATIS SNIPS Figure 2: Test scores of NLU trainings. Left column is ATIS,right column is SNIPS. First row shows the intent accuracy andsecond row the slot filling F1 score. The model is either trainedwith intent (int) only, slot filling (slot) only or joint training(joint). The training methods are cross-entropy (ce) or virtualadversarial loss (vat). low data regimes where the labeled signal is sparse. This, in it-self, is a form of Active Learning as [14] pointed out. Addition-ally, VAT joint training outperforms the baseline, which showsthat regularizing with adversarial noise from two different sig-nals (VAT slot loss and VAT intent loss) is still efficient and sug-gests it could be used in other multi-task training. More inter-estingly, VAT joint training in 60% and 10% regimes achievesvery similar intent accuracies as 90% (+0.2%) and 50% (-0.4%)without VAT respectively in ATIS and SNIPS datasets.Figure 3 shows even greater improvements with ActiveLearning (AL). The largest gains are obtained in intent detectionwith or without joint training. In SNIPS, only 10% of labeleddata with VirAAL (vat, ent) is enough to have higher scoresthan the 100% regime without VAT and AL (lines in red). Thiscan reduce the annotation effort by up to 80%. More precisely,intent accuracy reaches 97.89% and slot F1 88.95 with VirAALversus 96.86% and 87.40 F1, in joint training. A similar im-provement is shown in ATIS with a 30% regime, respectivelywith +0.62% and +1.54 F1 of absolute differences. On intenttraining only on SNIPS, 2% of labels suffices for the intent ac-curacy to overtake the 100% regime: 96.43% versus 96.05%.Furthermore, VirAAL for intent detection always outperformsthe random baseline (vat, random) or AL without VAT (ce, ent).VirAAL outperforms random in almost all data regimes and forboth datasets on the contrary to entropy-based AL without VAT,yielding worse scores than random in many regimes. Never-theless, VirAAL in joint training shows mitigated results withslight improvements and a similar behavior as ”joint ce, ent”.This is probably due to the non-heterogeneous natures of intentand slot scores for AL: slot confidence is an average of the slotentropies. In that respect, it would be interesting to use AL cri-terion in a common latent space. 4. Conclusion This work first demonstrates that Virtual Adversarial Training isa consistent method for training NLU models: intent detection,slot filling and joint training. This is a very effective way to I n t e n t T e s t A cc u r ac y S l o t T e s t F S c o r e I n t e n t T e s t A cc u r ac y S l o t T e s t F S c o r e Proportion of labeled data Proportion of labeled data I nd i v i du a l T r a i n i ng J o i n t T r a i n i ng ATIS SNIPS Figure 3: VirAAL (vat, ent) results on ATIS (left) and SNIPS(right) datasets obtained on low data regimes for the 3 differenttrainings: intent, slot filling and joint. Active Learning criteriaare: random, Conf int , Conf slot and Conf for joint training. reduce the amount of annotations to obtain accurate NLU com-ponents. Experiments show that using adversarial noise com-puted from multiple loss functions allows to effectively regu-larize the model even if it was not designed for this purpose.The proposed Virtual Adversarial Active Learning framework,VirAAL, shows even better improvement using entropy-basedActive Learning combined with VAT. This is an inexpensivemethod in terms of computation for efficiently querying sam-ples to annotate, thanks to the smoothness of the posterior dis-tribution. Additionally, VirAAL leads to even further improve-ments especially for both intent detection and slot filling and canreduce the labeling effort by up to 80%. Finally, VirAAL resultson joint training suggest that querying samples from a com-mon latent space could improve entropy-based Active Learningrather than combining entropies. This could be a future researchdirection. . References [1] R. De Mori, F. Bechet, D. Hakkani-Tur, M. McTear, G. Riccardi,and G. Tur, “Spoken language understanding,” IEEE Signal Pro-cessing Magazine , vol. 25, no. 3, pp. 50–58, 2008.[2] X. J. Zhu, “Semi-supervised learning literature survey,” Univer-sity of Wisconsin-Madison Department of Computer Sciences,Tech. Rep., 2005.[3] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learn-ing (chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transac-tions on Neural Networks , vol. 20, no. 3, pp. 542–542, 2009.[4] X. Zhu and A. B. Goldberg, “Introduction to semi-supervisedlearning,” Synthesis lectures on artificial intelligence and machinelearning , vol. 3, no. 1, pp. 1–130, 2009.[5] T. Miyato, A. M. Dai, and I. J. Goodfellow, “Virtual adversar-ial training for semi-supervised text classification,” CoRR , vol.abs/1605.07725, 2016.[6] D. Sachan, M. Zaheer, and R. Salakhutdinov, “Revisiting lstm net-works for semi-supervised text classification via mixed objectivefunction,” Proceedings of the AAAI Conference on Artificial Intel-ligence , vol. 33, pp. 6940–6948, 07 2019.[7] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver,and C. A. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” in Advances in Neural Information Pro-cessing Systems 32 , H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alch´e-Buc, E. Fox, and R. Garnett, Eds. Curran Associates,Inc., 2019, pp. 5049–5059.[8] B. Settles, “Active learning literature survey,” 2009.[9] M. Ducoffe and F. Precioso, “Adversarial active learning for deepnetworks: a margin based approach,” ArXiv , vol. abs/1802.09841,2018.[10] D. Gissin and S. Shalev-Shwartz, “Discriminative active learn-ing,” ArXiv , vol. abs/1907.06347, 2018.[11] O. Sener and S. Savarese, “Active learning for convolutional neu-ral networks: A core-set approach,” in International Conferenceon Learning Representations , 2018.[12] Y. Gal, R. Islam, and Z. Ghahramani, “Deep bayesian active learn-ing with image data,” in Proceedings of the 34th InternationalConference on Machine Learning - Volume 70 , ser. ICML17.JMLR.org, 2017, p. 11831192.[13] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii, “Dis-tributional smoothing by virtual adversarial examples,” in , 2016.[14] T. Miyato, S. Maeda, M. Koyama, and S. Ishii, “Virtual adversar-ial training: A regularization method for supervised and semi-supervised learning,” IEEE Trans. Pattern Anal. Mach. Intell. ,vol. 41, no. 8, pp. 1979–1993, 2019.[15] Y. Shinohara, “Adversarial multi-task learning of deep neural net-works for robust speech recognition,” in Interspeech 2016 , 2016,pp. 2369–2372.[16] B. Liu and I. Lane, “Attention-based recurrent neural networkmodels for joint intent detection and slot filling,” in Interspeech2016 , 09 2016, pp. 685–689.[17] C.-W. Goo, G. Gao, Y.-K. Hsu, C.-L. Huo, T.-C. Chen, K.-W.Hsu, and Y.-N. Chen, “Slot-gated modeling for joint slot fillingand intent prediction,” in Proceedings of the 2018 Conference ofthe North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 2 (Short Pa-pers) . New Orleans, Louisiana: Association for ComputationalLinguistics, Jun. 2018, pp. 753–757.[18] D. Hakkani-Tr, G. Tur, A. Celikyilmaz, Y.-N. V. Chen, J. Gao,L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frameparsing using bi-directional rnn-lstm,” in Proceedings of The 17thAnnual Meeting of the International Speech Communication As-sociation (INTERSPEECH 2016) . ISCA, June 2016. [19] X. Yang, Y.-N. Chen, D. Hakkani-Tur, P. Crook, X. Li, J. Gao,and l. Deng, “End-to-end joint learning of natural language under-standing and dialogue manager,” in Acoustics, Speech and SignalProcessing (ICASSP), 2017 IEEE International Conference on ,03 2017, pp. 5690–5694.[20] Q. Chen, Z. Zhuo, and W. Wang, “Bert for joint intent classifica-tion and slot filling,” 2019.[21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput. , vol. 9, no. 8, pp. 1735–1780, Nov. 1997.[22] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-nessing adversarial examples,” in , 2015.[23] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber,D. Tsipras, I. J. Goodfellow, A. Madry, and A. Kurakin, “On eval-uating adversarial robustness,” ArXiv , vol. abs/1902.06705, 2019.[24] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu,“Towards deep learning models resistant to adversarial attacks,”in International Conference on Learning Representations , 2018.[25] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibrationof modern neural networks,” in Proceedings of the 34th Interna-tional Conference on Machine Learning , ser. Proceedings of Ma-chine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 1321–1330.[26] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensem-bles,” in Advances in Neural Information Processing Systems 30 ,I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc.,2017, pp. 6402–6413.[27] A. Kirsch, J. R. van Amersfoort, and Y. Gal, “Batchbald: Efficientand diverse batch acquisition for deep bayesian active learning,”in NeurIPS , 2019.[28] Y. Shen, H. Yun, Z. C. Lipton, Y. Kronrod, and A. Anandkumar,“Deep active learning for named entity recognition,” in Interna-tional Conference on Learning Representations , 2018.[29] A. Siddhant and Z. C. Lipton, “Deep Bayesian active learningfor natural language processing: Results of a large-scale empir-ical study,” in Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing . Brussels, Belgium:Association for Computational Linguistics, Oct.-Nov. 2018, pp.2904–2909.[30] P. J. Price, “Evaluation of spoken language systems: The atis do-main,” in Proceedings of the Workshop on Speech and NaturalLanguage , ser. HLT ’90. Stroudsburg, PA, USA: Association forComputational Linguistics, 1990, pp. 91–95.[31] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig, “Using re-current neural networks for slot filling in spoken language un-derstanding,” IEEE/ACM Trans. Audio, Speech and Lang. Proc. ,vol. 23, no. 3, pp. 530–539, Mar. 2015.[32] G. Tur, D. Hakkani-Tr, and L. Heck, “What is left to be under-stood in atis?” in , Dec 2010, pp. 19–24.[33] P. Xu and R. Sarikaya, “Convolutional neural network based tri-angular crf for joint intent detection and slot filling,” in ,Dec 2013, pp. 78–83.[34] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy,C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril,M. Primet, and J. Dureau, “Snips voice platform: an embed-ded spoken language understanding system for private-by-designvoice interfaces,”