Vector Projection Network for Few-shot Slot Tagging in Natural Language Understanding
VVector Projection Network for Few-shot Slot Tagging in NaturalLanguage Understanding
Su Zhu, Ruisheng Cao, Lu Chen and
Kai Yu ∗ MoE Key Lab of Artificial IntelligenceSpeechLab, Department of Computer Science and EngineeringShanghai Jiao Tong University, Shanghai, China { paul2204,211314,chenlusz,kai.yu } @sjtu.edu.cn Abstract
Few-shot slot tagging becomes appealing forrapid domain transfer and adaptation, moti-vated by the tremendous development of con-versational dialogue systems. In this pa-per, we propose a vector projection networkfor few-shot slot tagging, which exploits pro-jections of contextual word embeddings oneach target label vector as word-label simi-larities. Essentially, this approach is equiv-alent to a normalized linear model with anadaptive bias. The contrastive experimentdemonstrates that our proposed vector projec-tion based similarity metric can significantlysurpass other variants. Specifically, in the five-shot setting on benchmarks SNIPS and NER,our method outperforms the strongest few-shotlearning baseline by . and . pointson F score, respectively. Our code will bereleased at https://github.com/sz128/few_shot_slot_tagging_and_NER . Natural language understanding (NLU) is a keycomponent of conversational dialogue systems,converting user’s utterances into the correspondingsemantic representations (Wang et al., 2005) forcertain narrow domain (e.g., booking hotel , search-ing flight ). As a core task in NLU, slot taggingis usually formulated as a sequence labeling prob-lem (Mesnil et al., 2015; Sarikaya et al., 2016; Liuand Lane, 2016).Recently, motivated by commercial applicationslike Amazon Alexa, Apple Siri, Google Assistant,and Microsoft Cortana, great interest has been at-tached to rapid domain transfer and adaptation withonly a few samples (Bapna et al., 2017). Few-shotlearning approaches (Fei-Fei et al., 2006; Vinyalset al., 2016) become appealing in this scenario(Fritzler et al., 2019; Geng et al., 2019; Hou et al., ∗ The corresponding author is Kai Yu. support set ). This scenario has been successfullyadopted in the slot tagging task by considering boththe word-label similarity and temporal dependencyof target labels (Hou et al., 2020). Nonetheless, itis still a challenge to devise appropriate word-labelsimilarity metrics for generalization capability.In this work, a vector projection network is pro-posed for the few-shot slot tagging task in NLU.To eliminate the impact of unrelated label vectorsbut with large norm, we exploit projections of con-textual word embeddings on each normalized labelvector as the word-label similarity. Moreover, thehalf norm of each label vector is utilized as a thresh-old, which can help reduce false positive errors.One-shot and five-shot experiments on slot tag-ging and named entity recognition (NER) (Houet al., 2020) tasks show that our method can outper-form various few-shot learning baselines, enhanceexisting advanced methods like TapNet (Yoonet al., 2019; Hou et al., 2020) and prototypicalnetwork (Snell et al., 2017; Fritzler et al., 2019),and achieve state-of-the-art performances.Our contributions are summarized as follows: • We propose a vector projection network for a r X i v : . [ c s . C L ] S e p igure 1: A data sample in domain GetWeather . the few-shot slot tagging task that utilizes pro-jections of contextual word embeddings oneach normalized label vector as the word-labelsimilarity. • We conduct extensive experiments to compareour method with different similarity metrics(e.g., dot product, cosine similarity, squaredEuclidean distance). Experimental resultsdemonstrate that our method can significantlyoutperform the others.
One prominent methodology for few-shot learn-ing in image classification field mainly focuses onmetric learning (Vinyals et al., 2016; Snell et al.,2017; Sung et al., 2018; Oreshkin et al., 2018; Yoonet al., 2019). The metric learning based methodsaim to learn an effective distance metric. It can bemuch simpler and more efficient than other meta-learning algorithms (Munkhdalai and Yu, 2017;Mishra et al., 2018; Finn et al., 2017).As for few-shot learning in natural language pro-cessing community, researchers pay more atten-tion to classification tasks, such as text classifica-tion (Yan et al., 2018; Yu et al., 2018; Sun et al.,2019; Geng et al., 2019). Recently, few-shot learn-ing for slot tagging task becomes popular and ap-pealing. Fritzler et al. (2019) explored few-shotNER with the prototypical network. Hou et al.(2020) exploited the TapNet and label dependencytransferring for both slot tagging and NER tasks.Compared to these methods, our model can achievebetter performance in new domains by utilizingvector projections as word-label similarities.
We denote each sentence x = ( x , · · · , x | x | ) asa word sequence, and define its label sequence as y = ( y , · · · , y | x | ) . An example for slot taggingin domain GetWeather is provided in Fig 1. Foreach domain D , it includes a set of ( x , y ) pairs,i.e., D = { ( x ( i ) , y ( i ) ) } |D| i =1 .In the few-shot scenario, the slot taggingmodel is trained on several source domains {D , D , · · · , D M } , and then directly evaluated on an unseen target domain D t which only containsfew labeled samples ( support set ). The support set, S = { ( x ( i ) , y ( i ) ) } | S | i =1 , usually includes k exam-ples (K-shot) for each of N labels (N-way). Thus,the few-shot slot tagging task is to find the bestlabel sequence y ∗ given an input query x in targetdomain D t and its corresponding support set S , y ∗ = arg max y p θ ( y | x , S ) (1)where θ refers to parameters of the slot taggingmodel, the ( x , y ) pair and the support set are fromthe target domain, i.e., ( x , y ) ∼ D t and S ∼ D t .The few-shot slot tagging model is trained on thesource domains to minimise the error in predictinglabels conditioned on the support set, θ = arg max θ M (cid:88) m =1 (cid:88) ( x , y ) ∼D m , S ∼D m log p θ ( y | x , S ) In this section, we will introduce our model for thefew-shot slot tagging task.
Linear Conditional Random Field (CRF) (Suttonet al., 2012) considers the correlations between la-bels in neighborhoods and jointly decodes the mostlikely label sequence given the input sentence (Yaoet al., 2014; Ma and Hovy, 2016). The posteriorprobability of label sequence y is computed via: ψ θ ( y , x , S ) = | x | (cid:88) i =1 ( f T ( y i − , y i ) + f E ( y i , x , S )) p θ ( y | x , S ) = exp ( ψ θ ( y , x , S )) (cid:80) y (cid:48) exp ( ψ θ ( y (cid:48) , x , S )) where f T ( y i − , y i ) is the transition score and f E ( y i , x , S ) is the emission score at the i -th step.The transition score captures temporal depen-dencies of labels in consecutive time steps, whichis a learnable scalar for each label pair. To sharethe underlying factors of transition between differ-ent domains, we adopt the Collapsed DependencyTransfer (CDT) mechanism (Hou et al., 2020).The emission scorer independently assigns eachword a score with respect to each label y i , which isdefined as a word-label similarity function: f E ( y i , x , S ) = S IM ( E ( x ) i , c y i ) (2)here E is a contextual word embeddingfunction, e.g., BLSTM (Graves, 2012), Trans-former (Vaswani et al., 2017), c y i is the label em-bedding of y i which is extracted from the supportset S . In this paper, we adopt a pre-trained BERTmodel (Devlin et al., 2019) as E .Various models are proposed to extract labelembedding c y i from S , such as matching net-work (Vinyals et al., 2016), prototypical net-work (Snell et al., 2017) and TapNet (Yoon et al.,2019). Take the prototypical network as an exam-ple, each prototype (label embedding) is defined asthe mean vector of the embedded supporting pointsbelonging to it: c y i = 1 N y i | S | (cid:88) j =1 | x ( j ) | (cid:88) k =1 I { y ( j ) k = y i } E ( x ( j ) ) k (3)where N y i = (cid:80) | S | j =1 (cid:80) | x ( j ) | k =1 I { y ( j ) k = y i } is thenumber of words labeled with y i in the support set. For the word-label similarity function, we proposeto exploit vector projections of word embeddings x i on each normalized label vector c k :S IM ( x i , c k ) = x (cid:62) i c k || c k || (4)Different with the dot product used by Hou et al.(2020), it can help eliminate the impact of c k ’snorm to avoid the circumstance where the normof c k is enough large to dominate the similaritymetric. In order to reduce false positive errors,the half norm of each label vector is utilized as anadaptive bias term:S IM ( x i , c k ) = x (cid:62) i c k || c k || − || c k || (5) A simple interpretation for the above vector projec-tion network is to learn a distinct linear classifierfor each label. We can rewrite the above formulasas a linear model:S IM ( x i , c k ) = x (cid:62) i w k + b k (6)where w k = c k || c k || and b k = − || c k || . The weightsare normalized as || w k || = 1 to improve the gener-alization capability of the few-shot model. Experi-mental results indicate that vector projection is aneffective choice compared to dot product, cosinesimilarity, squared Euclidean distance, etc. We evaluate the proposed method following thedata split provided by Hou et al. (2020) onSNIPS (Coucke et al., 2018) and NER datasets. Itis in the episode data setting (Vinyals et al., 2016),where each episode contains a support set (1-shotor 5-shot) and a batch of labeled samples. Forslot tagging, the SNIPS dataset consists of 7 do-mains with different label sets: Weather (We), Mu-sic (Mu), PlayList (Pl), Book (Bo), Search Screen(Se), Restaurant (Re) and Creative Work (Cr). ForNER, 4 different datasets are utilized to act as dif-ferent domains: CoNLL-2003 (News) (Sang andDe Meulder, 2003), GUM (Wiki) (Zeldes, 2017),WNUT-2017 (Social) (Derczynski et al., 2017) andOntoNotes (Mixed) (Pradhan et al., 2013). Moredetails of the data split are shown in Appendix A.For each dataset, we follow Hou et al. (2020) toselect one target domain for evaluation, one domainfor validation, and utilize the rest domains as sourcedomains for training. We also report the average F score at the episode level. For each experiment, werun it ten times with different random seeds. Thetraining details are illustrated in Appendix B. For each word x i , SimBERT finds themost similar word x (cid:48) k in the support set and assignthe label of x (cid:48) k to x i , according to cosine similarityof word embedding of a fixed BERT. TransferBERT:
A trainable linear classifier is ap-plied on a shared BERT to predict labels for eachdomain. Before evaluation, it is fine-tuned on thesupport set of the target domain.
L-WPZ(ProtoNet)+CDT+PWE:
WPZ is a few-shot sequence labeling model (Fritzler et al., 2019)that regards sequence labeling as classificationof each word. It pre-trains a prototypical net-work (Snell et al., 2017) on source domains, andutilize it to do word-level classification on targetdomains without fine-tuning. It is enhanced withBERT, Collapsed Dependency Transfer (CDT) andPair-Wise Embedding (PWE) mechanisms by (Houet al., 2020).
L-TapNet+CDT+PWE:
The previous state-of-the-art method for few-shot slot tagging (Hou et al.,2020), which incorporates TapNet (Yoon et al.,2019) with BERT, CDT and PWE. https://atmahou.github.io/attachments/ACL2020data.zip odel We Mu Pl Bo Se Re Cr Avg.1-shot SimBERT 36.10 37.08 35.11 68.09 41.61 42.82 23.91 40.67TransferBERT 55.82 38.01 45.65 31.63 21.96 41.79 38.53 39.06L-WPZ(ProtoNet)+CDT+PWE 71.23 47.38 59.57 81.98 69.83 66.52 62.84 65.62L-TapNet+CDT+PWE 71.53 60.56 66.27
L-ProtoNet+CDT+VP (ours) 73.19 58.62 68.26 83.54 77.88
SimBERT 53.46 54.13 42.81 75.54 57.10 55.30 32.38 52.96TransferBERT 59.41 42.00 46.07 20.74 28.20 67.75 58.61 46.11L-WPZ(ProtoNet)+CDT+PWE 74.68 56.73 52.20 78.79 80.61 69.59 67.46 68.58L-TapNet+CDT+PWE 71.64 67.16 75.88 84.38 82.58 70.05 73.41 75.01L-TapNet+CDT+VP (ours) 78.25 67.79 70.66 86.17 75.80 78.51 75.93 76.16ProtoNet+CDT+VP (ours) 79.88 67.77 78.08 87.68
Table 1: F scores on few-shot slot tagging of SNIPS. Results with standard deviations is shown in Appendix C.2. Model 1-shot 5-shotNews Wiki Social Mixed Avg. News Wiki Social Mixed Avg.
SimBERT 19.22 6.91 5.18 13.99 11.32 32.01 10.63 8.20 21.14 18.00TransferBERT 4.75 0.57 2.71 3.46 2.87 15.36 3.62 11.08 35.49 16.39L-TapNet+CDT+PWE 44.30
L-ProtoNet+CDT+VPB (ours) 43.47 10.95 28.43
Table 2: F scores on few-shot slot tagging of NER. Results with standard deviations is shown in Appendix C.2. S IM ( x , c ) SNIPS NER1-shot 5-shot 1-shot 5-shot x (cid:62) c || c || x (cid:62) c || c || − || c || x (cid:62) c x (cid:62) || x || c x (cid:62) || x || c || c || λ x (cid:62) c − || x − c || Table 3: Comparison among different similarity func-tions. Results are average F1-scores of all domains.
We borrow the results of these baselines fromHou et al. (2020). “ L- ” means label-enhanced pro-totypes are applied by using label name embed-dings. Table 1 and Table 2 show results on both 1-shotand 5-shot slot tagging of SNIPS and NER datasetsrespectively. Our method can significantly outper-form all baselines including the previous state-of-the-art model. Moreover, the previous state-of-the-art model heavily relies on PWE, which con- catenates an input sentence with each sample inthe support set and then feeds them into BERTto get pair-wise embeddings. By comparing “L-TapNet+CDT+PWE” with “L-TapNet+CDT+VP”,we can find that our proposed Vector Projection(VP) can achieve better performance as well ashigher efficiency. If we incorporate the negativehalf norm of each label vector as a bias (VPB),F score on 5-shot slot tagging is dramatically im-proved. We speculate that 5-shot slot tagging in-volves multiple support points for each label, thusfalse positive errors could occur more frequentlyif there is no threshold when predicting each label.We also find that label name embeddings (“ L- ’)help less in our methods. For the word-label similarity func-tion S IM ( x , c ) , we also conduct contrastive ex-periments between our proposed vector projectionand other variants including the dot product ( x (cid:62) c ),the projection of label vector on word embedding( x (cid:62) || x || c ), cosine similarity ( x (cid:62) || x || c || c || ), squared Eu-clidean distance ( − || x − c || ), and even a train- odel SNIPS 1-shot SNIPS 5-shot NER 1-shot NER 5-shotO-X X-O X-X O-X X-O X-X O-X X-O X-X O-X X-O X-X ProtoNet+CDT 10815 3552 17440 4802 1377 6532 58498 9890 35991 19344 1505 9091ProtoNet+CDT+VP 4400 3409 10638 2177 1214 3610 13075 29183 13893 5217 6283 3595ProtoNet+CDT+VPB 4118 3818 10959 1762 1076 3343 11976 26851 16032 2388 6617 3280
Table 4: Error analysis of slot tagging for different error patterns. Numbers are summed over all domains.Figure 2: Definition of three error types of slot tagging,which are “O-X”, “X-O” and “X-X”. “C” means cor-rect predictions. able scaling factor ( λ x (cid:62) c ) (Oreshkin et al., 2018).The results in Table 3 show that our methods cansignificantly outperform these alternative metrics.We also notice that the squared Euclidean distancecan achieve competitive results in the 5-shot setting.Mathematically, − || x − c || = − x (cid:62) x + x (cid:62) c − c (cid:62) c (cid:117) x (cid:62) c − c (cid:62) c where − x (cid:62) x is constant with respect to each la-bel and thus omitted. It further consolidates ourassumption that c (cid:62) c can function as a bias termto alleviate false positive errors. Effect of Vector Projection
We claimed that vec-tor projection could help reduce false positive er-rors. As illustrated in Figure 2, we classify allwrong predictions of slot tagging into three errortypes (i.e., “O-X”, “X-O” and “X-X”), where “O”means no slot and “X” means a slot tag beginningwith ‘B’ or ‘I’. The error analysis of these threeerror types are illustrated in Table 4. We can findthat our methods can significantly reduce wrongpredictions of these three types in SNIPS dataset.In NER dataset, our methods can achieve a remark-able reduction in “O-X” and “X-X”, while leadingto an increase of “X-O” errors. However, the totalnumber of these three errors are reduced by ourmethods in NER dataset.
Fine-tuning with Support Set
Apart from the few-shot slot tagging focusing on model transfer instead of fine-tuning, we also analyze keeping fine-tuningour models on the support set in Appendix C.1.
In this paper, we propose a vector projection net-work for the few-shot slot tagging task, which canbe interpreted as a normalized linear model withan adaptive bias. Experimental results demonstratethat our method can significantly outperform thestrongest few-shot learning baseline on SNIPS andNER datasets in both 1-shot and 5-shot settings.Furthermore, our proposed vector projection basedsimilarity metric can remarkably surpass othersvariants.For future work, we would like to add a learnablescale factor for bias in Eqn. 6.
References
Ankur Bapna, G¨okhan T¨ur, Dilek Hakkani-T¨ur, andLarry P. Heck. 2017. Towards zero-shot framesemantic parsing for domain scaling. In
INTER-SPEECH , pages 2476–2480.Alice Coucke, Alaa Saade, Adrien Ball, Th´eodoreBluche, Alexandre Caulier, David Leroy, Cl´ementDoumouro, Thibault Gisselbrecht, Francesco Calt-agirone, Thibaut Lavril, Ma¨el Primet, and JosephDureau. 2018. Snips Voice Platform: an embed-ded Spoken Language Understanding system forprivate-by-design voice interfaces. arXiv preprintarXiv:1805.10190 .Leon Derczynski, Eric Nichols, Marieke van Erp, andNut Limsopatham. 2017. Results of the WNUT2017shared task on novel and emerging entity recogni-tion. In
Proceedings of the 3rd Workshop on NoisyUser-generated Text , pages 140–147.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies , pages 4171–4186.Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories.
IEEE transac-tions on pattern analysis and machine intelligence ,28(4):594–611.helsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-agnostic meta-learning for fast adaptation ofdeep networks. In
Proceedings of the 34th Inter-national Conference on Machine Learning , pages1126–1135. JMLR. org.Alexander Fritzler, Varvara Logacheva, and MaksimKretov. 2019. Few-shot classification in named en-tity recognition task. In
Proceedings of the 34thACM/SIGAPP Symposium on Applied Computing ,pages 993–1000.Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu,Ping Jian, and Jian Sun. 2019. Induction networksfor few-shot text classification. In
Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 3895–3904.Alex Graves. 2012. Supervised sequence labelling. In
Supervised sequence labelling with recurrent neuralnetworks , pages 5–13. Springer.Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou,Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shotslot tagging with collapsed dependency transfer andlabel-enhanced task-adaptive projection network. In
ACL .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Bing Liu and Ian Lane. 2016. Attention-based recur-rent neural network models for joint intent detec-tion and slot filling. In , pages 685–689.Xuezhe Ma and Eduard Hovy. 2016. End-to-endsequence labeling via bi-directional LSTM-CNNs-CRF. In the 54th Annual Meeting of the Associationfor Computational Linguistics , pages 1064–1074.Gr´egoire Mesnil, Yann Dauphin, Kaisheng Yao,Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xi-aodong He, Larry Heck, Gokhan Tur, Dong Yu, et al.2015. Using recurrent neural networks for slot fill-ing in spoken language understanding.
IEEE/ACMTransactions on Audio, Speech, and Language Pro-cessing , 23(3):530–539.Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, andPieter Abbeel. 2018. A simple neural attentive meta-learner. In
International Conference on LearningRepresentations .Tsendsuren Munkhdalai and Hong Yu. 2017. Metanetworks. In
Proceedings of the 34th InternationalConference on Machine Learning , pages 2554–2563.JMLR. org.Boris Oreshkin, Pau Rodr´ıguez L´opez, and AlexandreLacoste. 2018. TADAM: Task dependent adaptivemetric for improved few-shot learning. In
Advances in Neural Information Processing Systems , pages721–731.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using ontonotes. In
Pro-ceedings of the Seventeenth Conference on Computa-tional Natural Language Learning , pages 143–152.Erik Tjong Kim Sang and Fien De Meulder. 2003.Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. In
Proceedings of the Seventh Conference on Natu-ral Language Learning at HLT-NAACL 2003 , pages142–147.Ruhi Sarikaya, Paul A Crook, Alex Marin, MinwooJeong, Jean-Philippe Robichaud, Asli Celikyilmaz,Young-Bum Kim, Alexandre Rochette, Omar ZiaKhan, Xiaohu Liu, et al. 2016. An overview of end-to-end language understanding and dialog manage-ment for personal digital assistants. In , pages391–397.Jake Snell, Kevin Swersky, and Richard Zemel. 2017.Prototypical networks for few-shot learning. In
Ad-vances in neural information processing systems ,pages 4077–4087.Shengli Sun, Qingfeng Sun, Kevin Zhou, and TengchaoLv. 2019. Hierarchical attention prototypical net-works for few-shot text classification. In
Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) , pages 476–485.Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang,Philip HS Torr, and Timothy M Hospedales. 2018.Learning to compare: Relation network for few-shotlearning. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages1199–1208.Charles Sutton, Andrew McCallum, et al. 2012. Anintroduction to conditional random fields.
Founda-tions and Trends R (cid:13) in Machine Learning , 4(4):267–373.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems , pages 5998–6008.Oriol Vinyals, Charles Blundell, Timothy Lillicrap,Daan Wierstra, et al. 2016. Matching networks forone shot learning. In
Advances in neural informa-tion processing systems , pages 3630–3638.Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spokenlanguage understanding–an introduction to the statis-tical framework.
IEEE Signal Processing Magazine ,22(5):16–31.eiming Yan, Yuhui Zheng, and Jie Cao. 2018. Few-shot learning for short text classification.
Multime-dia Tools and Applications , 77(22):29799–29810.Kaisheng Yao, Baolin Peng, Geoffrey Zweig, DongYu, Xiaolong Li, and Feng Gao. 2014. Recurrentconditional random field for language understanding.In , pages4077–4081.Sung Whan Yoon, Jun Seo, and Jaekyun Moon. 2019.TapNet: Neural network augmented with task-adaptive projection for few-shot learning. arXivpreprint arXiv:1905.06549 .Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, SaloniPotdar, Yu Cheng, Gerald Tesauro, Haoyu Wang,and Bowen Zhou. 2018. Diverse few-shot textclassification with multiple metrics. arXiv preprintarXiv:1805.07513 .Amir Zeldes. 2017. The GUM corpus: creating mul-tilayer resources in the classroom.
Language Re-sources and Evaluation , 51(3):581–612.
A Detail of Dataset
The data split method provided by Hou et al. (2020)are applied in SNIPS and NER datasets. Statisticalanalyses of the original datasets are provided inTable 5, where the number of labels (“
Task Dataset Domain
SlotTagging SNIPS We 2100 17Mu 2100 18Pl 2042 10Bo 2056 12Se 2059 15Re 2073 28Cr 2054 5NER CoNLL News 20679 9GUM Wiki 3493 23WNUT Social 5657 13OntoNotes Mixed 159615 37
Table 5: Statistics of original dataset.
Hou et al. (2020) reorganized the dataset forfew-shot slot tagging and NER in the episode datasetting (Vinyals et al., 2016), where each episodecontains a support set (1-shot or 5-shot) and a batchof labeled samples. The 1-shot and 5-shot scenariosmean each label of a domain appears about 1 and 5times, respectively. The overview of the few-shotdata split on SNIPS and NER are shown in Table 6and Table 7 respectively. For SNIPS, each domainconsists of 100 episodes. For NER, each domaincontains 200 episodes in 1-shot scenario and 100episodes in 5-shot scenario.
Domain 1-shot 5-shotAvg. | S | | S | Mu Pl Bo Se Re Cr Table 6: Overview of few-shot slot tagging data fromSNIPS. “Avg. | S | ” refers to the average support set sizeof each domain, and “Sample” indicates the number oflabelled samples in the batches of all episodes. Domain 1-shot 5-shotAvg. | S | | S | Wiki
Social
Mixed
Table 7: Overview of few-shot data for NER experi-ments.
B Training Details
In all the experiments, we use the uncased
BERT-Base (Devlin et al., 2019) as E to ex-tract contextual word embeddings. The modelsare trained using ADAM (Kingma and Ba, 2014)with the learning rate of 1e-5 and updated aftereach episode. We fine-tune BERT with layer-wiselearning rate decay (rate is 0.9), i.e., the parame-ters of the l -th layer get an adaptive learning rate e- ∗ . ( L − l ) , where L is the total number of lay-ers in BERT. For the CRF transition parameters,they are initialized as zeros, and a large learningrate of 1e-3 is applied.For each dataset, we follow Hou et al. (2020) toselect one target domain for evaluation, one domainfor validation, and utilize the rest domains as sourcedomains for training. The models are trained forfive iterations, and we save the parameters with thebest F score on the validation domain. We use theaverage F score at episode level, and the F -scoreis calculated using CoNLL evaluation script . Foreach experiment, we run it ten times with differentrandom seeds generated at .We run our models on GeForce GTX 2080 TiGraphics Cards, and the average training time foreach epoch and number of parameters in eachmodel are provided in Table 8. ethod Time per Batch SNIPS NERL-TapNet+CDT+VP 224ms 273ms 110MProtoNet+CDT+VP 176ms 223ms 110MProtoNet+CDT+VPB 184ms 240ms 110M
Table 8: Runtime and mode size of our methods.
C Additional Analyses and Results
C.1 Fine-tuning on the Support Set
Almost all few-shot slot tagging methods choosenot to keep fine-tuning on the support set for ef-ficiencies. Here we want to know how perfor-mances change if our methods are fine-tuned onthe support set. Concretely, pre-trained models arefine-tuned on the support set of one episode andthen evaluated on the data batch of the episode.Since different episodes are independent, modelswould be reinitialized as the pre-trained ones toprepare for the next episode. We fine-tune the “Pro-toNet+CDT+VP” model for ∼ steps usingthe same hyper-parameters with the training. Asillustrated in Table 9, we can find that fine-tuningon the support set can get further improvementsgreatly. Fine-tune step SNIPS NER1-shot 5-shot 1-shot 5-shot
Table 9: Results are averaged F1-scores of all domains.The backbone method is “ProtoNet+CDT+VP”.
C.2 Result with Standard Deviations
Table 10, 11, 12 and 13 show the complete resultswith standard deviations on SNIPS and NER. odel We Mu Pl Bo Se Re Cr Avg.
SimBERT ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 10: F scores on 1-shot slot tagging of SNIPS dataset. * indicates a result borrowed from Hou et al. (2020). Model We Mu Pl Bo Se Re Cr Avg.
SimBERT ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 11: F scores on 5-shot slot tagging of SNIPS dataset. * indicates a result borrowed from Hou et al. (2020). Model News Wiki Social Mixed Avg.
SimBERT ∗ ± ± ± ± ± ∗ ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 12: F scores on 1-shot slot tagging of NER dataset. * indicates a result borrowed from Hou et al. (2020). Model News Wiki Social Mixed Avg.
SimBERT ∗ ± ± ± ± ± ∗ ± ± ± ± ± ∗ ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 13: F1