FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding
Yutai Hou, Jiafeng Mao, Yongkui Lai, Cheng Chen, Wanxiang Che, Zhigang Chen, Ting Liu
FFewJoint: A Few-shot Learning Benchmark for Joint LanguageUnderstanding
Yutai Hou , Jiafeng Mao , Yongkui Lai , Cheng Chen ,Wanxiang Che ∗ , Zhigang Chen , Ting Liu Research Center for Social Computing and Information Retrieval,Harbin Institute of Technology State Key Laboratory of Cognitive Intelligence, Hefei, China { ythou, jfmao, klai, car, tliu } @ir.hit.edu.cn,[email protected], [email protected] Abstract
Few-learn learning (FSL) is one of the key fu-ture steps in machine learning and has raiseda lot of attention. However, in contrast tothe rapid development in other domains, suchas Computer Vision, the progress of FSL inNature Language Processing (NLP) is muchslower. One of the key reasons for this isthe lacking of public benchmarks. NLP FSLresearches always report new results on theirown constructed few-shot datasets, which ispretty inefficient in results comparison andthus impedes cumulative progress. In this pa-per, we present FewJoint, a novel Few-ShotLearning benchmark for NLP. Different frommost NLP FSL research that only focus onsimple N-classification problems, our bench-mark introduces few-shot joint dialogue lan-guage understanding, which additionally cov-ers the structure prediction and multi-task re-liance problems. This allows our benchmarkto reflect the real-word NLP complexity be-yond simple N-classification. Our benchmarkis used in the few-shot learning contest ofSMP2020-ECDT task-1. We also provide acompatible FSL platform to ease experimentset-up. Deep learning has achieved significant successes,but these successes heavily rely on massive anno-tated data. Few-Shot Learning (FSL) is one ofthe keys to breaking such shackle, and commitsto learning new tasks with only a few examples(usually only one or two per category) (Miller et al.,2000; Fei-Fei et al., 2006; Lake et al., 2015; Vinyalset al., 2016). FSL has made impressive progress in ∗ Corresponding author. The Eighth China National Conference on Social MediaProcessing. Link: https://smp2020.aconf.cn/smp.html The dataset and platform is available at https://github.com/AtmaHou/MetaDialog many areas, such as computer vision (CV) (Vinyalset al., 2016; Snell et al., 2017; Yoon et al., 2019).But the progress of FSL in natural language pro-cessing (NLP) is much slower. One of the primaryconstraints is the lack of a unified benchmark forfew-shot NLP, thus new methods cannot be easilycompared and iteratively improved.Existing few-shot NLP researches mainly focuson simple N-classification problems, such as textclassification (Sun et al., 2019; Geng et al., 2019;Yan et al., 2018; Yu et al., 2018; Bao et al., 2020;Vlasov et al., 2018) and entity relation classifica-tion (Lv et al., 2019; Gao et al., 2019a; Ye and Ling,2019). However, on one hand, these works oftenreport results on their own constructed few-shotdata, which is pretty inefficient in results compar-ison and thus hinders cumulative progress. Onthe other hand, these simple N-classification prob-lems cannot reflect the complexity of real-worldNLP tasks. NLP tasks often face the challengesof structure prediction problems, such as sequencelabeling (Lafferty et al., 2001; Hou et al., 2020)and parsing (Klein and Manning, 2003). Moreimportantly, different NLP tasks are often deeplyrelated to each other, i.e. multi-task problems (Wor-sham and Kalita, 2020). One typical scenario ofcomplex NLP is the Dialogue Language Under-standing problem, which includes two sub-tasks:Intent Detection (text classification) and Slot Tag-ging (sequence labeling). As a multi-task problem,these two sub-tasks are proved to strongly promoteand depend on each other (Chen et al., 2019; Gooet al., 2018).One of the main obstacles in constructing theNLP FSL benchmark comes from the special eval-uation paradigm of FSL. Few-shot models are usu-ally first pre-trained on data-rich domains (to learngeneral prior experience) and then tested on un-seen few-shot domains. Thus, FSL evaluationsalways need a lot of different domains to conquer a r X i v : . [ c s . C L ] N ov he result-randomness from domain selection andlimited learning shots. But it is often hard to gatherenough domains for NLP tasks. To solve this, ex-isting works (Bao et al., 2020; Gao et al., 2019b)construct fake domains from a single dataset. Theysplit all labels into training labels and testing la-bels. Then, they construct fake pre-training andtesting domains with training and testing labels re-spectively, so that testing labels are unseen duringpre-training. Such simulation can yield plenty ofrelated domains, but lacks reality and only workswhen the label set is large. Actually, splitting labelsis impractical for many real-world NLP problems.For example of the Name Entity Recognition, thelabel sets are often too small to split (usually 3 or 5labels).In this paper, we present FewJoint, a novel FSLbenchmark for joint multi-task learning, to pro-mote FSL research of the NLP area. To reflectthe real word NLP complexities beyond simpleN-classification, we adopt a sophisticated and im-portant NLP problem for the benchmark: Task-oriented Dialogue Language Understanding. Task-oriented Dialogue is a rising research area that de-velops dialogue systems to help users to achievegoals, such as booking tickets. Language Under-standing is a fundamental module of Task-orientedDialogue that extracts semantic frames from userutterances (See Figure 1). It contains two sub-tasks: Intent Detection and Slot Tagging. With theSlot Tagging task, our benchmark covers one ofthe most common structure prediction problems:sequence labeling. Besides, thanks to the naturaldependency between Intent Detection and Slot Tag-ging, our benchmark can embody the multi-taskchallenge of NLP problems. To conquer random-ness and make an adequate evaluation, we include59 different dialogue domains from real industrialAPI, which is a considerable domain amount com-pared to all existing few-shot and dialogue data.We also provide a Few-shot Learning platform toease the experiment set up and comparison.In summary, our contribution is three-fold: (1)We present a novel Few-shot learning benchmarkwith 59 real-world domains, which allows evalu-ating few-shot models without constructing fakedomains. (2) We propose to reflect real-world NLPcomplexities by covering the structure predictionproblems and multi-task learning problems. (3)We propose a Few-shot Learning platform to easecomparison and implement of few-shot methods. Before introducing the dataset, we present the defi-nition of the few-shot language understanding prob-lem here.Starting from the notions, we define utterance x = ( x , x , . . . , x n ) as a sequence of words anddefine corresponding semantic frame as y = ( c, s ) . c is the intent label of the utterance and s is the slotlabel sequence of the utterance and defined as s =( s , s , . . . , s n ) . A domain D = (cid:8) ( x ( i ) , y ( i ) ) (cid:9) N D i =1 is a set of ( x , y ) pairs. For each domain, there isa corresponding domain-specific label set L D . Tosimplify the description and ease understanding,we combine the label set definition of intent andslot, and assume that the number of labels N is thesame for all domains.In few-shot learning scenarios, models are usu-ally first trained on a set of source domains {D , D , . . . } , then evaluated on another set ofunseen target domains {D (cid:48) , D (cid:48) , . . . } . A targetdomain D (cid:48) j only contains few labeled examples,which is called support set S = (cid:8) ( x ( i ) , y ( i ) ) (cid:9) N S i =1 . S usually includes K examples (K-shot) for eachof N labels (N-way). Figure 1 shows an example ofthe training and testing process of 1-shot dialogueunderstanding.The K-shot dialogue understanding task is thendefined as follows: given an input query utterance x = ( x , x , . . . , x n ) and a K-shot support set S as references, find the most appropriate semanticframe y ∗ of x : y ∗ = ( y , y , . . . , y n ) = arg max y p ( y | x , S ) . In this section, we introduce the construction pro-cess of the FewJoint dataset, which generally con-tains two steps. Firstly, we collect and annotatea complete dialogue understanding corpus 59 do-mains (Section 3.1 and Section 3.2). Then, we splitthe corpus into training and unseen few-shot do-mains. We also sample support and query set tosimulate few-shot scenarios (Section 3.3).
We collect dialogue utterance of real dialogue do-mains from the AIUI open dialogue platform ofiFlytek. Before utterance collection, we selectpopular domains based on the frequency of API http://aiui.xfyun.cn/index-aiui upport Set: QueryWeather: 查询景德镇
City 的天气
QueryTemperature: 北京 City 明天 Date 的温度
Query Set: 明天哈尔滨天气怎么样今天多少度
Support Set:
FindFlightTicket: 从上海
FromCity 到广州
ToCity 的机票
FindTrainTicket: 查询今天
Date 到沈阳
ToCity 的火车票
Query Set: 查去深圳的火车明天有去北京的飞机么
Ticket Support Set:
PlayMusic: 播放周杰伦
Artist 的珊瑚海
Misic
PlayMovie: 我想看周杰伦
Artist 的大灌篮
Movie
Query Set: 播放阿凡达我想听王菲的人间
MultimediaTrain Task 1Test TaskWeatherTrain Task 2
Figure 1:
Training and testing examples of the few-shot dialogue language understanding benchmark. We training the model ona set of source domains, and testing it on an unseen domain with only a support set. calls, such as “Search information of corona-virus”.We ignore the domains that have no Intent or SlotSchema to ensure joint learning. For schema defini-tion, we leverage the semantic-frames and domainsdefined by AIUI, and also refine parts of domainsto remove ambiguous labels. Finally, we gather 59different dialogue domains together with semanticframe definitions.Given the semantic-frame definitions, we collectuser utterances generally from two sources:(1) Real user utterances.(2) Utterances written by the worker.For the source (1), we sample existing user utter-ances from the AIUI platform and remove the sen-sitive information. For the source (2), four workerswere asked to impersonate users of dialogue agentsand write query utterances for the specific domains,such as querying weather. The average ratio ofutterances between source (1) and (2) is about 3 :7.
After collecting raw user utterances, we label eachutterance with both intent (sentence level) and slotlabels (token level). The support sets in Figure 1show examples of utterances with intent and slotsannotations.The annotating process consists of two steps:Firstly, we obtain rough annotation for each utter-ance by predicting semantic-frame with the testing- tools of the AIUI platform. Then, the human work-ers validate each roughly annotated utterance andre-annotated the inappropriate ones. The data wasdivided equally into four parts and then annotatedby four workers respectively. The four workerswho participated in utterance writing are also re-sponsible for this part.After annotating, we perform data re-checking.Another three workers each checked all the data,and incorrect data are re-annotated.
Till now, we have collected the annotated dialoguecorpus. To test the learning performance of few-shot learning models, we need to reconstruct thedata into Few-Shot Learning (FSL) setting. In FSLsetting, models are first pre-trained on data-richdomains and then tested on unseen domains withonly a few-shot support set. Figure 1 show anexample of few-shot learning setting.
Few-shot Data Construction.
To achieve thefew-shot problem setting described in Section 2, wereserve some domains as few-shot testing domains,which are unseen during training. Specifically, wefirst split the 59 domains into three parts with nointersection: train, dev and test. Then on each devor test domains, we construct a K -shot supportset and use the other data as the query set. Thus,FewJoint can simulate few-shot scenarios on theunseen testing domains: let the models predict thelabels of the query samples with only few supportxamples. Table 1 shows the details of domaindivision. Reconstructing Testing Domains
We recon-struct each test/dev domains into two part: a few-shot support set and a query set. Here, a K -shotsupport set is manually constructed by the follow-ing principles:• Ensure each class (intent and slot) appearedat least k times, while keeping the support setas small as possible.• Avoid duplication between support set andquery set.• Encourage diversity of both expressions andslot values of support set. Reconstructing Training Domains
The train-ing set consists of 45 training domains, which pro-vide prior experience to help quick learning onunseen domains. For few-shot learning, there aretwo popular strategies for learning such prior ex-perience and their data format is very different. Inour benchmark, we provide two kinds of trainingset format to support these two learning strategies :(1) Learn the feature encoding layer on all train-ing data, which simply needs to combineall train domain utterances into a single pre-training set.(2) Learn the ability of learning quickly whengiven only a few examples, i.e. meta learning.This requires to reconstruct training set intoa series of few-shot episodes (i.e. support setand query set pair).The strategy (1) does not require special dataprocessing. To support the strategy (2), we need tosample query and support sets to construct few-shot learning episodes within training domains.Here, we adopt the Minimum-includingAlgorithm (Hou et al., 2020) to achieve auto-matic sampling of plentiful few-shot episodes.
Minimum-including Algorithm helpsto sample support set for sequence labeling prob-lems and multi-label problems, where a single in-stance may be associated with multiple labels. Inthese problem settings, the normal
N-way K-shot Benchmark users are free to re-construct training set intoany format.
Algorithm 1:
Minimum-including
Input: K , domain D , label set L D
1: Initialize support set S = {} , Count (cid:96) j = 0 ( ∀ (cid:96) j ∈ L D ) for (cid:96) in L D dowhile Count (cid:96) < K do From
D \ S , randomly sample a ( x ( i ) , y ( i ) ) pair that y ( i ) includes (cid:96) Add ( x ( i ) , y ( i ) ) to S Update all Count (cid:96) j ( ∀ (cid:96) j ∈ L D ) for each ( x ( i ) , y ( i ) ) in S do Remove ( x ( i ) , y ( i ) ) from S Update all Count (cid:96) j ( ∀ (cid:96) j ∈ L D ) if any Count (cid:96) j < K then Put ( x ( i ) , y ( i ) ) back to S Update all Count (cid:96) j ( ∀ (cid:96) j ∈ L D )
4: Return S support set definition is inapplicable. Because dif-ferent labels randomly co-occur in one sentence,and we cannot guarantee that each label appears K times. Take the dialogue understanding prob-lem as an example, each utterance instance is oftenassociated with multiple labels, including one in-tent and multiple slots. For example in Figure 1,in the 1-shot support set, the slot of “FromCity”appeared twice to ensure all labels appear at leastonce. Thus, Minimum-including Algorithm ap-proximately builds K-shot support set S followingtwo criteria:(1) All labels within the domain appear at least K times in support set S .(2) At least one label will appear less than K times in S if any ( x , y ) pair is removed fromit.Algorithm 1 shows the detailed process. This section presents a detailed statistic for the con-structed dataset. The statistic info of annotatedraw paper is shown in Table 2 There are 6,694 ut-terances included in the corpus and the averagelength of utterance is 9.9 (number of Chinese char-acters). As mention before, we collect data for 59real dialogue domains. Among them, we reserve14 domains as unseen few-shot domains for evalu-ation and use all the other 45 domains as trainingdomains. For evaluation, we select 9 domains asthe test set, and use 5 for development. Overall,our data set contains 143 different intents and 205different slots. rain Domains (45) Dev Domains (5) Test Domains (9) queryCapital, app, epg, petrolPrice, dream, animalCries, his-toryToday, translation, sentenceMaking, carNumber, poetry,familyNames, match, clock, weightScaler, cityOfPro, airCon-trol, website, stock, riddle, map, cookbook, music, calendar,crossTalk, wordsMeaning, new, health, home, video, tele-phone, weather, tvchannel, lottery, stroke, radio, contacts, bus,message, train, novel, email, cinemas, flight, childClassics wordFinding,garbageClassify,holiday, joke, temper-ature idiomsDict,timesTable,virusSearch, cap-tialInfo, constellation,drama, length, story,chineseZodiac
Table 1: The domains of FewJoint benchmark.
Item CountTotal number of utterances 6,694Average length of utterance 9.9Total number of domain 59Number of domain (train) 45Number of domain (dev) 5Number of domain (test) 9Total number of intents 143Average intents per. domain 2.42Total number of slots 205Average slots per. domain 3.47
Table 2:
Statistic of raw data.
Table 3 shows the statistics of constructed few-shot data. The main setting (Used in the SMP2020contest ) is 3-shot language understanding, and wealso provide 1, 5, 10 shots setting for extensiveevaluation. Support set size and query set sizeinformation of different shot setting is included inTable 3. Besides, we also present the number ofoccurrences for each intent and slot in the supportset, which satisfies our construction requirementsfor shots. Following the setting of the SMP2020 contest, weprovide baseline results for 3-shot. During theexperiments, we transfer the learned knowledgefrom source domains (training) to unseen targetdomains (testing) containing only a 3-shot supportset. Three baseline models are evaluated:
SepProto , JointProto and
JointProto + Finetune . To conduct a robust evaluation underfew-shot setting, we validate the models on dif-ferent domains and take the average score as finalresults. The Evaluation of Chinese Human-Computer DialogueTechnology, SMP2020-ECDT task-1. Link: https://smp2020.aconf.cn/smp.html
Setting Support Sent. Query Sent.1-shot dev 22 556test 55 1,0683-shot dev 66 511test 147 1,0615-shot dev 108 470test 233 96910-shot dev 192 389test 476 731Ave. Intent in S Ave. Slot in S Table 3:
Statistic of few-shot benchmark.
There are three main metrics for evaluation: In-tent Accuracy, Slot F1-score, Sentence Accuracy.Specifically, we calculate the Slot F1-score onquery samples with conlleval script. For Sen-tence Accuracy, we consider that one sentence iscorrect only when all its slots and intent are correct,and vice versa. All models are evaluated on thesame support-query pairs for fairness.To control the nondeterministic of neural net-work training (Reimers and Gurevych, 2017), wereport the average score of 5 random seeds for allresults.
Implements
For sentence embedding, we aver-age the token embedding provided by pretrainedlanguage model and we use uncased
BERT-Base (Devlin et al., 2019) here. Also, we adopt embed-ding tricks of Pairs-Wise Embedding (Hou et al.,2020) and Gradual Unfreezing (Howard and Ruder,2018). We use ADAM (Kingma and Ba, 2015) totrain the models with batch size 4. The learningrate is set as 1e-5 for baseline models. uring experiment, all baseline models are im-plemented with the provided few-shot platform. Here, we evaluate the two types of few-shot learn-ing strategies on 3-shot dialogue language under-standing: (1) non-fine-tune based methods (
Sep-Proto , JointProto ) and (2) fine-tune based methods(
JointProto + Finetune ). All the baseline modelsare provided by our few-shot learning platform:
SepProto is a similarity metric based few-shotlearning model with a prototypical network (Snellet al., 2017). It first averages the embeddings ofeach label’s support examples as prototypes, andcompares the similarity distance between the queryinstance and each prototype under a certain dis-tance metric. Then it classifies an item to its closestlabels. During the experiment, it is pre-trained onsource domains and then directly works on targetdomains without fine-tuning.
JointProto is also a prototypical model similar toSepProto. The difference is two-fold: (1) we adoptthe logits-dependency trick proposed by (Goo et al.,2018) to achieve joint learning of intent detectionand slot filling. (2) intent detection and slot fillingshare same BERT embedding. Similar to SeqProto,it is pre-trained on source domains and then directlyworks on target domains without fine-tuning.
JointProto + Finetune is a joint language under-standing model same to JointProto. Its differencefrom JointProto is that it is further finetuned on thesupport set of target domains.The evaluation results are showed in Table 4.As the results show, JointProto outperforms Sep-Proto on both intent detection and slot filling,which indicates that additional information fromjoint-learning tasks can improve performance. Forthe Sentence Acc., JointProto significantly outper-forms SepProto for 7.25 percent. This demon-strates the overall superiority of joint few-shot lan-guage understanding comparing to vanilla few-shotmethods. The improvements mainly come fromtwo aspects: (1) Another task’s logits serves as ad-ditional evidence for prediction. (2) Embeddingsharing introduces additional supervisory signalsto achieve better BERT finetuning.When comparing JointProto + Finetune andJointProto, huge improvements are witnessed onSentence Acc and Slot F1. This shows on 3-shot https://github.com/AtmaHou/MetaDialog Models Intent Acc. Slot F1 Sentence Acc.SepProto
JointProto 78.46
JointProto + Finetune
Table 4:
Main result of baselines. setting, finetuning can further boosts the few-shotlanguage understanding performance by providingdomain specific knowledge.
In this paper, we present a novel few-shot learningbenchmark for NLP tasks, which is the first few-shot NLP benchmark for joint multi-task learning.Compared to existing few-shot learning data, ourbenchmark reflects real-world NLP complexitiesbetter by covering the structure prediction prob-lem and multi-task learning problem. Also, ourbenchmark consists of 59 real dialogue domains.This allows to evaluate few-shot model withoutconstructing fake domain like existing works.
References
Yujia Bao, Menghua Wu, Shiyu Chang, and ReginaBarzilay. 2020. Few-shot text classification with dis-tributional signatures. In
Proc. of ICLR . OpenRe-view.net.Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bertfor joint intent classification and slot filling. arXivpreprint arXiv:1902.10909 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-training ofdeep bidirectional transformers for language under-standing. In
Proc. of the NAACL-HLT, Volume 1(Long and Short Papers) , pages 4171–4186.Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories.
IEEE transac-tions on pattern analysis and machine intelligence ,28(4):594–611.Tianyu Gao, Xu Han, Ruobing Xie, Zhiyuan Liu, FenLin, Leyu Lin, and Maosong Sun. 2019a. Neu-ral snowball for few-shot relation learning. arXivpreprint arXiv:1908.11007 .Tianyu Gao, Xu Han, Hao Zhu, Zhiyuan Liu, Peng Li,Maosong Sun, and Jie Zhou. 2019b. Fewrel 2.0: To-wards more challenging few-shot relation classifica-tion. arXiv preprint arXiv:1910.07124 .Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu,Ping Jian, and Jian Sun. 2019. Induction networksfor few-shot text classification. In
Proc. of theEMNLP-IJCNLP , pages 3895–3904.hih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-LiHuo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for jointslot filling and intent prediction. In
Proceedings ofthe 2018 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 2 (Short Pa-pers) , pages 753–757.Yutai Hou, Wanxiang Che, Yongkui Lai, Zhihan Zhou,Yijia Liu, Han Liu, and Ting Liu. 2020. Few-shotslot tagging with collapsed dependency transfer andlabel-enhanced task-adaptive projection network. In
Proc. of the ACL .Jeremy Howard and Sebastian Ruder. 2018. Univer-sal language model fine-tuning for text classification.In
Proc. of the ACL, Volume 1: Long Papers , pages328–339.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In
Proc. of theICLR .Dan Klein and Christopher D Manning. 2003. Accu-rate unlexicalized parsing. In
Proceedings of the41st annual meeting of the association for compu-tational linguistics , pages 423–430.John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In
Proc. of the ICML , ICML ’01, pages282–289, San Francisco, CA, USA. Morgan Kauf-mann Publishers Inc.Brenden M Lake, Ruslan Salakhutdinov, and Joshua BTenenbaum. 2015. Human-level concept learningthrough probabilistic program induction.
Science ,350(6266):1332–1338.Xin Lv, Yuxian Gu, Xu Han, Lei Hou, Juanzi Li,and Zhiyuan Liu. 2019. Adapting meta knowledgegraph information for multi-hop reasoning over few-shot relations. arXiv preprint arXiv:1908.11513 .Erik G Miller, Nicholas E Matsakis, and Paul A Viola.2000. Learning from one example through shareddensities on transforms. In
Proc. of the CVPR , vol-ume 1, pages 464–471. IEEE.Nils Reimers and Iryna Gurevych. 2017. Reportingscore distributions makes a difference: Performancestudy of lstm-networks for sequence tagging. In
Proc. of the EMNLP .Jake Snell, Kevin Swersky, and Richard Zemel. 2017.Prototypical networks for few-shot learning. In
NIPS , pages 4077–4087.Shengli Sun, Qingfeng Sun, Kevin Zhou, and TengchaoLv. 2019. Hierarchical attention prototypical net-works for few-shot text classification. In
Proc. ofthe EMNLP-IJCNLP , pages 476–485. Oriol Vinyals, Charles Blundell, Tim Lillicrap, KorayKavukcuoglu, and Daan Wierstra. 2016. Matchingnetworks for one shot learning. In
NIPS , pages3630–3638.Vladimir Vlasov, Akela Drissner-Schmid, and AlanNichol. 2018. Few-shot generalization across dia-logue tasks. arXiv preprint arXiv:1811.11707 .Joseph Worsham and Jugal Kalita. 2020. Multi-task learning for natural language processing in the2020s: where are we going?
Pattern RecognitionLetters .Leiming Yan, Yuhui Zheng, and Jie Cao. 2018. Few-shot learning for short text classification.
Multime-dia Tools and Applications , pages 1–12.Zhi-Xiu Ye and Zhen-Hua Ling. 2019. Multi-levelmatching and aggregation network for few-shot re-lation classification. In
Proc. of the ACL, Volume 1:Long Papers , pages 2872–2881.Sung Whan Yoon, Jun Seo, and Jaekyun Moon.2019. Tapnet: Neural network augmented with task-adaptive projection for few-shot learning. In
Proc.of the ICML , pages 7115–7123.Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, SaloniPotdar, Yu Cheng, Gerald Tesauro, Haoyu Wang,and Bowen Zhou. 2018. Diverse few-shot text clas-sification with multiple metrics. In