[PDF] A Neural Few-Shot Text Classification Reality Check

Abstract

Modern classification models tend to struggle when the amount of annotated data is scarce. To overcome this issue, several neural few-shot classification models have emerged, yielding significant progress over time, both in Computer Vision and Natural Language Processing. In the latter, such models used to rely on fixed word embeddings before the advent of transformers. Additionally, some models used in Computer Vision are yet to be tested in NLP applications. In this paper, we compare all these models, first adapting those made in the field of image processing to NLP, and second providing them access to transformers. We then test these models equipped with the same transformer-based encoder on the intent detection task, known for having a large number of classes. Our results reveal that while methods perform almost equally on the ARSC dataset, this is not the case for the Intent Detection task, where the most recent and supposedly best competitors perform worse than older and simpler ones (while all are given access to transformers). We also show that a simple baseline is surprisingly strong. All the new developed models, as well as the evaluation framework, are made publicly available.

Full PDF

AA Neural Few-Shot Text Classiﬁcation Reality Check

Thomas Dopierre and Christophe Gravier andand Wilfried Logerais Laboratoire Hubert CurienUMR CNRS 5516Universit´e Jean Monnet MeeticParis, FranceSaint- ´Etienne, France [email protected] {t.dopierre,w.logerais}@meetic-corp.com

Abstract

Modern classiﬁcation models tend to strugglewhen the amount of annotated data is scarce.To overcome this issue, several neural few-shot classiﬁcation models have emerged, yield-ing signiﬁcant progress over time, both inComputer Vision and Natural Language Pro-cessing. In the latter, such models used to relyon ﬁxed word embeddings before the adventof transformers. Additionally, some modelsused in Computer Vision are yet to be tested inNLP applications. In this paper, we compareall these models, ﬁrst adapting those madein the ﬁeld of image processing to NLP, andsecond providing them access to transformers.We then test these models equipped with thesame transformer-based encoder on the intentdetection task, known for having a large num-ber of classes. Our results reveal that whilemethods perform almost equally on the ARSCdataset, this is not the case for the Intent De-tection task, where the most recent and sup-posedly best competitors perform worse thanolder and simpler ones (while all are given ac-cess to transformers). We also show that a sim-ple baseline is surprisingly strong. All the newdeveloped models, as well as the evaluationframework, are made publicly available . Text classiﬁcation often requires a large number ofmappings between texts and target classes, so that itis challenging to build few-shot text classiﬁcationmodels (Geng et al., 2019). With the recent ad-vances of transformer-based models (Devlin et al.,2018; Wolf et al., 2019) along with their ﬁne-tuningtechniques (Sun et al., 2019), text classiﬁcation hassigniﬁcantly improved. In few-shot settings, meth-ods based on these extracted text representationshave been historically made of semi-supervision,especially thanks to pseudo-labeling (Blum and https://github.com/tdopierre/FewShotText Mitchell, 1998; Mihalcea, 2004; Zhi-Hua Zhouand Ming Li, 2005), which aims at propagatingknown labels to unlabeled data points in the rep-resentational space. Such methods depend on thenumber of collected unlabeled data, which can alsobe costly to obtain (Charoenphakdee et al., 2019),and also suffer from the infamous pipeline effectin NLP (Tenney et al., 2019), as cascade process-ing tends to make errors accumulate. In order toaddress the hindrance of collecting unlabeled data,modern approaches include unsupervised data aug-mentation techniques (Xie et al., 2019). It consistsof generating samples through well-established textaugmentation techniques in Neural Machine Trans-lation, such as backtranslation (Sennrich et al.,2015; Edunov et al., 2018), and then use a con-sistency loss, training the classiﬁer to assign thesame prediction to all variations of the same sampletext. While collecting new pseudo-labels can there-fore be overcome by manipulating the dataset (es-pecially using data augmentation techniques), thepipelining error accumulation effect instead callsfor new neural architectures supporting scarcity oflabeled data in an end-to-end fashion. Such end-to-end few-shot neural architectures for few-shot clas-siﬁcation were discovered in image processing – itincludes Matching Networks (Vinyals et al., 2016),Prototypical Networks (Snell et al., 2017) plus afollow-up known as Prototypical Networks++ (Renet al., 2018), and Relation Networks (Sung et al.,2018). Ultimately Induction Networks (Geng et al.,2019) is a meta-learning based method dedicated tofew-shot text classiﬁcation, supposedly the state-of-the-art. Since our contribution considers this familyof models, we will further detail them in Section 2.Nonetheless, it is important to stress that most ofthese neural architectures were originally devisedto integrate image feature extractors. Despite bothtext and image relying on features extractors, aparagraph or sentence of few words hardly convey a r X i v : . [ c s . C L ] J a n s much information as a full-ﬂedged three-canals × image ( , numerical values in-trinsically). It is therefore of the utmost practicalinterest to validate and compare if what works bestfor end-to-end few-shot image classiﬁcation is thesame for end-to-end few-shot text classiﬁcation.Moreover, when applying these end-to-end few-shot models to text, two main system componentsare into action: the text feature extractor itself andthe downstream part of the neural network thatprovides a learning strategy over few shots. If wewant to compare these systems, we need to plugthe same feature extractor (hopefully the best one,that is transformer-based currently) into each end-to-end model. For the time being, the literatureon end-to-end few-shot text classiﬁcation compareaforementioned techniques using a different text ex-tractor for each system, which is the one availablewhen the technique was discovered – these textencoding varying greatly (Section 3.3). From thatpoint-of-view, it is hardly possible to conclude ifthe improvement over time in few-shot text classi-ﬁcation is due to new few-shot learning techniquesor plainly to the signiﬁcant advances made by textfeature extractors. The same applies to vectors met-rics: one method can use the cosine and anotherthe euclidean distance, and that choice alone canimpact conclusions made on the method being thestate-of-the-art, although it could well rely only onthe metric at work. Ultimately, experimental setupsare usually restricted to one dataset, and evalua-tion schemes are heterogeneous among papers (Yuet al., 2018a),• We revise different end-to-end neural archi-tectures for few-shot text classiﬁcation usingthe same transformer-based feature extractor,• We investigate how these re-implementedstate-of-the-art solutions compete with verysimple baselines found to be yet very compet-itive for few-shot classiﬁcation in the ﬁeld ofimage-processing,• We introduce an evaluation framework basedon a number of intent detection datasets whichis signiﬁcantly bigger than what is usuallyused as evaluation in seminal papers transpos-ing each of these architectures from image totext classiﬁcation,• The entire framework used in this paper,including all the re-implemented methods plugged with up-to-date transformers, is pro-vided as an open-source repository for furtherresearch.In a nutshell, we will demonstrate that provid-ing a transformer-based encoder to a previouslyobsolete few-shot technique makes it the state-of-the-art again, that standard baselines are surpris-ingly strong, and that Induction Networks, whileperforming well for binary sentiment classiﬁcation,struggles to perform correctly in the most commonsetups of few-shot text classiﬁcation. In this section, we will describe the few-shot learn-ing methods. In the following section, sentencevectors derived from the sentence encoder are de-noted v . V s , V q , and V u represent vectors forsupport, query, and unlabeled points, respectively.The number of shots is denoted K , and the num-ber of classes per episode is denoted C . The k th support vector of class c is denoted v sc,k . In theequations, s qi,j (resp. s qi,c ) will denote the similaritybetween the i th query vector and the j th supportvector (resp. the c th class). Similarly, s ui,j repre-sents the similarity between the i th unlabeled sam-ple and the j th support vector. When needed, thenumber of unlabeled data is denoted U . For eachmethod relying on a given similarity or distancemetric, we devise two experiments, using either thecosine similarity or the euclidean distance. Thoseadditional experiments are crucial, as they allow usto compare methods directly, without introducinga metric choice bias. Architectures of the differ-ent few-shot approaches are illustrated in Figure 1.They are each detailed in Section 2.2 and onwards,yet we ﬁrst introduce the common building blocksamong all methods in what follows. All Matching, Prototypical, andRelation Networks contain a class average block.This step is used to directly compare a query pointto a given class in order to make a prediction forthis query point. In both Matching and Proto-typical Networks, this step averages embeddingsof support points for each class (they are class prototypes ), which are then compared to querypoints to output class probabilities. In MatchingNetworks, this block averages similarity scoresclass-wise. On the contrary, in Induction Networks, lassaverage distancemetric lossrefineprototypes protoproto vanilla (a) Prototypical Network, with the optional proto++ step. In theoriginal Prototypical Networks, the euclidean distance is used asdistance metric. relationmodule lossclassaverage (b) Relation Network distancemetric lossclassaverage (c) Matching Network. In the original Matching Networks,the cosine similarity is used as a distance metric. relationmodule lossinductionmodule (d) Induction Network

Figure 1: Few-Shot classiﬁcation methods and variants used in our experiments. this step is lacking as support points are convertedinto prototypes using an Induction Layer, whichaims at ﬁnding a better way to aggregate suchknowledge than using the average (Section 2.5).

Loss

Matching and Induction Networks both usethe mean squared error (MSE) loss. Other meth-ods use cross-entropy (CE). We implemented bothlosses on Matching and Induction networks, andit leads to very similar results – and sometimes,slightly better using CE. We therefore report resultsfor all models using CE due to space limitations.Note that both losses are available in the publiclyavailable source code. The cosine similarity beingbounded, it would not make sense to directly applysuch a loss on cosine similarities. To overcomethis issue, we multiply the cosine similarities by aconstant factor of , allowing them to reach moreextreme values, hence ensuring that probabilitiesobtained by softmax are sparse enough. Introduced by Vinyals et al. (2016), Matching Net-works (Figure 1c) rely on the comparison betweenquery and support vectors using the cosine similar-ity in the seminal paper. After similarities betweena query point and all support points are computed,they are averaged for each class. The predictedlabel for a given query point is the one with thehighest average cosine similarity. In our notationframework, this process is summed up in Equa- tion 1. s q, matching i,c = K − K (cid:88) k =1 ( v qi ) T v sc,k (cid:107) v qi (cid:107) (cid:107) v sc,k (cid:107) (1) Prototypical Networks (Figure 1a) were introducedby Snell et al. (2017) as an extension of MatchingNetworks. After obtaining support vectors from theencoder, a class-wise average operation is done, asin Equation 2. This results in C prototypes denoted { p c , c ∈ [[1 , C ]] } , each one being the representativeof a class. Then, a distance metric compares allquery points to all prototypes. For each query point,the predicted class is the one for which this dis-tance is the smallest. In the original PrototypicalNetworks, the euclidean distance was used, as inEquation 3. We also add the cosine similarity-baseddistance in our experiments in order to measure theimpact of selecting another distance metric. p c = K − K (cid:88) k =1 v sc,k (2) s q, proto i,c = exp (cid:0) −(cid:107) v qi − p c (cid:107) (cid:1)(cid:80) Cc (cid:48) =1 exp (cid:0) −(cid:107) v qi − p c (cid:48) (cid:107) (cid:1) (3)An extension to Prototypical Networks was pro-posed by Ren et al. (2018), where unlabeled datapoints are used along with support and query points.fter computing each class’s prototype, a soft k-means technique is applied to further reﬁne thoseprototypes using unlabeled data points. The re-ﬁned prototypes, denoted ˜ p c , are derived usingEquation 4. This additional step aims at correctingthe support points selection bias and making themethod more robust. ˜ p c = (cid:80) Kk =1 v sc,k + (cid:80) Ui =1 v ui s u, proto i,c K + (cid:80) i s u, proto i,c (4) Relation Networks (Sung et al., 2018) challengethe idea of using a pre-deﬁned metric. The Re-lation Module takes as an input a query vector v qi ∈ R d , and the prototype of a class p c ∈ R d ,the latter being obtained the same way as in Pro-totypical Networks (Equation 2). The idea is touse a relation module, modeling the relationshipbetween those two vectors, yielding a similarityscore s i,c ∈ (0 , . Instead of using a pre-deﬁneddistance metric like the euclidean or the cosineone, this approach allows such networks to learnthis metric by themselves. Two different relationmodule architectures exist. base The base relation module concatenatesboth v qi and p c , and applies a small feed-forwardneural network composed of two linear layers, witha ReLU activation function in between. The for-mula for this given relation module is describedin Equation 5, where C ( · , · ) denotes the concate-nation operator, f ( · ) denotes the ReLU activationfunction, and w, M , M are learnable parameters. s q, rel-base i,c = (cid:104) w, M ( f ( M ( C ( v qi , p c )))) (cid:105) (5) NTL

Introduced by Socher et al. (2013), the N eural T ensor L ayer relation module uses interme-diate learnable matrices M k ∈ R d,d to model therelation between support vectors and prototypes.The similarity score for this relation module is ob-tained using Equation 6, where w is a learnableparameter. Following the work done by Genget al. (2019), we ﬁx the number h of intermediatematrices to 100 in all our experiments. s q, rel-ntl i,c = (cid:10) w, z rel-ntl i,c (cid:11) , w ∈ R h (6) z q, rel-ntl i,c,t = f (cid:16) ( v qi ) T M t p c (cid:17) , t ∈ [[1 , h ]] (7) Induction Networks (Geng et al., 2019) aims at ﬁnd-ing a general representation of each class in the sup-port set to compare to new queries. They are com-posed of both an induction module and a relationmodule. The main motivation for such networksis that representing the class by the average vectorof its data points – what is done in Prototypicaland Relation networks – is too restrictive. The ﬁrstpart, the induction module, leverages a dynamicrouting (Sabour et al., 2017) algorithm. In theircontribution, Geng et al. (2019) show that theirmethod can better induce (hence their name) andgeneralize class-wise representations. For the sec-ond part, an NTL Relation Module is used: this isthe same as the one introduced earlier (Section 2.4).Such networks are illustrated in Figure 1d.As in (Geng et al., 2019), we ﬁx the number ofrouting iterations to , and the number of matricesin the NTL to . Few-shot learning algorithms are designed to over-come the data scarcity problem. With the tremen-dous shift in the architecture of sentence encodersusing transformers, control baselines are needed tovalidate their ability to learn from few samples. Forthis reason, we include as a ﬁrst

Baseline modela traditional classiﬁer, as described by Equation 8,added on top of BERT. Both W and b are learnableparameters, ﬁne-tuned on the support vectors V s .In our experiments, this method will henceforth bereferred to as Baseline . s q, baseline i,j = ( W v qi + b ) j ; W ∈ R C,d (8)In addition to this

Baseline model, we alsoimplement a variant of it, which will henceforthbe referred to as

Baseline++ . In that secondbaseline, the classiﬁer design differs as follows:it measures similarities to a learnable vector in-stead of transforming vectors into logits using alinear layer. The matrix W used in the Baseline model can be writen as [ w , . . . , w C ] where each w k ∈ R d is a weight vector corresponding to the k th class. To measure the similarity between class j and a query vector v qi , we compute the similarityscores in Equation 9. After all scores s · , · are com-puted, we then obtain a probability vector through oss regular classifier (a) Baseline Network. loss distance-basedclassifier (b) Baseline++ Network. The distance-base classiﬁer caneither use cosine or euclidean distance Figure 2: Few-Shot classiﬁcation baselines used in ourexperiments. normalization using the softmax function). s q, baseline++ i,j = w Tj v qi (cid:107) w j (cid:107) (cid:107) v qi (cid:107) (9)As in Prototypical Networks, the derived vec-tors [ w , . . . , w C ] can be interpreted as class proto-types. For both baselines, at each training episode,the weights W and b are initialized, and the wholemodel is ﬁne-tuned for a few iterations using sup-port samples. This is important in practice, as itteaches the sentence encoder – a transformer, seeSection 3.3 – how to produce good enough embed-dings for the downstream classiﬁer to learn efﬁ-ciently. At test time, the same process is used –using test labels –, except that we freeze the en-coder’s weights and only ﬁne-tune the classiﬁerpart. The baselines architectures are represented inFigure 2. Introduced by Vinyals et al. (2016), few-shot clas-siﬁcation corresponds to the case when a classiﬁermust adapt to new classes, denoted here as C test ,unseen during training, and only given a few la-beled examples of these new classes. To this end,the approaches assume that during training, a task-signiﬁcant set of classes noted C train is available,along with an accordingly task-signiﬁcant numberof labeled data for each class c train i ∈ C train . Foreach training episode, C classes are sampled from C train , C (cid:28) |C train | . Then, K support examplesand Q query examples are randomly drawn foreach of these classes. The model is then iterativelytrained using both query and support points.At testing time, the same sampling strategy ismade, this time drawing classes among C test , with C test ∩ C train = ∅ . The model is then evaluatedon its ability to predict labels for the Q query sam-ples, using the K support samples (unless other-wise stated, C , Q , and K values are the same atboth testing and training time).This training procedure is called C -way K -shotclassiﬁcation. In all our experiments, we used K = Q = 5 . Concerning the value of C , it isﬁxed to for ARSC, as this dataset is already com-posed of binary classiﬁcation tasks. Regarding theintent detection datasets we introduce later (Sec-tion 3.2), in order to see the shift between ARSCbinary tasks and the more common -way evalua-tion (Geng et al., 2019; Ren et al., 2018), we mea-sured performances of the different models with C ranging from to . In this section, we describe the datasets used in ourevaluation framework. The ﬁrst one is a popularsentiment classiﬁcation dataset, while the othersare intent detection datasets. All datasets are publicand in English.

ARSC

The A mazon R eview S entiment C lassiﬁcation dataset (Blitzer et al., 2007) iscomposed of product reviews from productcategories. Each review belongs to one of the domains, and contains a grade ranging from to stars. The usual setup (Yu et al., 2018b; Genget al., 2019) to evaluate few-shot classiﬁcationwith this dataset is as follows: for each of p ≤ product category and ≤ t ≤ score thresholds, E ARSC = p × t binary classiﬁcation evaluationtasks are created. In each of these p × tasks, acompetitor model must learn to classify negative( < t ) and positive ( ≥ t ) reviews. To build our testtasks, we consider the same product categoriesas previous works (Yu et al., 2018b; Genget al., 2019), which are Books , DVD , Electronics , Kitchen , and t = 3 (thresholds are picked in the { , , } set) – hence 12 binary classiﬁcation testtasks in our benchmark for this dataset.Each of these twelve evaluation tasks comes witha number of support test samples ( K = Q = 5 as stated previously). Nonetheless, in (Yu et al.,018b) the same samples per testing class areﬁxed for all experiments , which leads to a sig-niﬁcant selection bias towards these randomlyselected samples used throughout the evaluation.In order to get more consistent results, we ran ad-ditional experimental runs, each of them selectingrandomly new support samples. In the ARSC resulttable (table 1), this corresponds to the last column( BERT + Sample shots ). OOS

The O ut O f S cope dataset (Larson et al.,2019) is an intent detection dataset containing equally-distributed classes. While initially used forout-of-scope prediction, it was also motivated by ahigh number of classes, a low number of examplesper class ( ), and its chatbot life-like style. Inour experiments, we discard the out-of-scope class,keeping the remaining classes to work with. Liu

Introduced by Liu et al. (2019), this intent de-tection dataset consists in classes. This datasetwas collected on the Amazon Mechanical Turk plat-form, where workers were given an intent and hadto formulate queries for this intent with their ownwords. It is highly imbalanced: the most commonclass ( query ) holds , samples while the leastcommon one ( volume other ) samples. TREC28

TREC is an open-domain fact-baseddataset for question classiﬁcation. We use the labels version of the dataset but remove the labelswhich have less than samples. This ﬁlteringprocess yields a dataset with classes, rangingfrom to samples per class. In previous works comparing few-shot text classiﬁ-cation methods, sentence encoders were not alwaysthe same. For example, Yu et al. (2018b) use aCNN on top of word embeddings, while Genget al. (2019) use a Bi-LSTM. Those differencesmake the results hard to compare since they donot use the same method to convert sentences intovectors. In our experiments, in order to reduce thisselection bias, and since it is now the state-of-the-art in many applications, we use a BERT (Devlinet al., 2018) encoder, using models from the Hug-ging Face (Wolf et al., 2019) team. See labeled sampled in https://github.com/Gorov/DiverseFewShot_Amazon https://github.com/clinc/oos-eval https://trec.nist.gov/data/qa.html For each dataset, instead of using an off-the-shelfpre-trained model, we ﬁne-tune it on the maskedlanguage modeling task, as it greatly improves thequality of embeddings (Sun et al., 2019; Xie et al.,2019). This ﬁne-tuned transformer is then used asinput for all few-shot models.

We report results for the ARSC dataset in Table 1,and results for the Intent Detection tasks in Table 3.

Few-shot learning methods were originally used toovercome data scarcity. In those situations, traininga classiﬁer on top of a small dataset – in our case,5 samples per class – can be hard. However, ourexperiments on ARSC show that the

Baseline and

Baseline++ , plain and simple classiﬁers,get surprisingly close to state-of-the-art results. Ta-ble 2 provides four correct and four incorrect clas-siﬁcation examples for the

Baseline model.While it fails to predict the correct text labelfor some shots, it is also able to correctly clas-sify sentences such as

What do I take home ? among the test classes of the OOS dataset. Onthe ARSC dataset, it is also important to notethat the Baseline++ model is signiﬁcantly bet-ter than the

Baseline , and is even on par withall other architectures, except

PrototypicalNetworks . The mean accuracy difference between the last andthe second columns of Table 1 accounts for thedifference of randomly selecting new support sam-ples at each iteration (last column) as opposed topicking the same ﬁxed pool of support samples asdone previously (second to last column). We cansee that this difference alone is in the range of theincrements brought by each model over time (base-lines aside, bringing from point up to . pointsfor Prototypical Networks). This huge gap showsthe importance of using evaluation tricks like cross-validation, instead of evaluating only for one runover a ﬁxed set of shots. One of the main contributions of our paper is tocompare few-shot learning methods with the low-est bias possible (see Section 3.3). On the ARSCdataset, using transformers drastically changes the onﬁguration Mean binary accuracyModel Metric Relationmodule Original encoder † BERT as encoder ( (cid:37) or (cid:38) w.r.t. original encoder) BERT +Sample shotsMatching Network(Vinyals et al., 2016) euclid. N/A − (cid:37) ) 83.3Prototypical Network(Snell et al., 2017) euclid. N/A 68.2 80.0 ( (cid:37) ) 82.6cosine N/A − (cid:20) cosine N/A (cid:20) − (cid:38) ) 83.3Induction Network(Geng et al., 2019) N/A ntl 85.6 79.3 ( (cid:38) ) 80.3Baseline N/A N/A (cid:20) − (cid:20) Table 1: Mean accuracy on the 12 ARSC binary classiﬁcation test tasks. In column † , results are reproducedfrom the Induction Networks seminal paper (Geng et al., 2019) (where applies), a dash ( − ) means that resultsfor that encoder/metric pair were not reported, and (cid:20) denotes models only tested on computer vision tasks (ﬁrsttime applied to text in our contribution). The BERT column is our implementation using the same 5 shots asthe ﬁrst column but using a BERT encoder for all methods. The last column is also using BERT, but results areaveraged over ﬁve runs, sampling different shots for each run. In the Conﬁguration column, N/A means that theconﬁguration criteria does not apply to the model.

Correct classiﬁcation examplesS: Do I have enough in my boa account for a new pair of skis ?P: balance T: balance S: What’s 15% of 68 ?P: calculator T: calculator S: I need to know the nearest bank’s location.P: directions T: directions S: What do I take home ?P: income T: income Incorrect classiﬁcation examplesS: On Tuesday you are supposed to have a meeting.P: meeting schedule T: calendar S: What are my insurance rewards ?P: insurance T: redeem rewards S: How much farther is Orlando from my location?P: current location T: distance S: Stop talking please.P: change speed T: cancel Table 2: Examples of OOS query examples correctlyand incorrectly predicted by the

Baseline methodusing 5 shots. S (resp. P , T ) is the sentence (resp.prediction and true label). performances of all methods. When feeding thesame transformer-based encoder to all few-shotmethods, Prototypical Networks are now on top, whereas metric learning approaches (Induction &Relation Networks) tend to struggle, almost reach-ing the same performances as Matching Networks.Such metric learning approaches rely on variousweight matrices and parameters, while more tra-ditional approaches ( Matching , Proto ) do notuse any parameter apart from the encoding step.This hints that the upstream transformer does mostof the learning and is able to model the embeddingspace well enough such that no more additionalmetric learning is needed. The massive increase inembedding quality brought by the BERT encodermakes Prototypical Network approaches reclaimthe state-of-the-art position.

When Geng et al. (2019) introduced InductionNetworks, both the ARSC dataset and a privateintent detection dataset were used for evaluation(publicly unavailable). Our experiments of thismethod on the ARSC dataset conﬁrm those re-sults in an acceptable range, even when tryingto get more consistent results using multiple ran-dom seeds. Nonetheless, the performances of thismethod are underwhelming on all three intent de-tection datasets, even when matching the binaryclassiﬁcation scenario using C = 2 . Those poorperformances were observed both on the test set etric RelationModule

Liu OOS TREC282 3 4 5 2 3 4 5 2 3 4 5Matching euclid. - 96.6 93.7 91.1 89.1 99.2 98.7 98.1 97.7 89.4 81.6 76.6 69.6cosine - 93.3 87.9 84.8 81.0 96.8 95.8 95.1 94.7 81.6 75.4 68.5 63.5Proto euclid. - 97.4 95.3 93.4 91.8 cosine - 94.6 90.4 88.5 85.6 97.6 97.3 96.9 96.5 85.6 79.1 74.5 71.3Proto++ euclid. -

Table 3: Mean accuracy of C -way -shot intent detection, with C ranging between and . Each reported value isthe average over ﬁve runs with different random seeds. For each column, the best method is highlighted in bold . as well as the train set, discarding the over-ﬁttingargument. Such a big performance gap betweensentiment and intent classiﬁcation tasks show thatInduction Networks, while suited for the former,are not directly applicable to any type of task. Prototypical Networks were originally designed todo better than Matching Networks. The two dif-ferences between them are the placement of the class average step, and the choice of the met-ric (cosine for Matching, euclidean for Prototyp-ical). Our results show that metric choice yieldsa big gap in performances for both methods, thisgap being larger than the gap caused by the modeldesign. This hints that when using a pre-deﬁnedmetric – excluding the case of metric learning –,choosing the right metric is of paramount impor-tance. Moreover, while

Matching Networks were designed to use the cosine distance, wefound here that they perform signiﬁcantly betterwhen equipped with the Euclidean distance (on alldatasets for all number of given test classes).

Overall, Prototypical Networks come on top of ev-ery intent detection dataset. More importantly, theirgap between other competing approaches is wideras the number of classes increases. This result isimportant, as in practice, the number of classes islikely to be higher than what is used in the litera-ture. The extended variant, proto++, obtains mixedresults. While this shows that using unlabeled datacan have some beneﬁts, we also observe that theproto++ way of integrating this external knowledge is perfectible. Ultimately, note that our results donot mirror Computer Vision results. Since few-shotlearning methods are used on top of embeddings,we could emit the hypothesis that they can be ap-plied to any embeddings, regardless of the ﬁeld.However, while Relation Networks, for example,were performing well in Computer Vision classiﬁ-cation tasks – the tasks which they were originallydesigned for – as well as text classiﬁcation – backin the days when transformers did not exist –, thisis not the case anymore. The drawback is that allmethods are very sensitive to the feature extractorused in prior steps.

We provided a fair comparison for end-to-end neu-ral few-shot text classiﬁcation methods discoveredover the last few years. When they are all equippedwith a transformer-based text encoder, we showthat Prototypical Networks become the state-of-the-art again. We also found that a traditional classiﬁertrained on few shots yields very competitive results,especially when given shots are re-sampled at eachiteration. Ultimately, we also demonstrated the sig-niﬁcant impact of the vector metric, illustrated byMatching Networks strongly improving by onlyreplacing the cosine by the euclidean distance. Thecomplete source code with the re-implementationof all the tested methods and evaluation frameworkused in this study is publicly available – we hopethat it will help the community build upon consis-tent comparative experiments. https://github.com/tdopierre/FewShotText eferences John Blitzer, Mark Dredze, and Fernando Pereira. 2007.Biographies, bollywood, boom-boxes and blenders:Domain adaptation for sentiment classiﬁcation. In

Proceedings of the 45th annual meeting of the asso-ciation of computational linguistics , pages 440–447.Avrim Blum and Tom Mitchell. 1998. Combining la-beled and unlabeled data with co-training. In

Pro-ceedings of the eleventh annual conference on Com-putational learning theory , pages 92–100.Nontawat Charoenphakdee, Jongyeong Lee, YipingJin, Dittaya Wanvarie, and Masashi Sugiyama. 2019.Learning only from relevant keywords and unla-beled documents. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguis-tics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805 .Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. arXiv preprint arXiv:1808.09381 .Ruiying Geng, Binhua Li, Yongbin Li, Xiaodan Zhu,Ping Jian, and Jian Sun. 2019. Induction net-works for few-shot text classiﬁcation. arXiv preprintarXiv:1902.10482 .Stefan Larson, Anish Mahendran, Joseph J Peper,Christopher Clarke, Andrew Lee, Parker Hill,Jonathan K Kummerfeld, Kevin Leach, Michael ALaurenzano, Lingjia Tang, et al. 2019. An evalua-tion dataset for intent classiﬁcation and out-of-scopeprediction. arXiv preprint arXiv:1909.02027 .Xingkun Liu, Arash Eshghi, Pawel Swietojanski, andVerena Rieser. 2019. Benchmarking natural lan-guage understanding services for building conversa-tional agents. arXiv preprint arXiv:1903.05566 .Rada Mihalcea. 2004. Co-training and self-training forword sense disambiguation. In

Proc. of 8th CoNLL-2004 (HLT-NAACL 2004) .Mengye Ren, Eleni Triantaﬁllou, Sachin Ravi, JakeSnell, Kevin Swersky, Joshua B Tenenbaum, HugoLarochelle, and Richard S Zemel. 2018. Meta-learning for semi-supervised few-shot classiﬁcation. arXiv preprint arXiv:1803.00676 .Sara Sabour, Nicholas Frosst, and Geoffrey E Hin-ton. 2017. Dynamic routing between capsules. In

Advances in neural information processing systems ,pages 3856–3866.Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Improving neural machine translation models with monolingual data. arXiv preprintarXiv:1511.06709 .Jake Snell, Kevin Swersky, and Richard Zemel. 2017.Prototypical networks for few-shot learning. In

Ad-vances in neural information processing systems ,pages 4077–4087.Richard Socher, Danqi Chen, Christopher D Manning,and Andrew Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. In

Advances in neural information processing systems ,pages 926–934.Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to ﬁne-tune bert for text classiﬁcation?In

China National Conference on Chinese Computa-tional Linguistics , pages 194–206. Springer.Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang,Philip H.S. Torr, and Timothy M. Hospedales. 2018.Learning to compare: Relation network for few-shotlearning. In

The IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR) .Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.BERT rediscovers the classical NLP pipeline. In

Proceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4593–4601. Association for Computational Linguistics.Oriol Vinyals, Charles Blundell, Timothy Lillicrap,Daan Wierstra, et al. 2016. Matching networks forone shot learning. In

Advances in neural informa-tion processing systems , pages 3630–3638.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Lu-ong, and Quoc V Le. 2019. Unsupervised data aug-mentation for consistency training. arXiv preprintarXiv:1904.12848 .Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, SaloniPotdar, Yu Cheng, Gerald Tesauro, Haoyu Wang,and Bowen Zhou. 2018a. Diverse few-shot textclassiﬁcation with multiple metrics. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1 ,New Orleans, Louisiana. Association for Computa-tional Linguistics.Mo Yu, Xiaoxiao Guo, Jinfeng Yi, Shiyu Chang, SaloniPotdar, Yu Cheng, Gerald Tesauro, Haoyu Wang,and Bowen Zhou. 2018b. Diverse few-shot textclassiﬁcation with multiple metrics. arXiv preprintarXiv:1805.07513 .Zhi-Hua Zhou and Ming Li. 2005. Tri-training: ex-ploiting unlabeled data using three classiﬁers.