[PDF] Ask2Transformers: Zero-Shot Domain labelling with Pre-trained Language Models

Abstract

In this paper we present a system that exploits different pre-trained Language Models for assigning domain labels to WordNet synsets without any kind of supervision. Furthermore, the system is not restricted to use a particular set of domain labels. We exploit the knowledge encoded within different off-the-shelf pre-trained Language Models and task formulations to infer the domain label of a particular WordNet definition. The proposed zero-shot system achieves a new state-of-the-art on the English dataset used in the evaluation.

Full PDF

AAsk2Transformers: Zero-Shot Domain labelling with Pre-trainedLanguage Models

Oscar Sainz and

German Rigau

HiTZ Center - Ixa Group,University of the Basque Country (UPV/EHU) { oscar.sainz, german.rigau } @ehu.eus Abstract

In this paper we present a system that ex-ploits different pre-trained Language Mod-els for assigning domain labels to WordNetsynsets without any kind of supervision. Fur-thermore, the system is not restricted to use aparticular set of domain labels. We exploit theknowledge encoded within different off-the-shelf pre-trained Language Models and taskformulations to infer the domain label of aparticular WordNet deﬁnition. The proposedzero-shot system achieves a new state-of-the-art on the English dataset used in the evalua-tion.

The whole Natural Language Processing (NLP)research area have been accelerated with the ad-vent of the unsupervised pre-trained LanguageModels. First with ELMo (Peters et al., 2018) andthen with BERT (Devlin et al., 2019) the paradigmof using pre-trained Language Models for ﬁne-tuning on a particular NLP task has became thenew standard approach, replacing the more tradi-tional knowledge-based and fully supervised ap-proaches. Currently, as the size of the corpus andmodels increase, the research community has ob-served that the Transfer Learning approach has thecapacity to work without any or with a very smallﬁne-tuning. Some examples of the strength of thisapproach are GPT-2 (Radford et al., 2019) or morerecently GPT-3 (Brown et al., 2020) that shows theability of these huge pre-trained Language Modelsto solve tasks for which have not even trained.Recently, with the arrival of the GPT-3 newways to perform zero and few shot approacheshave been discovered. These approaches proposethe inclusion of a small number of supervised ex-amples in the input as a hint for the model. Themodel then, just by looking a small set of exam-ples, is able to complete successfully the task at hand. Brown et al. (2020) report that they solve awide range of NLP tasks just following the previ-ous approach. However, this approach only looksappropriate when the model is large enough.In this paper we exploit the domain knowl-edge already encoded within the existing pre-trained Language Models to enrich the WordNet(Miller, 1998) synsets and glosses with domainlabels. We explore and evaluate different pre-trained Language Models and pattern objectives.For instance, consider the example shown in Ta-ble 1. Given a WordNet deﬁnition such as the oneof < hospital, inﬁrmary > and the knowledge en-coded in a pre-trained Language Model, the taskis to assess which is its most suitable domain la-bel. Thus, we create an appropriate pattern in nat-ural language adapted to the objective of the Lan-guage Model. In the example, we use a LanguageModel ﬁne-tuned on a general task such as Nat-ural Language Inference (NLI) (Bowman et al.,2015). The NLI objective is to train a model ableto classify the relation between two sentences asentailment, contradiction or neutral. Having fourdomains such as medicine , biology , business and culture , our system performs four queries to themodel, each one with one of the four domains.Each query takes as a ﬁrst sentence the WordNetdeﬁnition and as a second sentence The domainof the sentence is about [domain-label].

As ex-pected, the most suitable domain label in this ex-ample is medicine with a conﬁdence of 0.77. Asshown, an off-the-shelf Language Model whichhave been ﬁne-tuned on a general NLI task is ableto infer the most appropriate domain label for theWordNet deﬁnition without any further training.Also note that the approach can use any given setof domain labels.Interestingly, without any training on the task athand, the proposed zero-shot system obtains an F1score of 92.4% on the English dataset used in the a r X i v : . [ c s . C L ] J a n valuation.All the implementation code along with the ex-periments is freely available on a GitHub reposi-tory .After this short introduction, the next sectionpresents previous work on domain labelling ofWordNet. Section 3 presents our approach, Sec-tion 4 the experimental setup and Section 5 theresults from our experiments. Finally, Section 6revises the main conclusions and the future work. Building large and rich lexical knowledge bases isa very costly effort which involves large researchgroups for long periods of development. Startingfrom version 3.0, Princeton WordNet has associ-ated topic information with a subset of its synsets.This topic labeling is achieved through pointersfrom a source synset to a target synset representingthe topic. WordNet uses 440 topics and the mostfrequent one is < law, jurisprudence > .In order to reduce the manual effort required,a few semi-automatic and fully automatic meth-ods have been applied for associating domain la-bels to synsets. For instance, WordNet Domains (WND) is a lexical resource where synsets havebeen semi-automatically annotated with one ormore domain labels from a set of 165 hierarchi-cally organized domains (Magnini, 2000; Ben-tivogli et al., 2004). The uses of WND includethe possibility to reduce the polysemy degree ofthe words, grouping those senses that belong to thesame domain (Magnini et al., 2002). But the semi-automatic method used to develop this resourcewas far from being perfect. For instance, the nounsynset < diver, frogman, underwater diver > de-ﬁned as some-one who works underwater has do-main history because it inherits from its hyper-nym < explorer, adventurer > also labelled with history . Moreover, many synsets have been la-belled as factotum meaning that the synset cannotbe labelled with a particular domain. WND alsoprovides mappings to WordNet Topics and also toWikipedia categories.eXtended WordNet Domains (XWND)(Gonzalez-Agirre et al., 2012; Gonz´alez et al.,2012) applied a graph-based method to propagatethe WND labels through the WordNet structure. https://github.com/osainz59/Ask2Transformers http://wndomains.fbk.eu/ https://adimen.si.ehu.es/web/XWND Domain information is also available in otherlexical resources. For instance, IATE , a EuropeanUnion inter-institutional terminology database.The domain labels of IATE are based on the Eu-rovoc thesaurus and were introduced manually.More recently, BabelDomains (Camacho-Collados and Navigli, 2017) propose an automaticmethod that propagates the knowledge categoriesfrom the Wikipedia to WordNet by exploitingboth distributional and graph-based clues. As do-mains of knowledge, BabelDomains opted for do-mains from the Wikipedia featured articles page .This page contains a set of thirty-two domainsof knowledge. When labelling WordNet synsetswith these domains, BabelDomains reports a pre-cision of 81.7, a recall of 68.7 and an F1 scoreof 74.6. Unfortunately, as these numbers sug-gest not all WordNet synsets have been labelledwith a domain. For instance, the synset < hospital,inﬁrmary > with a gloss deﬁnition a health facilitywhere patients receive treatment has no Babeldo-main assigned.It is worth to note that all these methods de-part from a particular set of domain labels (or cat-egories) manually assigned to a set of WordNetsynsets (or Wikipedia pages). Then, these labelsare propagated through the WordNet structure fol-lowing automatic or semi-automatic methods. Incontrast, our zero-shot method does not requirean initial manual annotation. Furthermore, it isnot designed for a particular set of domain labels.That is, it can be applied to label from scratch anydictionary or lexical knowledge base (or wordnet)with distinct sets of domain labels. Recent studies such as the one of GPT-3 (Brownet al., 2020) shows that when increasing the sizeof the model, the capacity to solve different taskswith just a few positive examples also increases(few-shot learning). However, very large Lan-guage Models also have important hardware re-quirements (i.e. large RAM GPUs). Thus, we de-cided to keep the size of the models used manage- http://iate.europa.eu/ https://op.europa.eu/en/web/eu-vocabularies/th-dataset/-/resource/dataset/eurovoc http://lcl.uniroma1.it/babeldomains/ https://en.wikipedia.org/wiki/Wikipedia:Featured_articles eﬁnition: hospital: a health facility where patients receive treatment.Pattern: The domain of the sentence is about medicine 0.77 biology 0.08business 0.04culture 0.02 Table 1: An example of domain labelling. able with small hardware requirements.The task where we focused on is the domainlabelling of WordNet glosses. This task con-sist in the following. Given a WordNet gloss g to predict the corresponding domain d of theWordNet concept deﬁned. In this paper, the do-mains are taken from BabelDomains (Camacho-Collados and Navigli, 2017). Supervised domainlabelling can be solved as any other multiclassproblem, where the output of the model is a classprobability distribution. In our zero-shot experi-ments we did not modify any of the pre-trainedmodels. We just reformulate the domain labellingtask to match with the LMs training objective. The Masked Language Modeling (MLM) is a pre-training objective followed by models such asBERT (Devlin et al., 2019) and RoBERTa (Liuet al., 2019). This objective works as follows.Given a sequence of tokens s = [ t , t , ..., t n ] , thesequence is ﬁrst perturbed by replacing some ofthe tokens t with an special token [MASK]. Then,the model is trained to recover the original se-quence s given the modiﬁed sequence ˆ s . This de-noising objective can be seen as an evolution forthe contextual embeddings of the previous CBOWfrom word2vec (Mikolov et al., 2013).For domain labelling, we have replaced the in-put for the model following the next pattern: s : Context: [context] Topic: [MASK]where we introduce the input sentence replacingthe [context] tag. Then, we let the model predictthe most probable token for the [MASK] tag. Forinstance, given the biological deﬁnition of cell , themodel returns the following topics: Biology , evo-lution , life , etc.This approach has been used to explore theknowledge of the model without any predeﬁnedset of domain labels in Section 5.7. Along with the MLM the Next Sentence Predic-tion (NSP) is the training objective used by theBERT models. Given a pair of sentences s and s , this objective predicts whether s is followedby s or not.To adapt the BERT objective to the domain la-belling task, we propose the next strategy inspiredin the work from Yin et al. (2019). We use thefollowing input pattern: s : [context] s : Domain or topic about [domain-label]where s encodes a WordNet gloss as a contextand s is formed by a template and a domain-label.In order to make the classiﬁcation, we run as manytimes as domain labels and then apply a softmaxover the positive class outputs. We hypothesizethat, no matter if any of the s can really followthe given s , the most probable one should be the s formed by the correct label. For instance, recallthe hospital example shown in Table 1. In this case, we use a pre-trained LM that has beenﬁne-tuned for a general inference task which isthe Natural Language Inference (Williams et al.,2018a). Given two sentences in the form of apremise s and an hypothesis s , the NLI task con-sists on redicting whether the s entails or contra-dicts s or if the relation between both is neutral .We also used the input pattern shown in the pre-vious NSP approach to adapt the NLI models tothe domain labelling task. In this case, we just usethe predictions of the entailment class. The predic-tions of the c ontradiction and neutral are not used.As in the previous case, no matter if any of the s hypothesis entails the premise s or not, the mostprobable entailment should be the correct domainlabel. For example, consider again the exampleresented in Table 1. This section describes our experimental setup. Weintroduce the pre-trained Language Models andthe dataset used. For the case of the LanguageModels, we have tested BERT (Devlin et al.,2019), RoBERTa (Liu et al., 2019) and BART(Wang et al., 2019). For the dataset, we haveused the one released by Camacho-Collados et al.(2016) based on WordNet.

All the Language Models have been obtained fromthe Huggingface Transformers library (Wolf et al.,2019).

MLM

For the objective we have used roberta-large and roberta-base checkpoints. These mod-els have obtained state-of-the-art results on manyNLP tasks and benchmarks.

NSP

For this objective we use the BERT mod-els as they are the only ones trained on that ob-jective. For the sake of comparing the perfor-mance of more than one model of each objectivewe have selected the bert-large-uncased and bert-base-uncased checkpoints. They only differ on thesize of the Language Model.

NLI

For this objective we used a checkpointbased on RoBERTa roberta-large-mnli whichhave been ﬁne-tuned with MultiNLI (Williamset al., 2018b). We also include bart-large-mnli fortesting a generative model.

We evaluate our approaches on a dataset derivedfrom WordNet which have been annotated withBabeldomain labels (Camacho-Collados et al.,2016). This dataset consist of synsets man-ually annotated with their corresponding Babeldo-main label. The distribution of domain labels inthe dataset is shown in Figure 1. Note that thedataset is quite unbalanced. In fact, some impor-tant domains such as

Transport and travel or Foodand drink have no single labelled example. As oursystem is unsupervised, we use the whole datasetfor testing.

This section presents a quantitative and qualita-tive evaluation. One the one hand, the quantita-

Figure 1: Distribution of domains in the WordNetdataset.

Method Top-1 Top-3 Top-5MNLI (roberta-large-mnli)

MNLI (bart-large-mnli) 61.81 79.85 87.59NSP (bert-large-uncased) 2.07 8.57 16.49NSP (bert-base-uncased) 2.85 10.32 16.88

Table 2: Top-K accuracy of different approaches. tive evaluation has been done incrementally in or-der to obtain the best-performing system. First,we have evaluated the different alternative modelsusing the same objective pattern. Then, once thebest approach was selected we have explored al-ternative patterns using the best model. When thebest performing pattern was discovered we havefocus on ﬁnding a better label representation. Fi-nally, we have compared our best system againstthe previous state-of-the-art methods.On the other hand, as one of our system is basedon a generative approach (MLM) the applied re-strictions may not show the real performance ofthe method. So, we decided to at least do an smallqualitative review of the approach.

Table 2 shows the Top-1, Top-3 and Top-5 accu-racy of each system when using the same objectivepattern. To understand better the behaviour of thesystems we also present in the Figure 2 the Top-K igure 2: Top-K accuracy curve of the different ap-proaches and a random classiﬁer baseline. accuracy curve comparing all the approaches anda random baseline. As expected the systems thatfollow the same approaches perform similarly andshare a similar curve. The best performing systemis the MNLI based roberta-large-mnli , followedby the bart-large-mnli checkpoint. We observe alarge difference between the different models. Forinstance, the models pre-trained on the NLI taskperform much better than those pre-trained on thegeneral NSP task. The NSP approaches performslightly better than the random classiﬁer which canbe a signal of a non appropriated objective modelto use.

Once selected the pre-trained Language Model,we evaluate different input patterns for the roberta-large-mnli checkpoint. As mentioned be-fore, the MNLI approaches follow the same struc-ture as NSP, where s is the gloss of the synset and s the sequence formed by a textual template plusthe label.Table 3 shows the results obtained by testingdifferent textual patterns. Very short patterns ob-tain low results. The best performing textual tem-plate is obtained with The domain of the sentenceis about [label] . As important as the input patterns is the set of do-main labels used. Actually, BabelDomains useslabels that refers to one or several speciﬁc do-mains. For instance,

Art, architecture and archae-ology . Although these coarse-grained labels canbe useful when clustering close-related domains, we also implemented a two-step labelling proce-dure taking into account those speciﬁc domains.First, we run the system over a set of speciﬁc do-mains or descriptors. Second, we apply a functionthat maps the descriptors to the original BabelDo-mains.

Descriptors

The descriptors deﬁned in thiswork are quite simple. Given a composed domainlabel such us

Art, architecture and archaeology ,we deﬁne the set of descriptors as each of the com-ponents of the label. For instance, in this case

Art , Architecture and

Archaeology . In the case of la-bels that consist on a single domain, the descrip-tors are just the labels. For example, in the case of

Music the descriptor is also

Music . Mapping function

The mapping function thatwe use in this work consists on taking themaximum result of the descriptors as the re-sult of the original domain label, i.e. l i =max( d i , d i , ..., d in ) . The inference time increases linearly with thenumber of labels. That is, for each example weneed to test all the different domain labels. Tospeed-up the labelling process we annotate au-tomatically the rest of WordNet glosses (around79.000 glosses) using our best zero-shot approach.Then, we use that automatically annotated datasetto train a much smaller Language Model for thetask. For instance, to label new deﬁnitions ornew lexicons. We have ﬁne-tuned two differentmodels, the ﬁrst one based with DistilBert (Sanhet al., 2019) which is 5 times smaller than the roberta-large-mnli and a XLM-RoBERTa (Con-neau et al., 2020) base which is 2 times smallerand is trained in a multilingual fashion. We calledthem A2T

FT-small and A2T

FT-xlingual respectively.The ﬁrst one achieve a x425 faster inference (5times smaller and 85 times less inferences) whilethe second one a speed boost of x170 . In order to know how good is our ﬁnal approachwe compare our new systems with the previousones. The results are reported on the Table 4 interms of Precision, Recall and F1 for comparisonpurposes. We also include the results from twoprevious state-of-the-art systems. As we can see,the new systems based on pre-trained LanguageModels obtain much better performance (from anput pattern Top-1 Top-3 Top-5Topic: [label] 59.61 69.48 74.02Domain: [label] 58.50 67.40 72.27Theme: [label] 59.67 73.96 81.36Subject: [label] 60.58 69.74 74.35Is about [label] 73.37 87.72 91.94Topic or domain about [label] 78.44 87.46 89.74The topic of the sentence is about [label] 80.71 92.92 95.77The domain of the sentence is about [label]

The topic or domain of the sentence is about [label] 76.62 88.63 91.23

Table 3: Some of the explored input patterns for the MNLI approach and their Top-1, Top-3 and Top-5 accuracy. previous best result with an F1 of 74.6 to thenew one of 82.10). We also obtain an small im-provement when establishing a threshold to decidewhether a prediction is taken into considerationor not. Our system performs slightly better witha conﬁdence score greater than 5% (A2T ( > . ) ).Figure 3 reports the Precision/Recall trade-off ofthe A2T system. As mentioned before labels com-posed of multiple domains can make the predic-tion harder for the zero-shot system. As a result, asimple system using the label descriptors booststhe performance of the system reaching a ﬁnal F1 score (A2T + descriptors ). Finally, we alsoinclude the results of both the ﬁne-tuned studentversions which still obtain very competitive resultswhile drastically reducing the inference time of theoriginal models.Method Precision Recall F1Distributional 84.0 59.8 69.9BabelDomains 81.7 68.7 74.6A2T 81.62 81.62 81.62A2T ( > . ) + descriptors A2T

FT-small

FT-xlingual

Table 4: Micro-averaged precision, recall and F1 foreach of the systems. Distributional (Camacho-Colladoset al., 2016) and BabelDomains (Camacho-Colladosand Navigli, 2017) measures are the ones reported bythem.

Figure 4 presents the confusion matrix of our bestsystem. The matrix is row wise normalized due

Figure 3: Precision/Recall trade-off of A2T system.Annotations indicates the probability thresholds. to the imbalance of the dataset label distribution.Looking at the ﬁgure there are 4 classes that aremisleading. The ”Animals” domain is confusedwith the related domains ”Biology” and ”Foodand drink”. For instance, this is the case of thesynset < diet > with the deﬁnition the usual foodand drink consumed by an organism (person oranimal) which is labelled by our system as ”Foodand drink”. The ”Games and video games” do-main is confused with the related domain ”Sportand recreation”. For example the sense referringto game: a single play of a sport or other contest;”the game lasted two hours” which is labelled byour system as ”Sport and recreation”. The thirdone, ”Heraldry, honors and vexillology” is alsoconfused with a very close domain ”Royalty andnobility”. Obviously, close-related domains canbe very difﬁcult to distinguish even for humans.For example, the sense < audio cd, audio compactdisc > annotated in the gold standard as ”Music”is labelled by our system as ”Media”. Finally,ynset cell phase space rounding error wipeoutLabel Biology Physics and astronomy Mathematics Sports and RecreationTop Biology

EOS rounding sportspredictions EOS physics EOS EOSbiology

Physics math sportevolution geometry taxes accidentlife relativity

Math Sports

Table 5: Top predictions of the MLM approach using the roberta-large checkpoint.Figure 4: Rowise normalized confusion matrix of theA2T + descriptors system. sometimes the ”History” domain is confused with”Food and drink”. A curious example of this caseis the sense referring to the history event < Bostontea party > that is labelled as ”Food and drink”. Table 5 shows some of the top predictions ob-tained by a Masked Language Model (MLM) andthe real label for 4 different synsets. In this case,the system is guessing its best predicted domain.That is, the system is not restricted to a select thebest label from a pre-deﬁned set of domain labels.Now, the system is free to return the word that bestﬁt the masked term.We can see in the table that the predictions ofthe model are close to the correct label althoughnot always equal. Sometimes because of a differ-ent case. They can also be seen as ﬁne-graineddomains or domain keywords of the real domain.

In this paper we have explored some approachesfor domain labelling of WordNet glosses by ex-ploiting pre-trained LM in a zero-shot manner. Wehave presented a simple approach that achieves anew state-of the art on the Babeldomain dataset.Even if we have focused on domain labelling ofWordNet glosses, our method seems to be robustenough to be adapted to work on tasks such as Sen-timent Analysis or other type of text classiﬁcation.In particular, we think that the approach can bevery useful when no annotated data is available.For the future, we have considered three mainobjectives. First, we plan to apply this approachto other sources of domain information such asWordNet topics and WordNet Domains. We willalso explore how to deal with deﬁnitions withgeneric domains (with no BabelDomains labels orwith WordNet Domains factotum label). Second,we also aim to explore the cross-lingual capabil-ities of pre-trained Language Models for domainlabelling of non-English wordnets and other lexi-cal resources. Finally, we also plan to explore theutility of these ﬁndings in the Word Sense Disam-biguation task.

Acknowledgments

This work has been funded by the Spanish Min-istry of Science, Innovation and Universities underthe project DeepReading (RTI2018-096846-B-C21) (MCIU/AEI/FEDER,UE) and by the BBVABig Data 2018 “BigKnowledge for Text Mining(BigKnowledge)” project. We also acknowledgethe support of the Nvidia Corporation with the do-nation of a GTX Titan X GPU used for this re-search. eferences

Luisa Bentivogli, Pamela Forner, Bernardo Magnini,and Emanuele Pianta. 2004. Revising the wordnetdomains hierarchy: semantics, coverage and balanc-ing. In

Proceedings of the workshop on multilinguallinguistic resources , pages 94–101.Samuel Bowman, Gabor Angeli, Christopher Potts,and Christopher D Manning. 2015. A large anno-tated corpus for learning natural language inference.In

Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing , pages632–642.Tom B Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shotlearners. arXiv preprint arXiv:2005.14165 .Jose Camacho-Collados and Roberto Navigli. 2017.BabelDomains: Large-scale domain labeling of lex-ical resources. In

Proceedings of the 15th Confer-ence of the European Chapter of the Associationfor Computational Linguistics: Volume 2, Short Pa-pers , pages 223–228, Valencia, Spain. Associationfor Computational Linguistics.Jos´e Camacho-Collados, Mohammad Taher Pilehvar,and Roberto Navigli. 2016. Nasari: Integrating ex-plicit knowledge and corpus statistics for a multilin-gual representation of concepts and entities.

Artiﬁ-cial Intelligence , 240:36 – 64.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Aitor Gonz´alez, German Rigau, and Mauro Castillo.2012. A graph-based method to improve wordnetdomains. In

International Conference on Intelli-gent Text Processing and Computational Linguistics ,pages 17–28. Springer.Aitor Gonzalez-Agirre, Mauro Castillo, and GermanRigau. 2012. A proposal for improving wordnet do-mains. In

LREC , pages 3457–3462. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .B Magnini. 2000. G. cavagli a. integrating subjectﬁeld codes into wordnet. In

Proceedings of LREC-2000, 2nd International Conference on LanguageResources and Evaluation , pages 1413–1418.Bernardo Magnini, Carlo Strapparava, Giovanni Pez-zulo, and Alﬁo Gliozzo. 2002. The role of domaininformation in word sense disambiguation.

NaturalLanguage Engineering , 8(4):359–373.Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efﬁcient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .George A Miller. 1998.

WordNet: An electronic lexicaldatabase . MIT press.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

OpenAIblog , 1(8):9.Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108 .Liang Wang, Wei Zhao, Ruoyu Jia, Sujian Li, andJingming Liu. 2019. Denoising based sequence-to-sequence pre-training for text generation. In

Pro-ceedings of the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) , pages 4003–4015,Hong Kong, China. Association for ComputationalLinguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018a. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers) , pages 1112–1122. Association forComputational Linguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018b. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North Americanhapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long Papers) , pages 1112–1122.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019.Benchmarking zero-shot text classiﬁcation:Datasets, evaluation and entailment approach.In