[PDF] Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models

Abstract

Recent progress in pretraining language models on large corpora has resulted in large performance gains on many NLP tasks. These large models acquire linguistic knowledge during pretraining, which helps to improve performance on downstream tasks via fine-tuning. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with `fill in the blank' style cloze questions. Existing probing datasets mainly focus on knowledge about relations between words and entities. We introduce WDLMPro (Word Definition Language Model Probing) to evaluate word understanding directly using dictionary definitions of words. In our experiments, three popular pretrained language models struggle to match words and their definitions. This indicates that they understand many words poorly and that our new probing task is a difficult challenge that could help guide research on LMs in the future.

Full PDF

DDoes He Wink or Does He Nod? A Challenging Benchmark forEvaluating Word Understanding of Language Models

Lutﬁ Kerem Senel and

Hinrich Sch ¨utze

Center for Information and Language Processing (CIS), LMU Munich, Germany [email protected]

Abstract

Recent progress in pretraining language mod-els on large corpora has resulted in large per-formance gains on many NLP tasks. Theselarge models acquire linguistic knowledge dur-ing pretraining, which helps to improve per-formance on downstream tasks via ﬁne-tuning.To assess what kind of knowledge is acquired,language models are commonly probed byquerying them with ‘ﬁll in the blank’ stylecloze questions. Existing probing datasetsmainly focus on knowledge about relationsbetween words and entities. We introduceWDLMPro (Word Deﬁnition Language ModelProbing) to evaluate word understanding di-rectly using dictionary deﬁnitions of words. Inour experiments, three popular pretrained lan-guage models struggle to match words andtheir deﬁnitions. This indicates that they un-derstand many words poorly and that our newprobing task is a difﬁcult challenge that couldhelp guide research on LMs in the future.

Natural language processing (NLP) has advanceddrastically in the last decade with the design oflarger and more sophisticated models, availabil-ity of larger corpora and increasing computationalpower. Pretrained word embeddings (Mikolovet al., 2013; Pennington et al., 2014) popularizedthe use of distributed word representations, whichbecame a fundamental building block for NLPsystems. Peters et al. (2018a) introduced LSTM-based deep contextual representations and obtainedlarge performance gains by ﬁne-tuning on tasks af-ter unsupervised pretraining (Radford et al., 2018;Howard and Ruder, 2018). More recently, the at-tention based transformer architecture was shownto use context more effectively (Vaswani et al.,2017) and several subsequent models achieved stateof the art results in many NLP tasks by combin-ing the transformer architecture with unsupervised pretraining and task speciﬁc ﬁne-tuning (Devlinet al., 2019; Liu et al., 2019). Radford et al. (2019)showed that language models can be applied to avariety of tasks without task speciﬁc ﬁne tuning.This is demonstrated on a much larger scale byBrown et al. (2020).Deep models improve performance. However,what they actually learn about language and wordmeaning is still to a large extent unclear due totheir uninterpretable nature. For static word embed-dings, researchers used word similarity (Hill et al.,2015) and word analogy (Gladkova et al., 2016)tests to shed light on what information is capturedin these dense vector spaces. For language models,a great amount of linguistic knowledge is storedin the model parameters (Peters et al., 2018b).Several studies proposed using ‘ﬁll in the blank’type cloze statements to test knowledge learnedby these models during unsupervised pretraining.Petroni et al. (2019) proposed the LAMA (LAn-guage Model Analysis) probe to test the factualand common sense knowledge stored in languagemodels. Similarly, Schick and Sch¨utze (2020) in-troduced WNLaMPro (WordNet Language ModelProbing) to assess the ability of language models tounderstand words based on their frequency. In WN-LaMPro, cloze style questions are generated basedon antonym, hypernym and cohyponym relationsamong words extracted from WordNet.The existing probing datasets mainly focus on in-vestigating the knowledge about relations between words or entities. However, a more direct way oftesting whether a language model understands themeaning of a word is to use its dictionary deﬁnition.If a pretrained language model truly understandsthe meaning of a word, then it should be able tomatch it with its dictionary deﬁnition. Based onthis motivation, we introduce the

Word Deﬁnition a r X i v : . [ c s . C L ] F e b ynset deﬁnition a cappella singing.n.01 singing without instrumental accompaniment caroling.n.01 singing joyful religious songs (especially at Christmas)crooning.n.01 singing in a soft low tonesingalong.n.01 informal group singing of popular songsbel canto.n.01 a style of operatic singing Table 1: Five candidates from G ( t ) for t = a cappella singing.n.01 and their deﬁnitions Noun Verb

Average min / max

Table 2: WDLMPro statistics

Language Model Probing (WDLMPro) dataset; itis a challenging benchmark for testing NLP modelsfor their ability to understand words. WDLMProis essentially a set of thousands of synset groups;each synset group consists of a target word (withits deﬁnition) and its taxonomic sisters (with theirdeﬁnitions). Using taxonomic sisters, rather thanrandom word groups, makes the task more chal-lenging for statistical models that are based on thedistributional hypothesis since these words havesimilar distributional characteristics (Lenci, 2008).We evaluate two masked language models, BERTand RoBERTa, and the auto-regressive model GPT-2 on WDLMPro using two different probing tests:(i) match deﬁnition to word (D2W) (ii) match wordto deﬁnition (W2D). We also provide a baselineusing static fastText embeddings (Mikolov et al.,2018). We ﬁnd that all three language models per-form clearly better than the baseline. Nevertheless,they have great difﬁculty matching words and theirdeﬁnitions, implying a poor understanding of wordmeaning. This is an important result that couldhelp guide research on LMs in the future. In this section, we introduce WDLMPro (WordDeﬁnition Language Model Probing), a dataset totest how well NLP models can match nouns andverbs with their deﬁnitions. We view this as a testof how well the models understand lexical mean-ing.

WordNet (Miller, 1995) is the basis for construct-ing WDLMPro. A WordNet synset contains a set of synonyms along with a short deﬁnition of thesynset. Different senses of polysemous words arerepresented in different synsets providing disam-biguation. WordNet connects synsets with eachother via semantic relations.Based on a target synset t and the semantic rela-tion hyponymy < , we construct a synset group G for the target as follows. G ( t ) = { x |∃ y : t < y ∧ x < y } that is, G contains all synsets that are “sister hy-ponyms” to t with respect to a hypernym of t . G ( t ) ,along with the deﬁnitions of the synsets in G ( t ) ,will be used to set up the WDLMPro tasks thatrequire matching of words and deﬁnitions. Wediscard groups G ( t ) that have a size of less than 5.In this study, we focus on nouns and verbs,i.e., we create synset groups G for the nouns andverbs in WordNet. Table 1 displays ﬁve mem-bers from G ( t ) and their deﬁnitions for the target a cappella singing.n.01 (see appx. for the target beckon.v.01 .) Table 2 shows statistics of the dataset. We deﬁne two probing tests that are converses ofeach other:•

Match deﬁnition to word (D2W).

Given adeﬁnition and a set of words, the task is toﬁnd the word that the deﬁnition deﬁnes.•

Match word to deﬁnition (W2D).

Given aword and a set of deﬁnitions, the task is toﬁnd the deﬁnition that deﬁnes the word.Each synset group G ( t ) gives rise to one instanceof D2W by providing the deﬁnition of t , and allwords in G ( t ) . The word from G ( t ) that matchesthe deﬁnition has then to be identiﬁed. (Note that t is a member of G ( t ) .) Similarly, each synsetgroup G ( t ) gives rise to one instance of W2D byproviding t and the deﬁnitions of all words in G ( t ) .The correct deﬁnition of t has then to be identiﬁedamong all deﬁnition candidates. Note that WordNetdeﬁnitions by construction do not contain the word asked Language Model (MLM)Noun is means is deﬁned as Verb deﬁnition of is to to is the deﬁnition of Autoregressive Language Model (ALM)Noun is the deﬁnition of

Verb to is the deﬁnition of Table 3: Patterns used for querying language modelsfor nouns and verbs. refers to the deﬁnition,is the mask or missing word that the language modelhas to predict. to be deﬁned. So there are no instances where thetwo tasks are trivial.

In principle, any NLP model can be tested on D2Wand W2D. In this paper, we are particularly in-terested in testing language models. To this end,we convert the data to a format that is suitable forlanguage models, i.e., to cloze-style questions asshown in Table 3. The basic quantity that allowsus to assess the compatibility of a word t and adeﬁnition is the probability of t being generated for“ ” when the deﬁnition is substituted for < DEF > .More precisely, we compute the probability thatthe string representation of t is being generated.We will denote the string representation of synset t by t . We obtain the string representation by remov-ing the word type and sense information from thename of the synset and replacing underscores withwhite space. For example, synset warm up.v.04 isrepresented by the string “warm up”.Table 3 shows that we deﬁne different templatesfor masked and autoregressive language models.For the masked language models, we average theprediction scores across patterns before ranking thecandidates. For a masked language model (MLM) M , the prob-ability of a candidate c ∈ G ( t ) on W2D is calcu-lated as: P W2D M ( c | t ) = | t | (cid:89) i =1 P ( t i | Q ( c, | t | )) where t = [ t , t , ..., t | t | ] is the tokenization pro-duced by M . Q ( c, | t | ) is the input query created from one of the patterns (Table 3) with replacedwith | t | consecutive mask tokens. For an autore-gressive language model (ALM) A , we decompose P ( t i | Q ( c ) , t ) in the standard way: P W2D A = | t | (cid:89) i =1 P ( t i | Q ( c ) , t , ..., t i − ) For D2W, we need to compare, given a deﬁnition,the probabilities of different candidate words thatare generally of different lengths. To ensure a faircomparison, we follow Xiong et al. (2020). ForMLMs, we match the number of mask tokens in aninput query to the token count of each candidate.The ﬁnal score is the average log-probability of themasked tokens: P D2W M ( c | t ) = 1 | c | | c | (cid:88) i =1 log P ( c i | Q ( t, | c | )) For ALMs, we use the probability of the ﬁrst token: P D2W A ( c | t ) = P ( c | Q ( t )) Considering further tokens does not make sensesince they are often easily predictable from the ﬁrsttoken.We apply our probing test to two different pre-trained MLMs (BERT and RoBERTa) and oneALM (GPT-2). To investigate the effect of modelsize on the performance, we experiment with bothbase and large versions of BERT and RoBERTaalong with all four sizes of GPT-2 (small, medium,large, xl). For RoBERTa, we capitalize theﬁrst letter of the candidate noun since pretrainedRoBERTa models are case sensitive and expect acapital letter at the beginning of a sentence. In addition to the deep contextual language mod-els, we also provide fastText static word embed-dings (Mikolov et al., 2018) as a baseline. ForfastText embeddings, we tokenize the candidatesand their deﬁnitions using the NLTK tokenizer andrepresent them with their average vector. We rankcandidates based on their cosine similarity to thetarget embedding. Not using capitalization resulted in poor performance forsingle token target words for D2W. We use the crawl-300d-2M-subword model fromhttps://fasttext.cc/docs/en/english-vectors.html A reviewer suggests that it would also be interesting toinvestigate the performance of supervised approaches, e.g.,ranking models. Our main focus here is the lexical knowledgeacquired in pretraining, so we leave this for future work. .4 Measures

We use two measures: precision at 1 (P@1) anda rank score (RS), both based on a ranked resultslist, either of words or of deﬁnitions. P@1 is thepercentage of top-ranked items that is correct. Wedeﬁne RS as follows:RS ( L, k ) = L − kL − where L = |G ( t ) | is the number of candidates and k is the rank of the correct item, ≤ k ≤ L . Table2 shows that the size of G ( t ) is highly variable; incontrast to P@1, RS is less affected by this and therandom baseline (cf. Tables 4 and 5) is always 0.5. Tables 4 and 5 present W2D and D2W results forBERT, RoBERTa and GPT-2 along with fastTextand random baselines. Language models performclearly better than both baselines. Larger mod-els perform generally better than smaller ones andRoBERTa consistently outperforms BERT. Thismight be an indication for the correlation betweenperformance on WDLAMPro and downstream per-formance. However, further investigation is neces-sary to show the correlation more clearly. For W2D,best performance is achieved by GPT-2 xl for nouns(47.3 P@1, 0.81 RS) and by RoBERTa large forverbs (50.8 P@1, 0.84 RS). Performance on D2Wis much lower than for W2D for all models. Fornouns, RoBERTa large and GPT-2 xl perform simi-larly (28.8 and 29.8 P@1, 0.70 and 0.73 RS) whileRoBERTa large achieves the best results for verbs(38.6 P@1, 0.80 RS). Poor performance on D2Wcompared to W2D might be due to language mod-els’ ability to distinguish different deﬁnitions betterthan individual words since deﬁnitions are moreinformative than individual words. Overall GPT-2models perform better than masked language mod-els (with the exception of Roberta large for verbs),despite using a single pattern as opposed to themultiple patterns used by masked language models.This might indicate that the ALM objective is betterat learning word meaning than the MLM objective.To investigate the effect of frequency, we strat-ify words into rare (fewer than 10 occurrences), medium (10 to 99 occurrences) and frequent (100 ormore occurrences), based on occurrences in WWC (Westbury Wikipedia Corpus, Shaoul (2010)), Targets that have more than 3 tokens (based on NLTKtokenization) are taken as rare without counting.

Model Noun Verb

P@1 RS P@1 RSBert b l b l s m l xl Table 4: P@1 and rank score (RS) on W2D

Model Noun Verb

P@1 RS P@1 RSBert b l b l s m l xl Table 5: P@1 and rank score (RS) on D2W where we use WWC frequency as a substitute forthe models’ training corpora. We focus on nounssince most verbs in our dataset are relatively fre-quent. Table 7 shows that, for W2D, all modelshave a poor understanding of the meaning of rareand medium words. (See appx. for D2W results.)Even for frequent words, P@1 is never above 55.We additionally break down the results basedon the depth of the synsets in the WordNet hierar-chy. Speciﬁcally, we investigate the performanceof the GPT-2 xl model on W2D for WordNet nouns,where we take the depth of a synset group as thelength of the shortest path from the target synsetto the root synset (i.e., entity.n.01 ). Table 6 showsthat performance drops steadily as we go deeper inthe hierarchy. Lower levels of the WordNet hier-archy contain many scientiﬁc terms and names of(sub)species such as types of cattle (e.g., cattalo , hereford , galloway ). These results suggest thateven very large LMs lack the knowledge necessaryto distinguish these terms. epth Table 6: RS and P@1 results for GPT-2 xl on W2D fornouns from different depths of the WordNet hierarchy. Model rare medium frequent all

Bert b l b l s m l xl Table 7: P@1 scores on W2D for nouns of differentfrequency ranges

Analysis.

The correct deﬁnition of the mediumfrequency verb ‘beckon’ is ‘signal with the handsor nod’. GPT-2 xl predicts ‘signal by winking’. Thecorrect deﬁnition of the frequent noun ‘roleplaying’is ‘acting a particular role (as in psychotherapy)’GPT-2 xl predicts ‘acting the part of a character onstage’. So GPT-2 xl understands that beckoning issignaling and that roleplaying is acting, but it hasnot learned to distinguish between different typesof signaling and acting. This points to an importantfuture goal for LMs: they should be developed togain an understanding of words that goes beyondthe current superﬁcial state of the art. Human performance on WDLAMPro.

It isbeyond the scope of this paper to evaluate humanperformance on the entirety of WDLAMPro. How-ever, we provide a comparison with human perfor-mance on a small subset to provide an intuitionabout the difﬁculty of the task. For each of thetwo tasks, 20 synset groups that have a maximumof 10 candidates are randomly sampled from WD-LAMPro. Then two native English speakers areasked to rank the candidates. Table 8 displays theaverage performance of the human participants andthe language models on this subset. For both tasks,performance of the best model is comparable to the

Model W2D D2W

P@1 RS P@1 RSBert b l b l s m l xl Table 8: LM and human performance on 20 randomsamples of WDLAMPro. average human performance.Human performance is the upper bound for manyNLP tasks. We believe that this is not the case forWDLAMPro: arguably, we should aim for modelswith an excellent understanding of the meaningsof words even if it is better than average humanunderstanding. Knowledge based tasks are an anal-ogous case: we should strive for models that knowas many facts as possible even if that performanceis above average human performance.

We introduced WDLMPro, a probing test that helpsanalyze how well a model understands word mean-ing. WDLMPro is complementary to existing prob-ing tests that are about relations between wordsor entities. We evaluated three popular pretrainedlanguage models on the W2D (word to deﬁnition)and D2W (deﬁnition to word) tasks. Our ﬁndingsshow that, despite their remarkable performanceon many downstream tasks, these models struggleto match a word and its true deﬁnition, suggest-ing an insufﬁcient understanding of word mean-ing. Relatively poor performance of these powerfulmodels on WDLMPro can be seen as evidencefor the limitations of purely distributional systemsand the need for incorporating external knowledge.WDLMPro provides an important evaluation bench-mark, encouraging design and training of modelswith precise word understanding.

Acknowledgements.

We thank Denis Peskovand Sander Schulhoff for helping out with the hu-man evaluation and the anonymous reviewers fortheir insightful comments and suggestions. Thiswork was funded by the European Research Coun-cil (ERC eferences

Tom B Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shotlearners. arXiv preprint arXiv:2005.14165 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of morpho-logical and semantic relations with word embed-dings: what works and what doesn’t. In

Proceedingsof the NAACL Student Research Workshop , pages 8–15, San Diego, California. Association for Computa-tional Linguistics.Felix Hill, Roi Reichart, and Anna Korhonen. 2015.SimLex-999: Evaluating semantic models with (gen-uine) similarity estimation.

Computational Linguis-tics , 41(4):665–695.Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model ﬁne-tuning for text classiﬁcation. In

Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339, Melbourne, Australia.Association for Computational Linguistics.Alessandro Lenci. 2008. Distributional semantics inlinguistic and cognitive research.

Italian journal oflinguistics , 20(1):1–31.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Tomas Mikolov, G.s Corrado, Kai Chen, and JeffreyDean. 2013. Efﬁcient estimation of word repre-sentations in vector space. In

Proc. of the Inter-national Conference on Learning Representations(ICLR) , pages 1–12.Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representa-tions. In

Proceedings of the International Confer-ence on Language Resources and Evaluation (LREC2018) .George A Miller. 1995. WordNet: a lexicaldatabase for english.

Communications of the ACM ,38(11):39–41. Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In

Proc. of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 1532–1543.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018a. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018b. Dissecting contextualword embeddings: Architecture and representation.In

Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 1499–1509, Brussels, Belgium. Associationfor Computational Linguistics.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In

Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.A. Radford, Jeffrey Wu, R. Child, David Luan, DarioAmodei, and Ilya Sutskever. 2019. Language mod-els are unsupervised multitask learners. In

TechnicalReport .Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Timo Schick and Hinrich Sch¨utze. 2020. Rare words:A major problem for contextualized embeddings andhow to ﬁx it by attentive mimicking.

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence ,34:8766–8774.C. & Westbury C. Shaoul. 2010. The westbury labwikipedia corpus.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Wenhan Xiong, Jingfei Du, William Yang Wang, andVeselin Stoyanov. 2020. Pretrained encyclopedia:Weakly supervised knowledge-pretrained languagemodel. In

International Conference on LearningRepresentations . Appendix synset deﬁnition beckon.v.01 signal with the hands or nod applaud.v.01 clap one’s hands or shout after performances to indicate approvalbow.v.01 bend one’s knee or body, or lower one’s headshrug.v.01 raise one’s shoulders to indicate indifference or resignationexsert.v.01 thrust or extend outwink.v.01 signal by winkingnod.v.01 express or signify by nodding

Table 9: Seven candidates of G ( t ) for t = beckon.v.01 and their deﬁnitions Model rare medium frequent all

Bert b l b l s m l xl19.3 24.8 36.3 29.8Random 6.7 7.1 8.3 7.6