Does He Wink or Does He Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models
DDoes He Wink or Does He Nod? A Challenging Benchmark forEvaluating Word Understanding of Language Models
Lutfi Kerem Senel and
Hinrich Sch ¨utze
Center for Information and Language Processing (CIS), LMU Munich, Germany [email protected]
Abstract
Recent progress in pretraining language mod-els on large corpora has resulted in large per-formance gains on many NLP tasks. Theselarge models acquire linguistic knowledge dur-ing pretraining, which helps to improve per-formance on downstream tasks via fine-tuning.To assess what kind of knowledge is acquired,language models are commonly probed byquerying them with ‘fill in the blank’ stylecloze questions. Existing probing datasetsmainly focus on knowledge about relationsbetween words and entities. We introduceWDLMPro (Word Definition Language ModelProbing) to evaluate word understanding di-rectly using dictionary definitions of words. Inour experiments, three popular pretrained lan-guage models struggle to match words andtheir definitions. This indicates that they un-derstand many words poorly and that our newprobing task is a difficult challenge that couldhelp guide research on LMs in the future.
Natural language processing (NLP) has advanceddrastically in the last decade with the design oflarger and more sophisticated models, availabil-ity of larger corpora and increasing computationalpower. Pretrained word embeddings (Mikolovet al., 2013; Pennington et al., 2014) popularizedthe use of distributed word representations, whichbecame a fundamental building block for NLPsystems. Peters et al. (2018a) introduced LSTM-based deep contextual representations and obtainedlarge performance gains by fine-tuning on tasks af-ter unsupervised pretraining (Radford et al., 2018;Howard and Ruder, 2018). More recently, the at-tention based transformer architecture was shownto use context more effectively (Vaswani et al.,2017) and several subsequent models achieved stateof the art results in many NLP tasks by combin-ing the transformer architecture with unsupervised pretraining and task specific fine-tuning (Devlinet al., 2019; Liu et al., 2019). Radford et al. (2019)showed that language models can be applied to avariety of tasks without task specific fine tuning.This is demonstrated on a much larger scale byBrown et al. (2020).Deep models improve performance. However,what they actually learn about language and wordmeaning is still to a large extent unclear due totheir uninterpretable nature. For static word embed-dings, researchers used word similarity (Hill et al.,2015) and word analogy (Gladkova et al., 2016)tests to shed light on what information is capturedin these dense vector spaces. For language models,a great amount of linguistic knowledge is storedin the model parameters (Peters et al., 2018b).Several studies proposed using ‘fill in the blank’type cloze statements to test knowledge learnedby these models during unsupervised pretraining.Petroni et al. (2019) proposed the LAMA (LAn-guage Model Analysis) probe to test the factualand common sense knowledge stored in languagemodels. Similarly, Schick and Sch¨utze (2020) in-troduced WNLaMPro (WordNet Language ModelProbing) to assess the ability of language models tounderstand words based on their frequency. In WN-LaMPro, cloze style questions are generated basedon antonym, hypernym and cohyponym relationsamong words extracted from WordNet.The existing probing datasets mainly focus on in-vestigating the knowledge about relations between words or entities. However, a more direct way oftesting whether a language model understands themeaning of a word is to use its dictionary definition.If a pretrained language model truly understandsthe meaning of a word, then it should be able tomatch it with its dictionary definition. Based onthis motivation, we introduce the
Word Definition a r X i v : . [ c s . C L ] F e b ynset definition a cappella singing.n.01 singing without instrumental accompaniment caroling.n.01 singing joyful religious songs (especially at Christmas)crooning.n.01 singing in a soft low tonesingalong.n.01 informal group singing of popular songsbel canto.n.01 a style of operatic singing Table 1: Five candidates from G ( t ) for t = a cappella singing.n.01 and their definitions Noun Verb
Average min / max
Table 2: WDLMPro statistics
Language Model Probing (WDLMPro) dataset; itis a challenging benchmark for testing NLP modelsfor their ability to understand words. WDLMProis essentially a set of thousands of synset groups;each synset group consists of a target word (withits definition) and its taxonomic sisters (with theirdefinitions). Using taxonomic sisters, rather thanrandom word groups, makes the task more chal-lenging for statistical models that are based on thedistributional hypothesis since these words havesimilar distributional characteristics (Lenci, 2008).We evaluate two masked language models, BERTand RoBERTa, and the auto-regressive model GPT-2 on WDLMPro using two different probing tests:(i) match definition to word (D2W) (ii) match wordto definition (W2D). We also provide a baselineusing static fastText embeddings (Mikolov et al.,2018). We find that all three language models per-form clearly better than the baseline. Nevertheless,they have great difficulty matching words and theirdefinitions, implying a poor understanding of wordmeaning. This is an important result that couldhelp guide research on LMs in the future. In this section, we introduce WDLMPro (WordDefinition Language Model Probing), a dataset totest how well NLP models can match nouns andverbs with their definitions. We view this as a testof how well the models understand lexical mean-ing.
WordNet (Miller, 1995) is the basis for construct-ing WDLMPro. A WordNet synset contains a set of synonyms along with a short definition of thesynset. Different senses of polysemous words arerepresented in different synsets providing disam-biguation. WordNet connects synsets with eachother via semantic relations.Based on a target synset t and the semantic rela-tion hyponymy < , we construct a synset group G for the target as follows. G ( t ) = { x |∃ y : t < y ∧ x < y } that is, G contains all synsets that are “sister hy-ponyms” to t with respect to a hypernym of t . G ( t ) ,along with the definitions of the synsets in G ( t ) ,will be used to set up the WDLMPro tasks thatrequire matching of words and definitions. Wediscard groups G ( t ) that have a size of less than 5.In this study, we focus on nouns and verbs,i.e., we create synset groups G for the nouns andverbs in WordNet. Table 1 displays five mem-bers from G ( t ) and their definitions for the target a cappella singing.n.01 (see appx. for the target beckon.v.01 .) Table 2 shows statistics of the dataset. We define two probing tests that are converses ofeach other:•
Match definition to word (D2W).
Given adefinition and a set of words, the task is tofind the word that the definition defines.•
Match word to definition (W2D).
Given aword and a set of definitions, the task is tofind the definition that defines the word.Each synset group G ( t ) gives rise to one instanceof D2W by providing the definition of t , and allwords in G ( t ) . The word from G ( t ) that matchesthe definition has then to be identified. (Note that t is a member of G ( t ) .) Similarly, each synsetgroup G ( t ) gives rise to one instance of W2D byproviding t and the definitions of all words in G ( t ) .The correct definition of t has then to be identifiedamong all definition candidates. Note that WordNetdefinitions by construction do not contain the word asked Language Model (MLM)Noun is
Verb to
In principle, any NLP model can be tested on D2Wand W2D. In this paper, we are particularly in-terested in testing language models. To this end,we convert the data to a format that is suitable forlanguage models, i.e., to cloze-style questions asshown in Table 3. The basic quantity that allowsus to assess the compatibility of a word t and adefinition is the probability of t being generated for“ ” when the definition is substituted for < DEF > .More precisely, we compute the probability thatthe string representation of t is being generated.We will denote the string representation of synset t by t . We obtain the string representation by remov-ing the word type and sense information from thename of the synset and replacing underscores withwhite space. For example, synset warm up.v.04 isrepresented by the string “warm up”.Table 3 shows that we define different templatesfor masked and autoregressive language models.For the masked language models, we average theprediction scores across patterns before ranking thecandidates. For a masked language model (MLM) M , the prob-ability of a candidate c ∈ G ( t ) on W2D is calcu-lated as: P W2D M ( c | t ) = | t | (cid:89) i =1 P ( t i | Q ( c, | t | )) where t = [ t , t , ..., t | t | ] is the tokenization pro-duced by M . Q ( c, | t | ) is the input query created from one of the patterns (Table 3) with replacedwith | t | consecutive mask tokens. For an autore-gressive language model (ALM) A , we decompose P ( t i | Q ( c ) , t ) in the standard way: P W2D A = | t | (cid:89) i =1 P ( t i | Q ( c ) , t , ..., t i − ) For D2W, we need to compare, given a definition,the probabilities of different candidate words thatare generally of different lengths. To ensure a faircomparison, we follow Xiong et al. (2020). ForMLMs, we match the number of mask tokens in aninput query to the token count of each candidate.The final score is the average log-probability of themasked tokens: P D2W M ( c | t ) = 1 | c | | c | (cid:88) i =1 log P ( c i | Q ( t, | c | )) For ALMs, we use the probability of the first token: P D2W A ( c | t ) = P ( c | Q ( t )) Considering further tokens does not make sensesince they are often easily predictable from the firsttoken.We apply our probing test to two different pre-trained MLMs (BERT and RoBERTa) and oneALM (GPT-2). To investigate the effect of modelsize on the performance, we experiment with bothbase and large versions of BERT and RoBERTaalong with all four sizes of GPT-2 (small, medium,large, xl). For RoBERTa, we capitalize thefirst letter of the candidate noun since pretrainedRoBERTa models are case sensitive and expect acapital letter at the beginning of a sentence. In addition to the deep contextual language mod-els, we also provide fastText static word embed-dings (Mikolov et al., 2018) as a baseline. ForfastText embeddings, we tokenize the candidatesand their definitions using the NLTK tokenizer andrepresent them with their average vector. We rankcandidates based on their cosine similarity to thetarget embedding. Not using capitalization resulted in poor performance forsingle token target words for D2W. We use the crawl-300d-2M-subword model fromhttps://fasttext.cc/docs/en/english-vectors.html A reviewer suggests that it would also be interesting toinvestigate the performance of supervised approaches, e.g.,ranking models. Our main focus here is the lexical knowledgeacquired in pretraining, so we leave this for future work. .4 Measures
We use two measures: precision at 1 (P@1) anda rank score (RS), both based on a ranked resultslist, either of words or of definitions. P@1 is thepercentage of top-ranked items that is correct. Wedefine RS as follows:RS ( L, k ) = L − kL − where L = |G ( t ) | is the number of candidates and k is the rank of the correct item, ≤ k ≤ L . Table2 shows that the size of G ( t ) is highly variable; incontrast to P@1, RS is less affected by this and therandom baseline (cf. Tables 4 and 5) is always 0.5. Tables 4 and 5 present W2D and D2W results forBERT, RoBERTa and GPT-2 along with fastTextand random baselines. Language models performclearly better than both baselines. Larger mod-els perform generally better than smaller ones andRoBERTa consistently outperforms BERT. Thismight be an indication for the correlation betweenperformance on WDLAMPro and downstream per-formance. However, further investigation is neces-sary to show the correlation more clearly. For W2D,best performance is achieved by GPT-2 xl for nouns(47.3 P@1, 0.81 RS) and by RoBERTa large forverbs (50.8 P@1, 0.84 RS). Performance on D2Wis much lower than for W2D for all models. Fornouns, RoBERTa large and GPT-2 xl perform simi-larly (28.8 and 29.8 P@1, 0.70 and 0.73 RS) whileRoBERTa large achieves the best results for verbs(38.6 P@1, 0.80 RS). Poor performance on D2Wcompared to W2D might be due to language mod-els’ ability to distinguish different definitions betterthan individual words since definitions are moreinformative than individual words. Overall GPT-2models perform better than masked language mod-els (with the exception of Roberta large for verbs),despite using a single pattern as opposed to themultiple patterns used by masked language models.This might indicate that the ALM objective is betterat learning word meaning than the MLM objective.To investigate the effect of frequency, we strat-ify words into rare (fewer than 10 occurrences), medium (10 to 99 occurrences) and frequent (100 ormore occurrences), based on occurrences in WWC (Westbury Wikipedia Corpus, Shaoul (2010)), Targets that have more than 3 tokens (based on NLTKtokenization) are taken as rare without counting.
Model Noun Verb
P@1 RS P@1 RSBert b l b l s m l xl Table 4: P@1 and rank score (RS) on W2D
Model Noun Verb
P@1 RS P@1 RSBert b l b l s m l xl Table 5: P@1 and rank score (RS) on D2W where we use WWC frequency as a substitute forthe models’ training corpora. We focus on nounssince most verbs in our dataset are relatively fre-quent. Table 7 shows that, for W2D, all modelshave a poor understanding of the meaning of rareand medium words. (See appx. for D2W results.)Even for frequent words, P@1 is never above 55.We additionally break down the results basedon the depth of the synsets in the WordNet hierar-chy. Specifically, we investigate the performanceof the GPT-2 xl model on W2D for WordNet nouns,where we take the depth of a synset group as thelength of the shortest path from the target synsetto the root synset (i.e., entity.n.01 ). Table 6 showsthat performance drops steadily as we go deeper inthe hierarchy. Lower levels of the WordNet hier-archy contain many scientific terms and names of(sub)species such as types of cattle (e.g., cattalo , hereford , galloway ). These results suggest thateven very large LMs lack the knowledge necessaryto distinguish these terms. epth Table 6: RS and P@1 results for GPT-2 xl on W2D fornouns from different depths of the WordNet hierarchy. Model rare medium frequent all
Bert b l b l s m l xl Table 7: P@1 scores on W2D for nouns of differentfrequency ranges
Analysis.
The correct definition of the mediumfrequency verb ‘beckon’ is ‘signal with the handsor nod’. GPT-2 xl predicts ‘signal by winking’. Thecorrect definition of the frequent noun ‘roleplaying’is ‘acting a particular role (as in psychotherapy)’GPT-2 xl predicts ‘acting the part of a character onstage’. So GPT-2 xl understands that beckoning issignaling and that roleplaying is acting, but it hasnot learned to distinguish between different typesof signaling and acting. This points to an importantfuture goal for LMs: they should be developed togain an understanding of words that goes beyondthe current superficial state of the art. Human performance on WDLAMPro.
It isbeyond the scope of this paper to evaluate humanperformance on the entirety of WDLAMPro. How-ever, we provide a comparison with human perfor-mance on a small subset to provide an intuitionabout the difficulty of the task. For each of thetwo tasks, 20 synset groups that have a maximumof 10 candidates are randomly sampled from WD-LAMPro. Then two native English speakers areasked to rank the candidates. Table 8 displays theaverage performance of the human participants andthe language models on this subset. For both tasks,performance of the best model is comparable to the
Model W2D D2W
P@1 RS P@1 RSBert b l b l s m l xl Table 8: LM and human performance on 20 randomsamples of WDLAMPro. average human performance.Human performance is the upper bound for manyNLP tasks. We believe that this is not the case forWDLAMPro: arguably, we should aim for modelswith an excellent understanding of the meaningsof words even if it is better than average humanunderstanding. Knowledge based tasks are an anal-ogous case: we should strive for models that knowas many facts as possible even if that performanceis above average human performance.
We introduced WDLMPro, a probing test that helpsanalyze how well a model understands word mean-ing. WDLMPro is complementary to existing prob-ing tests that are about relations between wordsor entities. We evaluated three popular pretrainedlanguage models on the W2D (word to definition)and D2W (definition to word) tasks. Our findingsshow that, despite their remarkable performanceon many downstream tasks, these models struggleto match a word and its true definition, suggest-ing an insufficient understanding of word mean-ing. Relatively poor performance of these powerfulmodels on WDLMPro can be seen as evidencefor the limitations of purely distributional systemsand the need for incorporating external knowledge.WDLMPro provides an important evaluation bench-mark, encouraging design and training of modelswith precise word understanding.
Acknowledgements.
We thank Denis Peskovand Sander Schulhoff for helping out with the hu-man evaluation and the anonymous reviewers fortheir insightful comments and suggestions. Thiswork was funded by the European Research Coun-cil (ERC eferences
Tom B Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, et al. 2020. Language models are few-shotlearners. arXiv preprint arXiv:2005.14165 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Anna Gladkova, Aleksandr Drozd, and Satoshi Mat-suoka. 2016. Analogy-based detection of morpho-logical and semantic relations with word embed-dings: what works and what doesn’t. In
Proceedingsof the NAACL Student Research Workshop , pages 8–15, San Diego, California. Association for Computa-tional Linguistics.Felix Hill, Roi Reichart, and Anna Korhonen. 2015.SimLex-999: Evaluating semantic models with (gen-uine) similarity estimation.
Computational Linguis-tics , 41(4):665–695.Jeremy Howard and Sebastian Ruder. 2018. Universallanguage model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 328–339, Melbourne, Australia.Association for Computational Linguistics.Alessandro Lenci. 2008. Distributional semantics inlinguistic and cognitive research.
Italian journal oflinguistics , 20(1):1–31.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Tomas Mikolov, G.s Corrado, Kai Chen, and JeffreyDean. 2013. Efficient estimation of word repre-sentations in vector space. In
Proc. of the Inter-national Conference on Learning Representations(ICLR) , pages 1–12.Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin. 2018. Ad-vances in pre-training distributed word representa-tions. In
Proceedings of the International Confer-ence on Language Resources and Evaluation (LREC2018) .George A Miller. 1995. WordNet: a lexicaldatabase for english.
Communications of the ACM ,38(11):39–41. Jeffrey Pennington, Richard Socher, and Christopher D.Manning. 2014. Glove: Global vectors for word rep-resentation. In
Proc. of the Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 1532–1543.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018a. Deep contextualized word rep-resentations. In
Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018b. Dissecting contextualword embeddings: Architecture and representation.In
Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 1499–1509, Brussels, Belgium. Associationfor Computational Linguistics.Fabio Petroni, Tim Rockt¨aschel, Sebastian Riedel,Patrick Lewis, Anton Bakhtin, Yuxiang Wu, andAlexander Miller. 2019. Language models as knowl-edge bases? In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2463–2473, Hong Kong, China. As-sociation for Computational Linguistics.A. Radford, Jeffrey Wu, R. Child, David Luan, DarioAmodei, and Ilya Sutskever. 2019. Language mod-els are unsupervised multitask learners. In
TechnicalReport .Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.Timo Schick and Hinrich Sch¨utze. 2020. Rare words:A major problem for contextualized embeddings andhow to fix it by attentive mimicking.
Proceedingsof the AAAI Conference on Artificial Intelligence ,34:8766–8774.C. & Westbury C. Shaoul. 2010. The westbury labwikipedia corpus.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Wenhan Xiong, Jingfei Du, William Yang Wang, andVeselin Stoyanov. 2020. Pretrained encyclopedia:Weakly supervised knowledge-pretrained languagemodel. In
International Conference on LearningRepresentations . Appendix synset definition beckon.v.01 signal with the hands or nod applaud.v.01 clap one’s hands or shout after performances to indicate approvalbow.v.01 bend one’s knee or body, or lower one’s headshrug.v.01 raise one’s shoulders to indicate indifference or resignationexsert.v.01 thrust or extend outwink.v.01 signal by winkingnod.v.01 express or signify by nodding
Table 9: Seven candidates of G ( t ) for t = beckon.v.01 and their definitions Model rare medium frequent all
Bert b l b l s m l xl19.3 24.8 36.3 29.8Random 6.7 7.1 8.3 7.6