[PDF] CLiMP: A Benchmark for Chinese Language Model Evaluation

Abstract

Linguistically informed analyses of language models (LMs) contribute to the understanding and improvement of these models. Here, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP), which can be used to investigate what knowledge Chinese LMs acquire. CLiMP consists of sets of 1,000 minimal pairs (MPs) for 16 syntactic contrasts in Mandarin, covering 9 major Mandarin linguistic phenomena. The MPs are semi-automatically generated, and human agreement with the labels in CLiMP is 95.8%. We evaluated 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT. We find that classifier-noun agreement and verb complement selection are the phenomena that models generally perform best at. However, models struggle the most with the ba construction, binding, and filler-gap dependencies. Overall, Chinese BERT achieves an 81.8% average accuracy, while the performances of LSTMs and 5-grams are only moderately above chance level.

Full PDF

CCLiMP: A Benchmark for Chinese Language Model Evaluation

Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt and Katharina Kann University of Colorado Boulder, New York University { beilei.xiang, changbing.yang, yuli9309 } @[email protected]@colorado.edu Abstract

Linguistically informed analyses of languagemodels (LMs) contribute to the understand-ing and improvement of these models. Here,we introduce the corpus of Chinese linguisticminimal pairs (CLiMP), which can be usedto investigate what knowledge Chinese LMsacquire. CLiMP consists of sets of 1,000minimal pairs (MPs) for 16 syntactic con-trasts in Mandarin, covering 9 major Mandarinlinguistic phenomena. The MPs are semi-automatically generated, and human agree-ment with the labels in CLiMP is . . Weevaluate 11 different LMs on CLiMP, covering n -grams, LSTMs, and Chinese BERT. We ﬁndthat classiﬁer–noun agreement and verb com-plement selection are the phenomena that mod-els generally perform best at. However, mod-els struggle the most with the bˇa construction,binding, and ﬁller-gap dependencies. Over-all, Chinese BERT achieves an . averageaccuracy, while the performances of LSTMsand 5-grams are only moderately above chancelevel. Language models (LMs) are crucial parts of natu-ral language processing (NLP) systems for a largevariety of tasks, including summarization, machinetranslation, and dialog generation. More recently,they have become popular in the form of pretrainedmodels, which are then ﬁne-tuned on downstreamtasks and often obtain state-of-the-art performance(Peters et al., 2018; Devlin et al., 2019; Conneauet al., 2020). However, which linguistic phenom-ena language models can or cannot learn is stillpoorly understood for many languages.Resources for the syntactic evaluation of LMs,such as BLiMP (Warstadt et al., 2020) have focused Throughout this paper, we adopt a broad deﬁnition ofLMs, which includes language representation models whichhave been trained on a masked language modeling objective. mainly on English, and non-English resourcescurrently only cover a small set of phenomena(Mueller et al., 2020; Gulordava et al., 2018; Rav-fogel et al., 2018). In order to spur the analysis andsubsequent improvement of LMs in Chinese, weintroduce the corpus of Chinese linguistic minimalpairs (CLiMP), which can be used to evaluate LMs’knowledge of Chinese grammar.CLiMP consists of 16 individual datasets that aresemi-automatically generated from grammar tem-plates. Each set—or paradigm —contains 1,000minimal pairs (MPs). Together, they cover 9core linguistic phenomena in Chinese. Humanagreement on this corpus is . , conﬁrmingthat CLiMP represents robust contrasts in Chinesegrammar. High performance on CLiMP thus im-plies high correlation with human acceptabilityjudgments across these phenomena.We use CLiMP to study Chinese BERT (Devlinet al., 2019), -gram LMs. We evaluatefor each MP whether the LM assigns a higher prob-ability to the grammatical or the ungrammaticalsentence. Our results show that Chinese BERT isclosest to human performance, achieving an 81.8%accuracy on average over all phenomena, while theperformances of LSTMs and -grams, regardlessof the training data size, are only moderately abovechance level. Classiﬁer–noun agreement and verbcomplement selection are the phenomena that mod-els generally perform best at, suggesting that Chi-nese LMs are better at acquiring knowledge of localselectional restrictions. The bˇa construction, bind-ing, and ﬁller-gap dependencies are the phenomenamodels have the most difﬁculties with. This indi-cates that they struggle to learn hierarchical syntaxand to identify long-distance dependencies. https://github.com/google-research/bert/blob/master/multilingual.md a r X i v : . [ c s . C L ] J a n Related Work

LMs assign probabilities to sequences of words (Ju-rafsky and Martin, 2009). Recently, they have be-come commonly used as pretrained models, whichcan be ﬁne-tuned for downstream NLP tasks (Pe-ters et al., 2018; Devlin et al., 2019; Conneau et al.,2020). Strictly speaking, LMs compute the proba-bilities of words based only on past context. BERT(Devlin et al., 2019), however, is trained using amasked language modeling objective: it predictswords based on past and future tokens. Wang andCho (2019) show that BERT is a Markov randomﬁeld language model that can assign sentences apseudo-log-likelihood score, which is computedby summing the conditional log probabilities ofall tokens in the sentence, as well as generate text.Shin et al. (2019) and Salazar et al. (2020) applypseudo-log-likelihood scores to sentence rankingand LM evaluation.

Numerous methods exist for probing syntacticknowledge of neural network models in English(Hewitt and Manning, 2019; Tenney et al., 2019),and a growing body of work evaluates the syntac-tic knowledge of neural models by testing whetherthey can judge the grammatical acceptability of sen-tences. One common version of this task uses MPsto evaluate LMs’ linguistic knowledge (Linzenet al., 2016; Marvin and Linzen, 2018; Warstadtet al., 2020; Wilcox et al., 2018).A MP is a pair of sentences that only differ inacceptability due to a single edit, as in (1) and (2).Native speakers can be asked to choose which sen-tence in each pair sounds more grammatical. Semi-automatically generating MPs can yield a larger setof controlled sentences, providing sufﬁcient datafor model evaluation (Linzen et al., 2016; Marvinand Linzen, 2018; Ettinger et al., 2018).(1) 王鑫 W´angx¯ın 把 bˇa 自行车 z`ıx´ıngch¯e 扔 r¯eng 了。 le SUBJ. BA. OBJ. V. PST.“Xin Wang threw away a bike.”(2) 王鑫 W´angx¯ın 被 b`ei 自行车 z`ıx´ıngch¯e 扔 r¯eng 了。 le SUBJ. PASS. OBJ. V. PST.“Xin Wang was thrown away by a bike.”It is possible to model acceptability in a to-tally unsupervised way using LMs. The model assigns a probability to each sentence in a MP,and the one with the higher score is predictedas correct, and the model’s predictions can beevaluated against human judgments (Marvin andLinzen, 2018; Warstadt et al., 2020). Supervisedapproaches are also possible (Warstadt et al., 2019),but can be less informative on LMs’ linguisticknowledge acquisition due to the bias introducedby training on acceptability judgment labels.Some prior work evaluates the linguistic knowl-edge of different non-English models (Ravfogelet al., 2018; Gulordava et al., 2018; Mueller et al.,2020). However, these efforts focus mainly onsubject-verb agreement, which is absent in Chi-nese, and the knowledge of Chinese LMs has notyet been explicitly studied.Finally, the linguistic abilities of English BERThave been investigated in a a lot of prior work,e.g., Clark et al. (2019); Vig (2019); Hewitt andManning (2019). We refer the reader to Rogerset al. (2021) for an overview.

Our main contribution is CLiMP, a corpus ofChinese MPs designed to evaluate Chinese LMs.CLiMP consists of 1,000 MPs for each of 16 gram-matical contrasts, covering 9 major Chinese lin-guistic phenomena. Example MPs for each phe-nomenon are shown in Table 1.

We generate data from grammar templates for ev-ery paradigm we incorporate. Our templates setlexical, syntactic, and semantic constraints for eachparadigm, aiming at building robust contrasts andkeeping the sentence length the same within eachMP. We then build an annotated vocabulary, andgenerate sentences by sampling words from it. (1)and (2) show an MP together with the template used to create it. We translate Warstadt et al.’s (2020) English vocab-ulary, containing 3,000 English words with mor-phological, syntactical, and semantic annotations.We add words and features speciﬁc to Chinese lin-guistic phenomena to our vocabulary, includingclassiﬁers, verb complements, action verbs, and The template example is only for demonstrative purposes.More information is encoded for the actual data generation. overbs. Our ﬁnal vocabulary contains 3,456 wordsand 84 features.We show the frequency of words in CLiMP’svocabulary in the Chinese Internet Corpus in Fig-ure 1. 1,055 of the words in CLiMP are within the5,000 most frequent words in the Chinese InternetCorpus. Figure 1: Comparison of word frequencies in CLiMPand the Chinese Internet Corpus.

CLiMP covers 9 major linguistic phenomena inMandarin Chinese, cf. Table 1. They are pickedfrom a comprehensive Chinese grammar book byPo-Ching and Rimmington (2015). Following Po-Ching and Rimmington’s discussion, we now ex-plain the phenomena not present in English. The bˇa construction is an SOV construction involv-ing the particle bˇa , which precedes the object andmoves the object to a position before the main verb.It is only grammatical with a subset of transitiveverbs.

Coverbs are verb-like items that precede themain verb in a serial verb construction. They al-most invariably have to be used in conjunction withother verbs in a sentence. They share some prop-erties with prepositions, but are not syntacticallyinterchangeable with them.

Classiﬁers obligato-rily appear with nouns when those are modiﬁedby numerals or adjectives. Mandarin has dozensof classiﬁers, and nouns select the classiﬁer theycombine with.

Verb complements follow a verb,often expressing a result or manner of an event.Not all verbs can be used with all complements,making certain combinations ungrammatical.

NPhead ﬁnality is present in Mandarin noun phrases.The relative clause precedes noun phrases.

To verify whether the MPs in our dataset show clearcontrasts, we conduct two rounds of human valida- http://corpus.leeds.ac.uk/frqc/internet-zh.num tion with 22 annotators. They are all native speak-ers of Chinese, 14 females and 8 males, whose agesrange from 20 to 48. All of them have at least ahigh school degree.In our ﬁrst human validation, each human anno-tator is assigned a subset (100 MPs) of a paradigm.We let them perform the same forced-choice taskas our models: decide for each MP which sentenceseems more acceptable. We discard one paradigm,the coverb-direction paradigm, after this validation,because its human validation accuracy is below . The average human agreement for the re-maining paradigms is . .In the second human validation, we sample 15MPs from each of the remaining paradigm, result-ing in a dataset consisting of 240 MPs. 16 annota-tors complete the same forced-choice task on thisdataset. We count a MP as valid if more than halfof the annotators agree with its label. The humanagreement on this dataset is 97.1%, showing thatour data creation results in valid examples. BLiMP consists of 67 datasets, each containing1,000 MPs and organized by phenomenon into 12categories. CLiMP only contains 16 datasets due tothe less inﬂectional nature of Mandarin Chinese. 3phenomena are covered by both corpora: anaphoragreement, binding, and ﬁller-gap. The humanagreement for these three phenomena in BLiMP is . , . , and . , respectively. The cor-responding accuracies in CLiMP are . , ,and , respectively. The overall human agree-ment for BLiMP is . , which is . lowerthan for CLiMP. We use accuracy for evaluation. A MP in CLiMP isclassiﬁed correctly if a LM assigns a higher prob-ability to the grammatical sentence than to the un-grammatical one. We evaluate statistical and neuralLMs, including masked LMs. Corpora which con-tain 0.4M, 2M, and 21.5M sentences are used forfurther exploration. We also investigate the effectof different tokenizations. Chinese BERT

BERT (Devlin et al., 2019) isa transformer-based neural model (Vaswani et al.,2017). Here, we evaluate Chinese BERT. This We use character tokenization and word tokenization(https://github.com/fxsjy/jieba). https://github.com/google-research/bert/blob/master/multilingual.md henomenon N Acceptable Example Unacceptable Example Anaphoragreement 1 王玉珍震惊 - 了她自己。 Jane.F shock-PST herself. ’Jane shocked herself.’ 王玉珍震惊 - 了他自己。 Jane.F shock-PST himself. ’Jane shocked himself.’

Binding 1 杨颖治疗吴宇涛之后佩服 - 过她自己。 Yang.F cure Wu.M after admire-PST herself ’Yang admired herself after she cured Wu.’ 杨颖治疗吴宇涛之后佩服 - 过他自己。 Yang.F cure Wu.M after admire-PST himself ’Yang admired himself after she cured Wu.’bˇa construction 1 王鑫把自行车扔了。 Wong.M BA bike throw PST ’Wong threw away the bike.’ 王鑫被自行车扔了。 Wong.M PASS bike throw PST ’Wong was thrown away by the bike.’

Coverb 3 李文清乘卡车到达 - 了咖啡店。 Lee.M ride truck arrive-PST coffee shop ’Lee went to the coffee shop by truck.’ 李文清于卡车到达 - 了咖啡店。 Lee.M at truck arrive-PST coffee shop ’Lee went to the coffee shop at truck.’

NP head ﬁnality 1 王梦正在卖张红梅清洗 - 过 - 的推车。 Wong.F PROG sell May.F clean-PRF-ADJ trolley ‘Wong is selling the trolley that Mel has cleaned.’ 王梦正在卖推车张红梅清洗 - 过 - 的。 Wong.F PROG sell trolley May.F clean-PRF-ADJ ‘Wong is selling the trolley that Mel has cleaned.’

Classiﬁer 2 张杰正在穿过一家艺术画廊。 Jay.M PROG pass one CL:INSTITUTION art gallery ’Jay is passing through an art gallery.’ 张杰正在穿过一段艺术画廊。 Jay.M PROG pass one CL:LENGTH art gallery ’Jay is passing through an art gallery.’

Filler gap 1 图书馆，我开车去 - 过这个地方。 The library, I drive to-PRF this place ‘The library, I have driven to this place.’ 图书馆，我开车去 - 过博物馆。 The library, I drive to-PRF the museum ‘The library, I have driven to the museum.’

Passive 1 这些患者被转移 - 了。 These patient PASS transfer-PST ’These patients were transferred.’ 这些患者被下降 - 了。 These patient PASS fall-PST ’These patients were fell.’

Verbcomplement 5 王慧的文章吓坏了包曼玉。 Wong.F POSS article frighten badly PST Bao.F. ’Wong’s article frightened Bao badly.’ 王慧的文章吓开了包曼玉。 Wong.F POSS article frighten openly PST Bao.F. ’Wong’s article frightened Bao openly.’

Table 1: Nine Chinese linguistic phenomena covered by CLiMP with acceptable and unacceptable sentence ex-amples. Minimal differences are underlined. The second line of each example shows a gloss, the third line is anEnglish translation. N represents how many paradigms (each with 1,000 examples) are within each phenomena. model has 12 layers, 768 hidden units, 12 attentionheads, and 110M parameters. The training datasetcontains 25M sentences. We assign probabilities tosentences with this model by masking the words ina sentence one by one, computing the probabilityof each masked word, and, ﬁnally, multiplying theprobabilities of all words (Wang and Cho, 2019;Salazar et al., 2020). LSTM LMs

We further evaluate 6 LSTM(Hochreiter and Schmidhuber, 1997) LMs. Thesemodel have 2 layers, 200 hidden units, and 2 atten-tion heads. We train them using Pytorch’s wordlanguage model code on 3 differently-sized Chi-nese Wikipedia corpora: 0.4M, 2M, and 21.5Msentences. We further compare word-level andcharacter-level models (cf. Table 2). For evalu-ation, we employ code adapted by Warstadt et al.(2020) from Gulordava et al. (2018). n -gram LMs Finally, we experiment with 4 dif-ferent 5-gram LMs, which have been trained on0.4M and 2M sentences from Chinese Wikipedia.For each corpus size, we train one word-based andone character-based LM. Those models are imple- https://github.com/xu-song/bert-as-language-model https://github.com/pytorch/examples/tree/master/word language model https://github.com/sheng-fu/colorlessgreenRNNs mented using KenLM. All results are shown in Table 2.

Phenomenon-speciﬁc Results

Our LMs per-form best on classiﬁer–noun agreement and verbcomplement selection: Chinese BERT’s accuracyis only . and, respectively, lower than thatof humans on these two phenomena. LSTMs and -grams remain around behind humans, but stillperform better on these phenomena than on othersin CLiMP. This indicates that Chinese LMs acquirelocal selection knowledge better than the linguisticknowledge needed to master other phenomena.Our LMs stuggle most with the bˇa construction,binding, and ﬁller-gap dependencies. All modelsperform close to chance level for binding, suggest-ing that they lack the hierarchical knowledge neces-sary to correctly resolve the structural relationshipbetween a reﬂexive and its binder. Similarly, mostmodels perform near chance on ﬁller-gap depen-dencies. This suggests that they do not robustlyrepresent long-distance dependencies. https://kheaﬁeld.com/code/kenlm/ A caveats applies: because Mandarin lacks wh -movement,we test ﬁller-gap dependencies using a topicalization construc- odel Overall Clsfr. V.Cp. Hd.Fi. The ba. Coverb Ana.Agr. Pass. Bind. Fi.Gap Human 95.8 99.7 96.0 100.0 85.0 92.5 94.5 91.0 99.0 100.0

Chinese BERT 81.8 92.9 -gram-2M-word 59.0 70.1 71 55.2 15.6 39.2 67.7 -gram-2M-char 65.7 70.6 -gram-0.4M-word 55.9 66.4 69.5 46.3 6.0 37.0 69.1 -gram–0.4M-char 60.0 Table 2: Percentage accuracy of all humans and models on CLiMP. Random guessing yields an accuracy of 50%.Bold numbers indicate the phenomenon each model is best at. Numbers in model names (21.5, 2, 0.4) refer to thenumber of sentences in the training corpus.

On the head-ﬁnal construction, Chinese BERTperforms surprisingly poorly as compared to theother models: only . accuracy as comparedto an average accuracy of by the LSTMs. Thecoverb construction, in contrast, is easy for Chi-nese BERT: it achieves . accuracy, while thehighest accuracy among all other models is . Model-speciﬁc Results

Comparing across mod-els, Chinese BERT achieves by far the highest over-all accuracy with 81.8%. Our different LSTMsall perform worse, but obtain surprisingly similarscores: from . to . . The performancesof our -grams range from . to . . Keep-ing tokenization and corpus size constant, three outof four -grams are outperformed by LSTMs. Thus,we overall ﬁnd that neural models have advantagesas compared to statistical models.Comparing among the LSTMs, we ﬁnd similarlyto Hu et al. (2020) that the corpus size does nothave much inﬂuence on the overall performance,with the caveat that these models perform closeto chance. In contrast, a larger corpus size doesresult in a better performance in -grams. Wealso compare the effect of different tokenizations:Character-based -grams demonstrate better per-formance than word-based ones. For LSTMs, how-ever, using characters only results in a better per-formance for our smallest corpus size (0.4M).Compared to English LMs (Warstadt et al.,2020), the human–model gap is much bigger forChinese models. While neither models nor datasetsare directly comparable between our and previouswork, this still suggests that more analyses anddevelopments are needed for non-English models. tion more common in speech, and less likely to appear in thetraining corpora. We introduced CLiMP, a suite of diagnostic testsets aimed at evaluating which syntactic phenom-ena Chinese LMs learn, and used it to evaluate11 different models. All LMs appeared to havelearned local selectional restrictions, but struggledwith argument structure alternations, hierarchicalstructure, and long-distance dependencies. Chi-nese BERT performed best on CLiMP overall.However, it obtained a 14% lower accuracy thanhumans, suggesting there is still much room forimprovement. We hope that CLiMP will serveas a linguistically informed resource for bench-marking and analyzing future progress on Chi-nese LMs. CLiMP is available at https://nala-cub.github.io/resources . We would like to thank the students from CU Boul-der’s CSCI/LING5832 in Spring 2020 for theirfeedback on this research. We are also gratefulfor the feedback of the anonymous reviewers.

References

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzm´an, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2020. Unsupervisedcross-lingual representation learning at scale. In roceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 8440–8451, Online. Association for Computational Lin-guistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Allyson Ettinger, Ahmed Elgohary, Colin Phillips, andPhilip Resnik. 2018. Assessing composition in sen-tence vector representations. In

Proceedings ofthe 27th International Conference on ComputationalLinguistics , pages 1790–1801, Santa Fe, New Mex-ico, USA. Association for Computational Linguis-tics.Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. In

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers) , pages 1195–1205, NewOrleans, Louisiana. Association for ComputationalLinguistics.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word repre-sentations. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox,and Roger Levy. 2020. A systematic assessmentof syntactic generalization in neural language mod-els. In

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics ,pages 1725–1744, Online. Association for Compu-tational Linguistics.Dan Jurafsky and James H. Martin. 2009.

Speechand language processing: An introduction to natu-ral language processing, computational linguistics,and speech recognition . Pearson Prentice Hall, Up-per Saddle River, N.J.Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies.

Transactions of theAssociation for Computational Linguistics , 4:521–535. Rebecca Marvin and Tal Linzen. 2018. Targeted syn-tactic evaluation of language models. In

Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing , pages 1192–1202,Brussels, Belgium. Association for ComputationalLinguistics.Aaron Mueller, Garrett Nicolai, Panayiota Petrou-Zeniou, Natalia Talmina, and Tal Linzen. 2020.Cross-linguistic syntactic evaluation of word predic-tion models. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 5523–5539, Online. Association forComputational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Yip Po-Ching and Don Rimmington. 2015.

Chinese: Acomprehensive grammar . Routledge.Shauli Ravfogel, Yoav Goldberg, and Francis Tyers.2018. Can LSTM learn to capture agreement? thecase of basque. In

Proceedings of the 2018 EMNLPWorkshop BlackboxNLP: Analyzing and Interpret-ing Neural Networks for NLP , pages 98–107, Brus-sels, Belgium. Association for Computational Lin-guistics.Anna Rogers, Olga Kovaleva, and Anna Rumshisky.2021. A Primer in BERTology: What We KnowAbout How BERT Works.

Transactions of the As-sociation for Computational Linguistics , 8(0):842–866.Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka-trin Kirchhoff. 2020. Masked language model scor-ing. In

Proceedings of the 58th Annual Meetingof the Association for Computational Linguistics ,pages 2699–2712, Online. Association for Compu-tational Linguistics.Joonbo Shin, Yoonhyung Lee, and Kyomin Jung. 2019.Effective sentence scoring method using BERT forspeech recognition. In

Asian Conference on Ma-chine Learning , pages 1081–1093.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? probing for sentence structure in contextu-alized word representations. In

International Con-ference on Learning Representations .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allou need. In

Advances in Neural Information Pro-cessing Systems , volume 30, pages 5998–6008. Cur-ran Associates, Inc.Jesse Vig. 2019. A multiscale visualization of atten-tion in the transformer model. In

Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics: System Demonstrations , pages37–42, Florence, Italy. Association for Computa-tional Linguistics.Alex Wang and Kyunghyun Cho. 2019. BERT hasa mouth, and it must speak: BERT as a Markovrandom ﬁeld language model. In

Proceedings ofthe Workshop on Methods for Optimizing and Eval-uating Neural Language Generation , pages 30–36,Minneapolis, Minnesota. Association for Computa-tional Linguistics.Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo-nananey, Wei Peng, Sheng-Fu Wang, and SamuelBowman. 2020. BLiMP: The Benchmark of Lin-guistic Minimal Pairs for English.

Transactionsof the Association for Computational Linguistics ,8(0):377–392.Alex Warstadt, Amanpreet Singh, and Samuel R Bow-man. 2019. Neural network acceptability judgments.

Transactions of the Association for ComputationalLinguistics , 7:625–641.Ethan Wilcox, Roger Levy, Takashi Morita, andRichard Futrell. 2018. What do rnn language mod-els learn about ﬁller–gap dependencies? In