[PDF] Evaluating Contextualized Language Models for Hungarian

Abstract

We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (typically at the middle layers). We also find that huBERT tends to generate fewer subwords for one word and that using the last subword for token-level tasks is generally a better choice than using the first one.

Full PDF

EEvaluating Contextualized Language Models forHungarian

Judit Ács , , Dániel Lévai , Dávid Márk Nemeskey , András Kornai Department of Automation and Applied InformaticsBudapest University of Technology and Economics Institute for Computer Science and Control Alfréd Rényi Institute of Mathematics

Abstract.

We present an extended comparison of contextualized lan-guage models for Hungarian. We compare huBERT, a Hungarian modelagainst 4 multilingual models including the multilingual BERT model.We evaluate these models through three tasks, morphological probing,POS tagging and NER. We ﬁnd that huBERT works better than theother models, often by a large margin, particularly near the global opti-mum (typically at the middle layers). We also ﬁnd that huBERT tendsto generate fewer subwords for one word and that using the last subwordfor token-level tasks is generally a better choice than using the ﬁrst one.

Keywords: huBERT, BERT, evaluation

Contextualized language models such BERT (Devlin et al., 2019) drasticallyimproved the state of the art for a multitude of natural language processingapplications. Devlin et al. (2019) originally released 4 English and 2 multilin-gual pretrained versions of BERT (mBERT for short) that support over 100languages including Hungarian. BERT was quickly followed by other large pre-trained Transformer (Vaswani et al., 2017) based models such as RoBERTa (Liuet al., 2019b) and multilingual models with Hungarian support such as XLM-RoBERTa (Conneau et al., 2019). Huggingface released the Transformers library(Wolf et al., 2020), a PyTorch implementation of Transformer-based languagemodels along with a repository for pretrained models from community contribu-tion . This list now contains over 1000 entries, many of which are domain- orlanguage-speciﬁc models.Despite the wealth of multilingual and language-speciﬁc models, most eval-uation methods are limited to English, especially for the early models. Devlinet al. (2019) showed that the original mBERT outperformed existing models onthe XNLI dataset (Conneau et al., 2018b). mBERT was further evaluated byWu and Dredze (2019) for 5 tasks in 39 languages, which they later expandedto over 50 languages for part-of-speech tagging, named entity recognition anddependency parsing (Wu and Dredze, 2020). https://huggingface.co/models a r X i v : . [ c s . C L ] F e b emeskey (2020) released the ﬁrst BERT model for Hungarian named hu-BERT trained on Webcorpus 2.0 (Nemeskey, 2020, ch. 4). It uses the samearchitecture as BERT base with 12 Transformer layers with 12 heads and 768hidden dimension each with a total of 110M parameters. huBERT has a Word-Piece vocabulary with 30k subwords.In this paper we focus on evaluation for the Hungarian language. We comparehuBERT against multilingual models using three tasks: morphological probing,POS tagging and NER. We show that huBERT outperforms all multilingualmodels, particularly in the lower layers, and often by a large margin. We alsoshow that subword tokens generated by huBERT’s tokenizer are closer to Hun-garian morphemes than the ones generated by the other models. We evaluate the models through three tasks: morphological probing, POS taggingand NER. Hungarian has a rich inﬂectional morphology and largely free wordorder. Morphology plays a key role in parsing Hungarian sentences.We picked two token-level tasks, POS tagging and NER for assessing thesentence level behavior of the models. POS tagging is a common subtask ofdownstream NLP applications such as dependency parsing, named entity recog-nition and building knowledge graphs. Named entity recognition is indispensablefor various high level semantic applications.

Probing is a popular evaluation method for black box models. Our approach isillustrated in Figure 1. The input of a probing classiﬁer is a sentence and a targetposition (a token in the sentence). We feed the sentence to the contextualizedmodel and extract the representation corresponding to the target token. We useeither a single Transformer layer of the model or the weighted average of all layerswith learned weights. We train a small classiﬁer on top of this representationthat predicts a morphological tag. We expose the classiﬁer to a limited amountof training data (2000 training and 200 validation instances). If the classiﬁerperforms well on unseen data, we conclude that the representation includes saidmorphological information. We generate the data from the automatically taggedWebcorpus 2.0. The target words have no overlap between train, validation andtest, and we limit class imbalance to 3-to-1 which resulted in ﬁltering some rarevalues. The list of tasks we were able to generate is summarized in Table 1.

Our setup for the two sequence tagging tasks is similar to that of the morpholog-ical probes except we train a shared classiﬁer on top of all token representations.Since multiple subwords may correspond to a single token (see Section 3.1 for ubword tokenizerYou have patience .[CLS] You have pati (cid:80) w i x i MLP P ( label ) trained Fig. 1: Probing architecture. Input is tokenized into subwords and a weightedaverage of the mBERT layers taken on the last subword of the target wordis used for classiﬁcation by an MLP. Only the MLP parameters and the layerweights w i are trained.more details), we need to aggregate them in some manner: we pick either theﬁrst one or the last one. We use two datasets for POS tagging. One is the Szeged Universal Dependen-cies Treebank (Farkas et al., 2012; Nivre et al., 2018) consisting of 910 train, 441validation, and 449 test sentences. Our second dataset is a subsample of Webcor-pus 2 tagged with emtsv (Indig et al., 2019) with 10,000 train, 2000 validation,and 2000 test sentences.Our architecture for NER is identical to the POS tagging setup. We train iton the Szeged NER corpus consisting of 8172 train, 503 validation, and 900 testsentences. We also experimented with other pooling methods such as elementwise max and sumbut they did not make a signiﬁcant diﬀerence.orph tag POS

Table 1.

List of morphological probing tasks.

We train all classiﬁers with identical hyperparameters. The classiﬁers have onehidden layer with 50 neurons and ReLU activation. The input and the outputlayers are determined by the choice of language model and the number of targetlabels. This results in 40k to 60k trained parameters, far fewer than the numberof parameters in any of the language models.All models are trained using the Adam optimizer (Kingma and Ba, 2014)with lr = 0 . , β = 0 . , β = 0 . . We use 0.2 dropout for regularization andearly stopping based on the development set. We evaluate 5 models. huBERT the Hungarian BERT, is a BERT-base model with 12 Transformerlayers, 12 attention heads, each with 768 hidden dimensions and a total of 110million parameters. It was trained on Webcorpus 2.0 (Nemeskey, 2020), 9-billion-token corpus compiled from the Hungarian subset of Common Crawl .Its string identiﬁer in Huggingface Transformers is SZTAKI-HLT/hubert-base-cc . mBERT the cased version of the multilingual BERT. It is a BERT-base modelwith identical architecture to huBERT. It was trained on the Wikipedias ofthe 104 largest Wikipedia languages. Its string id is bert-base-multilingual-cased . XLM-RoBERTa the multilingual version of RoBERTa. Architecturally, it isidentical to BERT; the only diﬀerence lies in the training regimen. XLM-RoBERTa was trained on 2TB of Common Crawl data, and it supports 100languages. Its string id is xlm-roberta-base . https://commoncrawl.org/ LM-MLM-100 is a larger variant of XLM-RoBERTa with 16 instead of 12layers. Its string id is xlm-mlm-100-1280 . distilbert-base-multilingual-cased is a distilled version of mBERT. It cutsthe parameter budget and inference time by roughly 40% while retaining 97%of the tutor model’s NLU capabilities. Its string id is distilbert-base-multilingual-cased . Subword tokenization is a key component in achieving good performance on mor-phologically rich languages. Out of the 5 models we compare, huBERT, mBERTand DistilBERT use the WordPiece algorithm (Schuster and Nakajima, 2012),XLM-RoBERTa and XLM-MLM-100 use the SentencePiece algorithm (Kudoand Richardson, 2018). The two types of tokenizers are algorithmically verysimilar, the diﬀerences between the tokenizers are mainly dependent on the vo-cabulary size per language. The multilingual models consist of about 100 lan-guages, and the vocabularies per language are (not linearly) proportional to theamount of training data available per language. Since huBERT is trained onmonolingual data, it can retain less frequent subwords in its vocabulary, whilemBERT, RoBERTa and MLM-100, being multilingual models, have token infor-mation from many languages, so we anticipate that huBERT is more faithful toHungarian morphology. DistilBERT uses the tokenizer of mBERT, thus it is notincluded in this subsection. huBERT mBERT RoBERTa MLM-100 emtsvLanguages 1 104 100 100 1Vocabulary size 32k 120k 250k 200k –Entropy of ﬁrst WP 8.99 6.64 6.33 7.56 8.26Entropy of last WP 6.82 6.38 5.60 6.89 5.14More than one WP 94.9% 96.9% 96.5% 97.0% 95.8%Length in WP 2.8 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 2.

Measures on the train data of the POS tasks. The length of ﬁrst andlast WP is calculated in characters, while the word length is calculated in WPs.DistilBERT data is identical to mBERT.As shown in Table 2, there is a gap between the Hungarian and multilingualmodels in almost every measure. mBERT’s shared vocabulary consists only of120k subwords for all 100 languages while huBERT’s vocabulary contains 32ktems and is uniquely for Hungarian. Given the very limited inventory of mBERT,only the most frequent Hungarian words are represented as a single token, whilelonger Hungarian words are segmented, often very poorly. The average numberof subwords a word is tokenized into is 2.77 in the case of huBERT, while allthe other models have signiﬁcantly higher mean length. This does not pose aproblem in itself, since the tokenizers work with a given dictionary size andfrequent words need not to be segmented into subwords. But in case of words withrarer subwords, the limits of smaller monolingual vocabulary can be observed,as shown in the following example: szállítójárművekkel ‘with transport vehicles’; szállító-jármű-vek-kel ‘transport-vehicle- pl - ins ’ for huBERT, sz-ál-lí-tó-já-rm-ű-vek-kel for mBERT, which found the aﬃxes correctly (since aﬃxes are high infrequency), but have not found the root ‘transport vehicle’.Fig. 2: Distribution of length in subword vs. log frequency rank. The count ofwords for one subword length is proportional to the size of the respective violin.Distributionally, huBERT shows a stronger Zipﬁan distribution than anyother model, as shown in Figure 2. Frequency and subword length are in alinear relationship for the huBERT model, while in case of the other models, thesubword lengths does not seem to be correlated the log frequency rank. The areaof the violins also show that words typically consist of more than 3 subwordsfor the multilingual models, contrary to the huBERT, which segments the wordstypically into one or two subwords. We ﬁnd that huBERT outperforms all models in all tasks, often with a largemargin, particularly in the lower layers. As for the choice of subword pooling(ﬁrst or last) and the choice of layer, we note some trends in the followingsubsections. .1 Morphology

The last subword is always better than the ﬁrst subword except for a few casesfor degree ADJ. This is not surprising because superlative is marked with acircumﬁx and it is diﬀerentiated from comparative by a preﬁx. The rest of theresults in this subsection all use the last subword.huBERT is better than all models, especially in the lower layers in morpholog-ical tasks, as shown in Figure 3. However, this tendency starts at the second layerand the ﬁrst layer does not usually outperform the other models. In some mor-phological tasks huBERT systematically outperforms the other models: these aremostly the simpler noun and adjective-based probes. In possessor tasks (tagged [psor] in Figure 3) XLM models are comparable to huBERT, while mBERTand distil-mBERT generally perform worse then huBERT. In verb tasks XLM-RoBERTa achieves similar accuracy to huBERT in the higher layers, while inthe lower layers, huBERT tends to have a higher accuracy.HuBERT is also better than all models in all tasks when we use the weightedaverage of all layers as illustrated by Figure 4. The only exceptions are adjec-tive degrees and the possessor tasks. A possible explanation for the surprisingeﬀectiveness of XLM-MLM-100 is its higher layer count.

Figure 5 shows the accuracy of diﬀerent models on the gold-standard Szeged UDand on the silver-standard data created with emtsv.Last subword pooling always performs better than ﬁrst subword pooling.As in the morphology tasks, the XLM models perform only a bit worse thanhuBERT. mBERT is very close in performance to huBERT, unlike in the mor-phological tasks, while distil-mBERT performs the worst, possibly due to its farlower parameter count.We next examine the behavior of the layers by relative position. The em-bedding layer is a static mapping of subwords to an embedding space with asimple positional encoding added. Contextual information is not available untilthe ﬁrst layer. The highest layer is generally used as the input for downstreamtasks. We also plot the performance of the middle layer. As Figure 6 shows, theembedding layer is the worst for each model and, somewhat surprisingly, addingone contextual layer only leads to a small improvement. The middle layer is ac-tually better than the highest layer which conﬁrms the ﬁndings of Tenney et al.(2019a) that BERT rediscovers the NLP pipeline along its layers, where POStagging is a mid-level task. As for the choice of subword, the last one is generallybetter, but the gap shrinks as we go higher in layers.

In the NER task (Figure 7), all of the models perform very similarly in thehigher layers, except for distil-mBERT which has nearly 3 times the error of We only do this on the smaller Szeged dataset due to resource limitations. .0 2.5 5.0 7.5 10.0 12.5 15.00.900.95 case, NOUN degree, ADJ mood, VERB number[psor], NOUN number, ADJ number, NOUN number, VERB person[psor], NOUN person, VERB tense, VERB verbform, VERB

Fig. 3: The layerwise accuracy of morphological probes using the last subword.Shaded areas represent conﬁdence intervals over 3 runs. c a s e _ n o u n d e g r e e _ a d j m o o d _ v e r b n u m b e r [ p s o r ] _ n o u n n u m b e r _ a d j n u m b e r _ n o u n n u m b e r _ v e r bp e r s o n [ p s o r ] _ n o u n p e r s o n _ v e r b t e n s e _ v e r b v e r b f o r m _ v e r b A cc u r a c y huBERTmBERTXLM-RoBERTaXLM-MLM-100distil-mBERT Fig. 4: Probing accuracy using the weighted sum of all layers. .90 0.92 0.94 0.96 0.98 first l a s t Szeged UD POS huBERTmBERTXLM-RoBERTaXLM-MLM-100distil-mBERT 0.90 0.92 0.94 0.96 0.98 first l a s t Szeged UD POSmodel huBERTmBERTXLM-RoBERTaXLM-MLM-100distil-mBERT

Fig. 5: POS tag accuracy on Szeged UD and on the Webcorpus 2.0 sample first last

Subword F Embedding layer huBERTmBERTXLM-RoBERTaXLM-MLM-100distil-mBERT first last

SubwordFirst layer first last

SubwordMiddle layer first last

SubwordLast layer

Fig. 6: Szeged POS at 4 layers: embedding layer, ﬁrst Transformer layer, middlelayer, and highest layer. first last

Subword F Embedding layer huBERTmBERTXLM-RoBERTaXLM-MLM-100distil-mBERT first last

SubwordFirst layer first last

SubwordMiddle layer first last

SubwordLast layer

Fig. 7: NER F score at the lowest, middle and highest layers.he best model, huBERT. The closer we get to the global optimum, the clearerhuBERT’s superiority becomes. Far away from the optimum, when we use onlythe embedding layer, ﬁrst subword is better than last, but the closer we getto the optimum (middle and last layer), the clearer the superiority of the lastsubword choice becomes. Probing is a popular method for exploring blackbox models. Shi et al. (2016) wasperhaps the ﬁrst one to apply probing classiﬁers to probe the syntactic knowledgeof neural machine translation models. Belinkov et al. (2017) probed NMT modelsfor morphology. This work was followed by a large number of similar probingpapers (Belinkov et al., 2017; Adi et al., 2017; Hewitt and Manning, 2019; Liuet al., 2019a; Tenney et al., 2019b; Warstadt et al., 2019; Conneau et al., 2018a;Hupkes and Zuidema, 2018). Despite the popularity of probing classiﬁers, theyhave theoretical limitations as knowledge extractors (Voita and Titov, 2020),and low quality of silver data can also limit applicability of important probingtechniques such as canonical correlation analysis (Singh et al., 2019),Multilingual BERT has been applied to a variety of multilingual tasks suchas dependency parsing (Kondratyuk and Straka, 2019) or constituency pars-ingKitaev et al. (2019). mBERT’s multilingual capabilities have been exploredfor NER, POS and dependency parsing in dozens of language by Wu and Dredze(2019) and Wu and Dredze (2020). The surprisingly eﬀective multilinguality ofmBERT was further explored by Dufter and Schütze (2020).

We presented a comparison of contextualized language models for Hungarian.We evaluated huBERT against 4 multilingual models across three tasks, mor-phological probing, POS tagging and NER. We found that huBERT is almostalways better at all tasks, especially in the layers where the optima are reached.We also found that the subword tokenizer of huBERT matches Hungarian mor-phological segmentation much more faithfully than those of the multilingualmodels. We also show that the choice of subword also matters. The last subwordis much better for all three kinds of tasks, except for cases where discontinuousmorphology is involved, as in circumﬁxes and inﬁxed plural possessives (Antal,1963; Mel’cuk, 1972). Our data, code and the full result tables are available at https://github.com/juditacs/hubert_eval . Acknowledgements

This work was partially supported by National Research, Development and Inno-vation Oﬃce (NKFIH) grant

Deep Learning of Morphological Struc-ture ”, by National Excellence Programme 2018-1.2.1-NKP-00008: “

Exploring theathematical Foundations of Artiﬁcial Intelligence ”, and by the Ministry of In-novation and the National Research, Development and Innovation Oﬃce withinthe framework of the Artiﬁcial Intelligence National Laboratory Programme. Lé-vai was supported by the NRDI Forefront Research Excellence Program KKP_20Nr. 133921 and the Hungarian National Excellence Grant 2018-1.2.1-NKP-00008.

References

Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., Goldberg, Y.: Fine-grained analysisof sentence embeddings using auxiliary prediction tasks. In: Proceedings ofInternational Conference on Learning Representations (2017)Antal, L.: The possessive form of the Hungarian noun. Linguistics 3, 50–61(1963)Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H., Glass, J.: What do neural ma-chine translation models learn about morphology? In: Proc. of ACL (2017),

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán,F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale (2019)Conneau, A., Kruszewski, G., Lample, G., Barrault, L., Baroni, M.: What youcan cram into a single \$&! http://aclweb.org/anthology/P18-1198

Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S.R., Schwenk, H.,Stoyanov, V.: Xnli: Evaluating cross-lingual sentence representations. In: Pro-ceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing. Association for Computational Linguistics (2018b)Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training ofdeep bidirectional transformers for language understanding. In: Proceed-ings of the 2019 Conference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers). pp. 4171–4186. Association for Computa-tional Linguistics, Minneapolis, Minnesota (6 2019),

Dufter, P., Schütze, H.: Identifying elements essential for BERT’s multilingual-ity. In: Proceedings of the 2020 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP). pp. 4423–4437. Association for Computa-tional Linguistics, Online (11 2020),

Farkas, R., Vincze, V., Schmid, H.: Dependency parsing of Hungarian: Baselineresults and challenges. In: Proceedings of the 13th Conference of the EuropeanChapter of the Association for Computational Linguistics. pp. 55–65. EACL’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012), http://dl.acm.org/citation.cfm?id=2380816.2380826 ewitt, J., Manning, C.D.: A structural probe for ﬁnding syntax in word rep-resentations. In: Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers). pp. 4129–4138 (2019)Hupkes, D., Zuidema, W.: Visualisation and ’diagnostic classiﬁers’ reveal howrecurrent and recursive neural networks process hierarchical structure. In:Proc. of IJCAI (2018), https://doi.org/10.24963/ijcai.2018/796

Indig, B., Sass, B., Simon, E., Mittelholcz, I., Kundráth, P., Vadász, N.: emtsv –Egy formátum mind felett [emtsv – One format to rule them all]. In: Berend,G., Gosztolya, G., Vincze, V. (eds.) XV. Magyar Számítógépes NyelvészetiKonferencia (MSZNY 2019). pp. 235–247. Szegedi Tudományegyetem Infor-matikai Tanszékcsoport (2019)Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 (2014), https://arxiv.org/abs/1412.6980

Kitaev, N., Cao, S., Klein, D.: Multilingual constituency parsing with self-attention and pre-training. In: Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics. pp. 3499–3505. Association forComputational Linguistics, Florence, Italy (7 2019),

Kondratyuk, D., Straka, M.: 75 languages, 1 model: Parsing universal depen-dencies universally. In: Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International JointConference on Natural Language Processing (EMNLP-IJCNLP). pp. 2779–2795. Association for Computational Linguistics, Hong Kong, China (11 2019),

Kudo, T., Richardson, J.: SentencePiece: A simple and language independentsubword tokenizer and detokenizer for neural text processing. In: Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Process-ing: System Demonstrations. pp. 66–71. Association for Computational Lin-guistics, Brussels, Belgium (11 2018),

Liu, N.F., Gardner, M., Belinkov, Y., Peters, M.E., Smith, N.A.: Linguisticknowledge and transferability of contextual representations pp. 1073–1094(2019a),

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,Zettlemoyer, L., Stoyanov, V.: RoBERTa: A robustly optimized bert pretrain-ing approach (2019b)Mel’cuk, I.A.: On the possessive forms of the Hungarian noun. In: Kiefer, F.,Rouwet, N. (eds.) Generative grammar in Europe, pp. 315–332. Reidel, Dor-drecht (1972)Nemeskey, D.M.: Natural Language Processing Methods for Language Modeling.Ph.D. thesis, Eötvös Loránd University (2020)Nivre, J., Abrams, M., Agić, Ž., et al.: Universal Dependencies 2.3 (2018), http://hdl.handle.net/11234/1-2895 , LINDAT/CLARIN digital library at theInstitute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematicsand Physics, Charles Universitychuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP).pp. 5149–5152. IEEE (2012)Shi, X., Padhi, I., Knight, K.: Does string-based neural MT learn source syntax?In: Proc. of EMNLP (2016),

Singh, J., McCann, B., Socher, R., Xiong, C.: BERT is not an interlingua and thebias of tokenization. In: Proceedings of the 2nd Workshop on Deep LearningApproaches for Low-Resource NLP (DeepLo 2019). pp. 47–55. Associationfor Computational Linguistics, Hong Kong, China (11 2019),

Tenney, I., Das, D., Pavlick, E.: BERT rediscovers the classical NLP pipeline.In: Proceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics. pp. 4593–4601. Association for Computational Linguistics,Florence, Italy (7 2019a),

Tenney, I., Xia, P., Chen, B., Wang, A., Poliak, A., McCoy, R.T., Kim, N.,Durme, B.V., Bowman, S., Das, D., Pavlick, E.: What do you learn from con-text? Probing for sentence structure in contextualized word representations.In: Proc. of ICLR (2019b), https://openreview.net/forum?id=SJzSgnRcKX

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I.,Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Gar-nett, R. (eds.) Advances in Neural Information Processing Systems 30, pp.5998–6008. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Voita, E., Titov, I.: Information-theoretic probing with minimum descriptionlength. In: Proceedings of the 2020 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP). pp. 183–196. Association for Computa-tional Linguistics, Online (11 2020),

Warstadt, A., Cao, Y., Grosu, I., Peng, W., Blix, H., Nie, Y., Alsop, A., Bordia,S., Liu, H., Parrish, A., Wang, S.F., Phang, J., Mohananey, A., Htut, P.M.,Jeretic, P., Bowman, S.R.: Investigating BERT’s knowledge of language: Fiveanalysis methods with NPIs. In: Proceedings of the 2019 Conference on Em-pirical Methods in Natural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing (EMNLP-IJCNLP). pp.2877–2887. Association for Computational Linguistics, Hong Kong, China (112019),

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cis-tac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., vonPlaten, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S.,Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural lan-guage processing. In: Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing: System Demonstrations. pp. 38–45. Association for Computational Linguistics, Online (Oct 2020), u, S., Dredze, M.: Beto, bentz, becas: The surprising cross-lingual eﬀectivenessof BERT. In: Proceedings of the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP). pp. 833–844. Associationfor Computational Linguistics, Hong Kong, China (11 2019),