[PDF] Positional Artefacts Propagate Through Masked Language Model Embeddings

Abstract

In this work, we demonstrate that the contextualized word vectors derived from pretrained masked language model-based encoders share a common, perhaps undesirable pattern across layers. Namely, we find cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors. In an attempt to investigate the source of this information, we introduce a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings. We also pre-train the RoBERTa-base models from scratch and find that the outliers disappear without using positional embeddings. These outliers, we find, are the major cause of anisotropy of encoders' raw vector spaces, and clipping them leads to increased similarity across vectors. We demonstrate this in practice by showing that clipped vectors can more accurately distinguish word senses, as well as lead to better sentence embeddings when mean pooling. In three supervised tasks, we find that clipping does not affect the performance.

Full PDF

CCatch the ”Tails” of BERT

Ziyang Luo

Department of Linguistics and Philology, Uppsala University

[email protected]

Abstract

Recently, contextualized word embeddingsoutperform static word embeddings on manyNLP tasks. However, we still do not knowmuch about the mechanism inside these rep-resentations. Do they have any common pat-terns? If so, where do these patterns comefrom? We ﬁnd that almost all the contextu-alized word vectors of BERT and RoBERTahave a common pattern. For BERT, the th element is always the smallest. For RoBERTa,the th element is always the largest, andthe th element is the smallest. We callthem ”tails” of models. We introduce a newneuron-level method to analyze where these”tails” come from. We ﬁnd that these ”tails”are closely related to the positional informa-tion. We also investigate what will happenif we ”cutting the tails” (zero-out). Our re-sults show that ”tails” are the major cause ofanisotropy of vector space. After ”cutting thetails”, a word’s different vectors are more sim-ilar to each other. The internal representationshave a better ability to distinguish a word’s dif-ferent senses with the word-in-context (WiC)dataset. The performance on the word sensedisambiguation task is better for BERT and un-changed for RoBERTa. We can also better in-duce phrase grammar from the vector space.These suggest that ”tails” are less related to thesense and syntax information in vectors. Theseﬁndings provide insights into the inner work-ings of contextualized word vectors. In many deep learning applications of NLP, themost important work is to represent a word as a vec-tor in a low-dimensional continuous space. Tradi-tionally, we use static word vectors, like Word2Vec(Mikolov et al., 2013), GloVe (Pennington et al.,2014), and FastText (Bojanowski et al., 2017).However, static word vectors can not solve thepolysemous problem. All senses of a word share the same representation. In recent years, deep lan-guage models, like BERT (Devlin et al., 2019) andRoBERTa (Liu et al., 2019b), achieve great successin many NLP tasks. They introduce a new kind ofword vectors, called contextualized word vectors.The representation is based on the context wherethe target word appears. Since these vectors aresensitive to context, they can better solve the pol-ysemous problem. Replacing static embeddingswith contextualized embeddings can beneﬁt a widerange of NLP tasks, including constituency parsing(Kitaev and Klein, 2018), coreference resolution(Joshi et al., 2019) and machine translation (Liuet al., 2020).Though contextualized word vectors in pre-trained language models achieve success, they arestill ”black boxes”. We poorly understand themechanism inside these vectors. Some previousworks use linear probes to investigate the linguis-tic properties of these vectors. Hewitt and Man-ning (2019) introduce a structure probe to evalu-ate whether syntax trees are embedded in the vec-tor space. Tenney et al. (2019) introduce an edgeprobe framework to investigate syntax in the vec-tors. We focus on investigating the common patternof vectors in BERT and RoBERTa. We also ana-lyze where this pattern comes from and how thispattern affects the geometry, the sense information,and the phrase grammar in the vector space. Weuse BERT-base-cased and RoBERTa-base modelsfrom Hugging Face’s Transformers Library (Wolfet al., 2020) in our works.The contributions of this study are as follows:1. All the contextualized embeddings share acommon pattern. For BERT, the th ele-ment is always the smallest element for alltokens in all non-input layers. For RoBERTa,the th element is always the largest in all https://huggingface.co/models a r X i v : . [ c s . C L ] N ov ectors and the th element is always thesmallest except the [CLS] token. We call them”tails” of models.2. We introduce a new neuron-level analysismethod to analyze where the ”tails” comefrom. We show that ”tails” are closely relatedto positional information. For BERT, its ”tail”is related to the ﬁrst position. For RoBERTa,its ”tails” are related to all positions exceptthe ﬁrst and second positions.3. We investigate what will happen if we ”cut thetails” (zero-out). Ethayarajh (2019) show thatthe geometry of contextualized word vectorspace is anisotropy. This means that the vec-tors occupy a narrow cone. We show that themajor cause of this phenomenon is the ”tails”of vectors. After ”cutting the tails”, the ge-ometry of vector space becomes directionallyuniform (isotropy). The vectors of the sameword are also more similar to each other.4. We also show that after ”cutting the tails”,the representations can better distinguisha word’s different senses of the word-in-context (WiC) dataset (Pilehvar and Camacho-Collados, 2019). The performance on theword sense disambiguation task is better forBERT and unchanged for RoBERTa. We canalso better induce phrase grammar from thevector space. These suggest that ”tails” areless related to sense and phrase grammar.These ﬁndings help us better understand the mech-anism inside a contextualized word vector. We randomly sample 1000 sentences from SST-2training set (Socher et al., 2013). Then we computeall the internal representations in all non-inputlayers of BERT and RoBERTa of these sentences.We get 209253 samples of BERT and 217044samples of RoBERTa. We perform an analysis ofcommon patterns of these internal representations.

BERT

We ﬁnd that the minimum element of97.16% contextualized word vectors is the th element. Example of this pattern is shown inFigure 1. A neuron is one dimension in a vector.

Figure 1: BERT Layer-5 internal representation of”phenomenal” in sentence ”is phenomenal , especiallythe women”.Figure 2: RoBERTa Layer-5 internal representation of”phenomenal” in sentence ”is phenomenal , especiallythe women”.

RoBERTa

We ﬁnd that the maximum elementof all contextualized word vectors is the th element. The minimum element of 90.29% vectorsis the th element. Example of these patterns areshown in Figure 2.Our results show that the internal word vectorsof BERT and RoBERTa have ”tails”. To the bestof our knowledge, we are the ﬁrst to ﬁnd suchcommon patterns in these models. More examplescan be seen in Appendix A. We further analyze the reasons why BERT andRoBERTa have ”tails”. We ﬁnd that the same ”tails”also exist in the positional embeddings.Figure 3 shows that the ﬁrst position’s positionalembedding of BERT has two long ”tails”. Theminimum element is the th element. This pat-tern does not exist in other positions’ positionalembeddings. Figure 4 shows that the third posi-ion’s positional embedding of RoBERTa has fourlong ”tails”. The th and th elements are in-cluded. We also ﬁnd that from the third position tothe ﬁnal position, the maximum element of . positional embeddings is the th element. Moreexamples of RoBERTa positional embeddings canbe found in Appendix B. Figure 3: The ﬁrst position’s positional embedding ofBERT-base.Figure 4: The third position’s positional embedding ofRoBERTa-base.

We give a possible explanation that the ”tails”in the contextualized word vectors are relatedto positional information. For BERT, the th ”tail” comes from the ﬁrst positional embedding.For RoBERTa, the th and th ”tails” comefrom all positional embeddings except the ﬁrst andsecond. Models parameters analysis

We further ana-lyze the parameters of BERT and RoBERTa. Weﬁnd some patterns in the parameters of Layer Nor-malization (LN, Ba et al. (2016)). LN has twolearnable parameters, gain and bias. Both of themare two 768-dimension vectors. They are designedfor afﬁne transformation on normalized vectors to improve the express ability. Every layer of BERTand RoBERTa uses two LNs.For BERT, the th element of gain is alwaysin the top-6 maximum values lists for the ﬁrst tenlayers’ ﬁrst LN. Especially, it is maximum in theﬁrst three layers.For RoBERTa, the th and th elements ofgain are in the top-5 maximum values list for theﬁrst layer’s ﬁrst LN.After the normalization of LN, the positionalinformation is lost. To reconstruct this informa-tion, LN uses gain to add positional informationto the vectors. We notice that in BERT, only theﬁrst position’s positional embedding has ”tails”.However, all contextualized vectors get these tailsafter the ﬁrst LN in the ﬁrst layer. We believe thatLN wants to use gain to add the ﬁrst position’spositional information into the ﬁrst vector. How-ever, all vectors share the same gain. This makesthat all vectors gain the ﬁrst position’s positionalinformation. Therefore, we make the followingassumptions: the ”tails” of BERT and RoBERTaare related to the positional information.To test our assumption, we introduce a newmethod to analyze an individual element in avector. Neuron-level analysis

The same as Durrani et al.(2020), we call each element in the vector as a neu-ron. We carry a neuron-level analysis on contextu-alized word vectors. First, we train a linear probewithout bias to predict the position of a contextu-alized vector in a sentence. Durrani et al. (2020)use the weights of the classiﬁer as a proxy to selectthe most relevant neurons. Their method is basedon the assumption that the larger the absolute valueof the weight, the more important the correspond-ing neuron. However, this method disregards themagnitudes of the values of neurons. A neuronwith a large weight does not mean to have a highcontribution to the ﬁnal classiﬁcation result. Forexample, if the value of a neuron is close to zero,a large weight also leads to a small contribution.Therefore, We deﬁne the contribution of the i th neuron as: c ( i ) = abs ( w i ∗ v i ) , (1)for i = 1 , , , ..., n . w i represents the i th weight and v i represents the i th neuron in thecontextualized word vector. We call C =[ c (1) , c (2) , ..., c ( n )] as a contribution vector. If a igure 5: Accuracy of position prediction.Figure 6: The average contribution of BERT’s ”tail” onposition prediction. neuron has a high contribution, this means that thisneuron is highly relevant to the ﬁnal classiﬁcationresult.To train a linear probe, we randomly samplesentences in SST-2 training set, 5000 sentences forthe training set, 1000 sentences for the validationset, and 1000 sentences for the test set. Weseparately train a linear probe for each layer.Training is running with batch size 32 and stoppedafter 10 epochs. We use a categorical cross-entropyloss, optimized by Adam algorithm (Kingma andBa, 2017). We compute the contribution of eachneuron on the test set. Neuron-level analysis results

Figure 5 showsthat the lower layers contain more positional in-formation than the upper layers. The higher layersof RoBERTa contain more position informationthan BERT’s.We use all the contextualized word embeddingsfrom the test set to compute the average contribu-tion of ”tails” on position prediction.For BERT, Figure 6 shows that BERT’s ”tail”has a higher contribution on predicting the ﬁrst position than the other positions for all layers ex-cept layer 5 and 6. We also ﬁnd that the th neuron have the maximum contribution to predict-ing the ﬁrst position of . contribution vec-tors of the ﬁrst layer, . of the second layer, . of the third layer, . of the fourthlayer, . of the sixth layer, . of the sev-enth layer, . of the eighth layer, . ofthe ninth layer, . of the tenth layer, . of the eleventh layer, and . of the twelfthlayer. These results show that the th ”tails” inall layers (except layer 5 and 6) are related to theﬁrst positional information.For RoBERTa, Figure 7 shows that the th tailhas a high contribution on the ﬁrst position predic-tion in the ﬁrst three layers and the last 4 layers(except layer 12). It also has a high contribution tothe second position prediction except the ﬁrst threelayers. The th tail has a high contribution on theﬁrst position prediction. This contribution becomeshigher in the upper layer. Its contribution to thesecond position prediction is also higher than otherpositions prediction. These results show that the th ”tails” are related to the ﬁrst and the secondpositional information. The th ”tails” are highlyrelated to the ﬁrst positional information. In Section 3, we show that the ”tails” are relatedto positional information. We zero out the ”tails”in this section. We call this ”cutting the tails”. Weinvestigate how this process affects contextualizedword vectors.BERT and RoBERTa use byte-pair tokenization(Sennrich et al., 2016). This means that some wordsare tokenized into subword units. Each word isrepresented by an average of the representations ofits subword units.

Ethayarajh (2019) show that contex-tualized word vectors are anisotropic in all non-input layers. This means that the average cosinesimilarity between uniformly randomly sampledwords are close to 1. We randomly samples 2000sentences from SST-2 training set and create 1000sentence-pairs. Then we randomly select a word ineach sentence. Every pair of sentences correspondsto a pair of words. We compute the cosine similar-ity between these two words to measure anisotropyof contextualized word vectors. igure 7: The average contribution of RoBERTa’s ”tails” on position prediction.Figure 8: Left: anisotropy measurement of contextualized word vectors in BERT and RoBERTa before and after”cutting the tails”. Right: self-similarity measurement of BERT and RoBERTa before and after ”cutting the tails”.

In the left ﬁgure of Figure 8, we can ﬁnd that con-textualized representations of BERT and RoBERTaare more ansiotropic in higher layer. Especially forRoBERTa, the average cosine similarity betweenrandom words are larger than 0.5 after the ﬁrstnon-input layer. This implies that the internal rep-resentations in BERT and RoBERTa occupy in anarrow cone in the vector space.We believe that the ”tails” are the major causeof anisotropy. To verify our hypothesis, we ”cutthe tails” of BERT and RoBERTa. This means thatwe set the th element of contextualized vectorsin BERT to be zero. The th and th elementsof vectors in RoBERTa are also set to be zero. Theleft ﬁgure in Figure 8 shows that after ”cuttingthe tails”, their vector spaces become much moreisotropic. Self-similarity

The self-similarity measurement(Ethayarajh, 2019) is used to compare the similar-ity of a word’s internal representations in differentcontexts. Given a word w , there are n different sen-tences s , s , ..., s n which contain such word. Weuse f il ( w ) to represent the internal representationof w in sentence s i in the l th layer. The averageself-similarity measurement of w in the l th layer is deﬁned as Self Sim l ( w ) = (cid:80) ni =1 (cid:80) nj = i +1 cos (cid:16) f il ( w ) , f jl ( w ) (cid:17) n ( n − (2)We sample 1000 different words from SST-2 train-ing set. All of them appear at least in 10 dif-ferent sentences. We use them to compute theself-similarity of BERT and RoBERTa before andafter ”cutting the tails”. To adjust for the effectof anisotropy, We subtract the self-similarity witheach layer’s anisotropy measurement.The right ﬁgure in Figure 8 shows that the inter-nal representations of a word become less similarto each other in the higher layer. However, after”cutting the tails”, the self-similarity increases. Thevectors become closer to each other. This impliesthat ”cutting the tails” can make a word’s contextu-alized vectors more similar to each other. Suppose that we have a target word w . It appearsin two sentences. w has the same sense in thesetwo sentences. However, its contextualized wordvectors are different because of different contexts.From Section 4.1, ”cutting the tails” can make aword’s contextualized vectors more similar. Wemake an assumption that ”cutting the tails” can igure 9: The difference of average cosine similaritybetween samples with true label and samples with falselabel in WiC dataset. increase vector’s ability to represent word sense. Word-in-context

To verify our assumption, weﬁrst analyze the vectors with the word-in-context(WiC) dataset (Pilehvar and Camacho-Collados,2019). Depending on different context, a word canrefer to multiple meanings. The WiC dataset isdesigned to identify the meaning of words in dif-ferent contexts. It is a binary classiﬁcation task.Given a target word and two sentences which con-tain this word, our models are required to determinewhether the word has the same meaning in thesetwo sentences.To analyze the inﬂuences of ”tails”, we ﬁrst com-pute the average cosine similarity of all the sampleswith true label and the similarity of all the sampleswith false label in dataset. Then we compute thedifference between these two cosine similarity val-ues. In Figure 9, we can ﬁnd that after ”cuttingthe tails”, the differences between samples withtrue and false label become larger, especially in thehigher layer.Second, we use the target word’s internal repre-sentations in two sentences to compute the cosinesimilarity. If the value is greater than threshold, weassign the true label. Otherwise, it is false label.We use this method to compare the performancebefore and after ”cutting the tails”. Since we do notrequire any training, we directly test our modelson WiC training dataset. We also compare 9 dif-ferent thresholds from 0.1 to 0.9. We use a simplebaseline model to assign all the samples with truelabel.Table 1 shows that after ”cutting the tails”,the accuracy scores of BERT and RoBERTaincrease about . These results show that Model Layer Threshold AccuracyBaselines - - 50.0%BERT 8 0.7 67.5%RoBERTa 10 0.9 69.0%BERT-cut 11 0.4 68.5%RoBERTa-cut 11 0.6 Table 1: The best accuracy scores on WiC dataset

Model Layer F1-scoreBaselines - 52.9%BERT 10 63.6%RoBERTa 10

BERT-cut 11 63.9%RoBERTa-cut 9

Table 2: The best F1 scores for WSD task. the internal representations of a word can bet-ter represent its meaning by only ”cutting the tails”.

Word sense disambiguation

We further analyzethe vectors with word sense disambiguation task.We follow the same method in ELMo’s paper (Pe-ters et al., 2018). We use BERT and RoBERTa tocompute representations of all words in the train-ing dataset, SemCor 3.0 (Miller et al., 1994) andthen compute the mean of representations for eachsense. To classify a new word, we use the word’scontextualized vector to ﬁnd the nearest neighborsense from the training dataset. If the target wordis not included in the training set, we default to themost commonly used sense in the training set. Thetesting data (3,669 sense types) and the evaluationframework are described in the paper (Raganatoet al., 2017). For the baseline model, we assign atarget lemma’s most frequent sense for it.Table 2 shows that after ”cutting the tails”, theF1 score of BERT become higher and the score ofRoBERTa is unchanged. These means that ”tails”are less related to word sense. Removing them canimprove the sense representation ability of vectorsin BERT.

Kim et al. (2020) introduces a method to induceconstituency parsing trees from contextualizedword vectors without training. We analyze how”cutting the tails” affects the phrase grammar abil-ity. We follow the method described in Kim et al.(2020). For a given sentence, we use the contex-tualized word vector for each word to computeodel Layer S-F1Baselines - 39.46BERT 9 41.39RoBERTa 11 41.81BERT-cut 9 41.41RoBERTa-cut 11

Table 3: The best S-F1 scores for unsupervised con-stituency parsing. the syntactic distance between every two adjacentwords. Then we use the algorithm in Shen et al.(2018) to extract the constituency parsing tree foreach sentence. We also add the right-skewness biasto every sentence.We use the WSJ Penn Treebank (PTB, Marcuset al. (1993)) here. We use the standard split ofthe dataset-2-21 for training, 22 for validation, and23 for test. We only use the test set in our ex-periments. We use sentence-level F1 (S-F1) scoreto evaluate our models. Our baseline model is aright-branching tree model.Table 3 shows that ”cutting the tails” can im-prove the phrase grammar ability of contextualizedword vectors.

Deep pre-trained language models

Pre-trainedlanguage models (Peters et al., 2018; Devlin et al.,2019; Liu et al., 2019b; Radford et al., 2019;Yang et al., 2019; Brown et al., 2020), particularlythose using Transformer architecture (Vaswaniet al., 2017), have great performance in manyNLP downstream tasks. However, they are still”black boxes”. We do not understand What theyhave learned, and why and when they perform well.

Interpreting deep pre-trained language mod-els

There have been many works performing anal-ysis to better understand what deep pre-trained lan-guage models learn. One line of work investigatesself-attention mechanism (Raganato and Tiede-mann, 2018; Vig, 2019; Mareˇcek and Rosa, 2018;Voita et al., 2019; Clark et al., 2019; Kobayashiet al., 2020). For example, Clark et al. (2019)shows that a large amount of attention focuses on[SEP] token. However, Kobayashi et al. (2020)proposes a norm-based analysis method and showthat BERT pays poor attention to special tokens.Another line of work investigates the internal rep- resentations with different probing classiﬁers (Ten-ney et al., 2019; Liu et al., 2019a; Lin et al., 2019;Hewitt and Manning, 2019; Zhao et al., 2020).Commonly, a probing classiﬁer is a linear clas-siﬁer that takes the internal representations as inputand is trained with a supervised task. For example,Hewitt and Manning (2019) introduce a structuralprobe to evaluate whether a syntax tree is encodedin the vector space. Their results provide evidencethat entire syntax trees are embedded in the space.Similar to our work, Ethayarajh (2019) investi-gate how contextual the contextualized word vec-tors are. They show that the contextualized vectorsof all words are not isotropic in all non-input lay-ers. A word’s different vectors are less similar toeach other in the higher layers. However, they donot perform an analysis of why the contextualizedword vectors have these properties. Dalvi et al.(2018) introduces a neuron-level analysis method.Durrani et al. (2020) use this method to analyzeindividual neurons in contextualized word vector.They train a linear probe to predict morphology,syntax and semantics information in a vector. Thenthey use the weights of the classiﬁer as a proxy toselect the most relevant neurons. They are based onthe assumption that the larger the absolute value ofthe weight, the more important the correspondingneuron. However, this method disregards the mag-nitudes of the values of neurons. A neuron with alarge weight does not lead to a high contribution tothe classiﬁcation result. For example, if the valueof a neuron is close to zero, a large weight alsoleads to a small contribution. Then this neuron isless related to the result.

In this paper, we investigate the common patternsof contextualized word representations of BERTand RoBERTa. For BERT, the th element is al-ways the smallest. For RoBERTa, the th elementis the smallest and the th element is the largest.We introduce a new neuron-level analysismethod to analyze where these ”tails” come from.We ﬁnd that they are related to the positional infor-mation.To further analyze the information inside ”tails”,we ”cut the tails” and investigate what will happen.We ﬁnd that they are the major cause of anisotropyin the vector space. ”Cutting the tails” can makethe vector space more directionally uniform andmake a word’s different representations moreimilar to each other. We also ﬁnd that ”tails”are less related to sense and phrase grammar.These insights help explain the mechanism insidecontextualized word vectors. Future works

Our ﬁndings offer some possibledirections for future works. In our works, we showthat ”tails” are related to positional information.We can analyze whether other deep pre-trained lan-guage models, like GPT-2 (Radford et al., 2019)and XLNet (Yang et al., 2019) have similar pat-terns. We also can use our new neuron-level analy-sis method to analyze other information in contex-tualized word vectors.

References

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hin-ton. 2016. Layer normalization.Piotr Bojanowski, Edouard Grave, Armand Joulin, andTomas Mikolov. 2017. Enriching word vectors withsubword information.

Transactions of the Associa-tion for Computational Linguistics , 5:135–146.Tom B. Brown, Benjamin Mann, Nick Ryder, MelanieSubbiah, Jared Kaplan, Prafulla Dhariwal, ArvindNeelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss,Gretchen Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,Clemens Winter, Christopher Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, BenjaminChess, Jack Clark, Christopher Berner, Sam Mc-Candlish, Alec Radford, Ilya Sutskever, and DarioAmodei. 2020. Language models are few-shot learn-ers.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Fahim Dalvi, Nadir Durrani, Hassan Sajjad, YonatanBelinkov, Anthony Bau, and James Glass. 2018.What is one grain of sand in the desert? analyzingindividual neurons in deep nlp models.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics. Nadir Durrani, Hassan Sajjad, Fahim Dalvi, andYonatan Belinkov. 2020. Analyzing individual neu-rons in pre-trained language models. In

Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages4865–4880, Online. Association for ComputationalLinguistics.Kawin Ethayarajh. 2019. How contextual are contex-tualized word representations? comparing the geom-etry of BERT, ELMo, and GPT-2 embeddings. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 55–65,Hong Kong, China. Association for ComputationalLinguistics.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word repre-sentations. In

Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP) , pages 5803–5808, Hong Kong,China. Association for Computational Linguistics.Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sanggoo Lee. 2020. Are pre-trained language modelsaware of phrases? simple but strong baselines forgrammar induction.Diederik P. Kingma and Jimmy Ba. 2017. Adam: Amethod for stochastic optimization.Nikita Kitaev and Dan Klein. 2018. Constituency pars-ing with a self-attentive encoder. In

Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 2676–2686, Melbourne, Australia. Associa-tion for Computational Linguistics.Goro Kobayashi, Tatsuki Kuribayashi, Sho Yokoi, andKentaro Inui. 2020. Attention is not only a weight:Analyzing transformers with vector norms. In

Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP) ,pages 7057–7075, Online. Association for Computa-tional Linguistics.Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019.Open sesame: Getting inside bert’s linguistic knowl-edge.Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019a. Lin-guistic knowledge and transferability of contextualepresentations. In

Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 1073–1094, Minneapolis, Minnesota.Association for Computational Linguistics.Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoisingpre-training for neural machine translation.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank.

Computa-tional Linguistics , 19(2):313–330.David Mareˇcek and Rudolf Rosa. 2018. Extracting syn-tactic trees from transformer encoder self-attentions.In

Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 347–349, Brussels, Belgium.Association for Computational Linguistics.Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient estimation of word represen-tations in vector space.George A. Miller, Martin Chodorow, Shari Landes,Claudia Leacock, and Robert G. Thomas. 1994. Us-ing a semantic concordance for sense identiﬁcation.In

Human Language Technology: Proceedings of aWorkshop held at Plainsboro, New Jersey, March 8-11, 1994 .Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for wordrepresentation. In

Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long Papers) , pages2227–2237, New Orleans, Louisiana. Associationfor Computational Linguistics.Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the word-in-context datasetfor evaluating context-sensitive meaning represen-tations. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pages 1267–1273, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.Alessandro Raganato, Jose Camacho-Collados, andRoberto Navigli. 2017. Word sense disambiguation:A uniﬁed evaluation framework and empirical com-parison. In

Proceedings of the 15th Conference ofthe European Chapter of the Association for Compu-tational Linguistics: Volume 1, Long Papers , pages99–110, Valencia, Spain. Association for Computa-tional Linguistics.Alessandro Raganato and J¨org Tiedemann. 2018. Ananalysis of encoder representations in transformer-based machine translation. In

Proceedings of the2018 EMNLP Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP , pages287–297, Brussels, Belgium. Association for Com-putational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-dro Sordoni, Aaron Courville, and Yoshua Bengio.2018. Straight to the tree: Constituency parsingwith neural syntactic distance. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1171–1180, Melbourne, Australia. Association forComputational Linguistics.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In

Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing ,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Sam Bowman, Dipanjan Das,and Ellie Pavlick. 2019. What do you learn fromcontext? probing for sentence structure in contextu-alized word representations. In

International Con-ference on Learning Representations .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in Neural Information Pro-cessing Systems , volume 30, pages 5998–6008. Cur-ran Associates, Inc.esse Vig. 2019. Visualizing attention in transformer-based language representation models.Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-headself-attention: Specialized heads do the heavy lift-ing, the rest can be pruned. In

Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics , pages 5797–5808, Florence,Italy. Association for Computational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Huggingface’s transformers: State-of-the-art naturallanguage processing.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding. In

Advances in NeuralInformation Processing Systems , volume 32, pages5753–5763. Curran Associates, Inc.Mengjie Zhao, Philipp Dufter, YadollahYaghoobzadeh, and Hinrich Sch¨utze. 2020. Quanti-fying the contextualization of word representationswith semantic class probing.

A Examples of Tails

We randomly sample some sentences from SST-2training set and we also write some sentences byourselves. The followings ﬁgrues are the examplesof ”tails” in BERT and RoBERTa.

A.1 BERT

Figure 10: BERT Layer-1 internal representation of”school” in sentence ”I want to go to school.”. Figure 11: BERT Layer-2 internal representation of”his” in sentence ”in his charming 2000 debut shanghainoon”.Figure 12: BERT Layer-3 internal representation of”like” in sentence ”I like playing basketball.”.Figure 13: BERT Layer-4 internal representation of”models” in sentence ”Language models are unsuper-vised multitask learners.”.igure 14: BERT Layer-6 internal representation of”school” in sentence ”I want to go to school.”.Figure 15: BERT Layer-7 internal representation of”doing” in sentence ”What are you doing?”.Figure 16: BERT Layer-8 internal representation of”doing” in sentence ”What are you doing?”. Figure 17: BERT Layer-9 internal representation of ”?”in sentence ”What are you doing?”.Figure 18: BERT Layer-10 internal representation of”London” in sentence ”London is the capital of UK.”.Figure 19: BERT Layer-11 internal representation of”London” in sentence ”London is the capital of UK.”.igure 20: BERT Layer-12 internal representation of”UK” in sentence ”London is the capital of UK.”.

A.2 RoBERTa

Figure 21: RoBERTa Layer-1 internal representation of”London” in sentence ”London is the capital of UK.”.Figure 22: RoBERTa Layer-2 internal representation of”London” in sentence ”London is the capital of UK.”. Figure 23: RoBERTa Layer-3 internal representationof ”semantic” in sentence ”Quantifying the contextu-alization of word representations with semantic classprobing.”.Figure 24: RoBERTa Layer-4 internal representation of”time” in sentence ”It is time to go to bed.”.Figure 25: RoBERTa Layer-6 internal representation of”computer” in sentence ”He bought a new computer.”.igure 26: RoBERTa Layer-7 internal representationof ”weather” in sentence ”The good seaman is knownin bad weather.”.Figure 27: RoBERTa Layer-8 internal representationof ”Five” in sentence ”One Two Three Four Five SixSeven Eight Nine Ten”.Figure 28: RoBERTa Layer-9 internal representationof ”want” in sentence ”Do you want to have lunch withme?”. Figure 29: RoBERTa Layer-10 internal representationof ”London” in sentence ”London is the capital ofUK.”.Figure 30: RoBERTa Layer-11 internal representationof ”Who” in sentence ”Who are you?”.Figure 31: RoBERTa Layer-12 internal representationof ”meet” in sentence ”Nice to meet you.”.

Examples of RoBERTa positionembeddings

Figure 32: The fourth position’s position embedding ofRoBERTa-baseFigure 33: The ﬁfth position’s position embedding ofRoBERTa-baseFigure 34: The thth