Have Attention Heads in BERT Learned Constituency Grammar?
HHave Attention Heads in BERT Learned Constituency Grammar?
Ziyang Luo
Uppsala University
Abstract
With the success of pre-trained language mod-els in recent years, more and more researchersfocus on opening the “black box” of thesemodels. Following this interest, we carry outa qualitative and quantitative analysis of con-stituency grammar in attention heads of BERTand RoBERTa. We employ the syntactic dis-tance method to extract implicit constituencygrammar from the attention weights of eachhead. Our results show that there exist headsthat can induce some grammar types much bet-ter than baselines, suggesting that some headsact as a proxy for constituency grammar. Wealso analyze how attention heads’ constituencygrammar inducing (CGI) ability changes afterfine-tuning with two kinds of tasks, includingsentence meaning similarity (SMS) tasks andnatural language inference (NLI) tasks. Ourresults suggest that SMS tasks decrease the av-erage CGI ability of upper layers, while NLItasks increase it. Lastly, we investigate theconnections between CGI ability and naturallanguage understanding ability on QQP andMNLI tasks.
Recently, pre-trained language models haveachieved great success in many natural languageprocessing tasks (Devlin et al., 2019; Yang et al.,2019), including sentiment analysis (Liu et al.,2019), question answering (Lan et al., 2020) andconstituency parsing (Zhang et al., 2020), to namea few. Though these models have become moreand more popular in many NLP tasks, they are still“black boxes”. What they have learned, and whyand when they perform well remain unknown. Toopen these “black boxes”, researchers have usedmany methods to analyze the linguistic knowledgethat these models encode (Goldberg, 2019; Clarket al., 2019; Hewitt and Manning, 2019; Kim et al.,2020). Pre-trained language models use self-attentionmechanism in each layer to compute the internalrepresentations of each token. In this work, we in-vestigate the hypothesis that some attention headsin pre-trained language models have learned con-stituency grammar. We use an unsupervised con-stituency parsing method to extract constituencytrees from each attention heads of BERT (Devlinet al., 2019) and RoBERTa (Liu et al., 2019) be-fore and after fine-tuning. This method computesthe syntactic distance between every two adjacentwords and generates a constituency parsing treerecursively. We analyze the extracted constituencyparsing trees to investigate whether specific atten-tion heads induce constituency grammar better thanbaselines, and which types of constituency gram-mars they learn best.In prior work, Kim et al. (2020) show that somelayers of pre-trained language models exhibit syn-tactic structure akin to constituency grammar tosome degree. However, they do not analyze howfine-tuning affects models. We first follow theirmethods to extract constituency grammar fromBERT and RoBERTa. Then, we use the sameapproach to analyze BERT and RoBERTa afterfine-tuning. To the best of our knowledge, we arethe first to investigate how fine-tuning affects theconstituency grammar inducing (CGI) ability ofattention heads. We fine-tune them on two types ofGLUE natural language understanding (NLU) tasks(Williams et al., 2018; Wang et al., 2018). The firsttype is the sentence meaning similarity (SMS) task.We fine-tune our models on two datasets, QQP and STS-B (Cer et al., 2017). The second type isthe natural language inference (NLI) task. We fine-tune our models on two datasets, MNLI (Williamset al., 2018) and QNLI (Rajpurkar et al., 2016;Wang et al., 2018). Lastly, we investigate the rela- a r X i v : . [ c s . C L ] F e b ions between CGI ability of attention heads andnatural language understanding ability on QQP andMNLI tasks.The findings of our study are as follows:1. Attention heads in the higher layers of BERTand the middle layers of RoBERTa have betterconstituency grammar inducing (CGI) abil-ity. Some heads act as a proxy for some con-stituency grammar types, but all heads do notappear to fully learn constituency grammar.2. The sentence meaning similarity task de-creases the average CGI ability in the higherlayers. The natural language inference taskincreases it in the higher layers.3. For QQP and MNLI tasks, attention headswith better CGI ability are more important forBERT. However, this relation is different inRoBERTa. Many works have proposed methods to induce con-stituency grammar and extract constituency treesfrom the attention heads of the transformer-basedmodel. Mareˇcek and Rosa (2018) aggregate allthe attention distributions through the layers andget an attention weight matrix. They extract bi-nary constituency tree and undirected dependencytree from this matrix. Kim et al. (2020) use theattention distribution and internal vector represen-tation to compute Syntactic Distance (Shen et al.,2018) between every two adjacent words to drawconstituency trees from raw sentences without anytraining.Additionally, researchers have investigated howfine-tuning affects syntactic knowledge that BERTlearns. Kovaleva et al. (2019) use the subset ofGLUE tasks (Wang et al., 2018) to fine-tune BERT-base model. They find that fine-tuning does notchange the self-attention patterns. They also findthat after fine-tuning, the last two layers’ atten-tion heads undergo the largest changes. Htut et al.(2019) investigates whether fine-tuning affects thedependency syntax in BERT attentions. Theyfind that fine-tuning does not have great effectson attention heads’ dependency syntax inducingability. Zhao and Bethard (2020) investigate thenegation scope linguistic knowledge in BERT andRoBERTa’s attention heads before and after fine-tuning. They find that after fine-tuning, the averageattention heads are more sensitive to negation. While there are some prior works analyzing at-tention heads in BERT, we believe we are the firstto analyze the constituency grammar learned byfine-tuned BERT and RoBERTa models.
Transformer (Vaswani et al., 2017) is a neural net-work model based on self-attention mechanism. Itcontains multiple layers and each layer containsmultiple attention heads. Each attention head takesa sequence of input vectors h = [ h , ..., h n ] corre-sponding to the n tokens. An attention head willtransform each vector h i into query q i , key k i , andvalue v i vectors. Then it computes the output o i bya weighted sum of the value vectors. a ij = exp( q Ti k j ) (cid:80) nt =1 exp( q Ti k t ) (1) o i = n (cid:88) j =1 a ij v j (2)Attention weights distribution of each token can beviewed as the “importance” from other tokens inthe sentence to the current token.BERT is a Transformer-based pre-trained lan-guage model. It is pre-trained on BooksCorpus(Zhu et al., 2015) and English Wikipedia withmasked language model (MLM) objective and nextsentence prediction (NSP) objective. RoBERTa is amodified version of BERT. It removes the NSP pre-training objective and training with much largermini-batches and learning rates. We use the un-cased base size of BERT and base size of RoBERTawhich have 12 layers and each layer contains 12attention heads. Our models are downloaded fromHugging Face’s Transformers Library (Wolf et al.,2020). We aim to analyze constituency grammar inattention heads. We use a method to extract con-stituency parsing trees from attention distributions.This method operates on the attention weightmatrix W ∈ (0 , T × T for every head at a givenlayer, where T is the number of tokens in thesentence. https://huggingface.co/models ethod: Syntactic Distance to ConstituencyTree To extract complete valid constituency pars-ing trees from the attention weights for a givenlayer and head, we follow the method of Kim et al.(2020) and treat every row of the attention weightmatrix as attention distribution of each token in thesentence. As in Kim et al. (2020), we compute thesyntactic distance vector d = [ d , d , ..., d n − ] for agiven sentence w , ..., w n , where d i is the syntacticdistance between w i and w i +1 . Each d i is definedas follows: d i = f ( g ( w i ) , g ( w i +1 )) , (3)where f ( · , · ) and g ( · ) are a distance measure func-tion and feature extractor function. We use Jensen-Shannon function to measure the distance betweeneach attention distribution. Appendix A gives abrief introduction of this function. g ( w i ) is equalto the i th row of the attention matrix W .To introduce the right-skewness bias for Englishconstituency trees, we follow Kim et al. (2020) byadding a linear bias term to every d i : ˆ d i = d i + λ · M ean ( d ) × (cid:18) − i − m − (cid:19) , (4)where m = n − and λ is set to . .After computing the syntactic distance, we usethe algorithm introduced by Shen et al. (2018) toget the target constituency tree. Appendix B de-scribes this algorithm.Constituency parsing is a word-level task, butBERT uses byte-pair tokenization (Sennrich et al.,2016). This means that some words are tokenizedinto subword units. Therefore, we need to converttoken-to-token attention matrix to word-to-wordattention matrix. We merge the non-matching sub-word units and compute the means of the atten-tion distributions for the corresponding rows andcolumns. We use two baselines in our experiments.They are left-branching and right-branching trees. In our experiments, we use an unsupervised con-stituency parsing method to induce constituencygrammar on WSJ Penn Treebank (PTB, Marcuset al. (1993)) without any training. We use thestandard split of the dataset-23 for testing. We usesentence-level F1 (S-F1) score to evaluate our mod-els. In addition, we also report label recall scoresfor six main phrase categories: SBAR, NP, VP, PP,ADJP, and ADVP.
Figure 1: Average constituency parsing S-F1 score ofeach layer in BERT and RoBERTa.
In this part, our goal is to understand how con-stituency grammar is captured by different attentionheads in BERT and RoBERTa before fine-tuning.First, we investigate the common patterns of atten-tion heads’ constituency grammar inducing (CGI)ability in BERT and RoBERTa. From Figure 1, wecan find that the CGI ability of the higher layersof BERT is better than the lower layers. However,the middle layers of RoBERTa are better than theother layers. In appendix C, two heatmaps of ev-ery heads’ S-F1 score in BERT and RoBERTa alsoshow such patterns.Table 1 describes the S-F1 scores of the bestattention heads of BERT and RoBERTa. We alsochoose the best recall for each phrase type. We ob-serve that the S-F1 scores of BERT and RoBERTaare only slightly better than the right-branchingbaseline. This implies that the attention heads inBERT and RoBERTa do not appear to fully learnconstituency grammar. However, they outperformthe baselines by a large margin for noun phrase(NP), preposition phrase (PP), adjective phrase(ADJP), and adverb phrase (ADVP). This impliesthat the attention heads in BERT and RoBERTaonly learn a part of constituency grammar.
In this part, we fine-tune BERT and RoBERTa withfour downstream tasks, QQP, STS-B, QNLI, andMNLI. These four tasks can be divided into twotypes. The first type is the sentence meaning simi-larity task (SMS), including QQP and STS-B. Thisodels S-F1 SBAR NP VP PP ADJP ADVP
Baselines
Left-branching Trees 8.73 5.46% 11.33% 0.82% 5.02% 2.46% 8.04%Right-branching Trees 39.46 68.76% 24.89% 71.76% 42.43% 27.65% 38.11%
Pre-trained LMs
BERT 39.47 67.32% 46.48% 68.82% 57.26% 46.39% 65.03%BERT-QQP 39.97 67.32% 45.39% 68.79% 50.71% 45.01% 61.54%BERT-STS-B 39.48 67.32% 44.16% 68.82% 56.68% 48.39% 57.69%BERT-QNLI 39.74 67.32% 50.96% 68.81% 65.38% 46.08% 63.29%BERT-MNLI 39.66 67.32% 44.89% 68.75% 62.81% 49.16% 64.69%RoBERTa 39.60 67.43% 47.92% 69.35% 56.53% 49.00% 66.43%RoBERTa-QQP 39.41 66.70% 43.02% 69.45% 51.06% 43.16% 60.84%RoBERTa-STS-B 40.36 66.76% 46.82% 69.50% 54.91% 46.54% 64.34%RoBERTa-QNLI 43.95 66.76% 52.51% 69.48% 58.30% 48.39% 69.23%RoBERTa-MNLI 40.41 66.76% 47.97% 69.42% 57.50% 47.77% 68.88%
Table 1: Highest constituency parsing scores of all models. Blue score means that this score is lower after fine-tuning. Red score means that this score is higher after fine-tuning.Figure 2: Changes of average S-F1 score of each layerin BERT after fine-tuning.Figure 3: Changes of average S-F1 score of each layerin RoBERTa after fine-tuning. task requires models to determine whether two sen-tences have the same meaning. The second type isthe natural language inference task (NLI), includ-ing QNLI and MNLI. This task requires models todetermine whether the first sentence can infer thesecond sentence. We want to analyze how thesetwo kinds of downstream tasks affect constituencygrammar inducing (CGI) ability of attention headsin BERT and RoBERTa.Figure 2 and Figure 3 show that these fourtasks do not have much influence on BERT andRoBERTa for the lower layers. For the higher lay-ers, fine-tuning with NLI tasks can increase theaverage CGI ability of attention heads in BERTand RoBERTa. However, fine-tuning with SMStasks harms it.Table 1 shows that fine-tuning can increase thehighest constituency parsing scores of all modelsexcept RoBERTa-QQP. However, fine-tuning withSMS tasks decreases the ability of attention headsto induce NP, PP, ADJP, and ADVP. For BERT, NLItasks can increase the ability of attention headsto induce NP, PP. For RoBERTa, NLI tasks canincrease the ability of attention heads to induce NP,VP, PP, and ADVP.
In this part, we analyze the relations between con-stituency grammar inducing (CGI) ability and natu-ral language understanding (NLU) ability on QQP igure 4: QQP dev and MNLI dev-matched accuracyafter masking the top-k/bottom-k attention heads ineach layer of BERT-QQP and BERT-MNLI.Figure 5: QQP dev and MNLI dev-matched accuracyafter masking the top-k/bottom-k attention heads ineach layer of RoBERTa-QQP and RoBERTa-MNLI. and MNLI tasks. We use the performance of BERTand RoBERTa to evaluate their NLU ability. Wereport the scores on the validation, rather than testdata, so the results are different from the originalBERT paper.First, we sort all attention heads in each layerbased on their S-F1 scores before fine-tuning. Thenwe use the method in Michel et al. (2019) to maskthe top-k/bottom-k ( k = 1 , ..., ) attention headsin each layer and compute the accuracy on twodownstream tasks, QQP and MNLI.Figure 4 shows that downstream tasks accuracyscores decrease quicker when we have masked thetop-k attention heads in BERT. Especially for theQQP task, after masking the bottom-7 attentionheads in all layers, accuracy is still higher than80%, which is more than 10% higher than maskingthe top-7 attention heads.Figure 5 shows that masking RoBERTa has dif-ferent results from BERT. For the QQP task, whenk is smaller or equal to 6, masking the bottom-k at- tention heads in all layers decreases faster. For theMNLI task, when k is 1 or 2, masking the bottom-kheads decreases also faster. When k is larger than 6in the QQP task and 2 in the MNLI task, maskingthe top-k heads decreases faster.For BERT, the results show that attention headswith better CGI ability are more important for amodel to gain NLU ability on these two tasks. ForRoBERTa, the connections between CGI abilityand NLU ability are not as strong as BERT. For theMNLI task, we still can find that better CGI abilityis more important for NLU ability. However, betterheads are not so important for QQP task. The experiments detailed in the previous sectionspoint out that the attention heads in BERT andRoBERTa does not fully learn much constituencygrammar knowledge. Even after fine-tuning withdownstream tasks, the best constituency parsingscore does not change much. Our results are simi-lar to Htut et al. (2019). They also point out that theattention heads do not fully learn much dependencysyntax. Fine-tuning does not affect these results.This raises an interesting question: do attentionheads not contain syntax (constituency or depen-dency) information? If this is true, where doesBERT encode this information? Also, is syntaxinformation not important for BERT to understandlanguage? Our simple experiment in §4.3 showsthat the attention heads with better constituencygrammar inducing ability are not important forRoBERTa on QQP task. Glavaˇs and Vulic (2020)also point out that leveraging explicit formalizedsyntactic structures provides zero to negligible im-pact on NLU tasks. The relations between syntaxand BERT’s NLU ability still need to be furtheranalyzed.
In this work, we investigate whether the attentionheads in BERT and RoBERTa have learned con-stituency grammar before and after fine-tuning. Weuse a method to extract constituency parsing treeswithout any training, and observe that the upperlayers of BERT and the middle layers of RoBERTashow better constituency grammar ability. Certainattention heads better induce specific phrase types,but none of the heads show strong constituencygrammar inducing (CGI) ability. Furthermore, weobserve that fine-tuning with SMS tasks decreaseshe average CGI ability of upper layers, but NLItasks can increase it. Lastly, we mask some headsbased on their parsing S-F1 scores. We show thatattention heads with better CGI ability are moreimportant for BERT on QQP and MNLI tasks. ForRoBERTa, better heads are not so important onQQP task.One of the directions for future research wouldbe to further study the relations between down-stream tasks and the CGI ability in attention headsand to explain why different tasks have differenteffects.
Acknowledgments
This project grew out of a master course project forthe Fall 2020 Uppsala University 5LN714,
Lan-guage Technology: Research and Development .We would like to thank Sara Stymne and Ali Basiratfor some great suggestions and the anonymous re-viewers for their excellent feedback.
References
Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In
Proceedingsof the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancouver,Canada. Association for Computational Linguistics.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In
Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Goran Glavaˇs and Ivan Vulic. 2020. Is supervised syn-tactic parsing beneficial for language understanding?an empirical investigation.Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties.John Hewitt and Christopher D. Manning. 2019. Astructural probe for finding syntax in word repre-sentations. In
Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Phu Mon Htut, Jason Phang, Shikha Bordia, andSamuel R. Bowman. 2019. Do attention heads inbert track syntactic dependencies?Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sanggoo Lee. 2020. Are pre-trained language modelsaware of phrases? simple but strong baselines forgrammar induction.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof BERT. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4365–4374, Hong Kong, China. Association forComputational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. Albert: A lite bert for self-supervised learningof language representations. In
International Con-ference on Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank.
Computa-tional Linguistics , 19(2):313–330.David Mareˇcek and Rudolf Rosa. 2018. Extracting syn-tactic trees from transformer encoder self-attentions.In
Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 347–349, Brussels, Belgium.Association for Computational Linguistics.Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc,E. Fox, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 32 , pages 14014–14024. Curran Associates, Inc.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In
Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In
Proceedings of the 54th An-nual Meeting of the Association for Computationalinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-dro Sordoni, Aaron Courville, and Yoshua Bengio.2018. Straight to the tree: Constituency parsingwith neural syntactic distance. In
Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1171–1180, Melbourne, Australia. Association forComputational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,
Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In
Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In
Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122. Association forComputational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Huggingface’s transformers: State-of-the-art naturallanguage processing.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc,E. Fox, and R. Garnett, editors,
Advances in Neu-ral Information Processing Systems 32 , pages 5753–5763. Curran Associates, Inc.Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020.Fast and accurate neural crf constituency parsing.In
Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence, IJCAI-20 , pages 4046–4053. International Joint Confer-ences on Artificial Intelligence Organization. Maintrack. Yiyun Zhao and Steven Bethard. 2020. How doesBERT’s attention change when you fine-tune? ananalysis methodology and a case study in negationscope. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 4729–4747, Online. Association for Computa-tional Linguistics.Yukun Zhu, Ryan Kiros, Richard Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books.
A Jensen-Shannon Distance MeasureFunction
Jensen-Shannon function measures the distance be-tween two distributions. Suppose that we have twodistributions P and Q , the Jensen-Shannon Dis-tance is defined as J SD ( P || Q ) = (cid:18) D KL ( P || M ) + D KL ( Q || M )2 (cid:19) , (5)where M = ( P + Q ) / and D KL ( A || B ) = (cid:80) w A ( w ) log( A ( w ) /B ( w )) . B Syntactic Distances to ConstituencyTrees Algorithm
Algorithm 1
Syntactic Distances to ConstituencyTrees Algorithm (Shen et al., 2018) S = [ w , w , ..., w n ] : a sentence with nwords. d = [ d , d , ..., d n − ] : a sequence of distancesbetween every two adjacent words. function T REE (S, d ) if d is empty then node ← Leaf ( S [0]) else i ← arg max i ( d ) lchild ← TREE ( S ≤ i , d i , d >i ) node ← Node(lchild, rchild) end ifreturn node end function
C BERT and RoBERTa Heatmaps