[PDF] Have Attention Heads in BERT Learned Constituency Grammar?

Abstract

With the success of pre-trained language models in recent years, more and more researchers focus on opening the "black box" of these models. Following this interest, we carry out a qualitative and quantitative analysis of constituency grammar in attention heads of BERT and RoBERTa. We employ the syntactic distance method to extract implicit constituency grammar from the attention weights of each head. Our results show that there exist heads that can induce some grammar types much better than baselines, suggesting that some heads act as a proxy for constituency grammar. We also analyze how attention heads' constituency grammar inducing (CGI) ability changes after fine-tuning with two kinds of tasks, including sentence meaning similarity (SMS) tasks and natural language inference (NLI) tasks. Our results suggest that SMS tasks decrease the average CGI ability of upper layers, while NLI tasks increase it. Lastly, we investigate the connections between CGI ability and natural language understanding ability on QQP and MNLI tasks.

Full PDF

HHave Attention Heads in BERT Learned Constituency Grammar?

Ziyang Luo

Uppsala University

[email protected]

Abstract

With the success of pre-trained language mod-els in recent years, more and more researchersfocus on opening the “black box” of thesemodels. Following this interest, we carry outa qualitative and quantitative analysis of con-stituency grammar in attention heads of BERTand RoBERTa. We employ the syntactic dis-tance method to extract implicit constituencygrammar from the attention weights of eachhead. Our results show that there exist headsthat can induce some grammar types much bet-ter than baselines, suggesting that some headsact as a proxy for constituency grammar. Wealso analyze how attention heads’ constituencygrammar inducing (CGI) ability changes afterﬁne-tuning with two kinds of tasks, includingsentence meaning similarity (SMS) tasks andnatural language inference (NLI) tasks. Ourresults suggest that SMS tasks decrease the av-erage CGI ability of upper layers, while NLItasks increase it. Lastly, we investigate theconnections between CGI ability and naturallanguage understanding ability on QQP andMNLI tasks.

Recently, pre-trained language models haveachieved great success in many natural languageprocessing tasks (Devlin et al., 2019; Yang et al.,2019), including sentiment analysis (Liu et al.,2019), question answering (Lan et al., 2020) andconstituency parsing (Zhang et al., 2020), to namea few. Though these models have become moreand more popular in many NLP tasks, they are still“black boxes”. What they have learned, and whyand when they perform well remain unknown. Toopen these “black boxes”, researchers have usedmany methods to analyze the linguistic knowledgethat these models encode (Goldberg, 2019; Clarket al., 2019; Hewitt and Manning, 2019; Kim et al.,2020). Pre-trained language models use self-attentionmechanism in each layer to compute the internalrepresentations of each token. In this work, we in-vestigate the hypothesis that some attention headsin pre-trained language models have learned con-stituency grammar. We use an unsupervised con-stituency parsing method to extract constituencytrees from each attention heads of BERT (Devlinet al., 2019) and RoBERTa (Liu et al., 2019) be-fore and after ﬁne-tuning. This method computesthe syntactic distance between every two adjacentwords and generates a constituency parsing treerecursively. We analyze the extracted constituencyparsing trees to investigate whether speciﬁc atten-tion heads induce constituency grammar better thanbaselines, and which types of constituency gram-mars they learn best.In prior work, Kim et al. (2020) show that somelayers of pre-trained language models exhibit syn-tactic structure akin to constituency grammar tosome degree. However, they do not analyze howﬁne-tuning affects models. We ﬁrst follow theirmethods to extract constituency grammar fromBERT and RoBERTa. Then, we use the sameapproach to analyze BERT and RoBERTa afterﬁne-tuning. To the best of our knowledge, we arethe ﬁrst to investigate how ﬁne-tuning affects theconstituency grammar inducing (CGI) ability ofattention heads. We ﬁne-tune them on two types ofGLUE natural language understanding (NLU) tasks(Williams et al., 2018; Wang et al., 2018). The ﬁrsttype is the sentence meaning similarity (SMS) task.We ﬁne-tune our models on two datasets, QQP and STS-B (Cer et al., 2017). The second type isthe natural language inference (NLI) task. We ﬁne-tune our models on two datasets, MNLI (Williamset al., 2018) and QNLI (Rajpurkar et al., 2016;Wang et al., 2018). Lastly, we investigate the rela- a r X i v : . [ c s . C L ] F e b ions between CGI ability of attention heads andnatural language understanding ability on QQP andMNLI tasks.The ﬁndings of our study are as follows:1. Attention heads in the higher layers of BERTand the middle layers of RoBERTa have betterconstituency grammar inducing (CGI) abil-ity. Some heads act as a proxy for some con-stituency grammar types, but all heads do notappear to fully learn constituency grammar.2. The sentence meaning similarity task de-creases the average CGI ability in the higherlayers. The natural language inference taskincreases it in the higher layers.3. For QQP and MNLI tasks, attention headswith better CGI ability are more important forBERT. However, this relation is different inRoBERTa. Many works have proposed methods to induce con-stituency grammar and extract constituency treesfrom the attention heads of the transformer-basedmodel. Mareˇcek and Rosa (2018) aggregate allthe attention distributions through the layers andget an attention weight matrix. They extract bi-nary constituency tree and undirected dependencytree from this matrix. Kim et al. (2020) use theattention distribution and internal vector represen-tation to compute Syntactic Distance (Shen et al.,2018) between every two adjacent words to drawconstituency trees from raw sentences without anytraining.Additionally, researchers have investigated howﬁne-tuning affects syntactic knowledge that BERTlearns. Kovaleva et al. (2019) use the subset ofGLUE tasks (Wang et al., 2018) to ﬁne-tune BERT-base model. They ﬁnd that ﬁne-tuning does notchange the self-attention patterns. They also ﬁndthat after ﬁne-tuning, the last two layers’ atten-tion heads undergo the largest changes. Htut et al.(2019) investigates whether ﬁne-tuning affects thedependency syntax in BERT attentions. Theyﬁnd that ﬁne-tuning does not have great effectson attention heads’ dependency syntax inducingability. Zhao and Bethard (2020) investigate thenegation scope linguistic knowledge in BERT andRoBERTa’s attention heads before and after ﬁne-tuning. They ﬁnd that after ﬁne-tuning, the averageattention heads are more sensitive to negation. While there are some prior works analyzing at-tention heads in BERT, we believe we are the ﬁrstto analyze the constituency grammar learned byﬁne-tuned BERT and RoBERTa models.

Transformer (Vaswani et al., 2017) is a neural net-work model based on self-attention mechanism. Itcontains multiple layers and each layer containsmultiple attention heads. Each attention head takesa sequence of input vectors h = [ h , ..., h n ] corre-sponding to the n tokens. An attention head willtransform each vector h i into query q i , key k i , andvalue v i vectors. Then it computes the output o i bya weighted sum of the value vectors. a ij = exp( q Ti k j ) (cid:80) nt =1 exp( q Ti k t ) (1) o i = n (cid:88) j =1 a ij v j (2)Attention weights distribution of each token can beviewed as the “importance” from other tokens inthe sentence to the current token.BERT is a Transformer-based pre-trained lan-guage model. It is pre-trained on BooksCorpus(Zhu et al., 2015) and English Wikipedia withmasked language model (MLM) objective and nextsentence prediction (NSP) objective. RoBERTa is amodiﬁed version of BERT. It removes the NSP pre-training objective and training with much largermini-batches and learning rates. We use the un-cased base size of BERT and base size of RoBERTawhich have 12 layers and each layer contains 12attention heads. Our models are downloaded fromHugging Face’s Transformers Library (Wolf et al.,2020). We aim to analyze constituency grammar inattention heads. We use a method to extract con-stituency parsing trees from attention distributions.This method operates on the attention weightmatrix W ∈ (0 , T × T for every head at a givenlayer, where T is the number of tokens in thesentence. https://huggingface.co/models ethod: Syntactic Distance to ConstituencyTree To extract complete valid constituency pars-ing trees from the attention weights for a givenlayer and head, we follow the method of Kim et al.(2020) and treat every row of the attention weightmatrix as attention distribution of each token in thesentence. As in Kim et al. (2020), we compute thesyntactic distance vector d = [ d , d , ..., d n − ] for agiven sentence w , ..., w n , where d i is the syntacticdistance between w i and w i +1 . Each d i is deﬁnedas follows: d i = f ( g ( w i ) , g ( w i +1 )) , (3)where f ( · , · ) and g ( · ) are a distance measure func-tion and feature extractor function. We use Jensen-Shannon function to measure the distance betweeneach attention distribution. Appendix A gives abrief introduction of this function. g ( w i ) is equalto the i th row of the attention matrix W .To introduce the right-skewness bias for Englishconstituency trees, we follow Kim et al. (2020) byadding a linear bias term to every d i : ˆ d i = d i + λ · M ean ( d ) × (cid:18) − i − m − (cid:19) , (4)where m = n − and λ is set to . .After computing the syntactic distance, we usethe algorithm introduced by Shen et al. (2018) toget the target constituency tree. Appendix B de-scribes this algorithm.Constituency parsing is a word-level task, butBERT uses byte-pair tokenization (Sennrich et al.,2016). This means that some words are tokenizedinto subword units. Therefore, we need to converttoken-to-token attention matrix to word-to-wordattention matrix. We merge the non-matching sub-word units and compute the means of the atten-tion distributions for the corresponding rows andcolumns. We use two baselines in our experiments.They are left-branching and right-branching trees. In our experiments, we use an unsupervised con-stituency parsing method to induce constituencygrammar on WSJ Penn Treebank (PTB, Marcuset al. (1993)) without any training. We use thestandard split of the dataset-23 for testing. We usesentence-level F1 (S-F1) score to evaluate our mod-els. In addition, we also report label recall scoresfor six main phrase categories: SBAR, NP, VP, PP,ADJP, and ADVP.

Figure 1: Average constituency parsing S-F1 score ofeach layer in BERT and RoBERTa.

In this part, our goal is to understand how con-stituency grammar is captured by different attentionheads in BERT and RoBERTa before ﬁne-tuning.First, we investigate the common patterns of atten-tion heads’ constituency grammar inducing (CGI)ability in BERT and RoBERTa. From Figure 1, wecan ﬁnd that the CGI ability of the higher layersof BERT is better than the lower layers. However,the middle layers of RoBERTa are better than theother layers. In appendix C, two heatmaps of ev-ery heads’ S-F1 score in BERT and RoBERTa alsoshow such patterns.Table 1 describes the S-F1 scores of the bestattention heads of BERT and RoBERTa. We alsochoose the best recall for each phrase type. We ob-serve that the S-F1 scores of BERT and RoBERTaare only slightly better than the right-branchingbaseline. This implies that the attention heads inBERT and RoBERTa do not appear to fully learnconstituency grammar. However, they outperformthe baselines by a large margin for noun phrase(NP), preposition phrase (PP), adjective phrase(ADJP), and adverb phrase (ADVP). This impliesthat the attention heads in BERT and RoBERTaonly learn a part of constituency grammar.

In this part, we ﬁne-tune BERT and RoBERTa withfour downstream tasks, QQP, STS-B, QNLI, andMNLI. These four tasks can be divided into twotypes. The ﬁrst type is the sentence meaning simi-larity task (SMS), including QQP and STS-B. Thisodels S-F1 SBAR NP VP PP ADJP ADVP

Baselines

Left-branching Trees 8.73 5.46% 11.33% 0.82% 5.02% 2.46% 8.04%Right-branching Trees 39.46 68.76% 24.89% 71.76% 42.43% 27.65% 38.11%

Pre-trained LMs

BERT 39.47 67.32% 46.48% 68.82% 57.26% 46.39% 65.03%BERT-QQP 39.97 67.32% 45.39% 68.79% 50.71% 45.01% 61.54%BERT-STS-B 39.48 67.32% 44.16% 68.82% 56.68% 48.39% 57.69%BERT-QNLI 39.74 67.32% 50.96% 68.81% 65.38% 46.08% 63.29%BERT-MNLI 39.66 67.32% 44.89% 68.75% 62.81% 49.16% 64.69%RoBERTa 39.60 67.43% 47.92% 69.35% 56.53% 49.00% 66.43%RoBERTa-QQP 39.41 66.70% 43.02% 69.45% 51.06% 43.16% 60.84%RoBERTa-STS-B 40.36 66.76% 46.82% 69.50% 54.91% 46.54% 64.34%RoBERTa-QNLI 43.95 66.76% 52.51% 69.48% 58.30% 48.39% 69.23%RoBERTa-MNLI 40.41 66.76% 47.97% 69.42% 57.50% 47.77% 68.88%

Table 1: Highest constituency parsing scores of all models. Blue score means that this score is lower after ﬁne-tuning. Red score means that this score is higher after ﬁne-tuning.Figure 2: Changes of average S-F1 score of each layerin BERT after ﬁne-tuning.Figure 3: Changes of average S-F1 score of each layerin RoBERTa after ﬁne-tuning. task requires models to determine whether two sen-tences have the same meaning. The second type isthe natural language inference task (NLI), includ-ing QNLI and MNLI. This task requires models todetermine whether the ﬁrst sentence can infer thesecond sentence. We want to analyze how thesetwo kinds of downstream tasks affect constituencygrammar inducing (CGI) ability of attention headsin BERT and RoBERTa.Figure 2 and Figure 3 show that these fourtasks do not have much inﬂuence on BERT andRoBERTa for the lower layers. For the higher lay-ers, ﬁne-tuning with NLI tasks can increase theaverage CGI ability of attention heads in BERTand RoBERTa. However, ﬁne-tuning with SMStasks harms it.Table 1 shows that ﬁne-tuning can increase thehighest constituency parsing scores of all modelsexcept RoBERTa-QQP. However, ﬁne-tuning withSMS tasks decreases the ability of attention headsto induce NP, PP, ADJP, and ADVP. For BERT, NLItasks can increase the ability of attention headsto induce NP, PP. For RoBERTa, NLI tasks canincrease the ability of attention heads to induce NP,VP, PP, and ADVP.

In this part, we analyze the relations between con-stituency grammar inducing (CGI) ability and natu-ral language understanding (NLU) ability on QQP igure 4: QQP dev and MNLI dev-matched accuracyafter masking the top-k/bottom-k attention heads ineach layer of BERT-QQP and BERT-MNLI.Figure 5: QQP dev and MNLI dev-matched accuracyafter masking the top-k/bottom-k attention heads ineach layer of RoBERTa-QQP and RoBERTa-MNLI. and MNLI tasks. We use the performance of BERTand RoBERTa to evaluate their NLU ability. Wereport the scores on the validation, rather than testdata, so the results are different from the originalBERT paper.First, we sort all attention heads in each layerbased on their S-F1 scores before ﬁne-tuning. Thenwe use the method in Michel et al. (2019) to maskthe top-k/bottom-k ( k = 1 , ..., ) attention headsin each layer and compute the accuracy on twodownstream tasks, QQP and MNLI.Figure 4 shows that downstream tasks accuracyscores decrease quicker when we have masked thetop-k attention heads in BERT. Especially for theQQP task, after masking the bottom-7 attentionheads in all layers, accuracy is still higher than80%, which is more than 10% higher than maskingthe top-7 attention heads.Figure 5 shows that masking RoBERTa has dif-ferent results from BERT. For the QQP task, whenk is smaller or equal to 6, masking the bottom-k at- tention heads in all layers decreases faster. For theMNLI task, when k is 1 or 2, masking the bottom-kheads decreases also faster. When k is larger than 6in the QQP task and 2 in the MNLI task, maskingthe top-k heads decreases faster.For BERT, the results show that attention headswith better CGI ability are more important for amodel to gain NLU ability on these two tasks. ForRoBERTa, the connections between CGI abilityand NLU ability are not as strong as BERT. For theMNLI task, we still can ﬁnd that better CGI abilityis more important for NLU ability. However, betterheads are not so important for QQP task. The experiments detailed in the previous sectionspoint out that the attention heads in BERT andRoBERTa does not fully learn much constituencygrammar knowledge. Even after ﬁne-tuning withdownstream tasks, the best constituency parsingscore does not change much. Our results are simi-lar to Htut et al. (2019). They also point out that theattention heads do not fully learn much dependencysyntax. Fine-tuning does not affect these results.This raises an interesting question: do attentionheads not contain syntax (constituency or depen-dency) information? If this is true, where doesBERT encode this information? Also, is syntaxinformation not important for BERT to understandlanguage? Our simple experiment in §4.3 showsthat the attention heads with better constituencygrammar inducing ability are not important forRoBERTa on QQP task. Glavaˇs and Vulic (2020)also point out that leveraging explicit formalizedsyntactic structures provides zero to negligible im-pact on NLU tasks. The relations between syntaxand BERT’s NLU ability still need to be furtheranalyzed.

In this work, we investigate whether the attentionheads in BERT and RoBERTa have learned con-stituency grammar before and after ﬁne-tuning. Weuse a method to extract constituency parsing treeswithout any training, and observe that the upperlayers of BERT and the middle layers of RoBERTashow better constituency grammar ability. Certainattention heads better induce speciﬁc phrase types,but none of the heads show strong constituencygrammar inducing (CGI) ability. Furthermore, weobserve that ﬁne-tuning with SMS tasks decreaseshe average CGI ability of upper layers, but NLItasks can increase it. Lastly, we mask some headsbased on their parsing S-F1 scores. We show thatattention heads with better CGI ability are moreimportant for BERT on QQP and MNLI tasks. ForRoBERTa, better heads are not so important onQQP task.One of the directions for future research wouldbe to further study the relations between down-stream tasks and the CGI ability in attention headsand to explain why different tasks have differenteffects.

Acknowledgments

This project grew out of a master course project forthe Fall 2020 Uppsala University 5LN714,

Lan-guage Technology: Research and Development .We would like to thank Sara Stymne and Ali Basiratfor some great suggestions and the anonymous re-viewers for their excellent feedback.

References

Daniel Cer, Mona Diab, Eneko Agirre, I˜nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017task 1: Semantic textual similarity multilingual andcrosslingual focused evaluation. In

Proceedingsof the 11th International Workshop on SemanticEvaluation (SemEval-2017) , pages 1–14, Vancouver,Canada. Association for Computational Linguistics.Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In

Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Goran Glavaˇs and Ivan Vulic. 2020. Is supervised syn-tactic parsing beneﬁcial for language understanding?an empirical investigation.Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties.John Hewitt and Christopher D. Manning. 2019. Astructural probe for ﬁnding syntax in word repre-sentations. In

Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4129–4138, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Phu Mon Htut, Jason Phang, Shikha Bordia, andSamuel R. Bowman. 2019. Do attention heads inbert track syntactic dependencies?Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sanggoo Lee. 2020. Are pre-trained language modelsaware of phrases? simple but strong baselines forgrammar induction.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof BERT. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4365–4374, Hong Kong, China. Association forComputational Linguistics.Zhenzhong Lan, Mingda Chen, Sebastian Goodman,Kevin Gimpel, Piyush Sharma, and Radu Soricut.2020. Albert: A lite bert for self-supervised learningof language representations. In

International Con-ference on Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn Treebank.

Computa-tional Linguistics , 19(2):313–330.David Mareˇcek and Rudolf Rosa. 2018. Extracting syn-tactic trees from transformer encoder self-attentions.In

Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 347–349, Brussels, Belgium.Association for Computational Linguistics.Paul Michel, Omer Levy, and Graham Neubig. 2019.Are sixteen heads really better than one? In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc,E. Fox, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 32 , pages 14014–14024. Curran Associates, Inc.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. SQuAD: 100,000+ questions formachine comprehension of text. In

Proceedings ofthe 2016 Conference on Empirical Methods in Natu-ral Language Processing , pages 2383–2392, Austin,Texas. Association for Computational Linguistics.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In

Proceedings of the 54th An-nual Meeting of the Association for Computationalinguistics (Volume 1: Long Papers) , pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessan-dro Sordoni, Aaron Courville, and Yoshua Bengio.2018. Straight to the tree: Constituency parsingwith neural syntactic distance. In

Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) , pages1171–1180, Melbourne, Australia. Association forComputational Linguistics.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In I. Guyon, U. V. Luxburg, S. Bengio,H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett, editors,

Advances in Neural Information Pro-cessing Systems 30 , pages 5998–6008. Curran Asso-ciates, Inc.Alex Wang, Amanpreet Singh, Julian Michael, Fe-lix Hill, Omer Levy, and Samuel Bowman. 2018.GLUE: A multi-task benchmark and analysis plat-form for natural language understanding. In

Pro-ceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP , pages 353–355, Brussels, Belgium.Association for Computational Linguistics.Adina Williams, Nikita Nangia, and Samuel Bowman.2018. A broad-coverage challenge corpus for sen-tence understanding through inference. In

Proceed-ings of the 2018 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume1 (Long Papers) , pages 1112–1122. Association forComputational Linguistics.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R´emi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Huggingface’s transformers: State-of-the-art naturallanguage processing.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc,E. Fox, and R. Garnett, editors,

Advances in Neu-ral Information Processing Systems 32 , pages 5753–5763. Curran Associates, Inc.Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020.Fast and accurate neural crf constituency parsing.In

Proceedings of the Twenty-Ninth InternationalJoint Conference on Artiﬁcial Intelligence, IJCAI-20 , pages 4046–4053. International Joint Confer-ences on Artiﬁcial Intelligence Organization. Maintrack. Yiyun Zhao and Steven Bethard. 2020. How doesBERT’s attention change when you ﬁne-tune? ananalysis methodology and a case study in negationscope. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 4729–4747, Online. Association for Computa-tional Linguistics.Yukun Zhu, Ryan Kiros, Richard Zemel, RuslanSalakhutdinov, Raquel Urtasun, Antonio Torralba,and Sanja Fidler. 2015. Aligning books and movies:Towards story-like visual explanations by watchingmovies and reading books.

A Jensen-Shannon Distance MeasureFunction

Jensen-Shannon function measures the distance be-tween two distributions. Suppose that we have twodistributions P and Q , the Jensen-Shannon Dis-tance is deﬁned as J SD ( P || Q ) = (cid:18) D KL ( P || M ) + D KL ( Q || M )2 (cid:19) , (5)where M = ( P + Q ) / and D KL ( A || B ) = (cid:80) w A ( w ) log( A ( w ) /B ( w )) . B Syntactic Distances to ConstituencyTrees Algorithm

Algorithm 1

Syntactic Distances to ConstituencyTrees Algorithm (Shen et al., 2018) S = [ w , w , ..., w n ] : a sentence with nwords. d = [ d , d , ..., d n − ] : a sequence of distancesbetween every two adjacent words. function T REE (S, d ) if d is empty then node ← Leaf ( S [0]) else i ← arg max i ( d ) lchild ← TREE ( S ≤ i , d i , d >i ) node ← Node(lchild, rchild) end ifreturn node end function

C BERT and RoBERTa Heatmaps

Related Researches

Leveraging cross-platform data to improve automated hate speech detection

by John D Gallacher

Learning Modality-Specific Representations with Self-Supervised Multi-Task Learning for Multimodal Sentiment Analysis

by Wenmeng Yu

BembaSpeech: A Speech Recognition Corpus for the Bemba Language

by Claytone Sikasote

Decontextualization: Making Sentences Stand-Alone

by Eunsol Choi

AuGPT: Dialogue with Pre-trained Language Models and Data Augmentation

by Jonáš Kulhánek

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

by Chuhan Wu

Bootstrapping Relation Extractors using Syntactic Search by Examples

by Matan Eyal

Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model

by Fatemah Husain

Broader terms curriculum mapping: Using natural language processing and visual-supported communication to create representative program planning experiences

by Rogério Duarte

Quality Estimation without Human-labeled Data

by Yi-Lin Tuan

Generate and Revise: Reinforcement Learning in Neural Poetry

by Andrea Zugarini

Effects of Layer Freezing when Transferring DeepSpeech to New Languages

by Onno Eberhard

How True is GPT-2? An Empirical Analysis of Intersectional Occupational Biases

by Hannah Kirk

SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning

by Di Wu

In-Order Chart-Based Constituent Parsing

by Yang Wei

A study of text representations in Hate Speech Detection

by Chrysoula Themeli

A Hybrid Task-Oriented Dialog System with Domain and Task Adaptive Pretraining

by Boliang Zhang

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

by Betty van Aken

Wake Word Detection with Streaming Transformers

by Yiming Wang

The Singleton Fallacy: Why Current Critiques of Language Models Miss the Point

by Magnus Sahlgren

OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

by Lalit Mohan Sanagavarapu

CSS-LM: A Contrastive Framework for Semi-supervised Fine-tuning of Pre-trained Language Models

by Yusheng Su

Spoiler Alert: Using Natural Language Processing to Detect Spoilers in Book Reviews

by Allen Bao

An open access NLP dataset for Arabic dialects : Data collection, labeling, and model construction

by ElMehdi Boujou

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

by Yunyang Xiong

«

1

2

3

4

»

Submitted on 16 Feb 2021 (v1), last revised 2 Jun 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar