Attention Can Reflect Syntactic Structure (If You Let It)
Vinit Ravishankar, Artur Kulmizev, Mostafa Abdou, Anders Søgaard, Joakim Nivre
AAttention Can Reflect Syntactic Structure (If You Let It) ∗ Vinit Ravishankar † ∗
Artur Kulmizev ‡ Mostafa Abdou § Anders Søgaard § Joakim Nivre ‡† Language Technology Group, Department of Informatics, University of Oslo ‡ Department of Linguistics and Philology, Uppsala University § Department of Computer Science, University of Copenhagen † [email protected] ‡ { artur.kulmizev,joakim.nivre } @lingfil.uu.se § { abdou,soegaard } @di.ku.dk Abstract
Since the popularization of the Transformeras a general-purpose feature encoder for NLP,many studies have attempted to decode lin-guistic structure from its novel multi-head at-tention mechanism. However, much of suchwork focused almost exclusively on English —a language with rigid word order and a lackof inflectional morphology. In this study, wepresent decoding experiments for multilingualBERT across 18 languages in order to test thegeneralizability of the claim that dependencysyntax is reflected in attention patterns. Weshow that full trees can be decoded above base-line accuracy from single attention heads, andthat individual relations are often tracked bythe same heads across languages. Furthermore,in an attempt to address recent debates aboutthe status of attention as an explanatory mecha-nism, we experiment with fine-tuning mBERTon a supervised parsing objective while freez-ing different series of parameters. Interest-ingly, in steering the objective to learn explicitlinguistic structure, we find much of the samestructure represented in the resulting attentionpatterns, with interesting differences with re-spect to which parameters are frozen.
In recent years, the attention mechanism proposedby Bahdanau et al. (2014) has become an indis-pensable component of many NLP systems. Itswidespread adoption was, in part, heralded bythe introduction of the Transformer architecture(Vaswani et al., 2017a), which constrains a softalignment to be learned across discrete states inthe input (self-attention), rather than across inputand output (e.g., Xu et al., 2015; Rockt¨aschel et al.,2015). The Transformer has, by now, supplantedthe popular LSTM (Hochreiter and Schmidhuber, ∗ Equal contribution. Order was decided by a coin toss. seq2seq machine translation, showingthat the attention learned by their models reflectsexpected cross-lingual idiosyncrasies between En-glish and French, e.g., concerning word order. Withself-attentive Transformers, interpretation becomesslightly more difficult, as attention is distributedacross words within the input itself. This is fur-ther compounded by the use of multiple layers andheads, each combination of which yields its ownalignment, representing a different (possibly re-dundant) view of the data. Given the similarityof such attention matrices to the score matricesemployed in arc-factored dependency parsing (Mc-Donald et al., 2005a,b), a salient question concern-ing interpretability becomes: Can we expect somecombination of these parameters to capture linguis-tic structure in the form of a dependency tree, espe-cially if the model performs well on NLP tasks? Ifnot, can we relax the expectation and examine theextent to which subcomponents of the linguisticstructure, such as subject-verb relations, are repre-sented? This prospect was first posed by Raganatoet al. (2018) for MT encoders, and later exploredby Clark et al. (2019) for BERT. Ultimately, theconsensus of these and other studies (Voita et al.,2019; Htut et al., 2019; Limisiewicz et al., 2020)was that, while there appears to exist no “general-ist” head responsible for extracting full dependencystructures, standalone heads often specialize in cap-turing individual grammatical relations.Unfortunately, most of such studies focused their a r X i v : . [ c s . C L ] J a n xperiments entirely on English, which is typologi-cally favored to succeed in such scenarios due to itsrigid word order and lack of inflectional morphol-ogy. It remains to be seen whether the attentionpatterns of such models can capture structural fea-tures across typologically diverse languages, or ifthe reported experiments on English are a misrep-resentation of local positional heuristics as such.Furthermore, though previous work has investi-gated how attention patterns might change afterfine-tuning on different tasks (Htut et al., 2019),a recent debate about attention as an explanatorymechanism (Jain and Wallace, 2019; Wiegreffe andPinter, 2019) has cast the entire enterprise in doubt.Indeed, it remains to be seen whether fine-tuningon an explicit structured prediction task, e.g. de-pendency parsing, can force attention to representthe structure being learned, or if the patterns ob-served in pretrained models are not altered in anymeaningful way.To address these issues, we investigate theprospect of extracting linguistic structure fromthe attention weights of multilingual Transformer-based language models. In light of the surveyedliterature, our research questions are as follows:1. Can we decode dependency trees for somelanguages better than others?2. Do the same layer–head combinations trackthe same relations across languages?3. How do attention patterns change after fine-tuning with explicit syntactic annotation?4. Which components of the model are involvedin these changes?In answering these questions, we believe we canshed further light on the (cross-)linguistic proper-ties of Transformer-based language models, as wellas address the question of attention patterns beinga reliable representation of linguistic structure. Transformers
The focus of the present studyis mBERT, a multilingual variant of the exceed-ingly popular language model (Devlin et al., 2019).BERT is built upon the Transformer architecture(Vaswani et al., 2017b), which is a self-attention-based encoder-decoder model (though only the en-coder is relevant to our purposes). A Transformertakes a sequence of vectors x = [ x , x , ... x n ] asinput and applies a positional encoding to them,in order to retain the order of words in a sentence.These inputs are then transformed into query ( Q ), key ( K ), and value ( V ) vectors via three separatelinear transformations and passed to an attentionmechanism. A single attention head computesscaled dot-product attention between K and Q , out-putting a weighted sum of V : Attention(
Q, K, V ) = softmax (cid:18) QK (cid:62) √ d k (cid:19) V (1)For multihead attention (MHA), the same processis repeated for k heads, allowing the model tojointly attend to information from different repre-sentation subspaces at different positions (Vaswaniet al., 2017b). Ultimately, the output of all heads isconcatenated and passed through a linear projection W O : H i = Attention (cid:16) QW Qi , KW Ki , V W Vi (cid:17) (2) MHA(
Q, K, V ) = concat( H , H , ..., H k ) W O (3)Every layer also consists of a feed-forward network( FFN ), consisting of two Dense layers with ReLUactivation functions. For each layer, therefore, theoutput of
MHA is passed through a LayerNormwith residual connections, passed through
FFN ,and then through another LayerNorm with residualconnections.
Searching for structure
Often, the line of in-quiry regarding interpretability in NLP has beenconcerned with extracting and analyzing linguisticinformation from neural network models of lan-guage (Belinkov and Glass, 2019). Recently, suchinvestigations have targeted Transformer models(Hewitt and Manning, 2019; Rosa and Mareˇcek,2019; Tenney et al., 2019), at least in part becausethe self-attention mechanism employed by thesemodels offers a possible window into their innerworkings. With large-scale machine translation andlanguage models being openly distributed for ex-perimentation, several researchers have wonderedif self-attention is capable of representing syntacticstructure, despite not being trained with any overtparsing objective.In pursuit of this question, Raganato et al. (2018)applied a maximum-spanning-tree algorithm overthe attention weights of several trained MT models,comparing them with gold trees from Universal De-pendencies (Nivre et al., 2016, 2020). They foundthat, while the accuracy was not comparable to thatof a supervised parser, it was nonetheless higherthan several strong baselines, implying that sometructure was consistently represented. Clark et al.(2019) corroborated the same findings for BERTwhen decoding full trees, but observed that indi-vidual dependency relations were often tracked byspecialized heads and were decodable with muchhigher accuracy than some fixed-offset baselines.Concurrently, Voita et al. (2019) made a similarobservation about heads specializing in specificdependency relations, proposing a coarse taxon-omy of head attention functions: positional , whereheads attend to adjacent tokens; syntactic , whereheads attend to specific syntactic relations; and rarewords , where heads point to the least frequent to-kens in the sentence. Htut et al. (2019) followedRaganato et al. (2018) in decoding dependencytrees from BERT-based models, finding that fine-tuning on two classification tasks did not producesyntactically plausible attention patterns. Lastly,Limisiewicz et al. (2020) modified UD annotationto better represent attention patterns and introduceda supervised head-ensembling method for consoli-dating shared syntactic information across heads.
Does attention have explanatory value?
Thoughmany studies have yielded insight about how atten-tion behaves in a variety of models, the question ofwhether it can be seen as a “faithful” explanationof model predictions has been subject to much re-cent debate. For example, Jain and Wallace (2019)present compelling arguments that attention doesnot offer a faithful explanation of predictions. Pri-marily, they demonstrate that there is little correla-tion between standard feature importance measuresand attention weights. Furthermore, they contendthat there exist counterfactual attention distribu-tions, which are substantially different from learnedattention weights but that do not alter a model’s pre-dictions. Using a similar methodology, Serrano andSmith (2019) corroborate that attention does notprovide an adequate account of an input compo-nent’s importance.In response to these findings, Wiegreffe and Pin-ter (2019) question the assumptions underlyingsuch claims. Attention, they argue, is not a prim-itive , i.e., it cannot be detached from the rest of amodel’s components as is done in the experimentsof Jain and Wallace (2019). They propose a setof four analyses to test whether a given model’sattention mechanism can provide meaningful ex-planation and demonstrate that the alternative at-tention distributions found via adversarial trainingmethods do, in fact, perform poorly compared to standard attention mechanisms. On a theoreticallevel, they argue that, although attention weights donot give an exclusive “faithful” explanation, theydo provide a meaningful plausible explanation.This discussion is relevant to our study becauseit remains unclear whether or not attending to syn-tactic structure serves, in practice, as plausible ex-planation for model behavior, or whether or not it iseven capable of serving as such. Indeed, the studiesof Raganato et al. (2018) and Clark et al. (2019)relate a convincing but incomplete picture — treedecoding accuracy just marginally exceeds base-lines and various relations tend to be tracked acrossvarying heads and layers. Thus, our fine-tuning ex-periments (detailed in the following section) serveto enable an “easy” setting wherein we explicitlyinform our models of the same structure that we aretrying to extract. We posit that, if, after fine-tuning,syntactic structures were still not decodable fromthe attention weights, one could safely concludethat these structures are being stored via a non-transparent mechanism that may not even involveattention weights. Such an insight would allow usto conclude that attention weights cannot provideeven a plausible explanation for models relying onsyntax.
To examine the extent to which we can decode de-pendency trees from attention patterns, we run atree decoding algorithm over mBERT’s attentionheads — before and after fine-tuning via a parsingobjective. We surmise that doing so will enable usto determine if attention can be interpreted as a re-liable mechanism for capturing linguistic structure.
We employ mBERT in our experiments, whichhas been shown to perform well across a varietyof NLP tasks (Hu et al., 2020; Kondratyuk andStraka, 2019a) and capture aspects of syntacticstructure cross-lingually (Pires et al., 2019; Chiet al., 2020). mBERT features 12 layers with 768hidden units and 12 attention heads, with a jointWordPiece sub-word vocabulary across languages.The model was trained on the concatenation ofWikiDumps for the top 104 languages with thelargest Wikipedias,where principled sampling wasemployed to enforce a balance between high- and https://github.com/google-research/bert ow-resource languages. For decoding dependency trees, we follow Ra-ganato et al. (2018) in applying the Chu-Liu-Edmonds maximum spanning tree algorithm (Chu,1965) to every layer/head combination availablein mBERT ( ×
12 = 144 in total). In order forthe matrices to correspond to gold treebank tok-enization, we remove the cells corresponding tothe BERT delimiter tokens ( [CLS] and [SEP] ).In addition to this, we sum the columns and av-erage the rows corresponding to the constituentsubwords of gold tokens, respectively (Clark et al.,2019). Lastly, since attention patterns across headsmay differ in whether they represent heads attend-ing to their dependents or vice versa, we take ourinput to be the element-wise product of a givenattention matrix and its transpose ( A ◦ A (cid:62) ). Weliken this to the joint probability of a head attend-ing to its dependent and a dependent attending toits head, similarly to Limisiewicz et al. (2020). Perthis point, we also follow Htut et al. (2019) in eval-uating the decoded trees via Undirected UnlabeledAttachment Score (UUAS) — the percentage ofundirected edges recovered correctly. Since we dis-count directionality, this is effectively a less strictmeasure than UAS, but one that has a long traditionin unsupervised dependency parsing since Kleinand Manning (2004). For our data, we employ the Parallel Universal De-pendencies (PUD) treebanks, as collected in UDv2.4 (Nivre et al., 2019). PUD was first releasedas part of the CONLL 2017 shared task (Zemanet al., 2018), containing 1000 parallel sentences,which were (professionally) translated from En-glish, German, French, Italian, and Spanish to 14other languages. The sentences are taken from twodomains, news and wikipedia , the latter implyingsome overlap with mBERT’s training data (thoughwe did not investigate this). We include all PUDtreebanks except Thai. In addition to exploring pretrained mBERT’s atten-tion weights, we are also interested in how attentionmight be guided by a training objective that learns Thai is the only treebank that does not have a non-PUDtreebank available in UD, which we need for our fine-tuningexperiments. the exact tree structure we aim to decode. To thisend, we employ the graph-based decoding algo-rithm of the biaffine parser introduced by Dozatand Manning (2016). We replace the standardBiLSTM encoder for this parser with the entiremBERT network, which we fine-tune with the pars-ing loss. The full parser decoder consists of fourdense layers, two for head/child representations fordependency arcs (dim. 500) and two for head/childrepresentations for dependency labels (dim. 100).These are transformed into the label space via abilinear transform.After training the parser, we can decode the fine-tuned mBERT parameters in the same fashion asdescribed in Section 3.2. We surmise that, if atten-tion heads are capable of tracking hierarchical rela-tions between words in any capacity, it is preciselyin this setting that this ability would be attested. Inaddition to this, we are interested in what individual components of the mBERT network are capable ofsteering attention patterns towards syntactic struc-ture. We believe that addressing this question willhelp us not only in interpreting decisions made byBERT-based neural parsers, but also in aiding us de-veloping syntax-aware models in general (Strubellet al., 2018; Swayamdipta et al., 2018). As such —beyond fine-tuning all parameters of the mBERTnetwork (our basic setting) — we perform a se-ries of ablation experiments wherein we updateonly one set of parameters per training cycle, e.g.the Query weights W Qi , and leave everything elsefrozen. This gives us a set of 6 models, which aredescribed below. For each model, all non-BERTparser components are always left unfrozen.• K EY : only the K components of the trans-former are unfrozen; these are the represen-tations of tokens that are paying attention to other tokens.• Q UERY : only the Q components are unfrozen;these, conversely, are the representations oftokens being paid attention to.• KQ: both keys and queries are unfrozen.• V ALUE : semantic value vectors per token ( V )are unfrozen; they are composed after beingweighted with attention scores obtained fromthe K / Q matrices.• D ENSE : the dense feed-forward networks inthe attention mechanism; all three per layerare unfrozen.• N
ONE : The basic setting with nothing frozen;all parameters are updated with the parsing
R CS DE EN ES FI FR HI ID IT JA KO PL PT RU SV TR ZH B ASELINE
50 40 36 36 40 42 40 46 47 40 43 55 45 41 42 39 52 41P RE
53 53 49 47 50 48 41 48 50 41
64 52 50 51 51 55 427-6 10-8 10-8 10-8 9-5 10-8 2-3 2-3 9-5 6-4 2-3 9-2 10-8 9-5 10-8 10-8 3-8 2-3N
ONE
76 78 76 71 77 66 45 72 75 58
42 64
75 76 75 74
55 3811-10 11-10 11-10 10-11 10-11 10-11 11-10 11-10 11-10 11-10 11-10 11-10 11-10 11-10 10-8 10-8 3-8 2-3K EY
62 64 58 53 59 56 41 54 59 47 44 62 64 58 61 59 55 4110-8 10-8 11-12 10-8 11-12 10-8 7-12 10-8 10-8 9-2 2-3 10-8 10-8 11-12 10-8 12-10 3-12 2-3Q
UERY
69 74 70 66 73 63 42 62 67 54
65 72 70 70 68 56 4211-4 10-8 11-4 11-4 11-4 10-8 11-4 11-4 11-4 11-4 2-3 10-8 11-4 11-4 10-8 11-4 10-8 2-3KQ 71 76 70 65 74 62 43 64 69 55 44 64 73 73 69 69 55 4111-4 11-4 11-4 11-4 11-4 11-4 10-11 11-4 11-4 11-4 2-3 11-4 11-4 11-4 11-4 11-4 11-4 2-3V
ALUE
75 72 72 64 76 59
63 73 55
45 66
73 74 69 65
57 42
ENSE
68 71 65 60 67 61 42 65 66 49 44 64 70 64 67 64 55 4011-10 11-10 11-10 10-8 12-10 11-10 10-8 11-10 11-10 9-5 3-12 11-10 11-10 12-5 11-10 11-10 11-10 3-12
Table 1: Adjacent-branching baseline and maximum UUAS decoding accuracy per PUD treebank, expressed asbest score and best layer/head combination for UUAS decoding. P RE refers to basic mBERT model before fine-tuning, while all cells below correspond different fine-tuned models described in Section 3.4. Best score indicatedin bold . loss.We fine-tune each of these models on a concaten-tation of all PUD treebanks for 20 epochs, whicheffectively makes our model multilingual. We doso in order to 1) control for domain and annotationconfounds, since all PUD sentences are paralleland are natively annotated (unlike converted UDtreebanks, for instance); 2) increase the number oftraining samples for fine-tuning, as each PUD tree-bank features only 1000 sentences; and 3) inducea better parser through multilinguality, as in Kon-dratyuk and Straka (2019b). Furthermore, in or-der to gauge the overall performance of our parseracross all ablated settings, we evaluate on the testset of the largest non-PUD treebank available foreach language, since PUD only features test par-titions. When training, we employ a combineddense/sparse Adam optimiser, at a learning rate of ∗ − . We rescale gradients to have a maximumnorm of 5. The second row of Table 1 (P RE ) depicts the UUASafter running our decoding algorithm over mBERTattention matrices, per language. We see a famil-iar pattern to that in Clark et al. (2019) amongothers — namely that attention patterns extracteddirectly from mBERT appear to be incapable ofdecoding dependency trees beyond a threshold of50–60% UUAS accuracy. However, we also notethat, in all languages, the attention-decoding algo-rithm outperforms a B ASELINE (row 1) that drawsan (undirected) edge between any two adjacentwords in linear order, which implies that some non-
Figure 1: UUAS of MST decoding per layer and head,across languages. Heads (y-axis) are sorted by accu-racy for easier visualization. linear structures are captured with regularity. In-deed, head 8 in layer 10 appears to be particularlystrong in this regard, returning the highest UUASfor 7 languages. Interestingly, the accuracy patterns igure 2: Left: UUAS per relation across languages (best layer/head combination indicated in cell). Right: BestUUAS as a function of best positional baseline (derived from the treebank), selected relations. across layers depicted in Figure 1 tend to followan identical trend for all languages, with nearly allheads in layer 7 returning high within-languageaccuracies.It appears that attention for some languages (Ara-bic, Czech, Korean, Turkish) is comparatively eas-ier to decode than others (French, Italian, Japanese,Chinese). A possible explanation for this result isthat dependency relations between content words,which are favored by the UD annotation, are morelikely to be adjacent in the morphologically richlanguages of the first group (without interveningfunction words). This assumption seems to be cor-roborated by the high baseline scores for Arabic,Korean and Turkish (but not Czech). Conversely,the low baselines scores and the likewise low de-coding accuracies for the latter four languages aredifficult to characterize. Indeed, we could not iden-tify what factors — typological, annotation, tok-enization or otherwise — would set French and Ital-ian apart from the remaining languages in terms ofscore. However, we hypothesize that the tokeniza-tion and our treatment of subword tokens plays apart in attempting to decode attention from Chineseand Japanese representations. Per the mBERT doc-umentation, Chinese and Japanese Kanji characterspans within the CJK Unicode range are character-tokenized. This lies in contrast with all other lan- https://github.com/google-research/bert/blob/master/multilingual.md guages (Korean Hangul and Japanese Hiragana andKatakana included), which rely on whitespace andWordPiece (Wu et al., 2016). It is thus possible thatthe attention distributions for these two languages(at least where CJK characters are relevant) are de-voted to composing words, rather than structuralrelations, which will distort the attention matricesthat we compute to correspond with gold tokeniza-tion (e.g. by maxing rows and averaging columns). Relation analysis
We can disambiguate whatsort of structures are captured with regularity bylooking at the UUAS returned per dependency rela-tion. Figure 2 (left) shows that adjectival modifiers( amod , mean UUAS = ± ) and determiners( det , ± ) are among the easiest relations todecode across languages. Indeed, words that areconnected by these relations are often adjacent toeach other and may be simple to decode if a head isprimarily concerned with tracking linear order. Toverify the extent to which this might be happening,we plot the aforementioned decoding accuracy asa function of select relations’ positional baselinesin Figure 2 (right). The positional baselines, in thiscase, are calculated by picking the most frequentoffset at which a dependent occurs with respect toits head, e.g., − det in English, meaning oneposition to the left of the head. Interestingly, whilewe observe significant variation across the posi-tional baselines for amod and det , the decodingccuracy remains quite high.In slight contrast to this, the core subject( nsubj , ± SD) and object ( obj , ± )relations prove to be more difficult to decode. Un-like the aforementioned relations, nsubj and obj are much more sensitive to the word order proper-ties of the language at hand. For example, whilea language like English, with Subject-Verb-Object(SVO) order, might have the subject frequentlyappear to the left of the verb, an SOV languagelike Hindi might have it several positions furtheraway, with an object and its potential modifiersintervening. Indeed, the best positional baselinefor English nsubj is 39 UUAS, while it is only10 for Hindi. Despite this variation, the relationseems to be tracked with some regularity by thesame head (layer 3, head 9), returning 60 UUASfor English and 52 for Hindi. The same can largelybe said for obj , where the positional baselines re-turn ± . In this latter case, however, the headstend to be much differently distributed across lan-guages. Finally, he results for the obj relationprovides some support for our earlier explanationconcerning morphologically rich languages, as Ara-bic, Czech, Korean and Turkish all have among thehighest accuracies (as well as positional baselines). Next, we investigate the effect fine-tuning has onUUAS decoding. Row 3 in Table 1 (N
ONE ) indi-cates that fine-tuning does result in large improve-ments to UUAS decoding across most languages,often by margins as high as ∼ . This showsthat with an explicit parsing objective, attentionheads are capable of serving as explanatory mecha-nisms for syntax; syntactic structure can be madeto be transparently stored in the heads, in a man-ner that does not require additional probe fitting orparameterized transformation to extract.Given that we do manage to decode reasonablesyntactic trees, we can then refine our question— what components are capable of learning thesetrees? One obvious candidate is the key/querycomponent pair, given that attention weights are ascaled softmax of a composition of the two. Figure3 (top) shows the difference between pretrainedUUAS and fine-tuned UUAS per layer, across mod-els and languages. Interestingly, the best parsingaccuracies do not appear to vary much dependingon what component is frozen. We do see a cleartrend, however, in that decoding the attention pat- terns of the fine-tuned model typically yields betterUUAS than the pretrained model, particularly inthe highest layers. Indeed, the lowest layer at whichfine-tuning appears to improve decoding is layer7. This implies that, regardless of which compo-nent remains frozen, the parameters facing any sortof significant and positive update tend to be thoseappearing towards the higher-end of the network,closer to the output.For the frozen components, the best improve-ments in UUAS are seen at the final layer in V ALUE ,which is also the only model that shows consistentimprovement, as well as the highest average im-provement in mean scores for the last few layers.Perhaps most interestingly, the mean UUAS (Fig-ure 3 (bottom)) for our “attentive” components –keys, queries, and their combination – does notappear to have improved by much after fine-tuning.In contrast, the maximum does show considerableimprovement; this seems to imply that althoughall components appear to be more or less equallycapable of learning decodable heads, the attentivecomponents, when fine-tuned, appear to sharpenfewer heads.Note that the only difference between keys andqueries in an attention mechanism is that keys aretransposed to index attention from/to appropriately.Surprisingly, K EY and Q UERY appear to act some-what differently, with Q
UERY being almost uni-formly better than K EY with the best heads, whilstK EY is slightly better with averages, implying dis-tinctions in how both store information. Further-more, allowing both keys and queries seems toresult in an interesting contradiction – the ultimatelayer, which has reasonable maximums and aver-ages for both K EY and Q UERY , now seems to showa UUAS drop almost uniformly. This is also truefor the completely unfrozen encoder.
Supervised Parsing
In addition to decodingtrees from attention matrices, we also measure su-pervised UAS/LAS on a held-out test set. Basedon Figure 4, it is apparent that all settings resultin generally the same UAS. This is somewhat ex-pected; Lauscher et al. (2020) see better results onparsing with the entire encoder frozen, implyingthat the task is easy enough for a biaffine parser The inner average is over all heads; the outer is over alllanguages. Note that the test set in our scenario is from the actual,non-parallel language treebank; as such, we left Korean out ofthis comparison due to annotation differences. igure 3: (Top) best scores across all heads, per language; (bottom) mean scores across all heads, per language.The languages (hidden from the X-axis for brevity) are, in order, ar, cs, de, en, es, fi, fr, hi, id, it, ja, ko, pl, pt, ru,sv, tr, zh
Figure 4: Mean UAS and LAS when evaluating differ-ent models on language-specific treebanks (Korean ex-cluded due to annotation differences). M BERT refersto models where the entire mBERT network is frozenas input to the parser. to learn, given frozen mBERT representations. The LAS distinction is, however, rather interesting:there is a marked difference between how impor-tant the dense layers are, as opposed to the atten-tive components. This is likely not reflected in ourUUAS probe as, strictly speaking, labelling arcsis not equivalent to searching for structure in sen-tences, but more akin to classifying pre-identifiedstructures. We also note that D
ENSE appears to Due to training on concatenated PUD sets, however, ourresults are not directly comparable/ be better than N
ONE on average, implying thatnon-dense components might actually be hurtinglabelling capacity.In brief, consolidating the two sets of resultsabove, we can draw three interesting conclusionsabout the components:1.
Value vectors are best aligned with syntacticdependencies; this is reflected both in the besthead at the upper layers, and the average scoreacross all heads.2.
Dense layers appear to have moderate infor-mative capacity, but appear to have the bestlearning capacity for the task of arc labelling.3. Perhaps most surprisingly,
Key and
Query vectors do not appear to make any outstandingcontributions, save for sharpening a smallersubset of heads.Our last result is especially surprising for UUAS de-coding. Keys and queries, fundamentally, combineto form the attention weight matrix, which is pre-cisely what we use to decode trees. One would ex-pect that allowing these components to learn fromlabelled syntax would result in the best improve-ments to decoding, but all three have surprisinglynegligible mean improvements. This indicates thatwe need to further improve our understanding ofhow attentive structure and weighting really works.
Cross-linguistic observations
We notice noclear cross-linguistic trends here across differentcomponent sets; however, certain languages dostand out as being particularly hard to decode fromthe fine-tuned parser. These include Japanese, Ko-rean, Chinese, French and Turkish. For the firstthree, we hypothesise that tokenization clashes withBERT’s internal representations may play a role.Indeed, as we hypothesized in Section 3.2, it couldbe the case that the composition of CJK charac-ters into gold tokens for Chinese and Japanese maydegrade the representations (and their correspond-ing attention) therein. Furthermore, for Japaneseand Korean specifically, it has been observed thattokenization strategies employed by different tree-banks could drastically influence the conclusionsone may draw about their inherent hierarchicalstructure (Kulmizev et al., 2020). Turkish andFrench are admittedly more difficult to diagnose.Note, however, that we fine-tuned our model on aconcatenation of all PUD treebanks. As such, anydeviation from PUD’s annotation norms is there-fore likely to be heavily penalised, by virtue ofsignal from other languages drowning out thesedifferences.
In this study, we revisited the prospect of decodingdependency trees from the self-attention patterns ofTransformer-based language models. We electedto extend our experiments to 18 languages in or-der to gain better insight about how tree decodingaccuracy might be affected in the face of (mod-est) typological diversity. Surprisingly, across alllanguages, we were able to decode dependencytrees from attention patterns more accurately thanan adjacent-linking baseline, implying that somestructure was indeed being tracked by the mech-anism. In looking at specific relation types, wecorroborated previous studies in showing that par-ticular layer-head combinations tracked the samerelation with regularity across languages, despitetypological differences concerning word order, etc.In investigating the extent to which attention canbe guided to properly capture structural relationsbetween input words, we fine-tuned mBERT as in-put to a dependency parser. This, we found, yieldedlarge improvements over the pretrained attentionpatterns in terms of decoding accuracy, demonstrat-ing that the attention mechanism was learning torepresent the structural objective of the parser. Inaddition to fine-tuning the entire mBERT network,we conducted a series of experiments, wherein weupdated only select components of model and leftthe remainder frozen. Most surprisingly, we ob-served that the Transformer parameters designedfor composing the attention matrix, K and Q , wereonly modestly capable of guiding the attention to- wards resembling the dependency structure. In con-trast, it was the Value ( V ) parameters, which areused for computing a weighted sum over the KQ -produced attention, that yielded the most faithfulrepresentations of the linguistic structure via atten-tion.Though prior work (Kovaleva et al., 2019; Zhaoand Bethard, 2020) seems to indicate that there isa lack of a substantial change in attention patternsafter fine-tuning on syntax- and semantics-orientedclassification tasks, the opposite effect has beenobserved with fine-tuning on negation scope reso-lution, where a more explanatory attention mech-anism can be induced (Htut et al., 2019). Our re-sults are similar to the latter, and we demonstratethat given explicit syntactic annotation, attentionweights do end up storing more transparently de-codable structure. It is, however, still unclear whichsets of transformer parameters are best suited forlearning this information and storing it in the formof attention. Acknowledgements
Our experiments were run on resources providedby UNINETT Sigma2 - the National Infrastructurefor High Performance Computing and Data Stor-age in Norway, under the NeIC-NLPL umbrella.Mostafa and Anders were funded by a Google Fo-cused Research Award. We would like to thankDaniel Dakota and Ali Basirat for some fruitfuldiscussions and the anonymous reviewers for theirexcellent feedback.
References
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473 .Yonatan Belinkov and James Glass. 2019. Analysismethods in neural language processing: A survey.
Transactions of the Association for ComputationalLinguistics , 7:49–72.Ethan A. Chi, John Hewitt, and Christopher D. Man-ning. 2020. Finding universal grammatical rela-tions in multilingual BERT. In
Proceedings of the58th Annual Meeting of the Association for Compu-tational Linguistics , pages 5564–5577, Online. As-sociation for Computational Linguistics.Yoeng-Jin Chu. 1965. On the shortest arborescence ofa directed graph.
Scientia Sinica , 14:1396–1400.evin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In
Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP , pages 276–286, Florence, Italy. Associationfor Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4171–4186.Timothy Dozat and Christopher D Manning. 2016.Deep biaffine attention for neural dependency pars-ing. arXiv preprint arXiv:1611.01734 .John Hewitt and Christopher D Manning. 2019. Astructural probe for finding syntax in word represen-tations. In
Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers) , pages4129–4138.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Phu Mon Htut, Jason Phang, Shikha Bordia, andSamuel R Bowman. 2019. Do attention heads inbert track syntactic dependencies? arXiv preprintarXiv:1911.12246 .Junjie Hu, Sebastian Ruder, Aditya Siddhant, GrahamNeubig, Orhan Firat, and Melvin Johnson. 2020.XTREME: A Massively Multilingual Multi-taskBenchmark for Evaluating Cross-lingual Generaliza-tion. arXiv:2003.11080 [cs] . ArXiv: 2003.11080.Sarthak Jain and Byron C Wallace. 2019. Attention isnot explanation. In
Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 3543–3556.Dan Klein and Christopher D. Manning. 2004. Corpus-based induction of syntactic structure: Models of de-pendency and constituency. In
Proceedings of the42nd Annual Meeting of the Association for Compu-tational Linguistics (ACL) , pages 479–486.Dan Kondratyuk and Milan Straka. 2019a. 75 lan-guages, 1 model: Parsing universal dependenciesuniversally. In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2779–2795. Dan Kondratyuk and Milan Straka. 2019b. 75 lan-guages, 1 model: Parsing Universal Dependenciesuniversally. In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2779–2795, Hong Kong, China. As-sociation for Computational Linguistics.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the Dark Secretsof BERT. In
Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages4364–4373, Hong Kong, China. Association forComputational Linguistics.Artur Kulmizev, Vinit Ravishankar, Mostafa Abdou,and Joakim Nivre. 2020. Do neural language mod-els show preferences for syntactic formalisms? In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 4077–4091, Online. Association for Computational Lin-guistics.Anne Lauscher, Vinit Ravishankar, Ivan Vuli, andGoran Glava. 2020. From Zero to Hero: On the Lim-itations of Zero-Shot Cross-Lingual Transfer withMultilingual Transformers. arXiv:2005.00633 [cs] .ArXiv: 2005.00633.Tomasz Limisiewicz, Rudolf Rosa, and David Mareˇcek.2020. Universal dependencies according to bert:both more specific and more general. arXiv preprintarXiv:2004.14620 .Ryan McDonald, Koby Crammer, and FernandoPereira. 2005a. Online large-margin training of de-pendency parsers. In
Proceedings of the 43rd An-nual Meeting of the Association for ComputationalLinguistics (ACL) , pages 91–98.Ryan McDonald, Fernando Pereira, Kiril Ribarov, andJan Hajiˇc. 2005b. Non-projective dependency pars-ing using spanning tree algorithms. In
Proceed-ings of the Human Language Technology Confer-ence and the Conference on Empirical Methods inNatural Language Processing (HLT/EMNLP) , pages523–530.Joakim Nivre, Mitchell Abrams, ˇZeljko Agi´c, LarsAhrenberg, Gabriel˙e Aleksandraviˇci¯ut˙e, Lene An-tonsen, Katya Aplonova, Maria Jesus Aranz-abe, Gashaw Arutie, Masayuki Asahara, LumaAteyah, Mohammed Attia, Aitziber Atutxa, Lies-beth Augustinus, Elena Badmaeva, Miguel Balles-teros, Esha Banerjee, Sebastian Bank, VerginicaBarbu Mititelu, Victoria Basmov, John Bauer, San-dra Bellato, Kepa Bengoetxea, Yevgeni Berzak,Irshad Ahmad Bhat, Riyaz Ahmad Bhat, EricaBiagetti, Eckhard Bick, Agn˙e Bielinskien˙e, Ro-gier Blokland, Victoria Bobicev, Lo¨ıc Boizou,Emanuel Borges V¨olker, Carl B¨orstell, CristinaBosco, Gosse Bouma, Sam Bowman, Adrianeoyd, Kristina Brokait˙e, Aljoscha Burchardt, MarieCandito, Bernard Caron, Gauthier Caron, G¨uls¸enCebiro˘glu Eryi˘git, Flavio Massimiliano Cecchini,Giuseppe G. A. Celano, Slavom´ır ˇC´epl¨o, SavasCetin, Fabricio Chalub, Jinho Choi, Yongseok Cho,Jayeol Chun, Silvie Cinkov´a, Aur´elie Collomb,C¸ a˘grı C¸ ¨oltekin, Miriam Connor, Marine Courtin,Elizabeth Davidson, Marie-Catherine de Marneffe,Valeria de Paiva, Arantza Diaz de Ilarraza, CarlyDickerson, Bamba Dione, Peter Dirix, Kaja Do-brovoljc, Timothy Dozat, Kira Droganova, PuneetDwivedi, Hanne Eckhoff, Marhaba Eli, Ali Elkahky,Binyam Ephrem, Tomaˇz Erjavec, Aline Etienne,Rich´ard Farkas, Hector Fernandez Alcalde, JenniferFoster, Cl´audia Freitas, Kazunori Fujita, Katar´ınaGajdoˇsov´a, Daniel Galbraith, Marcos Garcia, MoaG¨ardenfors, Sebastian Garza, Kim Gerdes, FilipGinter, Iakes Goenaga, Koldo Gojenola, MemduhG¨okırmak, Yoav Goldberg, Xavier G´omez Guino-vart, Berta Gonz´alez Saavedra, Matias Grioni, Nor-munds Gr¯uz¯ıtis, Bruno Guillaume, C´eline Guillot-Barbance, Nizar Habash, Jan Hajiˇc, Jan Hajiˇc jr.,Linh H`a M˜y, Na-Rae Han, Kim Harris, DagHaug, Johannes Heinecke, Felix Hennig, BarboraHladk´a, Jaroslava Hlav´aˇcov´a, Florinel Hociung, Pet-ter Hohle, Jena Hwang, Takumi Ikeda, Radu Ion,Elena Irimia, O. l´aj´ıd´e Ishola, Tom´aˇs Jel´ınek, An-ders Johannsen, Fredrik Jørgensen, H¨uner Kas¸ıkara,Andre Kaasen, Sylvain Kahane, Hiroshi Kanayama,Jenna Kanerva, Boris Katz, Tolga Kayadelen, Jes-sica Kenney, V´aclava Kettnerov´a, Jesse Kirchner,Arne K¨ohn, Kamil Kopacewicz, Natalia Kotsyba,Jolanta Kovalevskait˙e, Simon Krek, SookyoungKwak, Veronika Laippala, Lorenzo Lambertino, Lu-cia Lam, Tatiana Lando, Septina Dian Larasati,Alexei Lavrentiev, John Lee, Phng Lˆe H`ˆong,Alessandro Lenci, Saran Lertpradit, Herman Le-ung, Cheuk Ying Li, Josie Li, Keying Li, Kyung-Tae Lim, Yuan Li, Nikola Ljubeˇsi´c, Olga Logi-nova, Olga Lyashevskaya, Teresa Lynn, VivienMacketanz, Aibek Makazhanov, Michael Mandl,Christopher Manning, Ruli Manurung, C˘at˘alinaM˘ar˘anduc, David Mareˇcek, Katrin Marheinecke,H´ector Mart´ınez Alonso, Andr´e Martins, JanMaˇsek, Yuji Matsumoto, Ryan McDonald, SarahMcGuinness, Gustavo Mendonc¸a, Niko Miekka,Margarita Misirpashayeva, Anna Missil¨a, C˘at˘alinMititelu, Yusuke Miyao, Simonetta Montemagni,Amir More, Laura Moreno Romero, Keiko SophieMori, Tomohiko Morioka, Shinsuke Mori, ShigekiMoro, Bjartur Mortensen, Bohdan Moskalevskyi,Kadri Muischnek, Yugo Murawaki, Kaili M¨u¨urisep,Pinkey Nainwani, Juan Ignacio Navarro Hor˜niacek,Anna Nedoluzhko, Gunta Neˇspore-B¯erzkalne, LngNguy˜ˆen Thi., Huy`ˆen Nguy˜ˆen Thi. Minh, Yoshi-hiro Nikaido, Vitaly Nikolaev, Rattima Nitisaroj,Hanna Nurmi, Stina Ojala, Ad´edayo. Ol´u`okun, MaiOmura, Petya Osenova, Robert ¨Ostling, Lilja Øvre-lid, Niko Partanen, Elena Pascual, Marco Passarotti,Agnieszka Patejuk, Guilherme Paulino-Passos, An-gelika Peljak-Łapi´nska, Siyao Peng, Cenel-AugustoPerez, Guy Perrier, Daria Petrova, Slav Petrov,Jussi Piitulainen, Tommi A Pirinen, Emily Pitler, Barbara Plank, Thierry Poibeau, Martin Popel,Lauma Pretkalnin¸a, Sophie Pr´evost, Prokopis Proko-pidis, Adam Przepi´orkowski, Tiina Puolakainen,Sampo Pyysalo, Andriela R¨a¨abis, Alexandre Rade-maker, Loganathan Ramasamy, Taraka Rama, Car-los Ramisch, Vinit Ravishankar, Livy Real, SivaReddy, Georg Rehm, Michael Rießler, ErikaRimkut˙e, Larissa Rinaldi, Laura Rituma, LuisaRocha, Mykhailo Romanenko, Rudolf Rosa, Da-vide Rovati, Valentin Roca, Olga Rudina, JackRueter, Shoval Sadde, Benoˆıt Sagot, Shadi Saleh,Alessio Salomoni, Tanja Samardˇzi´c, Stephanie Sam-son, Manuela Sanguinetti, Dage S¨arg, Baiba Saul¯ıte,Yanin Sawanakunanon, Nathan Schneider, SebastianSchuster, Djam´e Seddah, Wolfgang Seeker, MojganSeraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu,Muh Shohibussirri, Dmitry Sichinava, Natalia Sil-veira, Maria Simi, Radu Simionescu, Katalin Simk´o,M´aria ˇSimkov´a, Kiril Simov, Aaron Smith, IsabelaSoares-Bastos, Carolyn Spadine, Antonio Stella,Milan Straka, Jana Strnadov´a, Alane Suhr, UmutSulubacak, Shingo Suzuki, Zsolt Sz´ant´o, DimaTaji, Yuta Takahashi, Fabio Tamburini, TakaakiTanaka, Isabelle Tellier, Guillaume Thomas, Li-isi Torga, Trond Trosterud, Anna Trukhina, ReutTsarfaty, Francis Tyers, Sumire Uematsu, ZdeˇnkaUreˇsov´a, Larraitz Uria, Hans Uszkoreit, SowmyaVajjala, Daniel van Niekerk, Gertjan van No-ord, Viktor Varga, Eric Villemonte de la Clerg-erie, Veronika Vincze, Lars Wallin, Abigail Walsh,Jing Xian Wang, Jonathan North Washington,Maximilan Wendt, Seyi Williams, Mats Wir´en,Christian Wittern, Tsegay Woldemariam, Tak-sumWong, Alina Wr´oblewska, Mary Yako, Naoki Ya-mazaki, Chunxiao Yan, Koichi Yasuoka, Marat M.Yavrumyan, Zhuoran Yu, Zdenˇek ˇZabokrtsk´y, AmirZeldes, Daniel Zeman, Manying Zhang, andHanzhi Zhu. 2019. Universal dependencies 2.4.LINDAT/CLARIAH-CZ digital library at the Insti-tute of Formal and Applied Linguistics ( ´UFAL), Fac-ulty of Mathematics and Physics, Charles Univer-sity.Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajiˇc, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Dan Zeman.2016. Universal Dependencies v1: A multilingualtreebank collection. In
Proceedings of the 10th In-ternational Conference on Language Resources andEvaluation (LREC) .Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Jan Hajiˇc, Christopher D. Manning, SampoPyysalo, Sebastian Schuster, Francis Tyers, and DanZeman. 2020. Universal Dependencies v2: An ev-ergrowing multilingual treebank collection. In
Pro-ceedings of the 12th International Conference onLanguage Resources and Evaluation (LREC) .Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.How multilingual is multilingual bert? In
Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics , pages 4996–5001.lessandro Raganato, J¨org Tiedemann, et al. 2018. Ananalysis of encoder representations in transformer-based machine translation. In
Proceedings of the2018 EMNLP Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP . The As-sociation for Computational Linguistics.Tim Rockt¨aschel, Edward Grefenstette, Karl MoritzHermann, Tom´aˇs Koˇcisk`y, and Phil Blunsom. 2015.Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664 .Rudolf Rosa and David Mareˇcek. 2019. Inducing syn-tactic trees from bert representations. arXiv preprintarXiv:1906.11511 .Sofia Serrano and Noah A Smith. 2019. Is attentioninterpretable? arXiv preprint arXiv:1906.03731 .Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In
Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing , pages 5027–5038, Brussels, Belgium.Association for Computational Linguistics.Swabha Swayamdipta, Sam Thomson, Kenton Lee,Luke Zettlemoyer, Chris Dyer, and Noah A. Smith.2018. Syntactic scaffolds for semantic structures.In
Proceedings of the 2018 Conference on Em-pirical Methods in Natural Language Processing ,pages 3772–3782, Brussels, Belgium. Associationfor Computational Linguistics.Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.Bert rediscovers the classical nlp pipeline. arXivpreprint arXiv:1905.05950 .Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017a. Attention is allyou need. In
Advances in neural information pro-cessing systems , pages 5998–6008.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017b. Attention IsAll You Need. arXiv:1706.03762 [cs] . ArXiv:1706.03762.Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-nrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavylifting, the rest can be pruned. arXiv preprintarXiv:1905.09418 .Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 11–20. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural machinetranslation system: Bridging the gap between hu-man and machine translation. arXiv preprintarXiv:1609.08144 .Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,Aaron Courville, Ruslan Salakhudinov, Rich Zemel,and Yoshua Bengio. 2015. Show, attend and tell:Neural image caption generation with visual atten-tion. In
International conference on machine learn-ing , pages 2048–2057.Daniel Zeman, Jan Hajic, Martin Popel, Martin Pot-thast, Milan Straka, Filip Ginter, Joakim Nivre, andSlav Petrov. 2018. Conll 2018 shared task: Mul-tilingual parsing from raw text to universal depen-dencies. In