[PDF] Retrofitting Structure-aware Transformer Language Model for End Tasks

Abstract

We consider retrofitting structure-aware Transformer-based language model for facilitating end tasks by proposing to exploit syntactic distance to encode both the phrasal constituency and dependency connection into the language model. A middle-layer structural learning strategy is leveraged for structure integration, accomplished with main semantic task training under multi-task learning scheme. Experimental results show that the retrofitted structure-aware Transformer language model achieves improved perplexity, meanwhile inducing accurate syntactic phrases. By performing structure-aware fine-tuning, our model achieves significant improvements for both semantic- and syntactic-dependent tasks.

Full PDF

RRetroﬁtting Structure-aware Transformer Language Model for End Tasks

Hao Fei , Yafeng Ren ∗ and Donghong Ji

1. Department of Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China2. Guangdong University of Foreign Studies, China { hao.fei,renyafeng,dhji } @whu.edu.cn Abstract

We consider retroﬁtting structure-aware Trans-former language model for facilitating endtasks by proposing to exploit syntactic dis-tance to encode both the phrasal constituencyand dependency connection into the languagemodel. A middle-layer structural learningstrategy is leveraged for structure integration,accomplished with main semantic task train-ing under multi-task learning scheme. Ex-perimental results show that the retroﬁttedstructure-aware Transformer language modelachieves improved perplexity, meanwhile in-ducing accurate syntactic phrases. By perform-ing structure-aware ﬁne-tuning, our modelachieves signiﬁcant improvements for bothsemantic- and syntactic-dependent tasks.

Natural language models (LM) can generate ﬂuenttext and encode factual knowledge (Mikolov et al.,2013; Pennington et al., 2014; Merity et al., 2017).Recently, pre-trained contextualized language mod-els have given remarkable improvements on vari-ous NLP tasks (Peters et al., 2018; Radford et al.,2018; Howard and Ruder, 2018; Yang et al., 2019;Devlin et al., 2019; Dai et al., 2019). Amongsuch methods, the Transformer-based (Vaswaniet al., 2017) BERT has become a most popularencoder for obtaining state-of-the-art NLP taskperformance. It has been shown (Conneau et al.,2018; Tenney et al., 2019) that besides rich seman-tic information, implicit language structure knowl-edge can be captured by a deep BERT (Vig andBelinkov, 2019; Jawahar et al., 2019; Goldberg,2019). However, such structure features learntvia the vanilla Transformer LM are insufﬁcientfor those NLP tasks that heavily rely on syntacticor linguistic knowledge (Hao et al., 2019). Someeffort devote to improved the ability of structure ∗ Corresponding author. (a) Full-layer (b) Middle-layer y struc y task y task y struc Figure 1: Full-layer multi-task learning for structuraltraining (left), and the middle-layer training for deepstructure-aware Transformer LM (right). learning in Transformer LM by installing novelsyntax-attention mechanisms (Ahmed et al., 2019;Wang et al., 2019). Nevertheless, several limita-tions can be observed.First, according to the recent ﬁndings by probingtasks (Conneau et al., 2018; Tenney et al., 2019;Goldberg, 2019), the syntactic structure represen-tations are best retained right at the middle layers(Vig and Belinkov, 2019; Jawahar et al., 2019).Nevertheless, existing tree Transformers employtraditional full-scale training over the whole deepTransformer architecture (as shown in Figure 1(a)),consequently weakening the upper-layer semanticlearning that can be crucial for end tasks. Sec-ond, these tree Transformer methods encode eitherstandalone constituency or dependency structure,while different tasks can depend on varying types ofstructural knowledge. The constituent and depen-dency representation for syntactic structure shareunderlying linguistic characteristics, while the for-mer focuses on disclosing phrasal continuity andthe latter aims at indicating dependency relationsamong elements. For example, semantic parsingtasks are more dependent on the dependency fea-tures (Rabinovich et al., 2017; Xia et al., 2019),while constituency information is much needed forsentiment classiﬁcation (Socher et al., 2013).In this paper, we aim to retroﬁt structure-aware a r X i v : . [ c s . C L ] S e p ransformer LM for facilitating end tasks. • On theone hand, we propose a structure learning modulefor Transformer LM, meanwhile exploiting syn-tactic distance as the measurement for encodingboth the phrasal constituency and the dependencyconnection. • On the other hand, as illustrated inFigure 1, to better coordinate the structural learningand semantic learning, we employ a middle-layerstructural training strategy to integrate syntacticstructures to the main language modeling task un-der multi-task scheme, which encourages the induc-tion of structural information to take place at mostsuitable layer. • Last but not least, we consider per-forming structure-aware ﬁne-tuning with end-tasktraining, allowing learned syntactic knowledge inaccordance most with the end task needs.We conduct experiments on language modelingand a wide range of NLP tasks. Results showthat the structure-aware Transformer retroﬁttedvia our proposed middle-layer training strategyachieves better language perplexity, meanwhile in-ducing high-quality syntactic phrases. Besides, theLM after structure-aware ﬁne-tuning can give sig-niﬁcantly improved performance for various endtasks, including semantic-dependent and syntactic-dependent tasks. We also ﬁnd that supervisedstructured pre-training brings more beneﬁts tosyntactic-dependent tasks, while the unsupervisedLM pre-training brings more beneﬁts to semantic-dependent tasks. Further experimental results onunsupervised structure induction demonstrate thatdifferent NLP tasks rely on varying types of struc-ture knowledge as well as distinct granularity ofphrases, and our retroﬁtting method can help toinduce structure phrases that are most adapted tothe needs of end tasks.

Contextual language modeling.

Contextual lan-guage models pre-trained on a large-scale corpushave witnessed signiﬁcant advances (Peters et al.,2018; Radford et al., 2018; Howard and Ruder,2018; Yang et al., 2019; Devlin et al., 2019; Daiet al., 2019). In contrast to the traditional staticand context-independent word embedding, contex-tual language models can strengthen word repre-sentations by dynamically encoding the contextualsentences for each word during pre-training. By fur-ther ﬁne-tuning with end tasks, the contextualizedword representation from language models can helpto give the most task-related context-sensitive fea- tures (Peters et al., 2018). In this work, we followthe line of Transformer-based (Vaswani et al., 2017)LM (e.g., BERT), considering its prominence.

Structure induction.

The idea of introducingtree structures into deep models for structure-awarelanguage modeling has long been explored by su-pervised structure learning, which generally relieson annotated parse trees during training and max-imizes the joint likelihood of sentence-tree pairs(Socher et al., 2010, 2013; Tai et al., 2015; Yazdaniand Henderson, 2015; Dyer et al., 2016; Alvarez-Melis and Jaakkola, 2017; Aharoni and Goldberg,2017; Eriguchi et al., 2017; Wang et al., 2018; G¯uet al., 2018).There has been much attention paid to unsu-pervised grammar induction task (Williams et al.,2017; Shen et al., 2018a,b; Kuncoro et al., 2018;Kim et al., 2019a; Luo et al., 2019; Drozdov et al.,2019; Kim et al., 2019b). For example, PRPN(Shen et al., 2018a) computes the syntactic dis-tance of word pairs. On-LSTM (Shen et al., 2018b)allows hidden neurons to learn long-term or short-term information by a gate mechanism. URNNG(Kim et al., 2019b) applies amortized variationalinference, encouraging the decoder to generate rea-sonable tree structures. DIORA (Drozdov et al.,2019) uses inside-outside dynamic programmingto compose latent representations from all possiblebinary trees. PCFG (Kim et al., 2019a) achievesgrammar induction by probabilistic context-freegrammar. Unlike these recurrent network basedstructure-aware LM, our work focuses on structurelearning for a deep Transformer LM.

Structure-aware Transformer language model.

Some efforts have been paid for the Transformer-based pre-trained language models (e.g. BERT) byvisualizing the attention (Vig and Belinkov, 2019;Kovaleva et al., 2019; Hao et al., 2019) or probingtasks (Jawahar et al., 2019; Goldberg, 2019). Theyﬁnd that the latent language structure knowledgeis best retained at the middle-layer in BERT (Vigand Belinkov, 2019; Jawahar et al., 2019; Gold-berg, 2019). Ahmed et al. (2019) employ a de-composable attention mechanism for recursivelylearn the tree structure for Transformer. Wang et al.(2019) integrate tree structures into Transformervia constituency-attention. However, these Trans-former LMs suffer from the full-scale structuraltraining and monotonous types of the structure,limiting the performance of structure LMs for end .....

Weighted Sum l th l- th l+ th++++++ Phrase GenerationStructure Learning Module

Transformer Encoders

Transformer Layer

Syntax Distance Layer

Phrase Context Phrase Embedding

MLMorEnd Task ++ Phrasal AttentionWord Context

Figure 2: Overall framework of the retroﬁtted structure-aware Transformer language model. tasks. Our work is partially inspired by Shen et al.(2018a) and Luo et al. (2019) on employing syntaxdistance measurements, while their works focus onthe syntax learning by recurrent LMs.

The proposed structure-aware Transformer lan-guage model mainly consists of two components:the Transformer encoders and structure learningmodule, which are illustrated in Figure 2.

The language model is built based on N -layerTransformer blocks. One Transformer layer ap-plies multi-head self-attention in combination witha feedforward network, layer normalization andresidual connections. Speciﬁcally, the attentionweights are computed in parallel via: E = softmax ( QK T √ d ) V = softmax ( ( t · x ) ( t · x ) √ d )( t · x ) (1)where Q (query), K (key) and V (value) in multi-head setting process the input x = { x , · · · , x n } t times.Given an input sentence x , the output contextualrepresentation of the l -th layer Transformer block [ James ] [ remembered ] [ the story ] [ of the party ] Root detnsubj dobj casenmoddet

NNDTINNNDTVBDNNP NPPPNP NPVPNP S (2)(3)(4)(1)

Figure 3: Simultaneously measuring dependency rela-tions (1) and phrasal constituency (3) based on the ex-ample sentence (2) by employing syntax distance (4). can be formulated as: { h l , · · · , h ln } = Trm ( { x , · · · , x n } )= η (Φ( η ( E l )) + E l ) (2)where η is the layer normalization operation and Φ is a feedforward network. In this work, the outputcontextual representation h l = { h l , · · · , h ln } ofthe middle layers can be used to learn the structure y struc , and the one at the ﬁnal layer will be used forthe language modeling or end task training y task . The structure learning module is responsiblefor unsupervisedly generating phrases, providingstructure-aware language modeling to the host LM.

Syntactic context.

We extract the context repre-sentations from Transformer middle layers for thenext syntax learning. We optimize the structure-aware Transformer LM by forcing the structureknowledge injection focused at middle three lay-ers: ( l − th , l th , and ( l + 1) th . Note that althoughwe only make structural attending to the selectedlayers, structure learning can enhance lower layersvia back-propagation.Speciﬁcally, we take the ﬁrst of the chosen three-layer as the word context C Ψ = h l − . For thephrasal context C Ω = { c Ω1 , · · · , c Ω n } , we makeuse of contextual representations from the threechosen layers by weighted sum: C Ω = α l − · h l − + α l · h l + α l +1 · h l +1 (3)where α l − , α l and α l +1 are sum-to-one trainablecoefﬁcients. Rich syntactic representations are ex-pected to be captured in C Ω by LM. tructure measuring. In this study, we reach thegoal of measuring syntax by employing syntax dis-tance . The general concept of syntax distance d i can be reckoned as a metric (i.e., distance) froma certain word x i to the root node within the de-pendency tree (Shen et al., 2018a). For instance inFigure 3, the head word ‘ remembered ’ x i and itsdependent word ‘ James ’ x j follow d i < d j . Whilein this work, to maintain both the dependency andphrasal constituents simultaneously, we add addi-tional constraints on words and phrases. Given twowords x i and x j ( ≤ i < j ≤ n ) in one phrase, wedeﬁne d i < d j . This can be demonstrated by theword pair ‘ the ’ and ‘ story ’. While if they are in dif-ferent phrases , e.g., S u and S v , the correspondinginner-phrasal head words follow d i (in S u ) > d j (in S V ), e.g., ‘ story ’ and ‘ party ’.In the structure learning module, we ﬁrst com-pute the syntactic distances d = { d , · · · , d n } foreach word based on the word context via a convo-lutional network: { d , · · · , d n } = Φ( CNN ( { c Ψ1 , · · · , c Ψ n } )) (4)where d i is a scalar, and Φ is for linearization. Withsuch syntactic distance, we expect both the depen-dency as well as constituency syntax can be wellcaptured in LM. Syntactic phrase generating.

Considering theword x i opening an induced phrase S m =[ x i , · · · , x i + w ] in a sentence, where w is the phrasewidth, we need to decide the probability p ∗ ( x j ) thata word x j ( j = i + w + 1 ) (i.e., the ﬁrst word out-side phrase S m ) belongs to S m : p ∗ ( x j ) = i + w (cid:89) k = i sigmoid ( d j − d k ) . (5)We set the initial width w = 1 , if p ∗ ( x j ) is abovethe window threshold λ , x j should be consideredinside the phrase; otherwise, the phrase S m shouldbe closed and restart at x j . We incrementally con-duct such phrasal searching procedure to segmentall the phrases in a sentence. Given an inducedphrase S m = [ x i , · · · , x i + w ] , we obtain its embed-ding s m via a phrasal attention: u i = softmax ( d i · p ∗ ( x i )) (6) s m = i + w (cid:88) i u i · c Ψ i (7) Note that we cannot explicitly deﬁne the granularity(width) of every phrases in constituency tree, while instead itwill be decided by the structure learning module in heuristics.

Multi-task training for language modeling andstructure induction.

Different from traditionallanguage models, a Transformer-based LM em-ploys the masked language modeling (MLM),which can capture larger contexts. Likewise, wepredict a masked word using the correspondingcontext representation at the top layer: p W ( y i | x ) = softmax ( c i | x ) (8) L W = k (cid:88) i log p W ( y i | x ) (9)On the other hand, the purpose on unsupervisedsyntactic induction is to encourage the model to in-duce s m that is most likely entailed by the phrasalcontext c Ω i . The behind logic lies is that, if the ini-tial Transformer LM can capture linguistic syntaxknowledge, then after iterations of learning withthe structure learning module, the induced structurecan be greatly ampliﬁed and enhanced (Luo et al.,2019). We thus deﬁne the following probability: p G ( s m | c Ω i ) = 11 + exp( − s Tm · c Ω i ) (10)Additionally, to enhance the syntax learning, weemploy negative sampling: L Neg = 1 n n (cid:88) j p G ( ˆ s Tj | c Ω i ) (11)where ˆ s is a randomly selected negative phrase.The ﬁnal objective for structure learning is: L G = K (cid:88) i ( M (cid:88) m (1 − p G ( s m | c Ω i )) + L Neg ) (12)We employ multi-task learning for simultane-ously training our LM for both word prediction andstructure induction. Thus, the overall target is tominimize the following multi-task loss objective: L pre = L W + γ pre · L G (13)where γ pre is a regulating coefﬁcient. Supervised syntax injection.

Our defaultstructure-aware LM unsupervisedly induces syntaxat the pre-training stage, as elaborated above.Alternatively, in Eq. (7), if we leverage the gold (orapriori) syntax distance information for phrases,we can achieve supervised structure injection. nsupervised structure ﬁne-tuning.

We aim toimprove the learnt structural information for betterfacilitating the end tasks. Therefore, during theﬁne-tuning stage of end tasks, we consider furthermaking the structure learning module trainable: L ﬁne = L task + γ ﬁne · L G (14)where L task refers to the loss function of the endtask, and γ ﬁne is a regulating coefﬁcient. Notethat to achieve the best structural ﬁne-tuning, thesupervised structure injection is unnecessary, andwe do not allow supervised structure aggregationat the ﬁne-tuning stage.Our approach is model-agnostic as we realize thesyntax induction via a standalone structure learn-ing module, which is disentangled from a hostLM. Thus the method can be applied to variousTransformer-based LM architectures. We employ the same architecture as BERT basemodel , which is a 12-layer Transformer with 12attention heads and 768 dimensional hidden size.To enrich our experiments, we also consider theGoogle pre-trained weights as the initialization. Weuse Adam as our optimizer with an initial learningrate in [8e-6, 1e-5, 2e-5, 3e-5], and a L2 weight de-cay of 0.01. The batch size is selected in [16,24,32].We set the initial values of coefﬁcients α l − , α l and α l +1 as 0.35, 0.4 and 0.25, respectively. Thepre-training coefﬁcient γ pre is set as 0.5, and theﬁne-tuning one γ ﬁne as 0.23. These values give thebest effects in our development experiments. Ourimplementation is based on the PyTroch library .Besides, for supervised structure learning in ourexperiments, we use the state-of-the-art BiAfﬁnedependency parser (Dozat and Manning, 2017) toparse sentences for all the relevant datasets, anduse the Self-Attentive parser (Kitaev and Klein,2018) to obtain the constituency structure. Beingtrained on the English Penn Treebank (PTB) corpus(Marcus et al., 1993), the dependency parser has95.2% UAS and 93.4% LAS, and the constituencyparser has 92.6% F1 score. With the auto-parsedannotations, we can calculate the syntax distances(substitute the ones in Eq. 4) and obtain the corre-sponding phrasal embeddings (in Eq. 7). https://github.com/google-research/bert https://pytorch.org/ . . . F S c o re (a) Constituency phrase parsing. (b) Dependency alignment. Supervised Unsupervised

Figure 4: Development experiments on syntactic prob-ing tasks at varying Transformer layer. F SupervisedUnsupervised

Figure 5: Constituency parsing under different λ . We ﬁrst validate atwhich layer of depths the structural-aware Trans-former LM can achieve the best performance whenintegrating our retroﬁtting method. We thus designprobing experiments, in which we consider follow-ing two syntactic tasks. 1)

Constituency phraseparsing seeks to generate grammar phrases basedon the PTB dataset and evaluate whether inducedconstituent spans also exist in the gold Treebankdataset. 2)

Dependency alignment aims to com-pute the proportion of Transformer attention con-necting tokens in a dependency relation (Vig andBelinkov, 2019):Score = (cid:80) x ∈ X (cid:80) xi =1 (cid:80) xj =1 α i,j ( x ) · dep ( x i , x j ) (cid:80) x ∈ X (cid:80) xi =1 (cid:80) xj =1 α i,j ( x ) (15)where α i,j ( x ) is the attention weight, anddep ( x i , x j ) is an indicator function (1 if x i and x j are in a dependency relation and 0 otherwise).The experiments are based on English Wikipedia,following Vig and Belinkov (2019).As shown in Figure 4, both the results on un-supervised and supervised phrase parsing are thebest at layer 6. Also the attention aligns with de-pendency relations most strongly in the middle lay-ers (5-6), consistent with ﬁndings from previouswork (Tenney et al., 2019; Vig and Belinkov, 2019).Both two probing tasks indicate that our proposedmiddle-layer structure training is practical. We thusinject the structure in the structure learning moduleat the -th layer ( l = 6 ). ystem Syntactic. Semantic.

Avg.

TreeDepth TopConst Tense SOMO NER SST Rel SRL • w/o Initial Weight: Trm 25.31 40.32 61.06 50.11 89.22 86.21 84.70 88.30 65.65RvTrm 29.52 45.01 63.83 51.42 89.98 86.66 85.02 88.94 67.55Tree+Trm 30.37 46.58 65.83 53.08 90.62 87.25 84.97 88.70 68.43PI+TrmXL 31.28 47.06 63.78 52.36 90.34 87.09 85.22 89.02 68.27Ours+Trm+ usp. sp. syn-embed. • Initial Weight:

BERT 38.61 79.37 90.61 65.31 92.40 93.50 89.25 92.20 80.16Ours+BERT( usp. ) 45.82 88.64 94.68 67.84 94.28 94.67 90.41 93.12 83.68

Table 1: Structure-aware Transformer LM for end tasks.

System Const. Ppl.PRPN 42.8 -On-LSTM 49.4 -URNNG 52.4 -DIORA 56.2 -PCFG 60.1 -Trm 22.7 78.6RvTrm 47.0 50.3Tree+Trm 52.0 45.7PI+TrmXL 56.2 43.4Ours+Trm+ usp. sp. usp. ) 65.2 16.2

Table 2: Performance on constituency parsing and lan-guage modeling.

Phrase generation threshold.

We introduce ahyper-parameter λ as a threshold to decide whethera word belong to a given phrase during the phrasalgeneration step. We explore the best λ value basedon the same parsing tasks. As shown in Figure 5,with λ = 0 . for unsupervised induction and λ =0 . for supervised induction, the induced phrasalquality is the highest. Therefore we set such λ values for all the remaining experiments. We evaluate the effectiveness of our proposedretroﬁtted structure-aware LM after pre-training.We ﬁrst compare the performance on languagemodeling . From the results shown in Table 2, our Transformer can see its subsequent words bidirectionally,so we measure the perplexity on masked words. And we thusavoid directly comparing with the Recurrent-based LMs. retroﬁtted Transformer yields better language per-plexity in both unsupervised (37.0) or supervised(29.2) manner. This proves that our middle-layerstructure training strategy can effectively relievenegative mutual inﬂuence of structure learningon semantic learning, while inducing high-qualityof structural phrases. We can also conclude thatlanguage models with more successful structuralknowledge can better help to encode effective in-trinsic language patterns, which is consistent withthe prior studies (Kim et al., 2019b; Wang et al.,2019; Drozdov et al., 2019).We also compare the constituency parsing withstate-of-the-art structure-aware models, includ-ing

1) Recurrent-based models described in §

2) Transformer based methods : Tree+Trm(Wang et al., 2019), RvTrm (Ahmed et al., 2019),PI+TrmXL (Luo et al., 2019), and the BERT modelinitialized with rich weights. As shown in Table 2,all the structure-aware models can give good pars-ing results, compared with non-structured models.Our retroﬁtted Transformer LM gives the best per-formance (60.3% F1) in unsupervised induction.Combined with the supervised auto-labeled parses,it give the highest F1 score (68.8%).

We validate the effectiveness of our method forend tasks with structure-aware ﬁne-tuning. Allsystems are ﬁrst pre-trained for structure learning,and then ﬁne-tuned with end task training. Theevaluation is performed on eight tasks, involving ngeranticipationdisgustfearjoyloveoptimismpessimismsadnesssurprisetrust DQJHUDQWLFLSDWLRQGLVJXVWIHDUMR\ORYHRSWLPLVPSHVVLPLVPVDGQHVVVXUSULVHWUXVW (a) without Latent Topic Attention-based Routing(b) with Latent Topic Attention-based Routing 6 WLOO W K L V I OL F N L V I XQ D QG W R KR V W V R P H W U X O \ H [ FH OO H Q W V H TX H Q FH V 6WLOOWKLVIOLFNLVIXQDQGWRKRVWVRPHWUXO\H[FHOOHQWVHTXHQFHV ) L Q D Q F L D O V W U H VV L V RQ H R I W K H P D L Q FD X V H V R I G L YR U FH )LQDQFLDOVWUHVVLVRQHRIWKHPDLQFDXVHVRIGLYRUFH ( Y D O X D WL RQ V V XJJ H V W W K D W JRRG RQ H V D U H H V S HF L D OO \ V R L I W K H H II HF W V RQ S D U WL F L S D Q W V D U H F RXQ W H G (YDOXDWLRQVVXJJHVWWKDWJRRGRQHVDUHHVSHFLDOO\VRLIWKHHIIHFWVRQSDUWLFLSDQWVDUHFRXQWHG (a) SST (b) Rel (c) SRL Figure 6: Visualization of attention heads (heatmap) and the corresponding syntax distances (bar chart). syntactic tasks and semantic tasks.

TreeDepth predicts the depth of the syntactic tree,

TopConst tests the sequence of top level constituents in thesyntax tree, and

Tense detects the tense of themain-clause verb, while

SOMO checks the sensi-tivity to random replacement of words, which arethe standard probing tasks. We follow the samedatasets and settings with previous work (Conneauet al., 2018; Jawahar et al., 2019).Also we evaluate the semantic tasks including1)

NER , named entity recognition on CoNLL03(Tjong Kim Sang and De Meulder, 2003), 2)

SST ,binary sentiment classiﬁcation task on Standfordsentiment treebank (Socher et al., 2013), 3)

Rel ,relation classiﬁcation on Semeval10 (Hendrickxet al., 2010), and 4)

SRL , semantic role labelingtask on the CoNLL09 WSJ (Hajiˇc et al., 2009). Theperformance is reported by the F1 score.The results are summarized in Table 1. First,we ﬁnd that structure-aware LMs bring improvedperformance for all the tasks, compared withthe vanilla Transformer encoder. Second, theTransformer with our structural-aware ﬁne-tuningachieves better results (70.74% on average) forall the end tasks, compared with the baseline treeTransformer LMs. This proves that our proposedmiddle-layer strategy best beneﬁts the structuralﬁne-tuning, compared with the full-layer struc-ture training on baselines. Third, with supervisedstructure learning, signiﬁcant improvements can befound across all tasks.For the supervised setting, we replace the super-vised syntax fusion in structure learning module

Mean MedianRvTrm 0.68 0.69Tree+Trm 0.60 0.64PI+TrmXL 0.54 0.58Ours+Trm( usp. ) 0.50 0.52Ours+Trm( sp. ) 0.32 0.37

Table 3: Fine-grained parsing. with the auto-labeled syntactic dependency embed-ding and concatenate it with other input embed-dings. The results are not as prominent as thesupervised syntax fusion, which reﬂects the ad-vantage of our proposed structure learning mod-ule. Besides, based on the task improvements fromthe retroﬁtted Transformer by our method, we canfurther infer that the supervised structure beneﬁtsmore syntactic-dependent tasks, and the unsuper-vised structure beneﬁts semantic-dependent tasksthe most. Finally, the BERT model integrating withour method can give improved effects . We take a further step, evaluating the ﬁne-grainedquality on phrasal structure induction after pre-training. Instead of checking whether the inducedconstituent spans are identical to the gold coun-terparts, we now consider measuring the devia-tion

P hrDev (ˆ y, y ) = (cid:113) N (cid:80) i [∆(ˆ y i , y i ) − ∆] , We note that the direct comparison with BERT model isnot fair, because the large numbers of well pre-trained param-eters can bring overwhelming advances. here ∆(ˆ y i , y i ) is the phrasal editing distancebetween the induced phrase length and the goldlength within a sentence. ∆ is the averaged edit-ing distance. If all the predicted phrases are samewith the ground truth, or all different from it, P hrDev (ˆ y, y ) = 0 , which means that the phrasesare induced with the maximum consistency, andvice versa. We make statistics for all the sentencesin Table 3. Our method can unsupervisedly gener-ate higher quality of structural phrases, while wecan achieve the best injection of the constituencyknowledge into LM by the supervised manner. To interpret theﬁne-tuned structures, we empirically visualize theTransformer attention head from the chosen l -layer,and the syntax distances of the sentence. We ex-hibit three examples from SST , Rel and

SRL ,respectively, as shown in Figure 6. Overall, ourmethod can help to induce clear structure of bothdependency and constituency. While interestingly,different types of tasks rely on different granular-ity of phrase. Comparing the heat maps and syn-tax distances with each other, the induced phrasalconstituency on

SST are longer than that on

SRL .This is because the sentiment classiﬁcation taskdemands more phrasal composition features, whilethe SRL task requires more ﬁne-grained phrases.In addition, we ﬁnd that the syntax distances in

SRL and

Rel are higher in variance, comparedwith the ones on

SST , Intuitively, the larger devia-tion of syntax distances in a sentence indicates themore demand to the interdependent information be-tween elements, while the smaller deviation refersto phrasal constituency. This reveals that

SRL and

Rel rely more on the dependency syntax, while

SST is more relevant to constituents, which is con-sistent with previous studies (Socher et al., 2013;Rabinovich et al., 2017; Xia et al., 2019; Fei et al.,2020).

Distributions of heterogeneous syntax for dif-ferent tasks.

Based on the above analysis, wefurther analyze the distributions of dependency andconstituency structures after ﬁne-tuning, in differ-ent tasks. Technically, we calculate the mean ab-solute differences of syntax distances between el-ements x i and the sub-root node x r in a sentence: Dif f = N (cid:80) Ni | d i − d r | . We then linearly nor-malize them into [0,1] for all the sentences in thecorpus of each task, and make statistics, as plot- T r e e D e p t h T o p C o n s t T e n s e S O M O N E R S S T R e l S R L . . . Figure 7: Distributions of dependency and constituencysyntax in different tasks. Blue color indicates the pre-dominance of dependency, while Red for constituency.

SST SRL

Ours+Trm Tree+Trm Ours+Trm Tree+Trm NP VP PP ADJP

ADVP

Table 4: Proportion of each type of induced phrase. ted in Figure 7. Intuitively, the larger the value is,the more interdependent to dependency syntax thetask is, and otherwise, to constituency structure.Overall, distributions of dependency structures andphrasal constituents in ﬁne-tuned LM vary amongdifferent tasks, verifying that different tasks dependon distinct types of structural knowledge. For ex-ample,

TreeDepth , Rel and

SRL are most sup-ported by dependency structure, while

TopConst and

SST beneﬁt from constituency the most.

SOMO and

NER can gain from both two types.

Phrase types.

Finally, we explore the diversityof phrasal syntax required by two representativeend tasks,

SST and

SRL . We ﬁrst look into thestatistical proportion for different types of inducedphrases . As shown in Table 4, our method tendsto induce more task-relevant phrases, where thelengths of induced phrases are more variable to thetask. Concretely, the ﬁne-tuned structure-awareTransformer helps to generate more NP also withlonger phrases for the SST task, and yield roughlyequal numbers of NP and VP for SRL tasks withshorter phrases. This evidently gives rise to thebetter task performance. In contrast, the syntaxphrases induced by the Tree+Trm model keep un-varying for

SST (3.22) and

SRL (3.36) tasks. Five main types are considered: noun phrase ( NP ),verb phrase ( VP ), prepositional phrase ( PP ), adjective phrase( ADJP ) and adverb phrase (

ADVP ). Conclusion

We presented a retroﬁtting method for structure-aware Transformer-based language model. Weadopted the syntax distance to encode both the con-stituency and dependency structure. To relieve theconﬂict of structure learning and semantic learn-ing in Transformer LM, we proposed a middle-layer structure learning strategy under a multi-tasks scheme. Results showed that structure-awareTransformer retroﬁtted via our proposed methodachieved better language perplexity, inducing high-quality syntactic phrase. Furthermore, our LM afterstructure-aware ﬁne-tuning gave signiﬁcantly im-proved performance for both semantic-dependentand syntactic-dependent tasks, also yielding mosttask-related and interpretable syntactic structures.

We thank the anonymous reviewers for their valu-able and detailed comments. This work is sup-ported by the National Natural Science Founda-tion of China (No. 61772378, No. 61702121),the National Key Research and Development Pro-gram of China (No. 2017YFC1200500), the Re-search Foundation of Ministry of Education ofChina (No. 18JZD015), the Major Projects ofthe National Social Science Foundation of China(No. 11&ZD189), the Key Project of State Lan-guage Commission of China (No. ZDI135-112)and Guangdong Basic and Applied Basic ResearchFoundation of China (No. 2020A151501705).

References

Roee Aharoni and Yoav Goldberg. 2017. Towardsstring-to-tree neural machine translation.

CoRR ,abs/1704.04743.Mahtab Ahmed, Muhammad Rifayat Samee, andRobert E. Mercer. 2019. You only need attentionto traverse trees. In

Proceedings of the ACL , pages316–322.David Alvarez-Melis and Tommi S. Jaakkola. 2017.Tree-structured decoding with doubly-recurrent neu-ral networks. In

Proceedings of the ICLR .Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Lo¨ıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single $&!

Proceedings of the ACL , pages 2126–2136.Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car-bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyonda ﬁxed-length context. In

Proceedings of the ACL ,pages 2978–2988.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the NAACL , pages4171–4186.Timothy Dozat and Christopher D. Manning. 2017.Deep biafﬁne attention for neural dependency pars-ing. In

Proceedings of the ICLR .Andrew Drozdov, Patrick Verga, Mohit Yadav, MohitIyyer, and Andrew McCallum. 2019. Unsupervisedlatent tree induction with deep inside-outside recur-sive autoencoders.

CoRR , abs/1904.02142.Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A. Smith. 2016. Recurrent neural networkgrammars. In

Proceedings of the NAACL , pages199–209.Akiko Eriguchi, Yoshimasa Tsuruoka, and KyunghyunCho. 2017. Learning to parse and translate improvesneural machine translation. In

Proceedings of theACL , pages 72–78.Hao Fei, Meishan Zhang, Fei Li, and Donghong Ji.2020. Cross-lingual semantic role labeling withmodel transfer.

IEEE/ACM Transactions on Audio,Speech, and Language Processing , 28:2427–2437.Yoav Goldberg. 2019. Assessing bert’s syntactic abili-ties.

CoRR , abs/1901.05287.Jetic G¯u, Hassan S. Shavarani, and Anoop Sarkar. 2018.Top-down tree structured decoding with syntacticconnections for neural machine translation and pars-ing. In

Proceedings of the EMNLP , pages 401–413.Jan Hajiˇc, Massimiliano Ciaramita, Richard Johans-son, Daisuke Kawahara, Maria Ant`onia Mart´ı, Llu´ısM`arquez, Adam Meyers, Joakim Nivre, SebastianPad´o, Jan ˇStˇep´anek, Pavel Straˇn´ak, Mihai Surdeanu,Nianwen Xue, and Yi Zhang. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependen-cies in multiple languages. In

Proceedings of theCoNLL , pages 1–18.Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visu-alizing and understanding the effectiveness of BERT.In

Proceedings of the EMNLP , pages 4141–4150.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O S´eaghdha, SebastianPad´o, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 task 8:Multi-way classiﬁcation of semantic relations be-tween pairs of nominals. In

Proceedings of the5th International Workshop on Semantic Evaluation ,pages 33–38.Jeremy Howard and Sebastian Ruder. 2018. Univer-sal language model ﬁne-tuning for text classiﬁcation. arXiv preprint arXiv:1801.06146 .anesh Jawahar, Benoˆıt Sagot, and Djam´e Seddah.2019. What does BERT learn about the structure oflanguage? In

Proceedings of the ACL , pages 3651–3657.Yoon Kim, Chris Dyer, and Alexander Rush. 2019a.Compound probabilistic context-free grammars forgrammar induction. In

Proceedings of the ACL ,pages 2369–2385.Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kun-coro, Chris Dyer, and G´abor Melis. 2019b. Unsu-pervised recurrent neural network grammars. In

Pro-ceedings of the NAACL , pages 1105–1117.Nikita Kitaev and Dan Klein. 2018. Constituency pars-ing with a self-attentive encoder. In

Proceedings ofthe ACL , pages 2676–2686.Olga Kovaleva, Alexey Romanov, Anna Rogers, andAnna Rumshisky. 2019. Revealing the dark secretsof BERT.

CoRR , abs/1908.08593.Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-gatama, Stephen Clark, and Phil Blunsom. 2018.LSTMs can learn syntax-sensitive dependencieswell, but modeling structure makes them better. In

Proceedings of the ACL , pages 1426–1436.Hongyin Luo, Lan Jiang, Yonatan Belinkov, and JamesGlass. 2019. Improving neural language models bysegmenting, attending, and predicting the future. In

Proceedings of the ACL , pages 1483–1493.Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of english: The penn treebank.

Computa-tional Linguistics , 19(2):313–330.Stephen Merity, Caiming Xiong, James Bradbury, andRichard Socher. 2017. Pointer sentinel mixture mod-els. In

Proceedings of the ICLR .Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed repre-sentations of words and phrases and their composi-tionality. In

Proceedings of the NIPS , pages 3111–3119.Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In

Proceedings of the EMNLP , pages1532–1543.Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In in Proceedings of the NAACL , pages2227–2237.Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code generationand semantic parsing. In

Proceedings of the ACL ,pages 1139–1149. Alec Radford, Karthik Narasimhan, Tim Salimans, andIlya Sutskever. 2018. Improving language under-standing by generative pre-training.

Technical Re-port .Yikang Shen, Zhouhan Lin, Chin-Wei Huang, andAaron C. Courville. 2018a. Neural language mod-eling by jointly learning syntax and lexicon. In

Pro-ceedings of the ICLR .Yikang Shen, Shawn Tan, Alessandro Sordoni, andAaron C. Courville. 2018b. Ordered neurons: Inte-grating tree structures into recurrent neural networks.

CoRR , abs/1810.09536.Richard Socher, Christopher D. Manning, and An-drew Y. Ng. 2010. Learning continuous phraserepresentations and syntactic parsing with recur-sive neural networks. In

In Proceedings of theNIPS-2010 Deep Learning and Unsupervised Fea-ture Learning Workshop .Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In

Proceedings of the EMNLP , pages 1631–1642.Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In

Proceedings of the ACL , pages 1556–1566.Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R. Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel R. Bowman, Dipan-jan Das, and Ellie Pavlick. 2019. What do you learnfrom context? probing for sentence structure in con-textualized word representations. In

Proceedings ofthe ICLR .Erik F. Tjong Kim Sang and Fien De Meulder.2003. Introduction to the CoNLL-2003 shared task:Language-independent named entity recognition. In

Proceedings of the CoNLL , pages 142–147.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information pro-cessing systems , pages 5998–6008.Jesse Vig and Yonatan Belinkov. 2019. Analyzingthe structure of attention in a transformer languagemodel.

CoRR , abs/1906.04284.Xinyi Wang, Hieu Pham, Pengcheng Yin, and GrahamNeubig. 2018. A tree-based decoder for neural ma-chine translation. In

Proceedings of the EMNLP ,pages 4772–4777.Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen.2019. Tree transformer: Integrating tree structuresinto self-attention. In

Proceedings of the EMNLP ,pages 1061–1070.dina Williams, Andrew Drozdov, and Samuel R.Bowman. 2017. Learning to parse from a seman-tic objective: It works. is it syntax?

CoRR ,abs/1709.01121.Qingrong Xia, Zhenghua Li, Min Zhang, MeishanZhang, Guohong Fu, Rui Wang, and Luo Si. 2019.Syntax-aware neural semantic role labeling. In

Pro-ceedings of the AAAI , pages 7305–7313.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019.Xlnet: Generalized autoregressive pretraining forlanguage understanding.

CoRR , abs/1906.08237.Majid Yazdani and James Henderson. 2015. Incremen-tal recurrent neural network dependency parser withsearch-based discriminative training. In