[PDF] Mimic and Conquer: Heterogeneous Tree Structure Distillation for Syntactic NLP

Abstract

Syntax has been shown useful for various NLP tasks, while existing work mostly encodes singleton syntactic tree using one hierarchical neural network. In this paper, we investigate a simple and effective method, Knowledge Distillation, to integrate heterogeneous structure knowledge into a unified sequential LSTM encoder. Experimental results on four typical syntax-dependent tasks show that our method outperforms tree encoders by effectively integrating rich heterogeneous structure syntax, meanwhile reducing error propagation, and also outperforms ensemble methods, in terms of both the efficiency and accuracy.

Full PDF

MMimic and Conquer: Heterogeneous Tree Structure Distillationfor Syntactic NLP

Hao Fei , Yafeng Ren and Donghong Ji ∗

1. Department of Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China2. Guangdong University of Foreign Studies, China { hao.fei,renyafeng,dhji } @whu.edu.cn Abstract

Syntax has been shown useful for various NLPtasks, while existing work mostly encodes sin-gleton syntactic tree using one hierarchicalneural network. In this paper, we investigate asimple and effective method, Knowledge Dis-tillation, to integrate heterogeneous structureknowledge into a uniﬁed sequential LSTM en-coder. Experimental results on four typicalsyntax-dependent tasks show that our methodoutperforms tree encoders by effectively in-tegrating rich heterogeneous structure syntax,meanwhile reducing error propagation, andalso outperforms ensemble methods, in termsof both the efﬁciency and accuracy.

Integrating syntactic information into neural net-works has received increasing attention in natu-ral language processing (NLP), which has beenused for a wide range of end tasks, such as senti-ment analysis (SA) (Nguyen and Shirai, 2015; Tengand Zhang, 2017; Looks et al., 2017; Zhang andZhang, 2019), neural machine translation (NMT)(Cho et al., 2014; Garmash and Monz, 2015; G¯uet al., 2018), language modeling (Yazdani and Hen-derson, 2015; Zhang et al., 2016; Zhou et al., 2017),semantic role labeling (SRL) (Marcheggiani andTitov, 2017; Strubell et al., 2018; Fei et al., 2020c),natural language inference (NLI) (Tai et al., 2015a;Liu et al., 2018) and text classiﬁcation (Chen et al.,2015; Zhang et al., 2018b). Despite the usefulnessof structure knowledge, most existing models useonly a single syntactic tree, such as a constituencyor a dependency tree.Constituent and dependency representation forsyntactic structure share underlying linguistic andcomputational characteristics, while differ also invarious aspects. For example, the former focuses ∗ Corresponding author. amodnmod nmodnmod case dobjnsubj

John visited his brother at school last week

NNP VBD

PRP NN IN NN JJ NNNP VP NP PP NPS

A0 visit.01 A1 AM-LOC AM-TMP (2)(3)(4)(1)

Figure 1: An example illustrating the mutual beneﬁt ofconstituency and dependency tree structures. (1) refersto the constituency tree structure, (2) indicates the se-mantic role labels, (3) refers to the example sentence,(4) represents the dependency tree structure. on revealing the continuity of phrases, while thelatter is more effective in representing the depen-dencies among elements. By integrating the tworepresentations from heterogeneous trees, the mu-tual beneﬁt has been explored for joint parsingtasks (Collins, 1997; Charniak and Johnson, 2005;Farkas et al., 2011; Yoshikawa et al., 2017; Zhouand Zhao, 2019). Intuitively, complementary ad-vantages from heterogeneous trees can facilitate arange of NLP tasks, especially syntax-dependentones such as SRL and NLI. Taking the sentence ofFigure 1 as example, where an example is shownfrom the SRL task. In this case, the dependencylinks can locate the relations between argumentsand predicates more efﬁciently, while the con-stituency structure can aggregate the phrasal spansfor arguments, and guide the global path to the pred-icate. Integrating the features of two structures canbetter guide the model to focus on the most suitablephrasal granularity (as circled by the dotted box),and also ensure the route consistency between thesemantic objective pairs.In this paper, we investigate the Knowledge Dis-tillation (KD) method, which has been shown to be We consider the span-based SRL, which aims to annotatethe phrasal span of all semantic arguments. a r X i v : . [ c s . C L ] S e p ffective for knowledge ensembling (Hinton et al.,2015; Kim and Rush, 2016; Furlanello et al., 2018),for heterogeneous structure integration. Speciﬁ-cally, we employ a sequential LSTM as the stu-dent for distilling heterogeneous syntactic struc-tures from various teacher tree encoders, such asGCN (Kipf and Welling, 2017) and TreeLSTM(Tai et al., 2015a). We consider output distillation,syntactic feature injection and semantic learning.In addition, we introduce an alternative structureinjection strategy to enhance the ability of hetero-geneous syntactic representations within the sharedsequential model. The distilled structure-aware stu-dent model can make inference using sequentialword inputs alone, reducing the error accumulationfrom external parsing tree annotations.We conduct extensive experiments on a widerange of syntax-dependent tasks, including seman-tic role labeling, relation classiﬁcation, natural lan-guage inference and sentiment classiﬁcation. Re-sults show that the distilled student outperformstree encoders, verifying the advantage of inte-grating heterogeneous structures. The proposedmethod also outperforms existing ensemble meth-ods and strong baseline systems, demonstrating itshigh effectiveness on structure information integra-tion. Previous work shows that integrating syntacticstructure knowledge can improve the performanceof NLP tasks (Socher et al., 2013; Cho et al., 2014;Nguyen and Shirai, 2015; Looks et al., 2017; Liuet al., 2018; Zhang and Zhang, 2019; Fei et al.,2020b). Generally, these methods consider in-jecting either standalone constituency tree or de-pendency tree by tree encoders such as TreeL-STM (Socher et al., 2013; Tai et al., 2015a) orGCN (Kipf and Welling, 2017). Based on the as-sumption that the dependency and constituencyrepresentation can be disentangled and coexist inone shared model, existing efforts are paid forjoint constituent and dependency parsing, veri-fying the mutual beneﬁt of these heterogeneousstructures (Collins, 1997; Charniak, 2000; Char-niak and Johnson, 2005; Farkas et al., 2011; Renet al., 2013; Yoshikawa et al., 2017; Strzyz et al.,2019; Kato and Matsubara, 2019; Zhou and Zhao,2019). However, little attention is paid for facili-tating the syntax-dependent tasks via integrating

Sequential

Word InputDependency Tree Input Constituency Tree Input

GCNGCNTreeLSTM TreeLSTM

Gold One-hotDependency Teachers Constituency TeachersStudent ②③ ④① ① , ② : Output Distill ③ , ④ : Feature Distill Figure 2: Overall framework of the proposed model. heterogeneous syntactic trees. Although the inte-gration from heterogeneous trees can be achievedvia widely employed approaches, such as ensem-ble learning (Wolpert, 1992; Ju et al., 2019) andmulti-task training (Liu et al., 2016; Chen et al.,2018; Fei et al., 2020a), they usually suffer fromlow-efﬁciency and high computational complexity.

Our work is related to knowledge distillation tech-niques. It has been shown that KD is very effectiveand scalable for knowledge ensembling (Hintonet al., 2015; Furlanello et al., 2018), and exist-ing methods are divided into two categories: 1) output distillation , which makes a teacher modeloutput logits as a student model training objective(Kim and Rush, 2016; Vyas and Carpuat, 2019;Clark et al., 2019), 2) feature distillation , whichallows a student to learn from a teacher’s inter-mediate feature representations (Zagoruyko andKomodakis, 2017; Sun et al., 2019). In this pa-per, we enhance the distillation of heterogeneousstructures via both output and feature distillationsby employing a sequential LSTM as the student.Our work is also closely related to Kuncoro et al.,(2019), who distill syntactic structure knowledgeto a student LSTM model. The difference lies inthat they focus on transferring tree knowledge fromsyntax-aware language model for achieving scal-able unsupervised syntax induction, while we aimat integrating heterogeneous syntax for improvingdownstream tasks.

As shown in Figure 2, the overall architecture con-sists of a sequential LSTM (Hochreiter and Schmid-huber, 1997) student, and several tree teachers fordependency and constituency structures. .1 Tree Encoder Teachers

Different tree models can encode the same treestructure, resulting in different heterogeneous treerepresentations. Following previous work (Taiet al., 2015b; Marcheggiani and Titov, 2017; Zhangand Zhang, 2019), we consider encoding depen-dency trees by Child-Sum TreeLSTM and con-stituency trees by N -ary TreeLSTM. We alsoemploy GCN to encode dependency and con-stituency structures separately. We employ a bidi-rectional tree encoder to fully capture the struc-tural information interaction. Formally, we de-note X = { x , · · · , x n } as an input sentence, X dep = { x dep , · · · , x dep n } as the dependency treeand X con = { x con , · · · , x con n } as the constituencytree. Encoding dependency structure.

We ﬁrst usethe standard Child-Sum TreeLSTM to encode thedependency structure, where each node j in the treetakes as input the embedding vector x j correspond-ing to the head word. The conventional bottom-upfashion is: h j = (cid:88) k ∈ C ( j ) h k i j = σ ( W ( i ) x dep j + U ( i ) h j + b ( i ) ) f jk = σ ( W ( f ) x dep j + U ( f ) h k + b ( f ) ) o j = σ ( W ( o ) x dep j + U ( o ) h j + b ( o ) ) u j = tanh ( W ( u ) x dep j + U ( u ) h j + b ( u ) ) c j = i j (cid:12) u j + (cid:88) k ∈ C ( j ) f jk (cid:12) c k h j = o j (cid:12) tanh ( c j ) (1)where W , U and b are parameters. C ( j ) refersto the set of child nodes of j . h j , i j , o j and c j are the hidden state, input gate, output gate andmemory cell of the node j , respectively. f jk is aforget gate for each child k of j . σ ( · ) is an activa-tion function and (cid:12) is element-wise multiplication.Similarly, the top-down TreeLSTM has the sametransition equations as the bottom-up TreeLSTM,except that the direction and the number of depen-dent nodes are different. We concatenate the treerepresentations of two directions for each node: h bij = [ h ↑ j ; h ↓ j ] .Compared with TreeLSTM, GCN is more com-putationally efﬁcient in performing the tree prop-agation for each node in parallel with O(1) com-plexity. Considering the constructed dependency graph G = ( V , E ) , where V are sets of nodes, and E are sets of bidirectional edges between heads anddependents, respectively. GCN can be viewed as ahierarchical node encoder, representing the node j at the l -th layer encoded as follows: g li = σ ( W li h li + b li ) (2) h lj = ReLU ( (cid:80) i ∈N ( i ) x li (cid:12) g li ) (3)where N ( i ) are neighbors of the node j . ReLU isa non-linear activation function. For dependencyencoding by TreeLSTM or GCN, we make use ofall the node representations, R dep = [ r , · · · , r n ] ,within the whole tree structure for next distillation. Encoding constituency structure.

We employ N -ary TreeLSTM to encode constituent tree: i j = σ ( W ( i ) x con j + N (cid:88) q =1 U ( i ) q h jq + b ( i ) ) f jk = σ ( W ( f ) x con j + N (cid:88) q =1 U ( f ) kq h jq + b ( f ) ) o j = σ ( W ( o ) x con j + N (cid:88) q =1 U ( o ) q h jq + b ( o ) ) u j = tanh ( W ( u ) x con j + N (cid:88) q =1 U ( u ) q h jq + b ( u ) ) c j = i j (cid:12) u j + N (cid:88) q =1 f jq (cid:12) c jq h j = o j (cid:12) tanh ( c j ) (4)where q is the index of the branch of j . Slightly dif-ferent from Child-Sum TreeLSTM, the separate pa-rameter matrices for each child k allow the modelto learn more ﬁne-grained and order-sensitive chil-dren information. We also concatenate two direc-tions from both bottom-up and top-down of eachnode as the ﬁnal representation.Similarly, GCN is also used to encode the con-stituent graph G = ( V , E ) via Eq. (2) and (3). Notethat there are both words and constituent labels inthe node set V . For constituency encoding by bothTreeLSTM and GCN, we take the representationsof terminal nodes in the structure as the correspond-ing word representations R con = [ r , · · · , r n ] . Sequential models have been proven effective onencoding syntactic tree information (Shen et al.,018; Kuncoro et al., 2019). We set the goal of KDas simultaneously distilling heterogeneous struc-tures from tree encoder teachers into a LSTM stu-dent model.We denote Γ( dep ) = { γ ( TreeLSTM ) , γ ( GCN ) } as the dependency teachers, and Γ( con ) = { γ ( TreeLSTM ) , γ ( GCN ) } as the constituency teachers,and Γ( all ) = Γ( dep ) (cid:83) Γ( con ) as the overall teach-ers. The objective of the student model can bedecomposed into three terms: an output distillationtarget, a semantic target, and a syntactic target. Output distillation.

The output logits serve as soft target providing richer supervision than the hard target of one-hot gold label for the training(Hinton et al., 2015). Given an input sentence X with the gold label Y (one-hot), the output logitsof teachers are P t Γ( all ) , and the output logits of thestudent is P s . The output distilling can be denotedas: L output = H ([ αY + (1 − α ) P t Γ( all ) ] , P s ) (5)where H ( , ) refers to the cross-entropy. α is a cou-pling factor, which increases from 0 to 1 in training,namely teacher annealing (Clark et al., 2019). Syntactic tree feature distillation.

In order tocapture rich syntactic tree features, we consider al-lowing the student to directly learn from the teach-ers’ feature hidden representation. Speciﬁcally,we denote the hidden representation of the studentLSTM as R s = [ r , · · · , r n ] , and we expect R s to be able to predict the output of R dep or R con from syntax-aware teachers. Thus the target is tooptimize the following regression loss: L ( A ) dep = 12 n (cid:88) j =1 || f t Γ( dep ) ( r dep j ) − f s ( r s j ) || (6) L ( A ) con = 12 n (cid:88) j =1 || f t Γ( con ) ( r con j ) − f s ( r s j ) || (7) L ( A ) syn = η L ( A ) dep + (1 − η ) L ( A ) con (8)where η ∈ [0 , is a factor for coordinating thedependency and constituency structure encoding, f t Γ( dep ) () , f t Γ( con ) () , f s () are the feedforward lay-ers, respectively, for calculating the correspondingscore vectors, and j is the word index. Semantic learning.

We randomly mask a targetinput word Q j and let LSTM predict the wordbased on its hidden representation of prior words. In consequence, we pose the following languagemodeling objective: L sem = M (cid:88) j =1 H ( Q j , P sj | X [1 , ··· ,j − ) (9)by which LSTM can additionally improve the abil-ity of semantic learning. We consider further enhancing the trees injection,by encouraging the student to mimic the depen-dency and constituency tree induction of teachers.

Dependency injection.

We force the student topredict the distributions of dependency arcs andlabels based on the hidden representations and therepresentations of teachers. L ( B ) dep = n (cid:88) j n (cid:88) i H ( P t Γ( dep ) ( r j | x i ) , P s ( r j | x i ))+ n (cid:88) j n (cid:88) i L (cid:88) k H ( P t Γ( dep ) ( l k | r j , x i ) , P s ( l k | r j , x i )) (10)where P t Γ( dep ) ( r j | x i ) is the arc probability of theparent node r j for x i in the dependency teacher,and P t Γ( dep ) ( l k | r j , x i ) is the probability of the label l k for the arc ( r j , x i ) in the teacher. Constituency injection.

Similarly, to enhanceconstituency injection, we mimic the distribution ofeach span ( i, j ) with label l in teachers. FollowingZhou et al. (2019), we adopt a feedforward layeras the span scorer:Scr ( t ) = (cid:88) ( i,j ) ∈ t (cid:88) k f ( i, j, l ) (11)We use the CYK algorithm (Cocke, 1970; Younger,1975; Kasami, 1965) to search the highest scoretree T ∗ in teachers, and all possible trees T in thestudent. Then we optimize the following hinge lossbetween the structures in the student and teachers: L ( B ) con = max (0 , max t ∈ T ( Scr ( t )+∆( t, T ∗ )) − Scr ( T ∗ )) (12)where ∆ is the hamming distance. The above syn-tax loss in Eq. (10) and (12) can substitute theones in Eq. (6) and (7), respectively. The overallobjective of the structure injection is: L ( B ) syn = η L ( B ) dep + (1 − η ) L ( B ) con (13) egularization. Based on the independent as-sumption, the syntax feature distillations targetlearning diversiﬁed private representations for het-erogeneous structures as much as possible. In prac-tice, there should be a latent shared structure in theparameter space, while the separate distillationswill squeeze such shared feature, weakening theexpression of the learnt representations. To avoidthis, we additionally impose a regularization on Eq.(6), (7), (10) and (12): L reg = ζ || Θ || , (14)where Θ is the overall parameter in the student. G processing the dependency or con-stituency injection from a tree teacher (line 13-17),to keep the training stable. After a certain numberof training iterations G , we optimize the overallloss (line 20): L all = L output + λ L syn + λ L sem (15)where λ and λ are coefﬁcients, which regulatethe corresponding objectives. L syn can be either L ( A ) syn (Eq. 8) or L ( B ) syn (Eq. 13), that is, the syntaxsources are simultaneously from two tree encoders(dependency and constituency) at one time. Dur-ing inference, the well-trained student can makeprediction alone with only word sequential input. We use a 3-layer BiLSTMas our student, and a 2-layer architecture for alltree teachers. The default word embeddings areinitialized randomly, and the dimension of wordembeddings is set as 300. The hidden size is set to350 in the student LSTM, and 300 in the teachermodels, respectively. We adopt the Adam optimizerwith an initial learning rate of 1e-5. We use themini-batch of 32 within total 10k ( T ) iterationswith early stopping, and apply 0.4 dropout ratio for Algorithm 1:

Distill heterogeneous trees.

Input:

Training set: ( X , X dep , X con , Y ) ;Total iteration T ; Syntax turn gaps G , G ; Syntax ﬂag F =Ture. Output:

Student model. while t < T do if t ≤ G then if t % G == 0 then F ← ! F ; end P s ← Student( X ) ; opt L sem in Eq. (9) ; for γ ( model ) ∈ Γ( all ) do P t Γ( dep ) , r dep j ← γ ( model )( X dep ) ; P t Γ( con ) , r con j ← γ ( model )( X con ) ; P t Γ( all ) = P t Γ( dep ) (cid:83) P t Γ( con ) ; opt L output in Eq. (5) ; if F then opt L dep in Eq. (6) or (10) ; // dependency learning else opt L con in Eq. (7) or (12) ; // constituency learning end end else opt L all in Eq. (15) ; end end all embeddings. We set the coefﬁcients λ , λ , ζ and η as 0.6, 0.2, 0.2 and 0.5, respectively. Thetraining iteration thresholds G and G are set as300 and 128, respectively. These values achieve thebest performance in the development experiments. Baselines systems.

We compare the follow-ing baselines. 1) Sequential encoders: LSTM,attention-based LSTM (A TT LSTM) and Trans-former (Vaswani et al., 2017), sentence-state LSTM(S-LSTM) (Zhang et al., 2018a); 2) Tree encodersintroduced in §

2; 3) Ensemble models: ensem-bling learning (EnSem) (Wolpert, 1992; Ju et al.,2019), multi-task method (MTL) (Liu et al., 2016;Chen et al., 2018), adversarial training (AdvT) (Liuet al., 2017) and tree communication model (TCM)(Zhang and Zhang, 2019). For EnSem, we onlyconcatenate the output representations of tree en-codes. For MTL, we use an underlying sharedLSTM for parameter sharing for tree encodes. FordvT, we adopt the shared-private architecture (Liuet al., 2017) based on MTL. Following Zhang etal. (2019), for TCM, we initialize GCN for TreeL-STM, to encode dependency and constituency treesrespectively, and ﬁnally concatenate the output rep-resentations. Note that all the models in Tree En-semble group take total four Tree teachers as inour distillation teachers’ meta-encoders. 4) Otherbaselines: ESIM (Chen et al., 2017), local-globalpattern based self-attention networks (LG-SANs)(Xu et al., 2019) and BERT.

Tasks and evaluation.

The experiments are con-ducted on four representative syntax-dependenttasks: 1)

Rel , relation classiﬁcation on Semeval10(Hendrickx et al., 2010); 2)

NLI , sentence pairclassiﬁcation on the Stanford NLI (Bowman et al.,2015); 3)

SST , binary sentiment classiﬁcationtask on the Stanford Sentiment Treebank (Socheret al., 2013), 4)

SRL , semantic role labeling on theCoNLL2012 OntoNotes (Pradhan et al., 2013). For

NLI , we make element-wise production, subtrac-tion, addition and concatenation of two separatesentence representations as a whole. We mainlyadopt F1 score to evaluate the performance of dif-ferent models. The data splitting follows previouswork.

Trees annotations and resources.

TheOntoNotes data offers the annotations of thedependency and constituency structure. Forthe rest datasets, we parse sentences via thestate-of-the-art BiAfﬁne dependency parser (Dozatand Manning, 2017), and the Self-Attentiveconstituency parser (Kitaev and Klein, 2018). Theparsers are trained on PTB . The dependencyparser has a 93.4% LAS, and the constituencyparser has 92.6% F1 score. Besides, we evaluatedifferent contextualized word representations, suchas ELMo and BERT . Experimental results of different models are shownin Table 1, where several observations can be found.First, tree models encoded with syntactic knowl-edge can facilitate syntax-dependent tasks, outper-forming sequential models by a substantial mar-gin. Second, different tree encoders integrated with https://catalog.ldc.upenn.edu/LDC99T42 . https://allennlp.org/elmo https://github.com/google-research/bert Rel NLI SST SRL • Sequential Encoder

LSTM 80.5 79.6 82.3 76.6A TT LSTM 82.3 81.5 84.2 78.2Transformer 84.7 84.2 85.0 80.5S-LSTM 85.0 84.8 86.2 82.0 • Standalone Tree Model

TreeLSTM+dep. 85.2 86.0 86.4 82.5GCN+dep. 85.9 85.8 86.1 83.3TreeLSTM+con. 85.0 86.8 87.6 82.2GCN+con. 84.8 86.3 86.8 81.8

Avg. • Tree Ensemble

EnSem 85.5 87.0 86.0 81.4MTL 84.9 88.3 87.2 83.7AdvT 86.4 87.6 85.2 82.1TCM 85.7 88.8 88.4 83.0

Avg. • Distilled Student

Best ∗ ∗ ∗ ∗ • Others

ESIM - 88.9 - -LG-SANs 85.6 86.5 87.3 81.2BERT 91.3 92.1 94.4 86.0

Table 1: Main results on various end tasks. ∗ indicates p ≤ varying syntactic tree structures can make differentcontributions to the tasks. For example, GCN withdependency structure gives the best result for Rel ,while TreeLSTM with constituency tree achievesthe best performance for

SST . Third, when inte-grating heterogeneous tree structures by tree en-semble methods, a competitive performance canbe obtained, showing the importance of integrat-ing heterogeneous tree information. Finally, ourdistilled student model signiﬁcantly outperformsall the baseline systems , demonstrating its higheffectiveness on the integration of heterogeneousstructure information. Ablation results.

We ablate each part of our dis-tilling method in Table 2. First, we ﬁnd that theenhanced structure injection strategy ( L ( B ) syn ) canhelp to achieve the best results for all the tasks,compared with the latent syntax feature mimic( L ( A ) syn ). By ablating each distilling objective, welearn that the syntax tree distillation ( L syn ) isthe kernel of our knowledge distillation for these Note that a direct comparison with BERT is unfair, be-cause a large number of pre-trained parameters can bringoverwhelming improvement. el NLI SST SRL • Syntax Injection Strategy+ L ( A ) syn L ( B ) syn • Distilling Objective (with L ( B ) syn )w/o L sem L syn L reg Tea.Anl. • Contextualized Semantics (with L ( B ) syn )+ELMo 90.6 91.6 92.4 85.1+BERT Table 2: Ablation results on distilled student. ‘Tea.Anl.’refers to teacher annealing. In ‘Semantics’, we re-place semantic learning L sem with pre-trained contex-tualized word representations. Constituency Dependency

TreeLSTM+dep. 28.31 73.92GCN+dep. 19.11

TreeLSTM+con. L reg Table 3: Probing the upper-bound of constituent anddependent syntactic structure. syntax-dependent tasks, compared with semanticfeature learning ( L sem ). Besides, both the intro-duced teacher annealing factor α and regularization L reg can beneﬁt the task performance. Finally, weexplore recent contextualized word representations,including ELMo (Peters et al., 2018) and BERT(Devlin et al., 2019). Surprisingly, our distilledstudent model receives a substantial performanceimprovements in all tasks. However, when remov-ing the proposed syntax distillation from BERT,the performance drops, as shown in Table 1 (thevanilla BERT). We explore to what extent the distilled student canmanage to capture heterogeneous tree structureinformation. Following previous work (Conneauet al., 2018), we consider employing two syntacticprobing tasks, including 1)

Constituent labeling ,which assigns a non-terminal label for text spanswithin the phrase-structure (e.g.,

Verb , Noun , etc.),and 2)

Dependency labeling , which predicts the

Rel NLI SST SRL . . . Figure 3: Heterogeneous syntax distribution. The pre-dominance of dependency syntax is above 0.5, other-wise for constituency.

100 80 60 40 20

100 80 60 40 20 F ( % ) Rel NLITraining set ratio (%)SST SRLTreeLSTM+dep. GCN+dep. TreeLSTM+con.GCN+con. TCM Student-Best

Figure 4: Results under varying ratio of train set. relationship (edge) between two tokens (e.g., subject-object etc.). We take the last-layer outputrepresentation as the probing objective. Wecompare the student model with four teachertree encoders, separately, based on the SRL task.As shown in Table 3, the student LSTM givesslightly lower score than one of the best treemodels (i.e.,

GCN+dep. for dependency labeling,

TreeLSTM+con. for constituency labeling),showing the effectiveness on capturing syntax.Besides, we can ﬁnd that the regularization L reg plays a key role in improving the expressioncapability of the learnt representation. Distributions of heterogeneous syntax in differ-ent tasks.

We also compare the distributions ofdependency and constituency structures in differ-ent tasks after ﬁne-tuning. Technically, based oneach example in the test set, the performance dropswhen the student LSTM is trained only under eitherstandalone dependency or constituency injection(TreeLSTM or GCN), respectively, by controlling η =0 or 1. Intuitively, the more the results drop, themore the model beneﬁts from the correspondingsyntax. For each task, we collect the sensitivity val-ues and linearly normalize them into [0,1] for allexamples, and make statistics. As plotted in Figure3, the distributions of dependency and constituencysyntax vary among tasks, verifying that differentasks depend on distinct types of structural knowl-edge, while integrating them altogether can givethe best effects. For example, TreeDepth , de-pendency structures support

Rel and

SRL , while

NLI and

SST beneﬁt from constituency the most.

Figure4 shows the performance of different models onvarying ratio of the full training dataset. We canﬁnd that the performance decreases with the reduc-tion of the training data for all methods, while ourdistilled student achieves better results, comparedwith most of the baselines. The underlying reasonsare two-fold. First, the heterogeneous syntacticfeatures can provide strong representations for sup-porting better predictions. Second, the distilledstudent takes only sequential inputs, avoiding thenoise from parsed inputs to some extent.Also we see that

TreeLSTM/GCN+dep. cancounteract the data reduction ( ≤ Rel and

SRL tasks, showning that they rely more on depen-dency structures, while

NLI and

SST depend onconstituency structures. In addition, the studentstarts underperforming than the best one on thesmall data ( ≤ Reducing error accumulation of tree annota-tion.

We investigate the effects on reducingnoises from tree annotation. We compare theperformance under different sources. Table 4shows the results on

SRL . With only word inputs,our model still outperforms the baselines whichtake the gold syntax annotation. This partiallyshows that without parsed tree annotation, the stu-dent model can avoid noise and error propagation.When we add gold annotation as additional signal,the performance can be further improved.

Efﬁciency study.

As shown in Figure 5, the stu-dent model has fewer parameters, while keepingfaster decoding speed, compared with other ensem-ble models. Our sequential model is about 3 timessmaller than AdvT, but nearly 4 times faster thanthe tree ensemble methods. Such observation coin-cides with previous studies (Kim and Rush, 2016;Sun et al., 2019; Clark et al., 2019).

System Auto-Syn Gold-Syn w/o

Syn

TreeLSTM+dep. 80.6 82.5 -GCN+dep. 81.1 83.3 -TreeLSTM+con. 79.6 82.2 -GCN+con. 79.8 81.8 -EnSem 80.5 81.4 -MTL 81.2 83.7 -AdvT 81.0 82.1 -TCM 82.4 83.0 -Student-Full - † Table 4: Performance of different systems withautomatically-parsed/gold syntax, and without syntaxannotations. † indicates that we concatenate additionalgold syntactic label with other input features. T ree L S T M + d e p . T ree L S T M + c o n s t . E nS e m M TL A d v T T C M G CN + d e p . G CN + c o n s t . S t ud e n t P a r a m e t er ( m ) Sp ee d ( s e n t / s ec ) Figure 5: Comparisons on parameter scale and decod-ing speed.

The enhanced structure injection objectives (Eq.(10) and (12)) enables the student LSTM to unsu-pervisedly induce tree structures at the test stage.To understand how the distilled model promote themutual learning of heterogeneous structures, weempirically visualize the induced trees based on atest example of SRL. As shown in Figure 6, thediscovered dependency structures accurately matchthe gold tree, and the constituents are highly cor-related with the gold one. Besides, the edges thatindicate the two elements are augmented by thelearning of each other, which in return enhancethe recognition of the spans of elements (yellowdotted boxes), respectively. For example, the con-stituent and dependent paths (green lines) linkingtwo minimal target spans, the Focus Today program and by Wang Shilin , are enhanced and echoed witheach other, via the core predicate. This reveals thatour method can offer a deeper latent interactionbetween heterogeneous tree structures. omp

Coming up is the Focus Today program hosted by Wang Shilin

VBG RP NPSVBZ DT NNP NNP NN VBD NNP NNPIN NPVPVPVP NP

PRT

VP comp compcompdet acl nmod comp case

Coming up is the Focus Today program hosted by Wang Shilin(a) gold structures(b) discovered structures nsubj

Figure 6: A SRL case where hosted is predicate, the Fo-cus Today program is A0 , by Wang Shilin is A1 . Bold green lines indicates the edges with higher scores. We investigated knowledge distillation on hetero-geneous tree structures integration for facilitatingthe NLP tasks, distilling syntactic knowledge into asequential input encoder, in both output and featurelevel distillations. Results on four representativesyntax-dependent tasks showed that the distilledstudent outperformed all standalone tree models, aswell as the commonly used ensemble methods, in-dicating the effectiveness of the proposed method.Further analysis demonstrated that our method en-joys high robustness and efﬁciency.

Acknowledgments

This work is supported by the National NaturalScience Foundation of China (No. 61772378,No. 61702121), the National Key Researchand Development Program of China (No.2017YFC1200500), the Research Foundation ofMinistry of Education of China (No. 18JZD015),the Major Projects of the National Social ScienceFoundation of China (No. 11&ZD189), the KeyProject of State Language Commission of China(No. ZDI135-112) and Guangdong Basic andApplied Basic Research Foundation of China (No.2020A151501705).

References

Samuel R. Bowman, Gabor Angeli, Christopher Potts,and Christopher D. Manning. 2015. A large anno-tated corpus for learning natural language inference.In

Proceedings of the EMNLP , pages 632–642.Eugene Charniak. 2000. A maximum-entropy-inspiredparser. In

Proceedings of NAACL . Eugene Charniak and Mark Johnson. 2005. Coarse-to-ﬁne n-best parsing and MaxEnt discriminativereranking. In

Proceedings of ACL , pages 173–180.Junkun Chen, Xipeng Qiu, Pengfei Liu, and XuanjingHuang. 2018. Meta multi-task learning for sequencemodeling. In

Proceedings of the AAAI , pages 5070–5077.Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In

Proceedings of ACL ,pages 1657–1668.Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Shiyu Wu, andXuanjing Huang. 2015. Sentence modeling withgated recursive neural network. In

Proceedings ofEMNLP , pages 793–798.Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. In

Proceedings of the Eighth Workshop onSyntax, Semantics and Structure in Statistical Trans-lation , pages 103–111.Kevin Clark, Minh-Thang Luong, Urvashi Khandel-wal, Christopher D. Manning, and Quoc V. Le. 2019.BAM! born-again multi-task networks for naturallanguage understanding. In

Proceedings of ACL ,pages 5931–5937.John Cocke. 1970. Programming languages and theircompilers: preliminary notes.Michael Collins. 1997. Three generative, lexicalisedmodels for statistical parsing. In

Proceedings ofACL , pages 16–23.Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Lo¨ıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single $&!

Proceedings of ACL , pages 2126–2136.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the NAACL , pages4171–4186.Timothy Dozat and Christopher D. Manning. 2017.Deep biafﬁne attention for neural dependency pars-ing. In

Proceedings of the ICLR .Rich´ard Farkas, Bernd Bohnet, and Helmut Schmid.2011. Features for phrase-structure reranking fromdependency parses. In

Proceedings of ICPT , pages209–214.Hao Fei, Yafeng Ren, and Donghong Ji. 2020a. Dis-patched attention with multi-task learning for nestedmention recognition.

Information Science , 513:241–251.ao Fei, Yafeng Ren, and Donghong Ji. 2020b. A tree-based neural network model for biomedical eventtrigger detection.

Information Science , 512:175–185.Hao Fei, Meishan Zhang, Fei Li, and Donghong Ji.2020c. Cross-lingual semantic role labeling withmodel transfer.

IEEE/ACM Transactions on Audio,Speech, and Language Processing , 28:2427–2437.Tommaso Furlanello, Zachary Chase Lipton, MichaelTschannen, Laurent Itti, and Anima Anandkumar.2018. Born-again neural networks. In

Proceedingsof ICML , pages 1602–1611.Ekaterina Garmash and Christof Monz. 2015. Bilin-gual structured language models for statistical ma-chine translation. In

Proceedings of EMNLP , pages2398–2408.Jetic G¯u, Hassan S. Shavarani, and Anoop Sarkar. 2018.Top-down tree structured decoding with syntacticconnections for neural machine translation and pars-ing. In

Proceedings of EMNLP , pages 401–413.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O S´eaghdha, SebastianPad´o, Marco Pennacchiotti, Lorenza Romano, andStan Szpakowicz. 2010. SemEval-2010 task 8:Multi-way classiﬁcation of semantic relations be-tween pairs of nominals. In

Proceedings of SemEval ,pages 33–38.Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean.2015. Distilling the knowledge in a neural network.

CoRR , abs/1503.02531.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural Computation ,9(8):1735–1780.Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng,Xuefeng Yang, and Yunfeng Liu. 2019. Technicalreport on conversational question answering.

CoRR ,abs/1909.10772.Tadao Kasami. 1965. An efﬁcient recognition andsyntaxanalysis algorithm for context-free languages.

Technical Report Air Force Cambridge ResearchLab .Yoshihide Kato and Shigeki Matsubara. 2019. PTBgraph parsing with tree approximation. In

Proceed-ings of the ACL , pages 5344–5349.Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In

Proceedings ofEMNLP , pages 1317–1327.Thomas N. Kipf and Max Welling. 2017. Semi-supervised classiﬁcation with graph convolutionalnetworks. In

Proceedings of ICLR .Nikita Kitaev and Dan Klein. 2018. Constituency pars-ing with a self-attentive encoder. In

Proceedings ofthe ACL , pages 2676–2686. Adhiguna Kuncoro, Chris Dyer, Laura Rimell, StephenClark, and Phil Blunsom. 2019. Scalable syntax-aware language models using knowledge distillation.In

Proceedings of ACL , pages 3472–3484.Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.Recurrent neural network for text classiﬁcation withmulti-task learning. In

Proceedings of the IJCAI ,pages 2873–2879.Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.Adversarial multi-task learning for text classiﬁca-tion. In

Proceedings of the ACL , pages 1–10.Yang Liu, Matt Gardner, and Mirella Lapata. 2018.Structured alignment networks for matching sen-tences. In

Proceedings of EMNLP , pages 1554–1564.Moshe Looks, Marcello Herreshoff, DeLesleyHutchins, and Peter Norvig. 2017. Deep learningwith dynamic computation graphs. In

Proceedingsof ICLR .Diego Marcheggiani and Ivan Titov. 2017. Encodingsentences with graph convolutional networks for se-mantic role labeling. In

Proceedings of EMNLP ,pages 1506–1515.Thien Hai Nguyen and Kiyoaki Shirai. 2015.Phrasernn: Phrase recursive neural network foraspect-based sentiment analysis. In

Proceedings ofEMNLP , pages 2509–2514.Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In

Proceedings of the NAACL , pages2227–2237.Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Hwee Tou Ng, Anders Bj¨orkelund, Olga Uryupina,Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-bust linguistic analysis using OntoNotes. In

Pro-ceedings of the CoNLL , pages 143–152.Xiaona Ren, Xiao Chen, and Chunyu Kit. 2013. Com-bine constituent and dependency parsing via rerank-ing. In

Proceedings of IJCAI , pages 2155–2161.Yikang Shen, Shawn Tan, Alessandro Sordoni, andAaron C. Courville. 2018. Ordered neurons: Inte-grating tree structures into recurrent neural networks.

CoRR , abs/1810.09536.Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In

Proceedings of the EMNLP , pages 1631–1642.Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semanticrole labeling. In

Proceedings of EMNLP , pages5027–5038.ichalina Strzyz, David Vilares, and Carlos G´omez-Rodr´ıguez. 2019. Sequence labeling parsing bylearning across representations. In

Proceedings ofthe ACL , pages 5350–5357.Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019.Patient knowledge distillation for BERT model com-pression. In

Proceedings of EMNLP , pages 4322–4331.Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015a. Improved semantic representa-tions from tree-structured long short-term memorynetworks. In

Proceedings of ACL , pages 1556–1566.Kai Sheng Tai, Richard Socher, and Christopher DManning. 2015b. Improved semantic representa-tions from tree-structured long short-term memorynetworks. arXiv preprint arXiv:1503.00075 .Zhiyang Teng and Yue Zhang. 2017. Head-lexicalizedbidirectional tree lstms.

TACL , 5:163–177.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Proceedings of the NIPS , pages 5998–6008.Yogarshi Vyas and Marine Carpuat. 2019. Weakly su-pervised cross-lingual semantic relation classiﬁca-tion via knowledge distillation. In

Proceedings ofEMNLP , pages 5284–5295.David H. Wolpert. 1992. Stacked generalization.

Neu-ral Networks , 5(2):241–259.Mingzhou Xu, Derek F. Wong, Baosong Yang, YueZhang, and Lidia S. Chao. 2019. Leveraging localand global patterns for self-attention networks. In

Proceedings of the ACL , pages 3069–3075.Majid Yazdani and James Henderson. 2015. Incremen-tal recurrent neural network dependency parser withsearch-based discriminative training. In

Proceed-ings of CoNLL , pages 142–152.Masashi Yoshikawa, Hiroshi Noji, and Yuji Matsumoto.2017. A* CCG parsing with a supertag and depen-dency factored model. In

Proceedings of ACL , pages277–287.Daniel H Younger. 1975. Recognition and parsing ofcontext-free languages in time n3.

Information andControl , 10(2):189208.Sergey Zagoruyko and Nikos Komodakis. 2017. Pay-ing more attention to attention: Improving the per-formance of convolutional neural networks via atten-tion transfer. In

Proceedings of ICLR .Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016.Top-down tree long short-term memory networks.In

Proceedings of NAACL , pages 310–320. Yuan Zhang and Yue Zhang. 2019. Tree communica-tion models for sentiment analysis. In

Proceedingsof ACL , pages 3518–3527.Yue Zhang, Qi Liu, and Linfeng Song. 2018a.Sentence-state LSTM for text representation. In

Pro-ceedings of the ACL , pages 317–327.Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018b. Graph convolution over pruned dependencytrees improves relation extraction. In

Proceedingsof EMNLP , pages 2205–2215.Ganbin Zhou, Ping Luo, Rongyu Cao, Yijun Xiao, FenLin, Bo Chen, and Qing He. 2017. Tree-structuredneural machine for linguistics-aware sentence gener-ation.

CoRR , abs/1705.00321.Junru Zhou and Hai Zhao. 2019. Head-driven phrasestructure grammar parsing on Penn treebank. In