[PDF] Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction

Abstract

Entities, as the essential elements in relation extraction tasks, exhibit certain structure. In this work, we formulate such structure as distinctive dependencies between mention pairs. We then propose SSAN, which incorporates these structural dependencies within the standard self-attention mechanism and throughout the overall encoding stage. Specifically, we design two alternative transformation modules inside each self-attention building block to produce attentive biases so as to adaptively regularize its attention flow. Our experiments demonstrate the usefulness of the proposed entity structure and the effectiveness of SSAN. It significantly outperforms competitive baselines, achieving new state-of-the-art results on three popular document-level relation extraction datasets. We further provide ablation and visualization to show how the entity structure guides the model for better relation extraction. Our code is publicly available.

Full PDF

EEntity Structure Within and Throughout: Modeling Mention Dependenciesfor Document-Level Relation Extraction

Benfeng Xu , Quan Wang , Yajuan Lyu , Yong Zhu , Zhendong Mao School of Information Science and Technology, University of Science and Technology of China, Hefei, China Baidu Inc., Beijing, [email protected], { wangquan05, lvyajuan, zhuyong } @baidu.com, [email protected] Abstract

Entities, as the essential elements in relation extraction tasks,exhibit certain structure. In this work, we formulate suchstructure as distinctive dependencies between mention pairs.We then propose SSAN, which incorporates these structuraldependencies within the standard self-attention mechanismand throughout the overall encoding stage. Speciﬁcally, wedesign two alternative transformation modules inside eachself-attention building block to produce attentive biases so asto adaptively regularize its attention ﬂow. Our experimentsdemonstrate the usefulness of the proposed entity structureand the effectiveness of SSAN. It signiﬁcantly outperformscompetitive baselines, achieving new state-of-the-art resultson three popular document-level relation extraction datasets.We further provide ablation and visualization to show howthe entity structure guides the model for better relation ex-traction. Our code is publicly available.

Relation extraction aims at discovering relational facts fromraw texts as structured knowledge. It is of great importanceto many real-world applications such as knowledge baseconstruction, question answering, and biomedical text anal-ysis. Although early studies mainly limited this problem un-der an intra-sentence and single entity pair setting, many re-cent works have made efforts to extend it into document-level texts (Li et al. 2016a; Yao et al. 2019), making it amore practical but also more challenging task.Document-level texts entail a large quantity of entities de-ﬁned over multiple mentions, which naturally exhibit mean-ingful dependencies in between. Figure 1 gives an examplefrom the recently proposed document-level relation extrac-tion dataset DocRED (Yao et al. 2019), which illustrates sev-eral mention dependencies: 1)

Coming Down Again and theRolling Stones that both reside in the 1st sentence are closelyrelated, so we can identify

R1: Performer (blue link) basedon their local context; 2)

Coming Down Again from the 1st * Work done while the ﬁrst author was an intern at Baidu Inc.. † https://github.com/PaddlePaddle/Research/tree/master/KG/AAAI2021 SSAN https://github.com/BenfengXu/SSAN Figure 1: An example excerpted from DocRED. Differentmention dependencies are distinguished by colored edges,with the target relations listed in below.sentence, It from the 2nd sentence, and The song from the5th sentence refer to the same entity (red link), so it is nec-essary to consider and reason with them together; 3) theRolling Stones from the 1st sentence and

Mick Jagger fromthe 2nd sentence, though not display direct connections, canbe associated via two coreferential mentions:

Coming DownAgain and it , which is essential to predict the target rela-tion R2: Member of (green link) between the two entities.Similar dependency also exists between the Rolling Stones and

Nicky Hopkins , which helps identify

R3: Member of between them. Intuitively, such dependencies indicate richinteractions among entity mentions, and thereby provide in-formative priors for relation extraction.Many previous works have tried to exploit such entitystructure, in particular the coreference dependency. For ex-ample, it is a commonly used trick to simply encode coref-erential information as extra features, and integrate theminto the initial input word embeddings. Verga, Strubell,and McCallum (2018) propose an adapted version of multi-instance learning to aggregate the predictions from corefer-ential mentions. Others also directly apply average poolingto the representations of coreferential mentions (Yao et al.2019). In summary, these heuristic techniques only use en-tity dependencies as complementary evidence in the pre- or a r X i v : . [ c s . C L ] F e b ost- processing stage, and thus bear limited modeling abil-ity. Besides, most of them fail to include other meaningfuldependencies in addition to coreference.More recently, graph-based methods have shown greatadvantage in modeling entity structure (Sahu et al. 2019;Christopoulou, Miwa, and Ananiadou 2019; Nan et al.2020). Typically, these methods rely on a general-purposeencoder, usually LSTM, to ﬁrst obtain contextual representa-tions of an input document. Then they introduce entity struc-ture by constructing a delicately designed graph, where en-tity representations are updated accordingly through prop-agation. This kind of approach, however, isolates the con-text reasoning stage and structure reasoning stage due tothe heterogeneity between the encoding network and graphnetwork, which means the contextual representations cannotbeneﬁt from structure guidance in the ﬁrst place.Instead, we argue that structural dependencies should beincorporated within the encoding network and throughoutthe overall system. To this end, we ﬁrst formulate the afore-mentioned entity structure under a uniﬁed framework, wherewe deﬁne various mention dependencies that cover the in-teractions in between. We then propose SSAN (StructuredSelf-Attention Network), which is equipped with a novel ex-tension of self-attention mechanism (Vaswani et al. 2017),to effectively model these dependencies within its buildingblocks and through all network layers bottom-to-up. Notethat although this paper only focus on entity structure fordocument-level relation extraction, the method developedhere is readily applicable to all kinds of Transformer-basedpretrained language models to incorporate any structural de-pendencies.To demonstrate the effectiveness of the proposed ap-proach, we conduct comprehensive experiments on Do-cRED (Yao et al. 2019), a recently proposed entity-richdocument-level relation extraction dataset, as well as twobiomedical domain datasets, namely CDR (Li et al. 2016a)and GDA (Wu et al. 2019). On all three datasets, we ob-serve consistent and substantial improvements over compet-itive baselines, and establish the new state-of-the-art. Ourcontribution can be summarized as follows:• We summarize various kinds of mention dependenciesexhibited in document-level texts into a uniﬁed frame-work. By explicitly incorporating such structure withinand throughout the encoding network, we are able to per-form context reasoning and structure reasoning simulta-neously and interactively, which brings substantially im-proved performance on relation extraction tasks.• We propose SSAN that extends the standard self-attentionmechanism with structural guidance.• We achieve new state-of-the-art results on threedocument-level relation extraction datasets.

This section elaborates on our approach. We ﬁrst formal-ize entity structure in section 2.1, then detail the proposedSSAN model in section 2.2 and section 2.3, and ﬁnally in-troduce its application to document-level relation extractionin section 2.4.

Entity structure describes the distribution of entity instancesover texts and the dependencies among them. In the speciﬁcscenario of document-level texts, we consider the followingtwo structures.• Co-occurrence structure: Whether or not two mentionsreside in the same sentence.• Coreference structure: Whether or not two mentions referto the same entity.Both structures can be described as

True or False . For co-occurrence structure , we segment documents into sen-tences, and take them as minimum units that exhibit mentioninteractions. So

True or False distinguishes intra-sententialinteractions which depend on local context from inter-sentential ones that require cross sentence reasoning. Wedenote them as intra and inter respectively. For coreferencestructure , True indicates that two mentions refer to the sameentity and thus should be investigated and reasoned withtogether, while

False implies a pair of distinctive entitiesthat are possibly related under certain predicates. We denotethem as coref and relate respectively. In summary, these twostructures are mutually orthogonal, resulting in four distinc-tive and undirected dependencies, as shown in table 1.CoreferenceTrue FalseCo-occurence True intra+coref intra+relate

False inter+coref inter+relate

Table 1: The formulation of entity structure.Besides the dependencies between entity mentions, wefurther consider another type of dependency between en-tity mentions and its intra-sentential non-entity (NE) words.We denote it as intraNE . For other inter-sentential non-entitywords, we assume there is no crucial dependency, and cat-egorize it as NA . The overall structure is thus formulatedinto an entity-centric adjacency matrix with all its elementsfrom a ﬁnite dependency set: { intra+coref , inter+coref , in-tra+relate , inter+relate , intraNE , NA } (see ﬁgure 2). SSAN inherits the architecture of Transformer (Vaswaniet al. 2017) encoder, which is a stack of identical buildingblocks, wrapped up with feedforward network, residual con-nection, and layer normalization. As its core component, wepropose structured self-attention mechanism with two alter-native transformation modules.Given an input token sequence x = ( x , x , ..., x n ) , fol-lowing the above formulation, we introduce S = { s ij } to represent its structure, where i, j ∈ { , , ..., n } and s ij ∈{ intra+coref , inter+coref , intra+relate , inter+relate , intraNE , NA } is a discrete variable denotes the dependencyfrom x i to x j . Note that here we extend dependency frommention-level to token-level for practical implementation. Ifigure 2: The overall architecture of SSAN. Left illustrates structured self-attention as its basic building block. Right explainsour entity structure formulation. This minimum example consists of two sentences: S , S , and three entities: E , E and E . N denotes non-entity tokens. Element in row i and column j represents the dependency from query token x i to key token x j ,we distinguish dependencies using different colors.mention instance consists of multiple subwords ( E in ﬁg-ure 2, S ), we assign dependencies for each token accord-ingly. Within each mention, subword pairs should conformwith intra+coref and thus are assigned as such.In each layer l , the input representation x li ∈ R d in is ﬁrstprojected into query / key / value vector respectively: q li = x li W Ql , k li = x li W Kl , v li = x li W Vl (1)where W Ql , W Kl , W Vl ∈ R d in × d out . Based on these inputsand entity structure S , we compute unstructured attentionscore and structured attentive bias, and then aggregate themtogether to guide the ﬁnal self-attention ﬂow.The unstructured attention score is produced by query-keyproduct as in standard self-attention: e lij = q li k ljT √ d (2)Parallel to it, we employ an additional module to model thestructural dependency conditioned on their contextualizedquery / key representations. We parameterize it as transfor-mations which project s ij along with query vector q li andkey vector k lj into attentive bias, then impose it upon e lij : ˜ e lij = e lij + transf ormation ( q li , k lj , s ij ) √ d (3)The proposed transformation module regulates the attentionﬂow from x i to x j . As a consequence, the model beneﬁtsfrom the guidance of structural dependencies.After we obtain the regulated attention scores ˜ e lij , a soft-max operation is applied, and the value vectors are aggre- gated accordingly: z l +1 i = n (cid:88) j =1 exp ˜ e lij (cid:80) nk =1 exp ˜ e lik v lj (4)here z l +1 i ∈ R d out is the updated contextual representationof x li . Figure 2 gives the overview of SSAN. In the nextsection, we describe the transformation module. To incorporate the discrete structure s ij into an end-to-endtrainable deep model, we instantiate each s ij as neural lay-ers with speciﬁc parameters, train and apply them in a com-positional fashion. As a result, for each input structure S composed of s ij , we have a structured model composed ofcorresponding layer parameters. As for the speciﬁc designof these neural layers, we propose two alternatives: BiafﬁneTransformation and Decomposed Linear Transformation: bias lij = Biaf f ine ( s ij , q li , k lj ) or = Decomp ( s ij , q li , k lj ) (5) Biafﬁne Transformation

Biafﬁne Transformation com-putes the bias as: bias lij = q li A l,s ij k ljT + b l,s ij (6)here we parameterize dependency s ij as trainable neurallayer A l,s ij ∈ R d out × × d out , which attends to the query andkey vector simultaneously and directionally, and projectsthem into a single-dimensional bias. As for the second term b l,s ij , we directly model prior bias for each dependency in-dependent to its context. ecomposed Linear Transformation Inspired byhow Dai et al. (2019) decompose the word embeddingand position embedding in Transformer, we propose tointroduce bias upon query and key vectors respectively, thebias is thus decomposed as: bias lij = q li K Tl,s ij + Q l,s ij k ljT + b l,s ij (7)where K l,s ij , Q l,s ij ∈ R d are also trainable neural layers.Intuitively, these three terms respectively represent: 1) biasconditioned on query token representation, 2) bias condi-tioned on key token representation, and 3) prior bias.So the overall computation of structured self-attention is: ˜ e lij = q li k ljT + transf ormation ( q li , k lj , s ij ) √ d = q li k ljT + q li A l,s ij k ljT + b l,s ij √ dor = q li k ljT + q li K Tl,s ij + Q l,s ij k ljT + b l,s ij √ d (8)As these transformation layers model structural dependen-cies adaptively according to context, we do not share themacross different layers or different attention heads.Previously, Shaw, Uszkoreit, and Vaswani (2018) haveproposed to model relative position information of input to-ken pair within the Transformer. They ﬁrst map the relativedistance into embedding, then add them with key vectors be-fore computing the attention score. Technically, such designcan be seen as a simpliﬁed version of our Decomposed Lin-ear Transformation, with query conditioned bias only. The proposed SSAN model takes document text as input,and builds its contextual representations under the guidanceof entity structure within and throughout the overall encod-ing stage. In this work, we simply use it for relation ex-traction with minimum design. After the encoding stage, weconstruct a ﬁxed dimensional representation for each targetentity via average pooling, which we denote as e i ∈ R d e .Then, for each entity pair, we compute the probability of re-lation r from the pre-speciﬁed relation schema as: P r ( e s , e o ) = sigmoid ( e s W r e o ) (9)where W r ∈ R d e × d e . The model is trained using cross en-tropy loss: L = (cid:88) (cid:88) r CrossEntropy ( P r ( e s , e o ) , y r ( e s , e o )) (10)and y is the target label. Given N entities and a rela-tion schema of size M , equation 9 should be computed N × N × M times to give all predictions. We evaluate the proposed approach on three populardocument-level relation extraction datasets, namely Do-cRED (Yao et al. 2019), CDR (Li et al. 2016a) andGDA (Wu et al. 2019), all involving challenging relationalreasoning over multiple entities across multiple sentences.We summarize their information in Appendix A.

DocRED

DocRED is a large scale dataset constructedfrom Wikipedia and Wikidata. It provides comprehensivehuman annotations including entity mentions, entity types,relational facts, and the corresponding supporting evidence.There are 97 target relations in total and approximately 26entities on average in each document. The data scale is 3053documents for training, 1000 for development set, and 1000for test. Besides, DocRED also collects distantly superviseddata for alternative research. It utilizes a ﬁnetuned BERTmodel to identify entities and link them to Wikidata. Thenthe relation labels are obtained via distant supervision, pro-ducing 101873 document instances at scale.

CDR

The Chemical-Disease Reactions dataset is abiomedical dataset constructed using PubMed abstracts. Itcontains 1500 human-annotated documents in total that areequally split into training, development, and test sets. CDRis a binary classiﬁcation task that aims at identifying inducedrelation from chemical entity to disease entity, which is ofsigniﬁcant importance to biomedical research.

GDA

Like CDR, the Gene-Disease Associations datasetis also a binary relation classiﬁcation task that identifyGene and Disease concepts interactions, but with a muchmore massive scale constructed by distant supervision usingMEDLINE abstracts. It consists of 29192 documents as thetraining set and 1000 as the test set.

We initialize SSAN with different pretrained language mod-els including BERT (Devlin et al. 2019), RoBERTa (Liuet al. 2019) and SciBERT (Beltagy, Lo, and Cohan 2019).

BERT

BERT is one of the ﬁrst works that ﬁnd thesuccess of Transformer in pretraining language modelson large scale corpora. Speciﬁcally, it is pretrained usingMasked Language Model and Next Sentence Prediction onBooksCorpus and Wikipedia. BERT is pretrained under twoconﬁgurations, Base and Large, respectively contains 12 and24 self-attention layers. It can be easily ﬁnetuned on variousdownstream tasks, producing competitive baselines.

RoBERTa

RoBERTa is an optimized version of BERT,which removes the Next Sentence Prediction task and adoptsway larger text corpora as well as more training steps. Itis currently one of the superior pretrained language modelsthat outperforms BERT in various downstream NLP tasks.

SciBERT

SciBERT adopts the same model architecture asBERT, but is trained on scientiﬁc text instead. It demon-strates considerable advantage in a series of scientiﬁc do-main tasks. In this paper, we provide SciBERT-initializedSSAN on the two biomedical domain datasets. odel Dev TestIgn F1 / F1 Ign F1 / F1 ContexAware (2019) 48.94 / 51.09 48.40 / 50.70EoG ∗ (2019) 45.94 / 52.15 49.48 / 51.82BERT Two-Phase (2019a) - / 54.42 - / 53.92GloVe+LSR (2020) 48.82 / 55.17 52.15 / 54.18HINBERT (2020) 54.29 / 56.31 53.70 / 55.60CorefBERT Base (2020) 55.32 / 57.51 54.54 / 56.96CorefBERT Large (2020) 56.73 / 58.88 56.48 / 58.70BERT+LSR (2020) 52.43 / 59.00 56.97 / 59.05CorefRoBERTa (2020) 57.84 / 59.93 57.68 / 59.91BERT Base Baseline 56.29 / 58.60 55.08 / 57.54SSAN Decomp / SSAN

Biafﬁne / Decomp

Biafﬁne / / RoBERTa Base Baseline 57.47 / 59.52 57.27 / 59.48SSAN

Decomp / 59.75SSAN

Biafﬁne / RoBERTa Large Baseline 58.45 / 60.58 58.43 / 60.54SSAN

Decomp

Biafﬁne / / + Adaptation / / Table 2: Results on DocRED. Subscript

Decomp and

Biafﬁne refer to Decomposed Linear Transformation and BiafﬁneTransformation. Test results are obtained by submitting toofﬁcial Codalab. Result with ∗ is from Nan et al. (2020). On each dataset, we give comprehensive results of SSANinitialized with different pretrained language models alongwith their corresponding baselines for fair comparisons. Theparameters in newly introduced transformation modules arelearned from scratch. All results are obtained using gridsearch for hyper-parameters (see appendix B for detail) onthe development set, then the best model is selected to pro-duce results on the test set. On DocRED, following the of-ﬁcial baseline implementation (Yao et al. 2019), we utilizenaive features including entity type and entity coreference,which is added to the input word embedding. We also con-catenate entity relative distance embedding of each entitypair before the ﬁnal classiﬁcation. We preprocess CDR andGDA dataset following Christopoulou, Miwa, and Anani-adou (2019). On CDR, after the best hyper-parameter is set,we merge the training set and dev set to train the ﬁnal model,on GDA, we split 20% of the training set for development.

We conduct comprehensive and comparable experiments onDocRED dataset. We report both F1 and Ign F1 according

Model DevF1 TestF1 Intra- / Inter-Test F1 (Gu et al. 2017) - 61.3 57.2 / 11.7BRAN(2018) - 62.1 - / -CNN+CNNchar(2018) - 62.3 - / -GCNN(2019) 57.2 58.6 - / -EoG (2019) 63.6 63.6 68.2 / 50.9LSR (2020) - 61.2 66.2 / 50.3LSR w/o MDP (2020) - 64.8 68.9 / 53.1BERT (2020) - 60.5 - / -SciBERT (2020) - 64.0 - / - methods using external resources (Peng, Wei, and Lu 2016) - 63.1 - / -(Li et al. 2016b) - 67.7 58.9 / -(Panyam et al. 2018) - 60.3 65.1 / 45.7(Zheng et al. 2018) - 61.5 - / -BERT Base Baseline 61.7 61.4 69.3 / 44.9SSAN

Decomp

SSAN

Biafﬁne / 44.7BERT Large Baseline 65.3 63.6 70.8 / 49.0SSAN

Decomp

Biafﬁne / SciBERT Baseline 68.2 65.8 71.9 / 53.3SSAN

Decomp

Biafﬁne / Table 3: Results on CDR dev set and test set.

Model DevF1 TestF1 Intra- / Inter-Test F1

EoG (2019) 78.7 81.5 85.2 / 49.3LSR (2020) - 79.6 83.1 / 49.6LSR w/o MDP (2020) - 82.2 85.4 / 51.1BERT Base Baseline 79.8 81.2 84.7 / 60.3SSAN

Decomp / SSAN

Biafﬁne

Decomp

SSAN

Biafﬁne / 63.9SciBERT Baseline 81.4 83.6 / 61.8SSAN

Decomp

Biafﬁne

Table 4: Results on GDA dev set and test set.to Yao et al. (2019). Ign F1 is computed by excluding rela-tional facts that already appeared in the training set.As shown in table 2, SSAN with both

Biafﬁne and

De-comp transformation can consistently outperform their base-lines with considerable margin. In most of the results,

Bi-afﬁne brings more considerable performance gain comparedto

Decomp , which demonstrates that the former is of greater ependency Ign F1 F1

SSAN

Biafﬁne (RoBERTa Large) 60.25 62.08 − intra+coref − intra+relate − inter+coref − inter+relate − intraNE − all 58.45 60.58Table 5: Ablation for entity structure formulation on Do-cRED dev set. Results when each dependency is excluded,and “-all” degenerates to RoBERTa Large baseline. Bias Term Ign F1 F1

RoBERTa Large baseline (w/o bias) 58.45 60.58 + b s ij + Q s ij k Tj + q i K Ts ij + q i K Ts ij + Q s ij k Tj + b s ij + q i A s ij k Tj + q i A s ij k Tj + b s ij l because theablation is implemented across all layers.ability to model structural dependencies.We compare our model with previous works that ei-ther do not consider entity structure or do not explicitlymodel them within and throughout encoders. Speciﬁcally,ContexAware (Yao et al. 2019), BERT Two-Phase (Wanget al. 2019a) and HINBERT (Tang et al. 2020) donot consider the structural dependencies among entities.EOG (Christopoulou, Miwa, and Ananiadou 2019) andLSR (Nan et al. 2020) utilize graph methods to performstructure reasoning, but only after the BiLSTM or BERT en-coder. CorefBERT and CorefRoBERTa (Ye et al. 2020) fur-ther pretrain BERT and RoBERTa with a coreference predic-tion task to enable implicit reasoning of coreference struc-ture. Results in table 2 shows that SSAN performs betterthan these methods. Our best model, SSAN Biafﬁne built uponRoBERTa Large, is +2.41 / +1.79 Ign F1 better on dev / testset than CorefRoBERTa Large (Ye et al. 2020), and +1.80 / +1.04 Ign F1 better than our baseline. In general, these re-sults demonstrate both the usefulness of entity structure andthe effectiveness of SSAN.Although SSAN is well compatible with pretrained Trans-former models, there still exists a distribution gap betweenparameters in newly introduced transformation layers andthose already pretrained ones, thus impedes the improve-ments of SSAN to a certain extent. In order to alleviatesuch distribution deviation, we also utilize the distantly su-pervised data from DocRED, which shares identical formatwith the trainset, to ﬁrst pretrain SSAN before ﬁnetuningon the annotated training set for better adaptation. Here wechoose our best model, SSAN Biafﬁne built upon RoBERTa Large, and denote it as +Adaptation in table 2 (see ap-pendix B for hyperparameters setting). The resulting perfor-mance are greatly improved, achieving and on test set as well as the 1st position on the leader-board at the time of submission. On CDR and GDA datasets, besides BERT, we also adoptsSciBERT for its superiority when dealing with biomedicaldomain texts. On CDR test set (see Table 3), SSAN obtains +1.3 F1 / +1.7 F1 gain based on BERT Base/Large and +2.9F1 gain based on SciBERT, which signiﬁcantly outperformthe baselines and all existing works. On GDA (see Table 4),similar improvements can also be observed. These resultsdemonstrate the strong applicability and generality of ourapproach. We perform ablation studies of the proposed approachon DocRED. Again, we consider SSAN

Biafﬁne built uponRoBERTa Large. Table 5 gives the results of SSAN wheneach structural dependency is excluded. It is clear that allﬁve dependencies contribute to the ﬁnal improvements. Wecan arrive at the conclusion that the proposed entity struc-ture formulation is indeed helpful priors for document-levelrelation extraction. We can also see that intra+coref effectsthe most among all dependencies.We also look into the design of two transformation mod-ules by testing each bias term respectively. As shown in ta-ble 6, all bias terms can improve the result over baseline,including the prior bias + b s ij that is only individual values.Among all bias terms, biafﬁne bias + q i A s ij k Tj is the mosteffective, brings +1.38 Ign F1 improvements solely. ForDecomposed Linear Transformation, key conditioned bias + Q s ij k Tj produces better results than query conditionedbias + q i K Ts ij , which implies that the key vectors might beassociated with more entity structure information. As a key feature of SSAN is to formulate entity structurepriors into attentive biases, it would be instructive to explorehow such attentive biases regulate the propagation of self-attention bottom-to-up. To this purpose, we collect all atten-tive biases produced by SSAN

Biafﬁne (built upon RoBERTaLarge) for DocRED dev instances, categorized according todependency types, and averaged across all attention headsand all instances. Figure 3 (a) is the resultant heatmap, whereeach cell indicates the value of averaged bias at each layer(horizontal axis) for each entity dependency type (verticalaxis). We can observe meaningful patterns: 1) Along thehorizontal axis, the bias is relatively small at bottom lay-ers, where the self-attention score will be mainly decidedby unstructured semantic contexts. It then grows graduallyand reaches the maximum at the top-most layers, where the https://competitions.codalab.org/competitions/20717 igure 3: (a): Visualization on the learned attentive bias from different layers and different mention dependencies. Results areaveraged over the entire dev set and different attention heads. (b): Ablation on number of layers to impose attentive biases.self-attention score will be greatly regulated by the struc-tural priors. 2) Along the vertical axis, at the top-most lay-ers (inside the dotted bounding box), bias from inter+coref is signiﬁcantly positive. This conforms with human intu-ition that coreferential mention pairs might act as a bridgefor cross-sentence reasoning, thus should enable more in-formation passing. While biases from intra+relate and in-ter+relate appear in contrast.Based on the discussion, we further investigate the ef-fect of different layers to impose attentive biases. As shownin Figure 3 (b), with only the top 4 layers (1/6 of the to-tal layers) integrated with entity structure, SSAN can keep+0.89 F1 gain, which conﬁrms that these top-most layerswith larger biases indeed impact more signiﬁcantly. In themeantime, with more layers included, the performance stillimproves, and reaches the best of +1.50 F1 with all 24 layersequipped with structured self-attention. Document-level RE

Recent years have seen growing in-terests for relation extraction beyond single sentence (Quirkand Poon 2017; Peng et al. 2017a). Among the most inﬂuen-tial works, many have proposed to introduce intra-sententialand inter-sentential syntactic dependencies (Peng et al.2017b; Song et al. 2018; Gupta et al. 2019). More recently,document-level relation extraction tasks have been pro-posed (Li et al. 2016a; Yao et al. 2019), where the goal is toidentify relations of multiple entity pairs from the entire doc-ument text, and rich entity interactions are thereby involved.In order to model these interactions, many graph basedmethods are proposed (Sahu et al. 2019; Christopoulou,Miwa, and Ananiadou 2019; Nan et al. 2020). However,these graph networks are built upon their contextual encoder,which is different from our approach that model entity inter-actions within and throughout the system.

Entity Structure

Entity structure has been shown to beuseful in many NLP tasks. In early works, Barzilay and La-pata (2008) propose an entity-grid representation for dis-course analysis, where the document is summarized into aset of entity transition sequences that record distributional,syntactic, and referential information. Ji et al. (2017) intro-duce a set of symbolic variables and state vectors to encodethe mentions and their coreference relationships for lan-guage modeling task. Dhingra et al. (2018) propose Coref- GRU, which incorporates mention coreference informationfor reading comprehension tasks. In general, many workshave utilized entity structure in various formulation for dif-ferent tasks.For document-level relation extraction, entity structurealso is essential prior. For example, Verga, Strubell, and Mc-Callum (2018) propose to merge predictions from coref-erential mentions. Nan et al. (2020) propose to modelentity interactions via latent structure reasoning. AndChristopoulou, Miwa, and Ananiadou (2019) construct agraph of mention nodes, entity nodes, and sentence nodes,then connect them using mention-mention coreference,mention-sentence residency etc., such design provides muchmore comprehensive entity structure information. Based onthe graph, they further utilize an edge-oriented method toiteratively reﬁne the relation representation between targetentity pairs, which is quite different from our approach.

Structured Networks

Neural networks that incorporatestructural priors have been extensively explored. In previ-ous works, many have investigated how to infuse the tree-like syntax structure into the classical LSTM encoder (Kimet al. 2017; Shen et al. 2019; Peng et al. 2017b). For Trans-former encoder, it is also a challenging and thriving researchdirection. Shaw, Uszkoreit, and Vaswani (2018) propose toincorporate relative position information of input tokens inthe form of attentive bias, which inspired part of this work.Wang et al. (2019b) further extend this method to relationextraction task, where the relative position is adjusted intoentity-centric form.

In this work, we formalize entity structure for document-level relation extraction. Based on it, we propose SSANto effectively incorporate such structural priors, which per-forms both contextual reasoning and structure reasoning ofentities simultaneously and interactively. The resulting per-formance on three datasets demonstrates the usefulness ofentity structure and the effectiveness of the SSAN model.For future works, we give two promising directions: 1)apply SSAN to more tasks such as reading comprehension,where the structure of entities or syntax is useful prior infor-mation. 2) extend the entity structure formulation to includemore meaningful dependencies, such as more complex in-teractions based on discourse structure. ataset Train Dev Test Entities / Doc Mentions / Doc Mention / Sent RelationDocRED Annotated 3053 1000 1000 19.5 26.2 3.58 96Distant 101873 - - 19.3 25.1 3.43 96CDR 500 500 500 6.8 19.2 2.48 1GDA 29192 - 1000 4.8 18.5 2.28 1

Table 7: Summary of DocRED, CDR and GDA datasets. For column

Mention / Sent , we exclude sentences that do not containany entity mention.

Dataset DocRED CDR GDAModel Base Large Distant Pretrain - Base Largelearning rate e − e − e − e − e − e − epoch { , , , } { , , , } { , , } batch size { , } Table 8: Hyper-parameters Setting.

Acknowledgments

We thank all anonymous reviewers for their valuable com-ments. This work is supported by the National Key Researchand Development Project of China (No.2018YFB1004300,No.2018AAA0101900), and the National Natural ScienceFoundation of China (No.61876223, No.U19A2057).

Appendix

A Datasets

Table 7 details statistics of entities along with other re-lated information of three selected datasets. We can see thatall three datasets entail more than two dozen mentions perdocument on average, with each sentence contains approx-imately three mentions on average. These statistics furtherdemonstrate the complexity of entity structure in document-level relation extraction tasks.

B Hyper-parameters Setting

Table 8 details our hyper-parameters setting. All experimentresults are obtained using grid search on the developmentset. All comparable results share the same search scope.

References

Barzilay, R.; and Lapata, M. 2008. Modeling local coherence: Anentity-based approach.

Computational Linguistics

Proceedings of the 2019Conference on Empirical Methods in Natural Language Process-ing and the 9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP)

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9thInternational Joint Conference on Natural Language Processing(EMNLP-IJCNLP)

Proceedings of the 57th An-nual Meeting of the Association for Computational Linguistics

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long andShort Papers)

Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 2 (Short Papers)

Database

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , vol-ume 33, 6513–6520.Ji, Y.; Tan, C.; Martschat, S.; Choi, Y.; and Smith, N. A. 2017. Dy-namic Entity Representations in Neural Language Models. In

Pro-ceedings of the 2017 Conference on Empirical Methods in NaturalLanguage Processing . OpenReview.net. URLhttps://openreview.net/forum?id=HkE0Nvqlg.Li, J.; Sun, Y.; Johnson, R. J.; Sciaky, D.; Wei, C.-H.; Leaman, R.;Davis, A. P.; Mattingly, C. J.; Wiegers, T. C.; and Lu, Z. 2016a.BioCreative V CDR task corpus: a resource for chemical diseaserelation extraction.

Database ,994–1001. IEEE.Liu, X.; Fan, J.; and Dong, S. 2020. Document-Level BiomedicalRelation Extraction Leveraging Pretrained Self-Attention Structureand Entity Replacement: Algorithm and Pretreatment Method Val-idation Study.

JMIR Medical Informatics arXiv preprintarXiv:1907.11692 .Nan, G.; Guo, Z.; Sekulic, I.; and Lu, W. 2020. Reasoning withLatent Structure Reﬁnement for Document-Level Relation Extrac-tion. In

Proceedings of the 58th Annual Meeting of the Associationfor Computational Linguistics

Proceedings of the BioNLP2018 workshop

Journal of biomedical semantics

Transactions of the Association for Computational Lin-guistics

Transac-tions of the Association for Computational Linguistics

5: 101–115.Peng, Y.; Wei, C.-H.; and Lu, Z. 2016. Improving chemical dis-ease relation extraction with rich features and weakly labeled data.

Journal of cheminformatics

Proceedings of the15th Conference of the European Chapter of the Association forComputational Linguistics: Volume 1, Long Papers

Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics

Proceedings of the 2018Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Vol-ume 2 (Short Papers)

International Conference on Learning Representations .URL https://openreview.net/forum?id=B1l6qiR5F7.Song, L.; Zhang, Y.; Wang, Z.; and Gildea, D. 2018. N-ary Re-lation Extraction using Graph-State LSTM. In

Proceedings ofthe 2018 Conference on Empirical Methods in Natural LanguageProcessing

Paciﬁc-Asia Conference on KnowledgeDiscovery and Data Mining , 197–209. Springer.Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention isall you need. In

Advances in neural information processing sys-tems , 5998–6008.Verga, P.; Strubell, E.; and McCallum, A. 2018. SimultaneouslySelf-Attending to All Mentions for Full-Abstract Biological Re-lation Extraction. In

Proceedings of the 2018 Conference of theNorth American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long Pa-pers) arXivpreprint arXiv:1909.11898 .Wang, H.; Tan, M.; Yu, M.; Chang, S.; Wang, D.; Xu, K.; Guo,X.; and Potdar, S. 2019b. Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers. In

Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics

International Conference on Researchin Computational Molecular Biology , 272–284. Springer.Yao, Y.; Ye, D.; Li, P.; Han, X.; Lin, Y.; Liu, Z.; Liu, Z.; Huang, L.;Zhou, J.; and Sun, M. 2019. DocRED: A Large-Scale Document-Level Relation Extraction Dataset. In

Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics arXivpreprint arXiv:2004.06870 .Zheng, W.; Lin, H.; Li, Z.; Liu, X.; Li, Z.; Xu, B.; Zhang, Y.; Yang,Z.; and Wang, J. 2018. An effective neural model extracting doc-ument level chemical-induced disease relations from biomedicalliterature.