[PDF] A Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction

Abstract

Joint entity and relation extraction framework constructs a unified model to perform entity recognition and relation extraction simultaneously, which can exploit the dependency between the two tasks to mitigate the error propagation problem suffered by the pipeline model. Current efforts on joint entity and relation extraction focus on enhancing the interaction between entity recognition and relation extraction through parameter sharing, joint decoding, or other ad-hoc tricks (e.g., modeled as a semi-Markov decision process, cast as a multi-round reading comprehension task). However, there are still two issues on the table. First, the interaction utilized by most methods is still weak and uni-directional, which is unable to model the mutual dependency between the two tasks. Second, relation triggers are ignored by most methods, which can help explain why humans would extract a relation in the sentence. They're essential for relation extraction but overlooked. To this end, we present a Trigger-Sense Memory Flow Framework (TriMF) for joint entity and relation extraction. We build a memory module to remember category representations learned in entity recognition and relation extraction tasks. And based on it, we design a multi-level memory flow attention mechanism to enhance the bi-directional interaction between entity recognition and relation extraction. Moreover, without any human annotations, our model can enhance relation trigger information in a sentence through a trigger sensor module, which improves the model performance and makes model predictions with better interpretation. Experiment results show that our proposed framework achieves state-of-the-art results by improves the relation F1 to 52.44% (+3.2%) on SciERC, 66.49% (+4.9%) on ACE05, 72.35% (+0.6%) on CoNLL04 and 80.66% (+2.3%) on ADE.

Full PDF

AA Trigger-Sense Memory Flow Framework for Joint Entity andRelation Extraction

Yongliang Shen

Zhejiang [email protected]

Xinyin Ma

Zhejiang [email protected]

Yechun Tang

Zhejiang [email protected]

Weiming Lu ∗ Zhejiang [email protected]

ABSTRACT

Joint entity and relation extraction framework constructs a uni-fied model to perform entity recognition and relation extractionsimultaneously, which can exploit the dependency between thetwo tasks to mitigate the error propagation problem suffered by thepipeline model. Current efforts on joint entity and relation extrac-tion focus on enhancing the interaction between entity recognitionand relation extraction through parameter sharing, joint decoding,or other ad-hoc tricks (e.g., modeled as a semi-Markov decisionprocess, cast as a multi-round reading comprehension task). How-ever, there are still two issues on the table. First, the interactionutilized by most methods is still weak and uni-directional, whichis unable to model the mutual dependency between the two tasks.Second, relation triggers are ignored by most methods, which canhelp explain why humans would extract a relation in the sentence.They’re essential for relation extraction but overlooked. To this end,we present a

Trigger-Sense Memory Flow Framework (TriMF) for joint entity and relation extraction. We build a memory mod-ule to remember category representations learned in entity recog-nition and relation extraction tasks. And based on it, we designa multi-level memory flow attention mechanism to enhance thebi-directional interaction between entity recognition and relationextraction. Moreover, without any human annotations, our modelcan enhance relation trigger information in a sentence through atrigger sensor module, which improves the model performance andmakes model predictions with better interpretation. Experimentresults show that our proposed framework achieves state-of-the-artresults by improves the relation F1 to 52.44% (+3.2%) on SciERC,66.49% (+4.9%) on ACE05, 72.35% (+0.6%) on CoNLL04 and 80.66%(+2.3%) on ADE.

CCS CONCEPTS • Computing methodologies → Information extraction . ACM Reference Format:

Yongliang Shen, Xinyin Ma, Yechun Tang, and Weiming Lu. 2021. A Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction. ∗ Corresponding authorThis paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.

WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449895 In Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021,Ljubljana, Slovenia.

ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3442381.3449895

Entity recognition and relation extraction aim to extract structuredknowledge from unstructured text and hold a critical role in infor-mation extraction and knowledge base construction. For example,given the following text:

Ruby shot Oswald to death with the 0.38-caliber Colt Cobra revolver in the basement of Dallas City Jail onNov. 24, 1963, two days after President Kennedy was assassinated. ,the goal is to recognize entities about

People , Location and extractrelations about

Kill , Located in held between recognized entities.There are two things of interest to humans when carrying out thistask. First, potential constraints between the relation type and theentity type, e.g., the head and tail entities of the

Kill are of

People type, and the tail entity of the

Located in is of

Location type. Second,triggers for relations, e.g. with words shot and death , the fact (Ruby,Kill, Oswald) can be easily extracted from the above example.Current entity recognition and relation extraction methods fallinto two categories: pipeline methods and joint methods. Pipelinemethods label entities in a sentence through an entity recognitionmodel and then predict the relation between them through a rela-tion extraction model [9, 25]. Although it is flexible to build pipelinemethods, there are two common issues with these methods. First,they are more susceptible to error prorogation wherein predictionerrors from entity recognition can affect relation extraction. Second,they lack effective interaction between entity recognition and rela-tion extraction, ignoring the intrinsic connection and dependencybetween the two tasks. To address these issues, many joint entityand relation extraction methods are proposed and have achievedsuperior performance than traditional pipeline methods. In thesemethods, an entity recognition model and a relation extractionmodel are unified through different strategies, including constraint-based joint decoding [22, 34], parameter sharing [5, 11, 26], cast as areading comprehension task [23, 41] or hierarchical reinforcementlearning [32]. Current joint extraction models have made greatprogress, but the following issues still remain:(1)

Trigger information is underutilized in entity recogni-tion and relation extraction.

Before neural informationextraction models, rule-based entity recognition and relationextraction framework were widely used. They were devotedto mine hard template-based rules or soft feature-based rulesfrom text and match them with instances [1–3, 13, 17, 19, 28]. a r X i v : . [ c s . C L ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al.

Such methods provide good explanations for the extractionwork, but the formulation of rules requires domain expertknowledge or automatic discovery from a large corpus, suf-fering from tedious data processing and incomplete rulecoverage. End-to-end neural network methods have madegreat progress in the field of information extraction in recentyears. To exploit the rules, many works have begun to com-bine traditional rule-based methods by introducing a neuralmatching module [24, 35, 43]. However, these methods stillneed to formulate seed rules or label seed relation triggersmanually, and iteratively expand them.(2)

The interaction between entity recognition and rela-tion extraction is insufficient and uni-directional.

En-tity recognition and relation extraction tasks are supposed tobe mutually beneficial, but joint extraction methods do nottake full advantage of dependency between the two tasks.Most joint extraction models are based on parameter shar-ing, where different task modules share input features orinternal hidden layer states. However, these methods usuallyuse independent decoding algorithms, resulting in a weakinteraction between the entity recognition module and therelation extraction module. The joint decoding-based extrac-tion model strengthens the interaction between modules,but it requires a trade-off between the richness of featuresfor different tasks and joint decoding accuracy. Other jointextraction methods, such as modeling the task as a readingcomprehension problem [23, 41] or a semi-Markov process[32], still suffer from a lack of bi-directional interaction due tothe sequential order of subtasks. More specifically, if relationextraction follows entity recognition, the entity classifica-tion task will ignore the solution of the relation classificationtask.(3)

There is no distinction between the syntactic and se-mantic importance of words in a sentence.

We note thatsome words have a significant syntactic role but contributelittle to the semantics of a sentence, such as prepositions andconjunctions. While some words are just the opposite, theycontribute significantly to the semantics, such as nouns andnotional verbs. When encoding context, most methods aretoo simple to inject syntactic features into the word vector,ignoring the fact that words differ in their semantic and syn-tactic importance. For example, some methods concatenatepart of speech tags of words onto their semantic vectors viaan embedding layer [12, 29]. Other methods combine theword, lexical, and entity class features of the nodes on theshortest entity path in the dependency tree to get the finalfeatures, which are then concatenated onto the semanticvector [8, 29]. These methods do not distinguish the tworoles of a word for sentence semantics and syntax, but rathertreat both roles of all words as equally important.In this paper, we propose a novel framework for joint entity andrelation extraction to address the issues mentioned above. First, ourmodel makes full use of relation triggers, which can indicate a spe-cific type of relation. Without any relation trigger annotations, ourmodel can extract relation triggers in a sentence and provide themas an explanation for model predictions. Second, to enhance the bi-directional interaction between entity recognition and relationextraction tasks, we design a Memory Flow Attention module. Itstores the already learned entity category and relation categoryrepresentations in memory. Then we adopt a memory flow atten-tion mechanism to compute memory-aware sentence encoding, andmake the two subtasks mutually boosted by enhancing task-relatedinformation of a sentence. The Memory Flow Attention modulecan easily be extended to multiple language levels, enabling theinteraction between the two subtasks at both subword-level andword-level. Finally, we distinguish the syntactic and semantic im-portance of a word in a sentence and propose a node-wise GraphWeighted Fusion module to dynamically fuse the syntactic andsemantic information of words.Our main contributions are as follow: • Considering the relation triggers, we propose the TriggerSensor module, which implicitly extracts the relation trig-gers from a sentence and then aggregates the informationof triggers into span-pair representation. Thus, it can im-prove the model performance and strengthens the modelinterpretability. • To model the mutual dependency between entity recognitionand relation extraction, we propose the Multi-level MemoryFlow Attention module. This module constructs entity mem-ory and relation memory to preserve the learned representa-tions of entity and relation categories. Through the memoryflow attention mechanism, it enables the bi-directional in-teraction between entity recognition and relation extractiontasks at multiple language levels. • Since the importance of semantic and syntactic roles thatwords play in a sentence are different, we propose a node-wise Graph Weighted Fusion module to dynamically fusesemantic and syntactic information. • Experiments show that our model achieves state-of-the-artperformance consistently on the SciERC, ACE05, CoNLL04,and ADE datasets, and outperforms several competing base-line models on relation F1 score by 3.2% on SciERC, 4.9% onACE05, 0.6% on CoNLL04 and 2.3% on ADE.

Traditional relation extraction methods utilize template-based rules[2, 13, 28], which are first formulated by domain experts or automat-ically generated from a large corpus based on statistical methods.Then, they apply hard matching to extract the corresponding rela-tion facts corresponding to the rules. Later on, some works changethe template-based rules to feature-based rules (such as TF-IDF,CBOW) and extract relations by soft matching [7, 18, 20, 40], butstill could not avoid mining the rule features from a large corpususing statistical methods. In short, rule-based relation extractionmodels typically suffer from a number of disadvantages, includingtedious efforts on the rule formulation, a lack of extensibility, andlow accuracy due to incomplete rule coverage, but they can providea new idea for neural relation extraction systems.Some recent efforts on neural extraction systems attempt tofocus on rules or natural language explanations [35]. NERO [43]

Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction WWW ’21, April 19–23, 2021, Ljubljana, Slovenia explicitly exploits labeling rules over unmatched sentences as su-pervision for training RE models. It consists of a sentence-levelrelation classifier and a soft rule matcher. The former learns theneural representations of sentences and classifies which relation ittalks about. The latter is a learnable module that produces match-ing scores for unmatched sentences with collected rules. NEROlabels sentences according to predefined rules, and makes full useof information from unmatched instances. However, it is still atedious process to formulate seed rules manually. And the qualityof rule-making affects the performance of the entire system.

Previous entity and relation extraction models are pipelined [9, 25].In these methods, an entity recognition model first recognizes enti-ties of interest, and a relation extraction model then predicts therelation type between the recognized entities. Although pipelinemodels have the flexibility of integrating different model struc-tures and learning algorithms, they suffer significantly from errorpropagation. To tackle this issue, joint learning models have beenproposed. They fall into two main categories: parameter sharingand joint decoding methods.Most methods jointly model the two tasks through parametersharing [29, 42]. They unite entity recognition and relation extrac-tion modules by sharing input features or internal hidden layerstates. Specifically, these methods use the same encoder to pro-vide sentence encoding for both the entity recognition moduleand the relation extraction module. Some methods [4, 26, 27, 33]perform entity recognition first and then pair entities of interestfor relation classification. While other methods [32, 38] are theopposite, they predict possible relations first and then recognizethe entities in the sentence. DygIE [27] constructs a span-graphand uses message propagation methods to enhance interaction be-tween entity recognition and relation extraction. HRL [32] modelsthe joint extraction problem as a semi-Markov decision process,and uses hierarchical reinforcement learning to extract entities andrelations. CASREL [36] considers the general relation classificationas a tagging task. Each relation corresponds to a tagger that recog-nizes the tail entities based on a head entity and context. CopyMTL[39] casts the extraction task as a generation task and proposes anencoder-decoder model with a copy mechanism to extract relationtuples with overlapping entities. Although entity recognition andrelation extraction modules can adopt different structures in thesemethods, their independent decoding algorithms result in insuffi-cient interaction between the two modules. Furthermore, subtasksare performed sequentially in these methods, so the interactionbetween two tasks is uni-directional.To enhance the bi-directional interaction between entity recog-nition and relation extraction tasks, some joint decoding algorithmshave been proposed. [37] proposes to use integer linear planningto enforce constraints on the prediction results of the entity andrelation models. [21] uses conditional random fields for both entityand relation models and obtains the output results of the entityand relation by the Viterbi decoding algorithm. Although the jointdecoding-based extraction model strengthens the interaction be-tween two modules, it still requires a trade-off between the richness of features required for different tasks and the accuracy of jointdecoding.

In this section, we will introduce the

Trigger-Sense MemoryFlow Framework (TriMF) for joint entity and relation extraction,which consists of five main modules:

Memory module,

Multi-Level Memory Flow Attention module,

Syntactic-SemanticGraph Weighted Fusion module,

Trigger Sensor module, and

Memory-Aware Classifier module.The overall architecture of the TriMF is illustrated in Figure 2.We first initialize the Memory, including an Entity Memory M E ∈ R 𝑛 𝑒 × ℎ 𝑚𝑒 and a Relation Memory M R ∈ R 𝑛 𝑟 × ℎ 𝑚𝑟 , where 𝑛 𝑒 and 𝑛 𝑟 denote the number of entity categories and relation categories, ℎ 𝑚𝑒 and ℎ 𝑚𝑟 denote the slot size of entity memory and the relationmemory. bert subword word span span-pair Figure 1: Four Levels Encoding

Our model performs a four-level sentence encoding (subword,word, span, and span-pair, as shown in Figure 1) and two-step clas-sification (entity classification and relation classification). Morespecifically, a sentence is encoded by BERT [10] to obtain subwordsequence encoding E 𝑑 = R 𝑚 × ℎ , where 𝑚 denotes the number ofsubwords in the sentence, and ℎ denotes the hidden state size ofBERT. Based on M R , M E and E 𝑑 , we perform the first MemoryFlow Attention at the subword-level. Then we use 𝑓 𝑤 to aggregatethe subword sequence encoding into a word sequence encoding E 𝑤 = R 𝑛 × ℎ 𝑤 , where 𝑛 denotes the number of words in the sentence,and ℎ 𝑤 denotes the size of the word vector. Here for 𝑓 𝑤 , we adoptthe max-pooling function. Based on M R , M E and E 𝑤 , we performthe second Memory Flow Attention at the word-level. After that, theword sequence encoding is fed into the Syntactic-Semantic GraphWeighted Fusion module to fuse semantic and syntactic informationat the word-level. Then, we combine the word sequence encodingsby 𝑓 𝑠 to obtain the span sequence encodings E 𝑠 = R 𝑁 × ℎ 𝑠 , where 𝑁 denotes the number of spans in the sentence, and ℎ 𝑠 denotes thesize of the span vector. Here for 𝑓 𝑠 , we adopt a method of concate-nating a span-size embedding on max-pooled word embeddings.We filter out the spans which are classified as the None categoryby a Memory-Aware Entity Classifier. After pairing the spans ofinterest, We compute local-context representation 𝑔 𝑙𝑜𝑐𝑎𝑙 and full-contextual span-pair specific trigger representation g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 usingthe Trigger Sensor. We combine the encodings of the head span,tail span, 𝑔 𝑙𝑜𝑐𝑎𝑙 and g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 to obtain the encoding E 𝑟 ∈ R 𝑀 × ℎ 𝑟 , WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al.

BERT ... , Harvey

Oswald, who had just been shot by Jack Ruby, ...

Subword-levelMemory Flow

Attention

Word-level Memory Flow

Attention

Syntactic-Semantic

Graph Fusion Memory-Aware Entity Classifier

Memory

Memory-AwareRelation Classifier

Trigger Sensor i n v . re a d i n v . re a d Filter shot, dead, assassinated... w r i t e n o r m . re a d w r i t e Figure 2: Trigger-Sense Memory Flow Framework (TriMF) Overview where E 𝑟 ( 𝑖 𝑗 ) denotes the span pair encoding consisting of the 𝑖 𝑡ℎ and 𝑗 𝑡ℎ spans, 𝑀 denotes the number of candidate span pairs, and ℎ 𝑟 denotes the size of the span pair encoding. Lastly, we input thecandidate span-pair representation to the Memory-Aware RelationClassifier and predict the relation type between the two spans. Inthe next sections, we’ll cover five main modules of our model indetail. Memory holds category representations learned from historicaltraining examples, consist of entity memory and relation memory.Each slot of these two memories indicates an entity category anda relation category respectively. The category representation isheld in the corresponding memory slot, which can be used by theMemory Flow Attention module to enhance information related tothe tasks in a sentence, or by the Trigger Sensor module to sensetriggers.In the Memory module, we define two types of processes, Mem-ory Read Process and Memory Write Process, to manipulate thememory.

Memory Read Process

Given an input E and our memory M ,we define two processes to read memory: normal read process and inverse read process . The normal read process takes the input as query , the memory as key and value . First, we calculate the attentionweights of the input E on the memory M by bilinear similarityfunction, and then we weigh the memory by the weights.A 𝑛𝑜𝑟𝑚 ( E , M ) = softmax (cid:16) EWM 𝑇 (cid:17) (1)Read 𝑛𝑜𝑟𝑚 ( E , M ) = A 𝑛𝑜𝑟𝑚 ( E , M ) M (2)where W is a learnable parameter for the bilinear attention mecha-nism. While the inverse read process takes the memory as query ,the input as key and value . We first compute 2d-attention weightmatrix through bilinear similarity function, and then sum the 2d-attention weight matrix on the memory-slot dimension to obtaina 1d-attention weight vector on the input E . The more relevantelement in input with the memory has a larger weight. We then multiply the 1d-attention weight vector with E to get a memory-aware sequence encoding:A 𝑖𝑛𝑣 ( E , M ) = | M | ∑︁ 𝑖 = softmax (cid:16) M 𝑖 WE 𝑇 (cid:17) (3)Read 𝑖𝑛𝑣 ( E , M ) = A 𝑖𝑛𝑣 ( E , M ) E (4)where W is a learnable parameter for the bilinear attention mecha-nism and | M | denotes the number of slots in the memory M . Memory Write Process

We write entity memory using gradientsof entity classification losses and write relation memory using gra-dients of relation classification losses. If the gradient of the currentinstance’s classification loss is large, it means that the classifiedinstance (span or span-pair) representation is far away from thecorresponding memory slot (entity or relation category representa-tion of ground truth) while closer to the memory slots of the othercategories, and we need to assign a large weight to this instancewhen writing it into memory. This makes the representations ofthe categories stored in memory more accurate. The write processfor entity memory and relation memory is described below: M E 𝑒 = M E 𝑒 − E 𝑠𝑖 W 𝑒 𝜕 L 𝑒 𝜕𝑙𝑜𝑔𝑖𝑡 𝑒 𝑙𝑟 (5) M R 𝑟 = M R 𝑟 − E 𝑟 ( 𝑖 𝑗 ) W 𝑟 𝜕 L 𝑟 𝜕𝑙𝑜𝑔𝑖𝑡 𝑟 𝑙𝑟 (6) 𝑙𝑜𝑔𝑖𝑡 𝑒 = 𝑙𝑜𝑔 (cid:18) 𝑝 ( 𝑠 𝑖 = 𝑒 ) − 𝑝 ( 𝑠 𝑖 = 𝑒 ) (cid:19) (7) 𝑙𝑜𝑔𝑖𝑡 𝑟 = 𝑙𝑜𝑔 (cid:18) 𝑝 ( 𝑟 𝑖 𝑗 = 𝑟 ) − 𝑝 ( 𝑟 𝑖 𝑗 = 𝑟 ) (cid:19) (8)where L 𝑒 and L 𝑟 denote entity classification loss and relationclassification loss, 𝑙𝑟 denotes the learning rate, W 𝑒 and W 𝑟 aretwo weight matrices, 𝑝 ( 𝑠 𝑖 = 𝑒 ) denotes the probability of span 𝑠 𝑖 belonging to entity type 𝑒 , 𝑝 ( 𝑟 𝑖 𝑗 = 𝑟 ) denotes the probability ofspan-pair’s relation 𝑟 𝑖 𝑗 belonging to relation type 𝑟 , and E 𝑠𝑖 , E 𝑟𝑖 𝑗 denote candidate span and span-pair encoding, respectively. Theabove symbols are specifically defined in defined at Sec.3.6. Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

We perform a memory flow attention mechanism between the mem-ory and the input sequence to enhance task-relevant information,such as entity surface names and trigger words. Entity memory andrelation memory can enhance entity-related and relation-relatedinformation in the input instance for the two tasks respectively,thus they can help to strengthen bi-directional interaction betweentasks.

Memory Flow Attention

In order to enhance the task-relevantinformation in a sentence, we designed the Memory Flow Attentionbased on the Memory. Given a memory M and a sequence encoding E , We calculate the memory-aware sequence encoding by runing memory inverse read process :MFA 𝑠 ( E , M ) = Read 𝑖𝑛𝑣 ( E , M ) (9)A single memory flow can be extended to multiple memory flows.We consider two types in our work: relation memory flow andentity memory flow. So we design a Multi-Memory Flow Attentionmechanism, which is calculated as follows:MFA 𝑚 ( E , M R , M E ) = mean (cid:16) MFA 𝑠 (cid:16) E , M R (cid:17) , MFA 𝑠 (cid:16) E , M E (cid:17)(cid:17) (10)where M E and M R denote entity and relation memory respectively.we know that languages are hierarchical, and different levels rep-resent semantic information at different levels of granularity. Asshown in Figure 3, we extend the multi-memory flow attentionmechanism to multiple levels ( subword-level and word-level ), anddesign a Multi-Level Multi-Memory Flow Attention mechanism:

2d attention1d attention

Att

Multi-level

Memory-awareEncoding Multi-level

Encoding Memory

Figure 3: Multi-Level Multi-Memory Flow Attention E 𝑑 = MFA 𝑚 ( E 𝑑 , M 𝑟 , M 𝑒 ) (11) E 𝑤 = 𝑓 𝑤 (cid:16) E 𝑑 (cid:17) (12) E 𝑤 = MFA 𝑚 ( E 𝑤 , M 𝑟 , M 𝑒 ) (13)where E 𝑑 and E 𝑤 denote memory-aware sequence encoding atsubword-level and word-level respectively. The semantic information and syntactic structure of a sentence areimportant for both entity recognition and relation extraction. Weconsider both by constructing semantic and syntactic graphs froma sentence, with nodes in the graph refer to words in the sentence.We update a node representation based on its neighbor nodes’ rep-resentations and the graph structure in the two graphs. We notethat some words have a significant syntactic role but contributelittle to the semantics of a sentence, such as prepositions and con-junctions. While some words are just the opposite, they contributesignificantly to the semantics, such as nouns and notional verbs.Therefore, we need to fuse syntactic and semantic graphs basedon the relative importance of the syntactic role and semantic role.First, the nodes in the two graphs are initialized as: H ( ) = E 𝑤 (14) Syntactic Graph

We construct a directed syntactic graph from asentence based on dependency parsing, with the word as a node andthe dependency between words as an edge. We then use the R-GCN[31] to update node representations. The node representations ofthe syntactic graph (cid:98) H ( 𝑙 ) in 𝑙 𝑡ℎ layer are calculated as: (cid:98) H ( 𝑙 ) 𝑖 = 𝜎 (cid:169)(cid:173)(cid:171) ∑︁ 𝑟 ∈R 𝑑𝑒𝑝 ∑︁ 𝑗 ∈N 𝑟𝑖 𝑐 𝑖,𝑟 (cid:98) W ( 𝑙 ) 𝑟 H ( 𝑙 ) 𝑗 + (cid:98) W ( 𝑙 ) H ( 𝑙 ) 𝑖 (cid:170)(cid:174)(cid:172) (15)where (cid:98) W ( 𝑙 ) 𝑟 and (cid:98) W ( 𝑙 ) denote two learnable weight matrices, and N 𝑟𝑖 denotes the set of neighbor indices of node 𝑖 under relation 𝑟 ∈ R 𝑑𝑒𝑝 . Semantic Graph

We compute the dense adjacency matrix based onsemantic similarity and randomly sample from the fully connectedgraph to construct the semantic graph: 𝛼 = LeakyReLU (cid:16) (cid:101) WH ( 𝑙 ) (cid:17) 𝑇 LeakyReLU (cid:16) (cid:101) WH ( 𝑙 ) (cid:17) (16)where (cid:101) W denotes a trainable weight matrix. Then we compute aweighted average for aggregation of neighbor nodes N ( 𝑖 ) , wherethe weights come from the normalized adjacency matrices 𝛼 . Weupdate the node representations of semantic graph (cid:101) H ( 𝑙 ) 𝑖 in 𝑙 𝑡ℎ layer,which are calculated as follows: 𝛼 = softmax ( 𝛼 ) (17) (cid:101) H ( 𝑙 ) 𝑖 = 𝛼 𝑖,𝑖 (cid:101) WH ( 𝑙 ) 𝑖 + ∑︁ 𝑗 ∈N( 𝑖 ) 𝛼 𝑖,𝑗 (cid:101) WH ( 𝑙 ) 𝑗 (18) Node-Wise Graph Weighted Fusion

We design a graph weightedfusion module to dynamically fuse two graphs according to the rela-tive semantic and syntactic importance of words in a sentence. The[CLS] vector, denote as e 𝑐𝑙𝑠 , is often used for sentence-level tasksand contains information about the entire sentence. We first calcu-late the bilinear similarity between e 𝑐𝑙𝑠 and each node of semanticand syntactic graphs. Then we normalize the similarity vectorsacross two graphs to obtain two sets of weights, which indicatesemantic and syntactic importance respectively. Finally, we fuse allnodes across the graphs based on the weights: WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al. w , (cid:98) w = softmax (cid:16)(cid:110) e 𝑐𝑙𝑠 W (cid:101) H ( 𝑙 ) , e 𝑐𝑙𝑠 W (cid:98) H ( 𝑙 ) (cid:111)(cid:17) (19) H ( 𝑙 + ) 𝑖 = (cid:101) w 𝑖 · (cid:101) H ( 𝑙 ) 𝑖 + (cid:98) w 𝑖 · (cid:98) H ( 𝑙 ) 𝑖 (20)where W is a learnable weight matrix, (cid:101) w and (cid:98) w denote the nodeimportance weights of syntactic and semantic graphs, respectively.Then we map the node representations H ( 𝑙 + ) to the correspondingword representations 𝐸 𝑔 using mean-pooling: E 𝑔 = mean (cid:16) H ( 𝑙 + ) , E 𝑤 (cid:17) (21) We know that a particular relation usually occurs in conjunctionwith a particular set of words, which we call relation triggers. Theycan help explain why humans would extract a relation in the sen-tence and play an essential role in relation extraction. We presenta Trigger Sensor module that senses and enhances the contextualtrigger information without any trigger annotations.Relation triggers typically appear in local context between apair of spans (cid:0) 𝑠 𝑖 , 𝑠 𝑗 (cid:1) , and some approaches encode local contextdirectly into the span-pair representation for relation classification.However, these approaches do not consider the case where thetriggers are outside the span-pair, resulting in the model ignoringuseful information from other contexts. We design both a Local-Context Encoder and a Full-Context Trigger Sensor to computethe local-context representation g 𝑙𝑜𝑐𝑎𝑙 and the full-context triggerrepresentation g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 . Local-Context Encoder

We aggregate local-context informationbetween spans of interest using max-pooling. The local-contextrepresentation g 𝑙𝑜𝑐𝑎𝑙 is calculated as: g 𝑙𝑜𝑐𝑎𝑙 = max (cid:16) E 𝑔𝑘 , E 𝑔𝑘 + , · · · , E 𝑔ℎ (cid:17) (22)where E gk , E gk + , · · · , E gh are the encodings of words between thetwo spans (cid:0) 𝑠 𝑖 , 𝑠 𝑗 (cid:1) . Full-Context Trigger Sensor

Full-context trigger sensor aims tosense and enhance span-pair specific triggers. Given a pair of spans (cid:0) 𝑠 𝑖 , 𝑠 𝑗 (cid:1) , we use head span and tail span as queries respectivelyand execute normal read process on the relation memory. Afterobtaining two span-specific memory representations, we performmean-pooling across them to get the span-pair specific relationrepresentation 𝑚 𝑟 ( 𝑖 𝑗 ) : 𝑚 𝑟 ( 𝑖 𝑗 ) = mean (cid:16) Read 𝑛𝑜𝑟𝑚 (cid:16) E 𝑠𝑖 , M R (cid:17) , Read 𝑛𝑜𝑟𝑚 (cid:16) E 𝑠𝑗 , M R (cid:17)(cid:17) (23)We calculate the similarity between m 𝑟 ( 𝑖 𝑗 ) and each word repre-sentation of a word sequence, and then weigh the word sequenceto get the full-context trigger representation g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 . g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 = softmax (cid:16) m ( 𝑖 𝑗 ) ( E 𝑔 ) 𝑇 (cid:17) E 𝑔 (24)We incorporate the local-context representation g 𝑙𝑜𝑐𝑎𝑙 and thefull-context trigger representation g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 into the span-pair en-coding E 𝑟𝑖 𝑗 using 𝑓 𝑟 : E 𝑟𝑖 𝑗 = 𝑓 𝑟 (cid:16) E 𝑠𝑖 , E 𝑠𝑗 , g 𝑙𝑜𝑐𝑎𝑙 , g 𝑡𝑟𝑖𝑔𝑔𝑒𝑟 (cid:17) (25) for 𝑓 𝑟 we adopt the concatenate function. Trigger Extraction

Using the trigger sensor, we can also extractrelation triggers and provide a reasonable explanation for modelpredictions. Based on the similarity of each word representationwith the span-pair specific relation representations 𝑚 𝑟 ( 𝑖 𝑗 ) , we rankthe words. The top-ranked words can be used as relation triggers toexplain the model’s predictions. We will show the trigger extractionability of our model in the case study section. Representations of the entity and relation categories are stored inentity memory and relation memory, respectively. Based on thebilinear similarity between instance (span or span-pair) representa-tion and categories representations, we compute the probability ofcandidate span 𝑠 𝑖 being an entity 𝑒 : 𝑝 ( 𝑠 𝑖 = 𝑒 ) = exp (cid:16) E 𝑠𝑖 W 𝑒 𝑀 E 𝑒 (cid:17)(cid:205) 𝑘 ∈E exp (cid:16) E 𝑠𝑖 W 𝑒 M E 𝑘 (cid:17) (26)and the probability of candidate span-pair (cid:0) 𝑠 𝑖 , 𝑠 𝑗 (cid:1) having a relation 𝑟 : 𝑝 (cid:16) 𝑟 ( 𝑖 𝑗 ) = 𝑟 (cid:17) = exp (cid:16) E 𝑟 ( 𝑖 𝑗 ) W 𝑟 M R 𝑟 (cid:17)(cid:205) 𝑘 ∈R exp (cid:16) E 𝑟 ( 𝑖 𝑗 ) W 𝑟 M R 𝑘 (cid:17) (27)where W 𝑒 ∈ R ℎ 𝑠 × ℎ 𝑚𝑒 and W 𝑟 ∈ R ℎ 𝑟 × ℎ 𝑚𝑟 denote two learnableweight matrices. Finally, we define a joint loss function for entityclassification and relation classification: L = L 𝑠 + L 𝑟 where L 𝑠 denotes the cross-entropy loss over entity categories(includingthe None category), and L 𝑟 denotes the binary cross-entropy lossover relation categories. At the start of training, since the memory is randomly initialized,the Memory Flow Attention module and Trigger Sensor modulewill introduce noises to the sequence encoding. These noises fur-ther corrupt the semantic information of the pre-trained BERT[10] through the gradient descent. We therefore divide the modeltraining procedure into two stages. In the first stage, we aim tolearn more accurate category representations and store them intothe corresponding memory slots. We only train Memory-AwareClassifier and Graph Weighted Fusion modules and update thememory through the memory write process . In the second stage, weadd Memory Flow Attention and Trigger Sensor modules to thetraining procedure. Based on the more accurate representations ofthe categories stored in the memory, we can strengthen the con-textual task-related features and relation triggers through memoryread process . We evaluate TriMF described above using the following four datasets:

Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Dataset Model Entity RelationPrecision Recall F1 Precision Recall F1SciERC SciIE † † - - 65.20 - - 41.60DYGIE++ † - - 67.50 - - 48.40SpERT † (using SciBERT [6]) 70.87 69.79 70.33 53.40 48.54 50.84TriMF † (using SciBERT) 70.18 ( ± ± ± ± ± ± ACE05 DyGIE † - - 88.40 - - 63.20DYGIE++ † - - 88.60 - - 63.40TriMF † ± ± ± ± ± ± Multi-turn QA ‡ ‡ ‡ ± ± ± ± ± ± CoNLL04 Multi-head + AT [4] ‡ ‡ ‡ ‡ ‡ ± ± ± ± ± ± ADE Multi-head + AT [4] ‡ * - - 86.73 - - 75.52SpERT ‡ * 88.99 89.59 89.28 77.77 79.96 78.84TriMF ‡ * 89.50 91.29 Table 1: Precision, Recall, and F1 scores on the SciERC, ACE05, CoNLL04 and ADE datasets. (macro-average=*, boundaryevaluation= † , strict evaluation= ‡ ) • SciERC : The SciERC [26] includes annotations for scientificentities, their relations, and coreference clusters for 500 sci-entific abstracts. The dataset defines 6 types for annotatingscientific entities and 7 relation categories. We adopt thesame data splits as in [26]. • ACE05 : ACE05 was built upon ACE04, and is commonlyused to benchmark NER and RE methods. ACE05 defines 7entity categories. For each pair of entities, it defines 6 relationcategories. We adopt the same data splits as in [29]. • CoNLL04 : The CoNLL04 dataset [30] consists of 1,441 sen-tences with annotated entities and relations extracted fromnews articles. It defines 4 entity categories and 5 relationcategorie. We adopt the same data splits as in [14], whichcontains 910 training, 243 dev, and 288 test sentences. • ADE : The Adverse Drug Events (ADE) dataset [15] consistsof 4, 272 sentences and 6, 821 relations extracted from med-ical reports. These sentences describe the adverse effectsarising from drug use. ADE dataset contains two entity cate-gories and a single relation category.

Our model is compared with current advanced joint entity and rela-tion extraction models, divided into three types: general parameter-sharing based models (Multi-head AT, SPtree, SpERT, SciIE), span-graph based models (DyGIE, DyGIE++), and reading-comprehensionbased models (multi-turn QA, MRC4ERE).

Multi-head + AT [4] treats the relation extraction task as a multi-head selection problem. Each entity is combined with all otherentities to form entity pairs that can be predicted which relations tohave. In addition, instead of being a multi-category task where each category is mutually exclusive, the relation classification is treatedas multiple bicategorical tasks where each relation is independent,which allows more than one relation to be predicted.

SPTree [29] shares parameters of the encoder in joint entity recog-nition and relation extraction tasks, which strengthens the correla-tion between the two tasks. SPTree is the first model that adoptsa neural network to solve a joint extraction task for entities andrelations.

SpERT [11] is a simple and effective model for joint entity andrelation extraction. It uses BERT [10] to encode a sentence, andenumerates all spans in the sentence. Then it performs span classifi-cation and span-pair classification to extract entities and relations.

SciIE [26] is a framework for extracting entities and relations fromthe scientific literature. It reduces error propagation between tasksand leverages cross-sentence relations through coreference links byintroducing a multi-task setup and a coreference disambiguationtask.

DyGIE/DYGIE++ [27, 33] dynamically build a span graph, and iter-atively refine the span representations by propagating coreferenceand relation type confidences through the constructed span graph.Also, DyGIE++ takes event extraction into account.

Multi-turn QA [23] treats joint entity and relation extraction taskas a multiple-round question-and-answer task. Each entity and eachrelation is depicted using a question-and-answer template, so thatthese entities and relations can be extracted by answering thesetemplated questions.

MRC4ERE++ [41] introduces a diversity question answering mech-anism based on Multi-turn QA. Two answering selection strategiesare designed to integrate different answers. Moreover, MRC4ERE++

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al. proposes to predict a subset of potential relations to filter out irrel-evant ones to generate questions effectively.

We evaluate these models on both entity recognition and relationextraction tasks. An entity is considered correct if its predictedspan and entity label match the ground truth. When evaluatingrelation extraction task, previous works have used different metrics.For the convenience of comparison, we report multiple evaluationmetrics consistent with them. We define a strict evaluation , wherea relation is considered correct if its relation type, as well as thetwo related entities, are both correct, and a boundary evaluation ,where entity type correctness is not considered. We reported strictrelation f1 on Conll04 and ADE, boundary relation f1 on SciERC,and both on ACE05. Our experiments on these datasets all report amicro-F1 score, except for the ADE dataset, where we report themacro-F1 score.

In most experiments, we use BERT [10] as the encoder, pre-trainedon an English corpus. On the SciERC dataset, we replace BERT withSciBERT [6]. We perform the four-level encoding with a subwordencoding size ℎ = ℎ 𝑤 = ℎ 𝑠 = ℎ 𝑟 = ℎ 𝑚𝑒 and relation memoryslot size ℎ 𝑚𝑟 to 768. We just use a single graph neural layer insemantic and syntactic graphs. We initialize entity memory andrelation memory using the normal distribution N ( . , . ) . Weuse the Adam Optimizer with a linear warmup-decay learning rateschedule (with a peak learning rate of 5e-5), a dropout before theentity and relation bilinear classifier with a rate of 0.5, a batch sizeof 8, span width embeddings of 25 dimensions and max span-sizeof 10. The training is divided into two stages with the first stage of18 epochs, and the second stage of 12 epochs. Main Results

We report the average results over 5 runs on Sci-ERC, ACE05 and CoNLL04 datasets. For ADE, we report metricsaveraged across the 10 folds. Table 1 illustrates the performanceof the proposed method as well as baseline models on SciERC,ACE05, CoNLL04 and ADE datasets. Our model consistently out-performs the state-of-the-art models for both entity and relationextraction on all datasets. Specifically, the relation F1 scores of ourmodel advance previous models by +3.2%, +4.9%, +0.6%, +2.3% onSciERC, ACE05, CoNLL04 and ADE respectively. We attribute theimprovement to three reasons. First, our model can share learnedinformation between tasks through the Memory module, enhanc-ing task interactions in both directions(from NER to RE, and fromRE to NER). Second, the Trigger Sensor module can enhance therelation trigger information, which is essential for relation classi-fication. Lastly, taking a step further from introducing structureinformation through syntactic graphs, we distinguish the semanticand syntactic importance of words to fuse two-way informationthrough a dynamic Graph Weighted Fusion module. We conductablation studies to further investigate the effectiveness of thesemodules.

Entity RelationMethod F1 Δ F1 Δ SciERCTriMF 70.17 - -w/o Graph Weighted Fusion 70.12 -0.05 51.83 -0.61w/o Trigger Sensor 70.19 +0.02 51.23 -1.21w/o Subword-level MFA 70.11 -0.06 51.27 -1.17w/o Token-level MFA +0.04 51.78 -0.66ACE05TriMF - -w/o Graph Weighted Fusion 87.55 -0.06 61.68 -1.09w/o Trigger Sensor 87.45 -0.16 61.60 -1.17w/o Subword-level MFA 87.09 -0.52 61.68 -1.09w/o Token-level MFA 87.42 -0.19 62.02 -0.75 Table 2: Effect of Different Modules

Effect of Different Modules

To prove the effects of each proposedmodules, we conduct the ablation study. As shown in Table 2, allmodules contribute to the final performance. Specifically, removingthe Trigger Sensor module has the most significant effect, causingthe relation F1 score to drop from 52.44% to 51.23% on SciERC, from62.77% to 61.60% on ACE05. Comparing the effects of Memory-FlowAttention at subword-level and word-level on the two datasets, wefind that the improvement of MFA at subword-level is more sig-nificant. We thus believe that fine-grained semantic informationis more effective for relation extraction. The performance of theSyntactic-Semantic Graph Weighted Fusion module varies widelyacross datasets, achieving an improvement of 1.09% on ACE05, butonly 0.61% on SciERC. This may be related to the different impor-tance of syntactic information for relation extraction on differentdomains.

Effect of Interaction Between Two Subtasks

There is a mutualdependency between the entity recognition and relation extrac-tion tasks. Our framework models this relationship through theMulti-level Memory Flow Attention module. Depending on thememory that the attention mechanism relies on, it can be dividedinto Relation-specific MFA and Entity-specific MFA. The Relation-specific MFA module enhances the relation-related informationbased on the relation memory, allowing the entity recognition taskto utilize the information already captured in the relation extrac-tion task, as does Entity-specific MFA. To verify that the MemoryFlow Attention module can facilitate the interaction between entityrecognition and relation extraction, we perform ablation studies,as shown in Table 3. On ACE05 and SciERC, both Entity-specificMFA and Relation-specific MFA bring significant performance im-provement. In addition, the Relation-specific MFA improves morecompared with Entity-specific MFA. We think the reason may bethat our model performs entity recognition first and then relationextraction. This order determines that information from entityrecognition has been used by relation extraction, but the informa-tion from relation extraction is not fed back to entity recognition.

Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Entity RelationMethod F1 Δ F1 Δ SciERCTriMF - -w/o MFA 70.04 -0.13 50.78 -1.66w/o Relation MFA 70.07 -0.10 51.28 -1.16w/o Entity MFA 70.17 0 51.84 -0.60ACE05TriMF - -w/o MFA 87.42 -0.19 62.19 -0.58w/o Relation MFA 87.37 -0.24 62.06 -0.71w/o Entity MFA 87.38 -0.23 62.64 -0.13 Table 3: Effect of Interaction between NER and RE

When using Relation-specific MFA, a bridge for bi-directional in-formation flow is built between the two tasks. Furthermore, whenwe use both Entity-specific MFA and Relation-specific MFA, theexperiment achieves the best performance, indicating that MFA canenhance the bi-directional interaction between entity recognitionand relation extraction.

Effect of Different Graph Fusion Methods

Our proposed graphweighted fusion module employs a node-wise weighted fusion ap-proach based on attention, which enables a flexible fusion of noderepresentations according to words’ syntactic importance and se-mantic importance. To demonstrate the effectiveness of our ap-proach, we compare other node-wise fusion methods, includingno-fusion, max-fusion, mean-fusion and sum-fusion, as shown inTable 4. Comparing the two experiments which only use the se-mantic graph or syntactic graph, we find that the syntactic graphprovides a greater improvement in model performance, probablybecause the initial encodings of the nodes of the syntactic graphhave already contained semantic information. Compared to max-fusion, mean-fusion, and sum-fusion, the node-wise weight-fusionmethod brings more improvement on relation F1 scores of bothSciERC and ACE05, which proves the effectiveness of our method.

Effect of Different Stage Divisions for Memory

We exploredthe effect of different two-stage divisions on the relation classifi-cation, as shown in Figure 4 (x-axis is the number of epochs forthe first stage and the total number of epochs is 30). We can notethat if our model skips the first stage (x=0) or ignores the secondstage (x=30), the performance of the model degrades significantly.Specifically, as the proportion of first stage epochs to total epochsincreases, our model performs better. But at a certain point, the per-formance degrades significantly. We believe this is due to a decreasein epochs of the second stage and the memory already written inthe first stage is not utilized effectively. Therefore the two-stagetraining strategy is effective, and a good balance of the two stagescan bring out a better model performance.

Effect of Different Gradients Flow to Memory

Our model pri-marily writes the memory in Memory-Aware Classifier. Further-more, we can also tune the memory in MFA and Trigger Sensor mod-ules through the backpropagation of gradients. The gradient flows

Entity RelationMethod P R F1 P R F1SciERCNo Graph 69.87 70.33 70.10 52.56 49.59 51.03Semantic Graph 68.47 69.61 49.04 52.00 50.62 51.30Syntactic Graph 72.18 70.68

ACE05No Graph 87.24 87.18 87.21 60.11 61.83 60.96Semantic Graph 87.57 87.69 87.63 59.45 62.47 60.92Syntactic Graph 87.47 87.36 87.41 59.29 62.96 61.07Mean-fusion 87.32 87.78 87.55 59.74 62.90 61.28Sum-fusion 87.85 87.47

Table 4: Effect of Different Graph Fusion Methods twostage division of memory ( xaxis: epoch for the first stage ) S c i E RC r e l a t i on F SciERC 5556575859606162 A C E r e l a t i on F ACE05

Figure 4: Effect of Train Stage Division are divided into three types: Trigger Sensor gradients, Subword-level MFA gradients and Word-level MFA gradients, and we inves-tigated the effects of different gradients, as shown in Table 6. Wesee that on the ACE05 dataset, when we block any of the gradi-ents flows, the model performance decreases significantly, by 1.35%,1.54%, and 0.92% on relation F1 score, which indicates that tuningthe memory during the second stage is effective. However, On theSciERC dataset, there is no significant drop, and we believe that themodel has learned accurate representations of the categories in thefirst training stage.

Effect of Relation Filtering Threshold

The precision and recallof relation classification are correlated with predefined thresholds.We investigate the impact of the relation filtering threshold onrelation F1. Figure 5 shows the relation F1 score on the SciERCand ACE05 test sets, plotted against the relation filtering threshold.We see that the performance of our model is stable for the choiceof relation filtering thresholds. Our model is able to achieve good

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al.

Original Text Relation Top-5 Relation TriggersUrutigoechea and the others were arrested Wednesday in the citiesof Bayonee and

Bonloc in southwestern

France in Poitiers in west-central France. (Bonloc, Located in, France) southwestern, west-central, cities, of, in

Kleber Elias Gia Bustamante , accused by the police of being a mem-ber of the "Red Sun" central committee, has been living clandestinelysince his escape from the Garcia Moreno Prison, where he was heldaccused of assassinating the industrialist,

Jose Antonio Briz Lopez . (Kleber Elias Gia Bustamante, Kill,Jose Antonio Briz Lopez) Prison, assassinating, held, of, accused

Table 5: Results of Trigger Words Extraction

Entity RelationMethod F1 Δ F1 Δ SciERCTriMF 70.17 - -w/o Trigger Sensor Grad. 70.14 -0.03 52.28 -0.16w/o Subword-level MFA Grad. +0.08 52.03 -0.41w/o Word-level MFA Grad. 70.12 -0.05 52.14 -0.30ACE05TriMF - -w/o Trigger Sensor Grad. 87.55 -0.06 61.42 -1.35w/o Subword-level MFA Grad. 87.43 -0.18 61.23 -1.54w/o Word-level MFA Grad. 87.34 -0.27 61.85 -0.92 Table 6: Effect of Gradient Flow to Memory results on relation classification except for extreme thresholds of0.0 or 1.0. Therefore, within a reasonable range, our model is notsensitive to choose a threshold. relation filtering threshold r e l a t i on F sciercace05 Figure 5: Effect of Relation Filtering Threshold

Trigger Words Extraction

With the Trigger Sensor module, ourmodel has the ability to extract the relation triggers. We rank thesimilarities of each word representation with the span-pair specificrelation representation, which have been calculated in the Trigger Sensor. Filtering out the entity surface words and stopwords, thetop k words are picked as relation triggers and used to interpretthe results of the relation extraction. We show two cases in Table 5.

Memory Flow Attention Visualization

We visualize the weightsof attention to provide a straightforward picture of how the entityand relation memory flow attention we designed can both enhancethe interaction between entity recognition and relation extraction.Also, it can enhance the information about relation triggers incontext, to some extent explaining the model’s predictions. Figure6 shows two cases of how attention weights on context from arelation memory flow can help the model recognize entities andhighlight relation triggers. Each example is split into two visualiza-tions, with the top showing the original attention weights and thebottom showing the attention weights after masking the entities. Inthe top figure, we can see that the darker words belong to an entity,for example, "Urutigoechea" , "Bayonee" , "Bonloc" in case 1, "Dallas" , "Jack Ruby" in case 2, illustrating that the attention of our relationmemory flow attention can highlight relevant entity information.Consistent with [16], our attention distribution also illustrates thatentity names provide more valid information for relation classifica-tion compared to context. To more clearly visualize the attentionweights of different contextual words, we mask all entities, formal-ize the weights of the remaining words, and then visualize them. Asshown in the bottom figure, the relation memory flow pays more at-tention on the words that indicate the type of relation, i.e., relationtriggers, such as "in" , "southwestern" , "west-central" in case 1 canindicate "Located in" relation, and "assassin" , "murdering" in case 2can indicate "Kill" relation. This shows that our relation memoryflow is able to highlight relation triggers, helping the model withbetter performance on relation extraction. Error Cases

In addition to visualizing Memory Flow Attentionweights on true positives, we also analyze a number of false pos-itives and false negatives . These error cases include relation re-quiring inference, ambiguous entity recognition and long entityrecognition, as shown in Table 7. In the first case, although ourmodel is able to recognize the four entities about

Location , it in-correctly extracts the relation "(Guernsey, Located in, France)" anddoes not extract the correct one "(Guernsey, Located in, ChannelIslands)" . This is because the model does not infer the complex loca-tion relation between the four entities. Our model is prone to makemistakes when classifying ambiguous entities, and False Positiveand False Negative often occur together. For example, in the secondrow of the Table 7, the model does not recognize "CBS News" as aLocation entity, but recognizes "CBS" which is not labeled in thetest set. Furthermore, recognition of long entities is a challenge

Trigger-Sense Memory Flow Framework for Joint Entity and Relation Extraction WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Case 1 – original:Case 1 – with entities masked:Case 2 – original: Case 2 – with entities masked: Figure 6: Two case studies of relation memory flow attention during inference. The darker cells have higher attention weights.

The problem is not unusual in [[ Guernsey ]

H_Located-in ] H_Located-in , one of [ Britain ]

T_Located-in ’s [ Channel Islands ]

T_Located-in off the coast of [[ France]]

T_Located-in ... and former [[ CBS ]

T_Work-for

News ] commentator [[ Eric Sevareid ]

H_Live-in ] H_Work-for , who was born in [ Velva ]

T_Live-in , several miles southeast of [Minot ].Text of the statement issued by the [ Organization of the Oppressed on Earth ] claiming [[ U. S. ]

T_Live-in ] T_Live-in

Marine Lt.[[ William R. Higgins]

H_Live-in ] H_Live-in was hanged.

Table 7: Typical error examples. Red brackets indicate entities predicted by the model, blue brackets indicate true entities, andthe labels in the lower right corner indicate the type of relation between the corresponding entities and the head or tail type(T for the tail entity; H for the head entity) for our model due to the fact that long entities are sparse in thedataset. For example, in the third row of the Table 7, the modelfails to recognize the long entity "Organization of the Oppressed onEarth" . In this paper, we propose a Trigger-Sense Memory Flow Framework(TriMF) for joint entity and relation extraction. We use the memoryto boost the task-related information in a sentence through theMulti-level Memory Flow Attention module. This module can effec-tively exploit the mutual dependency and enhance the bi-directionalinteraction between entity recognition and relation extraction tasks.Also, focusing on the relation triggers, we design a Trigger Sensorto sense and enhance triggers based on memory. Our model canextract the relation triggers without any trigger annotations, whichcan better assist the relation extraction and provide an explanation.Furthermore, we distinguish the semantic and syntactic importanceof a word in a sentence and fuse semantic and syntactic graphs dynamically based on the attention mechanism. Experiments on Sci-ERC, ACE05, CoNLL04 and ADE datasets show that our proposedmodel TriMF achieves state-of-the-art performance.In the future, we will improve our work along with two directions.First, we plan to impose constraints on the representations of entitycategories and relation categories written in the memory, due tothe fact that relations and entities substantively satisfy specificconstraints at the ontology level. Second, for improving the model’sability on sensing the trigger, we plan to add weak supervision (e.g.word frequency, entity boundary) to the Trigger Sensor module.

ACKNOWLEDGMENTS

This work is supported by the National Key Research and Devel-opment Project of China (No. 2018AAA0101900), the FundamentalResearch Funds for the Central Universities, the Chinese Knowl-edge Center of Engineering Science and Technology (CKCEST) andMOE Engineering Research Center of Digital Library.

WW ’21, April 19–23, 2021, Ljubljana, Slovenia Yongliang Shen, et al.

REFERENCES [1] Eugene Agichtein and Luis Gravano. 2000. Snowball: Extracting relations fromlarge plain-text collections. In

Proceedings of the fifth ACM conference on Digitallibraries . 85–94.[2] Chinatsu Aone, Lauren Halverson, Tom Hampton, and Mila Ramos-Santacruz.1998. SRA: Description of the IE2 system used for MUC-7. In

Seventh MessageUnderstanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax,Virginia, April 29-May 1, 1998 .[3] David S Batista, Bruno Martins, and Mário J Silva. 2015. Semi-supervised boot-strapping of relationship extractors with distributional semantics. In

Proceedingsof the 2015 Conference on Empirical Methods in Natural Language Processing .499–504.[4] Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018.Adversarial training for multi-context joint entity and relation extraction. arXivpreprint arXiv:1808.06876 (2018).[5] Giannis Bekoulis, Johannes Deleu, Thomas Demeester, and Chris Develder. 2018.Joint entity recognition and relation extraction as a multi-head selection problem.

Expert Systems with Applications

114 (2018), 34–45.[6] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained languagemodel for scientific text. arXiv preprint arXiv:1903.10676 (2019).[7] Quoc-Chinh Bui, Sophia Katrenko, and Peter MA Sloot. 2011. A hybrid approachto extract protein–protein interactions.

Bioinformatics

27, 2 (2011), 259–265.[8] Razvan Bunescu and Raymond Mooney. 2005. A shortest path dependency kernelfor relation extraction. In

Proceedings of Human Language Technology Conferenceand Conference on Empirical Methods in Natural Language Processing . 724–731.[9] Yee Seng Chan and Dan Roth. 2011. Exploiting syntactico-semantic structures forrelation extraction. In

Proceedings of the 49th Annual Meeting of the Associationfor Computational Linguistics: Human Language Technologies . 551–560.[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[11] Markus Eberts and Adrian Ulges. 2019. Span-based Joint Entity and RelationExtraction with Transformer Pre-training. arXiv preprint arXiv:1909.07755 (2019).[12] Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. GraphRel: Modeling text asrelational graphs for joint entity and relation extraction. In

Proceedings of the57th Annual Meeting of the Association for Computational Linguistics . 1409–1418.[13] Katrin Fundel, Robert Küffner, and Ralf Zimmer. 2007. RelEx-Relation extractionusing dependency parse trees.

Bioinformatics

23, 3 (2007), 365–371.[14] Pankaj Gupta, Hinrich Schütze, and Bernt Andrassy. 2016. Table filling multi-taskrecurrent neural network for joint entity and relation extraction. In

Proceedingsof COLING 2016, the 26th International Conference on Computational Linguistics:Technical Papers . 2537–2547.[15] Harsha Gurulingappa, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, MartinHofmann-Apitius, and Luca Toldo. 2012. Development of a benchmark corpus tosupport the automatic extraction of drug-related adverse effects from medicalcase reports.

Journal of biomedical informatics

45, 5 (2012), 885–892.[16] Xu Han, Tianyu Gao, Yankai Lin, Hao Peng, Yaoliang Yang, Chaojun Xiao,Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. 2020. More Data, MoreRelations, More Context and More Openness: A Review and Outlook for RelationExtraction. arXiv preprint arXiv:2004.03186 (2020).[17] Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text cor-pora. In

Coling 1992 volume 2: The 15th international conference on computationallinguistics .[18] Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the fea-ture space for relation extraction. In

Human Language Technologies 2007: TheConference of the North American Chapter of the Association for ComputationalLinguistics; Proceedings of the Main Conference . 113–120.[19] Rosie Jones, Andrew McCallum, Kamal Nigam, and Ellen Riloff. 1999. Bootstrap-ping for text learning tasks. In

IJCAI-99 Workshop on Text Mining: Foundations,Techniques and Applications , Vol. 1.[20] Nanda Kambhatla. 2004. Combining lexical, syntactic, and semantic featureswith maximum entropy models for extracting relations. In

Proceedings of the ACL2004 on Interactive poster and demonstration sessions . 22–es.[21] Arzoo Katiyar and Claire Cardie. 2016. Investigating lstms for joint extraction ofopinion entities and relations. In

Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) . 919–929.[22] Qi Li and Heng Ji. 2014. Incremental joint extraction of entity mentions andrelations. In

Proceedings of the 52nd Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers) . 402–412.[23] Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou,and Jiwei Li. 2019. Entity-relation extraction as multi-turn question answering. arXiv preprint arXiv:1905.05529 (2019).[24] Bill Yuchen Lin, Dong-Ho Lee, Ming Shen, Ryan Moreno, Xiao Huang, PrashantShiralkar, and Xiang Ren. 2020. Triggerner: Learning with entity triggers asexplanations for named entity recognition. In

ACL .[25] Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016.Neural relation extraction with selective attention over instances. In

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers) . 2124–2133.[26] Yi Luan, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi. 2018. Multi-taskidentification of entities, relations, and coreference for scientific knowledge graphconstruction. arXiv preprint arXiv:1808.09602 (2018).[27] Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and HannanehHajishirzi. 2019. A general framework for information extraction using dynamicspan graphs. arXiv preprint arXiv:1904.03296 (2019).[28] Scott Miller, Heidi Fox, Lance Ramshaw, and Ralph Weischedel. 2000. A noveluse of statistical parsing to extract information from text. In .[29] Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstmson sequences and tree structures. arXiv preprint arXiv:1601.00770 (2016).[30] Dan Roth and Wen-tau Yih. 2004.

A linear programming formulation for global in-ference in natural language tasks . Technical Report. ILLINOIS UNIV AT URBANA-CHAMPAIGN DEPT OF COMPUTER SCIENCE.[31] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, IvanTitov, and Max Welling. 2018. Modeling relational data with graph convolutionalnetworks. In

European Semantic Web Conference . Springer, 593–607.[32] Ryuichi Takanobu, Tianyang Zhang, Jiexi Liu, and Minlie Huang. 2019. A hi-erarchical framework for relation extraction with reinforcement learning. In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 7072–7079.[33] David Wadden, Ulme Wennberg, Yi Luan, and Hannaneh Hajishirzi. 2019. Entity,relation, and event extraction with contextualized span representations. arXivpreprint arXiv:1909.03546 (2019).[34] Shaolei Wang, Yue Zhang, Wanxiang Che, and Ting Liu. 2018. Joint extraction ofentities and relations based on a novel graph scheme.. In

IJCAI . 4461–4467.[35] Ziqi Wang, Yujia Qin, Wenxuan Zhou, Jun Yan, Qinyuan Ye, Leonardo Neves,Zhiyuan Liu, and Xiang Ren. 2019. Learning from explanations with neuralexecution tree. In

International Conference on Learning Representations .[36] Zhepei Wei, Jianlin Su, Yue Wang, Yuan Tian, and Yi Chang. 2020. A NovelCascade Binary Tagging Framework for Relational Triple Extraction. In

Proceed-ings of the 58th Annual Meeting of the Association for Computational Linguistics .1476–1488.[37] Bishan Yang and Claire Cardie. 2013. Joint inference for fine-grained opinionextraction. In

Proceedings of the 51st Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) . 1640–1649.[38] Yue Yuan, Xiaofei Zhou, Shirui Pan, Qiannan Zhu, Zeliang Song, and Li Guo. 2020.A Relation-Specific Attention Network for Joint Entity and Relation Extraction.In

International Joint Conference on Artificial Intelligence 2020 . Association forthe Advancement of Artificial Intelligence (AAAI), 4054–4060.[39] Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu, and Jun Zhao. 2018. Ex-tracting relational facts by an end-to-end neural model with copy mechanism.In

Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) . 506–514.[40] Min Zhang, Jie Zhang, and Jian Su. 2006. Exploring syntactic features for relationextraction using a convolution tree kernel. In

Proceedings of the Human LanguageTechnology Conference of the NAACL, Main Conference . 288–295.[41] Tianyang Zhao, Zhao Yan, Y. Cao, and Zhoujun Li. 2020. Asking Effective andDiverse Questions: A Machine Reading Comprehension based Framework forJoint Entity-Relation Extraction. In

IJCAI .[42] Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu.2017. Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:1706.05075 (2017).[43] Wenxuan Zhou, Hongtao Lin, Bill Yuchen Lin, Ziqi Wang, Junyi Du, LeonardoNeves, and Xiang Ren. 2020. Nero: A neural rule grounding framework forlabel-efficient relation extraction. In