[PDF] Reinforcement Learning-based N-ary Cross-Sentence Relation Extraction

Abstract

The models of n-ary cross sentence relation extraction based on distant supervision assume that consecutive sentences mentioning n entities describe the relation of these n entities. However, on one hand, this assumption introduces noisy labeled data and harms the models' performance. On the other hand, some non-consecutive sentences also describe one relation and these sentences cannot be labeled under this assumption. In this paper, we relax this strong assumption by a weaker distant supervision assumption to address the second issue and propose a novel sentence distribution estimator model to address the first problem. This estimator selects correctly labeled sentences to alleviate the effect of noisy data is a two-level agent reinforcement learning model. In addition, a novel universal relation extractor with a hybrid approach of attention mechanism and PCNN is proposed such that it can be deployed in any tasks, including consecutive and nonconsecutive sentences. Experiments demonstrate that the proposed model can reduce the impact of noisy data and achieve better performance on general n-ary cross sentence relation extraction task compared to baseline models.

Full PDF

RReinforcement Learning-based N-ary Cross-Sentence Relation Extraction

Chenhan Yuan Ryan Rossi Andrew Katz Hoda Eldardiry Virginia Tech Adobe [email protected], [email protected], [email protected], [email protected]

Abstract

The models of n-ary cross sentence relation extraction basedon distant supervision assume that consecutive sentencesmentioning n entities describe the relation of these n entities.However, on one hand, this assumption introduces noisy la-beled data and harms the models’ performance. On the otherhand, some non-consecutive sentences also describe one re-lation and these sentences cannot be labeled under this as-sumption. In this paper, we relax this strong assumption bya weaker distant supervision assumption to address the sec-ond issue and propose a novel sentence distribution estimatormodel to address the ﬁrst problem. This estimator selects cor-rectly labeled sentences to alleviate the effect of noisy data isa two-level agent reinforcement learning model. In addition,a novel universal relation extractor with a hybrid approachof attention mechanism and PCNN is proposed such that itcan be deployed in any tasks, including consecutive and non-consecutive sentences. Experiments demonstrate that the pro-posed model can reduce the impact of noisy data and achievebetter performance on general n-ary cross sentence relationextraction task compared to baseline models. Introduction

As a key step in constructing a knowledge graph, relation ex-traction is a task to extract the relation between the entitiesexpressed in a sentence. Previous work has largely focusedon intra-sentence binary relation extraction, where the goalis to extract the relation between an entity pair in the sen-tence (Hu et al. 2019; Gupta et al. 2019).However, some relations require more than two entitiesand may span multiple sentences, which is deﬁned as n-arycross-sentence relation extraction. As the example shownin Table 1, the relation “educate” includes four entities, theperson’s ”name“, ”academic degree“, ”academic major“ and”school“. In addition, this relation spans in four sentences inthe example. Some prior works have applied a supervisedlearning approach to tackle this task, but they require large-scale labeled training data (Jia, Wong, and Poon 2019).To obtain large-scale annotated data, some work assumesthat if the consecutive sentences (a sentence group) containthe entities that have a relation in a knowledge base, thesesentences as a whole describe that relation (Quirk and Poon2016). This assumption is referred to as distant supervisionin the n-ary cross-sentence relation extraction task. Eventhough methods based on distant supervision can quickly Table 1: Distant supervision labeled sentences example. Pos.is the sentence position in the text. DS is the relation labelresult of distant supervision and R is the real label. The factused here is: { Alan Turing, PhD, Princeton, computer sci-ence } , which has the “educate” relation. “edu” denotes thatthe sentence represents the “educate” relation and “–” de-notes it does not.Pos. Sentence DS R3 Alan Turing worked on hyper com-putation in

Princeton

University. edu –4 He obtained his

PhD in 1938. edu edu18

Alan Turing studied logic and com-puter science in Princeton . – edu20 His

PhD advisor is

Alonzo Church – eduannotate sentences, they still have two main limitations: 1)they suffer from a noisy labeling problem; 2) the strongdistant supervision assumption does not consider the non-consecutive sentences, which reduces the generalizability ofthe trained model. As the example shown in Table 1, thesentences at the 18th and 20th positions describe the fact butare not labeled using distant supervision because they arenot consecutive. The ﬁrst sentence is incorrectly labeled andis a noisy labeled data, which describes Alan Turing’s workinstead of his education.To address the ﬁrst limitation, we propose to train a sen-tence distribution estimator (SDE) , which is a two-levelagent reinforcement learning model. This provides a well-trained model that can select the high-quality labeled sen-tence groups and alleviate the impact of noisy data. Thereare previous works on applying reinforcement learning (RL)to remove binary intra-sentence noisy data and achieve state-of-the-art (SotA) performance (Feng et al. 2018; Yang et al.2019; Qin, Xu, and Wang 2018). When applying RL for n-ary cross-sentence relation extraction, a key challenge is thatthe RL model should not only learn sentence features, butalso know the context and relation between each sentence.In this paper, the process of selecting sentences is not onlyinﬂuenced by the feature of the sentence itself, but also bythe indicators we deﬁned , which measure the semantic re- a r X i v : . [ c s . L G ] S e p ationship between sentences. Moreover, whether a sentenceis selected in a state or not is going to affect the decisionof the next state. This state transition property provides theability to choose the best combination of sentences in eachsentence group.To address the second limitation, we relax the strong dis-tant supervision assumption that lies at the heart of priorwork by replacing it with a weaker distant supervision as-sumption. The assumption is that the sentence that has atleast one main entity or two supplementary entities is an-notated with the relation of these entities. We follow theWikidata Knowledge Base scheme, where the main entityis the “value” of each fact and the supplementary entityis the “qualifer” of each fact. This assumption introducessome non-consecutive sentences and we propose a noveluniversal relation extractor to encode both consecutive andnon-consecutive sentence groups. This relation extractor hasa self-attention and soft attention mechanism layer, whichcompares the similarity between the word-level features andthe relation query vectors. The relation extractor also en-codes each sentence via a Piece-wise Convolution NeuralNetwork (PCNN) layer. The PCNN output is used to learnhow the information transforms through sentences via a non-linear transformation layer. Related Work

Dependency shortest path has been applied with other pre-processing features for n-ary cross-sentence relation ex-traction (Li et al. 2015; Mesquita, Schmidek, and Barbosa2013). With the rise of deep learning, some work encodedthe dependency shortest path via graph neural networks.Peng et al. applied Graph-LSTM to encode the dependencyshortest path and link each path (Peng et al. 2017). One de-pendency shortest path usually requires two Graph-LSTMs.Song et al. proposed the Graph-state LSTM so that only oneGraph-LSTM is needed to encode a path (Song et al. 2018).Some work also implemented Bi-LSTM directly to encodethe whole sentence sequences without requiring any pre-processing (Mandya et al. 2018). The LSTM-CNN modelthey proposed achieved a better performance on the PubMeddataset, but it cannot encode long sequences. Recently, thismodel has been improved by deploying a multi-head atten-tion layer. The model is also enhanced by incorporating priorknowledge from a pre-trained Knowledge Base (Zhao et al.2020).The large-scale data used in these approaches are auto-matically labeled via distant supervision. As discussed inprevious literature, distant supervision always introducesnoisy incorrectly labeled data (Mintz et al. 2009; Takamatsu,Sato, and Nakagawa 2012). In the binary relation extractiontask, this problem is addressed by using a weaker distant su-pervision assumption. This assumption takes all labeled sen-tences with the same entity pairs into a bag and assumes thatonly one sentence in this bag is correctly labeled (Ye andLing 2019; Ji et al. 2017). Some work also trained an ex-tra selector model, which selected the correctly labeled sen-tences as the training data of the relation extraction model.Most selectors are reinforcement learning (RL)-based mod-els (Feng et al. 2018; Yang et al. 2019; Qin, Xu, and Wang 2018). However, these reinforcement learning-based selec-tors cannot be applied to n-ary cross-sentence relation ex-traction task, which is more challenging than the binaryintra-sentence relation extraction.To the best of our knowledge, our proposed work is theﬁrst to apply reinforcement learning on the n-ary cross-sentence relation extraction task. We propose an RL-basedtwo-level agent selector (sentence distribution estimator) toselect the correctly labeled sentence group. We also pro-pose a weaker distant supervision assumption to label bothconsecutive and non-consecutive sentences. To encode themboth, a novel universal relation extractor model is proposed,which is a hybrid approach of attention mechanism and con-text learning process of sentence features.

Problem Formulation

In a relation extraction task, a fact is deﬁned as a collectionof i entities and one corresponding relation, where i ≥ .The relations are verb phrases and describe the relationshipamong these entities. For every m sentences, a relation ex-traction model should give the relation among entities ex-pressed in these sentences. If m ≥ and i ≥ , the task isthe cross-sentence n-ary relation extraction problem.In distant supervision-based method, we decompose thecross-sentence n-ary relation extraction task into two sub-problems: sentence distribution estimation and relation ex-traction. The sentence distribution estimation is formulatedas follows: given a set of sentence groups and relation la-bel pairs as { ( g , r ) , ( g , r ) , · · · , ( g n , r n ) } , where thereare variable numbers of sentences in each sentence group g and r is the noisy relation label produced by distant su-pervision, the objective is to decide which sentences in eachgroup truly describe the relation. In other words, the modeltells which sentence is correctly labeled and should be se-lected as a training instance. The relation extraction is toclassify relation r , given a sentence group g . Proposed Model

As shown in Fig 1, the proposed model consists of two sub-models. The ﬁrst is a sentence distribution estimator (SDE),which is used to measure the probability of whether a sen-tence is correctly labeled by distant supervision. This modeloutputs a group of sentences that describe a complete fact.The second model is the relation extraction model (RE).This model takes a group of sentences as input and infersthe relation contained in these sentences.

Sentence Distribution Estimator (SDE)

The proposed model is a two-level agent reinforcementlearning model as in Fig. 1. We assume that for each groupof sentences, there is one main sentence and some supple-mentary sentences. The main sentence that has main entitiesmay give the main information about the relation. The sup-plementary sentences supplement this information and areselected given the main sentence.

Main Sentence-Level Policy State

The vector represen-tation of main sentence i is state s i , and is generated by theigure 1: The ﬂowchart of the proposed model. On the right side, the sentence distribution estimator consists of main policyand supplementary policy. The relation extraction model is on the left side.PCNN layer of the RE model. Action

The action set for this level is a i ∈ { , } , where indicates that the program selects the sentence i as the cor-rectly labeled sentence. Note that this is a one-state RL andthe reward is calculated once a i is decided. Policy

The policy π θ represents the probability of selectingthe input sentence given the encoding information s i . π θ ( a i , s i ) = P ( a i | s i )= σ ( W (cid:62) s i + b ) (1)where σ is the sigmoid function and W (cid:62) ∈ R d s × is theweighting matrix and d s is the dimension of the state vector. Reward

The reward is the classiﬁcation accuracy of therelation extraction model, given selected input sentences.

Supplementary Sentence-Level Policy State

The state m j for this level comprises three indicators and the encod-ing information of sentence c j . The ﬁrst indicator e − d mea-sures the distance between the current supplementary sen-tence and the main sentence, where d is the position dis-tance. Position distance is deﬁned as the number of sen-tences between the two sentences in the text. The secondindicator |{ e | e ∈ sent j ∧ e ∈ E }| gives the variety of en-tities of the current sentence sent j , where E is the set ofentities of the fact. The third indicator c j · s i || c j || ·|| s i || measuresthe cosine similarity between the current sentence and themain sentence i . Along with the encoding information, weassume that these indicators can fully address the contextinformation when selecting supplementary sentences.This is a multi-state RL. The transition function betweeneach m j should be deﬁned to calculate the reward once theend state is reached. The transition function is deﬁned as fol-lows. We ﬁrst sort the supplementary sentences according totheir corresponding second indicators, then let the agent de-cide if selecting the ﬁrst sentence. The next state is the ﬁrstsentence of the sorted remaining supplementary sentencesaccording to |{ e | e ∈ sent j ∧ e ∈ E/E prev }| , where E prev is the entities in the previous selected sentences. Action

The action set for this level is b j ∈ { , } , where indicates that the program selects the current sentence asthe correctly labeled sentence, where the label is the relationindicated by the sentence. Policy

Eq.2 shows the policy π γ considers sentence-level in-dicators and sentence encoding information simultaneously. π γ ( b j , m j ) = P ( b j | m j )= σ (cid:0) α ( W (cid:62) k k j + b k ) + β ( W (cid:62) s s j + b c ) (cid:1) (2)where W k ∈ R × and W s ∈ R d s × are weighting matri-ces. k j is the vector of three real-number indicators and d s is the dimension of the encoding vector of the sentence. α and β are also learnable parameters. Reward

Note that the rewards, a.k.a the accuracy of the re-sults from the RE model, can only be calculated when allnecessary sentences are given. Therefore, there is no inter-mediate reward that can be used directly for updating gradi-ents in this level policy. Similar to playing Go, we apply theMonte Carlo search algorithm to simulate possible future re-sults and use the average of these results as the intermediatereward (Silver et al. 2016). More formally, given the currentstate and previous states m j , the Monte Carlo search al-gorithm with π γ as roll-out policy is applied to sample thefuture possible state transitions m (cid:48) j +1: M , where M is the endstate. The mathematical deﬁnition is in Eq.3 M onte π γ ( m j ; m (cid:48) j +1: M ; N ) = { m j m (cid:48) (1) j +1: M , m j m (cid:48) (2) j +1: M , . . . , m j m (cid:48) ( n ) j +1: M } (3)where N is sample times. Based on this, the intermediatereward can be calculated via Eq.4 R ( i, j ) = (cid:40) N (cid:80) Nn =1 e − Ce ( RE ( s i ,m ( n )1: M )) , m ( n )1: M ∈ M onte π γ j < Me − Ce ( RE ( s i ,m j )) j = M (4)where Ce denotes the cross entropy loss and RE is the re-lation extraction model. Note that an exponential function isapplied to make sure that the sentences group that has lowercross entropy loss has a greater reward value. olicy Gradient After agents decide a set of actions basedon their policies, the relation extraction model gives the re-wards. The objective of an RL algorithm is to maximize theoverall rewards by updating policy parameters following apolicy gradient strategy (Sutton et al. 2000). The gradientsof our two agents can be computed using Eq.5. (cid:53) R θ = (cid:88) τ (cid:88) ι R ( ι ) π γ ( τ, ι ) π θ ( τ ) (cid:53) logπ θ ( τ )= E τ ∼ π θ ( τ ) [ (cid:88) ι R ( ι ) π γ ( τ, ι ) (cid:53) logπ θ ( τ )] ≈ N N (cid:88) i =1 M (cid:88) j =1 R ( i, j ) π γ ( b i,j , m i,j ) (cid:53) logπ θ ( a i , s i ) (cid:53) R γ = (cid:88) τ π θ ( τ ) (cid:53) (cid:88) ι R ( ι ) π γ ( τ, ι ) logπ γ ( τ, ι )= (cid:88) τ π θ ( τ ) E ι ∼ π γ ( ι ) [ R ( ι ) (cid:53) logπ γ ( ι )] ≈ N (cid:88) i =1 π θ ( a i , s i ) 1 M M (cid:88) j =1 R ( i, j ) (cid:53) logπ γ ( b i,j , m i,j ) (5)where (cid:53) R θ , (cid:53) R γ denote the derivation of the reward R w.r.t. parameters of main sentence policy π θ and parametersof supplementary sentence policy π γ , respectively. Then theparameters of π θ and π γ can be updated via Eq.6 θ ← θ + α θ (cid:53) R θ γ ← γ + α γ (cid:53) R γ (6)where α θ and α γ are the learning rates. Relation Extraction Model (RE)

As shown in Fig 1, the relation extraction (RE) model re-ceives a group of sentences from SDE and then encodesthem using a Bidirectional-LSTM layer. We implement at-tention and self-attention mechanisms to incorporate thesesentence encodings to classify the relation. Conventionalcross-sentence relation extraction models encode consecu-tive sentences so they do not consider the connection andcontext between sentences, which should be learned whenencoding non-consecutive sentences. Therefore, attentionmechanism is enriched with the output from the non-lineartransformation (LSTM) layer by a gate layer in the proposedmodel. This layer learns how the information transforms ineach sentence, which is exactly the context information. Theinput of this transformation layer is the Piecewise Convolu-tional Neural Network (PCNN), which encodes each sen-tence as a feature vector. By applying the hybrid modelof attention mechanism and non-linear transformation, theproposed model is universal for cross-sentence n-ary rela-tion extraction task in many scenarios, including both non-consecutive and consecutive sentences.

Sentence Encoding

Bidirectional-LSTM is applied as thesentence encoding layer (Graves and Schmidhuber 2005).The input of Bi-LSTM is a sequence of the concatenation of the word embedding vector and position encoding vec-tor. We pre-train the word embedding with d w dimensionby Word2Vec (Mikolov et al. 2013). The position encodingis implemented as follows. Supposing the ordered entity listof one sentence is e e . . . e n , we calculate the position dis-tances between words and e , e n . Then these position dis-tances are projected to a dense vector space, which has di-mension d p . Although we can select position distances of allentities, more position encoding features will decrease theclassiﬁcation accuracy (Mandya et al. 2018). The input di-mension for Bi-LSTM of one sentence is R n w × ( d w + d p + d p ) ,where n w is the number of words in each sentence. Thenthe output dimension from the Bi-LSTM of one sentence is R n w × d b , where d b is the hidden dimension of the Bi-LSTM. PCNN and Non-linear Transformation Layer

The vec-tor representation of each sentence is used in the non-lineartransformation layer. We apply PCNN to take the outputof Bi-LSTM layer and get the vector representation (Zenget al. 2015). PCNN ﬁrst uses n f ﬁlters, each of kernel size R n s × d b , to extract features, where n s is the window size.The output of each ﬁlter f i is then divided into three seg-ments { f i , f i , f i } according to entities e , e n positions,and does max-pooling in segments. Eq.7 formally deﬁnesthe piece-wise max-pooling layer: p ij = max ( f ij ) , ≤ i ≤ n f , j = 3 p i = p i ⊕ p i ⊕ p i (7)where ⊕ denotes concatenation and p i ∈ R × is the piece-wise max pooling result of i -th ﬁlter. Then the output dimen-sion of PCNN for one sentence is R n f .The non-linear transformation layer is implemented withLSTM Cell (Hochreiter and Schmidhuber 1997). The sen-tence feature vector s i coming from the PCNN layer is theinput at each LSTM cell state. The hidden vector of the laststate is the output of the non-linear transformation layer. Themathematical deﬁnition is shown in Eq.8 h i , c i = LST M ( s i , h i − , c i − ) , ≤ i ≤ n se h , c ∼ N (0 , q = h n se (8)where N (0 , denotes standard normal distribution and n se is the number of sentences in the sentence group. q ∈ R × d h is the output of non-linear transformation layer, where d h isthe hidden dimension of LSTM cell. Attention and Self-attention Mechanism

Previous workreports that multi-head self-attention improves sentence-level relation extraction performance because of its abilityto model long sequences (Zhao et al. 2020; Vaswani et al.2017). This mechanism is applied via Eq.9: M i = sof tmax (cid:32) QW Qi ( KW Ki ) (cid:62) √ d (cid:33) V W Vi M = M ⊕ M ⊕ M ⊕ · · · M n he U = M W O (9)where W Qi ∈ R n se × dsnhe , W Ki ∈ R n se × dsnhe , W Vi ∈ R n se × dsnhe , W Oi ∈ R d s × d s are learnable parameters and ∈ R n se × d s , K ∈ R n se × d s , V ∈ R n se × d s are the query,key and value vectors projected from the input vectors. n he and d s are the number of heads and the number of hiddenunits, respectively.Another soft attention layer is applied to attend to the in-put U that contributes the most on the classiﬁcation of therelation. As shown in Eq.10, this layer compares the relationvectors with output vectors from multi-head self-attention. p k = m (cid:88) j =0 (cid:15) k,j u j (cid:15) k,j = e c k,j (cid:80) mi =0 e c k,i c k,j = r k u Tj (10)where r k ∈ R d is the learnable vector of k-th relation and p k ∈ R d is the attention result for relation r k . Gate Layer and Output Layer

As shown in Eq.11, anelement-wise gate layer is applied to incorporate the outputsfrom attention layer and non-linear transformation layer. α = σ ( W (cid:62) a S a + b a )˜ S n = tanh ( W (cid:62) n S n + b n ) S = αS a + (1 − α ) ˜ S n (11)where S a ∈ R d s is the attention result and W a ∈ R d s isthe weighting matrix. S n ∈ R d s is the LSTM’s result and W n ∈ R d s is the weighting matrix. Model Training

The RE and SDE models are iteratively trained. This isbased on the following proposition: “The proposed modelcan accurately classify correctly labeled data and remove theincorrectly labeled data via this training process.” This canbe formally stated in Proposition 1 below.

Proposition 1.

After iteratively training SDE and RE, theprobability P θ,γ (1 | x p ) of selecting sampled x p as the truelabeled data is deﬁned as follows: P θ,γ (1 | x p ) (cid:29) P θ,γ (1 | x n ) (12)where x p represents the correctly labeled training data and x n represents the incorrectly labeled data. We provide aproof sketch of this proposition in the appendix. The coreproof idea is that if the number of samples of x p is far greaterthan that of x n , the RE model will converge to the x p distri-bution. In other words, the data assigned to the high rewardis generally also correctly labeled data. Then SDE will as-sign higher probability of being selected to high reward datain each iteration. In the end, the data assigned high probabil-ity is the correctly labeled data (Proposition 1). Experiments

Datasets

PubMed

The PubMed dataset is created by automaticallylabeling biomedical literature with Gene Drug KnowledgeDatabase. The labeling process follows this rule: a candidate is retained only if no other co-occurrence of the same entitiesin an overlapping text span with a smaller number of con-secutive sentences (Peng et al. 2017). In this dataset, thereare 6,987 ternary drug-gene-mutation relation instances and6,087 binary drug-mutation relation instances. There are 5categories of relations:“resistance or nonresponse”, “sensi-tivity”, “response”, “resistance” and “none”. Following pre-vious work (Peng et al. 2017), we binarize the multi-classrelations by replacing the ﬁrst four relations as “yes”. Wereport the experimental results on the binary relation extrac-tion and on the multi-class relation extraction.

WikiText

A complete fact not only appears in consecutivesentences but also in non-consecutive sentences. The strongdistant supervision hypothesis used in PubMed only con-sider consecutive sentences. To consider these two situationsat the same time and test whether the proposed model can re-duce the impact of noise data in both situations, we also cre-ate a new dataset using a weaker distant supervision assump-tion. We ﬁrst collect Wikipedia webpages under the “Peo-ple” category and remove all non-text symbols (Vrandeˇci´cand Kr¨otzsch 2014). Then Wikidata is used as a Knowl-edge Base to automatically label the relations for these web-pages. In Wikidata, each fact consists of two values (mainentities), n qualiﬁers (supplementary entities) with n roleswhere n ≥ , and one property (relation). The labeling pro-cess follows this rule: if the sentence has at least one mainentity or two supplementary entities that participate in onespeciﬁc fact, this sentence possibly indicates the relation ofthat fact. Speciﬁcally, as the example shown in Table 2, ifthe sentence has two main entities, this sentence is labeledas the main sentence of that relation. Others are labeled assupplementary sentences of that relation. Note that using thislabeling process, some sentences may be labeled more thanone relation, which makes the task more challenging. Com-pared to distant supervision used in the PubMed dataset, thislabeling process is a weaker distant supervision assumptionand does not restrict the consecutiveness.Statistically, there are 2,133 facts, 4,194 main sen-tences and 13,440 supplementary sentences in the WikiTextdataset. The number of different relations is 55, while thenumber of different roles is 90. We select 20% main sen-tences and 20% supplementary sentences individually as thetest dataset. In this randomized selection, we also make surethat the instance that has sentences in the test dataset alsohas sentences in the training dataset. This selection processis applied for 5 times and we report the average accuracyand standard derivation on this dataset. Experimental Settings

Following previous work (Zhao et al. 2020), the hyper-parameters are decided based on preliminary experimentson a small development set. For the RE model, we set theembedding size of words and positions to 200 and 25, re-spectively. The hidden dimension of Bi-LSTM is 252 andthe number of ﬁlters of PCNN is 132. The window size ofPCNN is 5. For SDE model, the sample times for Monte en.wikipedia.org/wiki/Wikipedia:Contents/People and self able 2: An example of using the weaker distant supervision to label WikiText, given the fact from Wikidata Knowledge BaseFact { relation: educated at; main entities: Marie Curie, University of Paris;supplementary entities: physics(major), Doctor of Science(degree) } main sentence In June 1903, Marie Curie was awarded her doctorate from the

University of Paris .main sentence

Marie Curie was the ﬁrst woman to become a professor at the

University of Paris .supplementary sentence In 1893,

Marie Curie was awarded a degree in physics andbegan work in an industrial laboratory of Professor Gabriel Lippmann.Table 3: Average test accuracy in ﬁve-fold validation on PubMed dataset. Ternary denotes drug-gene-mutation interactions andBinary denotes binary drug-mutation interactions. “-” denotes that the value is not provided hereinModel Binary class Multi-classTernary Binary Ternary BinarySingle Cross Single Cross Cross CrossGraph LSTM-EMBED(Peng et al. 2017) 76.5 80.6 74.3 76.5 — —Graph LSTM-FULL(Peng et al. 2017) 77.9 80.7 75.6 76.7 — —Graph LSTM MULTITASK(Peng et al. 2017) — 82 — 78.5 — —LSTM-CNN(Mandya et al. 2018) 79.6 82.9 85.8 88.5 — —GCN (K=0)(Zhang, Qi, and Manning 2018) 85.6 85.8 82.8 82.7 75.6 72.3GS GLSTM(Song et al. 2018) 80.3 83.2 83.5 83.6 71.7 71.7AGGCN(Zhang, Guo, and Lu 2019) 87.1 87 85.2 85.6 79.7 77.4Multihead attention(Zhao et al. 2020) 81.5 87.1 87.4 ± ± ± ± ± ± ± ± ± ± ± ± Models

On PubMed dataset, the proposed model is evaluatedagainst following baseline models: (a) Graph LSTM-basedmodels, including Graph LSTM-EMBED/FULL/ multi-task (Peng et al. 2017); (b) Graph state LSTM model(GSLSTM) (Song et al. 2018); (c) LSTM-CNN model, whichencodes sentences using LSTM ﬁrst then extracts featuresusing CNN (Mandya et al. 2018); (d) Graph Convolu-tional Networks (GCN) and Attention Guided GCN (AG-GCN) (Zhang, Qi, and Manning 2018; Zhang, Guo, and Lu2019); (e) Multi-head attention-based model model (Zhaoet al. 2020). Besides baselines, the RE model is also tested individually as a variant of the proposed model.On WikiText dataset, since previous models do not ad-dress how to encode non-consecutive sentences, these mod-els cannot be directly applied on this dataset. Therefore, weselect two SotA models, multi-head attention and LSTM-CNN, as the baseline models and implement them on Wiki-Text dataset. We also report the performance of variants ofthe proposed model, which are: (a) the RE model only; (b)the proposed model with randomly supplementary sentenceselection; (c) the proposed model without three indicators.

Results

Evaluation on PubMed

We report average test accuracyin ﬁve-fold validation on the PubMed dataset. As shown inTable 3, the performance of the proposed model is betterthan previous SotA baselines on most tasks. Speciﬁcally,the test accuracy of RE model on all ternary relation tasksis higher than baselines, which shows that the RE model iscapable of processing multi-entity relation extraction. Aftertraining the SDE model and the RE model iteratively, the im-pact of noise data on the training process of the RE model isgreatly reduced, so that the accuracy of the proposed modelis higher than that of the RE model on all tasks. Meanwhile,we notice that the accuracy of most baselines on multi-classtasks is much lower than that on binary-class tasks, e.g., theaccuracy of AGGCN is reduced by about 10%. However, ourmodel still maintains a high accuracy even on multi-classtasks, which is 1.8% higher than SotA.n the binary entity relation extraction tasks, the perfor-mance of our model drops a little. One possible reason is thatwe apply PCNN to extract the feature of each sentence. Inthe binary relation data, there are many sentences with onlyone entity, which does not meet the conditions of PCNN. Inthe experiment, the second anchor of PCNN on this kind ofsentence is set at the beginning of the sentence by default.Table 4: The average test accuracy and standard deviation onWikiText dataset.Model Accuracy(%)LSTM-CNN 37.9 ± ± ± ± ± ± Evaluation on WikiText

Since the baseline model is de-signed for consecutive sentences, the order of input sen-tences is set so that the main sentence is the ﬁrst, followedby all the supplementary sentences in order. This order isalso used in our proposed model without the SDE model.As shown in Table 4, the test accuracy of the RE modelis 7.1% higher than the best performance of baselines. Thisshows that the RE model is more capable of encoding non-consecutive sentences and predicting the relations than pre-vious models. Considering both the results on the WikiTextand PubMed dataset, the proposed RE model is a univer-sal model that ﬁts for both non-consecutive and consecutivecross-sentence n-ary relation extraction tasks. Note that thenumber of relations (classes) is 55, so the 66.4% test accu-racy of the proposed model is a fairly great result, which issigniﬁcantly better than the RE model. This indicates thatwith the help of SDE agents, the RE model is more possibleto learn the real relation distribution.Figure 2: The probability distribution of sentences

Evaluation on SDE model

We ﬁrst randomly select 100main sentences from the test set and ask a graduate studentto check whether the relation labeled with distant supervi-sion is correct for these 100 sentences. The correctly labeled sentences are marked with “1”, while others are marker with“0”. Given by the main sentence-level agent, the probabil-ity that the sentences are correctly labeled is also reported.As shown in Fig.2, the probability distribution of the sen-tences given by the agent has a strong positive correlationwith the results of manual inspection. Speciﬁcally, most low-probability sentences are marked with “0” by human evalu-ation, which indicates that these sentences are incorrectlylabeled by distant supervision, while most high-probabilitysentences are correctly labeled. This demonstrates that thewell-trained main sentence-level agent can distinguish in-correctly labeled data from correctly labeled data.Figure 3: The value of weights on each training iterationTo investigate whether the three indicators selected inthe supplementary sentence-level agent affect model perfor-mance, we tracked the changes of the two weights, α and β , during training and reported them in Fig.3. Both weightsare initialized to 0.5 and their values change slightly dur-ing training. The weight of the three indicators does not ap-proach 0, which demonstrates that the selected three indi-cators impact the model performance. Table 4 shows thatthe proposed model’s test accuracy without these indicatorsis 1.3% lower than the original proposed model. This alsoindicates the positive impact of these indicators on modelperformance. To investigate the impact of the deﬁned tran-sition rule, we replace the transition rule with a random se-lection process, in which the next state of the supplementarysentence-level agent is randomly chosen from the remain-ing sentences. Table 4 shows that the model accuracy basedon this process is 1.7% lower than the original model. Thisdemonstrates that the transition rule based on variety of enti-ties helps the proposed model alleviate the noise data effect. Conclusion

We proposed (1) a sentence distribution estimator to allevi-ate the impact of noisy distant supervision labeled data forn-ary cross-sentence relation extraction; (2) a weaker distantsupervision assumption, which considers non-consecutivesentences; and (3) a universal relation extractor, which is ahybrid model of attention mechanism and non-linear trans-formation layer that encodes both non-consecutive and con-secutive sentence groups. The experiments showed thatthe proposed model reduces the impact of noisy data andachieves signiﬁcantly better performance for n-ary crosssentence relation extraction compared to SotA models. eferences

Bishop, C. M. 1995. Training with noise is equivalent toTikhonov regularization.

Neural computation

Thirty-Second AAAI Conference on Artiﬁcial Intel-ligence .Graves, A.; and Schmidhuber, J. 2005. Framewise phonemeclassiﬁcation with bidirectional LSTM and other neural net-work architectures.

Neural networks

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , volume 33, 6513–6520.Hochreiter, S.; and Schmidhuber, J. 1997. Long short-termmemory.

Neural computation

Proceedings of the2019 Conference on Empirical Methods in Natural Lan-guage Processing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) ,3812–3820.Ji, G.; Liu, K.; He, S.; Zhao, J.; et al. 2017. Distant supervi-sion for relation extraction with sentence-level attention andentity descriptions. In

AAAI , volume 3060.Jia, R.; Wong, C.; and Poon, H. 2019. Document-Level N -ary Relation Extraction with Multiscale RepresentationLearning. arXiv preprint arXiv:1904.02347 .Li, H.; Krause, S.; Xu, F.; Moro, A.; Uszkoreit, H.; and Nav-igli, R. 2015. Improvement of n-ary Relation Extractionby Adding Lexical Semantics to Distant-Supervision RuleLearning. In ICAART (2) , 317–324.Mandya, A.; Bollegala, D.; Coenen, F.; and Atkinson, K.2018. Combining Long Short Term Memory and Convo-lutional Neural Network for Cross-Sentence n-ary RelationExtraction. In

Automated Knowledge Base Construction(AKBC) .Mesquita, F.; Schmidek, J.; and Barbosa, D. 2013. Effec-tiveness and efﬁciency of open relation extraction. In

Pro-ceedings of the 2013 Conference on Empirical Methods inNatural Language Processing , 447–457.Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; andDean, J. 2013. Distributed representations of words andphrases and their compositionality. In

Advances in neuralinformation processing systems , 3111–3119.Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Dis-tant supervision for relation extraction without labeled data.In

Proceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International Joint Confer-ence on Natural Language Processing of the AFNLP , 1003–1011. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;et al. 2019. Pytorch: An imperative style, high-performancedeep learning library. In

Advances in neural informationprocessing systems , 8026–8037.Peng, N.; Poon, H.; Quirk, C.; Toutanova, K.; and Yih, W.-t.2017. Cross-sentence n-ary relation extraction with graphlstms.

Transactions of the Association for ComputationalLinguistics

5: 101–115.Qin, P.; Xu, W.; and Wang, W. Y. 2018. Robust distant su-pervision relation extraction via deep reinforcement learn-ing. arXiv preprint arXiv:1805.09927 .Quirk, C.; and Poon, H. 2016. Distant supervision for rela-tion extraction beyond the sentence boundary. arXiv preprintarXiv:1609.04873 .Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering thegame of Go with deep neural networks and tree search. na-ture arXiv preprintarXiv:1808.09101 .Sutton, R. S.; McAllester, D. A.; Singh, S. P.; and Mansour,Y. 2000. Policy gradient methods for reinforcement learningwith function approximation. In

Advances in neural infor-mation processing systems , 1057–1063.Takamatsu, S.; Sato, I.; and Nakagawa, H. 2012. Reducingwrong labels in distant supervision for relation extraction.In

Proceedings of the 50th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: Long Papers) ,721–729.Van Der Walt, S.; Colbert, S. C.; and Varoquaux, G. 2011.The NumPy array: a structure for efﬁcient numerical com-putation.

Computing in Science & Engineering

Advances in neural informationprocessing systems , 5998–6008.Vrandeˇci´c, D.; and Kr¨otzsch, M. 2014. Wikidata: a freecollaborative knowledgebase.

Communications of the ACM

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Volume1 (Long and Short Papers) , 3216–3225.Ye, Z.-X.; and Ling, Z.-H. 2019. Distant supervision rela-tion extraction with intra-bag and inter-bag attentions. arXivpreprint arXiv:1904.00143 .Zeng, D.; Liu, K.; Chen, Y.; and Zhao, J. 2015. Distant su-pervision for relation extraction via piecewise convolutionalneural networks. In

Proceedings of the 2015 conference onmpirical methods in natural language processing , 1753–1762.Zhang, Y.; Guo, Z.; and Lu, W. 2019. Attention guidedgraph convolutional networks for relation extraction. arXivpreprint arXiv:1906.07510 .Zhang, Y.; Qi, P.; and Manning, C. D. 2018. Graph convolu-tion over pruned dependency trees improves relation extrac-tion. arXiv preprint arXiv:1809.10185 .Zhao, D.; Wang, J.; Zhang, Y.; Wang, X.; Lin, H.; and Yang,Z. 2020. Incorporating representation learning and multi-head attention to improve biomedical cross-sentence n-aryrelation extraction.

BMC bioinformatics

Appendix

Algorithm

The training procedure of the proposed model is shown inAlgorithm 1:

Theoretical Analysis

We theoretically show that the proposed model can classifycorrectly labeled data and remove the incorrectly labeleddata.To formalize this statement, we ﬁrst formally deﬁne thetraining data distribution. Suppose we have true distribu-tion p X ( x ) and noisy distribution ξ . In general, ξ has zeromean. In addition, each sampled data of ξ is not corre-lated. Then x p ∼ p X ( x ) is the correctly labeled data and x n ∼ p X ( x ) + ξ is the incorrectly labeled data. The mainreason is that incorrectly labeled sentences have the sameentity set with the correctly labeled sentences but the seman-tic information of incorrectly labeled sentences are shiftedby some noise.Then the statement is equivalent to corollary 1 below. Corollary 1.

After iteratively training SDE and RE, wehave: P θ,γ (1 | x p ) (cid:29) P θ,γ (1 | x n ) (13)where P θ,γ (1 | x p ) indicates the probability of selecting sam-pled x p as the positive labeled data. To prove Corollary 1,we deﬁne two lemmas and give the proof sketch of thesetwo lemmas : Lemma 1.

Let r be the average reward, for RL, we have max R ≡ max R p + min R n where R p = R − r > R n = R − r < (14) Proof of Lemma 1.

The objective of RL is to maximize theexpected reward, as stated in Eq.15 R = (cid:88) τ R ( τ ) π θ,γ ( τ )= (cid:88) τ ( R ( τ ) − r ) π θ,γ ( τ ) (15) The full proof is a work in progress

Algorithm 1:

Model Training

Input: a set of sentence groups and relation label pairs H = { ( g , r ) , ( g , r ) , · · · , ( g n , r n ) } Output:

The RE and SDE trained on the input

Parameter: the number of training times for RE, SDEand the whole model, M , J and K ,respectively Initialize parameters of the RE and SDE model; for m = 1 → M do RE receives G = { g , g , · · · , g n } as input RE outputs the classiﬁcation results ˆ R = { ˆ r , ˆ r , · · · , ˆ r n } calculate cross entropy loss based on the ˆ R and R = { r , r , · · · , r n } update parameters of RE model end for k = 1 → K do for j = 1 → J do SDE samples instances G (cid:48) = { g (cid:48) , g (cid:48) , · · · , g (cid:48) i } from G via Eq.1 ∼ do step 18 ∼ calculate reward based on the classiﬁcationaccuracy from RE via Eq.4 calculate policy gradient via Eq.5 update parameters of SDE model via Eq.6 end SDE samples instances G (cid:48) from G for m = 1 → M do RE receives the sampled instances G (cid:48) as input RE outputs the classiﬁcation results ˆ R calculate cross entropy loss based on the ˆ R and R update parameters of RE model end end return RE, SDESubtracting r from R is equivalent to the original rewardfunction, because this is an unbiased estimation of expecta-tion. Based on this, maximizing R is equivalent as follows: max R ∝ max  (cid:88) τ ∈ τ p ( R p ( τ )) π θ,γ ( τ ) + (cid:88) τ ∈ τ n ( R n ( τ )) π θ,γ ( τ )  = max  (cid:88) τ ∈ τ p ( R p ( τ )) π θ,γ ( τ )  + min (cid:32) (cid:88) τ ∈ τ n ( R n ( τ )) π θ,γ ( τ ) (cid:33) = max R p + min R n where τ p ∈ { τ | R ( τ ) − r > } τ n ∈ { τ | R ( τ ) − r < } (16) (cid:4) By maximizing the ﬁrst term R p , Lemma 1indicates that after training the RL, the data that has higherreward ( R − r > ) will be assigned higher probability toe selected as truly labeled data. Similarly, the data that haslower reward ( R − r < ) will be assigned lower probabilityby minimizing the second term R n .Since the relation extraction task is a classiﬁcation prob-lem, we use cross-entropy as the loss function of the REmodel : L = E x p ∼ p X ( x ) [ ylog (ˆ y ( x )) + (1 − y ) log (1 − ˆ y ( x ))]+ E x n ∼ p X ( x )+ ξ [ ylog (ˆ y ( x ) + (1 − y ) log (1 − ˆ y ( x ))] (17) Lemma 2.

After training RE model, we have min

L ≡ min L r + υ L n where L r = − (cid:88) x ∈{ x p ,x n − ξ } ylog (ˆ y ( x ))+ (1 − ˆ y ( x )) log (1 − y p ) p ( y | x ) p ( x ) (cid:90) ξ i ξ j p ( ξ ) dξ = υ δ ij (18) L n is positive deﬁnite and can be deemed as a regularizationterm when υ is small.Lemma 2 indicates that the model will converge to thedistribution of x p with regularization. In other words, thecross-entropy loss of x p is much lower than that of x n . Proof of Lemma 2.

We expand the loss function as a Taylorseries in powers of ξ and substitute the Taylor expansion intothe loss function. Then the loss function can be re-written as: L = L r + υ L e L e = 12 (cid:88) x ∈ x n − ξ (cid:26)(cid:20) ˆ y − y ˆ y (1 − y ) (cid:21) ∂ ˆ y∂x + (cid:20) y (1 − ˆ y ) − (ˆ y − y )(1 − y )ˆ y (1 − ˆ y ) (cid:21) (cid:18) ∂ ˆ y∂x (cid:19) (cid:41) p ( y | x ) p ( x ) (19)where υ represents the amplitude of the noise, ˆ y representsthe predicted relation label of x by the RE model and y rep-resents the real label. In general, ˆ y represents the probabilityof predicting the correct label for x should be labeled as thecorrect relation. As proved in previous literature, the secondand third term L e vanish after training the model (Bishop1995). Then L e is equivalent to: L e = 12 (cid:88) x ∈ x n − ξ y (1 − ˆ y ) ( ∂ ˆ y∂x ) p ( x ) (20)Now L e only has ﬁrst derivatives and is positive deﬁnite. Inother words, L e = L n and it can be deemed as a regular-izer. (cid:4) From Lemma 1 and Lemma 2: upon completion of the iter-ative training of the SDE and RE models, the probability ofselecting the correctly labeled data x p as positively labeleddata, will end up with a comparatively larger value than theincorrectly labeled data x n . That is, corollary 1 is shown tobe true.3