[PDF] Relation Mention Extraction from Noisy Data with Hierarchical Reinforcement Learning

Abstract

In this paper we address a task of relation mention extraction from noisy data: extracting representative phrases for a particular relation from noisy sentences that are collected via distant supervision. Despite its significance and value in many downstream applications, this task is less studied on noisy data. The major challenges exists in 1) the lack of annotation on mention phrases, and more severely, 2) handling noisy sentences which do not express a relation at all. To address the two challenges, we formulate the task as a semi-Markov decision process and propose a novel hierarchical reinforcement learning model. Our model consists of a top-level sentence selector to remove noisy sentences, a low-level mention extractor to extract relation mentions, and a reward estimator to provide signals to guide data denoising and mention extraction without explicit annotations. Experimental results show that our model is effective to extract relation mentions from noisy data.

Full PDF

RRelation Mention Extraction from Noisy Data with Hierarchical ReinforcementLearning

Jun Feng ‡ , Minlie Huang §∗ , Yijie Zhang § , Yang Yang † , and Xiaoyan Zhu § ‡ State Grid Corporation of China § State Key Lab. of Intelligent Technology and Systems, National Lab. for Information Science and TechnologyDept. of Computer Science and Technology, Tsinghua University, Beijing 100084, PR China † College of Computer Science and Technology, Zhejiang University

[email protected] , [email protected] , yj [email protected]@zju.edu.cn , [email protected] Abstract

In this paper we address a task of relation mention extraction from noisy data: extracting representative phrases for a partic-ular relation from noisy sentences that are collected via dis-tant supervision. Despite its signiﬁcance and value in manydownstream applications, this task is less studied on noisydata. The major challenges exists in 1) the lack of annota-tion on mention phrases, and more severely, 2) handling noisysentences which do not express a relation at all. To address thetwo challenges, we formulate the task as a semi-Markov deci-sion process and propose a novel hierarchical reinforcementlearning model. Our model consists of a top-level sentenceselector to remove noisy sentences, a low-level mention ex-tractor to extract relation mentions, and a reward estimator toprovide signals to guide data denoising and mention extrac-tion without explicit annotations. Experimental results showthat our model is effective to extract relation mentions fromnoisy data.

Introduction

The increasing demand for structured knowledge has signif-icantly advanced the research of named entity recognitionand relation extraction. Extensive prior research has studiedextracting entities (Borthwick et al. 1998; Chiu and Nichols2016; Xu, Jiang, and Watcharawittayakul 2017) and rela-tions (Bunescu and Mooney 2005; Mintz et al. 2009; Zenget al. 2014; Zheng et al. 2017) from a plain text. Figure 1illustrates an example of relation extraction, where the rela-tion “ place of birth” between two entities “Barack Obama”and “Hawaii” is detected since the expression “ was born in ”suggests the relation “place of birth” directly. Such repre-sentative expressions are referred to as relation mention .Relation mentions can be valuable resources in manydownstream tasks and beneﬁt many applications such as re-lation extraction, question answering, and language infer-ence. Moreover, it offers good interpretability to reveal thetextual evidence for a detected relation, and further, we canstudy the language variety in relation mention: there are var-ious phrases and ways to express the same relation. For in-stance, for the “ place of birth” relation, there are many ex- ∗ Corresponding author: Minlie Huang, [email protected] c (cid:13) [ Barack Obama ] e1 , the 44th president of United States, was born in [ Hawaii ] e2 on August 4, 1961.[ Barack Obama ] e1 went to [ Hawaii ] e2 on a vacation with First Lady.The birth place of [ Barack Obama ] e1 is [ Hawaii ] e2 .[ Barack Obama ] e1 , who hails from [ Hawaii ] e2 , was the president of United States.Entity 1: Barack Obama , Entity 2:

Hawaii , Relation: place_of_birth

Figure 1: Illustration of relation mention extraction fromnoisy sentences. Words in red are relation mentions.pressions such as “the birth place”, “was born in”, “hailsfrom” , and so on.Relation mention extraction in this paper is deﬁned as fol-lows: given a relation r , and a set of sentences containing anentity pair and associated with a noisy relation label r , thetask is to extract a set of representative phrases for relation r (e.g., “place of birth”), such as “ the birth place ”, “ hailsfrom ”, and “ was born in ”. We term the task as relation men-tion extraction .Many existing studies only focus on sentence-level re-lation classiﬁcation that predicts whether a sentence men-tions a relation (Riedel, Yao, and McCallum 2010; Hoff-mann et al. 2011; Li and Ji 2014; Miwa and Bansal 2016;Ren et al. 2017; Zheng et al. 2017). However, they do notconcern the words or phrases that describe a relation. Ourproblem also differs from Open IE (Banko et al. 2007;Fader, Soderland, and Etzioni 2011; Angeli, Premkumar,and Manning 2015), in that such systems do not needto normalize different expressions (e.g., “ the birth place ”and “ was born in ”) to the same canonical relation (e.g.,“place of birth”), as shown in Figure 1. Some works dealwith the noisy labeling issue on relation label (Takamatsu,Sato, and Nakagawa 2012; Zeng et al. 2015; Feng et al.2018), but they do not involve relation mention extraction.There are two major challenges for relation mention ex-traction. First , the sentences for a relation are constructed bydistant supervision (Mintz et al. 2009; Zeng et al. 2015), and The relation label is automatically generated under the distantsupervision assumption.

Noisy means that some sentences may notmention the automatically-labeled relation r at all. a r X i v : . [ c s . C L ] N ov re hence noisy where a sentence may not describe the re-lation at all. Extraction from noisy sentences will deﬁnitelylead to undesired, incorrect relation mentions. Second , it istoo costly to conduct mention annotation to specify whichwords or phrases mention a relation in a sentence, particu-larly in the setting of large-scale relation mention extraction.Instead, there is only a very weak signal available, indicatingthat a sentence (noisy itself) might describe a relation.To address these challenges, we devise a hierarchical rein-forcement learning (Sutton, Precup, and Singh 1999) modelto address the task of relation mention extraction from noisysentences. The model consists of three components: a top-level sentence selector for selecting correctly labeled sen-tences that express a particular relation, a low-level mentionextractor for identifying mention words in a selected sen-tence, and a reward estimator for providing signals to guidesentence denoising and mention extraction without explicitannotations. The intuition behind this model is as follows:if a high-quality sentence is selected, it will facilitate rela-tion mention extraction, and in return, the extraction perfor-mance will signify the ﬁtness of sentence selection.Our model works as follows: at the top level, the agentdecides whether a sentence should be selected or not froma sentence bag ; once the agent selects a sentence, it entersinto a low-level RL process for mention extraction. Whenthe low-level process completes its task, the agent will re-turn back to the top-level process and continues to tackle thenext sentence in the bag. Since we have no explicit annota-tions on either sentence (whether a sentence truly describesa relation) or word (which words are a relation mention), theproblem can be formulated as a natural sequential decisionproblem and the policy learning in the high-level and low-level processes is guided by the delayed rewards (the like-lihood of relation classiﬁcation), which is a weak, indirectsupervision signal for policy learning.Our contributions are as follows: • We study the task of relation mention extraction in newsettings: from noisy sentences and with only weak super-vision, that is, there is no explicit annotations on sentencesor mention words. • We propose a novel hierarchical reinforcement learningmodel which consists of a top-level sentence selector forremoving noisy sentences, a low-level extractor for ex-tracting relation mentions, and a reward estimator for of-fering supervision signals to guide data denoising andmention extraction.

Related Work

We deal with relation mention extraction in this paper. Asclosely related tasks, named entity recognition (NER) andrelation extraction (RE) have attracted considerable researchefforts recently. NER locates entity’s mentions in a plaintext (Borthwick et al. 1998; Chiu and Nichols 2016; Xu,Jiang, and Watcharawittayakul 2017; Katiyar and Cardie2017). As entity mentions are less diverse and it is easier to A sentence bag contains sentences labeled as the same relation access high-quality labels for NER, this task is usually for-mulated as a full supervision problem (e.g., sequential label-ing). The goal of RE is to extract semantic relations betweentwo given entities. Many researchers have explored modelsbased on handcrafted features (Mooney and Bunescu 2005;Zhou et al. 2005) or deep neural networks (Socher et al.2012; Zeng et al. 2014; dos Santos, Xiang, and Zhou 2015;Lin, Liu, and Sun 2017).The most relevant to our work is Open IE (Banko et al.2007; Wu and Weld 2010; Hoffmann et al. 2011; Angeli,Premkumar, and Manning 2015), which extracts triples thatcontain two entities and a relation mention. However, thereis no need to normalize different expressions to a canonicalrelation in Open IE systems.There exists a large amount of studies for sentence-level relation classiﬁcation which predicts whether a sen-tence describes a relation but without specifying a tokenspan as mention (Riedel, Yao, and McCallum 2010; Hoff-mann et al. 2011; Li and Ji 2014; Miwa and Bansal 2016;Ren et al. 2017; Zheng et al. 2017). (Wang et al. 2016) and(Huang and others 2016) adopted attention mechanisms tohighlight some words in a sentence as the clues of a relation.However, such methods can only detect separate words butdo not consider the dependency between words.There are also some works (Feng et al. 2018; Zeng et al.2018) using reinforcement learning for relation extractionfrom noisy data. However, they target more on relation clas-siﬁcation instead of mention extraction. Our work is inspiredby (Feng et al. 2018) where an instance selector was used toremove noisy sentences. However, the supervision signal forsentence selection is sparse as there is only a delayed rewardavailable after all selection in a bag is completed. By con-trast, our model is more straightforward: the top-level sen-tence selector can receive an intermediate reward after eachselection from the low-level mention extractor and obtainsdirect feedbacks to guide policy learning.

Methodology

Problem Deﬁnition

We formulate the task of relation mention extractionfrom noisy data as follows: given a relation r anda sequence of < sentence, relation > pairs as X = { ( x , r ), ( x , r ), . . . , ( x n , r ) } , the goal is to extract a set ofrepresentative phrases for relation r . Each x i is a sentenceassociated with two entities ( h , t ) and a noisy relation la-bel r , produced by distant supervision (Mintz et al. 2009).In other words, a sentence x i may not express relation r atall.The challenges for relation mention extraction comefrom: 1) there are noisy relation labels, and 2) there is noword-level mention annotation. Overview

As illustrated in Figure 2, the process of relation mentionextraction works as follows: the agent ﬁrst decides whethera sentence expresses a given relation; if the agent predictsso, it will scan the words in the sentence one by one toidentify the mention words; otherwise, the agent directly bama was born in the United StatesObama was the president of the United StatesThe birthplace of Obama was the United States

Top-level Sentence Selector Low-levelMention Extractor

Relation (cid:29)(cid:3)(cid:83)(cid:79)(cid:68)(cid:70)(cid:72)(cid:66)(cid:82)(cid:73)(cid:66)(cid:69)(cid:76)(cid:85)(cid:87)(cid:75)

Figure 2: The hierarchical decision making process to ex-tract mentions for relation “place of birth” . Blue circles de-note selected sentences (white for unselected sentences), andgreen squares indicate mention words (white squares meansnon-mention words). Words in red are mention words.skip the current sentence. The agent continues to tacklethe next sentence until all the sentences for the same en-tity pairs are handled. The above process can be natu-rally formulated as a semi-Markov decision process. Wethus address the task in the framework of hierarchical re-inforcement learning (Sutton, Precup, and Singh 1999;Dietterich 2000). The hierarchical reinforcement learningprocess has two tasks: a top-level RL task which takes an op-tion for data denoising, deciding whether a sentence shouldbe selected; and a low-level RL task that makes primitive ac-tions for mention extraction, deciding which words are partof a relation mention.As shown in Figure 3, our model consists of three compo-nents: a top-level sentence selector, a low-level mention ex-tractor, and a reward estimator. The sentence selector scansthe sentences in a bag and takes options (top-level action)to determine whether a sentence describes a relation. Themention extractor performs a sequential scan on a selectedsentence and takes actions on whether a particular word inthe sentence is part of a relation mention. As there are no ex-plicit supervision for either the selector or the extractor, wepretrain a relation classiﬁer as the reward estimator to guidethe policy learning in the two modules.

Reward Estimator

We adopt a CNN classiﬁer to offer supervision signals tohelp estimate the rewards for the sentence selector and themention extractor. The supervision signal is measured by thelikelihood P ( r | x ; Φ ) of relation classiﬁcation for a givensentence x . Following (Feng et al. 2018), the CNN networkhas an input layer, a convolution layer, a max pooling layer,and a non-linear layer from which the representation is usedfor relation classiﬁcation. Top-LevelSentence Selector Low-LevelMention Extractor RewardEstimator option action rewardrewardHierarchical Policy Network

Figure 3: The hierarchical reinforcement learning model.

CNN Structure.

The CNN structure can be brieﬂy de-scribed as below: L = CNN ( x ) (1)where x is the input vectors and L ∈ R d s is the result of themax pooling layer. In this structure, there is a convolutionlayer, and a max pooling layer. The convolution operation isperformed on 3 consecutive words, and the number of fea-ture maps d s is set to , the same as (Lin et al. 2016).Hence, the convolution parameters are W f ∈ R d s × (3 d ) and b f ∈ R d s .Then, the relation classiﬁer estimates P ( r | x ; Φ ) as fol-lows: P ( r | x ; Φ ) = sof tmax ( W r ∗ tanh ( L ) + b r ) (2)where W r ∈ R n r × d s and b r ∈ R n r are parameters in thefully-connected layer, n r is the total number of relations,and the parameters Φ = { W f , b f , W r , b r } .This probability P ( r | x ; Φ ) is used to estimate the rewardsto the sentence selector and the mention extractor, see Eq. 5and Eq. 7. Loss function.

The top-level sentence selector aims to select a sentence thattruly mentions the given relation. A selected sentence willthen be passed to the low-level mention extractor for furthermention extraction. As we do not have an explicit supervi-sion for the sentence selector, we measure the utility of theselected sentences as a whole using a ﬁnal reward. Thus, thisRL process terminates when all the sentences are scanned.In what follows, state s ht , option g ht and reward r ht at step t (corresponding to the t -th sentence) will be introduced. State.

The state s ht consists of the information about tthecurrent sentence, the already selected sentences, the relationlabel, and the extracted relation mentions from the previ-ously selected sentences:1) The vector representation of the current sentence, whichs obtained from the non-linear layer of the CNN classiﬁerfor relation classiﬁcation;2) The average of the sentence representations of the chosensentences;3) The one-hot representation of a given relation;4) The representation of the extracted relation mentions,which is the average of the word vectors of all the mentionwords. Option.

The option g ht ∈ {

0, 1 } where 1 means the t -thsentence is selected. We sample the value of g ht from thepolicy function as follows: µ ( g ht | s ht ; θ h ) = σ ( W h ∗ s ht + b h ) (4)where σ (.) is the sigmoid function with the parameter θ h = { W h , b h } . Reward.

At each step t , if the sentence is selected, the sen-tence selector will receive an intermediate reward r ht whichis the delayed reward received by the low-level mention ex-tractor on the t -th sentence, as deﬁned by Eq. 7; otherwise,the intermediate reward is set as .In addition to the intermediate rewards, a ﬁnal reward iscomputed to measure the utility of all the chosen sentences,when the top-level selector completes its scan on all the sen-tences for a given relation: r hfinal = 1 | ˆ X | (cid:88) x j ∈ ˆ X log P ( r | x j ) (5)where ˆ X ( ⊆ X ) contains the selected sentences, and r is thegiven relation. P ( r | x j ) is provided by the reward estimator,see Eq. 2. Low-Level Mention Extractor

Once the top-level sentence selector chooses a sentence x t ,the low-level mention extractor will scan sequentially thewords in x t to identify relation mention words given rela-tion r . At each step j , the mention extractor makes a deci-sion on whether the j -th word is part of the relation mention.This low-level RL process terminates after the last word isscanned. State.

The state s lj encodes the information about the currentwords, the already chosen words in the sentence, and therelation:1) The vector representation of the current words;2) The representation of the chosen mention words, which isthe average of the word embeddings of all the chosen words;3) The one-hot representation of the relation. Action.

The action a lj ∈ {

0, 1 } where 1 means the j -th wordis selected as a mention word. We sample a lj from the policyfunction: π ( a lj | s lj ; θ l ) = σ ( W l ∗ s lj + b l ) (6)where σ (.) is the sigmoid function with the parameter θ l = { W l , b l } . Reward.

As there is no annotation on which words are re-lated to a relation mention, we design a delayed reward tomeasure the adequacy of the extracted mention words once all the words in sentence x t are scanned. The delayed re-ward consists of three terms: the word discriminability, thecontinuity of the relation mention, and the distance to thetwo entities.Formally, suppose a mention m t = w k , w k , . . . , w k L isextracted from sentence x t , where k j ( ≤ j ≤ L ) is a wordindex in x t , and L is the number of words in the extractedmention. We denote the indices of the two entities as k e / k e ,respectively.The delayed rewards is deﬁned as: r lfinal ( x t ) = P ( r | x t ) − P ( r | x (cid:48) t ) P ( r | x t ) − λ k L − k L − λ (cid:80) q | k q − k e | + | k q − k e | L (7)where:1) The ﬁrst term is the word discriminability which mea-sures how well m t can distinguish the relation. P ( r | x t ) , de-ﬁned by Eq. 2 in the reward estimator, is the classiﬁcationlikelihood of sentence x t . x (cid:48) t is the sentence where m t areremoved from x t .2) The second term is the continuity reward which encour-ages the extraction of a consecutive token span at a certainextent.3) The third term is the distance reward which encouragesthat mention words should be close to the two entities.The three rewards are soft constraints for mention ex-traction. For instance, the contituity reward encourages ex-traction of consective words, but the model may also ex-tract non-consecutive words as mention. And, λ /λ are thehyper-parameters to balance the three factors. Training Objective and Optimization

For the sentence selector, we aim to maximize the expectedfuture cumulative rewards, as below: J ( θ h ) = E g t ∼ µ ( g ht | s ht ; θ h ) (cid:2) R ( g ht ) (cid:3) (8)where R ( g ht ) is the future cumulative rewards from state s ht . To compute R ( g ht ) , we sampled some trajectoriesaccording to the current policy. Taking one trajectory ( s h , g h , . . . , s hn , g hn ) as example ( n is the number of sen-tences in the top-level process), R ( g ht ) = r hfinal + (cid:80) nk = t (cid:104) γ k − t r lfinal ( x k ) (cid:105) , Note that the rewards received bythe low-level mention extractor ( r lfinal ( x k )) are passed tothe selector, which provides a feedback to indicate how wellsentence selection is.Similarly, the mention extractor maximizes the expectedcumulative rewards, as follows: J ( θ l ) = E a lt ∼ π ( a lt | s lt ; θ l ) (cid:2) R ( g lt ) (cid:3) (9)where R ( g lt ) = r lfinal , since the mention extractor have nointermediate rewards but only a delayed ﬁnal reward.According to the policy gradient theorem (Sutton et al.1999) and the REINFORCE algorithm (Williams 1992), we LGORITHM 1:

Training Process of Hierarchical Re-inforcement Learning

Input:

Training data T , and each relation r has asentence bag X r . foreach pair ( r , X r ) ∈ T doforeach sentence x t ∈ X r do Sample option g ht for the selector: g ht ∼ µ ( g ht | s ht ; θ h ) , see Eq. 4; r ht = 0 ; if g ht = 1 then Sample actions for the extractor on sentence x t with θ l : { a l , . . . , a lm } , a j ∼ π ( a lj | s lj ) , see Eq. 6 ;Obtain the ﬁnal reward from the extractor r lfinal ( x t ) ;Update the parameter θ l ;Compute the intermediate reward of theselector: r ht = r lfinal ( x t ) , see Eq.7 endend Obtain the reward of the extractor r hfinal , see Eq. 5 ;Update the parameter θ h ; end compute the gradient of the top-level sentence selector pol-icy as: ∇ θ h J ( θ h ) = E g ht ∼ µ ( g ht | s ht ; θ h ) (cid:2) R ( g ht ) · ∇ θ h log µ ( g ht | s ht ; θ h ) (cid:3) (10)The policy gradient of the low-level mention extractoryields: ∇ θ l J ( θ l ) = E a lt ∼ π ( a lt | s lt ; θ l ) (cid:2) R ( a lt ) · ∇ θ l log π ( a lt | s lt ; θ l ) (cid:3) (11)For model learning, we ﬁrst use all the sentences to pre-train a CNN classiﬁer as the reward estimator and pretrainthe low-level mention extractor according to Eq. 3 and Eq. 9respectively. After that, with the reward provided by theCNN classiﬁer (parameters ﬁxed), we are able to train thehierarchical RL model. See the details of our learning pro-cedure in Algorithm 1. Relation Mention Ranking

Note that our goal is to extract a set of representative phrasesfor a relation. Since our model extracts a mention from eachselected sentence, we need to rank the extracted mentionsat the corpus level to construct high-quality mention re-sources. Formally, an extracted mention m i for a relation r is ranked by the below score, similar to (Angeli, Premkumar,and Manning 2015): P ( m i | r ) · P ( r | m i ) (12)where P ( m i | r ) = n ( m i , r ) n ( r ) and P ( r | m i ) = n ( m i , r ) n ( m i ) . n ( m i , r ) is the times that mention m i is extracted for rela-tion r , n ( r ) is the number of the sentences labeled as relation Method Clean Data Noisy DataStanfordIE 0.30 0.11ATT 0.27 0.02N-gram 0.38 0.24Single RL Table 1: Sentence-level extraction accuracy for relationmention. Note tath HRL is the same as Single RL on theclean data. Note that StanfordIE is an unsupervised method. r , and n ( m i ) is the times that mention m i is extracted fromall the selected sentences. Finally, we select top N mentionsfor each relation to construct the mention resource. Experiments

Experimental Setup

Data Preparation

We evaluated our model on a cleandataset and a noisy dataset, respectively.

Clean dataset.

The clean dataset is adopted from SemEval-2010 (Hendrickx et al. 2009), which contains 10,717 sen-tences and 9 distinct relations. The average sentence lengthis . We took 8,000 sentences for training and the remain-der for test.

Noisy dataset.

To validate the performance of mention ex-traction from noisy data, we adopted a widely used datasetfrom (Riedel, Yao, and McCallum 2010) . This dataset con-tains 522,611 sentences, 281,270 entity pairs, and 18,252relational facts in the training set; and 172,448 sentences,96,678 entity pairs and 1,950 relational facts in the test set.There are 39,528 unique entities and 53 unique relations.The average sentence length is . This dataset consists ofnoisy sentences which may not describe a fact at all. Baselines OpenIE (Angeli, Premkumar, and Manning2015; Mausam et al. 2012). OpenIE systems are the mostrelevant to our work, which extract a triple that contains twoentity mentions and a relation mention. As aforementioned,OpenIE systems do not normalize different expressions toa canonical relation. Thus, we mapped the extracted men-tions to a relation following the algorithm described in (An-geli, Premkumar, and Manning 2015), which is trained onour training data. In our experiment, we use

Stanford Ope-nIE (Angeli, Premkumar, and Manning 2015) as baseline.

ATT (Huang and others 2016).ATT adopts a word-level at-tention over the words in a sentence and assigns each wordan attention weight. We selected the word with the largestweight as the relation mention.

Single RL.

This model only adopts the low-level men-tion extractor and ignores the top-level sentence selector.We compared this model with our HRL model on the noisydataset. On the clean dataset, HRL is unnecessary since thereis no noisy sentences.

N-gram.

To show the necessity for adopting reinforcementlearning, we devised a new model named N-gram as our http://iesl.cs.umass.edu/riedel/ecml/ xample-I: the Entity-Origin relation between name and address .The headquarters of the operation were at Berlin and the code [name] e for the program was derived from that [address] e . Output:

ATT: derived (cid:12)(cid:12)

StanfordIE: N/A (cid:12)(cid:12)

HRL: derived from (cid:12)(cid:12)

Example-II: the

Product-Producer relation between philosopher and writings .Andronicus wrote a work, the ﬁfth book of which contained a complete list of the [philosopher] e ’s [writings] e . Output:

ATT: wrote (cid:12)(cid:12)

StanfordIE: of (cid:12)(cid:12)

HRL: ’s (cid:12)(cid:12)

Table 2: Examples for the extracted mentions by ATT, StanfordIE, and our model. N/A means StanfordIE did not extract anyword.baseline, which searches over all n-grams ( n ≤ ) in a sen-tence and chooses as the mention the one which provides themaximal reward. The reward is the same as the ﬁnal rewardof the low-level mention extractor (see Eq. 7). Parameter Settings

The parameters of our model are dif-ferent on the clean and noisy datasets. For the clean dataset,we set the hyper-parameter λ = 1 , λ = 0.05 and the learn-ing rate as . The training episode number is . For thenoisy dataset, we set λ = 0.4 , λ = 0.02 . The learning rateis and the training episode number is during the pre-training of the mention detector. The learning rate is and the training episode number is during the trainingof HRL. The reward discount factor is γ = 0.999 on bothdatasets.For the parameters of the CNN classiﬁer in the rewardestimator, the word embedding dimension d w = 50 and theposition embedding dimension d p = 5 . The window size ofthe convolution layer l is . The learning rate is α = 0.02 .The batch size is ﬁxed to . The training episode number L = 25 . We employed a dropout strategy with a probabilityof . Quality of Extracted Relation Mentions

We evaluated the the quality of extracted relation mentionswith two metrics. At the sentence level, accuracy is assessedby manually checking whether the phrase extracted from asentence is indeed representative for the given relation r .At the mention level, Precision@K is assessed by rankingthe extracted mentions according to the representative abil-ity (see Eq. 12). Sentence-level Evaluation

We respectively sampled sentences from the clean and noisy datasets, and manuallyannotated the relation mention for each sentence. As differ-ent baseline models extract multi-granularity relation men-tions, we annotated multiple relation mentions for each sen-tence for fair comparison. And, we guaranteed that all theannotations are representative for a given relation. For in-stance, for sentence “Muscle fatigue is the number one causeof arm muscle pain.” with relation label “Cause-Effect”,mention annotations are “ is the number one cause of ”, “ isthe cause of ”, “ the cause of ” and “ cause ”.Then, we compared the extracted mentions with thosemanual annotations for each sentence to evaluate the ex-traction performance. Thus, this is sentence-level evaluation.The results shown in Table 1 reveal the following observa-tions: Method P@1 P@2 P@5 P@10StanfordIE 0.88 0.82 0.72 0.61ATT 0.67 0.74 0.61 0.44N-gram 0.83 0.75 0.67 0.56Single RL

Table 3: Average Precision@K of the extracted mentionsfrom the clean data (mention-level).

First , our proposed models (Single RL and HRL) outper-form the baselines on both clean and noisy data. Comparedto our model, ATT has two drawbacks: the word with thelargest attention weight may not be a mention word; and itcannot identify a consecutive token span as mention. As forStanfordIE, it failed to extract fact triples and did not extractany relation mention in many cases.

Second , HRL outperforms the baselines substantially on thenoisy data, demonstrating the effectiveness of data denois-ing by the sentence selector. By contrast, StanfordIE, ATT,N-gram and Single RL all suffer from the noisy data remark-ably due to the inability of excluding noisy sentences.We also note that ATT drops much more than other base-lines on noisy data. Our investigation into the results showsthat ATT is sensitive to the sentence length. The longer thesentence is, the more difﬁcult ATT can locate the correctrelation mention words. The average length of sentence innoisy data is much longer than that in the clean data ( vs. ). Third , SingleRL outperforms N-gram on both clean andnoisy data. The results show that our RL strategy is reason-able and effective.We further presented some exemplar mentions extractedby the models in Table 2. Interestingly, our model can notonly identify typical phrases like “ derived from ”, but alsodiscover less typical representative words such as “ ’s’ ’. ForStanfordIE, it sometimes failed to extract any word or ex-tracted undesirable results. As for ATT, it is prone to producewrong attention.

Mention-level Evaluation

We conducted mention-levelevaluation to assess the quality of the extracted mentions atthe corpus level. For each relation, we chose top 10 repre-sentative mentions which are ranked by Eq. 12. We adoptedPrecision@K as the performance metric.The results on the clean data and noisy data are presentedin Table 3 and Table 4, respectively. On the clean data, thetop 5 mentions extracted by our model achieve a precisionof more than 0.8, signiﬁcantly higher than those obtainedethod P@1 P@2 P@5 P@10StanfordIE 0.38 0.50 0.40 0.33ATT 0.15 0.15 0.20 0.17N-gram 0.38 0.38 0.42 0.37Single RL 0.46 0.38 0.40 0.28HRL

Table 4: Average Precision@K of the relation mentions ex-tracted from the noisy data (mention-level).

Relation Exemplar phrases

Cause-Effect triggers, caused by, lead to,generated by, instigatesProduct-Producer hand-made by, co-founded by,makes, created byFounder founder of, chief executive of,managing director at, chairman ofChildren son of, daughter,father, son of ministerTable 5: Exemplar mention phrases for some sampled rela-tions.by StanfordIE and ATT (Table 3). As for the noisy data,P@10 drops remarkably for all the methods, but HRL per-forms much better than the baselines. Moreover, HRL out-performs single RL remarkably . All the evidence supportsthat the sentence selector effectively exclude the noisy sen-tences (Table 4).A concrete example of the ranked mentions is presented inTable 7. It shows that the extracted phrases are representativeand meaningful. We also show the top 10 relation mentionsfor some relations in a supplementary ﬁle.

Utility of Extracted Relation Mentions

We evaluated whether the extracted mentions can facilitatedownstream applications such as relation classiﬁcation, onboth clean and noisy data.We ﬁrst evaluated on the clean data how extracted men-tions can beneﬁt relation classiﬁcation as addition feature.More speciﬁcally, we constructed a binary vector wherethe i -th dimension represents whether at least one extractedmention of the i -th relation occurs in a given sentence. Thedimension of the binary vector equals to the number of re-lations. For each sentence, if it contains the mentions for arelation, the corresponding dimension will be set to , oth-erwise .To fully check the effectiveness of the extracted relationmentions, we used the binary vector in two ways. The ﬁrstway is that we directly used the binary vector for relationclassiﬁcation, with a logistic regression classiﬁer. The sec-ond way is to use the binary vector along with a CNN classi-ﬁer. We concatenated the binary vector with the output of thepooling layer of a CNN structure, and fed the concatenatedvector into a fully-connected layer for relation classiﬁcation.We compared different mention features generated by Mention features CNN RegressionStanfordIE 81.74 27.52Ollie 81.28 18.32ATT 81.48 20.61N-gram 81.57 36.97Single RL Table 6: Macro F of relation classiﬁcation on the clean data.Open IE, ATT, RL, and HRL respectively. The results onthe clean data are shown in Table 6. It demonstrates that therelation mentions from our model obtain better performancethan those from the baseline models. In the CNN classiﬁer,mention features are only used as additional feature, whichmay explain that only slight improvement is observed.Due to the page limit, we provided a supplementary ﬁleto show the experiment results on noisy data. We believethat such extracted mentions would be beneﬁcial for ques-tion answering, language inference, and more, which willbe validated in future work. Discussions

Our model is advantageous in extracting relation mentionsthat can be expressed explicitly by words. However, somerelations are expressed implicitly, or sometimes, we need tomake semantic reasoning to derive a relation.This ﬁrst example demonstrates an implicit relation men-tion: sentence “I spent a year working for a [software] e [company] e to pay off my college loans.” is labeled with a Product-Producer relation, which requires the knowledgethat a software company sells software (However, the Applecompany does not produce apple).The second example shows that relation mention detec-tion sometimes needs to make semantic reasoning: sentence“You ’ll get an instant overview of [Tallahassee] e , whichwas chosen as [Florida] e ’s capital for only one reason · · · ” is marked with the Contains relation, which needs tobe inferred from the capital relationship. The third exam-ple, “ [Nicola Sturgeon] e , the newly elected ﬁrst ministerof [Scotland] e , expressed concern that · · · ”, labeled withthe Nationality relation, also needs to make semantic rea-soning from minister to derive the desired relation.Our model has limitations on these cases, and we willleave it as future work.

Conclusion

In this paper, we present a hierarchical reinforcement learn-ing model for extracting relation mentions from noisy data.The model consists of a sentence selector to exclude noisysentences, a mention extractor to identify mention words in aselected sentence, and a reward estimator to guide the policylearning of the selector and the extractor. The model learnsfrom large-scale noisy data without explicit annotations oneither sentence (whether a sentence truly describes a rela-tion) or on word (which words are a relation mention). Ex-periments show that our model outperforms the state-of-the-art baselines. eferences [Angeli, Premkumar, and Manning 2015] Angeli, G.;Premkumar, M. J. J.; and Manning, C. D. 2015. Leveraginglinguistic structure for open domain information extraction.In

ACL , volume 1, 344–354.[Banko et al. 2007] Banko, M.; Cafarella, M. J.; Soderland,S.; Broadhead, M.; and Etzioni, O. 2007. Open informationextraction from the web. In

IJCAI , volume 7, 2670–2676.[Borthwick et al. 1998] Borthwick, A.; Sterling, J.;Agichtein, E.; and Grishman, R. 1998. Exploiting di-verse knowledge sources via maximum entropy in namedentity recognition. In

Sixth Workshop on Very LargeCorpora .[Bunescu and Mooney 2005] Bunescu, R. C., and Mooney,R. J. 2005. A shortest path dependency kernel for relationextraction. In

EMNLP , 724–731.[Chiu and Nichols 2016] Chiu, J. P., and Nichols, E. 2016.Named entity recognition with bidirectional lstm-cnns.

TACL

J. Artif. Intell. Res.(JAIR)

ACL , 626–634.[Fader, Soderland, and Etzioni 2011] Fader, A.; Soderland,S.; and Etzioni, O. 2011. Identifying relations for openinformation extraction. In

EMNLP , 1535–1545.[Feng et al. 2018] Feng, J.; Huang, M.; Zhao, L.; Yang, Y.;and Zhu, X. 2018. Reinforcement learning for relation clas-siﬁcation from noisy data.

AAAI .[Hendrickx et al. 2009] Hendrickx, I.; Kim, S. N.; Kozareva,Z.; Nakov, P.; ´O S´eaghdha, D.; Pad´o, S.; Pennacchiotti, M.;Romano, L.; and Szpakowicz, S. 2009. Semeval-2010 task8: Multi-way classiﬁcation of semantic relations betweenpairs of nominals. In

Proceedings of the Workshop on Se-mantic Evaluations: Recent Achievements and Future Di-rections , 94–99.[Hoffmann et al. 2011] Hoffmann, R.; Zhang, C.; Ling, X.;Zettlemoyer, L.; and Weld, D. S. 2011. Knowledge-basedweak supervision for information extraction of overlappingrelations. In

ACL , 541–550.[Huang and others 2016] Huang, X., et al. 2016. Attention-based convolutional neural network for semantic relation ex-traction. In

COLING , 2526–2536.[Katiyar and Cardie 2017] Katiyar, A., and Cardie, C. 2017.Going out on a limb: Joint extraction of entity mentions andrelations without dependency trees. In

ACL , volume 1, 917–928.[Li and Ji 2014] Li, Q., and Ji, H. 2014. Incremental joint ex-traction of entity mentions and relations. In

ACL , volume 1,402–412.[Lin et al. 2016] Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; and Sun,M. 2016. Neural relation extraction with selective attentionover instances. In

ACL , volume 1, 2124–2133. [Lin, Liu, and Sun 2017] Lin, Y.; Liu, Z.; and Sun, M. 2017.Neural relation extraction with multi-lingual attention. In

ACL , volume 1, 34–43.[Mausam et al. 2012] Mausam; Michael, S.; Robert, B.;Stephen, S.; and Oren, E. 2012. Open language learningfor information extraction. In

EMNLP , 523–534.[Mintz et al. 2009] Mintz, M.; Bills, S.; Snow, R.; and Juraf-sky, D. 2009. Distant supervision for relation extractionwithout labeled data. In

ACL-IJCNLP , 1003–1011.[Miwa and Bansal 2016] Miwa, M., and Bansal, M. 2016.End-to-end relation extraction using lstms on sequences andtree structures. In

ACL , 1105–1116.[Mooney and Bunescu 2005] Mooney, R. J., and Bunescu,R. C. 2005. Subsequence kernels for relation extraction.In

NIPS , 171–178.[Ren et al. 2017] Ren, X.; Wu, Z.; He, W.; Qu, M.; Voss,C. R.; Ji, H.; Abdelzaher, T. F.; and Han, J. 2017. Cotype:Joint extraction of typed entities and relations with knowl-edge bases. In

WWW , 1015–1024.[Riedel, Yao, and McCallum 2010] Riedel, S.; Yao, L.; andMcCallum, A. 2010. Modeling relations and their mentionswithout labeled text. In

ECML-PKDD , 148–163. Springer.[Socher et al. 2012] Socher, R.; Huval, B.; Manning, C. D.;and Ng, A. Y. 2012. Semantic compositionality throughrecursive matrix-vector spaces. In

EMNLP-CoNLL , 1201–1211.[Sutton et al. 1999] Sutton, R. S.; McAllester, D.; Singh, S.;and Mansour, Y. 1999. Policy gradient methods for rein-forcement learning with function approximation. In

NIPS .[Sutton, Precup, and Singh 1999] Sutton, R. S.; Precup, D.;and Singh, S. 1999. Between mdps and semi-mdps: Aframework for temporal abstraction in reinforcement learn-ing.

Artiﬁcial intelligence

ACL , 721–729. ACL.[Wang et al. 2016] Wang, L.; Cao, Z.; de Melo, G.; and Liu,Z. 2016. Relation classiﬁcation via multi-level attentioncnns. In

ACL , volume 1, 1298–1307.[Williams 1992] Williams, R. J. 1992. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning.

Machine learning

ACL , 118–127.[Xu, Jiang, and Watcharawittayakul 2017] Xu, M.; Jiang,H.; and Watcharawittayakul, S. 2017. A local detection ap-proach for named entity recognition and mention detection.In

ACL , volume 1, 1237–1247.[Zeng et al. 2014] Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; andZhao, J. 2014. Relation classiﬁcation via convolutional deepneural network. In

COLING , 2335–2344.[Zeng et al. 2015] Zeng, D.; Liu, K.; Chen, Y.; and Zhao,J. 2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In

EMNLP , 1753–1762.Zeng et al. 2018] Zeng, X.; He, S.; Liu, K.; and Zhao, J.2018. Large scaled relation extraction with reinforcementlearning.

AAAI .[Zheng et al. 2017] Zheng, S.; Wang, F.; Bao, H.; Hao, Y.;Zhou, P.; and Xu, B. 2017. Joint extraction of entities andrelations based on a novel tagging scheme. In

ACL , vol-ume 1, 1227–1236.[Zhou et al. 2005] Zhou, G.; Su, J.; Zhang, J.; and Zhang, M.2005. Exploring various knowledge in relation extraction.In

ACL , 427–434.

Supplementary Materials

Relation Mention Rankings

We presented top 10 relation mentions for some relations inTable 7. It shows that the extracted phrases are representativeand meaningful. Although most of the phrases are represen-tative and meaningful, some of them lack semantic mean-ings, such as “ by ”, “ at ” and “ with ”. To interpret these cases,we need to throw them back to the sentence context. For the8th relation mention “ UNK cheif executive of ” for relation”Person-Company”, “UNK” indicates the company name.

Utility of Extracted Relation Mentions

We evaluated whether the extracted mentions can facilitatedownstream applications such as relation classiﬁcation, onboth clean and noisy data. The result on clean data is shownin the main paper. We show the result on noisy data in thissupplementary ﬁle.

Experiment on Noisy Data

Similar to the experiments onthe clean data, we generated the binary vector for each sen-tence on the noisy data and concatenated it with the outputof the pooling layer of a CNN, and fed the new vector into afully-connected layer for relation classiﬁcation.As there is no manual annotation on noisy data, we evalu-ated the results under the held-out evaluation conﬁguration,which provides an approximate measure of relation extrac-tion without expensive human labors.We compared different mention features generated byHRL and the baseline models. We divided the baseline mod-els into two groups. The ﬁrst group consist of previous exist-ing models include StanfordIE and ATT. The second groupconsists of the simpliﬁed version of HRL include SingleRLand N-gram.Figure 4 and Figure 5 show the results on noisy data. Fig-ure 4 shows that our HRL model outperforms the existingmention extraction models. Figure 5 shows that HRL outper-forms Single RL and Single RL outperforms N-gram. Thisdemonstrates that the necessity of removing noisy sentencesfor relation mention extraction.

Recall P r e c i s i o n HRLStanfordIEATTNone

Figure 4: Comparison between HRL and the baselines.

Recall P r e c i s i o n HRLSingle RLN-gram

Figure 5: Comparison between HRL and its simpliﬁed mod-els elation Top 10 relation mentionselation Top 10 relation mentions