[PDF] Denoising Distant Supervision for Relation Extraction via Instance-Level Adversarial Training

Abstract

Existing neural relation extraction (NRE) models rely on distant supervision and suffer from wrong labeling problems. In this paper, we propose a novel adversarial training mechanism over instances for relation extraction to alleviate the noise issue. As compared with previous denoising methods, our proposed method can better discriminate those informative instances from noisy ones. Our method is also efficient and flexible to be applied to various NRE architectures. As shown in the experiments on a large-scale benchmark dataset in relation extraction, our denoising method can effectively filter out noisy instances and achieve significant improvements as compared with the state-of-the-art models.

Full PDF

DDenoising Distant Supervision for Relation Extractionvia Instance-Level Adversarial Training

Xu Han , Zhiyuan Liu , Maosong Sun , Department of Computer Science and Technology,State Key Lab on Intelligent Technology and Systems,National Lab for Information Science and Technology, Tsinghua University, Beijing, China Beijing Advanced Innovation Center for Imaging Technology,Capital Normal University, Beijing, China

Abstract

Existing neural relation extraction (NRE)models rely on distant supervision and suf-fer from wrong labeling problems. Inthis paper, we propose a novel adversar-ial training mechanism over instances forrelation extraction to alleviate the noise is-sue. As compared with previous denoisingmethods, our proposed method can betterdiscriminate those informative instancesfrom noisy ones. Our method is also ef-ﬁcient and ﬂexible to be applied to vari-ous NRE architectures. As shown in theexperiments on a large-scale benchmarkdataset in relation extraction, our denois-ing method can effectively ﬁlter out noisyinstances and achieve signiﬁcant improve-ments as compared with the state-of-the-art models.

Relation extraction (RE) aims to extract relationalfacts from plain text via categorizing semantic re-lations between entities contained in text. Forexample, we can extract the fact (

Mark Twain , PlaceOfBirth , Florida ) from the sentence“

Mark Twain was born in

Florida ”. Many effortshave been devoted to RE, either early works basedon handcrafted features (Zelenko et al., 2003;Mooney and Bunescu, 2006) or recent worksbased on neural networks (Zeng et al., 2014; San-tos et al., 2015). These models all follow a su-pervised learning approach, which is effective, butthe requirement to high-quality annotated data is amajor bottleneck in practice.It is time-consuming and human-intensiveto manually annotate large-scale training data.Hence, Mintz et al. (2009) propose distant su-pervision to automatically generate training sen- S ( Bob , PlaceOfBrith,

New York ) [Bob] was born in [New York] . [Bob] lived in [New York] . [Bob] traveled to [New York] . ... D Distnant SupervisionPositive Negative

Figure 1: An example of distant supervision andadversarial denoising relation extraction.tences via aligning KGs and text. As shown inFigure 1, distant supervision assumes that if thereis a relation between two entities in a KG, all sen-tences that contain the two entities will be labeledwith that relation. Distant supervision is an ef-fective approach to automatically obtain trainingdata, but it inevitably suffers from wrong labelingproblems.To address the wrong labeling problem, Riedelet al. (2010) propose multi-instance learning(MIL), and Zeng et al. (2015) extend the idea ofMIL to neural models. Lin et al. (2016) furtherpropose a neural attention scheme over multipleinstances to reduce the weights of noisy instances.These methods achieve signiﬁcant improvementsin RE, however, still far from satisfactory. Thereason is that most denoising methods simply cal-culate soft weights for each sentence in an unsu-pervised manner, which can only make a coarse-grained distinction between informative and noisyinstances. Moreover, these methods cannot wellcope with those entity pairs with insufﬁcient sen-tences.In order to better discriminate informative andnoisy instances, inspired by the idea of adversar- a r X i v : . [ c s . C L ] M a y al learning (Goodfellow et al., 2014a), we ap-ply adversarial training over instances to enhanceRE performance. The idea of adversarial train-ing was explored in relation extraction by gen-erating adversarial examples with a perturbationadded to sentence embeddings (Wu et al., 2017),which do not necessarily correspond to real-worldsentences. On contrary, we generate adversarialexamples by sampling from existing training data,which may better locate real-world noise.Our method contains two modules: a discrim-inator and a sampler , and the method will splitthe distantly supervised data into two parts, theconﬁdent part and the unconﬁdent part. The dis-criminator is applied to judge which sentencesare more likely to be annotated correctly, with theconﬁdent data as positive instances and the uncon-ﬁdent data as negative instances. The sampler module is used to select the most confusing sen-tences from unconﬁdent data to cheat the discrim-inator as much as possible. Moreover, during sev-eral training epochs, we also dynamically selectmost informative and conﬁdent instances from theunconﬁdent set to the conﬁdent set, so as to enrichthe training instances for the discriminator.The discriminator and the sampler are trainedadversarially. As shown in Figure 1, during thetraining process, the actions of the sampler willadmonish the discriminator to focus on improv-ing those most confusing instances. Since noisyinstances are ineffective to decrease the loss func-tions of both sampler and discriminator, the noisewill be gradually ﬁltered out during the adversar-ial training. Finally, the sampler can effectivelydistinguish those informative instances from theunconﬁdent data, and the discriminator can wellcategorize relations between entities in text. Ascompared with the aforementioned MIL denoisingmethods, our method achieves more efﬁcient noisedetection in ﬁner granularity.We conduct experiments on a real-world datasetderived from New York Times (NYT) corpus andFreebase. Experimental results demonstrate thatour adversarial denoising method effectively re-duces noise and signiﬁcantly outperforms otherbaseline methods. Relation extraction is an important task in NLP,which aims to extract relational facts from text corpora. Many efforts are devoted to RE, espe-cially in supervised RE, such as early kernel-basedmodels (Zelenko et al., 2003; GuoDong et al.,2005; Mooney and Bunescu, 2006). Mintz et al.(2009) align plain text with KGs and propose adistantly supervised RE model, by assuming allsentences that mention two entities can describetheir relations in KGs.However, distant supervision inevitably accom-panies with the wrong labeling problem. Riedelet al. (2010) and Hoffmann et al. (2011) applythe multi-instance learning (MIL) mechanism forRE, which considers the reliability of each in-stance and combines multiple sentences contain-ing the same entity pair together to alleviate thenoise problem.In recent years, neural models (Zhang andWang, 2015; Zeng et al., 2017; Miwa and Bansal,2016) have been widely used in RE. These neu-ral models are capable of accurately capturing tex-tual relations without explicit linguistic analysis.Based on these neural architectures and the MILmechanism, Lin et al. (2016) propose a sentence-level attention to reduce the inﬂuence of incor-rectly labeled sentences. To summarize, theseMIL models generally make soft weight adjust-ment for informative and noisy instances. Someworks further adopt external information to im-prove denoising performance: Ji et al. (2017) in-corporate external entity descriptions to enhanceattention representations; Liu et al. (2017) man-ually set label conﬁdences to denoise entity-pairlevel noises.More sophisticated mechanisms, such as rein-forcement learning (Feng et al., 2018; Zeng et al.,2018), have recently also been adapted to selectpositive sentences from noisy data. However,these complex mechanisms usually require muchtime to ﬁne-tune and the convergence is not yetwell guaranteed in practice. In this paper, we pro-pose a novel ﬁne-grained denoising method for REvia adversarial training. The method is simple andeffective to be applied in various neural architec-tures and to scale up to large-scale data.

Szegedy et al. (2013) propose to generate adver-sarial examples by adding noise in the form ofsmall perturbations to the original data. Thesenoise examples are often indistinguishable forhumans but lead to models’ wrong predictions. ampler

Discriminator0 / 1

Sentence EncoderEstimatorNoisy Instances Sentence EncoderEstimatorInformative InstancesNA InstancesSentence EncoderEstimator Unconfident Instances

Confident InstancesSentence Encoder

Unconfident InstancesConfident Instances [Bob] was born in [New York]... [Bob] lived in [New York]...

Knowledge

Graph ( Bob , PlaceOfBrith,

New York ) ... [Bob] have said he loved [New York] ...[Bob] have traveled to [London]... Distant supervision Relation Extraction

Figure 2: The overall framework of the instance-level adversarial training model for relation extraction.The discriminator module is used to judge whether an instance is labeled correctly, and the instance willbe considered coming from the conﬁdent set if the prediction is yes. The sampler module is used to selectthe most confusing instances from unconﬁdent data to cheat the discriminator.Goodfellow et al. (2014b) analyze adversarial ex-amples and propose adversarial training for im-age classiﬁcation tasks. Afterwards, Goodfellowet al. (2014a) propose a mature adversarial trainingframework and use the framework to train genera-tive models.Adversarial training has also been explored inNLP. Miyato et al. (2016) propose adversarialtraining for text classiﬁcation by adding perturba-tions to word embeddings. The idea of perturba-tion addition has further been applied in other NLPtasks including language models (Xie et al., 2017)and relation extraction (Wu et al., 2017). Differentfrom (Wu et al., 2017) that generates pseudo ad-versarial examples by adding perturbations to in-stance embeddings, we perform adversarial train-ing by sampling adversarial examples from real-world noisy data. The adversarial examples inour method can better correspond to the real-worldscenario for RE. Hence our method is more favor-able to solve the wrong labeling problem in distantsupervision, which will be shown in experiments.

In this section, we introduce the details of ourinstance-level adversarial training model for de-noising RE. For this model, we split the entiretraining data into two parts, the set of those con-ﬁdent instances I c and the set of those unconﬁ-dent instances I u . A sentence encoder is adoptedto represent sentence semantics with embeddings.The adversarial training framework consists of asampler and a discriminator, corresponding to thenoise ﬁlter and the relation classiﬁer respectively. F As shown in Figure 2, the overall frameworkof our instance-level adversarial training modelincludes a discriminator D and a sampler S , inwhich S samples adversarial examples from theunconﬁdent set I u , and D learns to judge whethera given instance is from I c or I u .We assume that each instance s ∈ I c ex-poses implicit semantics of its labeled relation r s . In contrast, those instances s ∈ I u are nottrusted to be labeled correctly during the adversar-ial training. Hence, we implement D as a function D ( s, r s ) to judge whether a given instance s ex-poses implicit semantics of its labeled relation r s :if yes, the instance comes from I c ; while if no, theinstance comes from I u .The training process is a min-max game and canbe formalized as follows, φ = min p u max D ( E s ∼ p c [log( D ( s, r s ))] (1) + E s ∼ p u [log(1 − D ( s, r s ))]) , where p c is the conﬁdent data distribution, and thesampler S samples adversarial examples from theunconﬁdent data according to the probability dis-tribution p u .After sufﬁcient training, S tends to samplethose informative instances in I u rather than thosenoisy instances, and D becomes a relation classi-ﬁer of good robustness to noisy data. We will givethe detailed introduction to the sampler in Section3.2 and the discriminator in Section 3.3. .2 Sampler The sampler module aims to select the most con-fusing sentences from the unconﬁdent set I u tocheat the discriminator as much as possible by op-timizing the probability distribution p u . Hence, weneed to calculate the confusing score for each in-stance in the unconﬁdent set I u .Given an instance s , we can use neural sentenceencoders to represent its semantic information asan embedding y . The details of neural encoderswill be introduced in Section 3.4. Here, we cansimply calculate the confusing score according tothe sentence embedding y as follows, C ( s ) = W · y , (2)where W is a separating hyperplane. We furtherdeﬁne P u ( s ) as the confusing probability over I u , P u ( s ) = exp( C ( s )) (cid:80) s ∈I u exp( C ( s )) . (3)In the unconﬁdent set, we regard those instanceswith high D ( s, r s ) scores as the confusing in-stances, because they will fool the discriminator D to make wrong decision. An optimized samplerwill assign larger confusing score to those mostconfusing instances. Hence, we formalize the lossfunction to optimize the sampler module as fol-lows: L S = − (cid:88) s ∈I u P u ( s ) log( D ( s, r s )) . (4)When optimizing the sampler, we regard the com-ponent P u ( s ) as parameters for updating.Note that, when an instance is labeled as r s = NA , it indicates the relation of this instance isnot available, either unsure or having no rela-tion. Since these instances are always wronglypredicted into other relations, in order to let thediscriminator restrain this tendency, we speciﬁ-cally deﬁne D ( s, NA ) as the average score of theinstance over all feasible relations: D ( s, NA ) = 1 |R| − (cid:88) r ∈R ,r (cid:54) = NA D ( s, r ) , (5)where R indicates the set of relations. Given an instance s and its embedding y , the dis-criminator is responsible for judging whether itslabeled relation r s is correct. We implement the discriminator based on the semantic relatednessbetween r s and y , D ( s, r s ) = σ ( r s · y ) , (6)where σ ( · ) is the sigmoid function.An optimized discriminator will assign highscores to those instances in I c and low scores tothose instances in I u . Hence, we formalize theloss function to optimize the discriminator mod-ule as follows: L D = − (cid:88) s ∈I c |I c | log( D ( s, r s )) (7) − (cid:88) s ∈I u P u ( s ) log(1 − D ( s, r s )) . When optimizing the discriminator, we regard thecomponent D ( s, r s ) as parameters for updating.Note that, the objective functions of the samplerin Eq. 4 and the discriminator in Eq. 7 are adver-sarial to each other.In practice, the data set is usually too large tobe frequently traversed due to intractable largeamounts of computation. For convenience oftraining efﬁciency, we can simply sample sub-sets to approximate the probability distribution.Hence, we formalize a new loss function for op-timization: ˜ L D = − (cid:88) s ∈ ˆ I c | ˆ I c | log( D ( s, r s )) (8) − (cid:88) s ∈ ˆ I u Q u ( s ) log(1 − D ( s, r s )) , where ˆ I c and ˆ I u are subsets sampled from I c and I u respectively, and Q u ( s ) is the correspondingapproximation to P u ( s ) in Eq. 3: Q u ( s ) = exp( C ( s ) α ) (cid:80) s ∈ ˆ I u exp( C ( s ) α ) . (9)Note that α is a hyper-parameter that controls thesharpness of the confusing probability distribu-tion. For consistency, we also approximate L S inEq. 4 as: ˜ L S = − (cid:88) s ∈ ˆ I u Q u ( s ) log( D ( s, r s )) . (10) ˜ L S and ˜ L D are used to optimize our adversarialtraining model. .4 Instance Encoder Given an instance s containing two entities, we ap-ply several neural network architectures to encodethe sentence into continuous low-dimensional em-beddings y , which are expected to capture the im-plicit semantics of the labeled relation betweentwo entities. The input layer aims to map discrete languagesymbols (i.e., words) into continuous input em-beddings. Given an instance s containing n words { w , . . . , w n } , we use Skip-Gram (Mikolov et al.,2013) to embed all words into k w -dimensionalspace { w , . . . , w n } . For each word w i , we alsoembed its relative distances to the two entities intotwo k p -dimensional vectors, and then concatenatethem as an uniﬁed position embedding p i (Zenget al., 2014). We ﬁnally get the k i -dimensional in-put embeddings for the following encoding layer, s = { x , . . . , x n } (11) = { [ w ; p ] , . . . , [ w n ; p n ] } . In the encoding layer, we select four typical archi-tectures including CNN (Zeng et al., 2014), PCNN(Zeng et al., 2015), RNN (Zhang and Wang, 2015)and BiRNN (Zhang and Wang, 2015) to furtherencode input embeddings of the instance into sen-tence embeddings.

CNN slides a convolution kernel with the win-dow size m over the input sequence { x , . . . , x n } to get the k h -dimensional hidden embeddings. h i = CNN (cid:0) x i − m − , . . . , x i + m − (cid:1) . (12)A max-pooling is then applied over these hiddenembeddings to output the ﬁnal instance embed-ding y as follows, [ y ] j = max { [ h ] j , . . . , [ h n ] j } . (13) PCNN is an extension to CNN, which alsoadopts a convolution kernel with the window size m to obtain hidden embeddings. Afterwards,PCNN divides the hidden embeddings into threesegments { h , . . . , h e } , { h e +1 , . . . , h e } , and { h e +1 , . . . , h n } , where e and e are entity po-sitions. PCNN applies a piecewise max-poolingfor each segment, [ y ] j = max { [ h ] j , . . . , [ h e ] j } , (14) [ y ] j = max { [ h e +1 ] j , . . . , [ h e ] j } , [ y ] j = max { [ h e +1 ] j , . . . , [ h n ] j } . By concatenating all pooling results, PCNN even-tually outputs a · k h -dimensional instance em-bedding y as follows, y = [ y ; y ; y ] . (15) RNN is designed for modeling sequential data,as it keeps its hidden state changing with input em-beddings at each time-step accordingly, h i = RNN ( x i , h i − ) , (16)where RNN ( · ) is the recurrent unit and h i ∈ R k h is the hidden embedding at the time-step i . In thispaper, we select gated recurrent unit (GRU) (Choet al., 2014) as the recurrent unit. We use the hid-den embedding of the last time-step as the instanceembedding, i.e., y = h n . Bi-RNN aims to incorporate information fromboth sides of the sentence sequence. Bi-RNN isadopted with forward and backward directions asfollows, −→ h i = RNN f ( x i , −→ h i − ) , (17) ←− h i = RNN b ( x i , ←− h i +1 ) , where −→ h i and ←− h i are the hidden states at the po-sition i of the forward and backward RNN respec-tively. We concatenate the hidden states from boththe forward and backward RNN as the instanceembedding y , y = [ −→ h n ; ←− h ] . (18) Here we introduce the learning and optimizationdetails for our adversarial training model. We de-ﬁne the optimization function as L = ˜ L D + λ ˜ L S , (19)where λ is a harmonic factor. In practice, both themodules in adversarial training are optimized al-ternately using stochastic gradient descent (SGD).Since the framework of our model is much sim-pler than typical generative adversarial networks(GAN), we do not have to calibrate alternating ra-tio between the loss functions, and hence we cansimply use a ratio. It enables our model ef-ﬁcient for learning on large-scale data. Moreover,we can also integrate λ into the learning rate ofthe sampler ˜ L S , so as to avoid adjusting the hyper-parameter λ .t the start of adversarial training, we pre-traina relation classiﬁer on the entire training data. Therelation classiﬁer will split the entire data into asmall conﬁdent data and a large unconﬁdent data.During the adversarial training, after every fewtraining epochs, some instances from the uncon-ﬁdent set that are both recommended by the sam-pler and recognized by the discriminator will beselected to enrich the conﬁdent set. In this section, we carry out experiments todemonstrate the effectiveness of our instance-leveladversarial training method. We ﬁrst introducedatasets and parameter settings. Afterwards, wecompare the performance of our method with con-ventional neural methods and feature-based meth-ods for RE. To further verify that our methodcan better discriminate those informative instancesfrom noisy ones, we also conduct evaluations onthose entity pairs with few sentences.

We conduct experiments on the benchmark datasetderived from New York Times (NYT) corpus,which is ﬁrst proposed by Mintz et al. (2009) andthen widely used in various distantly supervisedRE works (Riedel et al., 2010; Hoffmann et al.,2011; Surdeanu et al., 2012; Zeng et al., 2015; Linet al., 2016; Wu et al., 2017). The dataset alignsentity pairs and their relations in the KG Freebasewith NYT corpus. After various essential dataprocessing, there are relation types includingthe NA relation in this dataset. The training datacontains , sentences, , entity pairsand , relational facts. The test data contains , sentences, , entity pairs and , relational facts. In our models, we select the learning rate α d and α s among { . , . , . , . } for training thediscriminator and the sampler respectively. Forother parameters, we simply follow the settingsused in (Zeng et al., 2014; Lin et al., 2016; Wuet al., 2017) so that we can fairly compare theresults of our adversarial denoising models withthese baselines. Table 1 shows all parameters usedin the experiments. During training, we selectmost informative and conﬁdent instances in the unconﬁdent set to enrich the conﬁdent set every training epochs. Discriminator Learning Rate α d α s k h for CNNs 230Hidden Layer Dimension k h for RNNs 150Position Dimension k p for CNNs 5Position Dimension k p for RNNs 3Word Dimension k w m p Table 1: Parameter settings.

We follow Mintz et al. (2009) to conduct the held-out evaluation. We construct candidate triples bycombining entity pairs in the test set with vari-ous relations and rank these triples according totheir corresponding sentence representations. Byregarding the triples in the KGs as correct and oth-ers as incorrect, we evaluate different models withtheir precision-recall results.The evaluation results are shown in Figure 3and Table 2. We report the results of various neu-ral architectures including CNN, PCNN, RNN andBiRNN with various denoising methods: +ATT isthe selective attention method over instances (Linet al., 2016); +ADV is the denoising method byadding a small adversarial perturbation to instanceembeddings (Wu et al., 2017); +AN is our pro-posed adversarial training method. We also com-pare our methods with feature-based models, in-cluding Mintz (Mintz et al., 2009), MultiR (Hoff-mann et al., 2011) and MIML (Surdeanu et al.,2012). The results of the baseline models allcome from the data reported in their papers or theiropen-source code. From the ﬁgure and table, weobserve that:(1) As shown in Figure 3(a), neural modelssigniﬁcantly outperform all feature-based modelsover the entire range of recall. When the recallgradually grows, the performance of feature-basedmodels drops out quickly. However, all the neuralmodels still preserve stable and competitive pre-cision. It demonstrates that human-designed fea-tures cannot work well in a noisy environment, andinevitable errors brought by NLP tools will furtherhurt the performance. In contrast, instance em-beddings learned automatically by neural models a) Comparison of the proposed modelsand feature-based models. (b) Comparison of the proposed modelsand various CNN models. (c) Comparison of the proposed modelsand various RNN models.

Figure 3: Aggregate precision/recall curves of different models.

Method 0.1 0.2 0.3 MeanCNN+ ATT 67.5 52.8 45.8 55.4AN

RNN+ ATT 63.9 54.4 48.0 55.4AN

PCNN+ ATT 69.4 60.6 51.6 60.5ADV 71.7 58.9 51.1 60.6AN

BiRNN+ ATT 66.8 58.6 52.4 64.2ADV 72.8 64.6

Table 2: Precision of various models for differentrecall (%).can effectively capture implicit relational seman-tics from noisy data for RE.(2) Both for CNNs (CNN and PCNN) in Figure3(b) and RNNs (RNN and BiRNN) in Figure 3(c),the models with adversarial training outperformthe models with sentence-level attention. Thesentence-level attention over multiple instances,which calculates soft weights for each sentenceto reduce noise, only makes a coarse-grained dis-tinction between informative and noisy instances.In contrast, the neural models trained with adver-sarial denoising methods generate or sample noisyadversarial examples and force the relation classi-ﬁers to overcome them. Hence, the models withadversarial training provide efﬁcient noise reduc-tion in ﬁner granularity. In general, the modelswith our adversarial training method achieve thebest results among models using adversarial train-ing. This indicates that, as compared to generatingpseudo adversarial examples by adding perturba-tions, our method by sampling adversarial exam-ples from real-world instances can better discrim- inate informative instances from noisy instances.(3) To better compare various denoising meth-ods, we also show evaluation results in Table 2.Since we focus more on the performance of thosetop-ranked results, here we show the precisionscores when the recall is . , . , . as wellas their mean. We ﬁnd that complicated neuralmodels (PCNN, BiRNN) perform better than sim-ple neural networks (CNN, RNN) when using thesame denoising methods. Both CNNs and RNNsare signiﬁcantly improved by adversarial train-ing, and our method (AN) performs consistentlymuch better than the adversarial training baseline(ADV). The improvements brought by changingdenoising methods are more signiﬁcant than theimprovements brought by modifying neural mod-els. This indicates that the wrong labeling problemis the critical factor that prevents distantly super-vised RE models from working effectively. To further verify the effectiveness of our adversar-ial training method, we evaluate the RE perfor-mance of our method and conventional MIL de-noising methods in a more challenging scenario,i.e., when entity pairs having few sentences.For each entity pair, we randomly select onesentence, two sentences, and all sentences to con-struct three experimental settings respectively. Wereport P@100, P@200, P@300 and the mean ofthem in the held-out evaluation. Since PCNN isthe best neural model in the above comparison, wesimply use PCNN to compare our method (AN)with the recent state-of-the-art denoising method,sentence-level attention (ATT), as well as its naiveversions +ONE and +AVG (Zeng et al., 2015; Linet al., 2016). The evaluation results are shown inTable 3, and from the results we observe that: est Settings One Two AllP@N 100 200 300 Mean 100 200 300 Mean 100 200 300 MeanPCNN 63.0 61.0 55.3 59.8 65.0 62.5 57.3 61.6 71.0 64.0 58.7 64.6PCNN+ONE 73.3 64.8 56.8 65.0 70.3 67.2 63.1 66.9 72.3 69.7 64.1 68.7PCNN+AVG 71.3 63.7 57.8 64.3 73.3 65.2 62.1 66.9 73.3 66.7 62.8 67.6PCNN+ATT 73.3 69.2 60.8 67.8 77.2 71.6 66.1 71.6 76.2 73.1 67.4 72.2PCNN+AN

Table 3: Top-N precision (P@N) for RE in the entity pairs with different number of instances (%).(1) Our method achieves consistent and sig-niﬁcant improvements as compared to the ATTmethod and its naive versions, especially wheneach entity pair only corresponds to one or twosentences. The reason is that most MIL denois-ing methods including ATT typically assume thatat least one instance that mentions the given en-tity pair can express their relation, and always se-lect at least one informative sentence for the entitypair. This assumption is not always true especiallywhen entity pairs correspond to few sentences: itis more likely there is no instance that can expressthe relation of the given entity pair. In contrast, ouradversarial training method is not restricted by theassumption. By conducting on instance level in-dividually, our method keeps effective even whenthe instances of each entity pair are few.(2) When taking more instances into account,all models achieve better results. PCNN+ATTand PCNN+AN achieve more improvements thanthose naive methods. The growth of distant super-vision data brings more information for trainingRE models as well as more noises that may hurtperformance. Our method keeps its degree of su-periority to the ATT method as the data growth.This indicates that our method could provide morerobust and reliable scheme to denoise distant su-pervision data.

Table 4 shows examples sampled by the sam-pler. For the frequent relation

LocationContains , we use the sampler to select thepositive and negative instances respectively. Foreach sentence, we highlight the entities in bold-face. From the table we ﬁnd that: The formerpositive examples clearly correspond to the rela-tion

Location Contains , while those nega-tive examples fail to reﬂect this relation. Theseexamples show that our sampler is effective to dis-

Relation

Location Contains

Positive ... China’s 10 most polluted cities, four, includ-ing

Datong , are in

Shanxi province ......

Manhattan ’s Chinatown has fought off theforces of urban decline ...Negative ... the senior commander of U.S. forces in

Bagh-dad , has ﬁgured out the obstacle to america ’sdream for

Iraq ...... after

Japan ’s defeat, he said, American sol-diers drove jeeps onto his family ’s estate in

Iwate ...

Table 4: Some examples sampled by the samplerin NYT corpus.tinguish informative and noisy instances.

In this paper, we propose a denoising distant su-pervised method for RE via instance-level adver-sarial training. By splitting the entire data into theconﬁdent set and the unconﬁdent set, our methodtrains a sampler and a discriminator adversarially.The sampler aims to select the most confusing in-stance from the unconﬁdent set, and the discrimi-nator aims to distinguish an instance which comesfrom either the conﬁdent set or the unconﬁdent set.In experiments, we apply our method to variousneural architectures for RE. The experimental re-sults show that our method achieves efﬁcient noisereduction in ﬁner granularity and signiﬁcantly out-performs the state-of-the-art baseline. Our methodis also robust for those long-tail entity pairs withfew instances.In the future, we plan to explore the followingdirections: (1) Inspired by (Ji et al., 2017), it willbe promising to adopt external knowledge, fromeither KBs or text, to help train more efﬁcient sam-plers and discriminators for adversarial training.(2) We may also extend the instance-level adver-sarial training to the entity-pair level to further im-prove the robustness of RE models. eferences

Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches.

Proceedings of SSST .Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi-aoyan Zhu. 2018. Reinforcement learning for rela-tion classiﬁcation from noisy data. In

Proceedingsof AAAI .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014a. Generativeadversarial nets. In

Proceedings of NIPS .Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014b. Explaining and harnessing adver-sarial examples. In

Proceedings of ICLR .Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min.2005. Exploring various knowledge in relation ex-traction. In

Proceedings of ACL .Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In

Proceedings of ACL .Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.In

Proceedings of AAAI .Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In

Proceed-ings of ACL .Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. In

Proceedings of EMNLP .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efﬁcient estimation of word represen-tations in vector space. In

Proceedings of ICLR .Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In

Proceedings of ACL-IJCNLP .Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. In

Proceedings of ACL .Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,Ken Nakae, and Shin Ishii. 2016. Distributionalsmoothing with virtual adversarial training.

Pro-ceedings of ICLR .Raymond J Mooney and Razvan C Bunescu. 2006.Subsequence kernels for relation extraction. In

Pro-ceedings of NIPS . Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In

Proceedings of ECML-PKDD .Cicero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying relations by ranking withconvolutional neural networks.

Proceedings of ACL .Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instancemulti-label learning for relation extraction. In

Pro-ceedings of EMNLP .Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199 .Yi Wu, David Bamman, and Stuart Russell. 2017. Ad-versarial training for relation extraction. In

Proceed-ings of EMNLP .Ziang Xie, Sida I Wang, Jiwei Li, Daniel L´evy, AimingNie, Dan Jurafsky, and Andrew Y Ng. 2017. Datanoising as smoothing in neural network languagemodels.

Proceedings of ICLR .Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2003. Kernel methods for relation ex-traction. In

Proceedings of JMLR .Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In

Pro-ceedings of EMNLP .Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classiﬁcation via con-volutional deep neural network. In

Proceedings ofCOLING .Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, andMaosong Sun. 2017. Incorporating relation pathsin neural relation extraction. In

Proceedings ofEMNLP .Xiangrong Zeng, Shizhu He, Kang Liu, and Jun Zhao.2018. Large scaled relation extraction with rein-forcement learning. In

Proceedings of AAAI .Dongxu Zhang and Dong Wang. 2015. Relation classi-ﬁcation via recurrent neural network. arXiv preprintarXiv:1508.01006arXiv preprintarXiv:1508.01006