Denoising Distant Supervision for Relation Extraction via Instance-Level Adversarial Training
DDenoising Distant Supervision for Relation Extractionvia Instance-Level Adversarial Training
Xu Han , Zhiyuan Liu , Maosong Sun , Department of Computer Science and Technology,State Key Lab on Intelligent Technology and Systems,National Lab for Information Science and Technology, Tsinghua University, Beijing, China Beijing Advanced Innovation Center for Imaging Technology,Capital Normal University, Beijing, China
Abstract
Existing neural relation extraction (NRE)models rely on distant supervision and suf-fer from wrong labeling problems. Inthis paper, we propose a novel adversar-ial training mechanism over instances forrelation extraction to alleviate the noise is-sue. As compared with previous denoisingmethods, our proposed method can betterdiscriminate those informative instancesfrom noisy ones. Our method is also ef-ficient and flexible to be applied to vari-ous NRE architectures. As shown in theexperiments on a large-scale benchmarkdataset in relation extraction, our denois-ing method can effectively filter out noisyinstances and achieve significant improve-ments as compared with the state-of-the-art models.
Relation extraction (RE) aims to extract relationalfacts from plain text via categorizing semantic re-lations between entities contained in text. Forexample, we can extract the fact (
Mark Twain , PlaceOfBirth , Florida ) from the sentence“
Mark Twain was born in
Florida ”. Many effortshave been devoted to RE, either early works basedon handcrafted features (Zelenko et al., 2003;Mooney and Bunescu, 2006) or recent worksbased on neural networks (Zeng et al., 2014; San-tos et al., 2015). These models all follow a su-pervised learning approach, which is effective, butthe requirement to high-quality annotated data is amajor bottleneck in practice.It is time-consuming and human-intensiveto manually annotate large-scale training data.Hence, Mintz et al. (2009) propose distant su-pervision to automatically generate training sen- S ( Bob , PlaceOfBrith,
New York ) [Bob] was born in [New York] . [Bob] lived in [New York] . [Bob] traveled to [New York] . ... D Distnant SupervisionPositive Negative
Figure 1: An example of distant supervision andadversarial denoising relation extraction.tences via aligning KGs and text. As shown inFigure 1, distant supervision assumes that if thereis a relation between two entities in a KG, all sen-tences that contain the two entities will be labeledwith that relation. Distant supervision is an ef-fective approach to automatically obtain trainingdata, but it inevitably suffers from wrong labelingproblems.To address the wrong labeling problem, Riedelet al. (2010) propose multi-instance learning(MIL), and Zeng et al. (2015) extend the idea ofMIL to neural models. Lin et al. (2016) furtherpropose a neural attention scheme over multipleinstances to reduce the weights of noisy instances.These methods achieve significant improvementsin RE, however, still far from satisfactory. Thereason is that most denoising methods simply cal-culate soft weights for each sentence in an unsu-pervised manner, which can only make a coarse-grained distinction between informative and noisyinstances. Moreover, these methods cannot wellcope with those entity pairs with insufficient sen-tences.In order to better discriminate informative andnoisy instances, inspired by the idea of adversar- a r X i v : . [ c s . C L ] M a y al learning (Goodfellow et al., 2014a), we ap-ply adversarial training over instances to enhanceRE performance. The idea of adversarial train-ing was explored in relation extraction by gen-erating adversarial examples with a perturbationadded to sentence embeddings (Wu et al., 2017),which do not necessarily correspond to real-worldsentences. On contrary, we generate adversarialexamples by sampling from existing training data,which may better locate real-world noise.Our method contains two modules: a discrim-inator and a sampler , and the method will splitthe distantly supervised data into two parts, theconfident part and the unconfident part. The dis-criminator is applied to judge which sentencesare more likely to be annotated correctly, with theconfident data as positive instances and the uncon-fident data as negative instances. The sampler module is used to select the most confusing sen-tences from unconfident data to cheat the discrim-inator as much as possible. Moreover, during sev-eral training epochs, we also dynamically selectmost informative and confident instances from theunconfident set to the confident set, so as to enrichthe training instances for the discriminator.The discriminator and the sampler are trainedadversarially. As shown in Figure 1, during thetraining process, the actions of the sampler willadmonish the discriminator to focus on improv-ing those most confusing instances. Since noisyinstances are ineffective to decrease the loss func-tions of both sampler and discriminator, the noisewill be gradually filtered out during the adversar-ial training. Finally, the sampler can effectivelydistinguish those informative instances from theunconfident data, and the discriminator can wellcategorize relations between entities in text. Ascompared with the aforementioned MIL denoisingmethods, our method achieves more efficient noisedetection in finer granularity.We conduct experiments on a real-world datasetderived from New York Times (NYT) corpus andFreebase. Experimental results demonstrate thatour adversarial denoising method effectively re-duces noise and significantly outperforms otherbaseline methods. Relation extraction is an important task in NLP,which aims to extract relational facts from text corpora. Many efforts are devoted to RE, espe-cially in supervised RE, such as early kernel-basedmodels (Zelenko et al., 2003; GuoDong et al.,2005; Mooney and Bunescu, 2006). Mintz et al.(2009) align plain text with KGs and propose adistantly supervised RE model, by assuming allsentences that mention two entities can describetheir relations in KGs.However, distant supervision inevitably accom-panies with the wrong labeling problem. Riedelet al. (2010) and Hoffmann et al. (2011) applythe multi-instance learning (MIL) mechanism forRE, which considers the reliability of each in-stance and combines multiple sentences contain-ing the same entity pair together to alleviate thenoise problem.In recent years, neural models (Zhang andWang, 2015; Zeng et al., 2017; Miwa and Bansal,2016) have been widely used in RE. These neu-ral models are capable of accurately capturing tex-tual relations without explicit linguistic analysis.Based on these neural architectures and the MILmechanism, Lin et al. (2016) propose a sentence-level attention to reduce the influence of incor-rectly labeled sentences. To summarize, theseMIL models generally make soft weight adjust-ment for informative and noisy instances. Someworks further adopt external information to im-prove denoising performance: Ji et al. (2017) in-corporate external entity descriptions to enhanceattention representations; Liu et al. (2017) man-ually set label confidences to denoise entity-pairlevel noises.More sophisticated mechanisms, such as rein-forcement learning (Feng et al., 2018; Zeng et al.,2018), have recently also been adapted to selectpositive sentences from noisy data. However,these complex mechanisms usually require muchtime to fine-tune and the convergence is not yetwell guaranteed in practice. In this paper, we pro-pose a novel fine-grained denoising method for REvia adversarial training. The method is simple andeffective to be applied in various neural architec-tures and to scale up to large-scale data.
Szegedy et al. (2013) propose to generate adver-sarial examples by adding noise in the form ofsmall perturbations to the original data. Thesenoise examples are often indistinguishable forhumans but lead to models’ wrong predictions. ampler
Discriminator0 / 1
Sentence EncoderEstimatorNoisy Instances Sentence EncoderEstimatorInformative InstancesNA InstancesSentence EncoderEstimator Unconfident Instances
Confident InstancesSentence Encoder
Unconfident InstancesConfident Instances [Bob] was born in [New York]... [Bob] lived in [New York]...
Knowledge
Graph ( Bob , PlaceOfBrith,
New York ) ... [Bob] have said he loved [New York] ...[Bob] have traveled to [London]... Distant supervision Relation Extraction
Figure 2: The overall framework of the instance-level adversarial training model for relation extraction.The discriminator module is used to judge whether an instance is labeled correctly, and the instance willbe considered coming from the confident set if the prediction is yes. The sampler module is used to selectthe most confusing instances from unconfident data to cheat the discriminator.Goodfellow et al. (2014b) analyze adversarial ex-amples and propose adversarial training for im-age classification tasks. Afterwards, Goodfellowet al. (2014a) propose a mature adversarial trainingframework and use the framework to train genera-tive models.Adversarial training has also been explored inNLP. Miyato et al. (2016) propose adversarialtraining for text classification by adding perturba-tions to word embeddings. The idea of perturba-tion addition has further been applied in other NLPtasks including language models (Xie et al., 2017)and relation extraction (Wu et al., 2017). Differentfrom (Wu et al., 2017) that generates pseudo ad-versarial examples by adding perturbations to in-stance embeddings, we perform adversarial train-ing by sampling adversarial examples from real-world noisy data. The adversarial examples inour method can better correspond to the real-worldscenario for RE. Hence our method is more favor-able to solve the wrong labeling problem in distantsupervision, which will be shown in experiments.
In this section, we introduce the details of ourinstance-level adversarial training model for de-noising RE. For this model, we split the entiretraining data into two parts, the set of those con-fident instances I c and the set of those unconfi-dent instances I u . A sentence encoder is adoptedto represent sentence semantics with embeddings.The adversarial training framework consists of asampler and a discriminator, corresponding to thenoise filter and the relation classifier respectively. F As shown in Figure 2, the overall frameworkof our instance-level adversarial training modelincludes a discriminator D and a sampler S , inwhich S samples adversarial examples from theunconfident set I u , and D learns to judge whethera given instance is from I c or I u .We assume that each instance s ∈ I c ex-poses implicit semantics of its labeled relation r s . In contrast, those instances s ∈ I u are nottrusted to be labeled correctly during the adversar-ial training. Hence, we implement D as a function D ( s, r s ) to judge whether a given instance s ex-poses implicit semantics of its labeled relation r s :if yes, the instance comes from I c ; while if no, theinstance comes from I u .The training process is a min-max game and canbe formalized as follows, φ = min p u max D ( E s ∼ p c [log( D ( s, r s ))] (1) + E s ∼ p u [log(1 − D ( s, r s ))]) , where p c is the confident data distribution, and thesampler S samples adversarial examples from theunconfident data according to the probability dis-tribution p u .After sufficient training, S tends to samplethose informative instances in I u rather than thosenoisy instances, and D becomes a relation classi-fier of good robustness to noisy data. We will givethe detailed introduction to the sampler in Section3.2 and the discriminator in Section 3.3. .2 Sampler The sampler module aims to select the most con-fusing sentences from the unconfident set I u tocheat the discriminator as much as possible by op-timizing the probability distribution p u . Hence, weneed to calculate the confusing score for each in-stance in the unconfident set I u .Given an instance s , we can use neural sentenceencoders to represent its semantic information asan embedding y . The details of neural encoderswill be introduced in Section 3.4. Here, we cansimply calculate the confusing score according tothe sentence embedding y as follows, C ( s ) = W · y , (2)where W is a separating hyperplane. We furtherdefine P u ( s ) as the confusing probability over I u , P u ( s ) = exp( C ( s )) (cid:80) s ∈I u exp( C ( s )) . (3)In the unconfident set, we regard those instanceswith high D ( s, r s ) scores as the confusing in-stances, because they will fool the discriminator D to make wrong decision. An optimized samplerwill assign larger confusing score to those mostconfusing instances. Hence, we formalize the lossfunction to optimize the sampler module as fol-lows: L S = − (cid:88) s ∈I u P u ( s ) log( D ( s, r s )) . (4)When optimizing the sampler, we regard the com-ponent P u ( s ) as parameters for updating.Note that, when an instance is labeled as r s = NA , it indicates the relation of this instance isnot available, either unsure or having no rela-tion. Since these instances are always wronglypredicted into other relations, in order to let thediscriminator restrain this tendency, we specifi-cally define D ( s, NA ) as the average score of theinstance over all feasible relations: D ( s, NA ) = 1 |R| − (cid:88) r ∈R ,r (cid:54) = NA D ( s, r ) , (5)where R indicates the set of relations. Given an instance s and its embedding y , the dis-criminator is responsible for judging whether itslabeled relation r s is correct. We implement the discriminator based on the semantic relatednessbetween r s and y , D ( s, r s ) = σ ( r s · y ) , (6)where σ ( · ) is the sigmoid function.An optimized discriminator will assign highscores to those instances in I c and low scores tothose instances in I u . Hence, we formalize theloss function to optimize the discriminator mod-ule as follows: L D = − (cid:88) s ∈I c |I c | log( D ( s, r s )) (7) − (cid:88) s ∈I u P u ( s ) log(1 − D ( s, r s )) . When optimizing the discriminator, we regard thecomponent D ( s, r s ) as parameters for updating.Note that, the objective functions of the samplerin Eq. 4 and the discriminator in Eq. 7 are adver-sarial to each other.In practice, the data set is usually too large tobe frequently traversed due to intractable largeamounts of computation. For convenience oftraining efficiency, we can simply sample sub-sets to approximate the probability distribution.Hence, we formalize a new loss function for op-timization: ˜ L D = − (cid:88) s ∈ ˆ I c | ˆ I c | log( D ( s, r s )) (8) − (cid:88) s ∈ ˆ I u Q u ( s ) log(1 − D ( s, r s )) , where ˆ I c and ˆ I u are subsets sampled from I c and I u respectively, and Q u ( s ) is the correspondingapproximation to P u ( s ) in Eq. 3: Q u ( s ) = exp( C ( s ) α ) (cid:80) s ∈ ˆ I u exp( C ( s ) α ) . (9)Note that α is a hyper-parameter that controls thesharpness of the confusing probability distribu-tion. For consistency, we also approximate L S inEq. 4 as: ˜ L S = − (cid:88) s ∈ ˆ I u Q u ( s ) log( D ( s, r s )) . (10) ˜ L S and ˜ L D are used to optimize our adversarialtraining model. .4 Instance Encoder Given an instance s containing two entities, we ap-ply several neural network architectures to encodethe sentence into continuous low-dimensional em-beddings y , which are expected to capture the im-plicit semantics of the labeled relation betweentwo entities. The input layer aims to map discrete languagesymbols (i.e., words) into continuous input em-beddings. Given an instance s containing n words { w , . . . , w n } , we use Skip-Gram (Mikolov et al.,2013) to embed all words into k w -dimensionalspace { w , . . . , w n } . For each word w i , we alsoembed its relative distances to the two entities intotwo k p -dimensional vectors, and then concatenatethem as an unified position embedding p i (Zenget al., 2014). We finally get the k i -dimensional in-put embeddings for the following encoding layer, s = { x , . . . , x n } (11) = { [ w ; p ] , . . . , [ w n ; p n ] } . In the encoding layer, we select four typical archi-tectures including CNN (Zeng et al., 2014), PCNN(Zeng et al., 2015), RNN (Zhang and Wang, 2015)and BiRNN (Zhang and Wang, 2015) to furtherencode input embeddings of the instance into sen-tence embeddings.
CNN slides a convolution kernel with the win-dow size m over the input sequence { x , . . . , x n } to get the k h -dimensional hidden embeddings. h i = CNN (cid:0) x i − m − , . . . , x i + m − (cid:1) . (12)A max-pooling is then applied over these hiddenembeddings to output the final instance embed-ding y as follows, [ y ] j = max { [ h ] j , . . . , [ h n ] j } . (13) PCNN is an extension to CNN, which alsoadopts a convolution kernel with the window size m to obtain hidden embeddings. Afterwards,PCNN divides the hidden embeddings into threesegments { h , . . . , h e } , { h e +1 , . . . , h e } , and { h e +1 , . . . , h n } , where e and e are entity po-sitions. PCNN applies a piecewise max-poolingfor each segment, [ y ] j = max { [ h ] j , . . . , [ h e ] j } , (14) [ y ] j = max { [ h e +1 ] j , . . . , [ h e ] j } , [ y ] j = max { [ h e +1 ] j , . . . , [ h n ] j } . By concatenating all pooling results, PCNN even-tually outputs a · k h -dimensional instance em-bedding y as follows, y = [ y ; y ; y ] . (15) RNN is designed for modeling sequential data,as it keeps its hidden state changing with input em-beddings at each time-step accordingly, h i = RNN ( x i , h i − ) , (16)where RNN ( · ) is the recurrent unit and h i ∈ R k h is the hidden embedding at the time-step i . In thispaper, we select gated recurrent unit (GRU) (Choet al., 2014) as the recurrent unit. We use the hid-den embedding of the last time-step as the instanceembedding, i.e., y = h n . Bi-RNN aims to incorporate information fromboth sides of the sentence sequence. Bi-RNN isadopted with forward and backward directions asfollows, −→ h i = RNN f ( x i , −→ h i − ) , (17) ←− h i = RNN b ( x i , ←− h i +1 ) , where −→ h i and ←− h i are the hidden states at the po-sition i of the forward and backward RNN respec-tively. We concatenate the hidden states from boththe forward and backward RNN as the instanceembedding y , y = [ −→ h n ; ←− h ] . (18) Here we introduce the learning and optimizationdetails for our adversarial training model. We de-fine the optimization function as L = ˜ L D + λ ˜ L S , (19)where λ is a harmonic factor. In practice, both themodules in adversarial training are optimized al-ternately using stochastic gradient descent (SGD).Since the framework of our model is much sim-pler than typical generative adversarial networks(GAN), we do not have to calibrate alternating ra-tio between the loss functions, and hence we cansimply use a ratio. It enables our model ef-ficient for learning on large-scale data. Moreover,we can also integrate λ into the learning rate ofthe sampler ˜ L S , so as to avoid adjusting the hyper-parameter λ .t the start of adversarial training, we pre-traina relation classifier on the entire training data. Therelation classifier will split the entire data into asmall confident data and a large unconfident data.During the adversarial training, after every fewtraining epochs, some instances from the uncon-fident set that are both recommended by the sam-pler and recognized by the discriminator will beselected to enrich the confident set. In this section, we carry out experiments todemonstrate the effectiveness of our instance-leveladversarial training method. We first introducedatasets and parameter settings. Afterwards, wecompare the performance of our method with con-ventional neural methods and feature-based meth-ods for RE. To further verify that our methodcan better discriminate those informative instancesfrom noisy ones, we also conduct evaluations onthose entity pairs with few sentences.
We conduct experiments on the benchmark datasetderived from New York Times (NYT) corpus,which is first proposed by Mintz et al. (2009) andthen widely used in various distantly supervisedRE works (Riedel et al., 2010; Hoffmann et al.,2011; Surdeanu et al., 2012; Zeng et al., 2015; Linet al., 2016; Wu et al., 2017). The dataset alignsentity pairs and their relations in the KG Freebasewith NYT corpus. After various essential dataprocessing, there are relation types includingthe NA relation in this dataset. The training datacontains , sentences, , entity pairsand , relational facts. The test data contains , sentences, , entity pairs and , relational facts. In our models, we select the learning rate α d and α s among { . , . , . , . } for training thediscriminator and the sampler respectively. Forother parameters, we simply follow the settingsused in (Zeng et al., 2014; Lin et al., 2016; Wuet al., 2017) so that we can fairly compare theresults of our adversarial denoising models withthese baselines. Table 1 shows all parameters usedin the experiments. During training, we selectmost informative and confident instances in the unconfident set to enrich the confident set every training epochs. Discriminator Learning Rate α d α s k h for CNNs 230Hidden Layer Dimension k h for RNNs 150Position Dimension k p for CNNs 5Position Dimension k p for RNNs 3Word Dimension k w m p Table 1: Parameter settings.
We follow Mintz et al. (2009) to conduct the held-out evaluation. We construct candidate triples bycombining entity pairs in the test set with vari-ous relations and rank these triples according totheir corresponding sentence representations. Byregarding the triples in the KGs as correct and oth-ers as incorrect, we evaluate different models withtheir precision-recall results.The evaluation results are shown in Figure 3and Table 2. We report the results of various neu-ral architectures including CNN, PCNN, RNN andBiRNN with various denoising methods: +ATT isthe selective attention method over instances (Linet al., 2016); +ADV is the denoising method byadding a small adversarial perturbation to instanceembeddings (Wu et al., 2017); +AN is our pro-posed adversarial training method. We also com-pare our methods with feature-based models, in-cluding Mintz (Mintz et al., 2009), MultiR (Hoff-mann et al., 2011) and MIML (Surdeanu et al.,2012). The results of the baseline models allcome from the data reported in their papers or theiropen-source code. From the figure and table, weobserve that:(1) As shown in Figure 3(a), neural modelssignificantly outperform all feature-based modelsover the entire range of recall. When the recallgradually grows, the performance of feature-basedmodels drops out quickly. However, all the neuralmodels still preserve stable and competitive pre-cision. It demonstrates that human-designed fea-tures cannot work well in a noisy environment, andinevitable errors brought by NLP tools will furtherhurt the performance. In contrast, instance em-beddings learned automatically by neural models a) Comparison of the proposed modelsand feature-based models. (b) Comparison of the proposed modelsand various CNN models. (c) Comparison of the proposed modelsand various RNN models.
Figure 3: Aggregate precision/recall curves of different models.
Method 0.1 0.2 0.3 MeanCNN+ ATT 67.5 52.8 45.8 55.4AN
RNN+ ATT 63.9 54.4 48.0 55.4AN
PCNN+ ATT 69.4 60.6 51.6 60.5ADV 71.7 58.9 51.1 60.6AN
BiRNN+ ATT 66.8 58.6 52.4 64.2ADV 72.8 64.6
Table 2: Precision of various models for differentrecall (%).can effectively capture implicit relational seman-tics from noisy data for RE.(2) Both for CNNs (CNN and PCNN) in Figure3(b) and RNNs (RNN and BiRNN) in Figure 3(c),the models with adversarial training outperformthe models with sentence-level attention. Thesentence-level attention over multiple instances,which calculates soft weights for each sentenceto reduce noise, only makes a coarse-grained dis-tinction between informative and noisy instances.In contrast, the neural models trained with adver-sarial denoising methods generate or sample noisyadversarial examples and force the relation classi-fiers to overcome them. Hence, the models withadversarial training provide efficient noise reduc-tion in finer granularity. In general, the modelswith our adversarial training method achieve thebest results among models using adversarial train-ing. This indicates that, as compared to generatingpseudo adversarial examples by adding perturba-tions, our method by sampling adversarial exam-ples from real-world instances can better discrim- inate informative instances from noisy instances.(3) To better compare various denoising meth-ods, we also show evaluation results in Table 2.Since we focus more on the performance of thosetop-ranked results, here we show the precisionscores when the recall is . , . , . as wellas their mean. We find that complicated neuralmodels (PCNN, BiRNN) perform better than sim-ple neural networks (CNN, RNN) when using thesame denoising methods. Both CNNs and RNNsare significantly improved by adversarial train-ing, and our method (AN) performs consistentlymuch better than the adversarial training baseline(ADV). The improvements brought by changingdenoising methods are more significant than theimprovements brought by modifying neural mod-els. This indicates that the wrong labeling problemis the critical factor that prevents distantly super-vised RE models from working effectively. To further verify the effectiveness of our adversar-ial training method, we evaluate the RE perfor-mance of our method and conventional MIL de-noising methods in a more challenging scenario,i.e., when entity pairs having few sentences.For each entity pair, we randomly select onesentence, two sentences, and all sentences to con-struct three experimental settings respectively. Wereport P@100, P@200, P@300 and the mean ofthem in the held-out evaluation. Since PCNN isthe best neural model in the above comparison, wesimply use PCNN to compare our method (AN)with the recent state-of-the-art denoising method,sentence-level attention (ATT), as well as its naiveversions +ONE and +AVG (Zeng et al., 2015; Linet al., 2016). The evaluation results are shown inTable 3, and from the results we observe that: est Settings One Two AllP@N 100 200 300 Mean 100 200 300 Mean 100 200 300 MeanPCNN 63.0 61.0 55.3 59.8 65.0 62.5 57.3 61.6 71.0 64.0 58.7 64.6PCNN+ONE 73.3 64.8 56.8 65.0 70.3 67.2 63.1 66.9 72.3 69.7 64.1 68.7PCNN+AVG 71.3 63.7 57.8 64.3 73.3 65.2 62.1 66.9 73.3 66.7 62.8 67.6PCNN+ATT 73.3 69.2 60.8 67.8 77.2 71.6 66.1 71.6 76.2 73.1 67.4 72.2PCNN+AN
Table 3: Top-N precision (P@N) for RE in the entity pairs with different number of instances (%).(1) Our method achieves consistent and sig-nificant improvements as compared to the ATTmethod and its naive versions, especially wheneach entity pair only corresponds to one or twosentences. The reason is that most MIL denois-ing methods including ATT typically assume thatat least one instance that mentions the given en-tity pair can express their relation, and always se-lect at least one informative sentence for the entitypair. This assumption is not always true especiallywhen entity pairs correspond to few sentences: itis more likely there is no instance that can expressthe relation of the given entity pair. In contrast, ouradversarial training method is not restricted by theassumption. By conducting on instance level in-dividually, our method keeps effective even whenthe instances of each entity pair are few.(2) When taking more instances into account,all models achieve better results. PCNN+ATTand PCNN+AN achieve more improvements thanthose naive methods. The growth of distant super-vision data brings more information for trainingRE models as well as more noises that may hurtperformance. Our method keeps its degree of su-periority to the ATT method as the data growth.This indicates that our method could provide morerobust and reliable scheme to denoise distant su-pervision data.
Table 4 shows examples sampled by the sam-pler. For the frequent relation
LocationContains , we use the sampler to select thepositive and negative instances respectively. Foreach sentence, we highlight the entities in bold-face. From the table we find that: The formerpositive examples clearly correspond to the rela-tion
Location Contains , while those nega-tive examples fail to reflect this relation. Theseexamples show that our sampler is effective to dis-
Relation
Location Contains
Positive ... China’s 10 most polluted cities, four, includ-ing
Datong , are in
Shanxi province ......
Manhattan ’s Chinatown has fought off theforces of urban decline ...Negative ... the senior commander of U.S. forces in
Bagh-dad , has figured out the obstacle to america ’sdream for
Iraq ...... after
Japan ’s defeat, he said, American sol-diers drove jeeps onto his family ’s estate in
Iwate ...
Table 4: Some examples sampled by the samplerin NYT corpus.tinguish informative and noisy instances.
In this paper, we propose a denoising distant su-pervised method for RE via instance-level adver-sarial training. By splitting the entire data into theconfident set and the unconfident set, our methodtrains a sampler and a discriminator adversarially.The sampler aims to select the most confusing in-stance from the unconfident set, and the discrimi-nator aims to distinguish an instance which comesfrom either the confident set or the unconfident set.In experiments, we apply our method to variousneural architectures for RE. The experimental re-sults show that our method achieves efficient noisereduction in finer granularity and significantly out-performs the state-of-the-art baseline. Our methodis also robust for those long-tail entity pairs withfew instances.In the future, we plan to explore the followingdirections: (1) Inspired by (Ji et al., 2017), it willbe promising to adopt external knowledge, fromeither KBs or text, to help train more efficient sam-plers and discriminators for adversarial training.(2) We may also extend the instance-level adver-sarial training to the entity-pair level to further im-prove the robustness of RE models. eferences
Kyunghyun Cho, Bart Van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder-decoder ap-proaches.
Proceedings of SSST .Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi-aoyan Zhu. 2018. Reinforcement learning for rela-tion classification from noisy data. In
Proceedingsof AAAI .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014a. Generativeadversarial nets. In
Proceedings of NIPS .Ian J Goodfellow, Jonathon Shlens, and ChristianSzegedy. 2014b. Explaining and harnessing adver-sarial examples. In
Proceedings of ICLR .Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min.2005. Exploring various knowledge in relation ex-traction. In
Proceedings of ACL .Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In
Proceedings of ACL .Guoliang Ji, Kang Liu, Shizhu He, Jun Zhao, et al.2017. Distant supervision for relation extractionwith sentence-level attention and entity descriptions.In
Proceedings of AAAI .Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural relation extractionwith selective attention over instances. In
Proceed-ings of ACL .Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A soft-label method for noise-tolerant distantly supervised relation extraction. In
Proceedings of EMNLP .Tomas Mikolov, Kai Chen, Greg Corrado, and JeffreyDean. 2013. Efficient estimation of word represen-tations in vector space. In
Proceedings of ICLR .Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In
Proceedings of ACL-IJCNLP .Makoto Miwa and Mohit Bansal. 2016. End-to-end re-lation extraction using lstms on sequences and treestructures. In
Proceedings of ACL .Takeru Miyato, Shin-ichi Maeda, Masanori Koyama,Ken Nakae, and Shin Ishii. 2016. Distributionalsmoothing with virtual adversarial training.
Pro-ceedings of ICLR .Raymond J Mooney and Razvan C Bunescu. 2006.Subsequence kernels for relation extraction. In
Pro-ceedings of NIPS . Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In
Proceedings of ECML-PKDD .Cicero Nogueira dos Santos, Bing Xiang, and BowenZhou. 2015. Classifying relations by ranking withconvolutional neural networks.
Proceedings of ACL .Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D Manning. 2012. Multi-instancemulti-label learning for relation extraction. In
Pro-ceedings of EMNLP .Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,Joan Bruna, Dumitru Erhan, Ian Goodfellow, andRob Fergus. 2013. Intriguing properties of neuralnetworks. arXiv preprint arXiv:1312.6199 .Yi Wu, David Bamman, and Stuart Russell. 2017. Ad-versarial training for relation extraction. In
Proceed-ings of EMNLP .Ziang Xie, Sida I Wang, Jiwei Li, Daniel L´evy, AimingNie, Dan Jurafsky, and Andrew Y Ng. 2017. Datanoising as smoothing in neural network languagemodels.
Proceedings of ICLR .Dmitry Zelenko, Chinatsu Aone, and AnthonyRichardella. 2003. Kernel methods for relation ex-traction. In
Proceedings of JMLR .Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viapiecewise convolutional neural networks. In
Pro-ceedings of EMNLP .Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classification via con-volutional deep neural network. In
Proceedings ofCOLING .Wenyuan Zeng, Yankai Lin, Zhiyuan Liu, andMaosong Sun. 2017. Incorporating relation pathsin neural relation extraction. In
Proceedings ofEMNLP .Xiangrong Zeng, Shizhu He, Kang Liu, and Jun Zhao.2018. Large scaled relation extraction with rein-forcement learning. In
Proceedings of AAAI .Dongxu Zhang and Dong Wang. 2015. Relation classi-fication via recurrent neural network. arXiv preprintarXiv:1508.01006arXiv preprintarXiv:1508.01006