[PDF] Improving Distantly-Supervised Relation Extraction through BERT-based Label & Instance Embeddings

Abstract

Distantly-supervised relation extraction (RE) is an effective method to scale RE to large corpora but suffers from noisy labels. Existing approaches try to alleviate noise through multi-instance learning and by providing additional information, but manage to recognize mainly the top frequent relations, neglecting those in the long-tail. We propose REDSandT (Relation Extraction with Distant Supervision and Transformers), a novel distantly-supervised transformer-based RE method, that manages to capture a wider set of relations through highly informative instance and label embeddings for RE, by exploiting BERT's pre-trained model, and the relationship between labels and entities, respectively. We guide REDSandT to focus solely on relational tokens by fine-tuning BERT on a structured input, including the sub-tree connecting an entity pair and the entities' types. Using the extracted informative vectors, we shape label embeddings, which we also use as attention mechanism over instances to further reduce noise. Finally, we represent sentences by concatenating relation and instance embeddings. Experiments in the NYT-10 dataset show that REDSandT captures a broader set of relations with higher confidence, achieving state-of-the-art AUC (0.424).

Full PDF

IImproving Distantly-Supervised Relation Extraction throughBERT-based Label & Instance Embeddings

Despina Christou

School of Informatics,Aristotle University of Thessaloniki,54124, Greece [email protected]

Grigorios Tsoumakas

School of Informatics,Aristotle University of Thessaloniki,54124, Greece [email protected]

Abstract

Distantly-supervised relation extraction (RE)is an effective method to scale RE to largecorpora but suffers from noisy labels. Exist-ing approaches try to alleviate noise throughmulti-instance learning and by providing ad-ditional information, but manage to recognizemainly the top frequent relations, neglectingthose in the long-tail. We propose RED-SandT (Relation Extraction with Distant Su-pervision and Transformers), a novel distantly-supervised transformer-based RE method, thatmanages to capture a wider set of relationsthrough highly informative instance and labelembeddings for RE, by exploiting BERT’s pre-trained model, and the relationship betweenlabels and entities, respectively. We guideREDSandT to focus solely on relational to-kens by ﬁne-tuning BERT on a structured in-put, including the sub-tree connecting an en-tity pair and the entities’ types. Using theextracted informative vectors, we shape labelembeddings, which we also use as attentionmechanism over instances to further reducenoise. Finally, we represent sentences by con-catenating relation and instance embeddings.Experiments in the NYT-10 dataset show thatREDSandT captures a broader set of relationswith higher conﬁdence, achieving state-of-the-art AUC (0.424).

Relation Extraction (RE) aims to detect semantic re-lationships between entity pairs in natural texts andhas proven to be crucial in various natural languageprocessing (NLP) applications, including questionanswering, and knowledge-base (KB) population.Most RE methods follow a supervised approach,with the required number of labeled trainingdata rendering the whole process time and labor-intensive. To automatically construct datasets forRE, (Mintz et al., 2009) proposed to use distant supervision (DS) from a KB, assuming that if twoentities exhibit a relationship in a KB, then all sen-tences mentioning these entities express this rela-tion. Inevitably, this assumption generates false-positives and leads distantly-created datasets tocontain erroneous labels. To alleviate the wronglabeling problem , (Riedel et al., 2010) relaxed thisassumption so that it does not hold for all instancesand along with (Hoffmann et al., 2011; Surdeanuet al., 2012) proposed multi-instance based learn-ing. Under this setting, classiﬁcation shifts frominstance-level to bag-level, with a bag consistingof all instances that contain a speciﬁc entity pair.Current state-of-the-art RE methods try to re-duce the effect of noisy instances by: i) identifyingvalid instances through multi-instance learning andselective attention (Lin et al., 2016), ii) reducinginner-sentence noise by capturing long-range de-pendencies using syntactic information from depen-dency parses (Mintz et al., 2009; He et al., 2018;Liu et al., 2018), specialized models like piecewiseCNN (PCNN) and graph CNN (GCNN), or word-level attention (He et al., 2018), and iii) enhanc-ing model effectiveness using external knowledge(i.e. KB entity types (Vashishth et al., 2018), entitydescriptions (Ji, 2017; Hu et al., 2019), relationphrases (Vashishth et al., 2018)) or transfer knowl-edge from pre-trained models (Alt et al., 2019).The study of the above approaches led us tothe following core observations. First, among allmodels used in the literature, the use of a pre-trained transformer-based language model (LM)can help in recognizing a broader set of relations,even though at the expense of time and computa-tional resources, and second, the relationship be-tween label and entities can entail valuable infor-mation but rarely used over external knowledge.Driven by these observations we inspired to de-velop a novel transformer-based model that canefﬁciently capture instance and label embeddings a r X i v : . [ c s . C L ] F e b n less complexity so as to drive RE in recognizinga broader set of relations.We propose REDSandT (Relation Extractionwith Distant Supervision and Transformers), anovel transformer-based RE model for distant su-pervision. To handle the problem of noisy in-stances, we guide REDSandT to focus solely onrelational tokens by ﬁne-tuning BERT on a struc-tured input, including the sub-tree connecting anentity pair (STP) and the entities’ types. The in-put’s RE-speciﬁc formation, along with BERT’sknowledge from unsupervised pre-training, resultsin REDSandT generating informative vectors. Us-ing these vectors, we shape relation embeddingsrepresenting the entities’ distance in vector space.Relation embeddings are then used as relation-wiseattention over instance representation to reduce theeffect of less-informative tokens. Finally, RED-SandT encodes sentences by concatenating relationand weighted-instance embeddings, with relationclassiﬁcation to occur at bag-level as a weightedsum over its sentences’ predictions.We chose BERT over other transformer-basedmodels because it considers bidirectionality whiletraining. We assume that this characteristic is im-portant to efﬁciently capture entities’ interactionswithout requiring an additional task that impor-tantly increases complexity (i.e. ﬁne-tuning anauxiliary objective in GPT (Alt et al., 2019)).The main contributions of this paper can be sum-marized as follows:• We extend BERT to handle multi-instancelearning to directly ﬁne-tune the model in aDS setting and reduce error accumulation.• Relation embeddings captured through BERTﬁne-tuned on our RE-speciﬁc input help torecognize a wider set of relations, includingrelations in the long-tail.• Suppressing the input sentence to its rela-tional tokens through STP encoding allowedus to capture informative instance embeddingswhile preserving low complexity to train ourmodel on modest hardware.• Experiments on the NYT-10 dataset showREDSandT to surpass state-of-the-art mod-els (Vashishth et al., 2018; Alt et al., 2019) inAUC (1.0 & 0.2 units respectively) and perfor-mance at higher recall values, while achievinga 7-10% improvement in P@ { } over (Alt et al., 2019). Given a bag of sentences { s , s , ..., s n } that con-cern a speciﬁc entity pair, REDSandT generatesa probability distribution on the set of possiblerelations. REDSandT utilizes BERT pre-trainedLM to capture the semantic and syntactic featuresof sentences by transferring pre-trained common-sense knowledge. We extend BERT to handle multi-instance learning, and we ﬁne-tune the model toclassify the relation linking the entity pair giventhe associated sentences.During ﬁne-tuning, we employ a structured, RE-speciﬁc input to minimize architectural changesto the model (Radford and Salimans, 2018). Eachsentence is adapted to a structured text, includingthe sentences’ tokens connecting the entity pair(STP) along with the entities types. We transformthe input into a (sub-)word-level distributed rep-resentation using BPE and positional embeddingsfrom BERT ﬁne-tuned on our corpora. Then, weform ﬁnal sentence representation by concatenat-ing relation embedding and sentence representationweighted with the relation embedding. Lastly, weuse attention over the bag’s sentences to shape bagrepresentation, which is then fed to a softmax layerto get the bag ’s relation distribution.REDSandT can be summarized in three compo-nents, namely sentence encoder, bag encoder, andmodel training. Each component is described indetail in the following sections with the overallarchitecture shown in Figure 1 and 2. Given a sentence x and an entity pair (cid:104) h, t (cid:105) , RED-SandT constructs a distributed representation of thesentence by concatenating relation and instance em-beddings. Overall sentence encoding is representedin Figure 1, with following sections to examine thesentence encoder parts in a bottom-up way. Relation extraction requires a structured input thatcan sufﬁciently capture the latent relation betweenan entity pair and its surrounding text. Our inputrepresentation encodes each sentence as a sequenceof tokens, depicted in the very bottom of Figure 1.It starts with the head entity type and token(s)followed by delimiter [H-SEP], continues with thetail entity type, and token(s) followed by delim-iter [T-SEP] and ends with the token sequence ofthe sentence’s STP path. The whole input starts igure 1: Sentence Representation in REDSandT. The input embedding h to BERT is created by summing overthe positional and byte pair embeddings for each token in the structured input. States h t are obtained by self-attending over the states of the previous layer h t − . Final sentence representation is obtained by concatenating therelation embedding r ht , and the ﬁnal ﬁne-tuned BERT layer h L weighted with relation attention α r . Head and tailtokens participating in the relation embedding formation are marked with bold and dashed lines respectively. and ends with special delimiters [CLS] and [SEP],respectively. In BERT, [CLS] typically acts as apooling token representing the whole sequence fordownstream tasks, such as RE.Several other sentence encodings were at-tempted with the presented one to perform thebest. Moreover, the ablation studies in section 4.2,reveal the importance of encoding entities’ typesand compressing the original sentence to the below-presented STP path. Below, we present in brief howwe form the sub-tree parse of the input and the en-tity types. Sub-tree parse of input sentence : We utilize thesub-tree parse (STP) of the input sentence in or-der to reduce the noisy words within sentence andfocus on the relational tokens. Precisely, STP pre-serves the path of the sentence that connects the twoentities with their least common ancestor (LCA)’sparent. Compared to other implementations (Liuet al., 2018), who shape the ﬁnal STP sequence byre-assigning the participating tokens into their orig-inal sequence order, we preserve the tokens’ orderwithin STP achieving a grammatical normalizationof the original sentence. Trials included encoding overall sentence tokens, STPtokens only, SDP ((Xu et al., 2015)) tokens only, using com-mon (cid:104) h, t (cid:105) delimiter, using single delimiter between entitiesand STP, removing entity type information.

Entity Type special tokens : In the extent that ev-ery relation puts some constraint on the type ofparticipating entities (Liu et al., 2014; Vashishthet al., 2018), we incorporate the entity type in themodel’s structured input (see bottom of Figure 1).Precisely, we incorporate 18 generic entity types,captured from recognizing NYT-10 sentence’s enti-ties with the spaCy model . We assume these typesKB-independent and easily accessible with our ex-periments in section 4.2 indicating their inclusionto improve performance. The input embedding h to BERT is created bysumming over the positional and byte pair embed-dings for each token in the structured input. Byte-pair tokens encoding : To make use of sub-word information, we tokenize input using byte-pair encoding (BPE) (Sennrich et al., 2016). Weparticularly use the tokenizer from the pre-trainedmodel (30,000 tokens), which we extend with 20task-speciﬁc tokens (e.g., [H-SEP], [T-SEP], andthe 18 entity type tokens). Added tokens serve aspecial meaning in the input representation, thusare not split into sub-words by the tokenizer.

Positional encoding : Positional encoding is an es- https://spacy.io/models/en ential part of BERT’s attention mechanism. Pre-cisely, BERT learns a unique position embeddingto represent each of the input (sub-word) tokenpositions within the sequence. Input sequence is transformed into feature vectors( h L ) using BERT’s pre-trained language model,ﬁne-tuned in our task. In spite of common practiceto represent the sentence by the [CLS] vector in h L (Alt et al., 2019), we argue that not all wordscontribute equally to sentence representation.By encoding the underlying relation as a func-tion of the examining entities and by giving atten-tion to vectors related to this underlying relation,we can further reduce sentence noise and improveprecision. Core modules constitute the: relationembedding, entities-wise attention, and relation at-tention. We examine them below. Relation Embedding : We formulate relation em-beddings using the TransE model (Bordes et al.,2013). TransE model regards the embedding of theunderlying relation l as the distance (difference)between h and t embeddings ( l i = t i − h i ), assum-ing that a relation r holds between an entity pair( h, t ). Then, we shape relation embedding for eachsentence i by applying a linear transformation onthe head and tail entities vectors, activated througha Tanh layer to capture possible nonlinearities: l i = T anh ( w l ( t i − h i ) + b l ) (1), where w l is the underlying relation weight matrixand b l ∈ (cid:60) d t is the bias vector. We mark relationembedding as l because it represents the possibleunderlying relation between the two entities andnot the actual relationship r . Head h i and tail t i embeddings reﬂect only the entities’ related tokens,which we capture through simple entities-wise at-tention, shown below. Entities-wise Attention : Head and tail embed-dings participating in the relation embedding arecreated by summing over respective token vectorsfrom BERT’s last layer h L . We capture these to-kens through head- and tail-wise attention. Head-wise attention assigns the weight α hit to focus onhead related tokens and tail-wise attention assignsthe weight α tit to focus on tail related tokens. α hit = (cid:40) if t = head in STP tokens otherwise (2) α tit = (cid:40) if t = tail in STP tokens otherwise (3)Head h i and tail t i embeddings are then shapedas follows: h i = T (cid:88) t =1 α hit · h it (4) t i = T (cid:88) t =1 α tit · h it (5) Relation Attention : Even though REDSandT istrained on STP that naturally preserves only rela-tional tokens, we wanted to further reduce possibleleft noise on sentence-level. For this reason, weuse a relation attention to emphasize on sentencetokens that are mostly related to the underlyingrelation l i . We calculate relation attention α r bycomparing each sentence representation against thelearned representation l i for each sentence i : α r = exp ( s i l i ) (cid:80) nj =1 exp ( s j l i ) (6)Then, we weight BERT’ s last hidden layer h L ∈(cid:60) d h with relation embedding: h (cid:48) L = T (cid:88) t =1 α r · h it (7)Finally, sentence representation s i ∈ (cid:60) d h ∗ is computed as the concatenation of the relationembedding l i and the sentence’s weighted hiddenrepresentation h (cid:48) L : s i = (cid:104) l i ; h (cid:48) L (cid:105) (8)Several other representation techniques weretested, with the presented method to outperform. Bag encoding, i.e., aggregation of sentence rep-resentations in a bag, comes to reduce noise gen-erated by the erroneously annotated relations ac-companying DS. Assuming that not all sentencescontribute equally to the bag representation, weuse selective attention (Lin et al., 2016) to empha-size on sentences that better express the underlyingrelation. B = (cid:88) i α i s i , (9)s seen, selective attention represents bag as aweighted sum of the individual sentences. Atten-tion α i is calculated by comparing each sentencerepresentation against a learned representation r: α i = exp ( s i r ) (cid:80) nj =1 exp ( s j r ) (10)Finally, bag representation B is fed to a softmaxclassiﬁer to obtain the probability distribution overthe relations. p ( r ) = Sof tmax ( W r · B + b r ) , (11)where W r is the relation weight matrix and b r ∈(cid:60) d r is the bias vector. REDSandT utilizes a transformer model, preciselyBERT, which ﬁne-tunes on our speciﬁc setup tocapture the semantic features of relational sen-tences. Below, we present the overall process.

For our experiments, we use the pre-trained bert-base-cased language model (Devlin et al., 2018),which consists of 12 layers, 12 attention heads,and 110M parameters, with each layer being abidirectional Transformer encoder (Vaswani et al.,2017). The model is trained on cased Englishtext of BooksCorpus and Wikipedia with a totalof 800M and 2.5K words respectively. BERT ispre-trained using two unsupervised tasks: maskedLM and next sentence prediction, with masked LMbeing its core novelty as it allows the previouslyimpossible bidirectional training.

We initialize REDSandT model’ s weights withthe pre-trained BERT model, and we ﬁne-tune its4-last layers under the multi-instance learning set-ting presented in Figure 2, given the speciﬁc inputshown in Figure 1. We end up ﬁne-tuning only thelast four layers after experimentation.During ﬁne-tuning, we optimize the followingobjective: L ( D ) = | B | (cid:88) i =1 logP ( l i | B i ; θ ) (12), where for all entity pair bags | B | in the dataset,we want to maximize the probability of correctlypredicting the bag’s relation given its sentences’representation and parameters. We conduct experiments on the widely used bench-mark dataset NYT-10 (Riedel et al., 2010), whichwas built by aligning triples in Freebase to the NYTcorpus and contains 53 relations. There are 522,611(172,448) sentences, 281,270 (96,678) entity pairs,and 18,252 (1,950) relation mentions in the train(test) set. We provide an enhanced dataset,

NYT-10-enhanced , including both STP and SDP versionsof the input sentences as well as the head and tailentity types to facilitate future implementations.

In our experiments we utilize bert-base-cased model with hidden layer dimension D h =768 ,while we ﬁne-tune the model with max seq length D t =64 . Regarding model’s hyper-parameters, wemanually tune them on the training set, basedon AUC score. We select batch size =32 among { , , } , epochs =3 among { , } , BERT’sﬁne-tuned layers =4 among all and last { , , } ,learning rate lr =2 e − among { e − , e − , e − } ,classiﬁer dropout p =0 . among { . , . , . } ,and weight decay =0 . among { . , . } .Moreover, we ﬁne-tune our model using the Adamoptimization scheme (Kingma and Lei Ba, 2015)with β =0 . , β =0 . and a cosine learning ratedecay schedule with warm-up over 0.1% of train-ing updates. We minimize loss using cross entropycriterion weighted on dataset’s classes to handle theunbalanced training set. Experiments conducted ona PC with 32.00 GB Ram, Intel i7-7800X [email protected] and NVIDIA’s GeForce GTX 1080 with8GB. Training time takes ∼ Figure 2: Transformer architecture (left) and train-ing framework (right). Sentence representation s i isformed as shown in Figure 1. .3 State-of-the-art Models For evaluating REDSandT, we compare againstfollowing state-of-the-art models:

Mintz (Mintz et al., 2009): A multi-class logisticregression model under distant supervision setting.

PCNN+ATT (Lin et al., 2016): A CNN modelwith instance-level attention

RESIDE (Vashishth et al., 2018): A NN modelthat uses several side information (entity types ,relational phrases) and employs Graph-CNN tocapture syntactic information of instances. DISTRE (Alt et al., 2019): A transformer model,GPT ﬁne-tuned for RE with an auxiliary objectiveunder the distant supervision setting.

Figure 3: Precision-Recall curves.

Figure 3 compares the precision-recall curves ofREDSandT against state-of-the-art models. Weobserve that: (1) The NN-based approaches out-perform the probabilistic method (Mintz), showinghuman-designed features limitation against neuralnetworks’ automatically extracted features. (2) RE-SIDE, DISTRE, and REDSandT achieve better per-formance than PCNN+ATT, which even exhibitingthe highest precision in the beginning soon followsan abrupt decline. This reveals the importance ofboth side-information (i.e., entity types and relationalias), and transfer knowledge. (3) RESIDE per-forms the best in low recalls and generally performswell, which we attribute to the multitude of side-information given. (4) Although DISTRE exhibits3.5% greater precision in medium-level recalls, itpresents 2-12% lower precision in recall values < Compared to our 18 KB-independent entity types, au-thors use 38 Freebase-speciﬁc entity types.

RE methods AUC P@100 P@300 P@500Mintz 0.107 52.3 45.0 39.7PCNN+ATT 0.341 73.0 67.3 63.6RESIDE 0.415 81.8 74.3 69.7DISTRE 0.422 68.0 65.3 65.0REDSandT

Table 1: AUC and P@N evaluation results. P@N rep-resents precision calculated for the top N rated relationinstances steady, downward trend, acting similar to RESIDEat the low and medium recalls and surpassing allbaselines in the very high recall values. We believethe reason is that we use potential label informationas an additional feature and as attention over the in-stance tokens. The learned label embeddings are ofhigh quality since they carry common-knowledgefrom the pre-trained model ﬁne-tuned on the spe-ciﬁc dataset and task. Moreover, the chosen pre-trained model, BERT, considers bidirectionalitywhile training, being thus able to efﬁciently cap-ture head and tail interaction.Table 1, which presents AUC and precision atvarious points in the P-R curve, reveals our model’sprecision performance to be between that of RE-SIDE and DISTRE while preserving the state-of-the-art AUC. Precisely, REDSandT’ s preci-sion does not exceed RESIDE’, even though itis close enough, which suggests that additionalside-information would improve our model. Mean-while, REDSandT surpasses DISTRE’ s preci-sion, which we attribute to our selected pre-trainedmodel that efﬁciently captures label embeddings.Consequently, our model is more consistent to thevarious points of the P-R curve.Table 2 shows the distribution over relation typesfor the top 300 predictions of REDSandT and base-line models. REDSandT encompasses 10 distinctrelation types, two of which ( place founded , /geo-graphic distribution ) are not recognized by none ofrest models. PCNN+ATT predictions are highly bi-ased towards a set of only four relation types, whileRESIDE captures three additional types. DISTREand REDSandT manage to recognize more typesthan all models, emphasizing the contribution oftransfer knowledge. Moreover, REDSandT cor-rectly not recognizes /location/country/capital rela-tion that DISTRE does, as their authors found mosterrors to arise from the speciﬁc predicted relationin manual evaluation. Meanwhile, we highlightREDSandT’ s effectiveness in recognizing rela-elation red dis res pcnn/location/contains 176 168 182 214/person/company 38 31 26 19/person/nationality 26 32 65 59/admin div/country 25 13 12 6/neighborhood of 22 10 3 2/person/children 5 - 6 -/team/location 4 2 - -/founders 2 2 6 -/place founded 1 - - -/geo distribution 1 - - -/country/capital - 17 - -/person/place lived - 22 - - Table 2: Relation Distribution over the top 300 pre-dictions for PCNN+ATT(pcnn), RESIDE(res), DIS-TRE(dis) and REDSandT(red) models

Metrics AUC P@N(%)

100 200 300

REDSandT w/o r ht REDSandT w/o ET

REDSandT w. SDP

REDSandT w/o a r REDSandT

Table 3: Evaluation results AUC and P@N of variantmodels on NYT-10 dataset. tions in the long-tail. Particularly, our model cap-tures, founders (1.47%), neighborhood of (1.06%), person/children (0.47%), and sports team/location (0.16%) relations. Relations are listed in descend-ing order regarding population in test set with re-spective percentage referenced in parentheses.

To assess the effectiveness of the different modulesof REDSandT, we create four ablation models:

REDSandT w/o ET : Removes entity types frominput sentence representation.

REDSandT w/o r ht : Removes relation embeddingand relation attention. We represent sentence usingthe [CLS] token of BERT’s last hidden layer h L . REDSandT w/o a r : Removes relation attentionon instance tokens. REDSandT w. SDP : Replaces STP with SDP (Xuet al., 2015) in sentence encoding.As shown in Table 3, all modules contribute toﬁnal model’ s effectiveness. Greatest impact comesfrom relation embeddings with their removal result- ing in the highest AUC (2 units) and P@300 (5.3%)drop. Meanwhile, P@100 goes up to 80% with in-spection of top 300 predictions revealing a focuson 5 relation types only, with /location/contains to make up the 79% of these. Simple integrationof entity types in input representation is the nextmost important feature that boosts our model. Next,“REDSandT w. SDP”, shows STP’s superiority,while a manual inspection in the model’s top 300predictions prove SDP’s weakness to recognizerelations in the long tail, with focus given on /per-son/nationality relation. Finally, removing the re-lation attention over instance tokens exhibits theleast effect in AUC (0.002) and precision ( ∼ Figure 4: Relation attention weights for children (top)and neighborhood of (bottom) long-tail relations.

Figure 4 shows a visualization of the relation atten-tion weights, highlighting the different parts of thesentence that drive relation extraction, for two long-tail relations. In both cases, we see that the specialtokens preserve important information, while alsothe entity type is given more weight than the entityitself. Moreover, we see which tokens affect morethe relation. Tokens “girlfriend”, “son”, and therepetition of name “James” are predictive of the“ children ” relation, while tokens “neighborhood”,“was”, “in”, along with a GPE entity type show aprobable “ neighborhood of ” relation.

Our work is related to distant supervision, neuralrelation extraction (mainly pre-trained LMs), sub-tree parse of input, label embedding, and entitytype side information.

Distant Supervision : DS plays a key role in RE,as it satisﬁes its need for extensive training data,easily and inexpensively. The use of DS (Cravenand Kumlien, 1999; Snow et al., 2005) to generatelarge training data for RE was proposed by (Mintzet al., 2009), who assumed that all sentences thatinclude an entity pair, which exhibits a relation-ship in a KB, express the same relation. However,his assumption comes with noisy labels, especiallywhen the KB is not directly related to the domainat hand. Multi-instance learning methods were pro-posed to alleviate the issue, by conducting relationclassiﬁcation at the bag level, with a bag includinginstances that mention the same entity pair (Riedelet al., 2010; Hoffmann et al., 2011).

Neural Relation Extraction : While the perfor-mance of the above approaches heavily relies onhandcrafted features (POS tags, named entity tags,morphological features, etc.), the advent of neuralnetworks in RE set the focus on model architecture.Zeng et al. (2014) propose a CNN-based method toautomatically capture the semantics of sentences,while PCNN (Zeng et al., 2015) became the com-mon architecture to embed sentences. PCNN isused in several approaches that handle DS noisypatterns, such as intra-bag attention (Lin et al.,2016), inter-bag attention (Ye and Ling, 2019), softlabeling (Liu et al., 2017; Wang et al., 2018) and ad-versarial training (Wu et al., 2018; Qin et al., 2018).Moreover, Graph-CNNs proved an effective way toencode syntactic information from text (Vashishthet al., 2018).The latest development of pre-trained LMs re-lying on transformer architecture (Vaswani et al.,2017) has shown to capture semantic and syntac-tic features better (Radford and Salimans, 2018).Howard and Ruder (2018) found that they signiﬁ-cantly improve text classiﬁcation performance, pre-vent overﬁtting, and increase sample efﬁciency. Shiand Lin (2019) ﬁne-tuned BERT (Devlin et al.,2018) on the TACRED dataset showing that simpleNNs built on top of BERT improve performance.Meanwhile, Alt et al. (2019) extended GPT (Rad-ford and Salimans, 2018) to the DS setting by incor-porating a multi-instance training mechanism, prov-ing that pre-trained LMs provide a stronger signalfor DS than speciﬁc linguistic and side-informationfeatures (Vashishth et al., 2018).

Side information : Apart from model architecture,several methods propose additional information tofurther reduce noise. Vashishth et al. (2018) userelation phrases and incorporate Freebase entitytypes achieving state-of-the-art precision at higherrecall values, while (Ji, 2017; Hu et al., 2019) useentity descriptors to enhance entity and label em-beddings, respectively.

Sub-Parses of Input : Xu et al. (2015) showed theimportance of the shortest-dependency path (SDP)in reducing irrelevant to RE words. Liu et al. (2018) further reduce the noise within sentences by pre-serving the sub-path of the sentence that connectsthe two entities with their least common ancestor’sparent (STP). In contrast with (Liu et al., 2018),who shape the ﬁnal STP sequence by re-assigningthe participating tokens into their original sequenceorder, we preserve the tokens’ order within the STPto maintain the emerged grammar information.

Label Embedding : Label embeddings aim to em-bed labels in the same space with word vectors.The idea comes from computer vision, with (Wanget al., 2018) to introduce them in text classiﬁca-tion and (Hu et al., 2019) to use them as attention-mechanism over relational tokens in distantly-supervised RE. We make use of the TransE (Bordeset al., 2013) model to shape label embeddings asthe entities’ distance in BERT’s vector space, andwe show that their use both as a feature and asattention over sentences signiﬁcantly improves RE.

We presented a novel transformer-based relation ex-traction model for distant supervision. REDSandTmanages to acquire high-informative instance andlabel embeddings and is efﬁcient at handling thenoisy labeling problem of DS. REDSandT captureshigh-informative embeddings for RE by ﬁne-tuningBERT on a RE-speciﬁc structured input that fo-cuses solely on relational arguments, including thesub-tree connecting the entities along with entities’types. Then, it utilizes these vectors to encodelabel embeddings, which are also used as atten-tion mechanism over instances to reduce the ef-fect of less-informative tokens. Finally, relationextraction occurs at bag-level by concatenating la-bel and weighted instance embeddings. Extensiveexperiments on the NYT-10 dataset illustrate RED-SandT’s effectiveness over existing baselines incurrent literature. Precisely, REDSandT managesto recognize relations that other methods fail todetect, including relations in the long-tail. Futurework includes an investigation of whether addi-tional information, such as entity descriptors, inﬂu-ence REDSandT’s performance and to what extent,while also whether the special token embeddingscan act as global embeddings for RE.

References

Christoph Alt, Marc H¨ubner, and Leonhard Hennig.2019. Fine-tuning Pre-Trained Transformer Lan-guage Models to Distantly Supervised Relation Ex-raction. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1388–1398. Association for Computa-tional Linguistics.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Dur´an, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

Advances in Neural InformationProcessing Systems , pages 2787–2795.Mark Craven and Johan Kumlien. 1999. Constructingbiological knowledge bases by extracting informa-tion from text sources.

Proceedings ofthe SeventhInternational Conference on Intelligent Systems forMolecular Biology , pages 77–86.Jacob Devlin, Ming-Wei Chang, Kenton Lee,Kristina Toutanova Google, and A I Language.2018. BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding. arXivpreprint arXiv:1810.04805 .Zhengqiu He, Wenliang Chen, Zhenghua Li, MeishanZhang, Wei Zhang, and Min Zhang. 2018. SEE:Syntax-aware entity embedding for neural relationextraction. In , pages 5795–5802.Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and Daniel S Weld. 2011. Knowledge-based weak supervision for information extractionof overlapping relations. In

ACL-HLT 2011 - Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies , volume 1, pages 541–550.Jeremy Howard and Sebastian Ruder. 2018. UniversalLanguage Model Fine-tuning for Text Classiﬁcation.

ACL 2018 - 56th Annual Meeting of the Associationfor Computational Linguistics, Proceedings of theConference (Long Papers) , pages 328–339.Linmei Hu, Luhao Zhang, Chuan Shi, Liqiang Nie,Weili Guan, and Cheng Yang. 2019. ImprovingDistantly-Supervised Relation Extraction with JointLabel Embedding. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , page 3821–3829.Liu K. He S. & Zhao J. Ji, G. 2017. Distant Supervisionfor Relation Extraction with Hierarchical Attentionand Entity Descriptions. In

Thirty-First AAAI Con-ference on Artiﬁcial Intelligence , volume 2018-July.Diederik P Kingma and Jimmy Lei Ba. 2015. ADAM:A METHOD FOR STOCHASTIC OPTIMIZA-TION. In

ICLR .Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong Sun. 2016. Neural Relation Extrac-tion with Selective Attention over Instances. In

Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers) , pages 2124–2133. Tianyi Liu, Xinsong Zhang, Wanhao Zhou, and Wei-jia Jia. 2018. Neural Relation Extraction via Inner-Sentence Noise Reduction and Transfer Learning.In

Proceedings of the 2018 Conference on Empiri-cal Methods in Natural Language Processing , pages2195–2204, Stroudsburg, PA, USA. Association forComputational Linguistics.Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhi-fang Sui. 2017. A Soft-label Method for Noise-tolerant Distantly Supervised Relation Extraction.In

Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing , pages1790–1795.Yang Liu, Kang Liu, Liheng Xu, and Jun Zhao. 2014.Exploring ﬁne-grained entity type constraints for dis-tantly supervised relation extraction. In

COLING2014 - 25th International Conference on Computa-tional Linguistics, Proceedings of COLING 2014:Technical Papers , pages 2107–2116.Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf-sky. 2009. Distant supervision for relation extrac-tion without labeled data. In

Proceedings of theJoint Conference of the 47th Annual Meeting of theACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP , pages1003–1011. Association for Computational Linguis-tics.Pengda Qin, Weiran Xu, and William Yang Wang.2018. DSGAN: Generative Adversarial Training forDistant Supervision Relation Extraction.

Associa-tion for Computational Linguistics , pages 496–505.Alec Radford and Tim Salimans. 2018. Im-proving Language Understanding by GenerativePre-Training.

URL https://s3-us-west-2. amazon-aws. com/openai-assets/research-covers/language-unsupervised/language understanding paper. pdf. ,pages 1–12.Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling relations and their mentions with-out labeled text. In

Joint European Conferenceon Machine Learning and Knowledge Discovery inDatabases , volume 6323 LNAI, pages 148–163.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In , volume 3, pages 1715–1725.Peng Shi and Jimmy Lin. 2019. Simple BERT Modelsfor Relation Extraction and Semantic Role Labeling. arXiv preprint arXiv:1904.05255 .Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2005.Learning syntactic patterns for automatic hypernymdiscovery. In

Advances in Neural Information Pro-cessing Systems , pages 1297–1304.ihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and Christopher D. Manning. 2012. Multi-instancemulti-label learning for relation extraction. In

EMNLP-CoNLL 2012 - 2012 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning,Proceedings of the Conference , pages 455–465.Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga,Chiranjib Bhattacharyya, and Partha Talukdar. 2018.RESIDE: Improving Distantly-Supervised NeuralRelation Extraction using Side Information. In

Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing , pages 1257–1266.Ashish Vaswani, Google Brain, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan NGomez, Łukasz Kaiser, and Illia Polosukhin. 2017.Attention Is All You Need. In

Advances in neuralinformation processing systems , pages 5998–6008.Guoyin Wang, Chunyuan Li, Wenlin Wang, YizheZhang, Dinghan Shen, Xinyuan Zhang, RicardoHenao, and Lawrence Carin. 2018. Joint embeddingof words and labels for text classiﬁcation. In

ACL2018 - 56th Annual Meeting of the Association forComputational Linguistics, Proceedings of the Con-ference (Long Papers) , volume 1, pages 2321–2331.Association for Computational Linguistics.Yi Wu, David Bamman, and Stuart Russell. 2018. Ad-versarial Training for Relation Extraction. In

Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing , pages 1778–1783.Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng,and Zhi Jin. 2015. Classifying relations via longshort term memory networks along shortest depen-dency paths. In

Conference Proceedings - EMNLP2015: Conference on Empirical Methods in Natu-ral Language Processing , pages 1785–1794. Associ-ation for Computational Linguistics.Zhi-Xiu Ye and Zhen-Hua Ling. 2019. Distant Super-vision Relation Extraction with Intra-Bag and Inter-Bag Attentions. In

Proceedings of the 2019 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers) , pages 2810–2819. Association for Computa-tional Linguistics (ACL).Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant supervision for relation extraction viaPiecewise Convolutional Neural Networks. In

Con-ference Proceedings - EMNLP 2015: Conference onEmpirical Methods in Natural Language Processing ,pages 1753–1762.Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,and Jun Zhao. 2014. Relation classiﬁcation via con-volutional deep neural network. In