[PDF] An Improved Baseline for Sentence-level Relation Extraction

Abstract

Sentence-level relation extraction (RE) aims at identifying the relationship between two entities in a sentence. Many efforts have been devoted to this problem, while the best performing methods are still far from perfect. In this paper, we revisit two problems that affect the performance of existing RE models, namely entity representation and noisy or ill-defined labels. Our improved baseline model, incorporated with entity representations with typed markers, achieves an F1 of 74.6% on TACRED, significantly outperforms previous SOTA methods. Furthermore, the presented new baseline achieves an F1 of 91.1% on the refined Re-TACRED dataset, demonstrating that the pre-trained language models achieve unexpectedly high performance on this task. We release our code to the community for future research.

Full PDF

aa r X i v : . [ c s . C L ] F e b An Improved Baseline for Sentence-level Relation Extraction

Wenxuan Zhou

University of Southern California [email protected]

Muhao Chen

University of Southern California [email protected]

Abstract

Sentence-level relation extraction (RE) aimsat identifying the relationship between twoentities in a sentence. Many efforts havebeen devoted to this problem, while the bestperforming methods are still far behind hu-man performance. In this paper, we revisittwo aspects of RE models that are not thor-oughly studied, namely

ENTITY REPRESEN - TATION and NA

INSTANCE PREDICTION . Ourimproved baseline model, incorporated withentity representations with type markers andconﬁdence-based classiﬁcation for enhancedNA instance detection, achieves an F of . on TACRED, signiﬁcantly outperformsprevious SOTA methods. Relation extraction (RE) aims at identifying the re-lationship between two entities in a given pieceof text from a pre-deﬁned set of relationships ofinterest. It is one of the fundamental informa-tion extraction (IE) tasks. For example, giventhe sentence “Bill Gates founded Microsoft to-gether with his friend Paul Allen in 1975” andan entity pair (“Bill Gates”, “Microsoft”), an REsystem is supposed to predict the relation to be

ORG:FOUNDED BY . Pre-trained language mod-els, such as BERT (Devlin et al., 2019), havegained signiﬁcant success on this task, achievingSOTA performance.Many recent efforts on RE focus on in-jecting external knowledge into language mod-els. ERNIE (Zhang et al., 2019) and Know-BERT (Peters et al., 2019) take pre-trained en-tity embedding as input to the transformer. K-Adapter (Wang et al., 2020) introduces a plug-in neural adaptor that injects factual and lin-guistic knowledge into the language model.LUKE (Yamada et al., 2020) extends the pre-training objective of masked language model- ing to entities and proposes an entity-aware self-attention mechanism. Another line of work fo-cuses on ﬁne-tuning pre-trained language modelson entity-linked text using RE-oriented objectives.Speciﬁcally, BERT-MTB (Baldini Soares et al.,2019) proposes a matching-the-blanks objectivethat decides whether two relation instances sharethe same entities.Despite extensively studied, the performanceof existing RE classiﬁers is still far from sat-isfactory. For example, on the TACREDdataset (Zhang et al., 2017), the SOTA resultonly increases from . (BERT LARGE ) to . (LUKE) after the application of pre-trainedlanguage models to this task. In this paper, we re-visit existing methods and locate two overlookedaspects in previous work. First, relation instancescontain several aspects of entity information, in-cluding entity names, spans, and types, while ex-isting methods do not capture all of them in theinput. Moreover, due to the inevitable limitationof the pre-deﬁned ontology, it is natural for a por-tion of relation instances not to satisfy any knownrelations in the ontology, and are labeled as NA.Such Nan instances may be of varied semantics inpractice, while existing methods typically modelthem as a single NA class and try to group theminto the same cluster. To this end, we proposean improved baseline for sentence-level RE withincorporated techniques for the aforementionedproblems. First, we introduce the typed entitymarker (Zhong and Chen, 2020), which leads topromising improvement of performance over exist-ing entity representation techniques. Besides, weuse conﬁdence-based classiﬁcation to model theNA instances. Instead of grouping them into thesame cluster, we let the model output a conﬁdencescore of whether a relationship exists and returnNA if the score is lower than a threshold.Our improved baseline achieves an F of . n the TACRED dataset, exceeding the SOTA re-sult by . . We also show that conﬁdence-basedclassiﬁcation is effective on the SemEval-2010task 8 dataset (Hendrickx et al., 2019). In this section, we ﬁrst formally deﬁne thesentence-level RE task in Section 2.1, and thenpresent our model architecture and entity represen-tation techniques in Section 2.2 and Section 2.3.Finally, we evaluate whether our model can gener-alize to unseen entities in Section 2.4.

In this paper, we focus on sentence-level RE.Speciﬁcally, given a sentence x and an entity pair ( e s , e o ) , the task of sentence-level RE is to pre-dict the relationship r between e s and e o from R∪{ NA } , where R is a pre-deﬁned set of relation-ships of interest, e s , e o are identiﬁed as the subjectand object entity, respectively. Entity pairs that donot express relation from R are labeled NA. Our RE classiﬁer is an extension of previoustransformer-based (Devlin et al., 2019; Liu et al.,2019) RE models (Baldini Soares et al., 2019).Given the input sentence x , we ﬁrst mark the en-tity spans and entity types using techniques in Sec-tion 2.3, then feed the processed sentence into apre-trained language model to get its contextualembedding. Finally, we take ﬁrst hidden statesof the subject and object entities in the languagemodel’s last layer, i.e., h subj and h obj , as the inputto the classiﬁer: z = ReLU (cid:0) W p (cid:2) h subj , h obj (cid:3)(cid:1) , P( r = ˆ r ) = exp( W ˆ r z ) P r ′ ∈R∪{ NA } exp( W r ′ z ) , where W p ∈ R d × d , W ˆ r ∈ R d are model parame-ters. In inference, the classiﬁer returns the relationwith the highest probability. For sentence-level RE, the spans and types of enti-ties are provided and need to be represented in theinput. Recent work (Baldini Soares et al., 2019;Zhong and Chen, 2020) shows that both types ofinformation improve the performance of RE clas-siﬁers. In this paper, we evaluate the following representation methods on TACRED (Zhang et al.,2017):•

Entity mask (Zhang et al., 2017) Thismethod uses a special [SUBJ-NER] or [OBJ-NER] token to replaces the subject orobject entity in the original sentence, where NER is the corresponding entity type.•

Entity marker (Baldini Soares et al., 2019)This method introduces special tokens [E1] , [/E1] , [E2] , [/E2] to indicate the en-tity spans, and modify the sentence to “ [E1] SUBJ [/E1] ... [E2]

OBJ [/E2] ”.•

Entity marker (punct) (Wang et al., 2020;Zhou et al., 2021) This method is a variant ofthe entity marker that marks entity spans bymodifying the original sentence to “@

SUBJ @ ...

OBJ

Typed entity marker (Zhong and Chen,2020) This method introduces special tokens h S:NER i , h /S:NER i , h O:NER i , h /O:NER i and modify the sentence to “ h S:NER i SUBJ h /S:NER i ... h O:NER i OBJ h /O:NER i ”,where NER is the corresponding entity type.•

Typed entity marker (punct)

We propose avariant of the typed entity marker that marksthe span and type of entities without introduc-ing special tokens. We enclose the subjectand object entities with “@” and “ ∧ ” for objects. Themodiﬁed sentence is “@ * SUBJ - NER * SUBJ @ ... ∧ OBJ - NER ∧ OBJ F score of . using the typed entity marker ethod Input Example BERT BASE

BERT

LARGE

RoBERTa

LARGE

Entity mask [SUBJ-PERSON] was born in [OBJ-CITY] . 69.6 70.6 60.9Entity marker [E1]

Bill [/E1] was born in [E2]

Seattle [/E2] . 68.4 69.7 70.7Entity marker (punct) @ Bill @ was born in [E1-PERSON]

Bill [/E1-PERSON] was born in [E2-CITY]

Seattle [/E2-CITY] . ∧ city ∧ Seattle

Table 1:

Test F (in %) of different entity representation methods on TACRED. For each method, we alsoprovide the processed input of an example sentence “Bill was born in Seattle” . Typed entity marker (original andpunct) signiﬁcantly outperforms other methods. (punct), which is signiﬁcantly higher than theSOTA result of . (Yamada et al., 2020). Itshows that all types of entity information are help-ful to the RE task. (2) Introducing special tokensdeteriorates model performance on RoBERTa. En-tity marker underperforms entity marker (punct)by . , typed entity marker underperforms typedentity marker (punct) by . , while entity maskgets a much worse result of . . Some previous work (Zhang et al., 2018) claimsthat the entity names may leak superﬁcial cluesof the relation types, leading to heuristics to hackthe benchmark. They show that neural RE classi-ﬁer can achieve high evaluation results only basedon the input subject and object entities withoutthe original sentence. They also suggest that REclassiﬁers trained without entity masks may notgeneralize to unseen entities. However, as theprovided entity types in RE datasets are usuallycoarse-grained, using entity masks may lead tothe loss of entity information. Using entity masksalso contradicts existing methods of injecting en-tity knowledge into RE classiﬁers (Zhang et al.,2019; Peters et al., 2019; Wang et al., 2020). IfRE classiﬁers should not consider entity names,they are not supposed to be improved by externalknowledge bases.To evaluate whether our RE classiﬁer can gener-alize to unseen entities, we propose a ﬁltered eval-uation setting. Speciﬁcally, we remove all test in-stances containing entities from the training set, re-sulting in a ﬁltered test set of 4,599 instances thatonly contain sentences with entities unseen duringtraining. We present the evaluation results on theﬁltered test set in Table 2. Note that as the labeldistributions of the original and ﬁltered test set aredifferent, their results may not be directly compa-rable. Still,

Typed Entity Marker consistently out-performs

Entity Mask , which shows that RE clas-

Model Test F BERT

BASE + entity mask 75.4BERT

LARGE + entity mask 75.8RoBERTa

LARGE + entity mask 69.7BERT

BASE + typed entity marker 75.9BERT

LARGE + typed entity marker 76.8RoBERTa

LARGE + typed entity marker(punct) 78.9

Table 2:

Test F on the ﬁltered test set of TACRED. siﬁers can learn from entity names and general-ize to unseen entities. Our ﬁnding is consistentwith Peng et al. (2020), which suggests that entitymentions can provide more information than entitytypes to improve the RE classiﬁer. The NA instances, which are not labeled with anyrelations from R , usually constitute a large portionof the data in practice. In the TACRED dataset, . of instances are labeled as NA. Previousmethods (Zhang et al., 2017; Baldini Soares et al.,2019) typically use an additional class in the clas-siﬁer to model the NA instances, and predict therelation to be NA if P ( r = NA ) > P( r = r ′ ) , ∀ r ′ ∈ R . Then in training, the representationsof NA instances are pushed to be similar to W NA .However, as NA corresponds to an inﬁnite num-ber of instances of varied semantics, it may not bemodeled by one class.In this paper, we explore identifying NAinstances with conﬁdence-based classiﬁcation,where the classiﬁer outputs a high-enough con-ﬁdence score if a relationship holds, andreturns NA if the score is lower than athreshold. Our method is closely related toopen set classiﬁcation (Bendale and Boult, 2016;Dhamija et al., 2018) and out-of-distribution de-tection (Liang et al., 2018; Vyas et al., 2018),where the classiﬁer needs to identify and rejectnstances from unseen classes. Speciﬁcally, ourclassiﬁer takes a sentence x , calculates the classprobabilities p ∈ R |R| , and returns the conﬁdencescore as c = max r ∈R p r , which is the maximumprobability of the relationships. In inference, aninstance is predicted as NA if c < θ , where θ isa conﬁdence threshold tuned on the developmentset.To make sure our method functions correctly,we have two expectations, i.e. (1) NA instancesshould have low conﬁdence scores, (2) othersshould have high conﬁdence scores. The latterexpectation is already met by the cross-entropyloss. For the former, directly minimizing the conﬁ-dence score causes inefﬁcient optimization, as therelation corresponding to the maximum probabil-ity may keep changing. Thus, we propose to mini-mize the surrogate of the conﬁdence score on NAinstances: c sup = X r ∈R p r , L conf = log (cid:0) − c sup (cid:1) , where we have c ≤ √ c sup . Minimizing L conf issimilar to minimizing c while makes the trainingmore stable. The derivative of L conf to the logit l r of the relation r is: ∂ L conf l r = − p r (cid:0) p r − P r ∈R p r (cid:1) − P r ∈R p r , from which we observe that: (1) L conf is mini-mized when p r = |R| , ∀ r ∈ R , i.e., the proba-bility distribution of relations is uniform; (2) L conf automatically weights the training instances by − P r ∈R p r , such that NA instances with higherconﬁdence scores receive larger weights in train-ing. In this section, we present the experimental evalu-ation of the proposed method.

We evaluate our baseline model on two bench-marks, i.e., TACRED and SemEval 2010 Task8. TACRED (Zhang et al., 2017) is a widely-usedsentence-level RE dataset consisting of 68,124training instances, 22,631 development instances,and 15,509 test instances. This dataset pro-vides the entity spans and entity types. We

Model Original Relabeled

Test F Test F CNN (Zhang et al., 2017) 59.5 66.5*TRE (Alt et al., 2019) 67.4 75.3*BERT

LARGE (Baldini Soares et al.,2019) 70.1 -SpanBERT (Joshi et al., 2020) 70.8 78.0*KnowBERT (Peters et al., 2019) 71.5 79.3*K-Adapter (Wang et al., 2020) 72.04 -LUKE (Yamada et al., 2020) 72.7 -BERT

BASE + TEM 71.5 79.4BERT

LARGE + TEM 73.0 81.3RoBERTa

LARGE + TEM 74.5 83.0BERT

BASE + TEM + CBC 71.8 80.1BERT

LARGE + TEM + CBC 73.3 81.7RoBERTa

LARGE + TEM + CBC 75.0 83.6

Table 3: F (in %) on the original and revisited testset of TACRED. * marks re-implemented results fromAlt et al. (2020). Model Test F BERT

BASE

LARGE

BASE + CBC 88.8BERT

LARGE + CBC 89.7

Table 4:

Test F on SemEval 2010 Task 8. also evaluate our model on the relabeled testset of TACRED (Alt et al., 2020), which recti-ﬁes some noisy labels. SemEval 2010 Task8 (Hendrickx et al. 2019; SemEval for short) is an-other RE dataset consisting of 8,000 training in-stances and 2,717 test instances. We further splitthe training set into an 80/20 split as training anddevelopment sets. SemEval does not provide en-tity types. Thus, we only apply the typed entitymarker technique to TACRED. Our model is implemented based on Hugging-face’s Transformers (Wolf et al., 2019). Thehyper-parameters follow Baldini Soares et al.(2019). Speciﬁcally, our model is optimized withAdam (Kingma and Ba, 2015) using the learningrate of − , with a linear warmup (Goyal et al.,2017) for the ﬁrst steps followed by a lineardecay to 0. We use a batch size of 64 for allmodels. We ﬁne-tune the TACRED model for 5epochs, and the SemEval model for 10 epochs. .3 Results The experiment results on TACRED and SemEvalare shown in Table 3 and Table 4, respectively. Weabbreviate the typed entity marker and conﬁdence-based classiﬁcation as TEB and CBC in the ta-ble. We ﬁnd that both techniques effectively im-prove the task in terms of F . CBC consistentlyimproves the F by . − . on all datasets.In particular, our RoBERTa LARGE + TEM + CBCmodel achieves a new SOTA result of . , out-performing all compared methods. Similarly, ourBERT LARGE + CBC model outperforms the pre-vious baseline model (Baldini Soares et al., 2019)by . in F1. In this paper, we propose an improved baselinefor sentence-level RE, which incorporates two sim-ple yet effective techniques: typed entity markerand conﬁdence-based classiﬁcation. Experimentson two datasets demonstrate the effectiveness ofour method. Future work includes evaluating ourmodel on more pre-trained language models, andextending the conﬁdence-based classiﬁcation toother tasks such as named entity recognition andquestion answering.

References

Christoph Alt, Aleksandra Gabryszak, and LeonhardHennig. 2020. TACRED revisited: A thorough eval-uation of the TACRED relation extraction task. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1558–1569, Online. Association for Computational Lin-guistics.Christoph Alt, Marc H¨ubner, and Leonhard Hennig.2019. Fine-tuning pre-trained transformer languagemodels to distantly supervised relation extraction.In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages1388–1398, Florence, Italy. Association for Compu-tational Linguistics.Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theblanks: Distributional similarity for relation learn-ing. In

Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 2895–2905, Florence, Italy. Association forComputational Linguistics.Abhijit Bendale and Terrance E Boult. 2016. Towardsopen set deep networks. In

Proceedings of the IEEEconference on computer vision and pattern recogni-tion , pages 1563–1572. Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Akshay Raj Dhamija, Manuel G¨unther, and Terrance EBoult. 2018. Reducing network agnostophobia. In

NeurIPS .Priya Goyal, P. Doll´ar, Ross B. Girshick, P. Noord-huis, L. Wesolowski, Aapo Kyrola, Andrew Tulloch,Y. Jia, and Kaiming He. 2017. Accurate, large mini-batch sgd: Training imagenet in 1 hour.

ArXiv ,abs/1706.02677.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid O S´eaghdha, Sebas-tian Pad´o, Marco Pennacchiotti, Lorenza Ro-mano, and Stan Szpakowicz. 2019. Semeval-2010task 8: Multi-way classiﬁcation of semantic rela-tions between pairs of nominals. arXiv preprintarXiv:1911.10422 .Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.SpanBERT: Improving pre-training by representingand predicting spans.

Transactions of the Associa-tion for Computational Linguistics , 8:64–77.Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization.

CoRR ,abs/1412.6980.Shiyu Liang, Yixuan Li, and R Srikant. 2018. Enhanc-ing the reliability of out-of-distribution image detec-tion in neural networks. In

International Conferenceon Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, PengLi, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2020.Learning from Context or Names? An EmpiricalStudy on Neural Relation Extraction. In

Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages3661–3672, Online. Association for ComputationalLinguistics.Matthew E. Peters, Mark Neumann, IV RobertLLogan,Roy Schwartz, V. Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In

EMNLP/IJCNLP .Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Di-pankar Das, Bharat Kaul, and Theodore L Willke.018. Out-of-distribution detection using an ensem-ble of self supervised leave-out classiﬁers. In

Pro-ceedings of the European Conference on ComputerVision (ECCV) , pages 550–564.Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,Xuanjing Huang, Cuihong Cao, Daxin Jiang, MingZhou, et al. 2020. K-adapter: Infusing knowl-edge into pre-trained models with adapters. arXivpreprint arXiv:2002.01808 .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Ikuya Yamada, Akari Asai, Hiroyuki Shindo, HideakiTakeda, and Yuji Matsumoto. 2020. LUKE: Deepcontextualized entity representations with entity-aware self-attention. In

Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 6442–6454, On-line. Association for Computational Linguistics.Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In

Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 2205–2215, Brus-sels, Belgium. Association for Computational Lin-guistics.Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slotﬁlling. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 35–45, Copenhagen, Denmark. Associationfor Computational Linguistics.Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In

Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1441–1451, Florence, Italy. Associationfor Computational Linguistics.Zexuan Zhong and Danqi Chen. 2020. A frustratinglyeasy approach for joint entity and relation extraction. arXiv preprint arXiv:2010.12812 .Wenxuan Zhou, Kevin Huang, Tengyu Ma, and JingHuang. 2021. Document-level relation extractionwith adaptive thresholding and localized contextpooling. In