An Improved Baseline for Sentence-level Relation Extraction
aa r X i v : . [ c s . C L ] F e b An Improved Baseline for Sentence-level Relation Extraction
Wenxuan Zhou
University of Southern California [email protected]
Muhao Chen
University of Southern California [email protected]
Abstract
Sentence-level relation extraction (RE) aimsat identifying the relationship between twoentities in a sentence. Many efforts havebeen devoted to this problem, while the bestperforming methods are still far behind hu-man performance. In this paper, we revisittwo aspects of RE models that are not thor-oughly studied, namely
ENTITY REPRESEN - TATION and NA
INSTANCE PREDICTION . Ourimproved baseline model, incorporated withentity representations with type markers andconfidence-based classification for enhancedNA instance detection, achieves an F of . on TACRED, significantly outperformsprevious SOTA methods. Relation extraction (RE) aims at identifying the re-lationship between two entities in a given pieceof text from a pre-defined set of relationships ofinterest. It is one of the fundamental informa-tion extraction (IE) tasks. For example, giventhe sentence “Bill Gates founded Microsoft to-gether with his friend Paul Allen in 1975” andan entity pair (“Bill Gates”, “Microsoft”), an REsystem is supposed to predict the relation to be
ORG:FOUNDED BY . Pre-trained language mod-els, such as BERT (Devlin et al., 2019), havegained significant success on this task, achievingSOTA performance.Many recent efforts on RE focus on in-jecting external knowledge into language mod-els. ERNIE (Zhang et al., 2019) and Know-BERT (Peters et al., 2019) take pre-trained en-tity embedding as input to the transformer. K-Adapter (Wang et al., 2020) introduces a plug-in neural adaptor that injects factual and lin-guistic knowledge into the language model.LUKE (Yamada et al., 2020) extends the pre-training objective of masked language model- ing to entities and proposes an entity-aware self-attention mechanism. Another line of work fo-cuses on fine-tuning pre-trained language modelson entity-linked text using RE-oriented objectives.Specifically, BERT-MTB (Baldini Soares et al.,2019) proposes a matching-the-blanks objectivethat decides whether two relation instances sharethe same entities.Despite extensively studied, the performanceof existing RE classifiers is still far from sat-isfactory. For example, on the TACREDdataset (Zhang et al., 2017), the SOTA resultonly increases from . (BERT LARGE ) to . (LUKE) after the application of pre-trainedlanguage models to this task. In this paper, we re-visit existing methods and locate two overlookedaspects in previous work. First, relation instancescontain several aspects of entity information, in-cluding entity names, spans, and types, while ex-isting methods do not capture all of them in theinput. Moreover, due to the inevitable limitationof the pre-defined ontology, it is natural for a por-tion of relation instances not to satisfy any knownrelations in the ontology, and are labeled as NA.Such Nan instances may be of varied semantics inpractice, while existing methods typically modelthem as a single NA class and try to group theminto the same cluster. To this end, we proposean improved baseline for sentence-level RE withincorporated techniques for the aforementionedproblems. First, we introduce the typed entitymarker (Zhong and Chen, 2020), which leads topromising improvement of performance over exist-ing entity representation techniques. Besides, weuse confidence-based classification to model theNA instances. Instead of grouping them into thesame cluster, we let the model output a confidencescore of whether a relationship exists and returnNA if the score is lower than a threshold.Our improved baseline achieves an F of . n the TACRED dataset, exceeding the SOTA re-sult by . . We also show that confidence-basedclassification is effective on the SemEval-2010task 8 dataset (Hendrickx et al., 2019). In this section, we first formally define thesentence-level RE task in Section 2.1, and thenpresent our model architecture and entity represen-tation techniques in Section 2.2 and Section 2.3.Finally, we evaluate whether our model can gener-alize to unseen entities in Section 2.4.
In this paper, we focus on sentence-level RE.Specifically, given a sentence x and an entity pair ( e s , e o ) , the task of sentence-level RE is to pre-dict the relationship r between e s and e o from R∪{ NA } , where R is a pre-defined set of relation-ships of interest, e s , e o are identified as the subjectand object entity, respectively. Entity pairs that donot express relation from R are labeled NA. Our RE classifier is an extension of previoustransformer-based (Devlin et al., 2019; Liu et al.,2019) RE models (Baldini Soares et al., 2019).Given the input sentence x , we first mark the en-tity spans and entity types using techniques in Sec-tion 2.3, then feed the processed sentence into apre-trained language model to get its contextualembedding. Finally, we take first hidden statesof the subject and object entities in the languagemodel’s last layer, i.e., h subj and h obj , as the inputto the classifier: z = ReLU (cid:0) W p (cid:2) h subj , h obj (cid:3)(cid:1) , P( r = ˆ r ) = exp( W ˆ r z ) P r ′ ∈R∪{ NA } exp( W r ′ z ) , where W p ∈ R d × d , W ˆ r ∈ R d are model parame-ters. In inference, the classifier returns the relationwith the highest probability. For sentence-level RE, the spans and types of enti-ties are provided and need to be represented in theinput. Recent work (Baldini Soares et al., 2019;Zhong and Chen, 2020) shows that both types ofinformation improve the performance of RE clas-sifiers. In this paper, we evaluate the following representation methods on TACRED (Zhang et al.,2017):•
Entity mask (Zhang et al., 2017) Thismethod uses a special [SUBJ-NER] or [OBJ-NER] token to replaces the subject orobject entity in the original sentence, where NER is the corresponding entity type.•
Entity marker (Baldini Soares et al., 2019)This method introduces special tokens [E1] , [/E1] , [E2] , [/E2] to indicate the en-tity spans, and modify the sentence to “ [E1] SUBJ [/E1] ... [E2]
OBJ [/E2] ”.•
Entity marker (punct) (Wang et al., 2020;Zhou et al., 2021) This method is a variant ofthe entity marker that marks entity spans bymodifying the original sentence to “@
SUBJ @ ...
OBJ
Typed entity marker (Zhong and Chen,2020) This method introduces special tokens h S:NER i , h /S:NER i , h O:NER i , h /O:NER i and modify the sentence to “ h S:NER i SUBJ h /S:NER i ... h O:NER i OBJ h /O:NER i ”,where NER is the corresponding entity type.•
Typed entity marker (punct)
We propose avariant of the typed entity marker that marksthe span and type of entities without introduc-ing special tokens. We enclose the subjectand object entities with “@” and “ ∧ ” for objects. Themodified sentence is “@ * SUBJ - NER * SUBJ @ ... ∧ OBJ - NER ∧ OBJ F score of . using the typed entity marker ethod Input Example BERT BASE
BERT
LARGE
RoBERTa
LARGE
Entity mask [SUBJ-PERSON] was born in [OBJ-CITY] . 69.6 70.6 60.9Entity marker [E1]
Bill [/E1] was born in [E2]
Seattle [/E2] . 68.4 69.7 70.7Entity marker (punct) @ Bill @ was born in [E1-PERSON]
Bill [/E1-PERSON] was born in [E2-CITY]
Seattle [/E2-CITY] . ∧ city ∧ Seattle
Table 1:
Test F (in %) of different entity representation methods on TACRED. For each method, we alsoprovide the processed input of an example sentence “Bill was born in Seattle” . Typed entity marker (original andpunct) significantly outperforms other methods. (punct), which is significantly higher than theSOTA result of . (Yamada et al., 2020). Itshows that all types of entity information are help-ful to the RE task. (2) Introducing special tokensdeteriorates model performance on RoBERTa. En-tity marker underperforms entity marker (punct)by . , typed entity marker underperforms typedentity marker (punct) by . , while entity maskgets a much worse result of . . Some previous work (Zhang et al., 2018) claimsthat the entity names may leak superficial cluesof the relation types, leading to heuristics to hackthe benchmark. They show that neural RE classi-fier can achieve high evaluation results only basedon the input subject and object entities withoutthe original sentence. They also suggest that REclassifiers trained without entity masks may notgeneralize to unseen entities. However, as theprovided entity types in RE datasets are usuallycoarse-grained, using entity masks may lead tothe loss of entity information. Using entity masksalso contradicts existing methods of injecting en-tity knowledge into RE classifiers (Zhang et al.,2019; Peters et al., 2019; Wang et al., 2020). IfRE classifiers should not consider entity names,they are not supposed to be improved by externalknowledge bases.To evaluate whether our RE classifier can gener-alize to unseen entities, we propose a filtered eval-uation setting. Specifically, we remove all test in-stances containing entities from the training set, re-sulting in a filtered test set of 4,599 instances thatonly contain sentences with entities unseen duringtraining. We present the evaluation results on thefiltered test set in Table 2. Note that as the labeldistributions of the original and filtered test set aredifferent, their results may not be directly compa-rable. Still,
Typed Entity Marker consistently out-performs
Entity Mask , which shows that RE clas-
Model Test F BERT
BASE + entity mask 75.4BERT
LARGE + entity mask 75.8RoBERTa
LARGE + entity mask 69.7BERT
BASE + typed entity marker 75.9BERT
LARGE + typed entity marker 76.8RoBERTa
LARGE + typed entity marker(punct) 78.9
Table 2:
Test F on the filtered test set of TACRED. sifiers can learn from entity names and general-ize to unseen entities. Our finding is consistentwith Peng et al. (2020), which suggests that entitymentions can provide more information than entitytypes to improve the RE classifier. The NA instances, which are not labeled with anyrelations from R , usually constitute a large portionof the data in practice. In the TACRED dataset, . of instances are labeled as NA. Previousmethods (Zhang et al., 2017; Baldini Soares et al.,2019) typically use an additional class in the clas-sifier to model the NA instances, and predict therelation to be NA if P ( r = NA ) > P( r = r ′ ) , ∀ r ′ ∈ R . Then in training, the representationsof NA instances are pushed to be similar to W NA .However, as NA corresponds to an infinite num-ber of instances of varied semantics, it may not bemodeled by one class.In this paper, we explore identifying NAinstances with confidence-based classification,where the classifier outputs a high-enough con-fidence score if a relationship holds, andreturns NA if the score is lower than athreshold. Our method is closely related toopen set classification (Bendale and Boult, 2016;Dhamija et al., 2018) and out-of-distribution de-tection (Liang et al., 2018; Vyas et al., 2018),where the classifier needs to identify and rejectnstances from unseen classes. Specifically, ourclassifier takes a sentence x , calculates the classprobabilities p ∈ R |R| , and returns the confidencescore as c = max r ∈R p r , which is the maximumprobability of the relationships. In inference, aninstance is predicted as NA if c < θ , where θ isa confidence threshold tuned on the developmentset.To make sure our method functions correctly,we have two expectations, i.e. (1) NA instancesshould have low confidence scores, (2) othersshould have high confidence scores. The latterexpectation is already met by the cross-entropyloss. For the former, directly minimizing the confi-dence score causes inefficient optimization, as therelation corresponding to the maximum probabil-ity may keep changing. Thus, we propose to mini-mize the surrogate of the confidence score on NAinstances: c sup = X r ∈R p r , L conf = log (cid:0) − c sup (cid:1) , where we have c ≤ √ c sup . Minimizing L conf issimilar to minimizing c while makes the trainingmore stable. The derivative of L conf to the logit l r of the relation r is: ∂ L conf l r = − p r (cid:0) p r − P r ∈R p r (cid:1) − P r ∈R p r , from which we observe that: (1) L conf is mini-mized when p r = |R| , ∀ r ∈ R , i.e., the proba-bility distribution of relations is uniform; (2) L conf automatically weights the training instances by − P r ∈R p r , such that NA instances with higherconfidence scores receive larger weights in train-ing. In this section, we present the experimental evalu-ation of the proposed method.
We evaluate our baseline model on two bench-marks, i.e., TACRED and SemEval 2010 Task8. TACRED (Zhang et al., 2017) is a widely-usedsentence-level RE dataset consisting of 68,124training instances, 22,631 development instances,and 15,509 test instances. This dataset pro-vides the entity spans and entity types. We
Model Original Relabeled
Test F Test F CNN (Zhang et al., 2017) 59.5 66.5*TRE (Alt et al., 2019) 67.4 75.3*BERT
LARGE (Baldini Soares et al.,2019) 70.1 -SpanBERT (Joshi et al., 2020) 70.8 78.0*KnowBERT (Peters et al., 2019) 71.5 79.3*K-Adapter (Wang et al., 2020) 72.04 -LUKE (Yamada et al., 2020) 72.7 -BERT
BASE + TEM 71.5 79.4BERT
LARGE + TEM 73.0 81.3RoBERTa
LARGE + TEM 74.5 83.0BERT
BASE + TEM + CBC 71.8 80.1BERT
LARGE + TEM + CBC 73.3 81.7RoBERTa
LARGE + TEM + CBC 75.0 83.6
Table 3: F (in %) on the original and revisited testset of TACRED. * marks re-implemented results fromAlt et al. (2020). Model Test F BERT
BASE
LARGE
BASE + CBC 88.8BERT
LARGE + CBC 89.7
Table 4:
Test F on SemEval 2010 Task 8. also evaluate our model on the relabeled testset of TACRED (Alt et al., 2020), which recti-fies some noisy labels. SemEval 2010 Task8 (Hendrickx et al. 2019; SemEval for short) is an-other RE dataset consisting of 8,000 training in-stances and 2,717 test instances. We further splitthe training set into an 80/20 split as training anddevelopment sets. SemEval does not provide en-tity types. Thus, we only apply the typed entitymarker technique to TACRED. Our model is implemented based on Hugging-face’s Transformers (Wolf et al., 2019). Thehyper-parameters follow Baldini Soares et al.(2019). Specifically, our model is optimized withAdam (Kingma and Ba, 2015) using the learningrate of − , with a linear warmup (Goyal et al.,2017) for the first steps followed by a lineardecay to 0. We use a batch size of 64 for allmodels. We fine-tune the TACRED model for 5epochs, and the SemEval model for 10 epochs. .3 Results The experiment results on TACRED and SemEvalare shown in Table 3 and Table 4, respectively. Weabbreviate the typed entity marker and confidence-based classification as TEB and CBC in the ta-ble. We find that both techniques effectively im-prove the task in terms of F . CBC consistentlyimproves the F by . − . on all datasets.In particular, our RoBERTa LARGE + TEM + CBCmodel achieves a new SOTA result of . , out-performing all compared methods. Similarly, ourBERT LARGE + CBC model outperforms the pre-vious baseline model (Baldini Soares et al., 2019)by . in F1. In this paper, we propose an improved baselinefor sentence-level RE, which incorporates two sim-ple yet effective techniques: typed entity markerand confidence-based classification. Experimentson two datasets demonstrate the effectiveness ofour method. Future work includes evaluating ourmodel on more pre-trained language models, andextending the confidence-based classification toother tasks such as named entity recognition andquestion answering.
References
Christoph Alt, Aleksandra Gabryszak, and LeonhardHennig. 2020. TACRED revisited: A thorough eval-uation of the TACRED relation extraction task. In
Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics , pages 1558–1569, Online. Association for Computational Lin-guistics.Christoph Alt, Marc H¨ubner, and Leonhard Hennig.2019. Fine-tuning pre-trained transformer languagemodels to distantly supervised relation extraction.In
Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics , pages1388–1398, Florence, Italy. Association for Compu-tational Linguistics.Livio Baldini Soares, Nicholas FitzGerald, JeffreyLing, and Tom Kwiatkowski. 2019. Matching theblanks: Distributional similarity for relation learn-ing. In
Proceedings of the 57th Annual Meetingof the Association for Computational Linguistics ,pages 2895–2905, Florence, Italy. Association forComputational Linguistics.Abhijit Bendale and Terrance E Boult. 2016. Towardsopen set deep networks. In
Proceedings of the IEEEconference on computer vision and pattern recogni-tion , pages 1563–1572. Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In
Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Akshay Raj Dhamija, Manuel G¨unther, and Terrance EBoult. 2018. Reducing network agnostophobia. In
NeurIPS .Priya Goyal, P. Doll´ar, Ross B. Girshick, P. Noord-huis, L. Wesolowski, Aapo Kyrola, Andrew Tulloch,Y. Jia, and Kaiming He. 2017. Accurate, large mini-batch sgd: Training imagenet in 1 hour.
ArXiv ,abs/1706.02677.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid O S´eaghdha, Sebas-tian Pad´o, Marco Pennacchiotti, Lorenza Ro-mano, and Stan Szpakowicz. 2019. Semeval-2010task 8: Multi-way classification of semantic rela-tions between pairs of nominals. arXiv preprintarXiv:1911.10422 .Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.Weld, Luke Zettlemoyer, and Omer Levy. 2020.SpanBERT: Improving pre-training by representingand predicting spans.
Transactions of the Associa-tion for Computational Linguistics , 8:64–77.Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization.
CoRR ,abs/1412.6980.Shiyu Liang, Yixuan Li, and R Srikant. 2018. Enhanc-ing the reliability of out-of-distribution image detec-tion in neural networks. In
International Conferenceon Learning Representations .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692 .Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, PengLi, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2020.Learning from Context or Names? An EmpiricalStudy on Neural Relation Extraction. In
Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP) , pages3661–3672, Online. Association for ComputationalLinguistics.Matthew E. Peters, Mark Neumann, IV RobertLLogan,Roy Schwartz, V. Joshi, Sameer Singh, and Noah A.Smith. 2019. Knowledge enhanced contextual wordrepresentations. In
EMNLP/IJCNLP .Apoorv Vyas, Nataraj Jammalamadaka, Xia Zhu, Di-pankar Das, Bharat Kaul, and Theodore L Willke.018. Out-of-distribution detection using an ensem-ble of self supervised leave-out classifiers. In
Pro-ceedings of the European Conference on ComputerVision (ECCV) , pages 550–564.Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei,Xuanjing Huang, Cuihong Cao, Daxin Jiang, MingZhou, et al. 2020. K-adapter: Infusing knowl-edge into pre-trained models with adapters. arXivpreprint arXiv:2002.01808 .Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.
ArXiv , abs/1910.03771.Ikuya Yamada, Akari Asai, Hiroyuki Shindo, HideakiTakeda, and Yuji Matsumoto. 2020. LUKE: Deepcontextualized entity representations with entity-aware self-attention. In
Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP) , pages 6442–6454, On-line. Association for Computational Linguistics.Yuhao Zhang, Peng Qi, and Christopher D. Manning.2018. Graph convolution over pruned dependencytrees improves relation extraction. In
Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 2205–2215, Brus-sels, Belgium. Association for Computational Lin-guistics.Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-geli, and Christopher D. Manning. 2017. Position-aware attention and supervised data improve slotfilling. In
Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 35–45, Copenhagen, Denmark. Associationfor Computational Linguistics.Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang,Maosong Sun, and Qun Liu. 2019. ERNIE: En-hanced language representation with informative en-tities. In
Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics , pages 1441–1451, Florence, Italy. Associationfor Computational Linguistics.Zexuan Zhong and Danqi Chen. 2020. A frustratinglyeasy approach for joint entity and relation extraction. arXiv preprint arXiv:2010.12812 .Wenxuan Zhou, Kevin Huang, Tengyu Ma, and JingHuang. 2021. Document-level relation extractionwith adaptive thresholding and localized contextpooling. In