Enriching Pre-trained Language Model with Entity Information for Relation Classification
EEnriching Pre-trained Language Model with Entity Information forRelation Classification
Shanchan Wu
Alibaba Group (U.S.) Inc., Sunnyvale, CA [email protected]
Yifan He
Alibaba Group (U.S.) Inc., Sunnyvale, CA [email protected]
Abstract
Relation classification is an importantNLP task to extract relations between enti-ties. The state-of-the-art methods for rela-tion classification are primarily based onConvolutional or Recurrent Neural Net-works. Recently, the pre-trained BERTmodel achieves very successful results inmany NLP classification / sequence la-beling tasks. Relation classification dif-fers from those tasks in that it relies oninformation of both the sentence and thetwo target entities. In this paper, we pro-pose a model that both leverages the pre-trained BERT language model and incor-porates information from the target entitiesto tackle the relation classification task.We locate the target entities and transferthe information through the pre-trained ar-chitecture and incorporate the correspond-ing encoding of the two entities. Weachieve significant improvement over thestate-of-the-art method on the SemEval-2010 task 8 relational dataset.
The task of relation classification is to predict se-mantic relations between pairs of nominals. Givena sequence of text (usually a sentence) s and a pairof nominals e and e , the objective is to identifythe relation between e and e (Hendrickx et al.,2010). It is an important NLP task which is nor-mally used as an intermediate step in variety ofNLP applications. The following example showsthe Component-Whole relation between the nom-inals “kitchen” and “house”: “The [kitchen] e isthe last renovated part of the [house] e .”Recently, deep neural networks have applied torelation classification (Socher et al., 2012; Zenget al., 2014; Yu et al., 2014; dos Santos et al., 2015; Shen and Huang, 2016; Lee et al., 2019). Thesemethods usually use some features derived fromlexical resources such as Word-Net or NLP toolssuch as dependency parsers and named entity rec-ognizers (NER).Language model pre-training has been shown tobe effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al.,2017; Radford et al., 2018; Ruder and Howard,2018; Devlin et al., 2018). The pretrained modelBERT proposed by (Devlin et al., 2018) has es-pecially significant impact. It has been applied tomultiple NLP tasks and obtains new state-of-the-art results on eleven tasks. The tasks that BERThas been applied to are typically modeled as clas-sification problems and sequence labeling prob-lems. It has also been applied to the SQuAD ques-tion answering (Rajpurkar et al., 2016) problem,in which the objective is to find the starting pointand ending point of an answer span.As far as we know, the pretrained BERT model(Devlin et al., 2018) has not been applied to rela-tion classification, which relies not only on the in-formation of the whole sentence but also on the in-formation of the specific target entities. In this pa-per, we apply the pretrained BERT model for rela-tion classification. We insert special tokens beforeand after the target entities before feeding the textto BERT for fine-tuning, in order to identify thelocations of the two target entities and transfer theinformation into the BERT model. We then locatethe positions of the two target entities in the out-put embedding from BERT model. We use theirembeddings as well as the sentence encoding (em-bedding of the special first token in the setting ofBERT) as the input to a multi-layer neural networkfor classification. By this way, it captures both thesemantics of the sentence and the two target enti-ties to better fit the relation classification task.Our contributions are as follows: (1) We putforward an innovative approach to incorporate a r X i v : . [ c s . C L ] M a y ntity-level information into the pretrained lan-guage model for relation classification. (2) Weachieve the new state-of-the-art for the relationclassification task. There has been some work with deep learn-ing methods for relation classification, such as(Socher et al., 2012; Zeng et al., 2014; Yu et al.,2014; dos Santos et al., 2015)MVRNN model (Socher et al., 2012) applies arecursive neural network (RNN) to relation clas-sification. They assign a matrix-vector represen-tation to every node in a parse tree and com-pute the representation for the complete sentencefrom bottom up according to the syntactic struc-ture of the parse tree. (Zeng et al., 2014) pro-pose a CNN model by incorporating both wordembeddings and position features as input. Thenthey concatenate lexical features and the outputfrom CNN into a single vector and feed theminto a softmax layer for prediction. (Yu et al.,2014) propose a Factor-based Compositional Em-bedding Model (FCM) by constructing sentence-level and substructure embeddings from word em-beddings, through dependency trees and namedentities. (Santos et al., 2015) tackle the relationclassification task by ranking with a convolutionalneural network named CR-CNN. Their loss func-tion is based on pairwise ranking. In our work, wetake advantage of a pre-trained language model forthe relation classification task, without relying onCNN or RNN architecutures. (Shen and Huang,2016) utilize a CNN encoder in conjunction with asentence representation that weights the words byattention between the target entities and the wordsin the sentence to perform relation classification.(Wang et al., 2016) propose a convolutional neu-ral network architecture with two levels of atten-tion in order to catch the patterns in heterogeneouscontexts to classify relations. (Lee et al., 2019) de-velop an end-to-end recurrent neural model whichincorporates an entity-aware attention mechanismwith a latent entity typing for relation classifica-tion.There are some related work on the relation ex-traction based on distant supervision, for example,(Mintz et al., 2009; Hoffmann et al., 2011; Linet al., 2016; Ji et al., 2017; Wu et al., 2019). Thedifference between relation classification on reg-ular data and on distantly supervised data is that the latter may contain a large number of noisy la-bels. In this paper, we focus on the regular relationclassification problem, without noisy labels.
The pre-trained BERT model (Devlin et al., 2018)is a multi-layer bidirectional Transformer encoder(Vaswani et al., 2017).The design of input representation of BERT isto be able to represent both a single text sentenceand a pair of text sentences in one token sequence.The input representation of each token is con-structed by the summation of the correspondingtoken, segment and position embeddings.‘[CLS]’ is appended to the beginning of eachsequence as the first token of the sequence. The fi-nal hidden state from the Transformer output cor-responding to the first token is used as the sen-tence representation for classification tasks. Incase there are two sentences in a task, ‘[SEP]’ isused to separate the two sentences.BERT pre-trains the model parameters by us-ing a pre-training objective: the masked languagemodel (MLM), which randomly masks some ofthe tokens from the input, and set the optimiza-tion objective to predict the original vocabulary idof the masked word according to its context. Un-like left-to-right language model pre-training, theMLM objective can help a state output to utilizeboth the left and the right context, which allowsa pre-training system to apply a deep bidirectionalTransformer. Besides the masked language model,BERT also trains a “next sentence prediction” taskthat jointly pre-trains text-pair representations.
Figure 1 shows the architecture of our approach.For a sentence s with two target entities e and e , to make the BERT module capture the locationinformation of the two entities, at both the begin-ning and end of the first entity, we insert a spe-cial token ‘$’, and at both the beginning and endof the second entity, we insert a special token ‘ “[CLS] The $ kitchen $ is the last renovatedpart of the CLS] T $ T i T j T j+2 $ k T m T m+2 n Entity 1 Entity 2
BERT H H i H j H k H m AverageAverageFully-connected+ activation Fully-connected+ activation Fully-connected+ activation Fully-connectedSoftmax
Figure 1: The model architecture.Given a sentence s with entity e and e , sup-pose its final hidden state output from BERT mod-ule is H . Suppose vectors H i to H j are the finalhidden state vectors from BERT for entity e , and H k to H m are the final hidden state vectors fromBERT for entity e . We apply the average opera-tion to get a vector representation for each of thetwo target entities. Then after an activation opera-tion (i.e. tanh), we add a fully connected layer toeach of the two vectors, and the output for e and e are H (cid:48) and H (cid:48) respectively. This process canbe mathematically formalized as Equation (1). H (cid:48) = W (cid:34) tanh (cid:32) j − i + 1 j (cid:88) t = i H t (cid:33)(cid:35) + b H (cid:48) = W (cid:34) tanh (cid:32) m − k + 1 m (cid:88) t = k H t (cid:33)(cid:35) + b (1)We make W and W , b and b share the sameparameters. In other words, we set W = W , b = b . For the final hidden state vector of thefirst token (i.e. ‘[CLS]’), we also add an activa-tion operation and a fully connected layer, whichis formally expressed as: H (cid:48) = W ( tanh ( H )) + b (2)Matrices W , W , W have the same dimensions,i.e. W ∈ R d × d , W ∈ R d × d , W ∈ R d × d , where d is the hidden state size from BERT.We concatenate H (cid:48) , H (cid:48) , H (cid:48) and then add a fullyconnected layer and a softmax layer, which can be expressed as following: h (cid:48)(cid:48) = W (cid:20) concat (cid:0) H (cid:48) , H (cid:48) , H (cid:48) (cid:1) (cid:21) + b p = sof tmax ( h (cid:48)(cid:48) ) (3)where W ∈ R L × d ( L is the number of relationtypes), and p is the probability output. In Equa-tions (1),(2),(3), b , b , b , b are bias vectors.We use cross entropy as the loss function. Weapply dropout before each fully connected layerduring training. We call our approach as R-BERT. We use the SemEval-2010 Task 8 dataset inour experiments. The dataset contains ninesemantic relation types and one artificial rela-tion type
Other , which means that the rela-tion does not belong to any of the nine relationtypes. The nine relation types are
Cause-Effect,Component-Whole, Content-Container, Entity-Destination, Entity-Origin, Instrument-Agency,Member-Collection, Message-Topic and
Product-Producer . The dataset contains 10,717 sentences,with each containing two nominals e1 and e2,and the corresponding relation type in the sen-tence. The relation is directional, which meansthat Component-Whole(e1, e2) is different fromComponent-Whole(e2, e1). The dataset has al-ready been partitioned into 8,000 training in-stances and 2,717 test instances. We evaluate ourolution by using the SemEval-2010 Task 8 offi-cial scorer script. It computes the macro-averagedF1-scores for the nine actual relations (excludingOther) and considers directionality.
Table shows the major parameters used in our ex-periments. Table 1: Parameter settings.Batch size 16Max sentence length 128Adam learning rate 2e-5Number of epochs 5Dropout rate 0.1We add dropout before each add-on layer. Forthe pre-trained BERT model, we use the uncasedbasic model. For the parameters of the pre-trainedBERT model, please refer to (Devlin et al., 2018)for details.
We compare our method, R-BERT, against re-sults by multiple methods recently published forthe SemEval-2010 Task 8 dataset, including SVM,RNN, MVRNN, CNN+Softmax, FCM, CR-CNN,Attention-CNN, Entity Attention Bi-LSTM. TheSVM method by (Rink and Harabagiu, 2010) usesa rich feature set in a traditional way, which wasthe best result during the SemEval-2010 task 8competition. Details of all other methods arebriefly reviewed in Section 2.Table 2 reports the results. We can see that R-BERT significantly beats all the baseline methods.The MACRO F1 value of R-BERT is 89.25, whichis much better than the previous best solution onthis dataset.
We have demonstrated the strong empirical re-sults based on the proposed approach. We fur-ther want to understand the specific contributionsby the components besides the pre-trained BERTcomponent. For this purpose, we create three moreconfigurations.The first configuration is to discard the specialseparate tokens (i.e. ‘$’ and ‘
Method F1
SVM(Rink and Harabagiu, 2010) 82.2RNN(Socher et al., 2012) 77.6MVRNN(Socher et al., 2012) 82.4CNN+Softmax(Zeng et al., 2014) 82.7FCM(Yu et al., 2014) 83.0CR-CNN(Santos et al., 2015) 84.1Attention CNN(Shen and Huang, 2016) 85.9Att-Pooling-CNN(Wang et al., 2016) 88.0Entity Attention Bi-LSTM(Lee et al., 2019) 85.2R-BERT entities in the sentence and discard the hidden vec-tor output of the two entities from concatenatingwith the hidden vector output of the sentence. Inother words, we add ‘[CLS]’ at the beginning ofthe sentence and feed the sentence with the two en-tities into the BERT module, and use the first out-put vector for classification. We label this methodas
BERT-NO-SEP-NO-ENT .The second configuration is to discard the spe-cial separate tokens (i.e. ‘$’ and ‘
BERT-NO-SEP .The third configuration is to discard the hiddenvector output of the two entities from concatena-tion for classification, but keep the special separatetokens. We label this method as
BERT-NO-ENT .Table 3 reports the results of the ablation studywith the above three configurations. We observethat the three methods all perform worse than R-BERT. Of the methods, BERT-NO-SEP-NO-ENTperforms worst, with its F1 8.16 absolute pointsworse than R-BERT. This ablation study demon-strates that both the special separate tokens andthe hidden entity vectors make important contri-butions to our approach.In relation classification, the relation label is de-endent on both the semantics of the sentence andthe two target entities. BERT without special sep-arate tokens cannot locate the target entities andlose this key information. The reason why the spe-cial separate tokens help to improve the accuracyis that they identify the locations of the two targetentities and transfer the information into the BERTmodel, which make the BERT output contain thelocation information of the two entities. On theother hand, incorporating the output of the targetentity vectors further enriches the information andhelps to make more accurate prediction.Table 3: Comparison of the BERT based methodswith different components.
Method F1
R-BERT-NO-SEP-NO-ENT 81.09R-BERT-NO-SEP 87.98R-BERT-NO-ENT 87.99R-BERT
In this paper, we develop an approach for relationclassification by enriching the pre-trained BERTmodel with entity information. We add specialseparate tokens to each target entity pair and uti-lize the sentence vector as well as target entity rep-resentations for classification. We conduct exper-iments on the SemEval-2010 benchmark datasetand our results significantly outperform the state-of-the-art methods. One possible future work is toextend the model to apply to distant supervision.
References
Andrew M Dai and Quoc V Le. 2015. Semi-supervised Sequence Learning. In
Advancesin Neural Information Processing Systems 28 .3079–3087.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-trainingof Deep Bidirectional Transformers for Lan-guage Understanding.
CoRR abs/1810.04805(2018). arXiv:1810.04805C´ıcero Nogueira dos Santos, Bing Xiang, andBowen Zhou. 2015. Classifying Relations byRanking with Convolutional Neural Networks.In
Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing of the Asian Fed-eration of Natural Language Processing, ACL2015 . 626–634.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O. S´eaghdha, Sebas-tian Pad´o, Marco Pennacchiotti, Lorenza Ro-mano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-way Classification of Se-mantic Relations Between Pairs of Nominals. In
Proceedings of the 5th International Workshopon Semantic Evaluation (SemEval ’10) . 33–38.Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke S. Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-Based Weak Supervision for Infor-mation Extraction of Overlapping Relations. In
The 49th Annual Meeting of the Associationfor Computational Linguistics, ACL 2011 . 541–550.Guoliang Ji, Kang Liu, Shizhu He, and JunZhao. 2017. Distant Supervision for Rela-tion Extraction with Sentence-Level Attentionand Entity Descriptions. In
Proceedings of theThirty-First AAAI Conference on Artificial In-telligence, 2017 . 3060–3066.Joohong Lee, Sangwoo Seo, and Yong SukChoi. 2019. Semantic Relation Classificationvia Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing.
CoRR (2019).Yankai Lin, Shiqi Shen, Zhiyuan Liu, HuanboLuan, and Maosong Sun. 2016. Neural Re-lation Extraction with Selective Attention overInstances. In
Proceedings of the 54th AnnualMeeting of the Association for ComputationalLinguistics, ACL 2016 .Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009. Distant Supervision for RelationExtraction Without Labeled Data. In
Proceed-ings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural LanguageProcessing of the AFNLP (ACL ’09) . 1003–1011.Matthew E. Peters, Waleed Ammar, Chandra Bha-gavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectionallanguage models.
CoRR abs/1705.00108(2017). arXiv:1705.00108Alec Radford, Karthik Narasimhan, Tim Sali-mans, and Ilya Sutskever. 2018. Improving lan-uage understanding with unsupervised learn-ing.
Technical report, OpenAI (2018).Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. 2016. SQuAD: 100,000+Questions for Machine Comprehension of Text.In
Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Process-ing . Association for Computational Linguistics,2383–2392.Bryan Rink and Sanda Harabagiu. 2010. Utd:Classifying semantic relations by combininglexical and semantic resources. In
Proceedingsof the 5th International Workshop on SemanticEvaluation . 256–259.Sebastian Ruder and Jeremy Howard. 2018. Uni-versal Language Model Fine-tuning for TextClassification. In
Proceedings of the 56th An-nual Meeting of the Association for Computa-tional Linguistics, ACL 2018, Melbourne, Aus-tralia, July 15-20, 2018, Volume 1: Long Pa-pers . 328–339.Cicero Nogueira Dos Santos, Bing Xiang, andBowen Zhou. 2015. Classifying Relations byRanking with Convolutional Neural Networks.In
Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguisticsand the 7th International Joint Conference onNatural Language Processing of the Asian Fed-eration of Natural Language Processing, (ACL)2015 .Yatian Shen and Xuanjing Huang. 2016.Attention-based Convolutional Neural Networkfor Semantic Relation Extraction. In
Proceed-ings of COLING 2016, the 26th InternationalConference on Computational Linguistics:Technical Papers . 2526–2536.Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. SemanticCompositionality Through Recursive Matrix-vector Spaces. In
Proceedings of the 2012 JointConference on Empirical Methods in NaturalLanguage Processing and Computational Nat-ural Language Learning (EMNLP-CoNLL ’12) .Stroudsburg, PA, USA, 1201–1211.Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-tention is All you Need. In
Advances in NeuralInformation Processing Systems 30 , I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett (Eds.).5998–6008.Linlin Wang, Zhu Cao, Gerard de Melo, andZhiyuan Liu. 2016. Relation Classification viaMulti-Level Attention CNNs. In
Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics .Shanchan Wu, Fan Kai, and Qiong Zhang. 2019.Improving Distantly Supervised Relation Ex-traction with Neural Noise Converter and Con-ditional Optimal Selector. In
Proceedings of theThirty-Third AAAI Conference on Artificial In-telligence, 2019 .Mo Yu, Matthew R. Gormley, and Mark Dredze.2014. Factor-based compositional embeddingmodels. In
In NIPS Workshop on Learning Se-mantics .Daojian Zeng, Kang Liu, Siwei Lai, GuangyouZhou, and Jun Zhao. 2014. Relation Classifica-tion via Convolutional Deep Neural Network. In