[PDF] Enriching Pre-trained Language Model with Entity Information for Relation Classification

Abstract

Relation classification is an important NLP task to extract relations between entities. The state-of-the-art methods for relation classification are primarily based on Convolutional or Recurrent Neural Networks. Recently, the pre-trained BERT model achieves very successful results in many NLP classification / sequence labeling tasks. Relation classification differs from those tasks in that it relies on information of both the sentence and the two target entities. In this paper, we propose a model that both leverages the pre-trained BERT language model and incorporates information from the target entities to tackle the relation classification task. We locate the target entities and transfer the information through the pre-trained architecture and incorporate the corresponding encoding of the two entities. We achieve significant improvement over the state-of-the-art method on the SemEval-2010 task 8 relational dataset.

Full PDF

EEnriching Pre-trained Language Model with Entity Information forRelation Classiﬁcation

Shanchan Wu

Alibaba Group (U.S.) Inc., Sunnyvale, CA [email protected]

Yifan He

Alibaba Group (U.S.) Inc., Sunnyvale, CA [email protected]

Abstract

Relation classiﬁcation is an importantNLP task to extract relations between enti-ties. The state-of-the-art methods for rela-tion classiﬁcation are primarily based onConvolutional or Recurrent Neural Net-works. Recently, the pre-trained BERTmodel achieves very successful results inmany NLP classiﬁcation / sequence la-beling tasks. Relation classiﬁcation dif-fers from those tasks in that it relies oninformation of both the sentence and thetwo target entities. In this paper, we pro-pose a model that both leverages the pre-trained BERT language model and incor-porates information from the target entitiesto tackle the relation classiﬁcation task.We locate the target entities and transferthe information through the pre-trained ar-chitecture and incorporate the correspond-ing encoding of the two entities. Weachieve signiﬁcant improvement over thestate-of-the-art method on the SemEval-2010 task 8 relational dataset.

The task of relation classiﬁcation is to predict se-mantic relations between pairs of nominals. Givena sequence of text (usually a sentence) s and a pairof nominals e and e , the objective is to identifythe relation between e and e (Hendrickx et al.,2010). It is an important NLP task which is nor-mally used as an intermediate step in variety ofNLP applications. The following example showsthe Component-Whole relation between the nom-inals “kitchen” and “house”: “The [kitchen] e isthe last renovated part of the [house] e .”Recently, deep neural networks have applied torelation classiﬁcation (Socher et al., 2012; Zenget al., 2014; Yu et al., 2014; dos Santos et al., 2015; Shen and Huang, 2016; Lee et al., 2019). Thesemethods usually use some features derived fromlexical resources such as Word-Net or NLP toolssuch as dependency parsers and named entity rec-ognizers (NER).Language model pre-training has been shown tobe effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al.,2017; Radford et al., 2018; Ruder and Howard,2018; Devlin et al., 2018). The pretrained modelBERT proposed by (Devlin et al., 2018) has es-pecially signiﬁcant impact. It has been applied tomultiple NLP tasks and obtains new state-of-the-art results on eleven tasks. The tasks that BERThas been applied to are typically modeled as clas-siﬁcation problems and sequence labeling prob-lems. It has also been applied to the SQuAD ques-tion answering (Rajpurkar et al., 2016) problem,in which the objective is to ﬁnd the starting pointand ending point of an answer span.As far as we know, the pretrained BERT model(Devlin et al., 2018) has not been applied to rela-tion classiﬁcation, which relies not only on the in-formation of the whole sentence but also on the in-formation of the speciﬁc target entities. In this pa-per, we apply the pretrained BERT model for rela-tion classiﬁcation. We insert special tokens beforeand after the target entities before feeding the textto BERT for ﬁne-tuning, in order to identify thelocations of the two target entities and transfer theinformation into the BERT model. We then locatethe positions of the two target entities in the out-put embedding from BERT model. We use theirembeddings as well as the sentence encoding (em-bedding of the special ﬁrst token in the setting ofBERT) as the input to a multi-layer neural networkfor classiﬁcation. By this way, it captures both thesemantics of the sentence and the two target enti-ties to better ﬁt the relation classiﬁcation task.Our contributions are as follows: (1) We putforward an innovative approach to incorporate a r X i v : . [ c s . C L ] M a y ntity-level information into the pretrained lan-guage model for relation classiﬁcation. (2) Weachieve the new state-of-the-art for the relationclassiﬁcation task. There has been some work with deep learn-ing methods for relation classiﬁcation, such as(Socher et al., 2012; Zeng et al., 2014; Yu et al.,2014; dos Santos et al., 2015)MVRNN model (Socher et al., 2012) applies arecursive neural network (RNN) to relation clas-siﬁcation. They assign a matrix-vector represen-tation to every node in a parse tree and com-pute the representation for the complete sentencefrom bottom up according to the syntactic struc-ture of the parse tree. (Zeng et al., 2014) pro-pose a CNN model by incorporating both wordembeddings and position features as input. Thenthey concatenate lexical features and the outputfrom CNN into a single vector and feed theminto a softmax layer for prediction. (Yu et al.,2014) propose a Factor-based Compositional Em-bedding Model (FCM) by constructing sentence-level and substructure embeddings from word em-beddings, through dependency trees and namedentities. (Santos et al., 2015) tackle the relationclassiﬁcation task by ranking with a convolutionalneural network named CR-CNN. Their loss func-tion is based on pairwise ranking. In our work, wetake advantage of a pre-trained language model forthe relation classiﬁcation task, without relying onCNN or RNN architecutures. (Shen and Huang,2016) utilize a CNN encoder in conjunction with asentence representation that weights the words byattention between the target entities and the wordsin the sentence to perform relation classiﬁcation.(Wang et al., 2016) propose a convolutional neu-ral network architecture with two levels of atten-tion in order to catch the patterns in heterogeneouscontexts to classify relations. (Lee et al., 2019) de-velop an end-to-end recurrent neural model whichincorporates an entity-aware attention mechanismwith a latent entity typing for relation classiﬁca-tion.There are some related work on the relation ex-traction based on distant supervision, for example,(Mintz et al., 2009; Hoffmann et al., 2011; Linet al., 2016; Ji et al., 2017; Wu et al., 2019). Thedifference between relation classiﬁcation on reg-ular data and on distantly supervised data is that the latter may contain a large number of noisy la-bels. In this paper, we focus on the regular relationclassiﬁcation problem, without noisy labels.

The pre-trained BERT model (Devlin et al., 2018)is a multi-layer bidirectional Transformer encoder(Vaswani et al., 2017).The design of input representation of BERT isto be able to represent both a single text sentenceand a pair of text sentences in one token sequence.The input representation of each token is con-structed by the summation of the correspondingtoken, segment and position embeddings.‘[CLS]’ is appended to the beginning of eachsequence as the ﬁrst token of the sequence. The ﬁ-nal hidden state from the Transformer output cor-responding to the ﬁrst token is used as the sen-tence representation for classiﬁcation tasks. Incase there are two sentences in a task, ‘[SEP]’ isused to separate the two sentences.BERT pre-trains the model parameters by us-ing a pre-training objective: the masked languagemodel (MLM), which randomly masks some ofthe tokens from the input, and set the optimiza-tion objective to predict the original vocabulary idof the masked word according to its context. Un-like left-to-right language model pre-training, theMLM objective can help a state output to utilizeboth the left and the right context, which allowsa pre-training system to apply a deep bidirectionalTransformer. Besides the masked language model,BERT also trains a “next sentence prediction” taskthat jointly pre-trains text-pair representations.

Figure 1 shows the architecture of our approach.For a sentence s with two target entities e and e , to make the BERT module capture the locationinformation of the two entities, at both the begin-ning and end of the ﬁrst entity, we insert a spe-cial token ‘$’, and at both the beginning and endof the second entity, we insert a special token ‘ “[CLS] The $ kitchen $ is the last renovatedpart of the CLS] T $ T i T j T j+2 $ k T m T m+2 n Entity 1 Entity 2

BERT H H i H j H k H m AverageAverageFully-connected+ activation Fully-connected+ activation Fully-connected+ activation Fully-connectedSoftmax

Figure 1: The model architecture.Given a sentence s with entity e and e , sup-pose its ﬁnal hidden state output from BERT mod-ule is H . Suppose vectors H i to H j are the ﬁnalhidden state vectors from BERT for entity e , and H k to H m are the ﬁnal hidden state vectors fromBERT for entity e . We apply the average opera-tion to get a vector representation for each of thetwo target entities. Then after an activation opera-tion (i.e. tanh), we add a fully connected layer toeach of the two vectors, and the output for e and e are H (cid:48) and H (cid:48) respectively. This process canbe mathematically formalized as Equation (1). H (cid:48) = W (cid:34) tanh (cid:32) j − i + 1 j (cid:88) t = i H t (cid:33)(cid:35) + b H (cid:48) = W (cid:34) tanh (cid:32) m − k + 1 m (cid:88) t = k H t (cid:33)(cid:35) + b (1)We make W and W , b and b share the sameparameters. In other words, we set W = W , b = b . For the ﬁnal hidden state vector of theﬁrst token (i.e. ‘[CLS]’), we also add an activa-tion operation and a fully connected layer, whichis formally expressed as: H (cid:48) = W ( tanh ( H )) + b (2)Matrices W , W , W have the same dimensions,i.e. W ∈ R d × d , W ∈ R d × d , W ∈ R d × d , where d is the hidden state size from BERT.We concatenate H (cid:48) , H (cid:48) , H (cid:48) and then add a fullyconnected layer and a softmax layer, which can be expressed as following: h (cid:48)(cid:48) = W (cid:20) concat (cid:0) H (cid:48) , H (cid:48) , H (cid:48) (cid:1) (cid:21) + b p = sof tmax ( h (cid:48)(cid:48) ) (3)where W ∈ R L × d ( L is the number of relationtypes), and p is the probability output. In Equa-tions (1),(2),(3), b , b , b , b are bias vectors.We use cross entropy as the loss function. Weapply dropout before each fully connected layerduring training. We call our approach as R-BERT. We use the SemEval-2010 Task 8 dataset inour experiments. The dataset contains ninesemantic relation types and one artiﬁcial rela-tion type

Other , which means that the rela-tion does not belong to any of the nine relationtypes. The nine relation types are

Cause-Effect,Component-Whole, Content-Container, Entity-Destination, Entity-Origin, Instrument-Agency,Member-Collection, Message-Topic and

Product-Producer . The dataset contains 10,717 sentences,with each containing two nominals e1 and e2,and the corresponding relation type in the sen-tence. The relation is directional, which meansthat Component-Whole(e1, e2) is different fromComponent-Whole(e2, e1). The dataset has al-ready been partitioned into 8,000 training in-stances and 2,717 test instances. We evaluate ourolution by using the SemEval-2010 Task 8 ofﬁ-cial scorer script. It computes the macro-averagedF1-scores for the nine actual relations (excludingOther) and considers directionality.

Table shows the major parameters used in our ex-periments. Table 1: Parameter settings.Batch size 16Max sentence length 128Adam learning rate 2e-5Number of epochs 5Dropout rate 0.1We add dropout before each add-on layer. Forthe pre-trained BERT model, we use the uncasedbasic model. For the parameters of the pre-trainedBERT model, please refer to (Devlin et al., 2018)for details.

We compare our method, R-BERT, against re-sults by multiple methods recently published forthe SemEval-2010 Task 8 dataset, including SVM,RNN, MVRNN, CNN+Softmax, FCM, CR-CNN,Attention-CNN, Entity Attention Bi-LSTM. TheSVM method by (Rink and Harabagiu, 2010) usesa rich feature set in a traditional way, which wasthe best result during the SemEval-2010 task 8competition. Details of all other methods arebrieﬂy reviewed in Section 2.Table 2 reports the results. We can see that R-BERT signiﬁcantly beats all the baseline methods.The MACRO F1 value of R-BERT is 89.25, whichis much better than the previous best solution onthis dataset.

We have demonstrated the strong empirical re-sults based on the proposed approach. We fur-ther want to understand the speciﬁc contributionsby the components besides the pre-trained BERTcomponent. For this purpose, we create three moreconﬁgurations.The ﬁrst conﬁguration is to discard the specialseparate tokens (i.e. ‘$’ and ‘

Method F1

SVM(Rink and Harabagiu, 2010) 82.2RNN(Socher et al., 2012) 77.6MVRNN(Socher et al., 2012) 82.4CNN+Softmax(Zeng et al., 2014) 82.7FCM(Yu et al., 2014) 83.0CR-CNN(Santos et al., 2015) 84.1Attention CNN(Shen and Huang, 2016) 85.9Att-Pooling-CNN(Wang et al., 2016) 88.0Entity Attention Bi-LSTM(Lee et al., 2019) 85.2R-BERT entities in the sentence and discard the hidden vec-tor output of the two entities from concatenatingwith the hidden vector output of the sentence. Inother words, we add ‘[CLS]’ at the beginning ofthe sentence and feed the sentence with the two en-tities into the BERT module, and use the ﬁrst out-put vector for classiﬁcation. We label this methodas

BERT-NO-SEP-NO-ENT .The second conﬁguration is to discard the spe-cial separate tokens (i.e. ‘$’ and ‘

BERT-NO-SEP .The third conﬁguration is to discard the hiddenvector output of the two entities from concatena-tion for classiﬁcation, but keep the special separatetokens. We label this method as

BERT-NO-ENT .Table 3 reports the results of the ablation studywith the above three conﬁgurations. We observethat the three methods all perform worse than R-BERT. Of the methods, BERT-NO-SEP-NO-ENTperforms worst, with its F1 8.16 absolute pointsworse than R-BERT. This ablation study demon-strates that both the special separate tokens andthe hidden entity vectors make important contri-butions to our approach.In relation classiﬁcation, the relation label is de-endent on both the semantics of the sentence andthe two target entities. BERT without special sep-arate tokens cannot locate the target entities andlose this key information. The reason why the spe-cial separate tokens help to improve the accuracyis that they identify the locations of the two targetentities and transfer the information into the BERTmodel, which make the BERT output contain thelocation information of the two entities. On theother hand, incorporating the output of the targetentity vectors further enriches the information andhelps to make more accurate prediction.Table 3: Comparison of the BERT based methodswith different components.

Method F1

R-BERT-NO-SEP-NO-ENT 81.09R-BERT-NO-SEP 87.98R-BERT-NO-ENT 87.99R-BERT

In this paper, we develop an approach for relationclassiﬁcation by enriching the pre-trained BERTmodel with entity information. We add specialseparate tokens to each target entity pair and uti-lize the sentence vector as well as target entity rep-resentations for classiﬁcation. We conduct exper-iments on the SemEval-2010 benchmark datasetand our results signiﬁcantly outperform the state-of-the-art methods. One possible future work is toextend the model to apply to distant supervision.

References

Andrew M Dai and Quoc V Le. 2015. Semi-supervised Sequence Learning. In

Advancesin Neural Information Processing Systems 28 .3079–3087.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: Pre-trainingof Deep Bidirectional Transformers for Lan-guage Understanding.

CoRR abs/1810.04805(2018). arXiv:1810.04805C´ıcero Nogueira dos Santos, Bing Xiang, andBowen Zhou. 2015. Classifying Relations byRanking with Convolutional Neural Networks.In

Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing of the Asian Fed-eration of Natural Language Processing, ACL2015 . 626–634.Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva,Preslav Nakov, Diarmuid ´O. S´eaghdha, Sebas-tian Pad´o, Marco Pennacchiotti, Lorenza Ro-mano, and Stan Szpakowicz. 2010. SemEval-2010 Task 8: Multi-way Classiﬁcation of Se-mantic Relations Between Pairs of Nominals. In

Proceedings of the 5th International Workshopon Semantic Evaluation (SemEval ’10) . 33–38.Raphael Hoffmann, Congle Zhang, Xiao Ling,Luke S. Zettlemoyer, and Daniel S. Weld. 2011.Knowledge-Based Weak Supervision for Infor-mation Extraction of Overlapping Relations. In

The 49th Annual Meeting of the Associationfor Computational Linguistics, ACL 2011 . 541–550.Guoliang Ji, Kang Liu, Shizhu He, and JunZhao. 2017. Distant Supervision for Rela-tion Extraction with Sentence-Level Attentionand Entity Descriptions. In

Proceedings of theThirty-First AAAI Conference on Artiﬁcial In-telligence, 2017 . 3060–3066.Joohong Lee, Sangwoo Seo, and Yong SukChoi. 2019. Semantic Relation Classiﬁcationvia Bidirectional LSTM Networks with Entity-aware Attention using Latent Entity Typing.

CoRR (2019).Yankai Lin, Shiqi Shen, Zhiyuan Liu, HuanboLuan, and Maosong Sun. 2016. Neural Re-lation Extraction with Selective Attention overInstances. In

Proceedings of the 54th AnnualMeeting of the Association for ComputationalLinguistics, ACL 2016 .Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009. Distant Supervision for RelationExtraction Without Labeled Data. In

Proceed-ings of the Joint Conference of the 47th An-nual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural LanguageProcessing of the AFNLP (ACL ’09) . 1003–1011.Matthew E. Peters, Waleed Ammar, Chandra Bha-gavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectionallanguage models.

CoRR abs/1705.00108(2017). arXiv:1705.00108Alec Radford, Karthik Narasimhan, Tim Sali-mans, and Ilya Sutskever. 2018. Improving lan-uage understanding with unsupervised learn-ing.

Technical report, OpenAI (2018).Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. 2016. SQuAD: 100,000+Questions for Machine Comprehension of Text.In

Proceedings of the 2016 Conference on Em-pirical Methods in Natural Language Process-ing . Association for Computational Linguistics,2383–2392.Bryan Rink and Sanda Harabagiu. 2010. Utd:Classifying semantic relations by combininglexical and semantic resources. In

Proceedingsof the 5th International Workshop on SemanticEvaluation . 256–259.Sebastian Ruder and Jeremy Howard. 2018. Uni-versal Language Model Fine-tuning for TextClassiﬁcation. In

Proceedings of the 56th An-nual Meeting of the Association for Computa-tional Linguistics, ACL 2018, Melbourne, Aus-tralia, July 15-20, 2018, Volume 1: Long Pa-pers . 328–339.Cicero Nogueira Dos Santos, Bing Xiang, andBowen Zhou. 2015. Classifying Relations byRanking with Convolutional Neural Networks.In

Proceedings of the 53rd Annual Meeting ofthe Association for Computational Linguisticsand the 7th International Joint Conference onNatural Language Processing of the Asian Fed-eration of Natural Language Processing, (ACL)2015 .Yatian Shen and Xuanjing Huang. 2016.Attention-based Convolutional Neural Networkfor Semantic Relation Extraction. In

Proceed-ings of COLING 2016, the 26th InternationalConference on Computational Linguistics:Technical Papers . 2526–2536.Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. SemanticCompositionality Through Recursive Matrix-vector Spaces. In

Proceedings of the 2012 JointConference on Empirical Methods in NaturalLanguage Processing and Computational Nat-ural Language Learning (EMNLP-CoNLL ’12) .Stroudsburg, PA, USA, 1201–1211.Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Ł ukasz Kaiser, and Illia Polosukhin. 2017. At-tention is All you Need. In

Advances in NeuralInformation Processing Systems 30 , I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett (Eds.).5998–6008.Linlin Wang, Zhu Cao, Gerard de Melo, andZhiyuan Liu. 2016. Relation Classiﬁcation viaMulti-Level Attention CNNs. In

Proceedings ofthe 54th Annual Meeting of the Association forComputational Linguistics .Shanchan Wu, Fan Kai, and Qiong Zhang. 2019.Improving Distantly Supervised Relation Ex-traction with Neural Noise Converter and Con-ditional Optimal Selector. In

Proceedings of theThirty-Third AAAI Conference on Artiﬁcial In-telligence, 2019 .Mo Yu, Matthew R. Gormley, and Mark Dredze.2014. Factor-based compositional embeddingmodels. In

In NIPS Workshop on Learning Se-mantics .Daojian Zeng, Kang Liu, Siwei Lai, GuangyouZhou, and Jun Zhao. 2014. Relation Classiﬁca-tion via Convolutional Deep Neural Network. In