ETH-DS3Lab at SemEval-2018 Task 7: Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction
EETH-DS3Lab at SemEval-2018 Task 7: Effectively Combining Recurrentand Convolutional Neural Networks for Relation Classification andExtraction
Jonathan Rotsztejn , Nora Hollenstein , Ce Zhang Systems Group, ETH Zurich { rotsztej,noraho } @ethz.ch, [email protected] IBM Research, Zurich
Abstract
Reliably detecting relevant relations betweenentities in unstructured text is a valuable re-source for knowledge extraction, which is whyit has awaken significant interest in the field ofNatural Language Processing. In this paper,we present a system for relation classificationand extraction based on an ensemble of con-volutional and recurrent neural networks thatranked first in 3 out of the 4 Subtasks at Se-mEval 2018 Task 7. We provide detailed ex-planations and grounds for the design choicesbehind the most relevant features and analyzetheir importance.
One of the current challenges in analyzing un-structured data is to extract valuable knowledgeby detecting the relevant entities and relations be-tween them. The focus of SemEval 2018 Task 7is on relation classification (assigning a type of re-lation to an entity pair -
Subtask 1 ) and relationextraction (detecting the existence of a relation be-tween two entities and determining its type -
Sub-task 2 ).Moreover, the task distinguishes between rela-tion classification on clean data (i.e.: manuallyannotated entities -
Subtask 1.1 ) and noisy data(automatically annotated entities -
Subtask 1.2 ).It addresses semantic relations from 6 categories,all of them specific to scientific literature. Rela-tion instances are to be classified into one of thefollowing classes: USAGE, RESULT, MODEL-FEATURE, PART-WHOLE, TOPIC, COMPARE,where the first five are asymmetrical relations andthe last is order-independent (see G´abor et al.(2018) for a more detailed description of the task).Since the training data was provided by the taskorganizers, we focused on supervised methods forrelation classification and extraction. Similar sys-tems in the past have been based on Support Vec-tor Machines (Uzuner et al., 2011; Minard et al., % F s c o r e Added feature
Figure 1: Feature addition study to evaluate the impactof the most relevant features on the F score of the 5-fold cross-validated training set of Subtasks 1.1 and 1.2 Figure 2 shows the full architecture of our system.Its main component is an ensemble of CNNs andRNNs. The CNN architecture follows closely on(Kim, 2014; Collobert et al., 2011). It consists of a r X i v : . [ c s . C L ] A p r igure 2: Full pipeline architecture an initial embedding layer, which is followed bya convolutional layer with multiple filter widthsand feature maps with a ReLU activation func-tion, a max-pooling layer (applied over time) and afully-connected layer, that is trained with dropout,and produces the output as logits, to which a soft-max function is applied to obtain probabilities.The RNN consists of the same initial embeddinglayer, followed two LSTM-based sequence mod-els (Hochreiter and Schmidhuber, 1997), one inthe forward and one in the backward direction ofthe sequence, which are dynamic (i.e.: work seam-lessly for varying sequence lengths). The outputand final hidden states of the forward and back-ward networks are then concatenated to a singlevector. Finally, a fully-connected layer, trainedwith dropout, connects this vector to the logit out-puts, to which a softmax function is applied anal-ogously to obtain probabilities.The complete architecture was replicated andtrained independently several times (see Table 2)using different random seeds that ensured dis-tinct initial values, sample ordering, etc. in or-der to form an ensemble of classifiers, whose out-put probabilities were averaged to obtain the finalprobabilities for each class. We analyzed and triedseveral deeper and more complex neural architec-tures, such as multiple stacked LSTMs (up to 4) and models with 2 to 4 hidden layers, but theydidn’t achieve any significant improvements overthe simpler models. Conclusively, the strategy thatproduced the best results consisted of adequatelycombining the individual predictions of the singlemodels (see section 4). We collected additional domain-specific data fromscientific NLP papers to train word embeddings.All
ArXiv cs.CL abstracts since 2010 (1 milliontokens) and the
ACL ARC corpus (90 million to-kens; Bird et al. (2008)) were downloaded andpreprocessed. We used gensim ( ˇReh˚uˇrek and So-jka, 2010) to train word2vec embeddings on thesetwo data sources, and additionally the sentencesprovided as training data for the SemEval task(in total: 91,304,581 tokens). We experimentedwith embeddings of 100, 200 and 300 dimensions,where 200 dimensions yielded the best perfor-mance for the task as shown in Figure 3.
Since the most relevantportion of text to determine the relation type isgenerally the one contained between and includingthe entities (Lee et al., 2017), we solely analyzedthat part of the sentences and disregarded the sur-rounding words. For Subtask 2, we initially con- % F s c o r e Embedding type
Figure 3: Effect of different word embedding typesbased on a simple CNN classifier for Subtask 1.1 C l a ss i f i c a t i o n a cc u r a c y Max. sentence length
Figure 4: Effect of max. length threshold on accuracyfor a preliminary RNN-based classifier sidered every entity pair contained within a singlesentence as having a potential relation. Since theprobability that a relation between two entities ex-ists drops very rapidly with increasing word dis-tance between them (see Figure 5), we only con-sidered sentences that didn’t exceed a maximumlength threshold (see Table 2) between entities todiminish the chances of predicting false positivesin long sentences.Various experiments with different thresholdsbetween 7 and 23 words on the training set showedthat the best results on sentences from scientificpapers are achieved with a threshold of 19 words,as shown in Figure 4.
Cleaning sentences
Some of the automaticallyannotated samples contained nested entitiessuch as < entity id=”L08-1220.16” > signal < entityid=”L08-1220.17” > processing < /entity >< /entity > . Weflattened these structures into simple entities andconsidered all the entities separately for each trainand test instance. Moreover, all tokens betweenbrackets [] and parentheses () were deleted, and S a m p l e s Word distance between entities
Figure 5: Word distance between entities in a relationfor training data in Subtask 1.1 < e > corpus < e > consists of independent < e > text < e >< e > text < e > independent of consists < e > corpus < e >< e > texts < e > from a < e > target corpus < e >ResemblesREV ERSE Figure 6: Example of a reversed sentence the numbers that were not part of a proper nounreplaced with a single wildcard token.
Using entity tags
In order to provide the neu-ral networks with explicit cues of where an entitystarted and ended, we used a single symbol, rep-resented as an XML tag
Relative order strategy & number of classes
As mentioned in Section 1, 5 out of the 6 relationtypes are asymmetrical and the tagging is alwaysdone by using the same order for the entities as theone found in the abstracts’ text/title. For that rea-son, it was important to carefully devise a schemathat allowed generalization by exploiting the infor-mation from both ordered and reversed (words thatwill be treated here as antonyms) relations. Apartfrom using the relative position embeddings pre-sented by Lee et al. (2017), for Subtask 1, we in-corporated a full text reversal of those sentences inwhich a reverse relation was present, both at train-ing and testing time. The result were instancesthat, although not corresponding to a valid Englishgrammar, frequently resembled more in structureto their ordered counterparts. This has been illus-trated by an example of two instances belongingto the
PART-WHOLE class in Figure 6.hus, the system could operate by using onlythe 6 originally specified relation types and merelylearn how to identify ordered relations, rather thanhaving to handle the two different types of pat-terns or to add extra classes to describe both theordered and the reversed versions of each class,which helped improve the overall accuracy of theclassifier (+2.0% F ).For Subtask 2, since no information regardingthe ordering of the arguments was available (theextraction and the ordering were part of thetask), we opted for a 12-class strategy: one foreach of the 5 ordered and reversed relations,plus the symmetrical relation (COMPARE) anda NONE class for the negative instances, i.e.:those that didn’t contain any relation at all. Analternative 6-class approach based on presentingthe sentences both ordered and reversed to thenetwork, computing two predictions for each andafterwards consolidating both did not producegood results (-3.4% F ). Part-of-speech tags
We used the StanfordCoreNLP tagger (Manning et al., 2014) to obtainPOS tags for each word in every sentence in thedataset and trained high-dimensional embeddingsfor the 36 possible tags defined by the Penn Tree-bank Project (Marcus et al., 1993). Moreover, theXML tags to identify the entities and the numberwildcard received their own corresponding artifi-cial POS tag embedding (see Figure 2 for a de-tailed example).
One of the main challenges of the task was thelimited size of the training set, which is a com-mon drawback for many supervised novel ma-chine learning tasks. To overcome it, we combinedthe provided datasets for Subtask 1.1 and 1.2 totrain the models for both Subtasks (+6.2% F ).Furthermore, we leveraged the predictions of oursystem for Subtasks 1.1 and 1.2 and added themas training data for Subtask 2 (+3.6% F ). Due to the limited number of training sentencesprovided, we explored the following approach toaugment the data: We generated automatically-tagged artificial training samples for Subtask 1 bycombining the entities that appeared in the test Link to forum post 1 - Link to forum post 2 data with the text between entities and relationlabels of those from the training set (see Table1). To evaluate the quality of the sentences andaugment our data only with sensible instances,we estimated an NLP language model using theKenLM Language Model Toolkit (Heafield, 2011)on the corpus of NLP-related text described inSection 2.2 and evaluated the generated sentenceswith it. Furthermore, we set a minimum thresh-old of 5 words for the length of the text betweenentities, limited the number of sentences gener-ated from each of them to a single instance in or-der to promote variety, and only kept those sen-tences that score a very high probability (-21 in logscale) against the language model. This processyielded 61 additional samples on the developmentset (+0.7% F ). To determine the optimal tuning for our richly pa-rameterized models, we ran a grid search over theparameter space for those parameters that werepart of our automatic pipeline. The final valuesand evaluated ranges are specified in Table 2.
The cross-entropy loss, defined as the cross-entropy between the probability distribution out-putted by the classifier and the one implied bythe correct prediction is one of the most widelyused objectives for training neural networks forclassification problems (Janocha and Czarnecki,2017). A shortcoming of this approach is thatthe cross-entropy loss usually only constitutes aconveniently decomposable proxy for what the ul-timate goal of the optimization is (Eban et al.,2017): in this case, the macro-averaged F score.Motivated by the fact that individual instances ofinfrequent classes have a bigger impact on the final F score than those of more frequent ones (Man-ning et al., 2008), we opted for a weighted versionof the cross-entropy as loss function, where eachclass had a weight w that was inversely propor-tional to their frequency in the training set: w class i = (cid:80) j class j N classes ∗ class i where indicates the count for a certain class and N classes is the total number of classes.The weights are scaled as to preserve the expectedvalue of the factor k i that accompanies the loga-rithm in the mathematical expression of the lossev set: < e > predictive performance < e > of our < e > models < e > Train set: < e > methods < e > involve the use of probabilistic < e > generative models < e > New sample: < e > predictive performance < e > involve the use of probabilistic < e > models < e > Table 1: Generated sample
Parameter Final value Experiment range
Word embedding dimensionality 200 100-300Embedding dimensionality for part-of-speech tags 30 10-50Embedding dimensionality for relative positions 20 10-50Number of CNN filters 192 64-384Sizes of CNN filters 2 to 7 2-4 to 5-9Norm regularization parameter ( λ ) 0.01 0.0-1.0Number of LSTM units (RNN) 600 0-2400Dropout probability (CNN and RNN) 0.5 0.0-0.7Initial learning rate 0.01 0.001-0.1Number of epochs (Subtask 1) 200 20-400Number of epochs (Subtask 2) 10 5-40Ensemble size 20 1-30Training batch size 64 32-192Upsampling ratio (only Subtask 2) 1.0 0.0-5.0Max. sentence length (only subtask 2) 19 7-23 Table 2: Final parameter values and their explored ranges
TOPIC-RRESULT-RCOMPARERESULTMODEL-FEAT-RPART-WHOLE-RTOPICPART-WHOLEUSAGE-RMODEL-FEATUSAGENONE
Figure 7: Class frequencies for Subtask 2 formula: L = − (cid:80) k i log ( y i ) , which is equal to wy (cid:48) i for the weighted cross-entropy and y (cid:48) i for theunweighted version, where y (cid:48) i = 1 for the correctclass and y i is the predicted probability for thatclass. Illustrating this concept, it can be observedthat a single instance of class TOPIC (support ofonly 6 instances) could account for up to 2.8% ofthe final score on the test set. This function provedto be a better surrogate for the global final scorethan the standard cross-entropy (+1.6% F ). One of the challenges of our approach for Subtask2 was the existence of a large imbalance betweenthe target classes. Namely, the NONE class con-stituted the clear majority (Figure 7). To overcomeit, we resorted to an upsampling scheme for whichwe defined an arbitrary ratio of positive to neg-ative examples to present to the networks for thecombination of all positive classes (+12.2% F ). The neural networks were trained using an Adamoptimizer with parameter values β = 0 . , β =0 . , (cid:15) = 1 e − (suggested default values inthe TensorFlow library (Abadi et al., 2015)) with astep learning rate decay scheme on top of it. Thisconsisted in halving the learning rate every 25 and1 iterations through the whole dataset for Subtasks1 and 2 respectively (note: the size of the upsam-pled dataset for Subtask 2 was about 25 times thatof Subtask 1), starting from the initial value deter-mined in Section 3.3. In order to avoid overfittingthe development set of each Subtask, we evalu-ated the quality of our models by applying a 5-foldcross-validation on the combined training data ofSubtasks 1.1 and 1.2 and on the training data ofSubtask 2. ombining predictions During the develop-ment, we observed that similar F scores couldbe achieved by using either a convolutional neu-ral network or a recurrent one separately, but thecombination of both outperformed the individualmodels. Moreover, since the RNN-based architec-ture had a tendency to obtain better results thanits CNN-based counterpart for long sequences, wecombined both predictions in such a way that ahigher weight was assigned to the RNN predic-tions for longer sentences by applying: w rnn,i = 0 . sign ( s i ) · s i , where s i = length i − min j ( length j )max j ( length j ) − min j ( length j ) − . and length i is the length of the i-th sentence. Post-processing
To enforce consistency withthe text annotation scheme, some rules that werenot built into the system had to be applied ex-post.First, predictions of reversed relations should notbe of type COMPARE, since it is the only symmet-rical relation. When this condition occurred, wesimply predicted the class that had the 2nd high-est probability. Second, each entity could only bepart of one relation. To address this for Subtask 2,we run a conflict-solving algorithm that, in case ofoverlaps, always preferred short relations (cf. Fig-ure 3]) and broke ties by choosing the relation withthe most frequent class in the training data and atrandom when it persisted.
We conducted a feature addition study to evalu-ate the impact of the most relevant features onthe F score of the 5-fold cross-validated train-ing/development set of Subtasks 1.1 and 1.2.The results have been previously shown in Fig-ure 1. It can be observed from the plot that sub-stantial gains can be obtained by applying stan-dalone data manipulation techniques that are in-dependent of the type of classifier used, such ascombining the data of subtask 1.1 and 1.2 ( CSD in Figure 1), reversing the sentences ( RS ), gener-ating additional data ( GD ) and the pre-processingtechniques from Section 2.3. Moreover, as in mostmachine learning problems, appropriately tuningthe model hyperparameters also has a significantimpact on the final score. Subtask P R F Table 3: Precision (P), recall (R) and F -score (F ) in% on the test set by Subtask Relation type P R F COMPARE 100.00 95.24 97.56MODEL-FEATURE 71.01 74.24 72.59PART-WHOLE 78.87 80.00 79.43RESULT 87.50 70.00 77.78TOPIC 50.00 100.00 66.67USAGE 87.86 86.86 87.36Micro-averaged total 82.82 82.82 82.82Macro-averaged total 79.21 84.39 81.72
Table 4: Detailed results (Precision (P), recall (R) and F -score (F )) in % for each relation type on the test setfor Subtask 1.1 After presenting and analyzing the impact of eachsystem feature separately, we show the overall re-sults in this section. The final results on the offi-cial test set are presented on Table 3, ranking 1stin Subtasks 1.1, 1.2 and Subtask 2.C (joint resultof classification and extraction) and 2nd for 2.E(relation extraction only). Furthermore, Table 4shows the differences in performance between re-lation types for Subtask 1.1.
In this article we presented the winning system ofSemEval 2018 Task 7 for relation classification,which also achieved the 2nd place for the relationextraction scenario. Our system, based on an en-semble of CNNs and RNNs, ranked first on 3 outof the 4 Subtasks (relation classification on cleanand noisy data, and relation extraction and classi-fication on clean data combined). We have testedvarious approaches to improve the system such asgenerating more additional training samples andexperimenting with different order strategies forasymmetrical relation types. We demonstrated theeffectiveness of preprocessing the samples by tak-ing into account their length, marking the entitieswith explicit tags, defining an adequate surrogateoptimization objective and combining effectivelythe outputs of several different models. eferences
Mart´ın Abadi, Ashish Agarwal, et al. 2015. Ten-sorFlow: Large-scale machine learning on hetero-geneous systems. Software available from tensor-flow.org.Steven Bird, Robert Dale, Bonnie J Dorr, Bryan Gib-son, Mark Thomas Joseph, Min-Yen Kan, DongwonLee, Brett Powley, Dragomir R Radev, and Yee FanTan. 2008. The ACL anthology reference corpus: Areference dataset for bibliographic research in com-putational linguistics.
EUROPEAN LANGUAGERESOURCES ASSOC-ELRA .Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch.
Journal of Machine Learning Research ,12(Aug):2493–2537.Dmitriy Dligach, Timothy Miller, Chen Lin, StevenBethard, and Guergana Savova. 2017. Neural tem-poral relation extraction.
EACL 2017 , page 746.Elad Eban, Mariano Schain, Alan Mackey, Ariel Gor-don, Ryan Rifkin, and Gal Elidan. 2017. Scalablelearning of non-decomposable objectives. In
Artifi-cial Intelligence and Statistics , pages 832–840.Lisheng Fu, Thien Huu Nguyen, Bonan Min, andRalph Grishman. 2017. Domain adaptation for re-lation extraction with domain adversarial neural net-work. In
Proceedings of the Eighth InternationalJoint Conference on Natural Language Processing(Volume 2: Short Papers) , volume 2, pages 425–429.Kata G´abor, Davide Buscaldi, Anne-Kathrin Schu-mann, Behrang QasemiZadeh, Hafa Zargayouna,and Thierry Charnois. 2018. SemEval-2018 Task7: Semantic Relation Extraction and Classifica-tion in Scientific Papers. In
Proceedings of the12th International Workshop on Semantic Evalua-tion (SemEval-2018) .Kenneth Heafield. 2011. Kenlm: Faster and smallerlanguage model queries. In
Proceedings of the SixthWorkshop on Statistical Machine Translation , pages187–197. Association for Computational Linguis-tics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.Katarzyna Janocha and Wojciech Marian Czarnecki.2017. On loss functions for deep neural networksin classification. arXiv preprint arXiv:1702.05659 .Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882 .Ji Young Lee, Franck Dernoncourt, and PeterSzolovits. 2017. MIT at SemEval-2017 Task 10:Relation Extraction with Convolutional Neural Net-works. arXiv preprint arXiv:1704.01523 . Christopher D Manning, Prabhakar Raghavan, HinrichSch¨utze, et al. 2008.
Introduction to information re-trieval , volume 1. Cambridge university press Cam-bridge.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In
Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large annotatedcorpus of English: The Penn Treebank.
Computa-tional linguistics , 19(2):313–330.Anne-Lyse Minard, Anne-Laure Ligozat, and BrigitteGrau. 2011. Multi-class SVM for relation extrac-tion from clinical reports. In
Proceedings of the In-ternational Conference Recent Advances in NaturalLanguage Processing 2011 , pages 604–609.Thien Huu Nguyen and Ralph Grishman. 2015. Rela-tion extraction: Perspective from convolutional neu-ral networks. In
Proceedings of the 1st Workshop onVector Space Modeling for Natural Language Pro-cessing , pages 39–48.Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen-tau Yih. 2017. Cross-sentencen-ary relation extraction with graph LSTMs. arXivpreprint arXiv:1708.03743 .Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks , pages 45–50, Val-letta, Malta. ELRA. http://is.muni.cz/publication/884893/en .Richard Socher, Brody Huval, Christopher D Manning,and Andrew Y Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In
Pro-ceedings of the 2012 joint conference on empiricalmethods in natural language processing and com-putational natural language learning , pages 1201–1211. Association for Computational Linguistics.Charles Sutton and Andrew McCallum. 2006.
Anintroduction to conditional random fields for rela-tional learning , volume 2. Introduction to statisticalrelational learning. MIT Press.¨Ozlem Uzuner, Brett R South, Shuying Shen, andScott L DuVall. 2011. 2010 i2b2/va challenge onconcepts, assertions, and relations in clinical text.
Journal of the American Medical Informatics Asso-ciation , 18(5):552–556.Godandapani Zayaraz et al. 2015. Concept relation ex-traction using na¨ıve bayes classifier for ontology-based question answering systems.
Journal ofKing Saud University-Computer and InformationSciences , 27(1):13–24.uncong Zheng, Yuexing Hao, Dongyuan Lu,Hongyun Bao, Jiaming Xu, Hongwei Hao, andBo Xu. 2017. Joint entity and relation extractionbased on a hybrid neural network.