[PDF] ETH-DS3Lab at SemEval-2018 Task 7: Effectively Combining Recurrent and Convolutional Neural Networks for Relation Classification and Extraction

Abstract

Reliably detecting relevant relations between entities in unstructured text is a valuable resource for knowledge extraction, which is why it has awaken significant interest in the field of Natural Language Processing. In this paper, we present a system for relation classification and extraction based on an ensemble of convolutional and recurrent neural networks that ranked first in 3 out of the 4 subtasks at SemEval 2018 Task 7. We provide detailed explanations and grounds for the design choices behind the most relevant features and analyze their importance.

Full PDF

EETH-DS3Lab at SemEval-2018 Task 7: Effectively Combining Recurrentand Convolutional Neural Networks for Relation Classiﬁcation andExtraction

Jonathan Rotsztejn , Nora Hollenstein , Ce Zhang Systems Group, ETH Zurich { rotsztej,noraho } @ethz.ch, [email protected] IBM Research, Zurich

Abstract

Reliably detecting relevant relations betweenentities in unstructured text is a valuable re-source for knowledge extraction, which is whyit has awaken signiﬁcant interest in the ﬁeld ofNatural Language Processing. In this paper,we present a system for relation classiﬁcationand extraction based on an ensemble of con-volutional and recurrent neural networks thatranked ﬁrst in 3 out of the 4 Subtasks at Se-mEval 2018 Task 7. We provide detailed ex-planations and grounds for the design choicesbehind the most relevant features and analyzetheir importance.

One of the current challenges in analyzing un-structured data is to extract valuable knowledgeby detecting the relevant entities and relations be-tween them. The focus of SemEval 2018 Task 7is on relation classiﬁcation (assigning a type of re-lation to an entity pair -

Subtask 1 ) and relationextraction (detecting the existence of a relation be-tween two entities and determining its type -

Sub-task 2 ).Moreover, the task distinguishes between rela-tion classiﬁcation on clean data (i.e.: manuallyannotated entities -

Subtask 1.1 ) and noisy data(automatically annotated entities -

Subtask 1.2 ).It addresses semantic relations from 6 categories,all of them speciﬁc to scientiﬁc literature. Rela-tion instances are to be classiﬁed into one of thefollowing classes: USAGE, RESULT, MODEL-FEATURE, PART-WHOLE, TOPIC, COMPARE,where the ﬁrst ﬁve are asymmetrical relations andthe last is order-independent (see G´abor et al.(2018) for a more detailed description of the task).Since the training data was provided by the taskorganizers, we focused on supervised methods forrelation classiﬁcation and extraction. Similar sys-tems in the past have been based on Support Vec-tor Machines (Uzuner et al., 2011; Minard et al., % F s c o r e Added feature

Figure 1: Feature addition study to evaluate the impactof the most relevant features on the F score of the 5-fold cross-validated training set of Subtasks 1.1 and 1.2 Figure 2 shows the full architecture of our system.Its main component is an ensemble of CNNs andRNNs. The CNN architecture follows closely on(Kim, 2014; Collobert et al., 2011). It consists of a r X i v : . [ c s . C L ] A p r igure 2: Full pipeline architecture an initial embedding layer, which is followed bya convolutional layer with multiple ﬁlter widthsand feature maps with a ReLU activation func-tion, a max-pooling layer (applied over time) and afully-connected layer, that is trained with dropout,and produces the output as logits, to which a soft-max function is applied to obtain probabilities.The RNN consists of the same initial embeddinglayer, followed two LSTM-based sequence mod-els (Hochreiter and Schmidhuber, 1997), one inthe forward and one in the backward direction ofthe sequence, which are dynamic (i.e.: work seam-lessly for varying sequence lengths). The outputand ﬁnal hidden states of the forward and back-ward networks are then concatenated to a singlevector. Finally, a fully-connected layer, trainedwith dropout, connects this vector to the logit out-puts, to which a softmax function is applied anal-ogously to obtain probabilities.The complete architecture was replicated andtrained independently several times (see Table 2)using different random seeds that ensured dis-tinct initial values, sample ordering, etc. in or-der to form an ensemble of classiﬁers, whose out-put probabilities were averaged to obtain the ﬁnalprobabilities for each class. We analyzed and triedseveral deeper and more complex neural architec-tures, such as multiple stacked LSTMs (up to 4) and models with 2 to 4 hidden layers, but theydidn’t achieve any signiﬁcant improvements overthe simpler models. Conclusively, the strategy thatproduced the best results consisted of adequatelycombining the individual predictions of the singlemodels (see section 4). We collected additional domain-speciﬁc data fromscientiﬁc NLP papers to train word embeddings.All

ArXiv cs.CL abstracts since 2010 (1 milliontokens) and the

ACL ARC corpus (90 million to-kens; Bird et al. (2008)) were downloaded andpreprocessed. We used gensim ( ˇReh˚uˇrek and So-jka, 2010) to train word2vec embeddings on thesetwo data sources, and additionally the sentencesprovided as training data for the SemEval task(in total: 91,304,581 tokens). We experimentedwith embeddings of 100, 200 and 300 dimensions,where 200 dimensions yielded the best perfor-mance for the task as shown in Figure 3.

Since the most relevantportion of text to determine the relation type isgenerally the one contained between and includingthe entities (Lee et al., 2017), we solely analyzedthat part of the sentences and disregarded the sur-rounding words. For Subtask 2, we initially con- % F s c o r e Embedding type

Figure 3: Effect of different word embedding typesbased on a simple CNN classiﬁer for Subtask 1.1 C l a ss i f i c a t i o n a cc u r a c y Max. sentence length

Figure 4: Effect of max. length threshold on accuracyfor a preliminary RNN-based classiﬁer sidered every entity pair contained within a singlesentence as having a potential relation. Since theprobability that a relation between two entities ex-ists drops very rapidly with increasing word dis-tance between them (see Figure 5), we only con-sidered sentences that didn’t exceed a maximumlength threshold (see Table 2) between entities todiminish the chances of predicting false positivesin long sentences.Various experiments with different thresholdsbetween 7 and 23 words on the training set showedthat the best results on sentences from scientiﬁcpapers are achieved with a threshold of 19 words,as shown in Figure 4.

Cleaning sentences

Some of the automaticallyannotated samples contained nested entitiessuch as < entity id=”L08-1220.16” > signal < entityid=”L08-1220.17” > processing < /entity >< /entity > . Weﬂattened these structures into simple entities andconsidered all the entities separately for each trainand test instance. Moreover, all tokens betweenbrackets [] and parentheses () were deleted, and S a m p l e s Word distance between entities

Figure 5: Word distance between entities in a relationfor training data in Subtask 1.1 < e > corpus < e > consists of independent < e > text < e >< e > text < e > independent of consists < e > corpus < e >< e > texts < e > from a < e > target corpus < e >ResemblesREV ERSE Figure 6: Example of a reversed sentence the numbers that were not part of a proper nounreplaced with a single wildcard token.

Using entity tags

In order to provide the neu-ral networks with explicit cues of where an entitystarted and ended, we used a single symbol, rep-resented as an XML tag before and after theentity, to indicate it (Dligach et al., 2017).

Relative order strategy & number of classes

As mentioned in Section 1, 5 out of the 6 relationtypes are asymmetrical and the tagging is alwaysdone by using the same order for the entities as theone found in the abstracts’ text/title. For that rea-son, it was important to carefully devise a schemathat allowed generalization by exploiting the infor-mation from both ordered and reversed (words thatwill be treated here as antonyms) relations. Apartfrom using the relative position embeddings pre-sented by Lee et al. (2017), for Subtask 1, we in-corporated a full text reversal of those sentences inwhich a reverse relation was present, both at train-ing and testing time. The result were instancesthat, although not corresponding to a valid Englishgrammar, frequently resembled more in structureto their ordered counterparts. This has been illus-trated by an example of two instances belongingto the

PART-WHOLE class in Figure 6.hus, the system could operate by using onlythe 6 originally speciﬁed relation types and merelylearn how to identify ordered relations, rather thanhaving to handle the two different types of pat-terns or to add extra classes to describe both theordered and the reversed versions of each class,which helped improve the overall accuracy of theclassiﬁer (+2.0% F ).For Subtask 2, since no information regardingthe ordering of the arguments was available (theextraction and the ordering were part of thetask), we opted for a 12-class strategy: one foreach of the 5 ordered and reversed relations,plus the symmetrical relation (COMPARE) anda NONE class for the negative instances, i.e.:those that didn’t contain any relation at all. Analternative 6-class approach based on presentingthe sentences both ordered and reversed to thenetwork, computing two predictions for each andafterwards consolidating both did not producegood results (-3.4% F ). Part-of-speech tags

We used the StanfordCoreNLP tagger (Manning et al., 2014) to obtainPOS tags for each word in every sentence in thedataset and trained high-dimensional embeddingsfor the 36 possible tags deﬁned by the Penn Tree-bank Project (Marcus et al., 1993). Moreover, theXML tags to identify the entities and the numberwildcard received their own corresponding artiﬁ-cial POS tag embedding (see Figure 2 for a de-tailed example).

One of the main challenges of the task was thelimited size of the training set, which is a com-mon drawback for many supervised novel ma-chine learning tasks. To overcome it, we combinedthe provided datasets for Subtask 1.1 and 1.2 totrain the models for both Subtasks (+6.2% F ).Furthermore, we leveraged the predictions of oursystem for Subtasks 1.1 and 1.2 and added themas training data for Subtask 2 (+3.6% F ). Due to the limited number of training sentencesprovided, we explored the following approach toaugment the data: We generated automatically-tagged artiﬁcial training samples for Subtask 1 bycombining the entities that appeared in the test Link to forum post 1 - Link to forum post 2 data with the text between entities and relationlabels of those from the training set (see Table1). To evaluate the quality of the sentences andaugment our data only with sensible instances,we estimated an NLP language model using theKenLM Language Model Toolkit (Heaﬁeld, 2011)on the corpus of NLP-related text described inSection 2.2 and evaluated the generated sentenceswith it. Furthermore, we set a minimum thresh-old of 5 words for the length of the text betweenentities, limited the number of sentences gener-ated from each of them to a single instance in or-der to promote variety, and only kept those sen-tences that score a very high probability (-21 in logscale) against the language model. This processyielded 61 additional samples on the developmentset (+0.7% F ). To determine the optimal tuning for our richly pa-rameterized models, we ran a grid search over theparameter space for those parameters that werepart of our automatic pipeline. The ﬁnal valuesand evaluated ranges are speciﬁed in Table 2.

The cross-entropy loss, deﬁned as the cross-entropy between the probability distribution out-putted by the classiﬁer and the one implied bythe correct prediction is one of the most widelyused objectives for training neural networks forclassiﬁcation problems (Janocha and Czarnecki,2017). A shortcoming of this approach is thatthe cross-entropy loss usually only constitutes aconveniently decomposable proxy for what the ul-timate goal of the optimization is (Eban et al.,2017): in this case, the macro-averaged F score.Motivated by the fact that individual instances ofinfrequent classes have a bigger impact on the ﬁnal F score than those of more frequent ones (Man-ning et al., 2008), we opted for a weighted versionof the cross-entropy as loss function, where eachclass had a weight w that was inversely propor-tional to their frequency in the training set: w class i = (cid:80) j class j N classes ∗ class i where indicates the count for a certain class and N classes is the total number of classes.The weights are scaled as to preserve the expectedvalue of the factor k i that accompanies the loga-rithm in the mathematical expression of the lossev set: < e > predictive performance < e > of our < e > models < e > Train set: < e > methods < e > involve the use of probabilistic < e > generative models < e > New sample: < e > predictive performance < e > involve the use of probabilistic < e > models < e > Table 1: Generated sample

Parameter Final value Experiment range

Word embedding dimensionality 200 100-300Embedding dimensionality for part-of-speech tags 30 10-50Embedding dimensionality for relative positions 20 10-50Number of CNN ﬁlters 192 64-384Sizes of CNN ﬁlters 2 to 7 2-4 to 5-9Norm regularization parameter ( λ ) 0.01 0.0-1.0Number of LSTM units (RNN) 600 0-2400Dropout probability (CNN and RNN) 0.5 0.0-0.7Initial learning rate 0.01 0.001-0.1Number of epochs (Subtask 1) 200 20-400Number of epochs (Subtask 2) 10 5-40Ensemble size 20 1-30Training batch size 64 32-192Upsampling ratio (only Subtask 2) 1.0 0.0-5.0Max. sentence length (only subtask 2) 19 7-23 Table 2: Final parameter values and their explored ranges

TOPIC-RRESULT-RCOMPARERESULTMODEL-FEAT-RPART-WHOLE-RTOPICPART-WHOLEUSAGE-RMODEL-FEATUSAGENONE

Figure 7: Class frequencies for Subtask 2 formula: L = − (cid:80) k i log ( y i ) , which is equal to wy (cid:48) i for the weighted cross-entropy and y (cid:48) i for theunweighted version, where y (cid:48) i = 1 for the correctclass and y i is the predicted probability for thatclass. Illustrating this concept, it can be observedthat a single instance of class TOPIC (support ofonly 6 instances) could account for up to 2.8% ofthe ﬁnal score on the test set. This function provedto be a better surrogate for the global ﬁnal scorethan the standard cross-entropy (+1.6% F ). One of the challenges of our approach for Subtask2 was the existence of a large imbalance betweenthe target classes. Namely, the NONE class con-stituted the clear majority (Figure 7). To overcomeit, we resorted to an upsampling scheme for whichwe deﬁned an arbitrary ratio of positive to neg-ative examples to present to the networks for thecombination of all positive classes (+12.2% F ). The neural networks were trained using an Adamoptimizer with parameter values β = 0 . , β =0 . , (cid:15) = 1 e − (suggested default values inthe TensorFlow library (Abadi et al., 2015)) with astep learning rate decay scheme on top of it. Thisconsisted in halving the learning rate every 25 and1 iterations through the whole dataset for Subtasks1 and 2 respectively (note: the size of the upsam-pled dataset for Subtask 2 was about 25 times thatof Subtask 1), starting from the initial value deter-mined in Section 3.3. In order to avoid overﬁttingthe development set of each Subtask, we evalu-ated the quality of our models by applying a 5-foldcross-validation on the combined training data ofSubtasks 1.1 and 1.2 and on the training data ofSubtask 2. ombining predictions During the develop-ment, we observed that similar F scores couldbe achieved by using either a convolutional neu-ral network or a recurrent one separately, but thecombination of both outperformed the individualmodels. Moreover, since the RNN-based architec-ture had a tendency to obtain better results thanits CNN-based counterpart for long sequences, wecombined both predictions in such a way that ahigher weight was assigned to the RNN predic-tions for longer sentences by applying: w rnn,i = 0 . sign ( s i ) · s i , where s i = length i − min j ( length j )max j ( length j ) − min j ( length j ) − . and length i is the length of the i-th sentence. Post-processing

To enforce consistency withthe text annotation scheme, some rules that werenot built into the system had to be applied ex-post.First, predictions of reversed relations should notbe of type COMPARE, since it is the only symmet-rical relation. When this condition occurred, wesimply predicted the class that had the 2nd high-est probability. Second, each entity could only bepart of one relation. To address this for Subtask 2,we run a conﬂict-solving algorithm that, in case ofoverlaps, always preferred short relations (cf. Fig-ure 3]) and broke ties by choosing the relation withthe most frequent class in the training data and atrandom when it persisted.

We conducted a feature addition study to evalu-ate the impact of the most relevant features onthe F score of the 5-fold cross-validated train-ing/development set of Subtasks 1.1 and 1.2.The results have been previously shown in Fig-ure 1. It can be observed from the plot that sub-stantial gains can be obtained by applying stan-dalone data manipulation techniques that are in-dependent of the type of classiﬁer used, such ascombining the data of subtask 1.1 and 1.2 ( CSD in Figure 1), reversing the sentences ( RS ), gener-ating additional data ( GD ) and the pre-processingtechniques from Section 2.3. Moreover, as in mostmachine learning problems, appropriately tuningthe model hyperparameters also has a signiﬁcantimpact on the ﬁnal score. Subtask P R F Table 3: Precision (P), recall (R) and F -score (F ) in% on the test set by Subtask Relation type P R F COMPARE 100.00 95.24 97.56MODEL-FEATURE 71.01 74.24 72.59PART-WHOLE 78.87 80.00 79.43RESULT 87.50 70.00 77.78TOPIC 50.00 100.00 66.67USAGE 87.86 86.86 87.36Micro-averaged total 82.82 82.82 82.82Macro-averaged total 79.21 84.39 81.72

Table 4: Detailed results (Precision (P), recall (R) and F -score (F )) in % for each relation type on the test setfor Subtask 1.1 After presenting and analyzing the impact of eachsystem feature separately, we show the overall re-sults in this section. The ﬁnal results on the ofﬁ-cial test set are presented on Table 3, ranking 1stin Subtasks 1.1, 1.2 and Subtask 2.C (joint resultof classiﬁcation and extraction) and 2nd for 2.E(relation extraction only). Furthermore, Table 4shows the differences in performance between re-lation types for Subtask 1.1.

In this article we presented the winning system ofSemEval 2018 Task 7 for relation classiﬁcation,which also achieved the 2nd place for the relationextraction scenario. Our system, based on an en-semble of CNNs and RNNs, ranked ﬁrst on 3 outof the 4 Subtasks (relation classiﬁcation on cleanand noisy data, and relation extraction and classi-ﬁcation on clean data combined). We have testedvarious approaches to improve the system such asgenerating more additional training samples andexperimenting with different order strategies forasymmetrical relation types. We demonstrated theeffectiveness of preprocessing the samples by tak-ing into account their length, marking the entitieswith explicit tags, deﬁning an adequate surrogateoptimization objective and combining effectivelythe outputs of several different models. eferences

Mart´ın Abadi, Ashish Agarwal, et al. 2015. Ten-sorFlow: Large-scale machine learning on hetero-geneous systems. Software available from tensor-ﬂow.org.Steven Bird, Robert Dale, Bonnie J Dorr, Bryan Gib-son, Mark Thomas Joseph, Min-Yen Kan, DongwonLee, Brett Powley, Dragomir R Radev, and Yee FanTan. 2008. The ACL anthology reference corpus: Areference dataset for bibliographic research in com-putational linguistics.

EUROPEAN LANGUAGERESOURCES ASSOC-ELRA .Ronan Collobert, Jason Weston, L´eon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch.

Journal of Machine Learning Research ,12(Aug):2493–2537.Dmitriy Dligach, Timothy Miller, Chen Lin, StevenBethard, and Guergana Savova. 2017. Neural tem-poral relation extraction.

EACL 2017 , page 746.Elad Eban, Mariano Schain, Alan Mackey, Ariel Gor-don, Ryan Rifkin, and Gal Elidan. 2017. Scalablelearning of non-decomposable objectives. In

Artiﬁ-cial Intelligence and Statistics , pages 832–840.Lisheng Fu, Thien Huu Nguyen, Bonan Min, andRalph Grishman. 2017. Domain adaptation for re-lation extraction with domain adversarial neural net-work. In

Proceedings of the Eighth InternationalJoint Conference on Natural Language Processing(Volume 2: Short Papers) , volume 2, pages 425–429.Kata G´abor, Davide Buscaldi, Anne-Kathrin Schu-mann, Behrang QasemiZadeh, Hafa Zargayouna,and Thierry Charnois. 2018. SemEval-2018 Task7: Semantic Relation Extraction and Classiﬁca-tion in Scientiﬁc Papers. In

Proceedings of the12th International Workshop on Semantic Evalua-tion (SemEval-2018) .Kenneth Heaﬁeld. 2011. Kenlm: Faster and smallerlanguage model queries. In

Proceedings of the SixthWorkshop on Statistical Machine Translation , pages187–197. Association for Computational Linguis-tics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Katarzyna Janocha and Wojciech Marian Czarnecki.2017. On loss functions for deep neural networksin classiﬁcation. arXiv preprint arXiv:1702.05659 .Yoon Kim. 2014. Convolutional neural net-works for sentence classiﬁcation. arXiv preprintarXiv:1408.5882 .Ji Young Lee, Franck Dernoncourt, and PeterSzolovits. 2017. MIT at SemEval-2017 Task 10:Relation Extraction with Convolutional Neural Net-works. arXiv preprint arXiv:1704.01523 . Christopher D Manning, Prabhakar Raghavan, HinrichSch¨utze, et al. 2008.

Introduction to information re-trieval , volume 1. Cambridge university press Cam-bridge.Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In

Association for Compu-tational Linguistics (ACL) System Demonstrations ,pages 55–60.Mitchell P Marcus, Mary Ann Marcinkiewicz, andBeatrice Santorini. 1993. Building a large annotatedcorpus of English: The Penn Treebank.

Computa-tional linguistics , 19(2):313–330.Anne-Lyse Minard, Anne-Laure Ligozat, and BrigitteGrau. 2011. Multi-class SVM for relation extrac-tion from clinical reports. In

Proceedings of the In-ternational Conference Recent Advances in NaturalLanguage Processing 2011 , pages 604–609.Thien Huu Nguyen and Ralph Grishman. 2015. Rela-tion extraction: Perspective from convolutional neu-ral networks. In

Proceedings of the 1st Workshop onVector Space Modeling for Natural Language Pro-cessing , pages 39–48.Nanyun Peng, Hoifung Poon, Chris Quirk, KristinaToutanova, and Wen-tau Yih. 2017. Cross-sentencen-ary relation extraction with graph LSTMs. arXivpreprint arXiv:1708.03743 .Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Frame-work for Topic Modelling with Large Corpora. In

Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks , pages 45–50, Val-letta, Malta. ELRA. http://is.muni.cz/publication/884893/en .Richard Socher, Brody Huval, Christopher D Manning,and Andrew Y Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In

Pro-ceedings of the 2012 joint conference on empiricalmethods in natural language processing and com-putational natural language learning , pages 1201–1211. Association for Computational Linguistics.Charles Sutton and Andrew McCallum. 2006.

Anintroduction to conditional random ﬁelds for rela-tional learning , volume 2. Introduction to statisticalrelational learning. MIT Press.¨Ozlem Uzuner, Brett R South, Shuying Shen, andScott L DuVall. 2011. 2010 i2b2/va challenge onconcepts, assertions, and relations in clinical text.

Journal of the American Medical Informatics Asso-ciation , 18(5):552–556.Godandapani Zayaraz et al. 2015. Concept relation ex-traction using na¨ıve bayes classiﬁer for ontology-based question answering systems.

Journal ofKing Saud University-Computer and InformationSciences , 27(1):13–24.uncong Zheng, Yuexing Hao, Dongyuan Lu,Hongyun Bao, Jiaming Xu, Hongwei Hao, andBo Xu. 2017. Joint entity and relation extractionbased on a hybrid neural network.