Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Path
CClassifying Relations via Long Short Term Memory Networksalong Shortest Dependency Paths
Yan Xu, † Lili Mou, † Ge Li, †∗ Yunchuan Chen, ‡ Hao Peng, † Zhi Jin †∗†
Software Institute, Peking University, 100871, P. R. China { xuyan14,lige,zhijin } @sei.pku.edu.cn, { doublepower.mou,penghao.pku } @gmail.com ‡ University of Chinese Academy of Sciences, [email protected]
Abstract
Relation classification is an important re-search arena in the field of natural lan-guage processing (NLP). In this paper, wepresent SDP-LSTM, a novel neural net-work to classify the relation of two enti-ties in a sentence. Our neural architectureleverages the shortest dependency path(SDP) between two entities; multichan-nel recurrent neural networks, with longshort term memory (LSTM) units, pickup heterogeneous information along theSDP. Our proposed model has several dis-tinct features: (1) The shortest dependencypaths retain most relevant information (torelation classification), while eliminatingirrelevant words in the sentence. (2) Themultichannel LSTM networks allow ef-fective information integration from het-erogeneous sources over the dependencypaths. (3) A customized dropout strategyregularizes the neural network to allevi-ate overfitting. We test our model on theSemEval 2010 relation classification task,and achieve an F -score of 83.7%, higherthan competing methods in the literature. Relation classification is an important NLP task.It plays a key role in various scenarios, e.g., in-formation extraction (Wu and Weld, 2010), ques-tion answering (Yao and Van Durme, 2014), med-ical informatics (Wang and Fan, 2014), ontol-ogy learning (Xu et al., 2014), etc. The aimof relation classification is to categorize into pre-defined classes the relations between pairs ofmarked entities in given texts. For instance, inthe sentence “A trillion gallons of [water] e havebeen poured into an empty [region] e of outer ∗ Corresponding authors. space,” the entities water and region are of rela-tion
Entity-Destination ( e , e ) .Traditional relation classification approachesrely largely on feature representation (Kambhatla,2004), or kernel design (Zelenko et al., 2003;Bunescu and Mooney, 2005). The former methodusually incorporates a large set of features; it isdifficult to improve the model performance if thefeature set is not very well chosen. The latter ap-proach, on the other hand, depends largely on thedesigned kernel, which summarizes all data infor-mation. Deep neural networks, emerging recently,provide a way of highly automatic feature learn-ing (Bengio et al., 2013), and have exhibited con-siderable potential (Zeng et al., 2014; Santos etal., 2015). However, human engineering—that is,incorporating human knowledge to the network’sarchitecture—is still important and beneficial.This paper proposes a new neural network,SDP-LSTM, for relation classification. Our modelutilizes the shortest dependency path (SDP) be-tween two entities in a sentence; we also design along short term memory (LSTM)-based recurrentneural network for information processing. Theneural architecture is mainly inspired by the fol-lowing observations. • Shortest dependency paths are informative(Fundel et al., 2007; Chen et al., 2014). Todetermine the two entities’ relation, we find itmostly sufficient to use only the words alongthe SDP: they concentrate on most relevantinformation while diminishing less relevantnoise. Figure 1 depicts the dependency parsetree of the aforementioned sentence. Wordsalong the SDP form a trimmed phrase ( gal-lons of water poured into region ) of the orig-inal sentence, which conveys much informa-tion about the target relation. Other words,such as a , trillion , outer space , are less infor-mative and may bring noise if not dealt withproperly. a r X i v : . [ c s . C L ] A ug Direction matters. Dependency trees are akind of directed graph. The dependency re-lation between into and region is PREP ; suchrelation hardly makes any sense if the di-rected edge is reversed. Moreover, the enti-ties’ relation distinguishes its directionality,that is, r ( a, b ) differs from r ( b, a ) , for a samegiven relation r and two entities a, b . There-fore, we think it necessary to let the neu-ral model process information in a direction-sensitive manner. Out of this consideration,we separate an SDP into two sub-paths, eachfrom an entity to the common ancestor node.The extracted features along the two sub-paths are concatenated to make final classi-fication. • Linguistic information helps. For exam-ple, with prior knowledge of hyponymy, weknow “water is a kind of substance.” Thisis a hint that the entities, water and region ,are more of
Entity-Destination rela-tion than, say,
Communication-Topic .To gather heterogeneous information alongSDP, we design a multichannel recurrent neu-ral network. It makes use of informationfrom various sources, including words them-selves, POS tags, WordNet hypernyms, andthe grammatical relations between governingwords and their children.For effective information propagation and inte-gration, our model leverages LSTM units duringrecurrent propagation. We also customize a newdropout strategy for our SDP-LSTM network toalleviate the problem of overfitting. To the bestof our knowledge, we are the first to use LSTM-based recurrent neural networks for the relationclassification task.We evaluate our proposed method on theSemEval 2010 relation classification task, andachieve an F -score of 83.7%, higher than com-peting methods in the literature.In the rest of this paper, we review related workin Section 2. In Section 3, we describe our SDP-LSTM model in detail. Section 4 presents quan-titative experimental results. Finally, we have ourconclusion in Section 5. Relation classification is a widely studied taskin the NLP community. Various existing meth- pouredgallons have been intotrillion of [region]A [water] an empty ofspaceouter e e Figure 1: The dependency parse tree correspond-ing to the sentence “A trillion gallons of waterhave been poured into an empty region of outerspace.” Red lines indicate the shortest dependencypath between entities water and region . An edge a → b refers to a being governed by b . Depen-dency types are labeled by the parser, but not pre-sented in the figure for clarity.ods mainly fall into three classes: feature-based,kernel-based, and neural network-based.In feature-based approaches, different sets offeatures are extracted and fed to a chosen classifier(e.g., logistic regression). Generally, three types offeatures are often used. Lexical features concen-trate on the entities of interest, e.g., entities per se ,entity POS, entity neighboring information. Syn-tactic features include chunking, parse trees, etc.Semantic features are exemplified by the concepthierarchy, entity class, entity mention. Kamb-hatla (2004) uses a maximum entropy model tocombine these features for relation classification.However, different sets of handcrafted features arelargely complementary to each other (e.g., hyper-nyms versus named-entity tags), and thus it is hardto improve performance in this way (Zhou et al.,2005).Kernel-based approaches specify some measureof similarity between two data samples, with-out explicit feature representation. Zelenko etal. (2003) compute the similarity of two trees byutilizing their common subtrees. Bunescu andMooney (2005) propose a shortest path depen-dency kernel for relation classification. Its mainidea is that the relation strongly relies on the de-pendency path between two given entities. Wang(2008) provides a systematic analysis of severalkernels and show that relation extraction can bene-t from combining convolution kernel and syntac-tic features. Plank and Moschitti (2013) introducesemantic information into kernel methods in ad-dition to considering structural information only.One potential difficulty of kernel methods is thatall data information is completely summarized bythe kernel function (similarity measure), and thusdesigning an effective kernel becomes crucial.Deep neural networks, emerging recently, canlearn underlying features automatically, and haveattracted growing interest in the literature. Socheret al. (2011) propose a recursive neural network(RNN) along sentences’ parse trees for sentimentanalysis; such model can also be used to clas-sify relations (Socher et al., 2012). Hashimoto etal. (2013) explicitly weight phrases’ importancein RNNs to improve performance. Ebrahimi andDou (2015) rebuild an RNN on the dependencypath between two marked entities. Zeng et al.(2014) explore convolutional neural networks, bywhich they utilize sequential information of sen-tences. Santos et al. (2015) also use the convo-lutional network; besides, they propose a rankingloss function with data cleaning, and achieve thestate-of-the-art result in SemEval-2010 Task 8.In addition to the above studies, which mainlyfocus on relation classification approaches andmodels, other related research trends include in-formation extraction from Web documents in asemi-supervised manner (Bunescu and Mooney,2007; Banko et al., 2007), dealing with smalldatasets without enough labels by distant super-vision techniques (Mintz et al., 2009), etc. In this section, we describe our SDP-LSTM modelin detail. Subsection 3.1 delineates the overall ar-chitecture of our model. Subsection 3.2 presentsthe rationale of using SDPs. Four different infor-mation channels along the SDP are explained inSubsection 3.3. Subsection 3.4 introduces the re-current neural network with long short term mem-ory, which is built upon the dependency path. Sub-section 3.5 customizes a dropout strategy for ournetwork to alleviate overfitting. We finally presentour training objective in Subsection 3.6.
Figure 2 depicts the overall architecture of ourSDP-LSTM network.First, a sentence is parsed to a dependency tree by the Stanford parser; the shortest dependencypath (SDP) is extracted as the input of our net-work. Along the SDP, four different types ofinformation—referred to as channels —are used,including the words, POS tags, grammatical rela-tions, and WordNet hypernyms. (See Figure 2a.)In each channel, discrete inputs, e.g., words, aremapped to real-valued vectors, called embeddings ,which capture the underlying meanings of the in-puts.Two recurrent neural networks (Figure 2b) pickup information along the left and right sub-pathsof the SDP, respecitvely. (The path is separated bythe common ancestor node of two entities.) Longshort term memory (LSTM) units are used in therecurrent networks for effective information prop-agation. A max pooling layer thereafter gathersinformation from LSTM nodes in each path.The pooling layers from different channels areconcatenated, and then connected to a hiddenlayer. Finally, we have a softmax output layer forclassification. (See again Figure 2a.) The dependency parse tree is naturally suitable forrelation classification because it focuses on the ac-tion and agents in a sentence (Socher et al., 2014).Moreover, the shortest path between entities, asdiscussed in Section 1, condenses most illuminat-ing information for entities’ relation.We also observe that the sub-paths, separated bythe common ancestor node of two entities, providestrong hints for the relation’s directionality. TakeFigure 1 as an example. Two entities water and region have their common ancestor node, poured ,which separates the SDP into two parts:[water] e → of → gallons → pouredand poured ← into ← [region] e The first sub-path captures information of e ,whereas the second sub-path is mainly about e . By examining the two sub-paths sepa-rately, we know e and e are of relation Entity-Destination ( e , e ) , rather than Entity-Destination ( e , e ) .Following the above intuition, we designtwo recurrent neural networks, which propagate http://nlp.stanford.edu/software/lex-parser.shtml STMOforwordOembeddings
HiddenOlayer
DependencyOOOOOOOpaths
Softmax
LeftOsub-pathPool RightOsub-pathPoolHiddenOlayerLSTMOforPOSOembeddings LSTMOforGROembeddings LSTMOforWordNetOembeddings (a) (b)
LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Figure 2: (a) The overall architecture of SDP-LSTM. (b) One channel of the recurrent neural networksbuilt upon the shortest dependency path. The channels are words, part-of-speech (POS) tags, grammaticalrelations (abbreviated as GR in the figure), and WordNet hypernyms.bottom-up from the entities to their common an-cestor. In this way, our model is direction-sensitive. We make use of four types of information alongthe SDP for relation classification. We call them channels as these information sources do not inter-act during recurrent propagation. Detailed channeldescriptions are as follows. • Word representations . Each word in a givensentence is mapped to a real-valued vector bylooking up in a word embedding table. Un-supervisedly trained on a large corpus, wordembeddings are thought to be able to wellcapture words’ syntactic and semantic infor-mation (Mikolov et al., 2013b). • Part-of-speech tags . Since word embed-dings are obtained on a generic corpus of alarge scale, the information they contain maynot agree with a specific sentence. We dealwith this problem by allying each input wordwith its POS tag, e.g., noun , verb , etc.In our experiment, we only take into use acoarse-grained POS category, containing 15different tags. • Grammatical relations . The dependencyrelations between a governing word and itschildren makes a difference in meaning. Asame word pair may have different depen-dency relation types. For example, “ beats nsubj −−−→ it ” is distinct from “ beats dobj −−−→ it .”Thus, it is necessary to capture such gram- matical relations in SDPs. In our experi-ment, grammatical relations are grouped into19 classes, mainly based on a coarse-grainedclassification (De Marneffe et al., 2006). • WordNet hypernyms . As illustrated in Sec-tion 1, hyponymy information is also usefulfor relation classification. (Details are not re-peated here.) To leverage WordNet hyper-nyms, we use a tool developed by Ciaramitaand Altun (2006). The tool assigns a hy-pernym to each word, from 41 predefinedconcepts in WordNet, e.g., noun.food , verb.motion , etc. Given its hypernym,each word gains a more abstract concept,which helps to build a linkage between dif-ferent but conceptual similar words.As we can see, POS tags, grammatical rela-tions, and WordNet hypernyms are also discrete(like words per se ). However, no prevailing em-bedding learning method exists for POS tags, say.Hence, we randomly initialize their embeddings,and tune them in a supervised fashion during train-ing. We notice that these information sources con-tain much fewer symbols, 15, 19, and 41, than thevocabulary size (greater than 25,000). Hence, webelieve our strategy of random initialization is fea-sible, because they can be adequately tuned duringsupervised training. The recurrent neural network is suitable for mod-eling sequential data by nature, as it keeps a hid- http://sourceforge.net/projects/supersensetag t ~ ~h t g i ~ o ~ x t f t-1 c h t-1 x t h t-1 h t-1 x t ~ h t-1 x t Figure 3: A long short term memory unit. h : hid-den unit. c : memory cell. i : input gate. f : for-get gate. o : output gate. g : candidate cell. ⊗ :element-wise multiplication. ∼ : activation func-tion.den state vector h , which changes with input dataat each step accordingly. We use the recurrent net-work to gather information along each sub-path inthe SDP (Figure 2b).The hidden state h t , for the t -th word in thesub-path, is a function of its previous state h t − and the current word x t . Traditional recurrent net-works have a basic interaction, that is, the input islinearly transformed by a weight matrix and non-linearly squashed by an activation function. For-mally, we have h t = f ( W in x t + W rec h t − + b h ) where W in and W rec are weight matrices for theinput and recurrent connections, respectively. b h is a bias term for the hidden state vector, and f h anon-linear activation function (e.g., tanh ).One problem of the above model is knownas gradient vanishing or exploding . The train-ing of neural networks requires gradient back-propagation. If the propagation sequence (path) istoo long, the gradient may probably either grow, ordecay, exponentially, depending on the magnitudeof W rec . This leads to the difficulty of training.Long short term memory (LSTM) units are pro-posed in Hochreiter (1998) to overcome this prob-lem. The main idea is to introduce an adaptive gat-ing mechanism, which decides the degree to whichLSTM units keep the previous state and memo-rize the extracted features of the current data in-put. Many LSTM variants have been proposed inthe literature. We adopt in our method a variant introduced by Zaremba and Sutskever (2014), alsoused in Zhu et al. (2014).Concretely, the LSTM-based recurrent neuralnetwork comprises four components: an input gate i t , a forget gate f t , an output gate o t , and a mem-ory cell c t (depicted in Figure 3 and formalizedthrough Equations 1–6 as bellow).The three adaptive gates i t , f t , and o t dependon the previous state h t − and the current input x t (Equations 1–3). An extracted feature vector g t is also computed, by Equation 4, serving as thecandidate memory cell. i t = σ ( W i · x t + U i · h t − + b i ) (1) f t = σ ( W f · x t + U f · h t − + b f ) (2) o t = σ ( W o · x t + U o · h t − + b o ) (3) g t = tanh( W g · x t + U g · h t − + b g ) (4)The current memory cell c t is a combination ofthe previous cell content c t − and the candidatecontent g t , weighted by the input gate i t and forgetgate f t , respectively. (See Equation 5 below.) c t = i t ⊗ g t + f t ⊗ c t − (5)The output of LSTM units is the the recur-rent network’s hidden state, which is computed byEquation 6 as follows. h t = o t ⊗ tanh( c t ) (6)In the above equations, σ denotes a sigmoid function; ⊗ denotes element-wise multiplication. A good regularization approach is needed to al-leviate overfitting. Dropout, proposed recentlyby Hinton et al. (2012), has been very successfulon feed-forward networks. By randomly omittingfeature detectors from the network during train-ing, it can obtain less interdependent network unitsand achieve better performance. However, theconventional dropout does not work well with re-current neural networks with LSTM units, sincedropout may hurt the valuable memorization abil-ity of memory units.As there is no consensus on how to dropout LSTM units in the literature, we try severaldropout strategies for our SDP-LSTM network: • Dropout embeddings; • Dropout inner cells in memory units, includ-ing i t , g t , o t , c t , and h t ; and Dropout the penultimate layer.As we shall see in Section 4.2, dropping outLSTM units turns out to be inimical to our model,whereas the other two strategies boost in perfor-mance.The following equations formalize the dropoutoperations on the embedding layers, where D de-notes the dropout operator. Each dimension in theembedding vector, x t , is set to zero with a prede-fined dropout rate. i t = σ ( W i · D ( x t ) + U i · h t − + b i ) (7) f t = σ ( W f · D ( x t ) + U f · h t − + b f ) (8) o t = σ ( W o · D ( x t ) + U o · h t − + b o ) (9) g t = tanh (cid:16) W g · D ( x t ) + U g · h t − + b g (cid:17) (10) The SDP-LSTM described above propagates in-formation along a sub-path from an entity to thecommon ancestor node (of the two entities). Amax pooling layer packs, for each sub-path, therecurrent network’s states, h ’s, to a fixed vectorby taking the maximum value in each dimension.Such architecture applies to all channels,namely, words, POS tags, grammatical relations,and WordNet hypernyms. The pooling vectors inthese channels are concatenated, and fed to a fullyconnected hidden layer. Finally, we add a softmax output layer for classification. The training objec-tive is the penalized cross-entropy error, given by J = − n c (cid:88) i =1 t i log y i + λ (cid:32) ω (cid:88) i =1 (cid:107) W i (cid:107) F + υ (cid:88) i =1 (cid:107) U i (cid:107) F (cid:33) where t ∈ R n c is the one-hot represented groundtruth and y ∈ R n c is the estimated probability foreach class by softmax . ( n c is the number of targetclasses.) (cid:107) · (cid:107) F denotes the Frobenius norm of amatrix; ω and υ are the numbers of weight matri-ces (for W ’s and U ’s, respectively). λ is a hyper-parameter that specifies the magnitude of penaltyon weights. Note that we do not add (cid:96) penalty tobiase parameters.We pretrained word embeddings by word2vec (Mikolov et al., 2013a) on the English Wikipediacorpus; other parameters are initialized randomly.We apply stochastic gradient descent (with mini-batch 10) for optimization; gradients are computedby standard back-propagation. Training details arefurther introduced in Section 4.2. In this section, we present our experiments in de-tail. Our implementation is built upon Mou et al.(2015). Section 4.1 introduces the dataset; Section4.2 describes hyperparameter settings. In Section4.3, we compare SDP-LSTM’s performance withother methods in the literature. We also analyzethe effect of different channels in Section 4.4.
The SemEval-2010 Task 8 dataset is a widely usedbenchmark for relation classification (Hendrickxet al., 2010). The dataset contains 8,000 sentencesfor training, and 2,717 for testing. We split 1/10samples out of the training set for validation.The target contains 19 labels: 9 directed rela-tions, and an undirected
Other class. The di-rected relations are list as below. • Cause-Effect • Component-Whole • Content-Container • Entity-Destination • Entity-Origin • Message-Topic • Member-Collection • Instrument-Agency • Product-Producer
In the following are illustrated two sample sen-tences with directed relations.[People] e have been moving back into[downtown] e .Financial [stress] e is one of the maincauses of [divorce] e .The target labels are Entity-Destination ( e , e ), and Cause-Effect ( e , e ), respec-tively.The dataset also contains an undirected Other class. Hence, there are 19 target labels in total.The undirected
Other class takes in entities thatdo not fit into the above categories, illustrated bythe following example.A misty [ridge] e uprises from the[surge] e .We use the official macro-averaged F -score toevaluate model performance. This official mea-surement excludes the Other relation. Nonethe-less, we have no special treatment of
Other classin our experiments, which is typical in other stud-ies. F - s c o r e ( % ) (a) Dropout word embeddings F - s c o r e ( % ) (b) Dropout inner cells of memory units F - s c o r e ( % ) (c) Dropout the penultimate layer Figure 4: F -scores versus dropout rates. We first evaluate the effect of dropout embeddings (a). Thenthe dropout of the inner cells (b) and the penultimate layer (c) is tested with word embeddings beingdropped out by 0.5. This subsection presents hyperparameter tuningfor our model. We set word-embeddings tobe 200-dimensional; POS, WordNet hyponymy,and grammatical relation embeddings are 50-dimensional. Each channel of the LSTM networkcontains the same number of units as its sourceembeddings (either 200 or 50). The penultimatehidden layer is 100-dimensional. As it is not fea-sible to perform full grid search for all hyperpa-rameters, the above values are chosen empirically.We add (cid:96) penalty for weights with coefficient − , which was chosen by validation from the set { − , − , · · · , − } .We thereafter validate the proposed dropoutstrategies in Section 3.5. Since network units indifferent channels do not interact with each otherduring information propagation, we herein takeone channel of LSTM networks to assess the ef-ficacy. Taking the word channel as an example,we first drop out word embeddings. Then with afixed dropout rate of word embeddings, we test theeffect of dropping out LSTM inner cells and thepenultimate units, respectively.We find that, dropout of LSTM units hurts themodel, even if the dropout rate is small, 0.1,say (Figure 4b). Dropout of embeddings im-proves model performance by 2.16% (Figure 4a);dropout of the penultimate layer further improvesby 0.16% (Figure 4c). This analysis also provides,for other studies, some clues for dropout in LSTMnetworks. Table 4 compares our SDT-LSTM with other state-of-the-art methods. The first entry in the ta- ble presents the highest performance achieved bytraditional feature engineering. Hendrickx et al.(2010) leverage a variety of handcrafted features,and use SVM for classification; they achieve an F -score of 82.2%.Neural networks are first used in this task inSocher et al. (2012). They build a recursive neuralnetwork (RNN) along a constituency tree for re-lation classification. They extend the basic RNNwith matrix-vector interaction and achieve an F -score of 82.4%.Zeng et al. (2014) treat a sentence as sequen-tial data and exploit the convolutional neural net-work (CNN); they also integrate word position in-formation into their model. Santos et al. (2015)design a model called CR-CNN; they propose aranking-based cost function and elaborately di-minish the impact of the Other class, which isnot counted in the official F -measure. In this way,they achieve the state-of-the-art result with the F -score of 84.1%. Without such special treatment,their F -score is 82.7%.Yu et al. (2014) propose a Feature-rich Com-positional Embedding Model (FCM) for relationclassification, which combines unlexicalized lin-guistic contexts and word embeddings. Theyachieve an F -score of 83.0%.Our proposed SDT-LSTM model yields an F -score of 83.7%. It outperforms existing compet-ing approaches, in a fair condition of softmax withcross-entropy error.It is worth to note that we have also conductedtwo controlled experiments: (1) Traditional RNNwithout LSTM units, achieving an F -score of82.8%; (2) LSTM network over the entire depen-dency path (instead of two sub-paths), achieving lassifier Feature set F SVM POS, WordNet, prefixes and other morphological features, 82.2depdency parse, Levin classes, PropBank, FanmeNet,NomLex-Plus, Google n -gram, paraphrases, TextRunnerRNN Word embeddings 74.8Word embeddings, POS, NER, WordNet 77.6MVRNN Word embeddings 79.1Word embeddings, POS, NER, WordNet 82.4CNN Word embeddings 69.7Word embeddings, word position embeddings, WordNet 82.7Chain CNN Word embeddings, POS, NER, WordNet 82.7FCM Word embeddings 80.6Word embeddings, depedency parsing, NER 83.0CR-CNN Word embeddings 82.8 † Word embeddings, position embeddings 82.7Word embeddings, position embeddings † SDP-LSTM Word embeddings 82.4Word embeddings, POS embeddings, WordNet embeddings, grammar relation embeddingsTable 1: Comparison of relation classification systems. The “ † ” remark refers to special treatment forthe Other class.an F -score of 82.2%. These results demonstratethe effectiveness of LSTM and directionality in re-lation classification. This subsection analyzes how different channelsaffect our model. We first used word embeddingsonly as a baseline; then we added POS tags, gram-matical relations, and WordNet hypernyms, re-spectively; we also combined all these channelsinto our models. Note that we did not try the latterthree channels alone, because each single of them(e.g., POS) does not carry much information.We see from Table 2 that word embeddingsalone in SDP-LSTM yield a remarkable perfor-mance of 82.35%, compared with CNNs 69.7%,RNNs 74.9–79.1%, and FCM 80.6%.Adding either grammatical relations or Word-Net hypernyms outperforms other existing meth-ods (data cleaning not considered here). POS tag-ging is comparatively less informative, but stillboosts the F -score by 0.63%.We notice that, the boosts are not simply addedwhen channels are combined. This suggests thatthese information sources are complementary toeach other in some linguistic aspects. Nonethe-less, incorporating all four channels further pushesthe F -score to 83.70%. Channels F Word embeddings 82.35 + POS embeddings (only) 82.98 + GR embeddings (only) 83.21 + WordNet embeddings (only) 83.03 + POS + GR + WordNet embeddings 83.70Table 2: Effect of different channels.
In this paper, we propose a novel neural networkmodel, named SDP-LSTM, for relation classifi-cation. It learns features for relation classifica-tion iteratively along the shortest dependency path.Several types of information (word themselves,POS tags, grammatical relations and WordNet hy-pernyms) along the path are used. Meanwhile,we leverage LSTM units for long-range infor-mation propagation and integration. We demon-strate the effectiveness of SDP-LSTM by evalu-ating the model on SemEval-2010 relation clas-sification task, outperforming existing state-of-artmethods (in a fair condition without data clean-ing). Our result sheds some light in the relationclassification task as follows. • The shortest dependency path can be a valu-able resource for relation classification, cov-ering mostly sufficient information of targetelations. • Classifying relation is a challenging task dueto the inherent ambiguity of natural lan-guages and the diversity of sentence expres-sion. Thus, integrating heterogeneous lin-guistic knowledge is beneficial to the task. • Treating the shortest dependency path as twosub-paths, mapping two different neural net-works, helps to capture the directionality ofrelations. • LSTM units are effective in feature detec-tion and propagation along the shortest de-pendency path.
Acknowledgments
This research is supported by the National BasicResearch Program of China (the 973 Program) un-der Grant No. 2015CB352201 and the NationalNatural Science Foundation of China under GrantNos. 61232015 and 91318301.
References [Banko et al.2007] M. Banko, M. J. Cafarella, S. Soder-land, M. Broadhead, and O. Etzioni. 2007. Openinformation extraction for the web. In
IJCAI , vol-ume 7, pages 2670–2676.[Bengio et al.2013] Y. Bengio, A. Courville, and P. Vin-cent. 2013. Representation learning: A review andnew perspectives.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 35(8):1798–1828.[Bunescu and Mooney2005] R. C. Bunescu and R. J.Mooney. 2005. A shortest path dependency kernelfor relation extraction. In
Proceedings of the con-ference on Human Language Technology and Em-pirical Methods in Natural Language Processing ,pages 724–731. Association for Computational Lin-guistics.[Bunescu and Mooney2007] R. Bunescu andR. Mooney. 2007. Learning to extract rela-tions from the web using minimal supervision.In
Annual meeting-association for ComputationalLinguistics , volume 45, page 576.[Chen et al.2014] Yun-Nung Chen, Dilek Hakkani-Tur,and Gokan Tur. 2014. Deriving local relationalsurface forms from dependency-based entity embed-dings for unsupervised spoken language understand-ing. In
Spoken Language Technology Workshop(SLT), 2014 IEEE , pages 242–247. IEEE.[Ciaramita and Altun2006] M. Ciaramita and Y. Altun.2006. Broad-coverage sense disambiguation andinformation extraction with a supersense sequencetagger. In
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Process-ing , pages 594–602. Association for ComputationalLinguistics.[De Marneffe et al.2006] M. C. De Marneffe, B. Mac-Cartney, and C. D. Manning. 2006. Generat-ing typed dependency parses from phrase structureparses. In
Proceedings of LREC , volume 6, pages449–454.[Ebrahimi and Dou2015] J. Ebrahimi and D. Dou.2015. Chain based rnn for relation classification. In
HLT-NAACL .[Fundel et al.2007] K. Fundel, R. K¨uffner, and R. Zim-mer. 2007. RelEx—Relation extraction using de-pendency parse trees.
Bioinformatics , 23(3):365–371.[Hashimoto et al.2013] K. Hashimoto, M. Miwa,Y. Tsuruoka, and T. Chikayama. 2013. Simplecustomization of recursive neural networks forsemantic relation classification. In
EMNLP , pages1372–1376.[Hendrickx et al.2010] I. Hendrickx, S. N. Kim, andZ. et al. Kozareva. 2010. Semeval-2010 task8: Multi-way classification of semantic relationsbetween pairs of nominals. In
Proceedings ofthe Workshop on Semantic Evaluations: RecentAchievements and Future Directions) . Associationfor Computational Linguistics.[Hinton et al.2012] G. E. Hinton, N. Srivastava,A. Krizhevsky, I. Sutskever, and R. R. Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580 .[Hochreiter1998] S. Hochreiter. 1998. The vanishinggradient problem during learning recurrent neuralnets and problem solutions.
International Journal ofUncertainty, Fuzziness and Knowledge-Based Sys-tems , 6(02):107–116.[Kambhatla2004] N. Kambhatla. 2004. Combininglexical, syntactic, and semantic features with max-imum entropy models for extracting relations. In
Proceedings of the ACL 2004 on Interactive posterand demonstration sessions , page 22. Associationfor Computational Linguistics.[Mikolov et al.2013a] T. Mikolov, I. Sutskever,K. Chen, G. S. Corrado, and J. Dean. 2013a.Distributed representations of words and phrasesand their compositionality. In
Advances in NeuralInformation Processing Systems , pages 3111–3119.[Mikolov et al.2013b] T. Mikolov, W. T. Yih, andG. Zweig. 2013b. Linguistic regularities in continu-ous space word representations. In
HLT-NAACL .[Mintz et al.2009] M. Mintz, S. Bills, R. Snow, andD. Jurafsky. 2009. Distant supervision for relationextraction without labeled data. In
Proceedings ofthe Joint Conference of the 47th Annual Meeting ofhe ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP: Vol-ume 2-Volume 2 , pages 1003–1011. Association forComputational Linguistics.[Mou et al.2015] L. Mou, H. Peng, G. Li, Y. Xu,L. Zhang, and Z. Jin. 2015. Discriminative neuralsentence modeling by tree-based convolution. arXivpreprint arXiv:1504.01106 .[Plank and Moschitti2013] B. Plank and A. Moschitti.2013. Embedding semantic similarity in tree kernelsfor domain adaptation of relation extraction. In
ACL(1) , pages 1498–1507.[Santos et al.2015] C. N. d. Santos, B. Xiang, andB. Zhou. 2015. Classifying relations by rankingwith convolutional neural networks. In
Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics .[Socher et al.2011] R. Socher, J. Pennington, E. H.Huang, A. Y. Ng, and C. D. Manning. 2011.Semi-supervised recursive autoencoders for predict-ing sentiment distributions. In
Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing , pages 151–161. Association forComputational Linguistics.[Socher et al.2012] R. Socher, B. Huval, C. D. Man-ning, and A. Y. Ng. 2012. Semantic compositional-ity through recursive matrix-vector spaces. In
Pro-ceedings of the 2012 Joint Conference on Empiri-cal Methods in Natural Language Processing andComputational Natural Language Learning , pages1201–1211. Association for Computational Linguis-tics.[Socher et al.2014] R. Socher, A. Karpathy, Q. V. Le,C. D. Manning, and A. Y. Ng. 2014. Groundedcompositional semantics for finding and describingimages with sentences.
Transactions of the Associa-tion for Computational Linguistics , 2:207–218.[Wang and Fan2014] C. Wang and J. Fan. 2014. Med-ical relation extraction with manifold models. In
Proceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 828–838, Baltimore, Maryland,June. Association for Computational Linguistics.[Wang2008] M. Wang. 2008. A re-examination of de-pendency path kernels for relation extraction. In
IJCNLP , pages 841–846.[Wu and Weld2010] F. Wu and D. S. Weld. 2010. Openinformation extraction using wikipedia. In
Proceed-ings of the 48th Annual Meeting of the Associationfor Computational Linguistics , pages 118–127. As-sociation for Computational Linguistics.[Xu et al.2014] Y. Xu, G. Li, L. Mou, and Y. Lu.2014. Learning non-taxonomic relations on demandfor ontology extension.
International Journal ofSoftware Engineering and Knowledge Engineering ,24(08):1159–1175. [Yao and Van Durme2014] X. Yao and B. Van Durme.2014. Information extraction over structured data:Question answering with freebase. In
Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Pa-pers) , pages 956–966, Baltimore, Maryland, June.Association for Computational Linguistics.[Yu et al.2014] M. Yu, M. R. Gormley, and M. Dredze.2014. Factor-based compositional embedding mod-els. In
The NIPS 2014 Learning Semantics Work-shop , December.[Zaremba and Sutskever2014] W. Zaremba andI. Sutskever. 2014. Learning to execute. arXivpreprint arXiv:1410.4615 .[Zelenko et al.2003] D. Zelenko, C. Aone, andA. Richardella. 2003. Kernel methods for relationextraction.
The Journal of Machine LearningResearch , 3:1083–1106.[Zeng et al.2014] D. Zeng, K. Liu, S. Lai, G. Zhou, andJ. Zhao. 2014. Relation classification via convolu-tional deep neural network. In
Proceedings of COL-ING , pages 2335–2344.[Zhou et al.2005] G.D. Zhou, J. Su, J. Zhang, andM. Zhang. 2005. Exploring various knowledge inrelation extraction. In
Proceedings of the 43rd an-nual meeting on association for computational lin-guistics , pages 427–434. Association for Computa-tional Linguistics.[Zhu et al.2014] X. Zhu, P. Sobhani, and H. Guo. 2014.Long short-term memory over tree structures. In