Improving Distant Supervised Relation Extraction by Dynamic Neural Network
Yanjie Gou, Yinjie Lei, Lingqiao Liu, Pingping Zhang, Xi Peng
aa r X i v : . [ c s . C L ] D ec Improving Distant Supervised Relation Extraction byDynamic Neural Network
Yanjie Gou a , Yinjie Lei a, ∗ , Lingqiao Liu b , Pingping Zhang c , Xi Peng d a College of Electronics and Information Engineering, Sichuan University, China b School of Computer Science, The University of Adelaide, Australia c School of Information and Communication Engineering, Dalian University ofTechnology, China d College of Computer Science, Sichuan University, China
Abstract
Distant Supervised Relation Extraction (DSRE) is usually formulated as aproblem of classifying a bag of sentences that contain two query entities, intothe predefined relation classes. Most existing methods consider those relationclasses as distinct semantic categories while ignoring their potential connec-tion to query entities. In this paper, we propose to leverage this connectionto improve the relation extraction accuracy. Our key ideas are twofold: (1)For sentences belonging to the same relation class, the expression style, i.e.words choice, can vary according to the query entities. To account for thisstyle shift, the model should adjust its parameters in accordance with en-tity types. (2) Some relation classes are semantically similar, and the entitytypes appear in one relation may also appear in others. Therefore, it can betrained cross different relation classes and further enhance those classes withfew samples, i.e., long-tail classes. To unify these two arguments, we devel-oped a novel D ynamic N eural N etwork for R elation E xtraction (DNNRE).The network adopts a novel dynamic parameter generator that dynamicallygenerates the network parameters according to the query entity types and re-lation classes. By using this mechanism, the network can simultaneously han-dle the style shift problem and enhance the prediction accuracy for long-tailclasses. Through our experimental study, we demonstrate the effectiveness ∗ Corresponding author
Email addresses: [email protected] (Yanjie Gou), [email protected] (Yinjie Lei), [email protected] (Lingqiao Liu), [email protected] (Pingping Zhang), [email protected] (Xi Peng)
Preprint submitted to Knowledge-Based Systems December 16, 2019 f the proposed method and show that it can achieve superior performanceover the state-of-the-art methods.
Keywords:
Deep neural network, Distant supervision, Relation extraction,Dynamic paremater, Style shift, Long-tail relation
1. Introduction
Relation Extraction (RE) [1, 2] aims to extract relations of entities fromsentences, which can automate the construction of Knowledge Bases (KBs)and has potential benefits to downstream applications such as question an-swering [3, 4] and web search [5]. Due to the difficulty of collecting a largeamount of sentence-level annotations, most recent RE methods are basedon the Distant Supervision (DS) framework [6] which can automatically an-notates adequate amounts of data by align texts with KBs: for a bag ofsentences that contains two entities, if KBs has this entity pair, this bag willbe labeled with the corresponding relation of the entity pair in KBs. With theDS framework, RE can be cast as a problem of classifying a bag of sentenceswhich contain the same query entity pair, into predefined relation classes.However, DS framework has its own disadvantages, that is, the noise labeland long-tail problems. Figure 1 shows how the label of a sentence bag isannotated and how the noise label problem is introduced by DS framework.Table 1 shows the long-tail problem existing on the NYT dataset, which is awidely used DSRE dataset [7]. (cid:46)(cid:81)(cid:82)(cid:90)(cid:79)(cid:72)(cid:71)(cid:74)(cid:72)(cid:3)(cid:37)(cid:68)(cid:86)(cid:72) (cid:36)(cid:79)(cid:76)(cid:74)(cid:81)(cid:3)(cid:68)(cid:81)(cid:71)(cid:3)(cid:79)(cid:68)(cid:69)(cid:72)(cid:79)(cid:53)(cid:72)(cid:79)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:55)(cid:85)(cid:76)(cid:83)(cid:79)(cid:72)(cid:29)(cid:11)(cid:36)(cid:83)(cid:83)(cid:79)(cid:72)(cid:15)(cid:18)(cid:69)(cid:88)(cid:86)(cid:76)(cid:81)(cid:72)(cid:86)(cid:86)(cid:18)(cid:70)(cid:82)(cid:80)(cid:83)(cid:68)(cid:81)(cid:92)(cid:18)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:85)(cid:86)(cid:15)(cid:54)(cid:87)(cid:72)(cid:89)(cid:72)(cid:3)(cid:45)(cid:82)(cid:69)(cid:86)(cid:12) (cid:54)(cid:20) (cid:726) (cid:54)(cid:87)(cid:72)(cid:89)(cid:72)(cid:3)(cid:45)(cid:82)(cid:69)(cid:86)(cid:3)(cid:76)(cid:86)(cid:3)(cid:87)(cid:75)(cid:72)(cid:3)(cid:73)(cid:82)(cid:88)(cid:81)(cid:71)(cid:72)(cid:85)(cid:3)(cid:82)(cid:73)(cid:3)(cid:36)(cid:83)(cid:83)(cid:79)(cid:72)(cid:17) (cid:54)(cid:21) (cid:726) (cid:54)(cid:87)(cid:72)(cid:89)(cid:72)(cid:3)(cid:45)(cid:82)(cid:69)(cid:86)(cid:3)(cid:90)(cid:68)(cid:86)(cid:3)(cid:72)(cid:91)(cid:83)(cid:72)(cid:79)(cid:79)(cid:72)(cid:71)(cid:3)(cid:73)(cid:85)(cid:82)(cid:80)(cid:3)(cid:36)(cid:83)(cid:83)(cid:79)(cid:72)(cid:3)(cid:76)(cid:81)(cid:3)(cid:20)(cid:28)(cid:27)(cid:24)(cid:17) (cid:54)(cid:22)(cid:29)(cid:3)(cid:54)(cid:87)(cid:72)(cid:89)(cid:72)(cid:3)(cid:45)(cid:82)(cid:69)(cid:86)(cid:3)(cid:76)(cid:86)(cid:3)(cid:85)(cid:72)(cid:86)(cid:83)(cid:72)(cid:70)(cid:87)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92)(cid:3)(cid:36)(cid:83)(cid:83)(cid:79)(cid:72)(cid:3)(cid:72)(cid:80)(cid:83)(cid:79)(cid:82)(cid:92)(cid:72)(cid:72)(cid:86)(cid:17) (cid:36)(cid:3)(cid:69)(cid:68)(cid:74)(cid:3)(cid:82)(cid:73)(cid:3)(cid:86)(cid:72)(cid:81)(cid:87)(cid:72)(cid:81)(cid:70)(cid:72)(cid:86)
Figure 1: An example of how the training instances are generated through distant su-pervision. S2 and S3 are wrong labeled with the relation “ /business/company/founders ”since they do not express the relation. In the existing works, many efforts [7, 8, 9] have been devoted to reducingthe effect of the noise label problem by recognizing the valid sentences. To2 able 1: The sample numbers of different relation classes in the NYT training set, whichis pretty class-imbalanced: the samples of half relation classes is less than 100. realtion class number of class samples /location/location/contains /people/person/nationality /business/person/company /location/us state/capital /business/company/place founded /location/it region/capital /business/company/locations /broadcast/content/location /location/jp prefecture/capital /business/shopping center/owner (cid:726) Chase Carey , the president of
DirecTV . (Chase Carey, DirecTV) as (/Person, /Organization)/business/person/companyS2 (cid:726)
Bob Woodruff , ABC News journalist .(Bob Woodruff, ABC News) as (/Person, /News_agency)/business/person/company
Figure 2: An example of the style shift problem in DSRE. The keywords that convey thisrelation are in red font. keywords that convey this relation are quite different due to the difference inquery entity types, i.e., (
Person, Organization ) VS. (
Person, News agency ).It seems that we need to further consider the style shift problem of eachrelation class concerning the entity types.On the other hand, some relation definitions may be fine-grained sincemany relations are semantically related and their representations may besimilar in the feature space. For example: • “ /people/person/place of birth ” and “ /people/person/place lived ”both indicate someone appears somewhere at a certain time.Moreover, the head and tail entities mentioned in these two relation classesare with the “ /person ” and “ /location ” types, respectively. That is, they canbe trained well cross different relation bags. It can be expected that basedon this intuition, the well-trained entity type information can be utilized tofurther enhance the vanilla relation representations. That especially benefitsthe long-tail problem.To unify these two arguments, we propose to use a dynamic neural net-work in which the class-dependent parameters (i.e., attention and classifier)can be dynamically determined by the query entity type information. By do-ing so, we can make our prediction model adaptive to the query entity typeswhich can naturally deal with the style shift problem. Also, the long-tail rela-tion representations can be derived from those entity type information, and it4an further address the long-tail problem. By utilizing the type informationthis way, we can tackle DSRE more efficiently.Specifically, to dynamically generate class-dependent parameters (i.e., at-tention and classifier) for our model, we develop a novel dynamic parametergeneration module. Such a module generates network parameters in twosteps: (1) The corresponding entity types of one bag are firstly converted totype embeddings. Then, a relation-aware attention mechanism is proposedto aggregate the important information of them for a specific relation class.That is, different relations may correspond to different entity types. (2) Theaggregated information of entity types is incorporated with the vanilla re-lation representations, and it is further transformed into the dynamic classrepresentations by nonlinear mapping.We conduct experiments on a widely used large-scale DSRE benchmarkdataset, and the experimental results demonstrate the superior performanceof the proposed method. It is validated that the dynamic network design isbeneficial for handling both style shift and long-tail problems in DSRE.In summary, our main contributions of this work are as follows: • We first utilize the class relationship with entity types for improvingthe performance of DSRE. • We propose a novel dynamic parameter generator to build a dynamicneural network whose parameters are determined by the query entitytypes and relation classes. • We propose a novel relation-aware attention over entity types to aggre-gate the discriminative information in the entity types. • Our experiments on a widely used benchmark show that our methodgives a new state-of-the-art result.
2. Related Works
In its early years, most of the DSRE methods are based on the hand-crafted features [6, 7, 21], e.g., POS tags, named entity tags, and dependency One entity may have multiple entity types. Therefore, the information of the entitytypes needs to be aggregated effectively.
In recent years, researchers turn to apply deep learning to DSRE due toits promising performance and generalization ability in various NLP appli-cations. Many methods [9, 10, 22] are under the MIL paradigm frameworkaiming to denoise the wrong label generated by DS. [9] proposes a piecewiseconvolutional neural network, and utilizes MIL to select one sentence in a bagwhich can well express the relation between the entity pair within such a sen-tence. However, they omit useful information in other sentences, which arealso useful for expressing such a relation. To solve this problem, [10] intro-duces a selective attention mechanism to capture more useful information inother sentences. Afterwards, its variants [11, 12, 13, 23] are proposed, where[11] introduces a multi-lingual neural relation extraction framework to betterutilize the information from various languages; [23] considers both intra-bagand inter-bag attentions in order to deal with the noise at sentence-level andbag-level respectively; [12] extends the selective attention to cross-relationcross-bag selective attention and trains the model more noise-robust; [13]enhances convolutional neural network with the self attention mechanism tolearn better sentence representations;Besides, [24] proposes a tagging based method that jointly extracts enti-ties and relations. [25] converts the joint task into a directed graph by design-ing a graph scheme and propose a transition-based approach to generate thedirected graph. [26] reduces the noise via sub-tree parse and utilizes transferlearning to improve the performance of DSRE. [27] proposes to embed textualrelations with global statistics of relations [28] proposes a joint representationlearning framework on knowledge graph completion and relation extractionfrom text. [29] proposes to learn syntax-aware entity embedding for DSRE.[30] proposes an attention regularization method to further reduce the noise6n the dataset. [31] proposes a neural pattern diagnosis framework, whichcan automatically summarize and refine high-quality relational patterns fromnoisy data. Other learning strategies, like adversarial training [15, 32, 33],capsule network [34, 35], and reinforcement learning [36, 37] are also appliedto DSRE to further improve its performance.Recently, due to the powerful capabilities of the pre-training languagemodel, [38] utilizes GPT [39] for DSRE and predicts a larger set of distinctrelation types with high confidence. In another way, [40] studies how to learna general-purpose embedding of textual relations and further improves theDSRE performance.
Recently, other useful external information is identified to be beneficialfor DSRE, e.g., KB information. [22] utilizes entity descriptions for DSRE,which can provide rich background information of entities, and help recog-nize relations in DSRE. [19] explores fine-grained entity type constraints forDSRE. [41] utilizes annotations from the heterogeneous information sources,e.g., knowledge base and domain heuristics, and conduct relation extractorlearning. [20] uses a set of side information, e.g., entity type, and relationalias, to boost DSRE performance. [42] proposes a label-free distant super-vision method, and only uses the prior knowledge derived from the KG tosupervise the learning of the classifier softly. [43] leverages the corpus-basedand KG-based information, and use logic rules on the entity type level. [17]proposes a coarse-to-fine grained attention scheme by hierarchical relationstructures in KB. Based on [17], [18] proposes a knowledge-aware attentionscheme using Knowledge Graph embedding (KGE). Besides, [44] combinesthe distant supervision data with additional directly-supervised data to traina model for identifying valid sentences. [45] proposes a multi-layer attention-based model to improve DSRE with joint label embedding, which is obtainedfrom entity descriptions and KG.However, all the above works ignore the style shift problem, whereasDNNRE uses the entity type information to address it and further improveDSRE performance. Note that there are also previous works using entitytypes in their models [20]. However, they are quite different from us: weutilize entity types to dynamically generate the parameters in our model foraddressing the style shift and long-tail problem, whereas previous works justuse entity types as input features. 7 ord
Embedding Position
Embedding
Piecewise
Max-Pooling C l a ss i f i c a t i o n l og i t s Dynamic Neural NetworkSentence Encoder z i D y n a m i c A tt e n t i o n Stage 1
Stage 2
Dynamic Parameter Generator λ D y n a m i c C l a ss i f i er s s s n s . . . s s s n s . . . s Other encoded sentences in a bag
Bob_Woodruff journalist
ABC_News . , A bag of n s sentences Concat FC T head e T taile /bussiness/person/company w i (cid:258) w i E n t i t y T y p e M a pp i n g ... .../ person FCFC .../ organization ...
Aggragation
Operation
Figure 3: Overview of
DNNRE . The sentence encoder is in the top left. The dynamicNeural network parts are in the top right, in which the attention (Dynamic Attention) andclassifier (Dynamic Classifier) are generated by the dynamic parameter generator in thebottom. Note that the fully connected layers (FC) which map w i to dynamic attentionand classifier modules do not share parameters.
3. Methodology
The primary idea of the proposed method is to build a network with
DY-NAMIC weights, that is, parts of the network parameters will be dynami-cally generated by the entity types and relation classes. This is in contrastto the traditional methods which use
STATIC models for which the modelparameters will be fixed during testing. Formally, the class-dependent pa-rameters λ of the proposed network can be dynamically generated by thefollowing function: λ = φ ( T e , w i ) , (1)where T e (i.e., T heade and T taile ) is the entity types, and w i is the vanillarelation representation of i -th class in the attention or classifier modules.The function φ is called dynamic parameter generator which transfer T e and w i into the network parameters λ .8ince T e is a variable of the query entity types, the generated networkparameters will be online adapted at the test stage, which offers a solutionto compensate for the style shift and long-tail problems. The overall architecture of the proposed DNNRE is illustrated in Figure3. In the top right, the
Sentence Encoder encodes a bag of sentences intosentence representations. Meanwhile, the
Entity Type Mapping mapsthe entity pair to T e for enhancing the relation representations. Then, thisinformation will be utilized to generate the dynamic parameters by the Dy-namic Parameter Generator (bottom). Finally, the dynamic parameterswill build the
Dynamic Neural Network (i.e., dynamic attention and clas-sifier) in the top right. The
Dynamic Attention aggregates the sentencerepresentations into a bag representation, which is feed into the
DynamicClassifier to predict its corresponding relation class.In the remaining of this section, the contents are organized as follows: • Firstly, the Sentence Encoder will be introduced briefly in subsection3.2. • Then, the Entity Type Mapping (i.e., T e ) will be introduced in sub-section 3.3. • The Dynamic Parameter Generator will be elaborated in subsection3.4. • Finally, the Dynamic Neural Network is introduced in subsection 3.5.
In the framework of DSRE, the input of the network is a bag of sentences.Similar as most DSRE methods [9, 10], we first convert each sentence S = (cid:8) w , w , ..., w | s | (cid:9) into a fixed-length vector s i by using a sentence encoder.In this work, we use PCNN [9] to fulfill this task. The sentence encoder hasthree layers: input representation layer, encoding layer and pooling layer,which are introduced as follows: 9 .2.1. Input Representation Layer The input representation layer converts each word in a sentence by theword embedding and the position embedding.The word embedding is used to represent each word token i of w i by apre-trained word embedding vector v i , which is trained on NYT corpus bythe word2vec tool [46].The position embedding consists of two fixed dimension vectors for rep-resenting the relative positions between w i and entity pair. We concatenateposition embedding p i , p i to the word representation. The word representations x i = [ v i ; p i ; p i ] ∈ R d i ( d i = d w + 2 × d p ) arefed into the encoding layer. m convolution kernels K = { k , ..., k m } ∈ R n w × d i slide over the input to capture features in the n w -gram: h i = k i ∗ x j − n +1: j ≤ i ≤ m, (2)where x j − n +1: j means the word representation from index j − n + 1 to j .Afterwards, we can obtain H = { h , ... h m } . After this convolution operation, a piecewise max-pooling is adopted toaggregate word-level information. Supposed h i is split into { h i , h i , h i } bythe entity positions, this pooling method is described as below: q i = [ maxpool ( h ij )] j = 1 , , . (3)Then we obtain Q ∈ R m × , and Q is flattened to a vector and translateit into the sentence embedding s ∈ R d s by a non-linear layer. The external information of entity types has been proved to be useful forthe DSRE task as additional input features [19, 20]. Unlike those existingworks, we use the entity type information T e to dynamically determine thenetwork parameters. The representation of it is shown as follows.We first obtain the entity types from KB and further mapping to thetypes predefined by FIGER [47], which are shown in Table 2. Then, we https://code.google.com/p/word2vec/ able 2: The entity types used in our model. The left column lists coarse-grained entitytypes and the right column lists fine-grained entity types corresponding to the left. Weput the hard-categorized entity types in the bottom. Note that we only show part of the112 entity types due to the space constraint. Coarse-grained Fine-grainedperson doctor engineer architect coach actorreligious leadermonarch terrorist artist musician athlete directororganization terrorist organization airline government companygovernment agency educational institution militaryfraternity sorority political party news agencylocation body of water city island country park astral bodymountain county glacier province bridge roadproduct camera engine mobile phone airplane software weaponcomputer ship game spacecraft instrument train carart written work film newspaper play musicevent military conflict attack natural disaster electionsports event protest terrorist attackbuilding airport dam hospital hotel library power stationrestaurant sports facility theaterchemical thing website color biological thing award time disease godbroadcast network medical treatment broadcast program title drugeducational degree tv channel symptom currency law algorithmethnicity body part language living thing food animal religionprogramming language stock exchange transit system transit linecreate an embedding vector for each entity type. Note that in practice, oneentity may correspond to multiple entity types, in such a case, we propose arelation-aware attention to selectively aggregate the important informationin these entity type embedding and obtain an aggregated vector to representthem. Besides, some entities may correspond to no entity type, we then usean “
UNK ” token to represent their entity type.Note that if the entity type information can not be directly obtainedfrom KBs, we can also use the entity typing methods [47, 48] to obtain thisinformation. In this paper, we assume the entity type information is easy toaccess and directly obtain it from KBs.11 .4. Dynamic Parameter Generator
In the following, the implementation of the dynamic parameter generator φ ( T e , w i ) is elaborated in detail. In our design, it achieves parameter gen-eration through two stages: (1) In stage 1 , since one entity may correspondto several entity types, we aggregate the optimal information in these entitytypes which is most important to enhance the relation representations. (2) In stage 2 , the aggregated entity type information is further utilized to generatethe dynamic parameters. The first stage is to aggregate the type information T e (i.e., T heade and T taile ) into two embedding vectors t he and t te with the dimension of d n , whichrepresents the aggregated head entity and tail entity type information, re-spectively.The reason to selectively aggregate such type information is that dif-ferent relation classes may correspond to different entity types. For ex-ample: the location “Sacramento” can be mentioned by relations “ /loca-tion/us state/capital ” and “ /people/person/place of birth ”. The former mayattend more to its type “ /city ”, and the latter may attend more to “ /loca-tion ”.This step is achieved by a relation-aware attention over the entity typeembedding. This attention mechanism enables the model to aggregate themost discriminative information related to a specific relation class. The at-tention weights over the type embedding are calculated as follows: α i = t hi W t w Tk , t he = n s X i =1 exp( α i ) P n s j =1 exp( α j ) t hi , (4)where W t ∈ R d t × d r are learnable parameters. w k ∈ R d r is the static repre-sentation for the i -th class in the vanilla attention or classifier. t hi is one of thetype embedding vectors in T heade , which is a set of entity type embeddings n t h , ..., t h | T head | o . t te is also aggregated from T taile by the same aggregationstrategy shown above.Note that we also investigate another two aggregation strategies (i.e.,average pooling and max pooling) to compare with the proposed relation-aware attention over the entity types. One can refer to subsection 5.2 for adetailed discussion. 12 .4.2. Stage 2: Dynamic Parameters Generation After we have aggregated the entity type information and obtained t he and t te , they are further utilized to generate the dynamic class parameters bythe following transformation: w k = f d ( w k + f t ([ t he , t te ])) , (5)where the term f t ([ t he , t te ]) can be considered as a dynamic component gen-erated from the information of the head and tail entity types. f d ( · ) and f t ( · )both denote the fully connected layers, and two layer fully connected moduleis applied in our design. Besides, [ · , · ] denotes concatenation of two vectors.After the transformation, w k can be utilized as the network parametersof the k -th class in the attention and the classifier. In our work, the dynamicparameters of the attention and classifier used two different f d ( · ). Thatis, the Dynamic Parameter Generator for the attention and classifier shareits parameters except for the fully connected module f d ( · ). For clarity, wedenote them as φ a ( T e , r k ) ∈ R d r and φ c ( T e , w k ) ∈ R d r , where r k , w k arethe vanilla static parameters for the k -th class in the attention and classifier,respectively. d r is the dimension of the parameters for each class. After the sentences being encoded into vector representations, the nextoperation is to aggregate them into a bag representation by attention mech-anism. Finally, the bag representation is fed into a classifier.Attention and classifier both measure the similarity between features andrelations, at sentence and bag level, respectively. In that sense, the dynamicparameter generator can enhance both of them, which will be introduced asdynamic attention and dynamic classifier module in the following parts.
Given n s sentences in a bag, their corresponding features are extractedby PCNN as S = { s , ..., s n s } , it is a common practise to use the attentionmechanism to generate n s weights to selectively attend the most relevantsentence. Then, the sentence features are aggregated to a fixed-length vectorrepresentation for a bag.In our work, the attention parameters will be generated by the dynamic13arameter generator, and the attention weights are calculated as follows: r k = φ a ( T e , r k ) ,α i = s i r k , z k = n s X i =1 exp( α i ) P n s j =1 exp( α j ) s i , (6)where r k is dynamic attention parameters for the k -th class. s i is the sen-tence feature. Note that we run the dynamic attention n times to obtain n aggregation results, i.e., [ z , ..., z n ]. Each result z k is classified by its corresponding classifier. In other words,the decision value for the k -th class is w k = φ c ( T e , w k ) ,v = w k z k + b k , (7)where w k is dynamic classifier parameters for the k -th class. and b k is a biasterm. Note that at the test stage, we do not know the ground-truth relationcategory k , thus we run the dynamic attention and the dynamic classifier n times with a hypothesis k each time. Each run will produce a posteriorprobability for the k -th class, and this result will be used for prediction andevaluation. The same operation has also been used in [10].
4. Experimental Results
In this section, we first describe the dataset and evaluation criteria. Then,we show our implementation details. Finally, we report our results comparedwith other existing methods.
We evaluate our method on a widely used dataset , NYT, which is devel-oped by [7]. The NYT dataset is generated by aligning Freebase [49] relationfacts with the New York Times corpus. The entities in sentences are rec-ognized by the Stanford named entity tagger [50] and further matched the http://iesl.cs.umass.edu/riedel/ecml/ Following the existing works [6, 10], we use a held-out evaluation methodto evaluate the models. The held-out evaluation method compares the pre-dicted relation classes with the ground truth to evaluate the correspondingmethod. The Precision-Recall (PR) curves and the top-N precision (P@N)will be reported for analysis. Moreover, to further evaluate our method onlong-tail relations, we follow [17, 18] and apply Hits@K metrics. In Addition,in the ablation study, we use AUC for quantitative analysis.
Table 3: Hyper-parameter settings used in our model.
Parameter Name Value
Word dim. d w d p d t d s d r w B λ lr p d λ L We use the same hyper-parameter settings in PCNN [9]. The dimensionof entity type and relation tuple element embedding are both set to 50. GCNlayers are set to 2. The cross-entropy loss function is applied to train ourmodel. The Adadelta optimizer [51] with its default parameters is used as theoptimizer. Moreover, the dropout strategy [52] is used at the classification15 .00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Recall P r e c i s i o n DNNREPCNNs+WNRESIDEPCNN+ATTPCNN+MILMIMLMultiRMintz
Figure 4: Performance comparison of proposed models with other methods. layer, and L2-regularization is also used to prevent the model training fromover-fitting. We implement the network based on PyTorch. Table 3 reportsall the hyper-parameter settings.
To evaluate the performance of
DNNRE , we compare it against severalexisting hand-crafted feature based and deep-feature based methods, whichare as follows: • Mintz : [6] represents a traditional multi-class logistic regression DSREmodel. • MultiR : [21] proposes a probabilistic graphical model for multi-instancelearning. • MIML : [8] applies a graphical model which jointly models multipleinstances and multiple labels. • PCNN+MIL : [9] is a convolutional neural network (CNN) adoptinga piecewise max-pooling method for sentence representation.16
PCNN+ATT : [10] uses selective attention over instances to aggregatesentence embeddings to a bag-level embedding, which can utilize theinformation in other sentences. • RESIDE : [20] utilizes side information (i.e., relation alias and entitytype information) and GCN is used to encode syntactic information ofinstances for boosting the DSRE performance. • PCNNs+WN [53] proposes to use a linear attenuation simulation toweight the word embeddings and a non-independent and identicallydistributed relevance hypothesis is used to capture the relevance ofsentences in the bag.From the PR curves in Figure 4, it can be observed that DNNRE achievessuperior performance compared with the state-of-the-arts. The precisionvalue of DNNRE outperforms others under almost all recall values. Espe-cially, when recall ranges from 0.10 to 0.40, there is a consistent marginbetween DNNRE and other methods. By cross-referencing the P@N resultsin Table 4, it is clear that our method achieves a significant improvementover the comparing methods.To highlight, compared with a recent method RESIDE, which utilizes sideinformation (e.g., entity type and relation alias). DNNRE achieves higherprecision by a large margin over all recall values in PR curves and attainsan improvement pf 4.2% in P@N on average. That demonstrates that ourmethod of using the entity type information is more effective.The performance of DNNRE indicates that the design of the dynamicnetwork can take advantage of the class relation with corresponding entitytypes. It can dynamically adapt its parameters to represent the relationsmore accurately. A case study is reported for evaluating the effectiveness ofDNNRE for the style shift problem caused by keyword variation in subsection5.4.
We also evaluate the performance of DNNRE on
Long-tail Relations by following the protocol of [17, 18]: (1) A subset of the test dataset inwhich all the relations have fewer than 100/200 training instances is selected.(2) Hits@K with K = { , , } metrics is used as an evaluation metric,which measures the likelihood of true relation falls into the first K candidaterelations recommended by the model.17 able 4: P@N comparison of proposed models with other methods. The best results arein bold font. P@N PCNN+MIL PCNN+ATT RESIDE PCNNs+WN DNNREP@100 72.3 76.2 84.0 83.0
P@200 69.7 73.1 78.5 82.0
P@300 64.1 67.4 75.6 80.3
Mean 68.7 72.2 79.4 81.8
Table 5: Accuracy (%) of Hits@K on relations with training instances fewer than 100/200. < < < . In Table 5, it is observed that our method outperforms
PCNN+ATT [10],
PCNN+HATT [17] and
PCNN+KATT [18] in most of the [email protected] particular, we observe that DNNRE achieve 57.6% and 64.1% accuraciesfor Hits@10. That is, more than half of true classes of the long-tail sam-ples can be predicted into the first 10 candidate relation recommended byDNNRE, which outperforms other works significantly at least 20% absoluteimprovement. This demonstrates that DNNRE can substantially boost theperformance of long-tail relation classes.
5. Analysis and Discussion
In this section, we first conduct the ablation study to analyze the effectsof each dynamic component of DNNRE (Subsection 5.1). Secondly, we com-pare our relation-aware attention over entity types with other aggregationstrategies, i.e., average pooling and max pooling (Subsection 5.2). Then, weinvestigate the effects of the different granularities of the entity types on thecapability of the generated dynamic parameters (Subsection 5.3). Finally, a18ase study is given to demonstrate the effectiveness of DNNRE (Subsection5.4).
Recall P r e c i s i o n DNNREw/o dynattw/o dyncls
Figure 5: Ablation study of different dynamic components of DNNRE.
In this subsection, we conduct ablation studies to validate the effect ofeach component of DNNRE (i.e., the dynamic attention and dynamic clas-sifier). The applied evaluation metrics we use are the PR curve and AUC.Note that since some bags in the testing set are noisy, we use AUC (recall < .
4) to focus on the high confidence bags in the low-recall region.In Figure 5, we report different ablated version of DNNRE, which isdescribed as follows: • w/o dynatt denotes a variant by removing the dynamic attention andonly use the vanilla static attention parameters. • w/o dyncls denotes a variant by removing the dynamic classifier andonly use the vanilla static classifier parameters.Merely using dynamic attention ( w/o dyncls ) or dynamic classifier ( w/odynatt ) can boost the AUC to around 0.264 and 0.290, respectively. Par-ticularly, when the dynamic classifier is removed, the performance drops to19.264. As a result, it indicates that the dynamic classifier is more importantfor the DNNRE. Besides, when these two dynamic components are both uti-lized in our model, the DNNRE achieves an AUC of 0.303, which means thatthe dynamic attention and classifier are complementary to each other. Theresults demonstrate that each dynamic module contributes to the superiorperformance of DNNRE, that is, both the dynamic design of the attentionand classifier are beneficial for relation recognition. Recall P r ec i s i on DNNREwith avg-poolingwith max-poolingRESIDE
Figure 6: Performance comparison of different aggregation strategies.
In this subsection, we investigate the effects of different aggregation strate-gies. Besides the relation-aware attention over entity types, we also applyanother two aggregation strategies, i.e., average pooling and max pooling.As shown in Figure 6, with avg-pooling and with max-pooling denotethe relation-aware attention is replaced with the average pooling operationand the max pooling operation, respectively. It is observed that the proposedrelation-aware attention over the entity types is more robust in the PR curveswhich indicates that this attention mechanism is capable of selecting theimportant information from entity types for different relation classes.Note that we also report the PR curve of RESIDE [20] in Figure 6. Theperformance of DNNRE all outperforms RESIDE by a large margin with any20hosen aggregation strategies, which demonstrates that the dynamic designof our model using the entity type information is more effective.
Recall P r e c i s i o n DNNREcoarse-grainedRESIDE
Figure 7: Performance comparison of the DNNRE with different granularities of entitytypes.
In this subsection, we investigate how the different granularities of theentity types affect the discriminative capability of the generated dynamicparameters. The results are reported in Figure 7, in which the coarse-grained denotes that the dynamic parameters of the attention and classifierare generated by the coarse-grained entity types, i.e., the 38 coarse typeswhich form the first hierarchy of FIGER types.On the one hand, DNNRE with fine-grained entity types outperforms theDNNRE with coarse-grained entity types by a large margin. That meansthe fine-grained entity type is helpful for the performance improvement ofDSRE, which is also validated by [19].On the other hand, considering RESIDE [20] also utilizes the 38 coarsetypes of FIGER [47], we compare the DNNRE ( coarse-grained ) with RE-SIDE. Note that RESIDE utilizes lots of external information, including theentity type, relation alias and syntactic information of sentence, which is en-coded by a Graph Convolutional Network [54]. However, only using the 3821oarse types, the performance of our DNNRE is still comparable to RESIDE,which indicates that the way of generating dynamic parameters by the entitytypes is more effective than utilizing them as input constraints [19, 20].
Entity pair & Type Sentence (brett keller, priceline.com) (/person, /company) (cid:18)(cid:69)(cid:88)(cid:86)(cid:76)(cid:81)(cid:72)(cid:86)(cid:86)(cid:18)(cid:83)(cid:72)(cid:85)(cid:86)(cid:82)(cid:81)(cid:18)(cid:70)(cid:82)(cid:80)(cid:83)(cid:68)(cid:81)(cid:92)
Brett keller , chief marketing officer at priceline.com , said mr. shatner was chosen in 1997 because it was the dawn of the online era and the actor , best known for his '' star trek '' role as captain james t. kirk , '' stood for the future , and we were a futuristic brand . ''
W/O Type (/person, /educational_institution) (cid:18)(cid:69)(cid:88)(cid:86)(cid:76)(cid:81)(cid:72)(cid:86)(cid:86)(cid:18)(cid:83)(cid:72)(cid:85)(cid:86)(cid:82)(cid:81)(cid:18)(cid:70)(cid:82)(cid:80)(cid:83)(cid:68)(cid:81)(cid:92) among the leaders in this field is james heckman , a university of chicago economist who won the nobel in economic science in 2000 .one leading voice in this camp is the university of chicago 's james heckman , a nobel prize-winning economist , who contends , after years of studying the subject , that all the low-wage jobs and adult training programs on the planet wo n't succeed in eliminating poverty unless government intervenes in the earliest stages of childhood , when tax dollars have been shown to yield the most return . '' 0.96 0.59 DNNRE
Figure 8: Examples to evaluate DNNRE for the style shift problem. On the left side, theentity pairs, entity types, and relation classes of two bags are shown. On the right side,the estimated probabilities (confidence scores) for detecting the ground-truth relation areshown.
Figure 8 uses two examples to show how DNNRE addresses the style shiftproblem. Two models are used to conduct the case study, i.e., DNNRE andits variant (
W/O Type ), whose parameters are not incorporating the entitytype information.The first example expresses the relation by the keyword “ chief marketingofficer ”. However, in the second example, the tail entity type changes to“ educational institution ”, and the sentence expresses the same relation by adifferent keyword, i.e., “ economist ”. In both examples, the proposed DNNREis able to adjust the model parameters according to entity type informationand produce higher prediction scores. The confidence scores from DNNREare 0.95 and 0.96 for these two examples, respectively. On the contrary,the confidence scores from DNNRE
W/O type only obtain 0.71 and 0.59,respectively. The results demonstrate that DNNRE can utilize the entitytype information to compensate for the relation representations with morevarieties for handling the style shift problem.22 . Conclusion
In this work, we propose a novel Dynamic Neural Network for RelationExtraction (DNNRE), whose parameters are determined by the query entitytypes and relation classes. The dynamic design of our model benefits forthe potential style shift caused by keyword variation under different entitytypes. Besides, the entity type information can also be trained cross differentrelation classes and further benefits for the long-tail relation classes. Throughextensive experiments, we demonstrate that the proposed method is effectivefor improving the DSRE accuracy.In future work, we will focus on the following directions: (1) We willexplore how to better model the style shift problem with different informationinputs, e.g., the entity description information. (2) The connections betweendifferent relation classes can be further utilized to boost the performance ofDSRE. (3) Entity typing can be incorporated into our DNNRE via a multi-task training manner and further benefits for the improvement of the DSREperformance.
Acknowledgment
This work was supported by the Key Research and Development Programof Sichuan Province (2019YFG0409).