[PDF] Deep Neural Network Based Relation Extraction: An Overview

Abstract

Knowledge is a formal way of understanding the world, providing a human-level cognition and intelligence for the next-generation artificial intelligence (AI). One of the representations of knowledge is semantic relations between entities. An effective way to automatically acquire this important knowledge, called Relation Extraction (RE), a sub-task of information extraction, plays a vital role in Natural Language Processing (NLP). Its purpose is to identify semantic relations between entities from natural language text. To date, there are several studies for RE in previous works, which have documented these techniques based on Deep Neural Networks (DNNs) become a prevailing technique in this research. Especially, the supervised and distant supervision methods based on DNNs are the most popular and reliable solutions for RE. This article 1) introduces some general concepts, and further 2) gives a comprehensive overview of DNNs in RE from two points of view: supervised RE, which attempts to improve the standard RE systems, and distant supervision RE, which adopts DNNs to design sentence encoder and de-noise method. We further 3) cover some novel methods and recent trends as well as discuss possible future research directions for this task.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Deep Neural Network Based Relation Extraction:An Overview

Hailin Wang · Ke Qin · Rufai YusufZakari · Guoming Lu · Jin Yin

Received: date / Accepted: date

Abstract

Knowledge is a formal way of understanding the world, providinga human-level cognition and intelligence for the next-generation artiﬁcial in-telligence (AI). One of the representations of knowledge is semantic relationsbetween entities. An eﬀective way to automatically acquire this importantknowledge, called Relation Extraction (RE), a sub-task of information extrac-tion, plays a vital role in Natural Language Processing (NLP). Its purposeis to identify semantic relations between entities from natural language text.To date, there are several studies for RE in previous works, which have doc-umented these techniques based on Deep Neural Networks (DNNs) becomea prevailing technique in this research. Especially, the supervised and distantsupervision methods based on DNNs are the most popular and reliable solu-tions for RE. This article 1) introduces some general concepts, and further 2)gives a comprehensive overview of DNNs in RE from two points of view: super-vised RE, which attempts to improve the standard RE systems, and distantsupervision RE, which adopts DNNs to design sentence encoder and de-noisemethod. We further 3) cover some novel methods and recent trends as well asdiscuss possible future research directions for this task.

Keywords

Overview · Information Extraction · Relation Extraction · NeuralNetworks

Corresponding author: Ke QinAuthorsTrusted Cloud Computing and Big Data Key Laboratory of Sichuan Province,School of Computer Science and Engineering,University of Electronic Science and Technology of China, Chengdu 611731, ChinaE-mail: lynn [email protected], [email protected], [email protected], [email protected],[email protected] a r X i v : . [ c s . C L ] F e b Hailin Wang et al.

Supervised Data setsDistant Supervision Data setsData sets SentenceRepresentation Feature ExtractionSupervised ModelCNN/LSTM/GRU...Distant Supervision ModelSentence EncoderDe-noise Method ClassifierSoftmax F1JobsisthefounderofAppleJobs [0.5,0.1,0.1, … ,-0.1][-0.2,0.8,0.9, … ,0.6][0.3,0.6,-0.7 … ,0.3][0.7,-0.4,0.3, … ,0.8][0.8,-0.3,0.2, … ,0.4][0.4,0.5,-0.5, … ,0.3] Relation EvaluationPrecision Fig. 1

The General Framework of DNN-based RE.

Artiﬁcial intelligence (AI) integrating knowledge is a hot topic in current re-search. It provides human thinking for AI to solve complex tasks. One of themost important techniques for supporting this research is knowledge acquisi-tion, also called relation extraction (RE). One aim of RE is to process the hu-man language text, to ﬁnd unknown relational facts from a plain text, organiz-ing unstructured information into structured information. A well-constructedand large-scale knowledge base can be useful for many downstream applica-tions and empower knowledge-aware models with the ability of commonsensereasoning, thereby paving the way for AI.RE build a large-scale knowledge base by extracting relation triples fromraw text. For example, there is a sentence: ” < e1 > Jobs < /e1 > is the founderof < e2 < Apple < /e2 > .” It marks the entity ”Jobs” and ”Apple” by a pair ofXML tags. From the sentence, the RE model outputs a triple (Jobs, Apple,founded by), which can be used for knowledge base construction.Recently, RE has attracted extensive attention, but few researchers to re-port the review of DNN-based RE [1, 2]. While these articles have their empha-sis, lacking a comprehensive, systematic introduction to DNN-based methods.Consequently, this paper presents an extensive survey and gives a compre-hensive introduction of RE to the prevalent DNN-based methods. To beginwith, this paper introduces the premise of RE frameworks, including a generalframework and some basic conceptions of RE. Second, a brief introduction oftraditional methods and variation of DNN-based methods will be comparedin detail. Third, the paper further provides an analysis of some problems andproposes future research directions. eep Neural Network Based Relation Extraction: An Overview 3 Data sets : The supervised data sets (SemEval 2010-task8 [3] and FewRel[4, 5]) are often obtained by manual annotation with high accuracy and lownoise, but small size. Instead, distant supervision data sets usually are ac-quired by a physical alignment of entities between a corpus and a knowledgebase (KB), which have a bigger size and high multi-domain applicability (e.g.,Riedel et al. [6]), but low accuracy and high noise. In a sense, this componentis not a simple module in general DNN-based models. But in RE ﬁeld, espe-cially in distant supervision, the physical align phrase, constructing trainingtriples, plays a vital role in RE.

Sentence Representation : In NLP ﬁeld, to understand human languagefor computers, words are usually represented as a series of real value vectors,such as word2vec and Glove methods. Meanwhile, the position embedding isintroduced to better express the positional relation between words and theentity pair. Hence, the ﬁnal representation of the words is the combination ofword vector and position embedding. Consequently, the ﬁnal sentence repre-sentation is composed of these word representations.

Feature extraction : In general, these DNN-based methods are fed withthe sentence representation above. With the annotated data sets, these meth-ods output a feature extractor by training. And this extractor can extracthigh-level features from the above sentence representation.

Classiﬁer : With the high-level features and a predeﬁned relation inven-tory, the classiﬁer outputs the relation between the entity pair in the sentence,and then evaluates the result.2.2 Basic ConceptionIn addition to the above general frameworks, the basic concepts of DNN-basedRE system commonly used in these frameworks are as follows:

Neural Networks have been widely used in image processing, languageprocessing, and other ﬁelds in recent years, with very remarkable results. Re-searches have designed many kinds of DNNs, including Convolutional NeuralNetworks (CNNs) [7], Recurrent Neural Networks (RNNs) [8], Recursive Neu-ral Networks [9], and Graph Neural Networks (GNNs) [10]. Diﬀerent kinds ofDNNs have diﬀerent characteristics and advantages in dealing with variouslanguage tasks. For example, the CNNs, with parallel processing ability, areadept at processing local and structural information. Instead, the RNNs, hav-ing advantages in dealing with a long text, cope with time-series informationby considering the factors before and after data input. Moreover, developed

Hailin Wang et al.

Table 1

Example sentence with position indicators.Examples: < e1 > Jobs < /e1 > is the founder of < e2 > Apple < /e2 > .Indictors: < e1 > , < /e1 > , < e2 > , < /e2 > AppleofFoundertheisJobs 4-1

Fig. 2

Example of relative distance. gradually in recent years, the GNNs, another kind of neural network, whichprocesses data with a graphical structure. For example, the grammatical de-pendency parse tree, a general tool for RE, is suitable for the GNNs. In additionto the above mentioned commonly used networks, there are also some otherRNNs variant networks used in RE systems, such as LSTM (Long Short TermMemory network) [11–13], GRU (Gated Recurrent Unit) [14].

Word Embedding is a method used to represent words in NLP, leveraginguniform low dimensional, continuous, real-value vectors to represent language.One of the earlier forms is one-hot, which exits some problems like data spar-sity, no meaning, dimensional disasters. To solve these problems, some scholars[15] propose a new method called word2vec to overcome these disadvantages.In this way, all the word vectors are distributed, the dimensions of the vec-tor can be arbitrary, generally in 50 to 100 dimensions, and the value of theelement can be any real value. The greatest beneﬁt of this approach is thesemantic and contextual information of words can be captured, and the simi-larity of words could be calculated by simple addition and subtraction. Hence,word2vec is a common component in DNN-based RE. Aside from the word2vecmethod, some researchers also designed other methods [16, 17].

Position Embedding provides a uniform way for the RE model to beaware of word positions. In RE task, the CNN-based models are lack of judg-ment on the word location information. To address this issue, Zeng et al. [18]propose the position feature (PF), which will be adopted in subsequent meth-ods of using CNNs [19–21], RNNs [22–24], and mixed frameworks [25, 26]. ThePF is a combination of the relative distance between each words around thelabeled entities in the sentence. For instance, given labeled entities: ”Jobs” and”Apple”, in the sentence ”Jobs is the founder of Apple”, the relative distanceof the word ” is ” to ” Jobs ” is -1, to ”

Apple ” is 4. In this way, the distance ofwords around the entity words in the sentence can be expressed clearly. Fur-thermore, to make the model easy to understand the PF, the above two realvalues are mapped into a new vector space, namely the position embeddingprocess. Normally the dimension of this vector is 5. One example is shown inFigure 2.In addition to PE, some RNN-based methods also use position indicators(PI) to further enhance the representation of entity pair. In SemEval 2010- eep Neural Network Based Relation Extraction: An Overview 5

Table 2

The example of pattern ”such Y as X”.Pattern such Y as XCorpus ... works by such authors as Herrick,Goldsmith, and Shakespeare.Relation Hyponym (”author”, ”Herrick”),Hyponym (”author”, ”Goldsmith”),Hyponym (”author”, ”Shakespeare”) task8, this data set uses four position indicators to indicate the entity pair inthe sentence. The example is shown in Table 1.

Shortest Dependency Path (SDP) is a word-level de-noise method andderived from a grammatical dependency tree, which masks irrelevant words in-ﬂuencing the relation of entities in a sentence. (The grammatical dependencytree can be obtained by the Stanford Parser ). Bunescu and Mooney [27] ﬁrstused the SDP to design a kernel-based method to ﬁnish RE task, and thena lot of SDP-based models follow this work, such as SDP-LSTM [28], BR-CNN [29], DesRC(BRCNN) [25], Att-RCNN [30], and FORESTFT-DDCNN[31]. For instance, a sentence ” < e1 > People < /e1 > have been moving back into < e2 > downtown < /e2 > .” can be parsed to an SDP: ”[ P eople ] e → moving → into → [ downtown ] e ”. This example illustrates that the SDP captures the predicate-argument sequences. In a sense, these sequences have great beneﬁts for RE:Firstly, this method compresses the information content of sentences. Secondly,it directly shows the dependency relations between each word. Finally, it alsoprovides a clearer relation direction between entities. hand-built pattern meth-ods, semi-supervised methods, supervised methods, unsupervisedmethods , and distant supervision methods. In this paper, we refer to theﬁve methods without DNNs as traditional methods.

Hand-built pattern methods require the cooperation between domainexperts and linguists to construct a knowledge set of patterns based on words,part of speeches, or semantics. With this linguistic knowledge and professionaldomain knowledge, RE can be realized by matching the preprocessed languagefragment with the patterns. If they match, the statement can be said to havethe relation of the corresponding pattern [32, 33]. Table 2 is an example of apattern for hyponymy. http://nlp.stanford.edu/software/lex-parser.shtml Hailin Wang et al. Initial Seed Tuples Ocurrences of Seed TuplesGenerat Extraction PatternsGenerate New Seed Tuples

Fig. 3

The main idea of DIPRE.

Semi-supervise methods are pattern-based methods in essence. Thetypical method is a bootstrapping algorithm, and the representative model isDIPRE (Dual Iterative Pattern Relation Expansion) proposed by Brin et al.[34]. The idea behind this method is ﬁrst to ﬁnd some seed tuples with highconﬁdence, the bootstrapping algorithm extracts patterns with the tuples froma large number of an unlabeled corpus. And then these patterns can be used toextract new triples. This method looks like Figure 3. Some other representativemodels are: Snowball [35], KnowItAll [36], TextRunner [37]. There is also arecent method proposed by Phi et al. [38].

Unsupervised methods adopt a bottom-up information extraction strat-egy based on the assumption: the context information of diﬀerent entity pairswith the same semantic relation is relatively similar. An earlier unsupervisedapproach is proposed by Hasegawa et al. [39]. This extraction process can bedivided into three steps: extracts an entity pair and its context, clusters theentity pair according to the context, and annotates the semantic relation ofeach class or describes the relation type.

Supervised methods consider RE as a multi-class classiﬁcation problem.These approaches are classiﬁed into two types: feature-based and kernel-based[2]. In the feature-based methods [40, 41], each relation instance in the labeleddata is used to train a classiﬁer fed with subsequent new instances for classiﬁca-tion. Generally, these features come from useful information (including lexical,syntactical, semantic) extracted from an instance context. Without a properfeature selection, a feature-based method is diﬃcult to improve the perfor-mance. Compared with the feature-based methods, the kernel-based [27, 42]methods need rarely explicit linguistic preprocessing steps. But it dependsmore on the performance of the kernel function designed. The key step in thisapproach becomes how to design an eﬀective kernel.

Distant Supervision methods are a kind of knowledge-based or weaklysupervised method proposed by Mintz et al. [43]. All of these works are basedon this assumption: If two entities participate in a relation, all sentences thatmention these two entities can express that relation. In other words, any sen-tence that contains a pair of entities participating in a KB is likely to expressthat relation. In this way, distant supervision attempts to extract the rela-tions between entities from the text by using a KB, such as Freebase, as thesupervision source. When a sentence and a KB refer to the same entity pair,this method marks the sentence heuristically with the corresponding relationin the KB. For example, ”Jobs is the founder of Apple.”, in this sentence, the eep Neural Network Based Relation Extraction: An Overview 7

KnowledgeBaseCorpus Entity1: JobsEntity2: AppleRelation: founded_by Relation Label_1:founded_byRelation Label_n: … …

AlignmentNew York Times Bag_1:Sentence_1 … Sentence_nBag_n:... ...

Distant Supervision RE Data Sets

Fig. 4

The process of distant supervision. The upper left of Figure is the knowledge base,and the lower left is the corpus source. After the text alignment process in the middle, theright side produces the packages corresponding to the knowledge base to represent variousrelationships, each package represents a relationship label and these packages contain severalsentence instances. person ”Jobs” and the organization ”Apple” appear in Freebase, and Freebasehas a triple (entity1: Jobs, entity2: Apple, relation: founded by ) correspond-ing to the mentioned entity pair. Therefore these entities express a relation founded by . The process is shown in Figure 4.3.2 DiscussionAlthough there are numerous ways to solve the problem of RE, this ﬁeld hasconsistently shown that these methods exit various obstacles. The hand-builtpattern and semi-supervised methods require the manual exhaustion of all re-lation patterns, result in inevitable human errors. In the supervised method,a variety of mature NLP toolkits [44] provide technical support for these ap-proaches, but both feature design and kernel design are still time-consumingand laborious. The clustering results generated by unsupervised methods aregenerally broad, and one of the main obstacles is to deﬁne an appropriate re-lation inventory. Moreover, this method has limited processing capacity for alow-frequency entity pair and also lacks a standard evaluation corpus or evenuniﬁed evaluation criteria. The distant supervision can eﬀectively label datafor RE, yet suﬀers from the wrong label and low accuracy problem. In addi-tion, all of these approaches have domain limitations, error propagation, andpoor ability to learning underlying features.To solve these problems, some scholars try to adopt DNN-based methodsto improve the performance of RE. In fact, in other ﬁelds of NLP, DNNs havebeen widely applied, such as machine translation, sentiment analysis, auto-matic summarization, question answering, and information recommendation,and all of them have achieved state-of-the-art performance. To date, DNN-based RE methods have been used in supervised and distant supervision REmentioned above. These DNN-based methods [18, 28, 45] can automaticallylearn features instead of manually designed features based on the various NLP

Hailin Wang et al. toolkits. At the same time, most of them have completely surpassed the tra-ditional methods in eﬀect. Table 3 shows some comparison of traditional REmethods and earlier DNN-based methods, which illustrates that DNN-basedmethods can obtain higher scores with fewer features.Based on the ﬁve traditional methods mentioned above, DNN-based meth-ods introduced in this paper mainly focus on supervised methods and distantsupervision methods.

Table 3

The comparison of traditional methods and DNN-basedmethods.

Classiﬁer Feature Sets F1

SVM POS,stemming,syntactic patterns 60.1SVM word pair,words in between 72.5SVM POS, stemming,Syntactic patterns 74.8MaxEnt POS,morphological,noum compound,thesauri,Google n-grams, WordNet 77.6SVM POS, preﬁxes,morphological,WordNet, Dependency parse,Levin Classed, ProBank,FrameNet, NomLex-Plus,Google n-gram,paraphrases, TextRunner 82.2RNN POS, NER ,WordNet 77.6MVRNN POS, NER ,WordNet 82.4CNN+softmax Word pair,Words around word pair,WordNet 82.7Some traditional classiﬁers, their feature sets and theF1-score for RE [18].

For the sake of simplicity and clarity, this paper further subdivides the meth-ods using diﬀerent DNNs as four types (e.g.,

CNN, RNN or LSTM, Mix-structure ), of which the evolutionary process is shown in Figure 5 (a). Al- eep Neural Network Based Relation Extraction: An Overview 9

SupervisedREData sets CNNRNNLSTMGRU Fixed-size-filterMulti-sizes-filter Max-poolingBi-RNNBLSTMBGRU

Negative Sampling

Attention/Multi-Attention Multi-levelCNNMulti-levelRNN/LSTMSDP Multi-ChannelMulti-Channel MIX ： CNN+LSTMCNN+RNNCNN+GRUentity-attentionentity-Aware attentionRandomAttention

Structure-oriented Semantic-oriented

Supervised Data sets ...

LSTMRNNCNN...Sentence SoftmaxEmbeddingTraning label Classification (a)(b)

Sentnece_1(entity1,entity2,relation)Sentnece_2(entity1,entity2,relation)...

Fig. 5 (a) and (b) are the evolutionary process and architecture of supervised methodsrespectively. though further subdivision is carried out, the advantages of high accuracyof DNN-based supervised methods are consistent with the traditional super-vised methods, and they have been further improved. The architecture ofthe supervised methods is shown in Figure 5 (b). To facilitate the analysis,the development process of each type of model set is divided into two sub-types according to the evolution of model structure: structure-oriented and semantic-oriented . The structure-oriented classes improve the ability of fea-ture extraction by changing the structure of the model; the semantic-orientedclasses improve the ability of semantic representation by excavating the inter-nal association of text. The architecture of the supervised methods is shownin Figure 5 (b).4.1 CNN-based MethodsCNN-based models are a general model of RE, which have achieved excellentresults. In these models, a part of them play a key role in the following REtasks, especially some modules, such as CNN with multi ﬁlters [19], piecewiseconvolution [46], attention mechanism, PE [18]. In the following, this subsec-tion will introduce these related models. Some comparisons are shown in Table4.

Table 4

The comparison of CNN-based methods.

Class Author Model name Framework Features set Loss function Optimization F1Str

Liu et al. [45] - ﬁxed-size-ﬁlter-CNN One-hot - - -Zeng et al. [18] - ﬁxed-size-ﬁlter-CNN WE1+WA+WN+PE Cross entropy SGD 82.7- WE1+WA+WN 69.7Nguyen et al. [19] - multi-sizes-ﬁlter-CNN+max-pooling WE2+PE - SGD 82.8Santos et al. [20] CR-CNN ﬁxed-size-ﬁlter-CNN+max-pooling WE2+PE Ranking loss SGD 84.1- Cross entropy 82.4

Sem

Xu et al. [48] depLCNN+NS ﬁxed-size-ﬁlter-CNN+max-pooling+Negative Sampling WE1+WA+WN+SDP Cross entropy SGD 85.6- ﬁxed-size-ﬁlter-CNN+max-pooling WE1+PE 83.7Wang et al. [21] - ﬁxed-size-ﬁlter-CNN+two-level-CNN+two-level-attention+max-pooling WE2+PE+WA Distance function SGD 88Att-Pooling-CNN ﬁxed-size-ﬁlter-CNN+two-level-CNN+max-pooling WE2+PE+WA 86.1In this tabel, the Str, Sem refer to structure-oriented classes, and semantic-oriented classes. The WE1, WE2, WE3 refer to Word embeddingprobosed by Turian et al. [16], Mikolov et al. [15], Pennington et al. [17] respectively. The WA, WN, PE, PI refer to Word around nominals,WordNet, Position embedding, Position indicator. The SDP, GR, WNSYN, Relative-DEP refer to shortest dependency path, grammar relation,hypernyms, relative-dependency. The F1 value are based on semeval2010-task8 [3]. This table lists the best result of the model and its variants,and subsequent tables follow this format.

Structure-oriented:

The ﬁrst model using CNN on RE is proposed byLiu et al. [45]. With the synonym dictionary and any other lexical features,this model transforms the sentence into a series of word vectors, which is fedto a CNN and a softmax output layer to get a classiﬁcation probability. Thispaper is just an attempt to adopt the CNN in this task. Better result as it hasachieved, this method still depends on NLP toolkits, which barely considersthe semantics, model structure, and feature selection.For improving the feature selection, Kim [46] introduces multiple ﬁlters andmax-pooling modules for CNN, which is probably one of the earliest works inthe text classiﬁcation task. Based on this work, Kalchbrenneret et al. [47]describe a dynamic CNN that uses a dynamic k max-pooling operator to pickup some features from the result of CNN. Both of the above two models [45, 47]achieve high performance on diﬀerent tasks.To ﬁnish this work with fewer NLP toolkits, Zeng et al. [18] creativelyput forward the concept of PE between entities, and combine the lexical in-formation of entities with the sentence-level features extracted by the CNN,and integrate these features into the network, achieving state-of-the-art on thetask of SemEval-2010task8 at 2014. This model solves the cumbersome pre-processing problem in the task and avoids the error propagation problem tosome extent. After this model, almost all CNN models use the PE method,and try to extract information with fewer NLP toolkits, or even try to extractrelations without using any information other than word embedding. However,with a ﬁxed-size-ﬁlter-CNN, this model only focuses on the local features andignores the global features.To bring more structure information, Nguyen et al. [19] combine the con-cept of multiple-window size ﬁlters for RE based on Kim’s [46] work. Comparedwith ﬁxed-size-ﬁlter-CNN, this paper demonstrates that multi window sizes ﬁl-ters (multi-sizes-ﬁlter-CNN) bring more structured information to the model. eep Neural Network Based Relation Extraction: An Overview 11

This would be an eﬃcient way to improve the CNN architecture. Many sub-sequent models will adopt this technique as well. Meanwhile, this paper alsoshows that the word vectors pre-trained and changing dynamically with modeltraining are helpful to improve the performance. But in fact, dynamic wordvector is not often referenced. The model structure is shown in Figure 6. Be-sides, all the above models deploy the softmax module, which cannot eliminatethe inﬂuence of other similar classes aﬀecting the ﬁnal classiﬁcation results.

In the morning, the President traveled to Detroit inthemorning , thepresidenttraveledtodetroitentity 1entity 2input sentence with marked entitiesword embedding matrix position embeddings matrixtable look-up Convolutional layer with multiple window sizes for filters Max pooling Fully connected layer with dropout and softmax outputLook-up tables Fig. 6

The structure of model [19] with multi window ﬁlters.

Santos et al. [20] improve the loss function instead of the softmax classiﬁer.The Model’s parameters are trained by minimizing a new ranking loss function(CR-CNN) over the training set, giving a higher probability to the correct classand lower probability for the wrong classes. This new loss function improvesthis model and could be used in other classiﬁers.

Semantic-oriented:

To learn more robust relation representations, Xu etal. [48] propose a model through a CNN with the SDP (talked about it in Sec2.2). The model, which mostly takes the subject to object of a sentence asinput, remove the words that are not related to the relation discrimination,following high accuracy with simply negative samples. This is the ﬁrst DNNsmodel using the SDP, and this technic and its variants will be widely adoptedin subsequent models.Another way to improve the semantic representation is attention mecha-nism. Wang et al. [21] use two levels of attention mechanism, called Multi-Level Attention CNNs which enables end to end learning from task-speciﬁclabeled data. This multiple attention mechanism considers both the semanticinformation at the word level and sentence level. This model rarely uses anyother external semantic information at all. At the ﬁrst level attention, themodel constructs an entity-based attention matrix at the input level, whichlabels the words related to the corresponding relation. At the second level, themechanism captures more abstract high-level features to construct the ﬁnaloutput matrix.

Table 5

The comparison of RNN-based methods.

Class Author Model name Framework Features set Loss function Optimization F1Str

Zhang et al. [22] RNN Bi-RNN+max-pooling WE1+PI Cross entropy SGD 80WE2+PI 82.5Zhang et al. [11] BLSTM BLSTM+max-pooling WE3+PE+WN+NER+POS+WNSYN+Relative-DEP - - 84.3WE3+WA 82.7

Sem

Xu et al. [28] SDP-LSTM SDP+LSTM WE2+SDP Cross entropy SGD 82.4WE2+SDP+WA+POS+GR 83.7Zhou et al. [49] Att-BLSTM BLSTM+attention WE3+PI Negative-log-likelihood(Cross-Entropy) AdaDelta 84BLSTM BLSTM+max-pooling WE3+WA - - 82.7Xu et al. [50] DRNNs multi-channel-rnn+max-pooling+augmentation WE2+GR+POS+WN+SDP(augmentation) Cross entropy SGD 86.1multi-channel-rnn+max-pooling WE2+GR+POS+WN+SDP 84.16Xiao et al. [51] BLSTM+BLSTM 2-level-BLSTM+attention WE2+WN+NER Ranking loss function AdaGrad 84.27WE2 83.9Qin et al. [23] EAtt-BiGRU Bi-GRU +entity-attention WE2+PE - AdaDelta 84.7- Bi-GRU+random-att 83.6Zhang et al. [24] BiGUR-MCNN-ATT BiGRU+multi-size-ﬁlter-cnn-attention WE2+PE+SDP - AdaDelta 84.7BiGUR-Random-ATT BiGRU+random-attention 84.2Lee et al. [52] BLSTM+LET BLSTM+Entity-aware-attention+Latent-Entity-Type WE2+PE+LET Cross entropy AdaDelta 85.2- BLSTM+Entity-aware-attention WE2+PE 84.7Table symbols have the same meanings as Table. 4.

Structure-oriented:

For learning relations within a long context and con-sidering the timing information, Zhang et al. [22] use a bi-directional RNNarchitecture to this task. RNN combines the output of each hidden state andthen represents the feature at the sentence level. At the end of the model,it conducts a max-pooling operation to pick up a few trigger word featuresfor prediction. Although max-pooling operation simpliﬁes feature extraction,the eﬀectiveness of these features remains to be discussed. Besides, the RNNmodel still has the problem of gradient explosion. To solve the problem ofthe gradient explosion, LSTM [13] is proposed by using the gate mechanism.Based on this, Xu et al. [28] propose a model with LSTM (will be discussedin the semantic-oriented type).With complete, sequential information about all words in the sentence isbeneﬁcial to RE, Zhang et al. [11] apply the bi-directional long short-termmemory networks (BLSTM) to obtain the sentence level representation, andalso use several lexical features. The experiment results show that using wordembedding as input features alone is enough to achieve state-of-the-art re-sults. This study documents the eﬀectiveness of the BLSTM. Although the eep Neural Network Based Relation Extraction: An Overview 13 method improves the representation of sentence-level features, there are stilltwo problems: a large number of external artiﬁcial features are introduced,and no eﬀective feature ﬁlter mechanism. + x e h ¬ h h ® T x T e T h ¬ T h T h ® x e h ¬ h h ® x e h ¬ h h ® (cid:92) InputLayerEmbeddingLayerLSTMLayerAttentionLayerOutputLayer . . .. . .. . .

Fig. 7

The structure of the model [49], BLSTM with an attention mechanism.

Semantic-oriented:

Based on the above problems, Xu et al. [28] proposea new DNNs model called SDP-LSTM. This model leverages four types ofinformation: Word vectors, POS tags, Grammatical relations, and WordNethypernyms, to construct four channels for this model to support external in-formation. And then, it concatenates the result of the four channels to thesoftmax layer for prediction. This model is a little more complex than Zhanget al. [22] by considering a lot of additional syntaxes and semantic information.Follow the above work SDP-LSTM [28], to overcome the problem of shallowarchitecture that hardly represent the potential space in diﬀerent networklevels, Xu et al. [50] increase the neural network layers to tackle this challenge,with which this model captures the abstract features along the two sub-pathsof SDP. Meanwhile, the small size of the semeval2010-task8 set and deeperneural networks may easily result in overﬁtting. Hence, the author augmentsthe data set by adding the directivity of the data based on the original SDP,avoiding the overﬁtting problem.The SDP ﬁlters the input text but can not ﬁlter the extracted features.To tackle this issue, Zhou et al. [49] come up with the attention mechanismin BLSTM, which automatically highlight the important features only withthe raw text instead of any other NLP toolkits or lexical resources. This workis a representative BLSTM model and the architecture is shown in Figure 7.Similar to the work of Zhou et al. [49], Xiao et al. [51] propose a two-levelBLSTM architecture with a two-level attention mechanism to extract a high-level representation of the raw sentence.Although the attention mechanism gives more weight to the importantfeatures extracted by the model, this work [49] just presents a random weight, which lacks the consideration of the prior knowledge. Therefore, the followingworks improve this model.Fed with entity pair and sentence, EAtt-BiGRU proposed by Qin et al.[23] leverage the entity pair as prior knowledge to form attention weight. Dif-ferent from Zhou et al.’s [49] work, EAtt-BiGRU applies bi-directional GRU(BiGRU) instead of BLSTM to reduce computation, which helps in obtainingthe representation of sentence and adopts a one-way GRU to extract priorknowledge of entity pair. With the representation and prior knowledge, thismodel can generate the corresponding attention weight adaptively. This workimproves the random attention mechanism, but how to better integrate theprior knowledge needs further studying.Zhang et al. [24] propose another kind of attention mechanism based on theSDP, which is another prior knowledge. This model uses Bi-GRU to extractsentence-level features and attention weights to select features from multi-channel CNN for classifying. Compared with other random or entity basedattention mechanisms [23, 49], this model constructs a better attention weightusing the SDP.Further research is followed by Lee et al. [52] who propose a mixed modelwith BLSTM, self-attention, entity-aware attention, and latent entity typingmodule, getting state-of-the-art without any high-level features. In general,the instance of the data set has no attribute of the entity type. However, theentity pair type is closely related to the relation classes. Previous works canonly get word-level or sentence-level attention, but rarely obtain the degreeof correlation between entities and other related words. Hence, this modelintroduces the latent entity typing module, self-attention module, entity-awareattention module, giving more prior knowledge.4.3 Mix-structure based MethodsIn addition to the above two types of models, some scholars combine thesemodels based on their respective characteristics, which can be beneﬁcial to REtask. There also exist two ways to merge these models: simply combination(

Structure-oriented ) [26] and attention mechanism (

Semantic-oriented )[30]. The comparison of them are shown in Table 6.

Structure-oriented:

To integrate RNN and CNN, Zheng et al. [53] pro-pose two neural networks based on CNN and LSTM (MixCNN+CNN andMixCNN+LSTM) framework by joint learning the entity semantic and rela-tion pattern. In this model, the entity semantic properties can be reﬂectedby their surrounding words, which can tackle the unknown words in entities(Out-of-vocabulary problem), and the relation pattern modeled by the sub-sentence between the given entities instead of the whole sentence. With theentity semantic and relation pattern, the performance of RE can be improved.This research sheds new light on merging these two modules, showing the com-plementarity of the two modules as well as the necessity of module integration. eep Neural Network Based Relation Extraction: An Overview 15

Table 6

The comparison of methods based on Mix neural networks.

Class Author Model name Framework Features set Loss function Optimization F1Str

Zheng et al. [53] MixCNN+CNN multi-sizes-ﬁlter-CNN+ﬁxed-size-ﬁlter-CNN+max-pooling WE2+WA Cross entropy SGD 84.8MixCNN+LSTM multi-sizes-ﬁlter-CNN+LSTM+max-pooling 83.8Zhang et al. [26] BLSTM-CNN BLSTM+ﬁxed-size-ﬁlter-CNN+max-pooling WE2+PI+PE - - 83.2BLSTM-CNN+PF WE2+PE 81.9BLSTM-CNN+PI WE2+PI 82.1

Sem

Cai et al. [29] BRCNN 2level-LSTM+2level-ﬁxed-size-ﬁlter-CNN+SDP(2direction) WE2+POS+NER+SDP Cross entropy AdaDelta 86.3WE2+SDP 85.4Ren et al. [25] DesRC(BRCNN) BRCNN+ﬁxed-size-ﬁliter-CNN-description+attention We2+PE+WN+SDP - SGD 87.4- BRCNN+ﬁxed-size-ﬁliter-CNN We2+PE+SDP 84.7Guo et al. [30] Att-RCNN ﬁxed-size-ﬁlter-CNN+max-pooling+BiGRU+attention WE2+SDP Pairwise logistic loss SGD 86.6- ﬁxed-size-ﬁlter-CNN+max-pooling+BiGRU 85.1Wang et al. [54] Bi-SDP ﬁxed-size-ﬁlter-CNN+max-pooling+BLSTM+attention WE2+PE+SDP Cross entropy SGD 85.1Table symbols have the same meanings as Table 4.

Zhang et al. [26] introduce the BLSTM-CNN, without any lexical atten-tion mechanism or NLP toolkits, just utilize three kinds of resources, wordembedding, PE, and PI, showing that simply merging BLSTM and CNNs canperform better than any other single models. However, the PE and PI, showingthe words around the nominals, are same functions for RE which may resultin overlap feeding.

Semantic-oriented:

Diﬀerent from the above two works, Cai et al. [29]combine these two types of models depending on the SDP. To improve themodel’s sense of relation directivity, this model, called BRCNN, learn sen-tence features from the SDP on both positive and negative directions, whichis beneﬁcial for predicting the direction of the relation.Following Cai et al.’s [29] work, Ren et al. [25] advance a further work,a traditional CNN architecture with BRCNN [29], using two kinds of atten-tion (’intra-cross’) to combine the classiﬁcation features that come from theoriginal sentences and their corresponding descriptions. The description of anentity is from the external text, which can enrich the prior knowledge. Com-pared with diﬀerent experiments and models, the result demonstrates thattext descriptions can provide more features to model and replace WordNet toa certain extent in RE. In this line, this is the ﬁrst RE method with entity de-scription information. But the method of extracting description is too simple.As an external knowledge, description information should be closely relatedto the original sentence, therefore the selection of description should be moretargeted.Guo et al. [30] propose a novel Att-RCNN model to extract text features.This model leverages GRU units instead of LSTM units, which has a higherspeed in computing convergence and a more eﬃcient CNNs to extract high-level features. Meanwhile, the two-level attention mechanism is similar to [21].

The special part of this model is the introduction of a new de-noise method,which can get a continuous fragment of the original text, based on the SDP[28].Wang et al. [54] further utilize the SDP to construct a Bi-SDP with aparallel attention weight to cope the direction of relation. In a sense, thismethod introduces more information about the SDP, and interprets anotherfunction (giving a hint of the direction of one relation) of a preposition in asentence, which is ignored by previous works.4.4 DiscussionAs stated above, the structure-oriented and semantic-oriented models improvethe RE task in diﬀerent aspects: one improves the ability of feature extraction,and the another focus on the semantic representation of text. From the Table4, 5, and 6, they show that semantic-oriented models exhibit more eﬀectivethan structure-oriented models. This phenomenon has been widely observedin all of these models, which reﬂects that relational facts in a sentence aresemantically strongly related to the sentence itself. In other words, we ar-gue that the constructed models around SDP, providing external semanticinformation, give a interpretable way of human-level thinking and cognitionto cope the RE task. Except the above works, some other research ﬁled alsogive new light on RE, such as: distilling knowledge [55] and auxiliary learning[56]. Utilizing distilling knowledge method to generate soft labels and guidethe student network to learn dark knowledge, which seems can overcome thelimitation of relational inventory or hard-label problem in supervised method.Auxiliary learning provides a simple module to further dig out the latent se-mantic relation in the wrong classiﬁed result, which can alleviate the semanticgap between the sentence and the label. All of the above methods provide asolid foundation for the development of supervised RE. However, we have toadmit that there are still many limitations in these approaches, which wouldbe mitigated by distant supervision.

This paper introduces the conception of distant supervision in section 3.2.In the supervised method of RE, the insuﬃcient training corpus puzzles thefurther development of RE despite of their excellent results. To solve thisproblem, Mintz et al. [43] propose distant supervision, which is strongly basedon an assumption that plays an important role in the selection of trainingexamples. To date, this assumption has evolved into

Three assumptions .To solve the insuﬃcient training samples problem, in 2009, Stanford Uni-versity Professor Mintz et al. [43] proposed RE method of distant supervisionat ACL conference. This method rarely needs manual annotation and gen-erates large-scale training data sets (talked about that in section 3.2). Thelarge-scale training data sets are from these assumptions: eep Neural Network Based Relation Extraction: An Overview 17

Table 7

The comparison of DNN-based distant supervision methods.

Class Author Model name Framework Features set De-noise method Loss function OptimizationEn

Zeng et al. [57] PCNN multi-ﬁxed-size-ﬁlter-CNN+piecewise-max-pooling WE2+PE Multi-instance Learning Cross entropy AdadeltaJiang et al. [58] MIMLCNN multi-ﬁxed-size-ﬁlter-CNN+piecewise-max-pooling+cross-sentence-max-pooling WE2+PE Output multi label Cross entropy Adadelta Re Yang et al. [59] BiGRU+2ATT BiGRU+word-level-attention+sentence-level-attention WE2+PE Sentence-level-attention Cross entropy AdamLin et al. [60] MNRE multi-ﬁxed-size-ﬁlter-CNN+max-pooling+mono-lingual attention+cross-lingual attention WE1+PE Mono-lingual(Sentence-level- attention) - SGDLin et al. [61] PCNN+ATT multi-ﬁxed-size-ﬁlter-CNN+piecewise-max-pooling+sentence-level-attention WE2+PE Selective attention Cross entropy SGDBanerjee et al. [62] MEM multi-channel-BLSTM WE3+DP+POS Co-occurrence statistics Cross entropy -Du et al. [63] MLSSA BLSTM+word-level-attention+sentence-level-attention WE2+PE Sentence-level-attention - Adam Ex Ji et al. [64] APCNNs+D multi-ﬁxed-size-ﬁlter-CNN+piecewise max-pooling+ﬁxed-size-ﬁliter-CNN-description+sentence-level-attention WE2+PE Sentence-level-attention Cross entropy AdadeltaWang et al. [65] LFDS multi-ﬁxed-size-ﬁlter-CNN+piecewise max-pooling+word-level-attention WE2+PE KG Embedding Margin loss -Vashishth et al. [66] RESIDE GCN+BiGRU+word-level-attention+senten-level-attention WE3+PE+DP - - - Pl Qin et al. [67] RL ﬁxed-size-ﬁlter-CNN+Reinforcement learning WE2 Reinforcement Learning - -Qin et al. [68] DSAGN ﬁxed-size-ﬁlter-CNN+GAN WE2+PE DSGAN - -Table symbols have the same meanings as Table 4. In addition, the En, Re, Ex, Pl refer to Sentence encoder, Enhanced representation, External knowledge, Plug-and-playcomponent.

Assumption 1:

If two entities participate in a relation, all sentences thatmention these two entities express that relation.

Assumption 2:

If two entities participate in a relation, at least one sen-tence that mentions these two entities might express that relation.

Assumption 3:

A relation holding between two entities can be either ex-pressed explicitly or inferred implicitly from all sentences that mention thesetwo entities.

The ﬁrst assumption [43] physically aligns the texts with a KB and trainsa classiﬁer heuristically using the existing triples in the KB. Riedel et al.[6] thought the ﬁrst assumption is too strong resulting in the wrong labelproblem (noisy problem), and then proposed the second assumption, called:”one sentence from one bag”. It leads to more accurate results. The thirdassumption [58] can consider more sentence features.Based on the above three assumptions, distant supervision methods arecomposed of two research directions:

Sentence Encoder optimizes the modeland the performance of RE;

De-noise Algorithm improves the quality of thedata sets, which can be further subdivided into three main ways: the ﬁrst makesfull use of the word-level and sentence-level features of the instances in the bagfor enhancing representation ; the second introduces the external knowl-edge ; while the third constructs a plug-and-play component . This sectionpresents some related works and summarizes the evolution of this approach(shown in Figure 8 (a)). The whole architecture of DNN-based distant super-vision is shown in Figure 8 (b). Hence, the rest of this chapter will follow thesefour points: 1) encoder-based, 2) representation-based, 3) knowledge-based, 4)plug-and-play based. In addiction, diﬀerent from the supervised method, Ta-

DistantSupervisionREData Sets CNNLSTMGRU Multi-sizes-filter Piecewise+Max poolingBi-LSTMBi-GRU EnhancedrepresentationExternal knowledgePlug-and-play componentBag_1:Sentence_1 … Sentence_nBag_n:... ...

DE|NOISE

Bag_1:Sentence_embedding_1 … Sentence_embedding_mBag_n:... ...

Classifier

Sentence encoderDistant supervision datasets

Assumptions Sentence encoder De-noise ... (a)(b)

Fig. 8 (a) and (b) are the evolutionary process and architecture of DNN-based distant su-pervision methods. The sub-ﬁgure (b) also shows the relationship between sentence encoderand de-noise method. The sentence encoder is responsible for encoding the sentence bag inthe distant supervision data set into vectors, while the de-noise algorithm is responsible forselecting the sentences in each bag that can correctly represent the relationship. ble 7 does not include evaluation indicators. Because most of these methodsin Table 7 is just an approximate measure of precision.5.1 Encoder-based MethodsThe encoder-based methods, including PCNN (Piecewise CNN), max-pooling,multi-instance learning (MIL) modules, provide an infrastructure for distantsupervision methods and reference for improvement of subsequent de-noisemethods.

PCNN+MIL : Inspired by Zeng et al. [18], Zeng et al. [57] exploit thePCNN with MIL, which divides the sentence into 3 segments based on thepositions of two entities, to extract the relevant features automatically froma sentence and to get the important structural information. And then, theydeploy MIL to train the model with the highest conﬁdence level instance toreduce the noise (used one sentence from one bag). This model outperformsseveral competitive baselines but ignores other instances in the bag. eep Neural Network Based Relation Extraction: An Overview 19

MIMLCNN : The same purpose as above, Jiang et al. [58] propose a multi-instance multi-label CNN for distant supervision and introduce assumption 3.This work leverages CNN to extract features from a single sentence and thenaggregates all sentence representations into an entity-pair-level representationby cross-sentence max-pooling. In this way, the model merges features fromdiﬀerent sentences, not one sentence from one bag. Meanwhile, the multi-relation pattern between entity pairs can be considered. Hence, for a givenentity pair, the model can predict multiple relations simultaneously.5.2 Representation-based MethodsUnlike the previous simple encoder methods, Lin et al. [60, 61] introduce twonovel attention models. The ﬁrst uses the weight of the attention mechanismto ﬂag all instances in one bag, and then apply the vector weighted sum ofall the sentences to represent a bag. Hence, this model can identify importantinstances from noisy sentences, as well as utilizes all the information in thebag to optimize the performance. As a special case of the MIL, this modeleﬀectively reduces the inﬂuence of wrong labeled instances. The seconde oneconsiders and leverages multi-lingual corpus, which is based on [61]. This workproposes mono-lingual attention and a cross-lingual attention mechanism toexcavate diverse information hidden in the data of diﬀerent languages, and theexperimental results show that this work eﬀectively model relation patternsamong diﬀerent languages. However, if each language builds a cross attentionmatrix, it doesn’t seem realistic.Yang et al. [59] and Du et al.’s [63] work are similar, both of them use twodiﬀerent kinds of attention mechanism (two-level) to extract features. The ﬁrstone is used to present the sentence, which looks like [49], and alleviates thedistractions of irrelevant words. The second one is like [61], leveraging attentionmechanism to choose the weight of sentence with the highest probability.In addition to using attention mechanisms, there are also ideas for incor-porating more information to enhance the model. Banerjee et al. [62] proposea simple co-occurrence based strategy method for calculating the highest con-ﬁdence in distant supervision bag, which uses the most frequent samples inthe bag as the training label of the package. This is also a way to enhance thepresentation.5.3 Knowledge-based MethodsOnly with the entity pair and context, the extracted features are not enough torepresent the sentence, which may result in the performance in a low accuracy.In recent years, with the development of word vectors and knowledge graphs,additional background information has been introduced in RE, improving theexternal description of relation vectors [64–66].In Ji et al.’s [64] work, they continue to choice the classic PCNN model toget sentence feature vectors. But in the data source processing step, this work introduces the conception of the relation vector from the knowledge graph r = e − e to present the relation, and then, combines these two kinds of vectorsto a new vector by concatenation. With the new vector, this model generatesan attention weight vector to weight sum all of the sentence feature vectorsas the bag’s features. Meanwhile, to get additional background information,a description for entities from Freebase and Wikipedia pages is introduced toimprove the entity representations. The experimental results show that thismethod contains more background knowledge to entities.Following the above work [64], Wang et al. [65] also introduce the conceptof the relation vector of knowledge graph r = e − e to present the relation.But this model is diﬀerent from the above, it leverages the labeled data fromthe sentence itself, which means that all the labels of the training data aredetermined by sentence and aligned entity pairs. In the end, the sentences,forming a sentence pattern, can be classiﬁed into a diverse group by the typesof aligned entities in the knowledge graph. Hence, this method also alleviatesthe wrong label problem and makes full use of the training corpus producedby distant supervision.Since not all entities have description information, the above two methodsmay not apply to all entities, but some side information describing entities typeor entity relations can be utilized. In Vashishth et al.’s [66] work, they considersome relevant and additional side information of KB, such as entity type andrelation alias, and employ GCN to encode syntactic information from text.The entity type information, in this method, is integrated into an embeddingto represent various entity types. And the relation alias come from some NLPtoolkits (Stanford Open IE [69]) or Paraphrase database PPDB [70]. This kindof description can also be seen as a knowledge improving the performance ofRE model.5.4 Plug-and-play based MethodsEither the ”enhanced representation” or ”external knowledge” methods men-tioned above are to improve RE model itself, which is not universal. Hence, todirectly construct a method as a plug-and-play component in distant supervi-sion to reduce the noise of data sets must be a new insight.Qin et al. [67, 68] oﬀer two approaches to constructing this plug-and-playcomponent: One is the reinforcement learning framework for distant supervi-sion [67] to solve false-positive case problem. The author argues that thosewrong label sentences must be ﬁltered by a hard decision, not by a soft weightof attention. In this way, this model generates less noise training data sets usedin any previous state-of-the-art model. The second is DSGAN, using the ideaof GAN to obtain a generator to classify the positive and negative samples ina bag from distant supervision. This is a kind of adversarial learning strategy,which could detect true-positive samples from the noisy distant supervisiondata sets. To some extent, these two methods alleviate the wrong label prob-lem and form a new high-conﬁdence training data sets. With this new data eep Neural Network Based Relation Extraction: An Overview 21 !" ! " %&' " % " %(' && DS PositiveDataset ) ' ) * ) + & ,) - ' & . ' / 0123. * / 0104. + / 0156. / 0148. / 01:0 Sampling ;<=>; / ?;<=>; / 0 ( @>A<@B () " ;<=>; / ?;<=>; / 0 Pre-training . - / 013 !" Generator Discriminator

Fig. 9

The structure of model DSGAN [68]. sets, some recent state-of-the-art models achieve further improvement in theexperiment. Hence, both of the ”reinforcement learning framework” and DS-GAN can be seen as a plug-and-play component in distant supervision. Thestructure of DSGAN is shown in Figure 9.5.5 DiscussionTo summarize, distant supervision adopts various methods to get an excel-lent result. Figure 8 and Table 7 illustrate that most of these works focus onde-noise algorithm. These works document that although distant supervisionbrings substantial beneﬁts, its drawbacks, the noise in the data set, are alsoobvious. The above three methods (representation-based, knowledge-based,and plug-and-play based) cope the noise problem in diﬀerent aspects, whilethey are still not optimal. Especially for knowledge-based methods, introduc-ing external knowledge will enable models with the ability of commonsensereasoning, which are missing in RE task. Meanwhile, fewer models devote tofeature extraction, which means that the neural network variants have failedto signiﬁcantly improve performance in pure text feature extraction. Conse-quently, no matter to improve the sentence encoder or de-noise algorithm,they only increase something extra burden to the basic model instead of solv-ing insuﬃcient training corpus problem (We compare the characteristics ofsupervision and distant supervision methods in Table 5.5). And of course,some researchers are considering other solutions, and we’ll talk about thosemethods later.

All the above RE methods are to extract one relation of one entity pair. Butin fact, one entity pair may have multiple relations in a sentence. To address

Table 8

The comparison of supervised methods and distant supervision methods.

Item Features Supervision Distant Supervision

Data sets Annotate mode Manually annotated Distant alignment KBAccuracy of labeled data High accuracy Low accuracyNoise Low noise High noiseData Size Small LargeApplicability Model portability Low HighCross-domain applicability Low HighAccuracy of prediction - High Low

Table 9

The comparison of data sets.

Data set Relations Data Amount (words/language)SemEval-2010 Task8 18 10717ACE04 24 350KNYT+Freebase 53 695059FewRel 100 70000 this issue, Zhang et al. [71] come up with a RE approach based on capsulenetworks with an attention mechanism, which outputs multi-relations of oneentity pair. To some extent, this method considers more detailed features. Inaddition, joint extraction methods [72–76], known as end-to-end extraction,is also a new sight. Unlike the methods discussed earlier, the joint extractionmethod integrates NER and RE into one task, which outputs the relation andentity pair together. In general, although the joint extraction method reducesthe possibility of error propagation. compared with other methods, its accuracyand usability still have a large room for further improvement.

Variety of data sets exit for diﬀerent methods, common data sets are SemEval2010-task8 [3], ACE series 2003-2005, NYT+Freebase [6]. The SemEval 2010-task8 and ACE series are commonly used for supervised learning classiﬁca-tion tasks, while the NYT+Freebase for distant supervision methods. Table9 shows the comparison of 4 representative data sets. The following is a briefintroduction to these data sets.

SemEval 2010-task8 is released in 2010, as an improvement on the Se-meval 2007-task4, it provides a standard testbed for evaluating various meth-ods of RE. This data set is used widely for evaluation, which contains 9 di-rectional relations and an additional ’other’ relation, resulting in 19 relationclasses in total. Most of the supervised methods discussed in this article usethe same data set. The relations are as follows: – Cause-Eﬀect: An event or object leads to an eﬀect. (those cancers werecaused by radiation exposures) eep Neural Network Based Relation Extraction: An Overview 23

Table 10

Some examples from SemEval 2010-task8.Example1: < e1 > People < /e1 > have been moving back into < e2 > downtown < /e2 > .Relation: Entity-Destination(e1,e2).Example2: Cieply’s < e1 > story < /e1 > makes a compelling < e2 > point < /e2 > aboutmodern-day studio economics.Relation: Message-Topic(e1,e2). Table 11

An introduction to ACE data sets.

Corpus Data Amount (words/language)

Tasks Languages

ACE03 100K training50K evaluation entitiesrelations ChineseEnglishArabicACE04 300K training50K evaluation entitiesrelationsevents ChineseEnglishArabicACE05 750K training150K evaluation entitiesrelationsevents ChineseEnglishArabic – Component-Whole: An object is a component of a larger whole. (my apart-ment has a large kitchen) – Content-Container: An object is physically stored in a delineated area ofspace, the container. (Earth is located in the Milky Way) – Entity-Destination: An entity is moving towards a destination. (the boywent to bed) – Entity-Origin: An entity is coming or is derived from an origin, e.g., positionor material. (letters from foreign countries) – Message-Topic: An act of communication, written or spoken, is about atopic. (the lecture was about semantics) – Member-Collection: A member forms a nonfunctional part of a collection.(there are many trees in the forest) – Instrument-Agency: An agent uses an instrument. (phone operator) – Product-Agency: A producer causes a product to exist. (a factory manu-factures suits) – Other: If none of the above nine relations appears to be suitable.This data set contains 10717 labeled data, of which 8000 are used for trainingand the remaining 2717 for testing. The nominals of the sentences in the dataset are marked by XML tags. Table 10 shows an example of labeled sentences.

ACE2003-2005 Series come from LDC (Linguistic Data consortium),consisting of various types of annotated for entities and relations. In threediﬀerent years of the corpus (ACE 2003, 2004, 2005), the ACE tasks are morecomplex than SemEval 2010-task8. The corpus can be divided into broad-cast news, newswire and telephone conversations, including the complete setof English, Arabic, Chinese training data. So the annotated entities in this corpus are pronoun words or other irregular words. In addition to the task ofRE, they can also be applied in the following ﬁve tasks: Entity Detection andRecognition, Entity Mention Detection, EDR Co-reference, Relation MentionDetection, and Relation Detection and Recognition of given reference enti-ties. This data set is typically used in supervised models. Table 11 shows theintroduction of ACE data sets.

NYT+Freebase means New York times + Freebase, which is a com-mon way to generate data sets in distant supervision. The distant supervisionmethod produces data sets by extracting relations through the heuristic align-ment of text and entities in the KB (talked about that in section 3.2). Asa result, in this way, distant supervision creates large-scale training data au-tomatically. This process is shown in Figure 4. The most widely used dataset is generated by Riedel et al. [6], which has 522611 training sentences and172448 test sentences, labeled by 53 candidate relations in Freebase, and anextra-label of NA (nearly 80% of the sentences in the training data are labeledas NA).

Other Data sets include SemEval2017-task10 [77], SemEval2018-task7[78], TACRED [79], KBP37 [22], DDIExtraction2011 [80], DDIExtraction2013[81], FewRel [4], FewRel2 [5].

This article begins by laying out the general framework and some basic con-cepts, focusing on DNN-based methods in supervised and distant supervision.Based on the above elaboration, both supervised and distant supervision meth-ods have their own characteristics. Supervised methods are better suited forspeciﬁc domain, while distant supervision methods are better for generic do-mains. As a result, it is diﬃcult to specify which methods are currently thebest. Hence, we just compare the characteristics of supervised and distant su-pervised methods in Table 5.5. In general, despite its long success, RE methodsstill have several problems as follow:

Transfer learning : is a cross-domain adaptive solution. It provides bet-ter domain transfer capability for RE model, especially in supervised learn-ing method, making the model have better extensibility to another corpus.Presently, world knowledge and the number of relations are not stagnant, cre-ating a dynamic knowledge base. Nevertheless, most of RE models, with lowutilization of dynamic knowledge, are explored in predeﬁned relation inventoryand suﬀered from the insuﬃcient data sets. To keep the model continuouslyreceiving new training samples and making full use of other related corpusis worthy of exploration. It is noted that the results in this ﬁeld are not yetsigniﬁcant [82–84].

Relationship Reasoning : presents a way to synthesize background knowl-edge and be aware of new knowledge based on logical ability. For RE, with thereasoning ability and some existing relational facts, we may use this logicalability to expand our existing knowledge. At the same time, in the open do- eep Neural Network Based Relation Extraction: An Overview 25 main, without the relation inventory in advance, it is a very forward-lookingwork to obtain the relation directly from the existing corpus in the real world.Hence, using background knowledge to obtain the relation by reasoning is anexciting way to solve the problem. Several attempts [25, 64, 65] have beenmade to use a knowledge graph or introduce entity description informationto RE. However, how to further improve the reasoning mechanism still needsstudying.

Relationship Framework : limits RE task in an established framework.Another problem is the diversity of relation inventory in data sets (out-of-vocabulary problem). Diﬀerent data sets have diﬀerent deﬁnitions of relationinventories, making the training model of each data set limited in domainadaptability. If the industry can construct a framework with an uniﬁed de-scription of all kinds of relations and make RE have a hierarchical structure,then it may have a better multi-domain adaptation. Predecessors have donea lot of work on the deﬁnition of relation inventories, but no agreement hasbeen reached [3].

Cross-sentence RE : Moreover, cross-sentence RE is likewise one signif-icant ﬁeld. The current RE task mainly focuses on processing the entity pairrelation in intra-sentence. In the actual situation, most entities in the textrepresent the relation between each other by multi sentences. Current modelsmay not be able to handle such tasks directly. Especially, distant supervisiongenerates a large scale of document-level corpora, which can only be solvedby a cross-sentence RE technique. some researchers [85–87] propose a seriesof novel methods to obtain the entity relation across sentences. These studiesoﬀer some important insights into cross-sentence RE.

Others : In addition to these ﬁelds of research, several problems in theexisting methods also need solving, one is the problem of error propagationin supervised methods, and the other is the problem of wrong label in distantsupervision. Especially for the latter one, if the wrong label problem can bewell solved, a large amount of eﬀective sample data will be obtained, whichwill make an important contribution to this ﬁeld.

Acknowledgements

This work was supported by National Natural Science Foundationof China (No. U19A2059), and by Ministry of Science and Technology of Sichuan ProvinceProgram (NO.2021YFG0018&No.20ZDYF0343).We sincerely thank Mr. Kombou Victor, Anto Leoba Jonathan, Rufai Yusuf Zakri andOwusu Wilson Jim for their helpful discussions.

References

1. Kumar S (2017) A survey of deep learning methods for relation extraction.arXiv preprint arXiv:1705036452. Pawar S, Palshikar GK, Bhattacharyya P (2017) Relation extraction: Asurvey. arXiv preprint arXiv:1712051913. Hendrickx I, Kim SN, Kozareva Z, Nakov P, ´O S´eaghdha D, Pad´o S, Pen-nacchiotti M, Romano L, Szpakowicz S (2009) Semeval-2010 task 8: Multi- way classiﬁcation of semantic relations between pairs of nominals. In: Pro-ceedings of the Workshop on Semantic Evaluations: Recent Achievementsand Future Directions, Association for Computational Linguistics, pp 94–994. Han X, Zhu H, Yu P, Wang Z, Yao Y, Liu Z, Sun M (2018) Fewrel: Alarge-scale supervised few-shot relation classiﬁcation dataset with state-of-the-art evaluation. arXiv preprint arXiv:1810101475. Gao T, Han X, Zhu H, Liu Z, Li P, Sun M, Zhou J (2019) Fewrel 2.0:Towards more challenging few-shot relation classiﬁcation. arXiv preprintarXiv:1910071246. Riedel S, Yao L, McCallum A (2010) Modeling relations and their mentionswithout labeled text. In: Joint European Conference on Machine Learningand Knowledge Discovery in Databases, Springer, pp 148–1637. LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W,Jackel LD (1989) Backpropagation applied to handwritten zip code recog-nition. Neural computation 1(4):541–5518. Elman JL (1991) Distributed representations, simple recurrent networks,and grammatical structure. Machine learning 7(2-3):195–2259. Socher R, Huval B, Manning CD, Ng AY (2012) Semantic composition-ality through recursive matrix-vector spaces. In: Proceedings of the 2012joint conference on empirical methods in natural language processing andcomputational natural language learning, Association for ComputationalLinguistics, pp 1201–121110. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008)The graph neural network model. IEEE Transactions on Neural Networks20(1):61–8011. Zhang S, Zheng D, Hu X, Yang M (2015) Bidirectional long short-termmemory networks for relation classiﬁcation. In: Proceedings of the 29thPaciﬁc Asia conference on language, information and computation, pp 73–7812. Sundermeyer M, Schl¨uter R, Ney H (2012) Lstm neural networks for lan-guage modeling. In: Thirteenth annual conference of the internationalspeech communication association13. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural com-putation 9(8):1735–178014. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation ofgated recurrent neural networks on sequence modeling. arXiv preprintarXiv:1412355515. Mikolov T, Chen K, Corrado G, Dean J (2013) Eﬃcient estimation ofword representations in vector space. arXiv preprint arXiv:1301378116. Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple andgeneral method for semi-supervised learning. In: Proceedings of the 48thannual meeting of the association for computational linguistics, Associa-tion for Computational Linguistics, pp 384–39417. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for wordrepresentation. In: Proceedings of the 2014 conference on empirical meth- eep Neural Network Based Relation Extraction: An Overview 27 ods in natural language processing (EMNLP), pp 1532–154318. Zeng D, Liu K, Lai S, Zhou G, Zhao J, et al. (2014) Relation classiﬁcationvia convolutional deep neural network19. Nguyen TH, Grishman R (2015) Relation extraction: Perspective fromconvolutional neural networks. In: Proceedings of the 1st Workshop onVector Space Modeling for Natural Language Processing, pp 39–4820. Santos CNd, Xiang B, Zhou B (2015) Classifying relations by ranking withconvolutional neural networks. arXiv preprint arXiv:15040658021. Wang L, Cao Z, De Melo G, Liu Z (2016) Relation classiﬁcation via multi-level attention cnns. In: Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers), pp1298–130722. Zhang D, Wang D (2015) Relation classiﬁcation via recurrent neural net-work. arXiv preprint arXiv:15080100623. Qin P, Xu W, Guo J (2017) Designing an adaptive attention mechanism forrelation classiﬁcation. In: 2017 International Joint Conference on NeuralNetworks (IJCNN), IEEE, pp 4356–436224. Zhang C, Cui C, Gao S, Nie X, Xu W, Yang L, Xi X, Yin Y (2019)Multi-gram cnn-based self-attention model for relation classiﬁcation. IEEEAccess 7:5343–535725. Ren F, Zhou D, Liu Z, Li Y, Zhao R, Liu Y, Liang X (2018) Neuralrelation classiﬁcation with text descriptions. In: Proceedings of the 27thInternational Conference on Computational Linguistics, pp 1167–117726. Zhang L, Xiang F (2018) Relation classiﬁcation via bilstm-cnn. In: Inter-national Conference on Data Mining and Big Data, Springer, pp 373–38227. Mooney RJ, Bunescu RC (2006) Subsequence kernels for relation extrac-tion. In: Advances in neural information processing systems, pp 171–17828. Xu Y, Mou L, Li G, Chen Y, Peng H, Jin Z (2015) Classifying relationsvia long short term memory networks along shortest dependency paths.In: proceedings of the 2015 conference on empirical methods in naturallanguage processing, pp 1785–179429. Cai R, Zhang X, Wang H (2016) Bidirectional recurrent convolutional neu-ral network for relation classiﬁcation. In: Proceedings of the 54th AnnualMeeting of the Association for Computational Linguistics (Volume 1: LongPapers), vol 1, pp 756–76530. Guo X, Zhang H, Yang H, Xu L, Ye Z (2019) A single attention-based com-bination of cnn and rnn for relation classiﬁcation. IEEE Access 7:12467–1247531. Jin L, Song L, Zhang Y, Xu K, Ma Wy, Yu D (2020) Relation extractionexploiting full dependency forests. In: Proceedings of the AAAI Conferenceon Artiﬁcial Intelligence, vol 34, pp 8034–804132. Hearst MA (1992) Automatic acquisition of hyponyms from large text cor-pora. In: Proceedings of the 14th conference on Computational linguistics-Volume 2, Association for Computational Linguistics, pp 539–54533. Berland M, Charniak E (1999) Finding parts in very large corpora. In:Proceedings of the 37th annual meeting of the Association for Computa- tional Linguistics34. Brin S (1998) Extracting patterns and relations from the world wide web.In: International workshop on the world wide web and databases, Springer,pp 172–18335. Agichtein E, Gravano L (2000) Snowball: Extracting relations from largeplain-text collections. In: Proceedings of the ﬁfth ACM conference on Dig-ital libraries, ACM, pp 85–9436. Etzioni O, Cafarella M, Downey D, Kok S, Popescu AM, Shaked T, Soder-land S, Weld DS, Yates A (2004) Web-scale information extraction inknowitall:(preliminary results). In: Proceedings of the 13th internationalconference on World Wide Web, ACM, pp 100–11037. Yates A, Cafarella M, Banko M, Etzioni O, Broadhead M, Soderland S(2007) Textrunner: open information extraction on the web. In: Proceed-ings of Human Language Technologies: The Annual Conference of theNorth American Chapter of the Association for Computational Linguis-tics: Demonstrations, Association for Computational Linguistics, pp 25–2638. Phi VT, Santoso J, Shimbo M, Matsumoto Y (2018) Ranking-based au-tomatic seed selection and noise reduction for weakly supervised relationextraction. In: Proceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Papers), pp 89–9539. Hasegawa T, Sekine S, Grishman R (2004) Discovering relations amongnamed entities from large corpora. In: Proceedings of the 42nd annualmeeting on association for computational linguistics, Association for Com-putational Linguistics, p 41540. Rink B, Harabagiu S (2010) Utd: Classifying semantic relations by com-bining lexical and semantic resources. In: Proceedings of the 5th Interna-tional Workshop on Semantic Evaluation, Association for ComputationalLinguistics, pp 256–25941. Kambhatla N (2004) Combining lexical, syntactic, and semantic featureswith maximum entropy models for extracting relations. In: Proceedings ofthe ACL 2004 on Interactive poster and demonstration sessions, Associa-tion for Computational Linguistics, p 2242. Bunescu RC, Mooney RJ (2005) A shortest path dependency kernel forrelation extraction. In: Proceedings of the conference on human languagetechnology and empirical methods in natural language processing, Asso-ciation for Computational Linguistics, pp 724–73143. Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for re-lation extraction without labeled data. In: Proceedings of the Joint Con-ference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP: Volume2-Volume 2, Association for Computational Linguistics, pp 1003–101144. Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014)The stanford corenlp natural language processing toolkit. In: Proceedingsof 52nd annual meeting of the association for computational linguistics:system demonstrations, pp 55–60 eep Neural Network Based Relation Extraction: An Overview 29

45. Liu C, Sun W, Chao W, Che W (2013) Convolution neural network for re-lation extraction. In: International Conference on Advanced Data Miningand Applications, Springer, pp 231–24246. Kim Y (2014) Convolutional neural networks for sentence classiﬁcation.arXiv preprint arXiv:1408588247. Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neuralnetwork for modelling sentences. arXiv preprint arXiv:1404218848. Xu K, Feng Y, Huang S, Zhao D (2015) Semantic relation classiﬁcationvia convolutional neural networks with simple negative sampling. arXivpreprint arXiv:15060765049. Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-basedbidirectional long short-term memory networks for relation classiﬁcation.In: Proceedings of the 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers), vol 2, pp 207–21250. Xu Y, Jia R, Mou L, Li G, Chen Y, Lu Y, Jin Z (2016) Improved relationclassiﬁcation by deep recurrent neural networks with data augmentation.arXiv preprint arXiv:16010365151. Xiao M, Liu C (2016) Semantic relation classiﬁcation via hierarchical re-current neural network with attention. In: Proceedings of COLING 2016,the 26th International Conference on Computational Linguistics: Techni-cal Papers, pp 1254–126352. Lee J, Seo S, Choi YS (2019) Semantic relation classiﬁcation via bidirec-tional lstm networks with entity-aware attention using latent entity typing.arXiv preprint arXiv:19010816353. Zheng S, Xu J, Zhou P, Bao H, Qi Z, Xu B (2016) A neural networkframework for relation extraction: Learning entity semantic and relationpattern. Knowledge-Based Systems 114:12–2354. Wang H, Qin K, Lu G, Luo G, Liu G (2020) Direction-sensitive rela-tion extraction using bi-sdp attention model. Knowledge-Based Systemsp 10592855. Zhang Z, Shu X, Yu B, Liu T, Zhao J, Li Q, Guo L (2020) Distillingknowledge from well-informed soft labels for neural relation extraction.In: AAAI, pp 9620–962756. Lyu S, Cheng J, Wu X, Cui L, Chen H, Miao C (2020) Auxiliary learningfor relation extraction. IEEE Transactions on Emerging Topics in Com-putational Intelligence57. Zeng D, Liu K, Chen Y, Zhao J (2015) Distant supervision for relationextraction via piecewise convolutional neural networks. In: Proceedings ofthe 2015 Conference on Empirical Methods in Natural Language Process-ing, pp 1753–176258. Jiang X, Wang Q, Li P, Wang B (2016) Relation extraction with multi-instance multi-label convolutional neural networks. In: Proceedings ofCOLING 2016, the 26th International Conference on Computational Lin-guistics: Technical Papers, pp 1471–148059. Yang L, Ng TLJ, Mooney C, Dong R (2017) Multi-level attention-basedneural networks for distant supervised relation extraction. In: AICS, pp eep Neural Network Based Relation Extraction: An Overview 31

73. Miwa M, Bansal M (2016) End-to-end relation extraction using lstms onsequences and tree structures. arXiv preprint arXiv:16010077074. Li F, Zhang M, Fu G, Ji D (2017) A neural joint model for entity andrelation extraction from biomedical text. BMC bioinformatics 18(1):19875. Zheng S, Wang F, Bao H, Hao Y, Zhou P, Xu B (2017) Joint extractionof entities and relations based on a novel tagging scheme. arXiv preprintarXiv:17060507576. Xiao Y, Tan C, Fan Z, Xu Q, Zhu W (2020) Joint entity and relationextraction with a hybrid transformer and reinforcement learning basedmodel. In: AAAI, pp 9314–932177. Bethard S, Carpuat M, Cer D, Jurgens D, Nakov P, Zesch T (2016)Proceedings of the 10th international workshop on semantic evaluation(semeval-2016). In: Proceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016)78. G´abor K, Buscaldi D, Schumann AK, QasemiZadeh B, Zargayouna H,Charnois T (2018) Semeval-2018 task 7: Semantic relation extraction andclassiﬁcation in scientiﬁc papers. In: Proceedings of The 12th InternationalWorkshop on Semantic Evaluation, pp 679–68879. Zhang Y, Zhong V, Chen D, Angeli G, Manning CD (2017) Position-aware attention and supervised data improve slot ﬁlling. In: Proceedings ofthe 2017 Conference on Empirical Methods in Natural Language Process-ing (EMNLP 2017), pp 35–45, URL https://nlp.stanford.edu/pubs/zhang2017tacred.pdf

80. Segura Bedmar I, Martinez P, S´anchez Cisneros D (2011) The 1stddiextraction-2011 challenge task: Extraction of drug-drug interactionsfrom biomedical texts81. Segura Bedmar I, Mart´ınez P, Herrero Zazo M (2013) Semeval-2013 task 9:Extraction of drug-drug interactions from biomedical texts (ddiextraction2013). Association for Computational Linguistics82. Di S, Shen Y, Chen L (2019) Relation extraction via domain-aware transferlearning. In: Proceedings of the 25th ACM SIGKDD International Con-ference on Knowledge Discovery & Data Mining, pp 1348–135783. Sun C, Wu Y (2019) Distantly supervised entity relation extraction withadapted manual annotations. In: Proceedings of the AAAI Conference onArtiﬁcial Intelligence, vol 33, pp 7039–704684. Zhang N, Deng S, Sun Z, Chen J, Zhang W, Chen H (2019) Transfer learn-ing for relation extraction via relation-gated adversarial learning. arXivpreprint arXiv:19080850785. Sahu SK, Christopoulou F, Miwa M, Ananiadou S (2019) Inter-sentencerelation extraction with document-level graph convolutional neural net-work. arXiv preprint arXiv:19060468486. Guo Z, Zhang Y, Lu W (2019) Attention Guided Graph ConvolutionalNetworks for Relation Extraction pp 241–251, DOI 10.18653/v1/p19-1024,

87. Zhang Y, Qi P, Manning CD (2019) Graph Convolution over PrunedDependency Trees Improves Relation Extraction (2005):2205–2215, DOI10.18653/v1/d18-1244,