[PDF] Clustering-based Unsupervised Generative Relation Extraction

Abstract

This paper focuses on the problem of unsupervised relation extraction. Existing probabilistic generative model-based relation extraction methods work by extracting sentence features and using these features as inputs to train a generative model. This model is then used to cluster similar relations. However, these methods do not consider correlations between sentences with the same entity pair during training, which can negatively impact model performance. To address this issue, we propose a Clustering-based Unsupervised generative Relation Extraction (CURE) framework that leverages an "Encoder-Decoder" architecture to perform self-supervised learning so the encoder can extract relation information. Given multiple sentences with the same entity pair as inputs, self-supervised learning is deployed by predicting the shortest path between entity pairs on the dependency graph of one of the sentences. After that, we extract the relation information using the well-trained encoder. Then, entity pairs that share the same relation are clustered based on their corresponding relation information. Each cluster is labeled with a few words based on the words in the shortest paths corresponding to the entity pairs in each cluster. These cluster labels also describe the meaning of these relation clusters. We compare the triplets extracted by our proposed framework (CURE) and baseline methods with a ground-truth Knowledge Base. Experimental results show that our model performs better than state-of-the-art models on both New York Times (NYT) and United Nations Parallel Corpus (UNPC) standard datasets.

Full PDF

CClustering-based Unsupervised Generative Relation Extraction

Chenhan Yuan

Virginia TechBlacksburg, [email protected]

Ryan Rossi

Adobe ResearchSan Jose, [email protected]

Andrew Katz

Virginia TechBlacksburg, [email protected]

Hoda Eldardiry

Virginia TechBlacksburg, [email protected]

ABSTRACT

This paper focuses on the problem of unsupervised relation extrac-tion. Existing probabilistic generative model-based relation extrac-tion methods work by extracting sentence features and using thesefeatures as inputs to train a generative model. This model is thenused to cluster similar relations. However, these methods do notconsider correlations between sentences with the same entity pairduring training, which can negatively impact model performance.To address this issue, we propose a Clustering-based Unsupervisedgenerative Relation Extraction (CURE) framework that leverages an“Encoder-Decoder” architecture to perform self-supervised learningso the encoder can extract relation information. Given multiple sen-tences with the same entity pair as inputs, self-supervised learningis deployed by predicting the shortest path between entity pairs onthe dependency graph of one of the sentences. After that, we ex-tract the relation information using the well-trained encoder. Then,entity pairs that share the same relation are clustered based on theircorresponding relation information. Each cluster is labeled with afew words based on the words in the shortest paths correspondingto the entity pairs in each cluster. These cluster labels also describethe meaning of these relation clusters. We compare the triplets ex-tracted by our proposed framework (CURE) and baseline methodswith a ground-truth Knowledge Base. Experimental results showthat our model performs better than state-of-the-art models onboth New York Times (NYT) and United Nations Parallel Corpus(UNPC) standard datasets.

KEYWORDS

Relation extraction, unsupervised learning, generative model

Since it was proposed in 2012 by Google, the Knowledge Graph(KG) has been deployed in many important AI tasks, such as searchengine, recommender system, and question answering [49, 52, 58].There are many existing knowledge graph systems in both academiaand industry, such as Wikidata [48], YAGO [45], and DBpedia [2].Conventionally, constructing a knowledge graph as listed abovefrom text is based on triplets. These triplets can be expressed as ( subject , relation , object ) , which is similar to RDF format [19], andthey are extracted from raw text. Based on the extracted triplets,an integration process is implemented to integrate repeated tripletsand construct the knowledge graph [25]. For triplets extraction, some work uses Information Extraction(IE) methods to extract this information while other work deployscrowdsourcing approaches with help from volunteers [48]. As a vi-tal process in knowledge graph construction, IE, also called RelationExtraction (RE) because most IE methods focus on how to extracta relation given a subject and an object, initially are explored inrule-based and supervised ways. In rule-based relation extraction,researchers have analyzed the syntactic structure in the exampletext and proposed graph search-related algorithms to automaticallycollect different linguistic patterns [23, 26, 44]. In supervised learn-ing, the likelihood of a relation given entity pairs and correspondingsentences is maximized to train their model [29, 43].However, rule-based relation extraction does not accurately iden-tify relations between entities in complex sentences because mostuseful rules are manually labeled and relatively simple. Similarly,supervised relation extraction methods also require some priorknowledge about the text, such as marking the correct triplets ineach sentence. This limits the use of supervised relation extractionsince most texts lack such supporting prior knowledge. Lately, how-ever, unsupervised and distant supervised learning approaches havebeen introduced to the Relation Extraction problem [1, 14, 18, 34, 53].These approaches address the problem of a lack of labeled trainingtext data. In the distant-supervised method, most papers have useda small number of seed example triplets to annotate text to expandthe training set. These researchers assumed that if the same entitypair appeared in different sentences, then these sentences mightdescribe the same relation. That is, these sentences are markedas the same relation as in the seed example [1, 14, 16, 34]. As tothe unsupervised learning approaches, based on selected features,clustering techniques were used in some work to find similar con-cept pairs and relations. After that, different groups were assigneddifferent labels. These labels can be achieved by manually labelingor selecting common words [13, 18, 53, 54].Nevertheless, using seed examples to expand the training datasetcauses error propagation problems [27]. Unlike the distant-supervisedlearning-based approach, unsupervised relation extraction modelsdo not consider the correlation between sentences with the sameentity pair, which can negatively impact model performance. Mean-while, predefined feature selections, such as trigger words [54] andkeywords [36], may introduce biases and influence the final resultof the models [42].To alleviate the issues discussed above, we propose a novel self-supervised approach to train a generative model that can extract a r X i v : . [ c s . C L ] S e p able 1: Comparison between unsupervised relation extraction methods. Generative (G): probabilistic generative models.Feature-cluster (F): models that extract features from sentences then cluster features to find similar entity pairs. VI, EM andHAC denote Variational Inference, Expectation Maximization and Hierarchical Agglomerative Clustering, respectively. TheCluster Label column shows the relation words selection methods. The Data Correlation column shows whether the modelconsiders the correlation among input sentences.Type Input Cluster Cluster Label Data Correlation VAE [31] generative(G) pre-defined features VI trigger words individualRel-LDA [54] generative(G) pre-defined features EM trigger words individualOpen-RE [13] feature-cluster(F) pre-defined features HAC common words individualHasegawa et al. [18] feature-cluster(F) pre-defined features HAC common words individual

CURE (our model)

F&G trained features extractor HAC word vector similarity jointrelation information accurately. Our model does not require labelingnew data or pre-defining sentence features. Concretely, according tothe dependency graph of the sentence, we first extract the shortestpath of the entity pair in this graph. After that, we train an encoderand a decoder simultaneously, where the encoder extracts relationinformation from the shortest paths of the sentences with the sameentity pairs and the decoder generates the shortest path of one ofthe sentences according to the extracted relation information. Aftertraining this model, a well-trained encoder, also known as relationextractor, is obtained to extract relation information. Subsequently,a cluster-based method is used to cluster entity pairs based on theirrelation information. Finally, we label each cluster automatically byanalyzing attributes of words that appear in the shortest path, suchthat the label of each cluster is exactly the relation words. Theseattributes include word frequency and word vector distance.

Summary of Contributions : The key contributions of this workare as follows: • We propose a Clustering-based Unsupervised generative Re-lation Extraction (CURE) framework to extract relations fromraw text. Our proposed framework includes novel mecha-nisms for (1) relation extractor training and (2) triplets clus-tering. Both proposed approaches outperform three state-of-the-art baseline approaches on a relation extraction taskusing two datasets. • We propose a novel method for automatically training arelation information extractor based on the shortest pathprediction. Our method does not require labeling text orpre-specifying sentence features. • Our proposed relation cluster labeling approach selects rela-tion words based on word frequency and word vector dis-tance. This enables a more accurate description of the re-lation than existing approaches that only select the mostcommon words [13, 18]. • We compare our model to state-of-the-art baselines on twodatasets: the standard NTY data set and United Nations Par-allel Corpus (UNPC). The results show that our model out-performs baseline models by more correctly extracting therelations under different topics and different genres.

Information Extraction (IE) is an important step in KG construction.The goal of IE models is to extract triplets from text, where eachtriplet consists of two entities and the relation between them. Forexample, given a labeled dataset, Kambhatla et al. trained a maxi-mum entropy classifier with a set of features generated from thedata. In later work [17], more features were explored to train anSVM relation classifier, such as base phrase chunking and semanticresources. Besides these features, Nguyen et al. proposed to extractkeywords of sentences first. Then a core tree was built based onthese keywords, which is combined with the dependency graph totrain the classifier [36]. Chan and Roth [9] found that some rela-tion types have similar syntactic structure that can be extractedby some manually created rules. Jiang and Zhai analyzed the im-pact of selecting different feature sub-spaces on relation classifierperformance, namely dependency parse tree and syntactic parsetree [24]. Recently, with the rapid growth of deep learning, somework [30] has modeled the dependency shortest paths of entitiesusing neural networks to predict the relation type of entity pairs.In a similar vein, other work achieved this by Convolutional NeuralNetwork (CNN) [56]. However, labeled text only has pre-definedrelation types, which shows deficiencies in the Open-domain rela-tion extraction task. Moreover, most texts are not labeled, whichlimits the use of supervised relation extraction.To overcome the lack of human-labeled text in open-domain,researchers have also designed models to label data automaticallybased on seed examples, which referred to distant supervised learn-ing. Wu and Weld took advantage of the info-box in Wikipediato label training data automatically. They trained a pattern clas-sifier to learn the linguistic features of labeled sentences [51]. Inother distant supervised models [33, 35], the sentences which haveentity pairs shown in Freebase were labeled the same relationsas Freebase. Similarly, Craven and Kumlien labeled new text databased on existing knowledge and referred to labeled data as “weaklytraining data” [11]. When labeling data using the same method,Bunescu and Mooney proposed that the model should be punishedmore if it wrongly assigned positive sample entity pairs rather thannegative samples [7]. Some works also considered the informationfrom webpages as ground-truth, such as Wikipedia, when labelingtraining data [28, 37]. In previous work [1, 22], they assumed thatdifferent sentences that include the same entity pairs may share he same relation on these entity pairs. Based on this assumption,they labeled training data given seed sample data. Romano et al.applied an unsupervised paraphrasing detector, which is used toexpand existing relations [40]. However, these labeling methodsmay introduce noise to training data. Takamatsu et al. presented agenerative model that directly models the heuristic labeling processof distant supervision such that the prediction of label assignmentscan be achieved by hidden variables of the generative model [46].Riedel et al. proposed to use a factor graph to decide whether therelation learned from distant-supervised learning is mentioned inthe training sentence [39].Similarly, some works have also labeled text data first by someheuristics and referred this approach as self-supervised learning.For example, TextRunner used automatically labeled data to train aNaive Bayes classifier, which can tell whether the parsed sentenceis trustworthy. In this model, sentences were parsed by some pre-defined linguistic constraints to extract candidate relations [55].Following this approach, Fader et al. proposed to add syntacticconstraints and lexical constraints to enable the model to extractrelations from more complex sentences [16] Banko and Etzioni.proposed to use Conditional Random Fields as relation classifierinstead of Naive Bayes [3]. The idea of using Wikipedia’s info-boxwas also applied in the improvement of TextRunner [50]. Unsupervised relation extraction is a way to cluster entity pairswith the same relations and label the cluster automatically or man-ually. Hasegawa et al. first proposed the concept of the contextof entity pairs, which can be deemed as extracted features fromsentences. After that, they clustered different relations based onfeature similarity and selected common words in the context ofall entity pairs to describe each relation [18]. Following this work,an extra unsupervised feature selection process was proposed toreduce the impact of noisy words in context [10]. Yan et al. pro-posed a two-step clustering algorithm to classify relations, whichincluded a linguistic patterns based cluster and a surface contextcluster. The linguistic patterns here are pre-defined rules derivedfrom the dependency tree [53]. Poon and Domingos also thought ofusing dependency trees to cluster relations. The dependency treesare first transformed to quasi-logical forms, where lambda formscan be induced recursively [38]. Rosenfeld and Feldman, on theother hand, considered that arguments and keywords are relationpatterns that can be learned by utilizing instances [41]. Their ap-proach was an improvement of KnowItAll system, which is a factextraction system focusing more on entity extraction [15].Some works also considered unsupervised relation extraction asa probabilistic generation task. Latent Dirichlet Allocation (LDA)was applied in unsupervised relation extraction [4, 54]. Researchersreplaced the topic distributions with triplets distributions and imple-mented Expectation Maximization algorithm to cluster similar rela-tions. de Lacalle and Lapata applied this method in general domainknowledge, where they first encoded a Knowledge Base using FirstOrder Logic rules and then combined this with LDA [12]. Marcheg-giani et al. argued that previous generative models make too manyindependence assumptions about extracted features, which may af-fect the performance of models. As a variant of an autoencoder [6],

Table 2: Summary of notationSymbol Description ( e i , r k , e j ) a triplet, where e i and e j are two different entitiesand r k is the relation W , D , P word, dependency tag and POS tag sequences onone semantic shortest path Pa i the i -th semantic shortest path of each entity pair C , R , r i cluster centroid set, candidate relation words set,the vector representation of i -th word in Rn h , n h ′ the hidden state size in LSTM and inverse LSTM n w , n l the number of non-repeating words, the maxlength of all semantic shortest paths W , U bold capitals are weighting matrices σ , tanh the sigmoid and tanh activation functions ⊕ , ⊙ , ⊗ concatenation, Hadamard product, matrix product EI , ei EI ∈ R ( n h + n h ′ ) n l is encoding information vector, ei ∈ R ( n h + n h ′ ) n l is encoding information for onesemantic shortest path h ′′ i , h i h ′′ i ∈ R ( n h + n h ′ ) is output vector from i -th Bi-LSTM, h i ∈ R n w is output vector from i -th GRU withattention mechanismthey introduced a variational auto encoder (VAE) to a relation ex-traction model [31]. They first implemented two individual parts topredict semantic relation given entity pairs and to reconstruct enti-ties based on the prediction, respectively. Then they jointly trainedthe model to minimize error in entity recovering. In unsupervisedopen domain relation extraction [13], the authors used correspond-ing sentences of entity pairs as features and then vectorized thefeatures to evaluate similarity of relations. These features includethe re-weighting word embedding vectors and types of entities.The summary of key differences of state-of-the-art unsupervisedrelation extraction models are shown in Table 1.However, to the best of our knowledge, the correlation betweensentences with the same entity pair has not been explicitly used tocreate a probabilistic generative relation extraction model. Multi-ple sentences with the same entity pair often occur in large-scalecorpora, which can be used to let the relation extraction modellearn how to extract features from sentences and convert theminto relation information. Therefore, we propose to train a relationextractor by giving multiple sentences with the same entity pairsas inputs. Then the extractor is expected to output correct relationinformation, which can be used to predict the shortest path betweenentity pairs on the dependency graph of one of the sentences. In this work we focus on the problem of relation extraction (RE)which is a specific subproblem of the broader information extrac-tion (IE) problem. Specifically, we tackle this subproblem using anunsupervised relation extraction approach.We begin by formulating the problem as an information extrac-tion task as follows. Given text T and external information I , such igure 1: The architecture of relation extractor training stage of CURE as labeled text and info box, the IE model extracts triplets ( e i , r k , e j ) from T , where e i and e j are two different entities and r k is therelation of these two entities.As stated previously, we only focus on relation extraction andnot other information extraction methods such as Named EntityRecognition (NER). Note that, unsupervised RE cannot obtain I .In unsupervised RE, the model learns and labels the clusters ofdifferent relations based on T . The problem of unsupervised RE canbe defined as follows. Given T , the model should learn the clustersof entity pairs, based on their relation similarities. Then, given ( e i , e j ) , the model selects the closest centroid from cluster centroidset C and uses the label of that centroid as r k . The notations usedin this paper can be found in Table 2. The proposed Clustering-based Unsupervised Generative RelationExtraction (CURE) model includes two stages. The first is the rela-tion extractor training stage. We train a relation extraction model,which takes text and ( e i , e j ) as input and outputs vectorized re-lation representations. The second is the triplets clustering stage.In this stage, the relation extractor model is used to extract rela-tion representations then the relations are clustered. After labelingeach cluster centroid, for a given ( e i , e j ) , the model selects the clos-est centroid from cluster centroid set C and uses the label of thatcentroid as r k . We begin by introducing the Encoder-Decoder model that isused to train the relation extractor. This proposed model capturesthe relation information given ( e i , e j ) and text. The model archi-tecture is shown in Figure 1. This training model first encodes thesemantic shortest paths of one entity pair in various sentences. Theencoding information generated by the encoder reflects the relationinformation of the input ( e i , e j ) . The decoder uses the summationof this information to generate the predicted semantic shortest pathof that entity pair. More formally, our model optimizes the decoder( D ) and encoder ( E ), s.t.argmax D θ , E γ P ( Pa u | Pa , Pa , · · · , Pa u − ) (1)where Pa i is the i-th semantic shortest path of ( e i , e j ) .The formal definition of semantic shortest path is explained insection 3.3. Here, we briefly explain why the task of this stage isto predict ˆ Pa u given other semantic shortest paths. Note that itis necessary to build a well-trained encoder that can extract rela-tion information from given semantic shortest paths. However, thetraining data does not provide correct relations of each entity pair,therefore it is not possible to train the encoder using a supervisedapproach. Similar to self-supervised learning techniques, the keyidea is to find “correct expected result” to let the model fit withoutlabeling the data. In our relation extraction scenario, since all thesemantic shortest paths of one entity pair possibly share similarrelation information, we treat one of them as the “correct expectedresult”, and the remaining semantic shortest paths are provided as nput to the encoder-decoder training model. This “correct expectedresult” will be generated as output by that model. This proposedsemantic shortest path prediction approach provides a mechanismthat can train the encoder-decoder model, while making sure thismodel can converge. The well-trained model indicates that the in-dividual parts, D and E are also well-trained, which satisfies ourexpectation from the relation extractor training stage.In the triplets clustering stage of CURE, the well-trained encoderis used as the relation extractor. The procedure of using the relationextractor model is shown in Figure 2. This procedure first generatesencoding information of input entity pairs ( e i , e j ) using the pre-trained relation extractor. Then entity pairs are clustered basedon their corresponding encoding information. After labeling eachcluster centroid, each entity pair ( e i , e j ) is assigned a relation r k ,which is the cluster label. The details are discussed in Section 3.7. Given a dependency tree of one sentence, the semantic shortest path(SSP) of two entities is defined as the shortest path from one entity(node) to the other entity (node) in the dependency tree. Razvan et al.mentioned that the semantic shortest path can capture the relationinformation of entity pairs [8]. Table 3 shows an example in which,given an entity pair and a sentence, the semantic shortest path isthe path from the start entity “Ronald Reagan” to the end entity “theUnited States”. Since only words on this path may not be sufficientto capture the relation information, we save the dependency tags D ,Part-Of-Speech (POS) tags P and words W to represent this path.However, since some entities are compound words, which can bedivided into different nodes by the dependency parser, we choosethe word that has a “subjective”, “objective” or “modifier” depen-dency relation as a representative. For example, we use “Reagan”as the start entity to find the path because the dependency tagof “Reagan” is “nsubj”, while the dependency tag of “Ronald” is“compound”. For each semantic shortest path of a given entity pair ( e i , e j ) , the D , P and W sequences are embedded into vectors with differentdimensions. Since words have more variation than POS tags andDependency tags, we give more embedding dimensions to W . Afterthe embedding process, the vector representations of W , P and D are concatenated in order.We use a Long Short-Term Memory (LSTM) neural network [21]as the basic unit of the encoder model. The formal description ofLSTM is shown in Equation 2. x i = w i ⊕ d i ⊕ p i o i = σ ( W o h i − + U o x i + b o ) h i = o i ⊙ tanh ( c i ) c i = f i ⊙ c i − + i i ⊙ ˆ c i ˆ c i = tanh ( W c h i − + U c x i + b c ) f i = σ ( W f h i − + U f x i + b f ) i i = σ ( W i h i − + U i x i + b i ) (2)where σ is the element-wise sigmoid function and ⊙ is the element-wise product. w i , d i and p i are the embedding vector of the i -th element in the W , D , P sequences. x i is the concatenation of w i , d i and p i . h i is the hidden state and i denotes the i -th node on theshortest path. Other variables are parameters in different gates thatwill be learned.The original LSTM model only considers information from pre-vious states. However, context should be considered in text data.Therefore, we use the Bi-directional LSTM (Bi-LSTM) [57] to encodethis sequential data. The Bi-LSTM model considers informationfrom both directions of the text and then concatenates the outputsfrom each LSTM in different directions. The output of the Bi-LSTMmodel is shown in Equation 3: h ′′ i = lstm ( x i , h i − ) ⊕ lstm ′ ( x i , h ′ i − ) = ( o i ⊙ tanh ( c i )) ⊕ ( o ′ i ⊙ tanh ( c ′ i )) (3)where lstm and lstm ′ are the LSTM and inverse LSTM functionsdescribed in Equation 2. o ′ i , c ′ i and h ′ i − denote the parameters ofthe inverse LSTM.After all nodes on the shortest path are encoded, the encoderconcatenates each hidden state in order. The encoding informationis the summation of encoding results of all shortest paths. Theformal description is defined in Equation 4: ei = h ′′ ⊕ h ′′ ⊕ · · · ⊕ h ′′ n EI = u − (cid:213) j = ei j (4)where n is the length of each shortest path and ei j is the encodingresult of j -th shortest path. EI is the encoding information of oneentity pair. In the decoder part, the words on the semantic shortest path mustbe generated correctly. If the model can generate the correct wordsequences ( W ), this means that the model has also correctly learnedthe complex syntax information. Therefore, we do not require themodel to generate P and D at the decoder part.We use a Gated Recurrent Units (GRU) neural network [20] asthe basic unit of our proposed decoder. The GRU architecture hassimilar characteristics to LSTM, with an additional benefit of havingfewer parameters. The mathematical definition of the GRU unit isshown in Equation 5: z t = σ ( W z x t + U z h t − + b z ) r t = σ ( W r x t + U r h t − + b r ) h t = z t ⊙ h t − + ( − z t ) ⊙ tanh ( W h x t + U h ( r t ⊙ h t − ) + b h ) (5)where x t is the input at time t and h t is the hidden state that willbe used in the next state. σ is an activation function and tanh is ahyperbolic tangent.In order to allow the decoder to fully integrate the encoding infor-mation when generating W , we introduce the attention mechanismto the decoder. Attention mechanisms can make the model noticeonly the information related to the current generation task [47].This enables the model to more efficiently use the input informa-tion, which is the encoding information in this case. In general, asshown in Equation 6, the attention mechanism is achieved by using able 3: An example of path search original sentence Ronald Reagan served as the 40th president of the United States.Entity Pair (Ronald Reagan, the United States)Dep Path [‘nsubj’, ‘ROOT’, ‘prep’, ‘pobj’, ‘prep’, ‘pobj’]POS Path [‘PROPN’, ‘VERB’, ‘ADP’, ‘NOUN’, ‘ADP’, ‘PROPN’]Word Path [‘Reagan’, ‘served’, ‘as’, ‘president’, ‘of’, ‘States’] Figure 2: The triplets clustering stage of CURE attention weights to incorporate encoding information. h i = дru ( h i − , q i − ) q i − = attn β (cid:16)(cid:16) attn α ( h i − ) ⊗ EI (cid:17) ⊕ q i − (cid:17) = W β (cid:16)(cid:16) ( W α ⊗ h i − + b α (cid:17) ⊗ EI ) ⊕ q i − (cid:17) (6)where h i is the output of the i -th GRU unit, which is the predictedprobability distribution of the word at that position. q i − is the inputof the GRU and the weighted information of the previous state andthe encoding information. дru is the GRU function described inEquation 5. attn β and attn α are two different attention matricesthat will be learned. As discussed in the decoder section, each GRU unit outputs a vectorthat represents the probability distribution for the word at a givenposition, where the index of each element of the vector correspondsto the index of each candidate word.We design the loss function as the average cross entropy valueof each predicted word and correct word. The formal definition ofthe loss function is in Equation 7:

J (D θ , E γ ) = m (cid:213) l = n n (cid:213) i = − loд (cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171) e h ( l ) i , k (cid:205) j : j (cid:44) k e h ( l ) i , j (cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172) (7)where m is the batch size and n is the length of each semanticshortest path. h is the output tensor from the decoder. Therefore, h ( l ) i , k indicates the value of the k -th element in the i -th vector thatbelongs to the l -th semantic shortest path. When training the encoder-decoder model is complete, a well-trained relation extractor is obtained, which can extract relationinformation given semantic shortest paths. The relation extractorcan use a vector to represent relation r k . Therefore, according to the method we introduced in Figure 2, we use Hierarchical Agglom-erative Clustering (HAC) to cluster similar vectors together usingEuclidean distance. The result of the HAC clustering is the same asthe clustering result of the entity pairs that share similar relations.After obtaining these clusters, we extract the W correspondingto the entity pairs in each cluster, thus a candidate relation wordset R is obtained. Based on set R , the relation word of each cluster(i.e., cluster label) can be selected using the Equation 8:ˆ r k = ws . t . argmax w word vec ( w ) · v || word vec ( w )|| · || v || where v = (cid:213) r i ∈R Norm (cid:169)(cid:173)(cid:171) (cid:213) r j ∈R , j (cid:44) i (cid:18) − r i · r j || r i || · || r j || (cid:19) Count ( r i ) (cid:170)(cid:174)(cid:172) r i (8)where w is the selected relation word, r i is the vector representationof the i -th word in R and Count ( r i ) is the number of occurrences ofthe i -th word in R . Norm (·) is the min-max normalization function.Our proposed key idea is to first project the words into a high-dimension space using a pre-trained Word2Vec model [32]. Thenthe vector summation of these words obtains the vector of therelation word.The direct summation of each word vector will lose a lot ofimportant information. However, the more occurrences of a wordin R , the weight should be greater in the summation process. Forexample, suppose “locate” appears ten times and “citizen” appearsonce in R , which indicates that this cluster is more likely to describe“is located in” than “is citizen of”. Thus the model needs to reduce theimpact of “citizen”. On the other hand, words with more occurrencesin R may also be common words or stop words. Therefore, we addanother factor, which measures the cosine similarity between thecurrent word vector and other word vectors in R . If the sum ofthe cosine similarity is higher, then the word is more similar toother words, so we lower the value of this factor. Here we makean assumption that words that are less similar to other words maybe more meaningful. This assumption is based on our observationthat many stop words, such as “to” and “from”, are similar in thevector space. We perform extensive experiments on CURE and baseline methodsto answer the following questions: ( Q1 ) Does CURE cluster entitypairs with the same relation correctly on different dataset, andhow does it compare to state-of-the-art methods? (see section 4.3-4.5) ( Q2 ) Does our proposed relation word selection method betterdescribe the relation better than traditional methods? (See section4.6) .1 Baseline Models We compare CURE to three state-of-the-art unsupervised relationextraction models. See Table 1 for a summary of these methods andthe key differences of our proposed CURE approach.(1)

Rel-LDA : the topic distribution in LDA is replaced withtriplets distribution, and similar relations are clustered usingExpectation Maximization [54].(2)

VAE : the variational autoencoder first predicts semanticrelation given entity pairs then reconstructs entities basedon the prediction. The model is jointly trained to minimizeerror in entity recovering [31].(3)

Open-RE : corresponding sentences of entity pairs are usedas features and then the features are vectorized to evaluaterelation similarity [13].

We use a New York Times (NYT) dataset [39] and the United NationsParallel Corpus (UNPC) dataset [59] to train and test our modeland other unsupervised relation extraction baseline methods.

NYT dataset.

In the NYT dataset, following the preprocessingin Rel-LDA, 500K and 5K sentences were selected as the trainingand testing sets, respectively. Each sentence contains at least oneentity pair. Note that only entity pairs that appear in at least twosentences were included in the training set, so the number of entitypairs in training set is 60K. Furthermore, all entity pairs in thetesting set have been matched to Freebase [5]. That is, for a givenentity pair ( e i , e j ) , we have a relation r k from Freebase. UNPC dataset.

The UNPC dataset is a multilingual corpus thathas been manually curated. In this dataset, 3.2M sentences wererandomly selected from the aligned text of the English-Frenchcorpus and used as the training set. The number of entity pairs intraining set is 200k. We selected 2.6k sentences to use as the testingset. Each sentence also contains at least one entity pair. The numberof unique entity pairs is 1.5k in the testing set (previous work useda testing set with 1k unique entity pairs [53]). Similarly, all entitypairs in the testing set have been matched to YAGO.While previous state-of-the-art methods for this problem usedonly the NYT dataset for evaluation, we chose to additionally usethis corpus for further evaluation for two reasons: (1) The scale ofthis dataset is far greater than that of NYT dataset, so the modelis more likely to learn methods for extracting relation patterns. (2)To ensure model robustness and ensure that a model that achievesexcellent results on NYT is not over fitting to the dataset.

Table 4 shows the performance of each model on assigning relationsto entity pairs, which involves relation extraction followed by clus-tering. We compare the models on selected relations, which appearmost frequently in the testing dataset. We report recall, precisionand F1 scores for each method in Table 4. Since the original Rel-LDAand VAE methods did not investigate automatic cluster labeling, wecompare against a variant of these methods, where we use the mostfrequent trigger word in each cluster as the label. Trigger wordsare defined by the non-stop words on semantic shortest paths. Acluster (and each entity pair in that cluster) is labeled by the relation(in Freebase) that is similar to the most frequent trigger word in

Figure 3: % F-1 gain of CURE over baselines on NYT that cluster. For a given entity pair with two or more relations inFreebase, the predicted relation of this entity pair is consideredaccurate as long as it matches one of the corresponding relations inFreebase. Notably, CURE achieves the highest accuracy assigningrelations to entity pairs as shown in Table 4. We also report theF-1 gain in Figure 3. Overall, CURE outperforms all other methodswith a gain in F-1 score of average 10.47%.While both our method and VAE involve an encoding and de-coding process, there is a key difference between the two methods.CURE considers the correlation of sentences that have the sameentity pair, while VAE directly projects the relation informationinto a high-dimensional space, and reconstructs triplets accordingto the projection results to train the encoder. The results showthat the CURE relation information extractor is more accurate thanVAE. We conjecture that CURE’s achieved accuracy improvementis because dding sentence correlation into the model is equivalentto guiding the converge direction when training the encoder. Wenote that it can be difficult to clearly distinguish some relationsin a sentence. For example, the two clusters for “placeBirth” and“placeLived” partially overlap, so the F-1 score of each model onthese two relations is relatively low. In future work, we plan tofurther investigate and address this finding.

We use the same experimental settings and parameters we used onthe NYT data set. Similarly, Table 5 reports recall, precision and F1scores and shows that our model achieved the best performancein most relations. Note that the genre of UNPC (political meetingsrecords) is different from that of NYT. Therefore, the relations inUNPC are mainly based on national relations and geographicallocation. Although, overall, CURE outperforms all the baselines,we note that it did not perform well on some relations. In thesecases, we notice that CURE performs more detailed clustering thanneeded. For example, given the relation “isPoliticianOf”, CUREdivides entity pairs in this category into finer grain subsets, such as“president” or “ambassador”. We also report the F-1 gain in Figure 4.Overall, CURE outperforms the other methods with an average F-1score gain of 6.59 percent. Experiments on UNPC show that CUREoutperforms state-of-the-art approaches on datasets of differentgenres or sizes and not overfit to a particular dataset to obtainpositive results. able 4: Experimental results on NYT Relation System Rec. Prec. F1company

CURE 48.2 60.4 53.6

Open-RE 46.8 54.9 50.5Rel-LDA 39.4 50.7 44.3VAE 47.3 51.6 49.4placeBirth

CURE 47.5 38.2 42.3

Open-RE 38.4 31.3 34.5Rel-LDA 31.7 25.7 28.4VAE 43.2 32.9 37.4capital

CURE

Open-RE 53.2

CURE 56.7

Open-RE 51.6

CURE

Open-RE 36.4 62.8 46.1Rel-LDA 31.3 64.6 42.2VAE

CURE 43.9 45.1 44.5

Open-RE 42.5 43.4 42.9Rel-LDA 33.8 38.6 36.0VAE 37.1 44.0 40.3founders

CURE 46.4

Open-RE 45.1 44.4 44.7Rel-LDA 35.9 43.9 39.5VAE 42.6

CURE 38.7 33.1 35.7

Open-RE 37.4 27.6 31.8Rel-LDA 32.4 24.5 27.9VAE 35.3 32.9 34.0children

CURE

Open-RE 48.0 45.7 46.8Rel-LDA 44.3 42.3 43.3VAE

Figure 4: % F-1 gain of CURE over baselines on UNPC Table 5: Experimental results on UNPC

Relation Models Rec. Prec. F1dealsWith

CURE

Open-RE 62.7 54.4 58.3Rel-LDA 60.3 50.3 54.8VAE

CURE 62.9 60.2 61.5

Open-RE 60.5 58.1 59.3Rel-LDA 56.7 56.5 56.8VAE 61.6 58.3 59.9hasNeighbor

CURE 68.5 56.7 62.0

Open-RE 62.3 53.8 57.7Rel-LDA 61.4 52.6 56.6VAE 67.3 54.6 61.8isCitizenOf

CURE 57.6

Open-RE 55.2 39.5 46.0Rel-LDA 52.5 36.9 41.2VAE 53.1

CURE 71.9 46.7 56.6

Open-RE 68.7 42.1 52.2Rel-LDA 66.0 39.4 49.3VAE 68.3 44.9 54.2isPoliticianOf

CURE 47.5 41.1 44.1

Open-RE 44.7 38.8 41.5Rel-LDA 39.2 35.7 37.2VAE 45.2 38.0 41.3

We evaluate clustering performance of each model using rand index.We implement the evaluation as follows: 1) We pair n entity pairsin the testing set together. Therefore, we obtain (cid:0) n (cid:1) pairs of entitypairs. 2) We partition the testing set into m subsets using Freebaseor YAGO, and into k subsets using CURE and the baseline methods.Following the definition of rand index, we then compare the m and k subsets to measure the similarity of the results of the twopartitioning methods.The rand index evaluation result is shown in Figure 5. Overall,CURE outperforms state-of-the-art methods on both datasets. CUREperforms slightly better on NYT than on UNPC. One possible reasonis that most sentences of the UNPC dataset do not directly explainthe relation between two entities, so some entity pairs are assignedto more general relations, such as “contains”. In this section, we compare the results of two approaches for select-ing relation words : (1) based on word vector similarity (denotedas

WVS and used by CURE), and (2) based on common words (de-noted as CW and used by previous work [18]). Other approachesthat rely on experts to manually specify relation words based onextracted trigger words are not included in this comparison. Weimplement this evaluation as follows: (1) For each relation r f inFreebase, we count the number of entity pairs with the relation r f in each cluster. (2) We select the cluster that contains the most igure 5: Rand Index score of CURE and baselinesTable 6: Clustering Label Comparison between selecting re-lation words based on word vector similarity (WVS) and se-lecting relation words based on common words (CW) Label Words Relation

WVS metropolis government city capitalCW city states help

WVS live stay york placeLivedCW york live play

WVS born rise country placeBirthCW country city live

WVS near neighbor close neighborOfCW include like york

WVS business executive group companyCW group expert executive

WVS locate include states containsCW states country cityentity pairs with the relation r f . (3) WVS and CW are used to gen-erate the label of the selected cluster. (4) We compare the top threegenerated cluster labels with the relation r f as shown in Table 6.The relation words selected by WVS can capture the relationsbetter than CW. For example, for the relation “contains”, WVS findswords that describe the relation between two geographic locations,such as "locate" and "include". However, CW can only find that“contains” is related to each geographical division, such as “State”and “country”. Moreover, the candidate word lists generated byWVS and CW have different orders. For example, for the relation“company”, CW regards “group” as the best word to describe therelation and puts “executive” in the last place. This arrangement isobviously not consistent with facts, because “company” in Freebasemainly emphasizes the relation between the company’s leader orowner and the company. WVS arranges its candidate words listdifferently and more accurately, putting “business” in the first placeand “executive” in the second place. Finally, both label clusteringmethods are affected by the noise in the text. For example, for therelation "placeLived", both CW and WVS mistakenly included “york”as a candidate relation word because "New York Times" appearedmany times in the NYT dataset. In this paper, we proposed a Clustering-based Unsupervised Gen-erative Relation Extraction (CURE) framework to extract relationsfrom text. Our CURE training approach does not require labeleddata. The CURE relation extractor is trained using the correlationsbetween sentences with the same entity pair. The CURE cluster-ing approach then uses the relation information identified by therelation extractor to cluster entity pairs that share similar rela-tions. Our experiments demonstrate that including sentence corre-lation improves unsupervised generative clustering performance.We demonstrate this by comparing our approach to three state-of-the-art baselines on two datasets. We chose baselines in twodifferent categories: probabilistic generative models, and sentencesfeature extraction-based methods. We compare model performanceon the main relations in the testing dataset. The cluster performanceof each model is also reported by rand index test. The results showthat our model achieves the best performance. We also demonstratethat our proposed relation word selection method better describesrelations than existing methods. Our method is based on wordvector similarity, while existing methods are based on commonwords.In the future, we will explore improving model effectiveness byusing an approach that better encodes the syntactic structure infor-mation. For example, we will explore using graph neural networks,such as Tree-LSTM, instead of LSTM. We also plan to explore us-ing a variational autoencoder that leverages correlations betweensentences with similar entity pairs, which may improve modelaccuracy.

REFERENCES [1] Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning.2015. Leveraging linguistic structure for open domain information extraction. In

Proceedings of the 53rd Annual Meeting of the Association for Computational Lin-guistics and the 7th International Joint Conference on Natural Language Processing(Volume 1: Long Papers) . 344–354.[2] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak,and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In

Thesemantic web . Springer, 722–735.[3] Michele Banko and Oren Etzioni. 2008. The tradeoffs between open and tradi-tional relation extraction. In

Proceedings of ACL-08: HLT . 28–36.[4] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

Journal of machine Learning research

3, Jan (2003), 993–1022.[5] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor.2008. Freebase: a collaboratively created graph database for structuring humanknowledge. In

Proceedings of the 2008 ACM SIGMOD international conference onManagement of data . 1247–1250.[6] Denny Britz, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. MassiveExploration of Neural Machine Translation Architectures. In

Proceedings of the2017 Conference on Empirical Methods in Natural Language Processing . 1442–1451.[7] Razvan Bunescu and Raymond Mooney. 2007. Learning to extract relations fromthe web using minimal supervision. In

Proceedings of the 45th Annual Meeting ofthe Association of Computational Linguistics . 576–583.[8] Razvan C Bunescu and Raymond J Mooney. 2005. A shortest path dependencykernel for relation extraction. In

Proceedings of the conference on human languagetechnology and empirical methods in natural language processing . Association forComputational Linguistics, 724–731.[9] Yee Seng Chan and Dan Roth. 2011. Exploiting syntactico-semantic structures forrelation extraction. In

Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies-Volume 1 . Associationfor Computational Linguistics, 551–560.[10] Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Zheng-Yu Niu. 2005. Unsupervisedfeature selection for relation extraction. In

Companion Volume to the Proceedingsof Conference including Posters/Demos and tutorial abstracts .[11] Mark Craven, Johan Kumlien, et al. 1999. Constructing biological knowledgebases by extracting information from text sources.. In

ISMB , Vol. 1999. 77–86.

12] Oier Lopez De Lacalle and Mirella Lapata. 2013. Unsupervised relation extractionwith general domain knowledge. In

Proceedings of the 2013 Conference on EmpiricalMethods in Natural Language Processing . 415–425.[13] Hady Elsahar, Elena Demidova, Simon Gottschalk, Christophe Gravier, and Fred-erique Laforest. 2017. Unsupervised open relation extraction. In

European Se-mantic Web Conference . Springer, 12–16.[14] Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S Weld. 2008. Openinformation extraction from the web.

Commun. ACM

51, 12 (2008), 68–74.[15] Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked,Stephen Soderland, Daniel S Weld, and Alexander Yates. 2005. Unsupervisednamed-entity extraction from the web: An experimental study.

Artificial intelli-gence

Proceedings of the conference on empiricalmethods in natural language processing . Association for Computational Linguis-tics, 1535–1545.[17] Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring variousknowledge in relation extraction. In

Proceedings of the 43rd annual meeting on as-sociation for computational linguistics . Association for Computational Linguistics,427–434.[18] Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. 2004. Discoveringrelations among named entities from large corpora. In

Proceedings of the 42ndannual meeting on association for computational linguistics . Association for Com-putational Linguistics, 415.[19] Philipp Heim, Sebastian Hellmann, Jens Lehmann, Steffen Lohmann, and TimoStegemann. 2009. RelFinder: Revealing relationships in RDF knowledge bases. In

International Conference on Semantic and Digital Media Technologies . Springer,182–187.[20] Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrentneural nets and problem solutions.

International Journal of Uncertainty, Fuzzinessand Knowledge-Based Systems

6, 02 (1998), 107–116.[21] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780.[22] Raphael Hoffmann, Congle Zhang, and Daniel S Weld. 2010. Learning 5000relational extractors. In

Proceedings of the 48th Annual Meeting of the Associationfor Computational Linguistics . 286–295.[23] Scott B Huffman. 1995. Learning information extraction patterns from examples.In

International Joint Conference on Artificial Intelligence . Springer, 246–260.[24] Jing Jiang and ChengXiang Zhai. 2007. A systematic exploration of the fea-ture space for relation extraction. In

Human Language Technologies 2007: TheConference of the North American Chapter of the Association for ComputationalLinguistics; Proceedings of the Main Conference . 113–120.[25] Natthawut Kertkeidkachorn and Ryutaro Ichise. 2017. T2KG: An end-to-endsystem for creating knowledge graph from unstructured text. In

Workshops atthe Thirty-First AAAI Conference on Artificial Intelligence .[26] Jun-Tae Kim and Dan I. Moldovan. 1995. Acquisition of linguistic patterns forknowledge-based information extraction.

IEEE transactions on knowledge anddata engineering

7, 5 (1995), 713–724.[27] Natalia Konstantinova. 2014. Review of Relation Extraction Methods: WhatIs New Out There?. In

Analysis of Images, Social Networks and Texts , Dmitry I.Ignatov, Mikhail Yu. Khachay, Alexander Panchenko, Natalia Konstantinova, andRostislav E. Yavorsky (Eds.). Springer International Publishing, Cham, 15–28.[28] Sebastian Krause, Hong Li, Hans Uszkoreit, and Feiyu Xu. 2012. Large-scalelearning of relation-extraction rules with distant supervision from the web. In

International Semantic Web Conference . Springer, 263–278.[29] ChunYang Liu, WenBo Sun, WenHan Chao, and Wanxiang Che. 2013. Con-volution neural network for relation extraction. In

International Conference onAdvanced Data Mining and Applications . Springer, 231–242.[30] Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015. Adependency-based neural network for relation classification. In . Association for Computational Linguistics(ACL), 285–290.[31] Diego Marcheggiani and Ivan Titov. 2016. Discrete-state variational autoencodersfor joint discovery and factorization of relations.

Transactions of the Associationfor Computational Linguistics arXiv preprint arXiv:1301.3781 (2013).[33] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervisionfor relation extraction without labeled data. In

Proceedings of the Joint Conferenceof the 47th Annual Meeting of the ACL and the 4th International Joint Conferenceon Natural Language Processing of the AFNLP: Volume 2-Volume 2 . Association forComputational Linguistics, 1003–1011.[34] Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. PATTY: ataxonomy of relational patterns with semantic types. In

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com-putational Natural Language Learning . Association for Computational Linguistics,1135–1145.[35] Dat PT Nguyen, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Exploiting syntac-tic and semantic information for relation extraction from wikipedia. In

IJCAIWorkshop on Text-Mining & Link-Analysis (TextLink 2007) .[36] Dat PT Nguyen, Yutaka Matsuo, and Mitsuru Ishizuka. 2007. Relation extractionfrom wikipedia using subtree mining. In

Proceedings of the National Conferenceon Artificial Intelligence , Vol. 22. Menlo Park, CA; Cambridge, MA; London; AAAIPress; MIT Press; 1999, 1414.[37] Truc-Vien T Nguyen and Alessandro Moschitti. 2011. End-to-end relation extrac-tion using distant supervision from external semantic repositories. In

Proceedingsof the 49th Annual Meeting of the Association for Computational Linguistics: HumanLanguage Technologies: short papers-Volume 2 . Association for ComputationalLinguistics, 277–282.[38] Hoifung Poon and Pedro Domingos. 2009. Unsupervised semantic parsing. In

Proceedings of the 2009 Conference on Empirical Methods in Natural LanguageProcessing: Volume 1-Volume 1 . Association for Computational Linguistics, 1–10.[39] Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relationsand their mentions without labeled text. In

Joint European Conference on MachineLearning and Knowledge Discovery in Databases . Springer, 148–163.[40] Lorenza Romano, Milen Kouylekov, Idan Szpektor, Ido Dagan, and Alberto Lavelli.2006. Investigating a generic paraphrase-based approach for relation extraction.In .[41] Benjamin Rosenfeld and Ronen Feldman. 2006. Ures: an unsupervised webrelation extraction system. In

Proceedings of the COLING/ACL on Main conferenceposter sessions . Association for Computational Linguistics, 667–674.[42] Binjamin Rozenfeld and Ronen Feldman. 2006. High-performance unsupervisedrelation extraction from large corpora. In

Sixth International Conference on DataMining (ICDM’06) . IEEE, 1032–1037.[43] Sunita Sarawagi and William W Cohen. 2005. Semi-markov conditional randomfields for information extraction. In

Advances in neural information processingsystems . 1185–1192.[44] Stephen Soderland, David Fisher, Jonathan Aseltine, and Wendy Lehnert. 1995.CRYSTAL: Inducing a conceptual dictionary. arXiv preprint cmp-lg/9505020 (1995).[45] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core ofsemantic knowledge. In

Proceedings of the 16th international conference on WorldWide Web . 697–706.[46] Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa. 2012. Reducing wronglabels in distant supervision for relation extraction. In

Proceedings of the 50thAnnual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1 . Association for Computational Linguistics, 721–729.[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[48] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborativeknowledgebase.

Commun. ACM

57, 10 (2014), 78–85.[49] Xiang Wang, Dingxian Wang, Canran Xu, Xiangnan He, Yixin Cao, and Tat-SengChua. 2019. Explainable reasoning over knowledge graphs for recommendation.In

Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 33. 5329–5336.[50] Daniel S Weld, Raphael Hoffmann, and Fei Wu. 2009. Using wikipedia to bootstrapopen information extraction.

Acm Sigmod Record

37, 4 (2009), 62–68.[51] Fei Wu and Daniel S Weld. 2010. Open information extraction using Wikipedia.In

Proceedings of the 48th annual meeting of the association for computationallinguistics . Association for Computational Linguistics, 118–127.[52] Chenyan Xiong, Russell Power, and Jamie Callan. 2017. Explicit semantic rank-ing for academic search via knowledge graph embedding. In

Proceedings of the26th international conference on world wide web . International World Wide WebConferences Steering Committee, 1271–1279.[53] Yulan Yan, Naoaki Okazaki, Yutaka Matsuo, Zhenglu Yang, and Mitsuru Ishizuka.2009. Unsupervised relation extraction by mining wikipedia texts using infor-mation from the web. In

Proceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International Joint Conference on Natural Lan-guage Processing of the AFNLP: Volume 2-Volume 2 . Association for ComputationalLinguistics, 1021–1029.[54] Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum. 2011. Struc-tured relation discovery using generative models. In

Proceedings of the Conferenceon Empirical Methods in Natural Language Processing . Association for Computa-tional Linguistics, 1456–1466.[55] Alexander Yates, Michele Banko, Matthew Broadhead, Michael J Cafarella, OrenEtzioni, and Stephen Soderland. 2007. Textrunner: open information extraction onthe web. In

Proceedings of Human Language Technologies: The Annual Conferenceof the North American Chapter of the Association for Computational Linguistics(NAACL-HLT) . 25–26.[56] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, Jun Zhao, et al. 2014. Relationclassification via convolutional deep neural network. (2014).

57] Dongxu Zhang and Dong Wang. 2015. Relation classification via recurrent neuralnetwork. arXiv preprint arXiv:1508.01006 (2015).[58] Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander J Smola, and Le Song.2018. Variational reasoning for question answering with knowledge graph. In

Thirty-Second AAAI Conference on Artificial Intelligence .[59] Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. Theunited nations parallel corpus v1. 0. In

Proceedings of the Tenth InternationalConference on Language Resources and Evaluation (LREC’16) . 3530–3534.. 3530–3534.