[PDF] Inductively Representing Out-of-Knowledge-Graph Entities by Optimal Estimation Under Translational Assumptions

Abstract

Conventional Knowledge Graph Completion (KGC) assumes that all test entities appear during training. However, in real-world scenarios, Knowledge Graphs (KG) evolve fast with out-of-knowledge-graph (OOKG) entities added frequently, and we need to represent these entities efficiently. Most existing Knowledge Graph Embedding (KGE) methods cannot represent OOKG entities without costly retraining on the whole KG. To enhance efficiency, we propose a simple and effective method that inductively represents OOKG entities by their optimal estimation under translational assumptions. Given pretrained embeddings of the in-knowledge-graph (IKG) entities, our method needs no additional learning. Experimental results show that our method outperforms the state-of-the-art methods with higher efficiency on two KGC tasks with OOKG entities.

Full PDF

IInductively Representing Out-of-Knowledge-Graph Entities byOptimal Estimation Under Translational Assumptions

Damai Dai , Hua Zheng , Fuli Luo , Pengcheng Yang ,Baobao Chang , Zhifang Sui Key Lab of Computational Linguistics, School of EECS, Peking University Peng Cheng Laboratory, China { daidamai,zhenghua,luofuli,yang pc,chbb,szf } @pku.edu.cn Abstract

Conventional Knowledge Graph Completion(KGC) assumes that all test entities appear dur-ing training. However, in real-world scenarios,Knowledge Graphs (KG) evolve fast with out-of-knowledge-graph (OOKG) entities addedfrequently, and we need to represent these en-tities efﬁciently. Most existing KnowledgeGraph Embedding (KGE) methods cannot rep-resent OOKG entities without costly retrain-ing on the whole KG. To enhance efﬁciency,we propose a simple and effective method thatinductively represents OOKG entities by theiroptimal estimation under translational assump-tions. Given pretrained embeddings of the in-knowledge-graph (IKG) entities, our methodneeds no additional learning. Experimental re-sults show that our method outperforms thestate-of-the-art methods with higher efﬁciencyon two KGC tasks with OOKG entities.

Knowledge Graphs (KG) play a pivotal role in var-ious NLP tasks, but generally suffer from incom-pleteness. To address this problem, KnowledgeGraph Completion (KGC) aims to predict missingrelations in a KG based on Knowledge Graph Em-beddings (KGE) of entities. Conventional KGEmethods such as TransE (Bordes et al., 2013) andRotatE (Sun et al., 2019) achieve success in con-ventional KGC, which assumes that all test entitiesappear during training. However, in real-worldscenarios, KGs evolve fast with out-of-knowledge-graph (OOKG) entities added frequently. To repre-sent OOKG entities, most conventional KGE meth-ods need to retrain on the whole KG frequently,which is extremely time-consuming. Faced withthis problem, we are in urgent need of an efﬁcientmethod to tackle KGC with OOKG entities.Figure 1 shows an example of KGC with OOKGentities. Based on an existing KG, a new movie

Christopher Nolan action John David WashingtonTENET directed by star ingenre language?

English…… … …………… … OOKG EntityIKG EntityIKG RelationAuxiliary RelationQuery Relation

Figure 1: An example of KGC with OOKG entities.When an OOKG entity “

TENET ” is added, we need torepresent it efﬁciently via information of its IKG neigh-bors to predict its missing relations with other entities. “ TENET ” is added as an OOKG entity with someauxiliary relations that connect it with some in-knowledge-graph (IKG) entities. To predict themissing relations between “

TENET ” and other en-tities, we need to obtain its embedding ﬁrst. Beingaware that “

TENET ” is directed by “

ChristopherNolan ”, is an “ action ” movie, and is starred by“

John David Washington ”, we can combine theseclues to proﬁle “

TENET ” and estimate its embed-ding. This embedding can then be used to predictwhether its relation with “

English ” is “language”.To represent OOKG entities via IKG neighborinformation instead of retraining, Hamaguchi et al.(2017); Wang et al. (2019); Bi et al. (2020); Zhaoet al. (2020) adopt Graph Neural Networks (GNN)to aggregate IKG neighbors to obtain the OOKGentity embedding. Some other methods (Xie et al.,2016, 2017; Shi and Weninger, 2018) utilize exter-nal resources such as entity descriptions or imagesinstead of IKG neighbor information to avoid re-training. However, GNN models require relativelycomplex calculations, and high-quality external re-sources are hard and expensive to acquire.In this paper, we propose an inductive methodthat derives formulas to estimate OOKG entity em-beddings from translational assumptions. Com-pared to existing methods, our method has simplercalculations and does not need external resources. a r X i v : . [ c s . C L ] S e p or a triplet ( h, r, t ) , translational assumptions ofKGE models suppose that embedding h can estab-lish a connection with t via an r -speciﬁc operation.Assuming that h is OOKG and t is IKG, we showthat if a translational assumption can derive a spe-ciﬁc formula to compute h via pretrained t and r , then there will be no other candidate for h thatbetter ﬁts this translational assumption. Therefore,the computed h is the optimal estimation of theOOKG entity under this translational assumption.Among existing typical KGE models, we discoverthat translational assumptions of TransE and Ro-tatE can derive speciﬁc estimation formulas. There-fore, based on them, we design two instances ofour method called InvTransE and

InvRotatE , re-spectively. Note that our estimation formulas aresettled, so our method needs no additional learningwhen given pretrained IKG embeddings.Our contributions are summarized as follows:(1) We propose a simple and effective method toinductively represent OOKG entities by their opti-mal estimation under translational assumptions. (2)Our method needs no external resources. Given pre-trained IKG embeddings, our method even needsno additional learning. (3) We evaluate our methodon two KGC tasks with OOKG entities. Experi-mental results show that our method outperformsthe state-of-the-art methods by a large margin withhigher efﬁciency, and maintains a robust perfor-mance even under increasing OOKG entity ratios.

Let E denote the IKG entity set and R denote therelation set. K train is the training set where all en-tities are IKG. K aux is the auxiliary set connectingOOKG and IKG entities when inferring, whereeach triplet contains an OOKG and an IKG entity.We deﬁne the K -neighbor set of an entity e as allits neighbor entities and relations in K : N K ( e ) = { ( r, t ) | ( e, r, t ) ∈ K} ∪ { ( h, r ) | ( h, r, e ) ∈ K} .Using notations above, we formulate our prob-lem as follows: Given K aux and IKG embeddingspretrained on K train , we need to utilize them to rep-resent an OOKG entity e (cid:54)∈ E as an embedding.This embedding can then be used to tackle KGCwith OOKG entities. As shown in Figure 2, our proposed method is com-posed of an estimator and a reducer. The estimator

Christopher Nolan action John David WashingtonTENET (OOKG entity)directed by star ingenre

Estimator �𝐞𝐞 �𝐞𝐞 �𝐞𝐞 Reducer 𝐞𝐞 (TENET) Pretrained IKG Embeddings

Figure 2: An illustration of our method. It is composedof an estimator and a reducer. aims to compute a set of candidate embeddings foran OOKG entity via its IKG neighbor information.The reducer aims to reduce these candidates to theﬁnal embedding of the OOKG entity.

For an OOKG entity e , given its IKG neighbors N K aux ( e ) with pretrained embeddings, the estima-tor aims to compute a set of candidate embeddings.Except TransE and RotatE, other typical KGE mod-els have relatively complex calculations in theirtranslational assumptions. These complex calcula-tions prevent their translational assumptions fromderiving speciﬁc estimation formulas for OOKG en-tities. Therefore, we design two sets of estimationformulas based on TransE and RotatE, respectively.To be speciﬁc, if e is the head entity, we can obtainits optimal estimation (cid:101) e by the following formulas: (cid:101) e = (cid:40) t − r , for InvTransE , t ◦ r − , for InvRotatE , where ◦ denotes the element-wise product, r − de-notes the element-wise inversion.Otherwise, if e is the tail entity, we can obtainits optimal estimation (cid:101) e by the following formulas: (cid:101) e = (cid:40) h + r , for InvTransE , h ◦ r , for InvRotatE . After the estimator computes |N K aux ( e ) | candidateembeddings, the reducer aims to reduce them to theﬁnal embedding of the OOKG entity by weightedaverage. We design two weighting functions. Correlation-based weights are query-aware.Inspired by Wang et al. (2019), we ﬁrst use theconditional probability to model the correlationbetween two relations: P ( r | r ) = (cid:80) e ∈E ( r , r ∈ N K train ( e ) ) (cid:80) e ∈E ( r ∈ N K train ( e ) ) . Detailed proof is included in Appendix. hen the query relation r q is speciﬁed, we as-sign more weight to the candidate computed via aneighbor with a more relevant relation to r q : w corr ( (cid:101) e ) = P ( r (cid:101) e | r q ) + P ( r q | r (cid:101) e ) Z corr , where Z corr is the normalization factor, r (cid:101) e is theneighbor relation via which (cid:101) e is computed. Degree-based weights focus more on the entitywith higher degree in the training set: w deg ( (cid:101) e ) = log ( d (cid:101) e + δ ) Z deg , where Z deg is the normalization factor, d (cid:101) e is the de-gree of the neighbor entity via which (cid:101) e is computed, δ is a smoothing factor.Based on these weighting functions, the ﬁnalembedding of the OOKG entity e is computed by e = (cid:88) (cid:101) e ∈C (cid:101) e · w corr/deg ( (cid:101) e ) , where C denotes the candidate embedding set. We conduct experiments on two KGC tasks withOOKG entities: link prediction and triplet classiﬁ-cation. For link prediction, we use two datasetsreleased by Wang et al. (2019) built based onFB15k (Bordes et al., 2013). For triplet classiﬁca-tion, we use nine datasets released by Hamaguchiet al. (2017) built based on WN11 (Socher et al.,2013). All datasets are built for KGC with OOKGentities and composed of a training set, an auxiliaryset, a validation set, and a test set. More details ofthese datasets are included in Appendix.

We tune hyper-parameters for pretraining on thevalidation set. Generally, we use Adam (Kingmaand Ba, 2015) with an initial learning rate of − as the optimizer and a batch size of , . Forlink prediction, we use an embedding dimensionof , and the correlation-based weights. Fortriplet classiﬁcation, we use an embedding dimen-sion of and the degree-based weights. Detailsof experimental settings are included in Appendix. Method FB15k-Head-10 FB15k-Tail-10

MRR H@10 H@1 MRR H@10 H@1GNN-LSTM 0.254 42.9 16.2 0.219 37.3 14.3GNN-MEAN 0.310 48.0 22.2 0.251 41.0 17.1LAN 0.394 56.6 30.2 0.314 48.2 22.7

InvTransE 0.462 60.4 38.5

InvRotatE

Table 1: Evaluation results (MRR, Hits@k) of link pre-diction.

Bold is the best. Underline is the second best.

Method WN11-Head WN11-Tail WN11-Both

InvTransE 87.8 80.1 86.3 78.4 74.6InvRotatE

Table 2: Evaluation results (Accuracy) of triplet classi-ﬁcation.

Bold is the best. Underline is the second best.

For link prediction, we compare our method withthree GNN-based baselines.

GNN-MEAN (Ham-aguchi et al., 2017) uses a mean function to ag-gregate neighbors.

GNN-LSTM adopts LSTM foraggregation.

LAN (Wang et al., 2019) adopts aboth rule- and network-based attention mechanismfor aggregation and maintains the best performanceso far. For triplet classiﬁcation, we compare withtwo more GNN-based baselines.

ConvLayer (Biet al., 2020) uses convolutional layers as the transi-tion function.

FCLEntity (Zhao et al., 2020) usesfully-connected networks as the transition functionand adopts an attention-based aggregation.

For link prediction, we use Mean Reciprocal Rank(MRR) and the proportion of ground truth entitiesranked in top-k (Hits@k, k ∈ { , } ). All the met-rics are ﬁltered versions that exclude false negativecandidates. For triplet classiﬁcation, we use Accu-racy. We determine relation-speciﬁc thresholds δ r by maximizing the accuracy on the validation set. Evaluation results of link prediction are shown inTable 1. From the table, we have the followingobservations: (1) Both instances of our methodsigniﬁcantly outperform all baselines since our es-timation formulas are optimal under translational ethod

MRR H@10 H@1InvTransE (Full) 0.462 60.4 38.5Up to 32 Neighbors 0.447 59.2 37.2Up to 8 Neighbors 0.386 52.0 31.3Only 1 Neighbor 0.246 37.9 18.1Uniform Weights 0.361 52.0 28.1

Table 3: Ablation experiment results for InvTransE onthe FB15k-Head-10 dataset of link prediction. assumptions. (2) GNN-LSTM performs the worstsince neighbors are unordered but LSTM capturesordered information. (3) LAN is the best baselinesince it adopts a complex attention mechanism toaggregate neighbors more comprehensively. Fortriplet classiﬁcation, due to space limitation, weshow the main part of the results in Table 2 and thecomplete results in Appendix. From Table 2, weﬁnd that our method outperforms all baselines onall datasets due to our optimal estimation.

We randomly select up to k ∈{ , , } IKG neighbors of OOKG entities to use.As shown in Table 3, as the number of used neigh-bors decreases, the performance drops. This sug-gests that using more neighbors can enhance therobustness and thus lead to better performance.

Do our weighting functions matter?

We attemptto reduce candidates with uniform weights. Asshown in Table 3, the performance without ourweighting functions drops dramatically. This veri-ﬁes the effectiveness of our weighting functions.

How does our method perform under increas-ing OOKG entity ratios?

We compare the tripletclassiﬁcation results of InvTransE, LAN, and GNN-MEAN under increasing OOKG entity ratios inFigure 3. We ﬁnd that, as the OOKG entity ratioincreases, the performance of our method dropsthe slowest. This suggests that our method is morerobust to increasing OOKG entity ratios.

Is our method more efﬁcient?

We compare In-vTransE with LAN to highlight the efﬁciency ofour method. Theoretically, LAN requires O ( md ) to represent an entity, where m is the number ofneighbors and d is the embedding dimension. Bycontrast, InvTransE requires only O ( d ) and O ( md ) to represent an IKG and an OOKG entity, respec-tively. Empirically, under similar conﬁgurations,LAN costs about 15 times the time of InvTransE to Figure 3: Results under increasing OOKG entity ratios. train a model for triplet classiﬁcation. This veriﬁesthat our simple method is much more efﬁcient.

Conventional transductive KGE methods map en-tities and relations to embeddings, and then usescore functions to measure the salience of triplets.TransE (Bordes et al., 2013) pioneers translationaldistance methods and is the most widely-used one.It derives a series of translational distance methods,such as TransH (Wang et al., 2014), TransR (Linet al., 2015), and RotatE (Sun et al., 2019). Be-sides, semantic matching methods form anothermainstream (Nickel et al., 2011; Yang et al., 2015;Trouillon et al., 2016; Nickel et al., 2016; Balaze-vic et al., 2019). These transductive KGE methodsachieve success in conventional KGC, but fail todirectly represent OOKG entities efﬁciently.To represent OOKG entities more efﬁciently,some inductive methods adopt GNN to aggregateIKG neighbors to inductively produce embeddingsfor OOKG entities (Hamaguchi et al., 2017; Wanget al., 2019; Bi et al., 2020; Zhao et al., 2020).These methods are effective, but need relativelycomplex calculations. Other inductive methods in-corporate external resources to enrich embeddingsand represent OOKG entities via only external re-sources (Xie et al., 2016; Shi and Weninger, 2018;Xie et al., 2017). However, high-quality externalresources are hard and expensive to acquire.

This paper aims to address the problem of efﬁ-ciently representing OOKG entities. We proposea simple and effective method that inductively rep-resents OOKG entities by their optimal estimationunder translational assumptions. Given pretrainedIKG embeddings, our method needs no additionallearning. Experimental results on two KGC taskswith OOKG entities show that our method outper-forms the state-of-the-art methods by a large mar-gin with higher efﬁciency, and maintains a robustperformance under increasing OOKG entity ratios. eferences

Ivana Balazevic, Carl Allen, and Timothy M.Hospedales. 2019. TuckER: Tensor factorization forknowledge graph completion. In

EMNLP-IJCNLP2019 , pages 5184–5193.Zhongqin Bi, Tianchen Zhang, Ping Zhou, and Yong-bin Li. 2020. Knowledge transfer for out-of-knowledge-base entities: Improving graph-neural-network-based embedding using convolutional lay-ers.

IEEE Access , 8:159039–159049.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

NeurIPS 2013 , pages 2787–2795.Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo,and Yuji Matsumoto. 2017. Knowledge transfer forout-of-knowledge-base entities: A graph neural net-work approach. In

IJCAI 2017 , pages 1802–1808.Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

ICLR 2015 .Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, andXuan Zhu. 2015. Learning entity and relation em-beddings for knowledge graph completion. In

AAAI2015 , pages 2181–2187.Maximilian Nickel, Lorenzo Rosasco, and Tomaso A.Poggio. 2016. Holographic embeddings of knowl-edge graphs. In

AAAI 2016 , pages 1955–1961.Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In

ICML 2011 ,pages 809–816.Baoxu Shi and Tim Weninger. 2018. Open-worldknowledge graph completion. In

AAAI 2018 , pages1957–1964.Richard Socher, Danqi Chen, Christopher D Manning,and Andrew Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. In

NeurIPS 2013 , pages 926–934.Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and JianTang. 2019. RotatE: Knowledge graph embeddingby relational rotation in complex space. In

ICLR2019 .Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´EricGaussier, and Guillaume Bouchard. 2016. ComplExembeddings for simple link prediction. In

ICML2016 , pages 2071–2080.Peifeng Wang, Jialong Han, Chenliang Li, and RongPan. 2019. Logic attention based neighborhood ag-gregation for inductive knowledge graph embedding.In

AAAI 2019 , pages 7152–7159.Zhen Wang, Jianwen Zhang, Jianlin Feng, and ZhengChen. 2014. Knowledge graph embedding by trans-lating on hyperplanes. In

AAAI 2014 , pages 1112–1119. Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, andMaosong Sun. 2016. Representation learning ofknowledge graphs with entity descriptions. In

AAAI2016 , pages 2659–2665.Ruobing Xie, Zhiyuan Liu, Huanbo Luan, andMaosong Sun. 2017. Image-embodied knowledgerepresentation learning. In

IJCAI 2017 , pages 3140–3146.Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding entities andrelations for learning and inference in knowledgebases. In

ICLR 2015 .Ming Zhao, Weijia Jia, and Yusheng Huang. 2020.Attention-based aggregation graph networks forknowledge graph information transfer. In

PAKDD2020 , pages 542–554. ppendices

A Which Translational Assumptions Can Derive Speciﬁc Estimation Formulas forOOKG entities?

For a triplet ( h, r, t ) , translational assumptions of KGE models suppose that h can establish a connectionwith t via an r -speciﬁc operation, which can be formulated by the following equation: F r ( h , t ) = 0 , (1)where F r ( · ) is an r -speciﬁc function that is determined by the speciﬁc KGE model. Without loss ofgenerality, we may assume that h is an OOKG entity and t is an IKG entity. Under a translationalassumption, we can obtain a speciﬁc estimation formula for h if and only if (1) we regard h as unknown,and its solution in Equation 1 exists, (2) the solution is unique. If the above two conditions hold, theunique solution of h is the optimal estimation under the translational assumption, since no other candidatefor h can better ﬁt Equation 1. In the following parts, we analyze translational assumptions of four KGEmodels (TransE, RotatE, TransH, TransR) as examples. A.1 TransE

For TransE, its translational assumption is formulated by F r ( h , t ) = (cid:107) h + r − t (cid:107) / = 0 . (2)In this case, we can obtain a unique solution of h by the following steps: (cid:107) h + r − t (cid:107) / = 0 , (3) = ⇒ h + r − t = , (4) = ⇒ h = t − r . (5)This computed h is the optimal estimation under the translational assumption. A.2 RotatE

For RotatE, its translational assumption is formulated by F r ( h , t ) = (cid:107) h ◦ r − t (cid:107) / = 0 . (6)In this case, we can obtain a unique solution of h by the following steps: (cid:107) h ◦ r − t (cid:107) / = 0 , (7) = ⇒ h ◦ r − t = , (8) = ⇒ h = t ◦ r − . (9)This computed h is the optimal estimation under the translational assumption. A.3 TransH

For TransH, its translational assumption is formulated by F r ( h , t ) = (cid:13)(cid:13)(cid:13) ( h − w (cid:62) r hw r ) + r − ( t − w (cid:62) r tw r ) (cid:13)(cid:13)(cid:13) / = 0 , (10)where w r is the unit normal vector of the plane P that r lies on. From the translational assumption, wecan derive the following equations: (cid:13)(cid:13)(cid:13) ( h − w (cid:62) r hw r ) + r − ( t − w (cid:62) r tw r ) (cid:13)(cid:13)(cid:13) / = 0 , (11) = ⇒ ( h − w (cid:62) r hw r ) + r − ( t − w (cid:62) r tw r ) = , (12) = ⇒ ( h − w (cid:62) r hw r ) = ( t − w (cid:62) r tw r ) − r (cid:44) v . (13) − w (cid:62) r hw r is the projection of h on the plane P . From the translational assumption, we can only deducethat the projection of h is equal to v . However, there exist inﬁnitely many possible h that can satisfy thiscondition. Therefore, the solution of h is not unique, and we cannot obtain a speciﬁc estimation formulafrom the translational assumption of TransH. A.4 TransR

For TransR, its translational assumption is formulated by F r ( h , t ) = (cid:107) M r h + r − M r t (cid:107) / = 0 , (14)where M r is an r-speciﬁc matrix. From the translational assumption, we can derive the followingequations: (cid:107) M r h + r − M r t (cid:107) / = 0 , (15) = ⇒ M r h + r − M r t = , (16) = ⇒ M r h = M r t − r (cid:44) v . (17)In this case, we derive a system of linear equations from the translational assumption. In this system, thereexists a unique solution for h if and only if the rank of the coefﬁcient matrix M r is equal to the rank ofthe augmented matrix [ M r ; v ] . However, M r is automatically learned by TransR without this restriction.Therefore, we cannot guarantee that there exists a unique solution for h , and we cannot obtain a speciﬁcestimation formula from the translational assumption of TransR. B Details of Datasets

Dataset |K train | |K valid | |K aux | |K test | |R| |E| |E (cid:48) | FB15k-Head-10 108,854 11,339 249,798 2,811 1,170 10,336 2,082FB15k-Tail-10 99,783 10,190 261,341 2,987 1,126 10,603 1,934WN11-Head-1000 108,197 4,561 1,938 955 11 37,700 340WN11-Head-3000 99,963 4,068 5,311 2,686 11 36,646 985WN11-Head-5000 92,309 3,688 8,048 4,252 11 35,560 1,638WN11-Tail-1000 96,968 3,864 6,674 852 11 36,771 811WN11-Tail-3000 78,812 2,851 12,824 2,061 11 33,800 1,874WN11-Tail-5000 68,040 2,258 15,414 2,968 11 31,311 2,589WN11-Both-1000 93,683 3,625 7,875 873 11 36,277 1,136WN11-Both-3000 71,618 2,436 14,453 2,242 11 32,254 2,805WN11-Both-5000 58,923 1,788 16,660 3,218 11 28,979 3,934

Table 4: Statistics of datasets with OOKG entities. These datasets are built based on FB15k or WN11 and namedin the form of “Base-Pos-Num”. Base denotes the based datasets. Pos denotes the position of OOKG entities intest triplets. Num distinguishes different numbers of OOKG entities represented by |E (cid:48) | . For link prediction, we use two datasets released by Wang et al. (2019): FB15k-Head-10 and FB15k-Tail-10. These two datasets are built based on FB15k (Bordes et al., 2013). For triplet classiﬁcation, weuse nine datasets released by Hamaguchi et al. (2017): WN11-Head-1000, WN11-Head-3000, WN11-Head-5000, WN11-Tail-1000, WN11-Tail-3000, WN11-Tail-5000, WN11-Both-1000, WN11-Both-3000,and WN11-Both-5000. These nine datasets are built based on WN11 (Socher et al., 2013). Each of thedatasets mentioned above is composed of four sets: a training set, an auxiliary set, a validation set, anda test set. Each triplet in the training and validation sets contains only IKG entities. Each triplet in theauxiliary set contains an OOKG entity and an IKG entity. Each triplet in the test set contains at least oneOOKG entity. The statistics of the datasets are shown in Table 4.

Details of Experimental Settings

Datasets d γ α n

L2 Training StepsFB15k-based 1,000 24.0 1.0 256 N/A 100,000WN11-based 300 0.5 1.0 128 − Table 5: Hyper-parameters for two categories of datasets. We use the same hyper-parameters for two FB15k-baseddatasets and the same hyper-parameters for nine WN11-based datasets. On each dataset, we use the same hyper-parameters for two pretrained models. d denotes the embedding dimension. γ denotes the margin. α denotes thesampling temperature. n denotes the negative sampling size. L2 denotes the parameter of L2 regularization, whereN/A means no regularization. To pretrain the TransE and RotatE models, we adopt the self-adversarial negative sampling lossproposed by Sun et al. (2019) in consideration of its good performance on training TransE and RotatE.The self-adversarial negative sampling loss L is formulated as L = − log σ ( γ − D ( h , r , t )) − n (cid:88) i =1 p (cid:0) h (cid:48) i , r, t (cid:48) i (cid:1) log σ (cid:0) D (cid:0) h (cid:48) i , r , t (cid:48) i (cid:1) − γ (cid:1) , where σ is the sigmoid function, γ is the margin, n is the negative sampling size and ( h (cid:48) i , r, t (cid:48) i ) is the i-thnegative sample triplet. D ( · ) is the distance function. D ( h , r , t ) is equal to (cid:107) h + r − t (cid:107) / for TransEand is equal to (cid:107) h ◦ r − t (cid:107) / for RotatE. p is the self-adversarial weight function which gives moreweight to the high-scored negative samples: p (cid:0) h (cid:48) i , r, t (cid:48) i (cid:1) ∝ exp (cid:0) α · F (cid:0) h (cid:48) i , r , t (cid:48) i (cid:1)(cid:1) , where α is a hyper-parameter called sampling temperature to be tuned. F ( · ) is the score function that isequal to −D ( · ) .We conduct each experiment on a single Nvidia Geforce GTX-1080Ti GPU and tune hyper-parameterson the validation set. Generally, we set the batch size to and use Adam (Kingma and Ba, 2015)with an initial learning rate of − as the optimizer. We choose the correlation-based weights for linkprediction and choose the degree-based weights with a smoothing factor of . for triplet classiﬁcation.Other hyper-parameters are shown in Table 5. D Complete Evaluation Results of Triplet Classiﬁcation

Method WN11-Head WN11-Tail WN11-Both

InvTransE 89.2 87.8 87.0

Table 6: Complete evaluation results (Accuracy) of triplet classiﬁcation.