[PDF] Exploring the Limits of Few-Shot Link Prediction in Knowledge Graphs

Abstract

Real-world knowledge graphs are often characterized by low-frequency relations - a challenge that has prompted an increasing interest in few-shot link prediction methods. These methods perform link prediction for a set of new relations, unseen during training, given only a few example facts of each relation at test time. In this work, we perform a systematic study on a spectrum of models derived by generalizing the current state of the art for few-shot link prediction, with the goal of probing the limits of learning in this few-shot setting. We find that a simple zero-shot baseline - which ignores any relation-specific information - achieves surprisingly strong performance. Moreover, experiments on carefully crafted synthetic datasets show that having only a few examples of a relation fundamentally limits models from using fine-grained structural information and only allows for exploiting the coarse-grained positional information of entities. Together, our findings challenge the implicit assumptions and inductive biases of prior work and highlight new directions for research in this area.

Full PDF

EExploring the Limits of Few-Shot Link Prediction in Knowledge Graphs

Dora Jambor , Komal Teru , Joelle Pineau , William L. Hamilton School of Computer Science, McGill University, Canada Quebec AI Institute (Mila), Canada Facebook AI Research (FAIR) { dora.jambor, komal.teru, jpineau, wlh } @ { mail.mcgill.ca, mail.mcgill.ca, cs.mcgill.ca, cs.mcgill.ca } Abstract

Real-world knowledge graphs are often char-acterized by low-frequency relations—a chal-lenge that has prompted an increasing interestin few-shot link prediction methods. Thesemethods perform link prediction for a set ofnew relations, unseen during training, givenonly a few example facts of each relation attest time. In this work, we perform a sys-tematic study on a spectrum of models de-rived by generalizing the current state of theart for few-shot link prediction, with the goalof probing the limits of learning in this few-shot setting. We ﬁnd that a simple zero-shot baseline—which ignores any relation-speciﬁcinformation—achieves surprisingly strong per-formance. Moreover, experiments on carefullycrafted synthetic datasets show that havingonly a few examples of a relation fundamen-tally limits models from using ﬁne-grainedstructural information and only allows for ex-ploiting the coarse-grained positional informa-tion of entities. Together, our ﬁndings chal-lenge the implicit assumptions and inductivebiases of prior work and highlight new direc-tions for research in this area.

A knowledge graph (KG) is a multi-relational graphthat offers a structured way to organize facts aboutthe world. Encoder-decoder approaches are com-monly used to predict new facts from existing oneswhere entities and relations are embedded in a low-dimensional vector space via an encoder, to thenscore the likelihood of observing a new fact via adecoder (Nickel et al., 2015; Bordes et al., 2013;Trouillon et al., 2017; Dettmers et al., 2018).It is well known that the performance of thesemethods can signiﬁcantly drop when predictingfor relations that are only observed in a few ex-ample facts. However, link prediction for theselow-frequency relations is very important, as notonly are these relations abundant in most knowl-edge graphs, they are also key for knowledge graph (Richard Feynman, born_in, United States) (Albert Einstein, born_in, Germany)(Galileo Galilei, born_in, Italy)

KnowledgeTransferSupport set:

K example facts

Query set:

Test fact (Paul Erdős, born_in, ?)

Figure 1: The few-shot link prediction task completion tasks where new relations may appearafter model training.To study this low-frequency regime, Xiong et al.(2018) created the Nell-One and Wiki-One bench-marks where the task is to predict new facts for aset of new relations at test time, where each relationis only observed a few times (as speciﬁed by somesmall ﬁxed number K ). Previous approaches haveshown promising results using metric-based (Xionget al., 2018) and gradient based meta-learning tech-niques (Chen et al., 2019). However, we argue thatthese models are limited by the current task for-mulation to only exploit coarse-grained positionalsignals (i.e., nodes belonging to the same commu-nity) that are abundant in these benchmarks, ratherthan leveraging structural signals (e.g. transitivity,symmetry). Present work . In this work, we take a critical takeon current approaches for few-shot link predictionover knowledge graphs. We posit that current meta-learning based approaches beneﬁt largely due tothe positional signals in entities, rather than utilis-ing information about the low frequency relations.We corroborate these insights by conducting a sys-tematic study on a spectrum of models with de-creasing complexity. Interestingly, we ﬁnd thata much simpler zero-shot variant of the state ofthe art —devoid of any meta-learning scheme—yields surprisingly competitive results, while notconsuming any example facts about a relation. Mo-tivated by these observations, we design a set ofnull models tailored to different learning signals amodel might utilize to drive effective link predic- a r X i v : . [ c s . A I] F e b ion. Empirically, we validate that these existingmeta-learning models are ill-equipped to infer log-ical patterns about the few-shot relations. Theseﬁndings bring forth the shortcomings of the cur-rent task formulation and raises new questions inboth task and model design while highlighting newdirections of research in few-shot link prediction. The goal of few-shot link prediction is to predictmissing links for a new relation by only observing K example triples of that relation (Figure 1). Fol-lowing literature in few-shot classiﬁcation (Raviand Larochelle, 2016; Vinyals et al., 2016; Snellet al., 2017), we organize our dataset as a set oftasks, where a task corresponds to predicting linksfor a new relation. The set of tasks for train-ing and testing are disjoint, with the added con-straint that entities in the test tasks are a subsetof the entities in the train tasks. Let V denote theset of entities in the knowledge graph. For eachnew relation r i , we then construct a support set S i = { ( h k , r i , t k ) } Kk =1 containing K example en-tity pairs, h k , t k ∈ V , connected by relation r i ,and a query set Q i = { ( h j , r i , ?) } Jj =1 containing J query triples over entities in V . As shown inFigure 1, the goal is then to learn how to extractknowledge from the support set such that we canpredict the missing tail entities in the query set. The foundation of our analyses focuses ona generalization of the current state-of-the-artgradient-based meta-learning approach (Chen et al.,2019). This approach follows the encoder-decoderparadigm of embedding-based knowledge graphcompletion methods (Hamilton et al., 2017), wherethe entities and relations are embedded in a low-dimensional vector space and the embeddings areused to predict the likelihood of a given triple.

Encoder functions . The key idea in few-shotlearning is to transfer knowledge from support setto query set by learning a function

RelLearner : S i (cid:55)→ R d . This maps a support set S i , which char-acterizes the relation r i , to a low dimensional em-bedding via an encoder function E : V (cid:55)→ R d r i = RelLearner ( { ( E ( h k ) , E ( t k ) } Kk =1 ) . (1)The RelLearner function can vary from a sim-ple MLP (Hastie et al., 2009) to more complicated recurrent architectures (Rumelhart et al., 1985; Jor-dan, 1997; Hochreiter and Schmidhuber, 1997).Further, the entity encoder E can vary from TransE-style embeddings (Bordes et al., 2013; Sun et al.,2019) to a graph neural network (Schlichtkrullet al., 2018) that explicitly leverages the neighbor-hood information around entities. Decoder and loss function . A decoder functioningests the embeddings of the entities h , t and ofthe relation r to score the likelihood of a giventriple ( h, r, t ) . Using a simple TransE decoder(Bordes et al., 2013), it is then optimized to scorepositive triples higher than negative triples usinga contrastive loss L (Dyer, 2014). In the few-shotsetting we compute the support set loss L ( S i ) , andthe ﬁnal query set loss L ( Q i ) , which are used toupdate the model parameters (Chen et al., 2019). Meta-gradient update . Instead of directly usingthe relation embedding r i from Equation (1) tocompute the ﬁnal query loss L ( Q i ) , we ﬁrst makean update on the relation embedding using the gra-dient of the support set loss L ( S i ) r (cid:48) i = r i − η ∇ r i L ( S ) (2)where η denotes the learning rate. This updateencourages r i to be such that it effectively predictsthe support set triples via minimizing L ( S i ) . Our objective is to probe how much models lever-age the support set to perform the query task. Tothis end, we perform a systematic study of differentmodel variants, where each falls into the generalframework described in Section 2.2.

MetaR follows Chen et al. (2019), where the

RelLearner is deﬁned as a 2-layer MLP (Hastieet al., 2009) over the support set entity embeddings.The encoder E simply maps each entity to a ﬁxedlearnable vector as in Bordes et al. (2013). SharedEmbed skips Equation (1), and insteadsets r i = r g , where r g is a single learnable embed-ding shared across all relations. We propose thismodiﬁcation to measure the effect of representingall relations by the same embedding r g , where theonly information from the support set comes viathe gradient update in Equation (2). ZeroShot further removes the meta-gradient up-date in Equation (2) and lets r (cid:48) i = r g . This effec-tively reduces the model to perform zero-shot linkprediction on the relation’s query set without anyrelation-speciﬁc information from the support set.RR Hits@10 Hits@5 Hits@11-shot 5-shot 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Nell-One

MetaR 0.294 0.323 0.464 0.500 0.398 0.426 0.201 0.230SharedEmbed 0.276 0.311 0.454 0.495 0.382 0.420 0.173 0.205ZeroShot 0.199 0.219 0.342 0.365 0.283 0.303 0.116 0.136R-GCN 0.216 0.267 0.412 0.464 0.316 0.366 0.120 0.172

Wiki-One * MetaR 0.325 0.326 0.448 0.408 0.408 0.367 0.258 0.280SharedEmbed 0.290 0.311 0.399 0.415 0.348 0.378 0.238 0.254ZeroShot 0.279 0.289 0.361 0.367 0.337 0.341 0.234 0.246R-GCN 0.126 0.137 0.178 0.237 0.130 0.152 0.101 0.104

Table 1: Average metrics on Nell-One and Wiki-One few-shot link prediction tasks. * For Wiki-One we usedpre-trained embeddings, similar to Chen et al. (2019).

R-GCN uses the same

RelLearner as inMetaR, with the exception that support set entityembeddings are learned via a multi-relational graphneural network, R-GCN (Schlichtkrull et al., 2018),instead of a TransE-style embeddings. R-GCNlearns entity representation via aggregating the 2-hop neighbors of a given entity. With this modelwe probe the extent to which injecting structuralbias into entity representations can inﬂuence per-formance in link prediction.

In order to probe and understand the performanceof different models, we introduce two null mod-els , which are used to generate synthetic data thatsatisfy certain properties. Motivated by recent liter-ature on position versus structure-aware methodsin relational learning (You et al., 2019; Srinivasanand Ribeiro, 2020), we test the models’ ability tolearn from two key sources of information: struc-tural information and positional information . Inthe context of knowledge graphs, structural infor-mation corresponds to the ﬁne-grained relationalsemantics. These are the logical patterns that areextracted by state-of-the-art rule induction systems,such as RuleN (Meilicke et al., 2018).On the other hand, positional information corre-sponds to the coarse-grained community structureof the nodes in the graph. In other words, two nodesare said to be positionally ‘close’ in the graph, i.e.,if they belong to the same community (Newman,2018).

The ﬁrst type of null models contains syntheticrelations that satisfy simple logical properties. Forthe sake of exposition, we focus on two simplelogical patterns: symmetry and transitivity. Forthe purposes of all synthetic data generation, we only consider the largest connected component ofrespective datasets, denoted G L . Synthetic symmetric relations . To generate N edges connected by a symmetric relation r ∗ s , werepeat the following steps N times.:1. Uniformly sample a pair of unique entities– h i , t i –from all the entities in G L .2. Add two edges– (( h i , r ∗ s , t i )) , (( t i , r ∗ s , r i )) tothe set of synthetic symmetric edges. Synthetic transitive relations . To sample N edges connected by a transitive relation r ∗ t , we gen-erate 3 edges at a time. In particular, we repeat thefollowing steps N times:1. Uniformly sample 3 unique entities– e , e ,and e –from all the entities in G L .2. Add three edges– ( e , r ∗ t , e ) , ( e , r ∗ t , e ) , ( e , r ∗ t , e ) –to our collection. The second type of null models focuses on gener-ating synthetic relations that depend on the under-lying community structure in the graph. We callthese relations positional because they depend onthe relative global position of the entities, ratherthan on local structural properties.We ﬁrst cluster the largest connected component G L into K communities using a standard algorithmoriginally proposed by Blondel et al. (2008). Let { C i } Ki =1 denote the set of communities generated,where each community is a set of entities from G L . To generate N synthetic edges for a positionalrelation r p , we repeat the following steps N times:1. Uniformly sample a community index i fromthe set { , , K } .2. Uniformly sample two unique entities h, t from community C i ; add ( h, r ∗ p , t ) to the set. -shot | Nell-One 5-shots | Nell-One 1-shot | Wiki-One 5-shots | Wiki-One Symmetric Transitive PositionalSymmetric Transitive PositionalSymmetric Transitive PositionalSymmetric Transitive Positional A v e r a g e H it s @ Symmetric Transitive PositionalSymmetric Transitive PositionalSymmetric Transitive PositionalSymmetric Transitive Positional A v e r a g e H it s @ Figure 2: Average Hits@10 on synthetically generated relations using our proposed null models.

Wiki-OneNell-One

Log Frequency of Support Set Entities A v e r a g e M RR Figure 3: Pearson’s R between MRR and the log fre-quency of support set entities in training graph. We ob-serve strong correlation for Nell-One, but not for Wiki-One. For more details on this, see Appendix C.

We followed the same experimental setup as inChen et al. (2019), as described in Appendix A.We conducted our experiments on the Nell-Oneand Wiki-One datasets . For more details on thesebenchmarks, we refer the reader to Table 1 in Xionget al. (2018). Similar to earlier work, we reportMRR, Hits@1, Hits@5 and Hits@10 on our testrelations, using a type-constrained candidate set. . As shown in Table 1,for Nell-One, we ﬁnd that SharedEmbed modelyields competitive performance to MetaR, withHits@10 of 45.4% and 49.5%, as compared toMetaR’s Hits@10 of 46.4% and 50.0% for 1 and 5-shot, respectively. The same observation holds forWiki-One, where SharedEmbed yields 39.9% and41.5% Hits@10, compared to MetaR’s Hits@10 of44.8% and 40.8%, for 1 and 5-shot, respectively.It is surprising how competitive SharedEmbed is,given that the only relation-speciﬁc information themodel gets to observe comes via the meta-gradientupdate in Equation (2). In fact, we ﬁnd that evenin absence of this gradient signal, i.e., without any relation-speciﬁc information, ZeroShot performsrelatively good, with Hits@10 of 34.2% and 36.5%on Nell-One, and 36.1% and 36.7% on Wiki-One.The nontrivial performance of these simple mod-els suggests that such models may exploit someeasily accessible positional signals around entities, Datasets can be downloaded under this link. without the need to learn meaningful representa-tions for relations. In fact, Figure 3 shows a highcorrelation between performance and the degreesof entities in the support set for Nell-One. Wereconcile this observation by noting that as mod-els observe more signals about entities, they startrelying less on the support set, and thus on the re-lation representations. Furthermore, contrary toour expectation, even when we equip models withstructural biases, as done via an R-GCN, they donot yield better results.

Null Model Experiments . We probed the abovetrained models on the synthetically generated testtasks following the procedure discussed in Section3. As shown in Figure 2, we ﬁnd a consistent trendfor these models to yield higher performance ontasks that rely on positional signals, as comparedto tasks that require logical inference.Indeed, in the current task formulation, wherewe are given a support set of K randomly sampledexamples, it is unlikely that logically consistent pat-terns will be captured in the K-shot examples. Forexample, seeing conclusive evidence of transitivitywhen only given a small random sample of tuples ishighly unlikely. In fact, as we show in Appendix B,one provably cannot learn certain logical patternsfor some values of K in the K-shot setting.

We conducted a systematic study of various modelsto probe their limits in performing few-shot linkprediction. Our experiments on both synthetic andreal data show that the current task formulationencourages models to mainly rely on positional in-formation around entities, rather than leveraginglogical signals about relations. In fact, we em-pirically show that having only K examples of arelation fundamentally limits the types of logicalpatterns that can be learned. We argue that a futuredirection in few-shot link prediction should allowfor a more careful construction of the support set,to scaffold the use of logical patterns in few-shotlearning. cknowledgments

The authors would like to thank Priyesh Vijayan,Joey Bose, Lu Liu and other members of the RLLaband Mila for their invaluable feedback and usefuldiscussions on the early drafts. Furthermore, theauthors are grateful for the anonymous reviewersfor their comments on the ﬁrst draft of the paper.This research was supported in part by CanadaCIFAR AI Chairs held by Prof. Pineau and Prof.Hamilton, as well as gift grants from MicrosoftResearch and Samsung AI.

References

Vincent D Blondel, Jean-Loup Guillaume, RenaudLambiotte, and Etienne Lefebvre. 2008. Fast un-folding of communities in large networks.

Jour-nal of statistical mechanics: theory and experiment ,2008(10):P10008.Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In

Advances in Neural InformationProcessing Systems , pages 2787–2795.Mingyang Chen, Wen Zhang, Wei Zhang, Qiang Chen,and Huajun Chen. 2019. Meta relational learningfor few-shot link prediction in knowledge graphs. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the9th International Joint Conference on Natural Lan-guage Processing (EMNLP-IJCNLP) , pages 4217–4226. Association for Computational Linguistics.Tim Dettmers, Minervini Pasquale, Stenetorp Pon-tus, and Sebastian Riedel. 2018. Convolutional 2dknowledge graph embeddings. In

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence .Chris Dyer. 2014. Notes on noise contrastive es-timation and negative sampling. arXiv preprintarXiv:1410.8251 .William L. Hamilton, Rex Ying, and Jure Leskovec.2017. Representation learning on graphs: Methodsand applications.

IEEE Data Eng. Bull.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.2009.

The elements of statistical learning: data min-ing, inference, and prediction . Springer Science &Business Media.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.

Neural computation ,9(8):1735–1780.Michael I Jordan. 1997. Serial order: A parallel dis-tributed processing approach. In

Advances in psy-chology , volume 121, pages 471–495. Elsevier. Diederik P. Kingma and Jimmy Ba. 2015. Adam:A method for stochastic optimization.

CoRR ,abs/1412.6980.Christian Meilicke, Manuel Fink, Yanjie Wang, DanielRufﬁnelli, Rainer Gemulla, and Heiner Stucken-schmidt. 2018. Fine-grained evaluation of rule-and embedding-based systems for knowledge graphcompletion. In

International Semantic Web Confer-ence , pages 3–20. Springer.Mark Newman. 2018.

Networks . Oxford universitypress.Maximilian Nickel, Kevin Murphy, Volker Tresp, andEvgeniy Gabrilovich. 2015. A review of relationalmachine learning for knowledge graphs.

Proceed-ings of the IEEE , 104(1):11–33.Sachin Ravi and Hugo Larochelle. 2016. Optimizationas a model for few-shot learning. In

InternationalConference on Learning Representations .David E Rumelhart, Geoffrey E Hinton, and Ronald JWilliams. 1985. Learning internal representationsby error propagation. Technical report, CaliforniaUniv San Diego La Jolla Inst for Cognitive Science.Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,Rianne Van Den Berg, Ivan Titov, and Max Welling.2018. Modeling relational data with graph convolu-tional networks. In

European Semantic Web Confer-ence , pages 593–607. Springer.Jake Snell, Kevin Swersky, and Richard Zemel. 2017.Prototypical networks for few-shot learning. In

Ad-vances in neural information processing systems ,pages 4077–4087.Balasubramaniam Srinivasan and Bruno Ribeiro. 2020.On the equivalence between positional node embed-dings and structural graph representations. In

Inter-national Conference on Learning Representations .Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and JianTang. 2019. Rotate: Knowledge graph embeddingby relational rotation in complex space. In

Interna-tional Conference on Learning Representations .Th´eo Trouillon, Christopher R Dance, ´Eric Gaussier,Johannes Welbl, Sebastian Riedel, and GuillaumeBouchard. 2017. Knowledge graph completion viacomplex tensor factorization.

The Journal of Ma-chine Learning Research , 18(1):4735–4772.Oriol Vinyals, Charles Blundell, Timothy Lillicrap,Daan Wierstra, et al. 2016. Matching networks forone shot learning. In

Advances in neural informa-tion processing systems , pages 3630–3638.Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,and William Yang Wang. 2018. One-shot relationallearning for knowledge graphs. In

Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing , pages 1980–1990. Asso-ciation for Computational Linguistics.iaxuan You, Rex Ying, and Jure Leskovec. 2019.Position-aware graph neural networks. In

Proceed-ings of the 34th International Conference on Ma-chine Learning-Volume 72 . Hyperparameters

As discussed, we followed the experimental setupdescribed in Chen et al. (2019). We used the Adamoptimizer (Kingma and Ba, 2015) with a learningrate of 0.001, using 1 to 3 ratio of positive to neg-ative samples. During training, we used 3 queriesper task on each dataset.We adapted the batch size to be 1024, and thenumber of queries to test on to be 3, based ontheir open-sourced codebase. These hyperparame-ters yielded the best performing models. Similarly,we also used the same train/validation/test relationsplits of 51:5:11 and 133:16:34 for Nell-One andWiki-One respectively.For our R-GCN model, we considered a range of[5, 10, 20] as the number of neighbors to sample foreach message passing step, and [2, 4] as the numberof basis. Furthermore, we used 2 layers in the R-GCN. Finally, we used 50 and 20 as the R-GCNhidden layer dimension for Nell-One and Wiki-One, respectively. These hyperparameters werepartly followed from Schlichtkrull et al. (2018), andwere decided upon consideration for our availablecompute infrastructure.Our models were trained on a single Nvidia1080Ti GPU, and each model training took between13-18 hours depending on the model and datasetsettings.

Figure 4: Limits of logical inference in the few-shotdomain.

Entity degree distribution of Wiki-One N u m b e r o f t r i p l e s p e r e n tit y Entities

Figure 5: Entity degree distribution of Wiki-One

B Logical inference in the few-shotdomain

Figure 4 shows ﬁve example support sets to demon-strate that inferring logical properties such as sym-metry and transitivity requires a minimum numberof carefully designed K-shot examples.