Meta Reasoning over Knowledge Graphs
Hong Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang
MMeta Reasoning over Knowledge Graphs
Hong Wang † , Wenhan Xiong † , Mo Yu ‡∗ , Xiaoxiao Guo ‡∗ , Shiyu Chang ‡ , William Yang Wang †† University of California, Santa Barbara ‡ IBM Research { hongwang600, xwhan, william } @cs.ucsb.edu, [email protected], { xiaoxiao.guo, shiyu.chang } @ibm.com Abstract
The ability to reason over learned knowledgeis an innate ability for humans and humanscan easily master new reasoning rules withonly a few demonstrations. While most ex-isting studies on knowledge graph (KG) rea-soning assume enough training examples, westudy the challenging and practical problemof few-shot knowledge graph reasoning un-der the paradigm of meta-learning. We pro-pose a new meta learning framework that ef-fectively utilizes the task-specific meta infor-mation such as local graph neighbors and rea-soning paths in KGs. Specifically, we designa meta-encoder that encodes the meta infor-mation into task-specific initialization param-eters for different tasks. This allows our rea-soning module to have diverse starting pointswhen learning to reason over different rela-tions, which is expected to better fit the tar-get task. On two few-shot knowledge basecompletion benchmarks, we show that theaugmented task-specific meta-encoder yieldsmuch better initial point than MAML and out-performs several few-shot learning baselines.
Knowledge Graphs (Auer et al., 2007; Bollackeret al., 2008; Vrandecic and Kr¨otzsch, 2014) rep-resent entities’ relational knowledge in the formof triples, i.e., (subject, predicate, object), and hasbeen proven to be essential and helpful in variousdownstream applications such as question answer-ing (Yao and Durme, 2014; Bordes et al., 2015;Yih et al., 2015; Yu et al., 2017). Since most ex-isting KGs are highly incomplete, a lot of stud-ies (Bordes et al., 2013; Trouillon et al., 2016;Lao and Cohen, 2010) have been done in auto-matically completing KGs, i.e., inferring missingtriples. However, most of these studies only fo-cus on frequent relations and ignore the relationswith limited training samples. As a matter of fact, a large portion of KG relations are actually long-tail, i.e., they have very few instances. There-fore, it is important to consider the task of knowl-edge graph completion under few-shot learningsetting, where limited instances are available fornew tasks. Xiong et al. (2018) first propose a graphnetwork based metric-learning framework for thisproblem but the metric is learned upon graph em-beddings and their method does not provide rea-soning rationales for the predictions.In contrast, we propose a meta reasoning agentthat learns to make predictions along with multi-hop reasoning chains, thus the prediction of ourmodel is fully explainable. In this problem set-ting, each task corresponds to a particular rela-tion and the goal is to infer the end entity giventhe start entity (i.e., the query). Following the re-cent work (Ravi and Larochelle, 2017; Finn et al.,2017; Gu et al., 2018; Huang et al., 2018; Sunget al., 2018; Mishra et al., 2018) on meta-learning,we aim to learns a reasoning agent that can effec-tively adapt to new relations with only a few ex-amples. This is quite challenging since the modelmust learn to leverage its prior learning experiencefor fast adaptation and at the same time avoid over-fitting on the few-shot training examples. Model-agnostic meta-learning algorithm (MAML) (Finnet al., 2017) is a popular and general algorithmto solve this problem. It aims to learn an ini-tial model that captures the common knowledgeshared within the tasks so that it can adapt on thenew task quickly. But one problem of MAML isthat it only learns a single initial model, whichcan not fit the new task without training, and haslimited power in the case of diverse tasks (Chenet al., 2019). Another problem is that MAML onlylearns the common knowledge shared within thetasks without taking advantage of the relationshipbetween them since no task-specific information isused when learning the initial model. a r X i v : . [ c s . C L ] A ug n order to learn the relationship between tasks,the model must be aware of the identity of the cur-rent task, such as the query relation in our prob-lem. But simply using task identity will be aproblem, since there is no way to initialize theidentity of the new task except random initializa-tion. We try to solve this problem via a meta-encoder that learns the task representation frommeta-information which is available on the newtask as well. Specifically, the meta-encoder is usedto encode the task-specific information and gen-erate the representation of the task as part of pa-rameters. Through this way, different tasks willhave different representations, thus different ini-tial models. Also, since the presentation of thetask is available, the model can leverage the re-lationship between different tasks. To apply thisidea in our problem, we propose two meta-encoderto encode two different kinds of task-specific in-formation. One is to use the neighbor encoderto encode the start entity and the end entity, andthen use the difference between the embedding ofthe start entity and the end entity as the task rep-resentation. But this take-specific information isnot robust when the number of neighbors is small.Thus we propose another way for the case whichencodes the path from the start entity to the endentity. On two constructed few-shot multi-hopreasoning datasets, we show that the augmentedmeta-encoder yields much better initial point andoutperforms several few-shot learning baselines.The main contributions of this work include: • We introduce few-shot learning on the taskof multi-hop reasoning over knowledge graph, andpresent two constructed datasets for this task. • We propose to use meta-encoder to encodetask-specific information so as to generate bettertask-dependent model for the new task. • We apply neighbor encoder and path encoderto leverage the task-specific information in multi-hop reasoning task, and experiments verify the ef-fectiveness of the augmented meta-encoder.
Reasoning over Knowledge Graphs
Knowl-edge graph reasoning aims to infer the existenceof a query relation between two entities. There aretwo general approaches for knowledge graph rea-soning. The embedding based approaches (Nickelet al., 2011; Bordes et al., 2013; Yang et al., 2015;Trouillon et al., 2016; Wu et al., 2016) learn the representations of the relations and entities in theKG with some heuristic self-supervised loss func-tions, while path search based approaches (Laoand Cohen, 2010; Neelakantan et al., 2015; Xionget al., 2017; Das et al., 2018; Chen et al., 2018; Linet al., 2018; Shen et al., 2018) solve this problemthrough multi-hop reasoning, i.e., finding the rea-soning path between two entities. In spite of thesuperior performance of embedding-based meth-ods, they can not capture the complex reasoningpatterns in the KG and are lack of explainability.Due to its explainability, multi-hop reasoninghas been investigated a lot in recent years. ThePath-Ranking Algorithm (PRA) (Lao and Co-hen, 2010) is a primal approach that learns ran-dom walkers to leverage the complex path fea-tures. (Gardner et al., 2013, 2014) improves uponPRA by computing feature similarity in the vec-tor space. Recursive random walk integrates thebackground KG and text (Wang and Cohen, 2015).There are also other methods using convolutionalneural network (Toutanova et al., 2015) and recur-rent neural networks (Neelakantan et al., 2015).More recently, (Xiong et al., 2017) first applies re-inforcement learning for learning relational paths.(Das et al., 2018) proposes a more practical set-ting of predicting end entity given the query rela-tion and the start entity. (Lin et al., 2018) reshapesthe rewards using pre-trained embedding model.(Shen et al., 2018) uses Monte Carlo Tree Searchto overcome the problem of sparse reward.
Meta-learning
Meta-learning aims to achievefast adaption on new tasks through meta-trainingon a set of tasks with abundant training examples.It has been widely applied in few-shot learningsettings where limited samples are available (Guet al., 2018; Huang et al., 2018). One importantcategory of meta-learning approaches is initializa-tion based methods, which aims to find a good ini-tial model that can fast adapt to new tasks withlimited samples (Finn et al., 2017; Nichol et al.,2018). However, they only learn a single initialmodel and do not leverage the relationship be-tween tasks. (Rusu et al., 2018) proposes to learn adata-dependent latent generative representation ofthe model parameters and conduct gradient-basedadaptation procedure in this latent space. Anotherrelated work is Relation Network (Sung et al.,2018), which consists of am embedding moduleto encode samples and a relation module to cap-ture the relation between samples.
Background
In this section, we will first introduce the multi-hop reasoning task. Then we will extend it tothe meta-learning setting and introduce the pop-ular framework (MAML) for few-shot learning.
In this problem, there is a background graph G ,and a set of query relations R . Each query relationhas its own training and testing triple ( e s , r, e t ),where e s , and e t are the start entity and end en-tity in the KB, while r is the query relation. Giventhe start entity e s and the query relation r , the taskis to predict the end entity e t , along with a supportreasoning path from e s to e t in G . The length ofthe path is set to be fixed, and an additional STOP edge is added for each entity to point at itself sothat the model is able to stay in the end entity.We give an example to better explain this task.Consider the relation of
Nationality with a trainingtriple: (Obama, Nationality, American) . Given thestart entity and the query relation, (Obama, Na-tionality) , the model is expected to find a path witha fixed length in G from Obama to American . Ageneral framework to solve this problem is to trainan agent that predicts the next relation based on thecurrent entity, the query relation, and the visitedpath at each step. In expectation, the agent shouldgive the reasoning path (BornIn, CityIn, Provin-ceIn) , and predict the end entity as
American . For multi-hop reasoning problem, we define a taskas the inference of a specific relation’s end en-tity conditioned on the start entity. It is easy tosee that each relation forms an individual task. Inthe meta-learning framework, the tasks are dividedinto three disjoint sets called meta-training, meta-dev, and meta-test set respectively. The goal ofmeta-learning is to train an agent that can quicklyadapt on the new tasks in meta-test set with limiteddata by leveraging prior learning experience.Following standard meta-learning setting as in(Finn et al., 2017), our setting consists of twophases, the meta-training and meta-test phase.In the meta-training phase, the agent learns on aset of meta-training tasks T = {T , T , · · · , T N } ,where each task T i has its own training and valida-tion set denoted as { D traini , D validi } . By learningon the meta-training tasks T , the agent is expected to gain some knowledge about the reasoning pro-cess, which can help learn faster on new tasks.In the meta-test phase, the trained agent willbe evaluated on a set of new tasks in the meta-dev/meta-test task set T (cid:48) = {T (cid:48) , T (cid:48) , · · · , T (cid:48) N (cid:48) } .Each task T (cid:48) i has its own training and testing setdenoted as { D (cid:48) traini , D (cid:48) testi } , where D (cid:48) traini onlyhas limited training samples. The agent will befine-tuned on each task T (cid:48) i using D (cid:48) traini for fixedgradient steps, and be evaluated after each gradi-ent step. The macro-average on all tasks in T (cid:48) is reported as its performance of meta-learning.Note that the number of fine-tuning steps shouldbe chosen according to the model’s performanceon meta-dev tasks, and use the fixed chosen stepson meta-test tasks directly, since there are onlylimited samples in the new task, which are not suf-ficient for choosing a feasible fine-tuning step. Let f denotes the reasoning model in our settingthat maps the observation to the action, i.e., nextrelation to be taken. The objective of MAML(Finn et al., 2017) is to find a good model initial-ization f θ which can quickly adapt to the new tasksafter a few adaptions. We will first introduce theobjective function of MAML, and then illustratehow to optimize it in the following part.Let θ denote the parameter of the current model,and θ (cid:48) denote the updated parameter using samplesfrom task T i . For example, suppose we use onegradient update on T i , then we have: θ (cid:48) i = θ − α ∇ θ L T i ( f θ ) . The meta-objective is to optimize the performanceof f θ (cid:48) i across tasks sampled from p ( T ) . More for-mal definition is as follows: min θ (cid:88) T i ∼ p ( T ) L T i (cid:16) f θ − α ∇ θ L T i ( f θ ) (cid:17) To optimize this problem, we sample a batch oftasks T i ∼ p ( T ) . For each task T i , two subsets( D i and D (cid:48) i ) of training examples will be sampledindependently. D i is used to compute the updatedparameters θ (cid:48) . Then θ is optimized to minimize theobjective function using D (cid:48) i . Formally, we have θ (cid:48) i = θ − α ∇ θ L D i T i ( f θ ) .θ = θ − β ∇ θ (cid:88) T i ∼ p ( T ) L D (cid:48) i T i ( f θ (cid:48) i ) he above optimization requires the computationof second-order gradient, which is computation-ally expensive. In practice, people usually usefirst-order update rule instead, which has simi-lar performance but needs much less computation(Finn et al., 2017; Nichol et al., 2018): θ = θ − β ∇ θ (cid:48) (cid:88) T i ∼ p ( T ) L D (cid:48) T i ( f θ (cid:48) ) MAML learns a single initial model that does notdepend on any task-specific information. It worksby adapting the initial model through gradient up-date on the target task. In other words, the initialmodel learns some common knowledge shared bythe tasks, so that it can adapt to new tasks quickly.However, MAML is not able to capture the rela-tionship between different tasks because it is lackof task-specific information. One easy way to in-ject task information is to use task identity, suchas the embedding of query relation in our KB rea-soning problem. But this solution could incur twoproblems. First, the model will learn some knowl-edge that only applies to a specific task, which ishard to transfer when adapting to new tasks. Sec-ond, when there comes a new task, we can not eas-ily initialize the task identity, e.g. the embeddingof a new query. Therefore, we propose to use ameta-encoder to encode the task-specific informa-tion, which can not only enable the model to learnthe relationship between different tasks but also al-lows the model adapt in the new task faster sincethe model can leverage the task-specific informa-tion of the new tasks as well.Let x and ˆ x denote the input data and task-specific information respectively. g is the meta-encoder that encodes ˆ x , and f is the model whichtakes both x and g (ˆ x ) as inputs to predict the out-puts, i.e., f ( x, g (ˆ x )) is used for prediction. Notethat we hope g (ˆ x ) can encode the informationabout the whole target task instead of just x it-self so that g (ˆ x ) can also benefit other instances x (cid:48) within the same task T i , i.e., f ( x (cid:48) , g (ˆ x )) shouldperform well for any x (cid:48) ∈ T i . This is because thetask-specific information may not be available forthe testing sample. For example, the end entity weuse as the task-specific information is not avail-able in new testing samples. To achieve this goal,we apply meta-gradient methods which is similarto MAML. Given a task T i , we will sample two Algorithm 1
MAML with Meta-Encoder
Require: p ( T ) : the distribution of tasks α, β : learning rates for adaptation and meta-update k : the number of adaptations f, g : the reasoning model and meta-encoder1: Randomly initialize θ for step = 0 : M-1 do for batch of tasks T i ∼ p ( T ) do
4: Sample task instances ( D i , D (cid:48) i ) from T i
5: Compute task specific information ˆ D i
6: Set θ (cid:48) i = θ for i = 0 : k do θ (cid:48) i ←− θ (cid:48) i − α ∇ θ (cid:48) i L T i ( f θ (cid:48) i ( D i , g θ (cid:48) i ( ˆ D i ))) end for end for θ ←− θ − β ∇ θ (cid:80) T i ∼ p ( T ) L T i ( f θ (cid:48) i ( D (cid:48) i , g θ (cid:48) i ( ˆ D i ))) end for subsets of instances D i and D (cid:48) i . The updated pa-rameter is computed using D i : θ (cid:48) i = θ − α ∇ θ L T i (cid:16) f θ ( D i , g θ ( ˆ D i )) (cid:17) . Then meta-gradient is computed using ˆ D i and D (cid:48) i ,where ˆ D i is used for initialization. θ = θ − β ∇ θ (cid:88) T i ∼ p ( T ) L T i ( f θ (cid:48) i ( D (cid:48) i , g θ (cid:48) i ( ˆ D i ))) The first order update rule can be written as: θ = θ − β ∇ θ (cid:48) i (cid:88) T i ∼ p ( T ) L T i ( f θ (cid:48) i ( D (cid:48) i , g θ (cid:48) i ( ˆ D i ))) The details are shown in Algorithm 1. At first,a batch of tasks will be sampled. For each task T i , we sample two subsets of instances ( D i , D (cid:48) i ) ,and compute the meta information ˆ D i based on D i , which is the neighbor of start and end entityor the reasoning path between them for the multi-hop reasoning problem. In the following proce-dure, the updated parameters θ (cid:48) i will be computedfor each task (line 7-9). In meta-update step (line ), we update θ to minimize the loss of θ i usingnew instances D (cid:48) i and the task representation ˆ D i .For testing on a new task T (cid:48) i , we obtain the taskrepresentation g (ˆ x ) based on the few-shot samples x ∈ D (cid:48) traini . Then we fine-tune f and g usingthe data D (cid:48) traini . The model makes prediction ontesting samples x (cid:48) ∈ D (cid:48) testi using f ( x (cid:48) , g (ˆ x )) . The general framework of our model is shown inFigure 1. The original reasoning agent takes startentity and query relation as inputs, and output the tart EntityReasoning AgentMeta-EncoderMeta-InfomationReasoning Path e s e e e e e t e e e e e Neighbor Encoder Neighbor Encoder - Meta-information:neighbor of start entityNE Es e s e e e e e t e e Path 1Path 2Path EncoderPath 1: (r r r ) Path 2: (r r r )a) Reasoning Model with Meta-Encoder b) Meta-encoder: neighbor encoder c) Meta-encoder: path encoder Barack Obama
Query triple: (Barack Obama, Nationality, ???)
Support triple: (Theresa May, Nationality, UK)BornInCity—> CityIn—> ProvinceIn r r r r r r r r r r r r r r r r r r Meta-information:neighbor of end entity Meta-information:path from start entity to end entityNE Et NE Et — NE Es End entityAmerican Theresa May UKUK Theresa May
Figure 1 : The model we use for meta-reasoning over knowledge graph. a) is the general framework of the model. b) and c)are our neighbor encoder and path encoder respectively. reasoning path and end entity. But this agent willnot work well under meta-learning setting, wherethe embedding of the new query relation is hard tobe initialized. Our method replaces the query rela-tion with a meta-encoder that encodes some metainformation about the task, which is available on anew task. In the following parts, we will introducemore about the reasoning agent and meta-encoder.
We use the policy proposed in (Das et al., 2018),which is called MINERVA. They formulated thisproblem as a reinforcement learning problem. Thestate is defined as the combination of the query, theanswer, and the current location (an entity in KB).But the answer is not observed, so the observationonly includes the query and the current location.The actions are defined as the outgoing edges ofthe current location. The reward is +1 is reachingthe answer, otherwise, it is .The policy uses LSTM to encode the history in-formation, i.e. the visited path. h t = LSTM ( h t − , [ a t − ; o t ]) where h t − is previous hidden state, a t − is theembedding for the chosen relation at time t − ,and o t is the embedding of the current entity. Thehidden state of the LSTM, h t is then concatenatedwith the embedding of the current entity o t andthe query relation r q . The action distribution d t is computed by applying softmax on the matchingscore between the action embedding and the pro-jection of the concatenated embedding, i.e., d t = softmax ( A t ( W ReLU ( W [ h t ; o t ; r q ]))) . The model structure is the same as proposed in(Das et al., 2018), which uses two linear layers( W and W ) to encode the observation. Nextaction is sampled from the action distribution d t . We can regard the embedding of query relationused in the above MINERVA model as the taskidentity. But when there comes a new task,there is no good way to find an initial embed-ding for the new query relation that fits intothe reasoning model well. Therefore, we needanother meta-encoder that leverage some meta-information about the new task and generate theembedding of query relation, based on which themodel will be able to make reasonable outputs.Here we introduce two task-specific encoders toachieve this, neighbor encoder and path encoder.
Neighbor Encoder
Given an instance, i.e., atriple ( e s , r, e t ) , we use the difference between theembedding of start entity e s and end entity e t as anrepresentation of the query relation r (2013). Tobetter represent the entity, we borrow the idea ofneighbor encoder from (Xiong et al., 2018). Let N e denotes the neighbor of entity e . For eachrelation-entity pair ( r i , e i ) ∈ N e , We compute thefeature representation C r i ,e i as C r i ,e i = W c ( v r i ⊕ v e i ) + b c , where v r i and v e i are the embedding for r i and e i respectively, ⊕ denotes concatenation, and W c and b c are parameters of a linear layer. Then theneighbor embedding of the given entity e is com-puted as the average of the feature representations ataset Table 1 : Statistics of the datasets. of all neighbors, i.e.,NE e = σ ( 1 |N e | (cid:88) ( r i ,e i ) ∈N e C r i ,e i ) , where σ = tanh is the activation function. Thenthe representation of the query relation is definedas the difference between the neighbor embeddingof e s and e t like TransE (Bordes et al., 2013): R r = NE e t − NE e s . Path Encoder
The neighbor encoder needs toencode the neighbor as the representation for thestart and end entity, and it will not work well whenthe number of neighbors is small. Thus we pro-pose another encoder for this case called path en-coder. Path encoder takes into consideration ofthe successful path in the graph, i.e., the reason-ing path from start entity to end entity for a givenquery relation. Since not all the paths from startentity to end entity are meaningful, this path en-coder is noisier than the neighbor encoder.Let P e denotes all the paths from start entity e s to end entity e t . For any path p i ∈ P e , we have p i = { r i , · · · , r ni } , where r ji is the selected rela-tion at step j in path p i , and n is the max lengthof reasoning path. We use LSTM (Hochreiter andSchmidhuber, 1997) to encode each path: h t = LSTM ( h t − , r ji ) , where h t is the hidden state of the LSTM at step t , and r ji is the embedding for relation r ji . The lasthidden state h n is used as the embedding C p i forpath p i , i.e., C p i = h n . The final path embed-ding PE e for the given triple ( e s , e, e t ) is averageembedding of all the paths, i.e.,PE e = 1 |P e | (cid:88) p i ∈P e C p i . To verify the effectiveness of the proposed meth-ods, we compare it with several baselines ontwo knowledge completion datasets, FB15K-237(Toutanova et al., 2015), and NELL (Mitchellet al., 2018). In the following part, we will intro-duce how we construct the meta-learning settingfor knowledge graph reasoning and the baselineswe use, then we will show the main results andother analytic experiments.
We construct the meta-learning setting fromtwo well-known knowledge completion datasets:FB15K-237 (Toutanova et al., 2015) and NELL(Mitchell et al., 2018). FB15K-237 is created fromoriginal FB15K by removing various sources oftest leakage. Every relation in the training set ofFB15K-237 is regarded as an individual task. Forthe NELL dataset, we use the modified versionfrom (Xiong et al., 2018), which chooses relationswith more than triples, and less than triplesas one-shot tasks. Here we used those selectedtasks as meta-learning tasks. The statistics of thetwo datasets are shown in Table 1.Let D train , D dev , and D test denotes the train-ing data, validation data and test data in origi-nal dataset such as FB15K-237. We choose sometasks with positive transfer (task that has betterperformance when training together with othertasks than training solely) as meta-dev and meta-test tasks. More specifically, we choose task withat least . and . positive transfer on FB15K-237 and NELL dataset respectively, from whichwe only keep tasks with more than samplesin the dev set. Note that . and . are care-fully chosen threshold so that we can get enoughtasks with reasonable positive transfer. Throughthis way, we get / and / relations for meta-dev/meta-test on FB15K-237 and NELL respec-tively, and other relations left are used for meta-training. We denote the partitioned relation setas R meta-train /R meta-dev /R meta-test , and each relationhas its own training/test data. We compare our methods with the following base-lines.
Random method trains a separate modelfor each task from random initialization.
Trans-fer method will learn an initial model by usingsamples from D meta-traintrain . MAML uses the train- etting Method FB15K-237 NELLHits@1 Hits@3 Hits@10 MRR Hits@1 Hits@3 Hits@10 MRRFull Data MINERVA .124 .146 .187 .142 .137 .176 .202 .163Best
Baselines
Random .017 .028 .043 .027 .047 .100 .165 .086Transfer .010 .012 .054 .019 .041 .070 .128 .066MAML .021 .041 .052 .035 .067 .086 .139 .087MAML-Mask .009 .023 .045 .019 .032 .054 .080 .058
Ours
Neighbor .065 .073 .128 .080 .045 .066 .106 .064Path .041 .067 .101 .060 .108 .141 .200 .137
Initial
Baselines
Random .000 .000 .005 .002 .021 .074 .105 .056Transfer .000 .005 .023 .006 .037 .055 .077 .051MAML .005 .005 .023 .010 .017 .031 .054 .032MAML-Mask .000 .014 .045 .012 .021 .050 .081 .043
Ours
Neighbor .043 .054 .092 .056 .026 .047 .091 .045Path .000 .005 .058 .012 .082 .109 .164 .104
Table 2 : The results on -shot experiments. We also report the performance of MINERVA on these tasks using full data forbetter comparison. Full Data denotes using MINERVA algorithm on these tasks with full training data. Best denotes the bestperformance for each method after fine-tuning, and Initial denotes the performance of method at the initial point. We report theaverage performance on the meta-test tasks. Best result for each evaluation matrix is marked in bold. (a) FB15K-237 (b) NELL Figure 2 : The change of the performance with the size of few-shot samples for each method. Here we choose the size to be , , , , , . MRR of each model after fine-tuning is reported. ing framework of MAML to learn an initial point,and the task identity (the query relation) is given. MAML-Mask uses the same training frameworkas MAML, the difference is that we mask the taskidentity by setting the query relation for all tasksto be . Neighbor and
Path method means we usethe neighbor encoder and path encoder to encodethe task-specific information respectively.We tuned the hyper-parameters for all the base-lines and our methods, and they are set as follows.For Transfer, the batch size in the pre-trainingphase is set to be . For MAML, MAML-Mask,Neighbor, and Path, the batch size is set to be .For Path, adaption step is applied to compute theupdated parameters, and α = 0 . , β = 0 . .For Neighbor, MAML, and MAML-Mask, and adaption steps are applied on FB15K-237 and NELL respectively, and α = 0 . when the num-ber of adaption step k = 1 , α = 0 . when k = 5 , and β = 0 . . Other parameters are setas default as in (Das et al., 2018). We conduct our experiments under -shot learningsetting, i.e., there are training samples for eachtask in R meta-dev and R meta-test . We use the meanreciprocal rank (MRR) and Hits@K to evaluateeach model. For each method, we will first fine-tune and test the initial model on meta-dev tasks,through which we choose the number of fine-tunesteps and fix it on meta-test tasks. For example,if a model has the best performance after fine-tune steps on meta-dev tasks, then the model willbe tested after fine-tune steps on meta-test tasks.We report the best performance on meta-test tasksor each method in Table 2 as Best group. We alsolist the results using full data for better compari-son. From the results, we can see that neighborencoder and path encoder achieves the best per-formance on FB15K-237 and NELL dataset re-spectively. It is reasonable that neighbor encoderdoes not perform well on NELL dataset since themedian outgoing degree on this dataset is only .We also note that path encoder outperforms otherbaselines on FB15K-237, which verify the con-sistent effectiveness of the task-specific encoder.While other baselines do not show much differ-ence as the simple Random baseline, sometimesthey even underperform Random baseline.In order to show that our model can have betterinitial point than others, we report the performanceof the initial point without any training in Table2 as Initial group. We notice that the baselineshave very poor initial performances on FB15K-237, which is reasonable since the model has neverseen the new relation. From the results, we cansee that the neighbor encoder and path encoderachieves much better initial point than other base-lines in FB15K-237 and NELL respectively. Thepath encoder has a fair performance which is sim-ilar to the best of the baselines MAML-Mask, wethink the reason that path encoder does not per-form very well is the path encoder is noisier thanneighbor encoder as we mentioned before. To investigate the impact of the few-shot sizeon the performance of the model, we eval-uate the model using various few-shot size: , , , , , . The results are shown in Fig-ure 2. From the results, we can see that forMAML and MAML-Mask, their performances re-main nearly the same after the size reaches onFB15K-237 dataset. The performance of MAMLis not stable on NELL dataset, while MAML-Mask keeps increasing. Both methods under-perform the Random baseline when the size in-creases. For Transfer method, its performance in-creases with the few shot size on FB15K-237, butthere is a huge drop on NELL when the size is ,which indicates it is not stable enough, and sensi-tive to the noise in the data. The neighbor encoderhas the best performance on FB15K-237 dataset,but not well on NELL due to the small neighborsize. Path encoder seems to be less stable com-pared with neighbor encoder since there is perfor- Setting FB15K-237Hits@1 Hits@3 Hits@10 MRREncoder-1-shot .047 .058 .117 .064Encoder-50-shot .049 .070 .128 .069
No-encoder .008 .035 .084 .032
Table 3 : The comparison of performance for model with dif-ferent initialization on FB15K-237 dataset. Encoder-1-shotand Encoder-50-shot denotes using neighbor encoder with and samples. No-encoder means using a random initializa-tion. We report the average performance on meta-test tasks.Best result for each evaluation matrix is marked in bold. mance drop once on both datasets, but it achievesthe best performance on NELL and second-bestperformance when size is larger than except . To verify the effectiveness of the encoder, we com-pare the model using task-specific initializationwith the model using random initialization at theinitial point. We choose the neighbor encoder onFB15K-237 dataset to conduct the ablation study.The comparison results are shown in Table 3. Thethree models in the table use the same reasoningmodel, the only difference is the task representa-tion. Encoder-1-shot and Encoder-50-shot applyneighbor encoder to generate the task representa-tion using and samples respectively, whileNo-encoder uses a randomly initialized represen-tation. By comparing Encoder-1-shot with No-encoder, we can see that the model can achievemuch better performance through the way of en-coding task-related information, even using onlyone sample, which also indicates the generatedtask representations are meaningful. Also, bet-ter initialization can be achieved when using moresamples, since the performance of Encoder-50-shot is better than that of Encoder-1-shot. In this paper, we consider multi-hop reasoningover knowledge graphs under few-shot learningsetting, where limited samples are available onnew tasks. We improve upon MAML by using ameta-encoder to encode task-specific information.Through this way, our method can create a task-dependent initial model that better fits the targettask. Neighbor encoder and path encoder are pro-posed for our problem. Experiments on FB15K-237 and NELL under meta-learning setting showthat our task-specific meta-encoder yields a betterinitial point and outperforms other baselines. eferences
S¨oren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary G. Ives.2007. Dbpedia: A nucleus for a web of opendata. In
The Semantic Web, 6th International Se-mantic Web Conference, 2nd Asian Semantic WebConference, ISWC 2007 + ASWC 2007, Busan, Ko-rea, November 11-15, 2007. , volume 4825 of
Lec-ture Notes in Computer Science , pages 722–735.Springer.Kurt D. Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: a col-laboratively created graph database for structuringhuman knowledge. In
Proceedings of the ACM SIG-MOD International Conference on Management ofData, SIGMOD 2008, Vancouver, BC, Canada, June10-12, 2008 , pages 1247–1250. ACM.Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple ques-tion answering with memory networks.
CoRR ,abs/1506.02075.Antoine Bordes, Nicolas Usunier, Alberto Garc´ıa-Dur´an, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In
Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States. , pages 2787–2795.Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. 2019.A closer look at few-shot classification.
CoRR ,abs/1904.04232.Wenhu Chen, Wenhan Xiong, Xifeng Yan, andWilliam Yang Wang. 2018. Variational knowledgegraph reasoning. In
Proceedings of the 2018 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, NAACL-HLT 2018, New Or-leans, Louisiana, USA, June 1-6, 2018, Volume 1(Long Papers) , pages 1823–1832. Association forComputational Linguistics.Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,Luke Vilnis, Ishan Durugkar, Akshay Krishna-murthy, Alex Smola, and Andrew McCallum. 2018.Go for a walk and arrive at the answer: Reason-ing over paths in knowledge bases using reinforce-ment learning. In . OpenReview.net.Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017.Model-agnostic meta-learning for fast adaptation ofdeep networks. In
Proceedings of the 34th Inter-national Conference on Machine Learning, ICML2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of
Proceedings of Machine Learning Re-search , pages 1126–1135. PMLR.Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel,and Tom M. Mitchell. 2013. Improving learningand inference in a large knowledge-base using la-tent syntactic cues. In
Proceedings of the 2013 Con-ference on Empirical Methods in Natural LanguageProcessing, EMNLP 2013, 18-21 October 2013,Grand Hyatt Seattle, Seattle, Washington, USA, Ameeting of SIGDAT, a Special Interest Group of theACL , pages 833–838. ACL.Matt Gardner, Partha Pratim Talukdar, Jayant Krish-namurthy, and Tom M. Mitchell. 2014. Incorporat-ing vector space similarity in random walk inferenceover knowledge bases. In
Proceedings of the 2014Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2014, October 25-29,2014, Doha, Qatar, A meeting of SIGDAT, a SpecialInterest Group of the ACL , pages 397–406. ACL.Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li,and Kyunghyun Cho. 2018. Meta-learning for low-resource neural machine translation. In
Proceed-ings of the 2018 Conference on Empirical Methodsin Natural Language Processing, Brussels, Belgium,October 31 - November 4, 2018 , pages 3622–3631.Association for Computational Linguistics.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural Computation ,9(8):1735–1780.Po-Sen Huang, Chenglong Wang, Rishabh Singh,Wen-tau Yih, and Xiaodong He. 2018. Naturallanguage to structured query generation via meta-learning. In
Proceedings of the 2018 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, NAACL-HLT, New Orleans, Louisiana,USA, June 1-6, 2018, Volume 2 (Short Papers) ,pages 732–738. Association for Computational Lin-guistics.Ni Lao and William W. Cohen. 2010. Relational re-trieval using a combination of path-constrained ran-dom walks.
Machine Learning , 81(1):53–67.Xi Victoria Lin, Richard Socher, and Caiming Xiong.2018. Multi-hop knowledge graph reasoning withreward shaping. In
Proceedings of the 2018 Con-ference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - Novem-ber 4, 2018 , pages 3243–3253. Association forComputational Linguistics.Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, andPieter Abbeel. 2018. A simple neural attentive meta-learner. In . OpenReview.net.om M. Mitchell, William W. Cohen, Estevam R. Hr-uschka Jr., Partha P. Talukdar, Bo Yang, Justin Bet-teridge, Andrew Carlson, Bhavana Dalvi Mishra,Matt Gardner, Bryan Kisiel, Jayant Krishnamurthy,Ni Lao, Kathryn Mazaitis, Thahir Mohamed, Nda-pandula Nakashole, Emmanouil A. Platanios, AlanRitter, Mehdi Samadi, Burr Settles, Richard C.Wang, Derry Wijaya, Abhinav Gupta, Xinlei Chen,Abulhair Saparov, Malcolm Greaves, and JoelWelling. 2018. Never-ending learning.
Commun.ACM , 61(5):103–115.Arvind Neelakantan, Benjamin Roth, and Andrew Mc-Callum. 2015. Compositional vector space modelsfor knowledge base inference. In . AAAI Press.Alex Nichol, Joshua Achiam, and John Schulman.2018. On first-order meta-learning algorithms.
CoRR , abs/1803.02999.Maximilian Nickel, Volker Tresp, and Hans-PeterKriegel. 2011. A three-way model for collectivelearning on multi-relational data. In
Proceedingsof the 28th International Conference on MachineLearning, ICML 2011, Bellevue, Washington, USA,June 28 - July 2, 2011 , pages 809–816. Omnipress.Sachin Ravi and Hugo Larochelle. 2017. Optimiza-tion as a model for few-shot learning. In . OpenReview.net.Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski,Oriol Vinyals, Razvan Pascanu, Simon Osindero,and Raia Hadsell. 2018. Meta-learning with latentembedding optimization.
CoRR , abs/1807.05960.Yelong Shen, Jianshu Chen, Po-Sen Huang, YuqingGuo, and Jianfeng Gao. 2018. M-walk: Learning towalk over graphs using monte carlo tree search. In
Advances in Neural Information Processing Systems31: Annual Conference on Neural Information Pro-cessing Systems 2018, NeurIPS 2018, 3-8 December2018, Montr´eal, Canada. , pages 6787–6798.Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang,Philip H. S. Torr, and Timothy M. Hospedales. 2018.Learning to compare: Relation network for few-shotlearning. In , pages 1199–1208.IEEE Computer Society.Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoi-fung Poon, Pallavi Choudhury, and Michael Gamon.2015. Representing text for joint embedding of textand knowledge bases. In
Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, EMNLP 2015, Lisbon, Portugal,September 17-21, 2015 , pages 1499–1509. The As-sociation for Computational Linguistics. Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´EricGaussier, and Guillaume Bouchard. 2016. Complexembeddings for simple link prediction. In
Proceed-ings of the 33nd International Conference on Ma-chine Learning, ICML 2016, New York City, NY,USA, June 19-24, 2016 , volume 48 of
JMLR Work-shop and Conference Proceedings , pages 2071–2080. JMLR.org.Denny Vrandecic and Markus Kr¨otzsch. 2014. Wiki-data: a free collaborative knowledgebase.
Commun.ACM , 57(10):78–85.William Yang Wang and William W. Cohen. 2015.Joint information extraction and reasoning: A scal-able statistical relational learning approach. In
Pro-ceedings of the 53rd Annual Meeting of the Associ-ation for Computational Linguistics and the 7th In-ternational Joint Conference on Natural LanguageProcessing of the Asian Federation of Natural Lan-guage Processing, ACL 2015, July 26-31, 2015, Bei-jing, China, Volume 1: Long Papers , pages 355–364.The Association for Computer Linguistics.Jiawei Wu, Ruobing Xie, Zhiyuan Liu, and MaosongSun. 2016. Knowledge representation via jointlearning of sequential text and knowledge graphs.Wenhan Xiong, Thien Hoang, and William YangWang. 2017. Deeppath: A reinforcement learningmethod for knowledge graph reasoning. In
Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP 2017,Copenhagen, Denmark, September 9-11, 2017 ,pages 564–573. Association for Computational Lin-guistics.Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo,and William Yang Wang. 2018. One-shot relationallearning for knowledge graphs. In
Proceedings ofthe 2018 Conference on Empirical Methods in Nat-ural Language Processing, Brussels, Belgium, Oc-tober 31 - November 4, 2018 , pages 1980–1990. As-sociation for Computational Linguistics.Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding entities andrelations for learning and inference in knowledgebases. In .Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with freebase. In
Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics, ACL 2014, June 22-27, 2014,Baltimore, MD, USA, Volume 1: Long Papers , pages956–966. The Association for Computer Linguis-tics.Wen-tau Yih, Ming-Wei Chang, Xiaodong He, andJianfeng Gao. 2015. Semantic parsing via stagedquery graph generation: Question answering withknowledge base. In
Proceedings of the 53rd An-nual Meeting of the Association for Computationalinguistics and the 7th International Joint Confer-ence on Natural Language Processing of the AsianFederation of Natural Language Processing, ACL2015, July 26-31, 2015, Beijing, China, Volume 1:Long Papers , pages 1321–1331. The Association forComputer Linguistics.Mo Yu, Wenpeng Yin, Kazi Saidul Hasan,C´ıcero Nogueira dos Santos, Bing Xiang, andBowen Zhou. 2017. Improved neural relationdetection for knowledge base question answering.In