[PDF] Differentiable Logic Machines

Abstract

The integration of reasoning, learning, and decision-making is key to build more general AI systems. As a step in this direction, we propose a novel neural-logic architecture that can solve both inductive logic programming (ILP) and deep reinforcement learning (RL) problems. Our architecture defines a restricted but expressive continuous space of first-order logic programs by assigning weights to predicates instead of rules. Therefore, it is fully differentiable and can be efficiently trained with gradient descent. Besides, in the deep RL setting with actor-critic algorithms, we propose a novel efficient critic architecture. Compared to state-of-the-art methods on both ILP and RL problems, our proposition achieves excellent performance, while being able to provide a fully interpretable solution and scaling much better, especially during the testing phase.

Full PDF

DDifferentiable Logic Machines

Matthieu Zimmer Xuening Feng Claire Glanois Zhaohui Jiang Jianyi Zhang Paul Weng

Jianye Hao Dong Li Wulong Liu Abstract

The integration of reasoning, learning, anddecision-making is key to build more general AIsystems. As a step in this direction, we propose anovel neural-logic architecture that can solve bothinductive logic programming (ILP) and deep rein-forcement learning (RL) problems. Our architec-ture deﬁnes a restricted but expressive continuousspace of ﬁrst-order logic programs by assigningweights to predicates instead of rules. Therefore,it is fully differentiable and can be efﬁcientlytrained with gradient descent. Besides, in thedeep RL setting with actor-critic algorithms, wepropose a novel efﬁcient critic architecture. Com-pared to state-of-the-art methods on both ILP andRL problems, our proposition achieves excellentperformance, while being able to provide a fullyinterpretable solution and scaling much better, es-pecially during the testing phase.

1. Introduction

Following the successes of deep learning and deep reinforce-ment learning, a research trend (Dong et al., 2019; Jiang &Luo, 2019; Manhaeve et al., 2018), whose goal is to combinereasoning, learning, and decision-making into one architec-ture has become very active. This research may unlock thenext generation of artiﬁcial intelligence (AI) (Lake et al.,2017; Marcus, 2018). Simultaneously, a second researchtrend has ﬂourished under the umbrella term of explainableAI (Barredo Arrieta et al., 2020). This trend is fueled by therealization that solutions obtained via deep learning-basedtechniques are difﬁcult to understand, debug, and deploy.Neural-logic approaches (see Section 2) have been proposedto integrate reasoning and learning, notably via ﬁrst-orderlogic and neural networks. Recent works have demonstratedgood achievements by using differentiable methods to learn UM-SJTU Joint Institute, Shanghai Jiao Tong University,China Department of Automation, Shanghai Jiao Tong University,Shanghai, China. Huawei Noah’s Ark Lab, China. Correspon-dence to: Paul Weng < [email protected] > . a logic program (Evans & Grefenstette, 2018) or by applyinga logical inductive bias to create a neural-logic architecture(Dong et al., 2019). The latter approach obtains the best per-formance at the cost of interpretability, while the former canyield an interpretable solution, but at the cost of scalability.In this paper, we propose a novel neural-logic architecture(see Section 3 for background notions and Section 4 forour proposition) that offers a better tradeoff in terms ofinterpretability vs performance and scalability. This archi-tecture deﬁnes a continuous relaxation over ﬁrst-order logicexpressions deﬁned on input predicates. In contrast to mostprevious approaches, one key idea is to assign learnableweights on predicates instead of template rules, which al-lows for a much better scalability. We also introduce severaltraining tricks to ﬁnd interpretable solutions. Besides, fordeep reinforcement learning (RL), we propose an adaptedcritic to train our architecture in an actor-critic scheme forfaster convergence.We experimentally compare our proposition with previously-proposed architectures on inductive logic programming(ILP) and RL tasks (see Section 5). Our architectureachieves state-of-the-art performances in ILP and RL taskswhile maintaining interpretability and achieving better scal-ability. Our proposition is superior to all interpretable meth-ods in terms of success rates, computational time, and mem-ory consumption. Compared to non-interpretable ones, ourmethod compares favorably, but can ﬁnd fully-interpretablesolutions (i.e., logic program) that are faster and use lessmemory during the testing phase.The contributions of this paper can be summarized as fol-lows: (1) a novel neural-logic architecture that can producean interpretable solution and that can scale better than state-of-the-art methods, (2) an algorithm to train this architectureand obtain an interpretable solution, (3) an adapted critic forthe deep RL setting, and (4) a thorough empirical evaluationin both ILP and RL tasks.

2. Related Work

The literature aiming at integrating reasoning and learn-ing (and decision-making more recently) is very rich (e.g.,(Raedt et al., 2020; Manhaeve et al., 2018; Lyu et al., 2019; a r X i v : . [ c s . A I] F e b ifferentiable Logic Machines Yi et al., 2018)). For space reasons, we only discuss therecent works closest to ours below. (Differentiable) ILP and their extensions to RL

Induc-tive Logic Programming (ILP) (Cropper et al., 2020) aims toextract lifted logical rules from examples. Since traditionalILP systems can not handle noisy, uncertain or ambiguousdata, they have been extended and integrated into neural anddifferentiable frameworks. For instance, Evans & Grefen-stette (2018) proposed B ILP, a model based on a continuousrelaxation of the logical reasoning process, such that the pa-rameters can be trained via gradient descent, by expressingthe satisﬁability problem of ILP as a binary classiﬁcationproblem. This relaxation is deﬁned by assigning weightsto templated rules. Jiang & Luo (2019) adapted B ILP toRL problems using vanilla policy gradient. Despite beinginterpretable, this approach does not scale well, which isnotably due to how the relaxation is deﬁned.Payani & Fekri (2019b) proposed differentiable NeuralLogic ILP (dNL-ILP), another ILP solver where in con-trast to B ILP, weights are placed on predicates like in ourapproach. Their architecture is organized as a sequenceof one layer of neural conjunction functions followed byone layer of neural disjunction functions to represent ex-pressions in Disjunctive Normal Form, which provides highexpressivity. Payani & Fekri (2019b) did not provide anyexperimental evaluation of dNL-ILP on any standard ILPbenchmarks. But, in our experiments, our best effort toevaluate it suggests that dNL-ILP performs worse than B ILP.Payani & Fekri (2020) extended their model to RL andshowed that initial predicates can be learned from imagesif sufﬁcient domain knowledge under the form of auxil-iary rules is provided to the agent. However, they do notshow that their approach can learn good policies withoutthis domain knowledge.Another way to tackle inductive learning and logic reason-ing, is to introduce some logical architectural inductive bias,as in Neural Logic Machine (NLM) (Dong et al., 2019).This approach departs from previous ones by learning ruleswith single-layer perceptrons (SLP), which prevents thismethod to provide any ﬁnal interpretable solution. NLMscan generalize and its inference time is signiﬁcantly im-proved compared to B ILP; by avoiding rules templates asin traditional neuro-symbolic approaches, it also gains inexpressivity. Our architecture is inspired by NLM, but weuse interpretable modules instead of SLPs.

Knowledge-Base (KB) reasoning

Another line of workin relational reasoning speciﬁcally targets KB reasoning.Although these works have demonstrated huge gain in scal-ability (w.r.t number of predicates or entities), they are usu- ally less concerned about predicate invention . Some recentworks (Yang et al., 2017; Yang & Song, 2020) extend themulti-hop reasoning framework to ILP problems. The lat-ter work is able to learn more expressive rules, with theuse of nested attention operators. In the KB completionliterature, a recurrent idea is to jointly learn sub-symbolicembeddings of entities and predicates, which are then usedfor approximate inference. However, the expressivity re-mains too limited for more complex ILP tasks and theseworks are typically more data-hungry.

3. Background

In this section, we present our notations in the context of In-ductive Logic Programming, recall the deﬁnition of MarkovDecision Process, and review the related deep RL algorithms(i.e., PPO), which we use to formulate our method.

Notations

For any ﬁnite set X , let ∆ p X q denote the setof probability distributions over X . For any n P N , let r n s denote the set t , , . . . , n u . Inductive Logic Programming (ILP) refers to the problem oflearning a logic program that entails a given set of positiveexamples and does not entail a given set of negative exam-ples. This logic program is generally written in (a fragmentof) ﬁrst-order logic.

First-Order Logic (FOL) is a formal language deﬁned withseveral elements: constants, variables, functions, predicates,and formulas.

Constants correspond to object in the world.Let C denote the set of m constants. They will be denotedin lowercase. Variables refer to unspeciﬁed objects. Theywill be denoted in uppercase. Like previous works, weconsider a fragment of FOL without any functions. An r -ary predicate P can be thought of as a relation between r constants, which can be evaluated as true T or false F .Predicates will be denoted in uppercase. Let P denote theset of predicates used for a given problem. An atom is an r -ary predicate with its arguments P p x , ¨ ¨ ¨ , x r q where x i ’s are either variables or constants. A formula is a logicalexpression composed of atoms, logical connectives (e.g.,negation (cid:32) , conjunction ^ , disjunction _ , implication Ð ),and possibly existential D and universal @ quantiﬁers.Since solving an ILP task involves searching an exponen-tially large space, this problem is generally handled by fo-cusing on formulas of restricted forms, such as a subset ofif-then rules , also referred to as clauses . A deﬁnite clause isa rule of the form: B Ð A ^ . . . ^ A k Typically, rules learned in multi-hop reasoning are chain-likerules (i.e., paths on graphs), which form a subset of Horn clauses: Q p X, Y q Ð P p X, Z q ^ P p Z , Z q ^ ¨ ¨ ¨ P r p Z r ´ , Z r q . ifferentiable Logic Machines which means that the head atom B is implied by the con-junction of the body atoms A , . . . , A k . More general rulescan be deﬁned by allowing logical operations (e.g., disjunc-tion or negation). A ground rule (resp. ground atom) is arule (resp. atom) whose variables have been all replacedby constants. By gathering the ground atoms attached toa speciﬁc predicate P of arity r , we can associate to P atensor P with values for F and for T , of dimension r with shape r m, . . . , m s .In ILP tasks, given some initial input predicates (e.g., zero p X q , succ p X, Y q for natural numbers), and a targetpredicate (e.g., even ), the goal is to learn a logical for-mula deﬁning the target predicate. It is usually judicious toproceed incrementally and introduce in P some auxiliarypredicates, for which we also have to learn an explicit andconsistent deﬁnition. Below, we show a simple example ofsuch predicate invention with created predicate succ : even p X q Ð zero p X q _ ` succ p X, Y q ^ even p Y q ˘ succ p X, Y q Ð succ p X, Z q ^ succ p Z, Y q The

Markov Decision Process (MDP) model (Bellman,1957) is deﬁned as a tuple p S , A , T, r, µ , γ q , where S isa set of states, A is a set of actions, T : S ˆ A Ñ ∆ p S q is a transition function, r : S ˆ A Ñ R is a reward func-tion, µ P ∆ p S q is a distribution over initial states, and γ P r , q is a discount factor. A (stationary Markov) policy π : S Ñ ∆ p A q is a mapping from states to distributionsover actions; π p a | s q stands for the probability of takingaction a given state s . We consider parametrized policies π θ with parameter θ (e.g., neural networks). The aim indiscounted MDP settings, is to ﬁnd a policy that maximizesthe expected discounted total reward: J p θ q “ E µ ,T,π θ ” ÿ t “ γ t r p s t , a t q ı (1)where E µ ,T,π θ is the expectation w.r.t. distribution µ , tran-sition function T , and policy π θ . The state value function of a policy π θ for a state s is deﬁned by: V θ p s q “ E T,π θ ” ÿ t “ γ t r p s t , a t q | s “ s ı (2)where E T,π θ is the expectation w.r.t. transition function T and policy π θ . The action value function is deﬁned by: Q θ p s, a q “ E T,π θ ” ÿ t “ γ t r p s t , a t q | s “ s, a “ a ı (3)and the advantage function is deﬁned by: A θ p s, a q “ Q θ p s, a q ´ V θ p s q . (4) Figure 1: Architecture of DLM zoomed in around breadth b , whereboxes represent logic modules (except for layer ), blue arrowscorrespond to reduction, and yellow arrows to expansion. The pred-icates corresponding to inputs of a logic module can be negatedand their arguments permuted. Reinforcement learning (RL), which is based on MDP, is theproblem of learning a policy that maximizes the expecteddiscounted sum of rewards without knowing the transitionand reward functions. Policy gradient (PG) methods con-stitute a widespread approach for tackling RL problems incontinuous or large state-action spaces. They are based oniterative updates of the policy parameter in the direction ofa gradient expressed as (Sutton & Barto, 2018): ∇ θ J p θ q “ E p s,a q„ d π θ r A θ p s, a q ∇ θ log π θ p a | s qs where the expectation is taken w.r.t to d π θ , the stationarydistribution of the Markov chain induced by policy π θ . Anefﬁcient way to perform the policy update is via an actor-critic (AC) scheme. In such a framework, both an actor ( π θ )and a critic (e.g., A θ or V θ ) are jointly learned.Proximal Policy Optimization (PPO) (Schulman et al., 2017)is a state-of-the-art actor-critic algorithm, which optimizesa clipped surrogate objective function J PPO p θ q deﬁned by: ÿ t “ min p ω t p θ q A ¯ θ p s t , a t q , clip p ω t p θ q , (cid:15) q A ¯ θ p s t , a t qq , (5)where ¯ θ is the current policy parameter, ω t p θ q “ π θ p a t | s t q π ¯ θ p a t | s t q ,and clip p¨ , (cid:15) q is the function to clip between r ´ (cid:15), ` (cid:15) s .This surrogate objective was motivated as an approximationof that used in TRPO (Schulman et al., 2015), which wasintroduced to ensure monotonic improvement after a policyparameter update. Some undeniable advantages of PPO overTRPO lies in its simplicity and lower sample complexity.

4. Differentiable Logic Machines

In this section, we present our novel neural-logic architec-ture, called Differentiable Logic Machines (DLM), whichoffers a good trade-off between expressivity and trainability. ifferentiable Logic Machines

Inspired by NLM’s architecture (Dong et al., 2019), DLMis comprised of learnable logic modules organized into L layers, each layer having a breadth B (see Figure 1). Byconvention, layer corresponds to all the initial predicates.At layer l P r L s , a breadth level b P r B s corresponds to alogic module containing n O invented predicates of arity b .Each of those predicates can be a disjunction or a conjunc-tion of n A predicates from layer l ´ and breadth b ´ , b or b ` . In order to provide their deﬁnitions, we introducesome notations.Let P lb be the set of all predicates of arity b at layer l P r L s .This set can be augmented with three operations: negation,expansion, and reduction: Negation:

The set of predicates in P lb and their negationsis denoted P lb . Expansion:

Any b -arity predicate P can be expanded intoa b ` -arity predicate, where the last argument does notplay any role in its truth value, i.e., ˆ P p X , . . . , X b ` q : “ P p X , . . . , X b q . The set of expanded predicates obtainedfrom P lb is denoted p P lb . By convention, p P l “ H . Reduction:

Any b ` -arity predicate P can be reducedinto a b -arity predicate P by marginalizing out its last ar-gument either with an existential quantiﬁer resp. universalone, i.e., P p X , . . . , X b q “ D X b ` P p X , . . . , X b ` q or P p X , . . . , X b q “ @ X b ` P p X , . . . , X b ` q . Those oper-ations on tensors are respectively performed by a max or a min on the corresponding index. The set of reduced predi-cates obtained from P lb ` is denoted q P lb ` . By convention, q P lb ` “ H if b “ B .Let S B be the set of all permutations of B . For agiven b -arity predicate and a given permutation σ P S b , P σ p X , . . . , X b q : “ P p X σ p q , . . . , X σ p b q q . Permuting argu-ments allow to build more expressive formulas.Typically, a logic module has half conjunctive predicatesand half disjunctive ones. For any predicate, we writeits corresponding tensor in bold. A conjunctive predicate P lb p X , . . . , X b q is computed with a fuzzy and (for n A “ ,which easily extends to n A ą ): P lb “ ˜ ÿ P P P l ´ w P P ¸ d ˜ ÿ P P P l ´ w P P ¸ (6)where d is the component-wise multiplication and P l ´ “ t T , F u Y t P σ | P P P l ´ b Y p P l ´ b ´ Y q P l ´ b ` , σ P S b u .The weights w P ’s (or w P ) are learned as a softmax ofparameters θ P (with temperature τ as a hyperparameter): w P “ exp p θ P { τ q ř P P P l ´ exp p θ P { τ q . (7)Similarly, a disjunctive predicate is deﬁned with a fuzzy or : P lb “ Q ` Q ´ Q d Q (8)where Q “ ř P P P l ´ w P P and Q “ ř P P P l ´ w P P .The number of parameters in one module grows as O p pn A n O q where p is the number of input predicates ofthe module, n A is the number of atoms used in a predicate, n O is the number of output predicates of the module. Incomparison, to obtain the same expressivity, B ILP (and thusNLRL) would need O p p ! ˆ n O q because the weights aredeﬁned for all permutations over p . In contrast, dNL-ILPand NLM would be better with only O p pn O q . However, themodules of dNL-ILP and NLM are not really comparable tothose of our model or B ILP. Indeed, the NLM modules arenot interpretable, and the dNL-ILP architecture amounts tolearning a CNF formula, where a module corresponds to acomponent of that formula. While expressive, the space oflogic program induced in dNL-ILP is much less constrainedthan in our architecture, making it much harder to train, asshown in our experiments (see Table 1).

DLM deﬁnes a continuous relaxation over ﬁrst-order logicprograms. Its expressivity can not only be controlled bysetting hyperparameters L (maximum number of layers), B (maximum arity), n A (number of body atoms to deﬁnea clause), and n O (number of outputs in logic module),but also by restricting the inputs of logic modules (e.g., nonegation or no existential or universal quantiﬁers). A prioriknowledge can be injected in this architecture by choosingdifferent values for B at each layer, different values for n O in each module, or removing some inputs of logic modules.Note that DLM is independent of the number of objects: itcan be trained on a small number of objects and generalizeto a larger number.For supervised learning tasks, with positive and negativeexamples, the loss is simply a binary cross-entropy loss. Asreported in previous neural-logic works, this loss functiongenerally admits many local optima. Besides, there may beglobal optima that reside in the interior of the continuousrelaxation (i.e., not interpretable). Therefore, if we train ourmodel with a standard supervised (or RL training) technique,there is no reason that an interpretable solution would beobtained, if we manage to completely solve the task at all.In order to help training and guide the model towards aninterpretable solution, we propose to use three tricks: (1)inject some noise in the softmax deﬁned in (7), (2) decreasetemperature τ during training, and (3) use dropout. For thenoise, we use a Gumbel distribution. Thus, the softmax in(7) is replaced by a Gumbel-softmax (Jang et al., 2017): w P “ exp pp G P ` θ P q{ τ q ř P P P l ´ exp pp G P ` θ P q{ τ q (9) ifferentiable Logic Machines P (X,Y) GRUGRU GRU P (Y,X) P (X) MLP P (·) V(s) Figure 2: Architecture of our critic with B “ . Additionalobservable predicates with lower arity can be introduced in thedeeper layers by concatenation. The architecture generalizes tohigher depth by introducing more initials GRUs, for instance, 3GRUs for depth 3, etc. The number of parameters of the criticis independent of the number of objects and only depends on thenumber of predicates. where G P (and G P ) are i.i.d. samples from a Gumbel distri-bution Gumbel p , β q . The injection of noise during trainingcorresponds to a stochastic smoothing technique: optimiz-ing by injecting noises with gradient descent amounts toperforming stochastic gradient descent of a smoothed lossfunction. This helps avoid early convergence to a local opti-mum and ﬁnd a global optimum. The decreasing tempera-ture favors the convergence towards interpretable solutions.To further help learn an interpretable solution, we addition-ally use dropout during our training procedure. The scale β of the Gumbel distribution and the dropout probability isalso decreased with the temperature during learning. For RL tasks, the objective function is deﬁned from thesparse reward obtained when a task is solved. Here, the goalis to learn a policy that maximizes the expected cumulativerewards. To the best of our knowledge, all previous neural-logic works rely on REINFORCE (Williams, 1992) insteadof an actor-critic (AC) algorithm, which can be generallymuch more sample-efﬁcient. One reason may be the dif-ﬁculty of designing a neural network architecture for thecritic that can directly receive as inputs the initial predicates.To solve this issue, we propose a recurrent neural networkarchitecture for the critic that can directly take as inputsthe initial predicates. The critic estimates the value of acurrent state described by the initial predicates, which arerepresented by a tensor. An r -ary predicate can be repre-sented as a tensor of dimension r with shape r m, . . . , m s .The set of r -ary predicates can be represented as a tensor P r of dimension r ` with shape r m, . . . , m, | P r |s . Thearchitecture is depicted in Figure 2. For arity r , r recurrentheads, implemented as a Gated Recurrent Unit (GRU) (Choet al., 2014), read the r -ary predicates, the i -th head readingthe i -th slice of tensor P r , yielding an output P ir of shape r m, | P r |s . Intuitively, the i -th head computes for each ob- Although not discussed in Jiang & Luo (2019)’s paper, wefound in their source code an attempt to apply an AC scheme, butthey do so by converting states into images, which may not onlybe unsuitable for some problems, but may also lose informationduring the conversion and prevent good generalization. ject o and each predicate P a summary of the objects that o is in relation with according to P when o is in the i -thposition of P . Outputs P ir ’s are then combined with anotherGRU to provide an output for arity r . All those outputsare then given as inputs to a single-layer perceptron thatestimates value of the current state. Note that this critic, likeDLM, is independent of the number of objects.For the actor, the output should ideally correspond to a predi-cate that evaluates to true for only one action and false for allother actions, which corresponds to a deterministic policy.For instance, in a blocks world domain, the target predicatewould be move p X, Y q and would be true for only one pairof objects, corresponding to the optimal action, and falsefor all other pairs. While not impossible to achieve (at leastfor certain small tasks), an optimal deterministic policy mayinvolve an unnecessarily complex logic program. Indeed,for instance, for the blocks world domain, in many states,there are several equivalent actions, which the deterministicpolicy would have to order.Thus, as done in previous works, we separate the reasoningpart and the decision-making part. The reasoning part fol-lows the architecture presented in Figure 1, which providesa tensor representing the target predicate corresponding tothe actions. A component of this tensor can be interpretedas whether the respective action is good or not. The deci-sion part takes as input this tensor and outputs a probabilitydistribution over the actions.In previous works, the softmax distribution has been con-sidered in NLM and dNL-ILP. However, as noticed by theauthors of NLRL, this distribution does not favor the emer-gence of interpretable policies: the reasoning part has noincentive to use values near 0 or 1. For instance, with a lowtemperature in the softmax, all the values in the reasoningpart can be very close. Instead, Jiang & Luo (2019) pro-posed to normalize the reasoning part by its sum to obtainthe distribution over actions. In our experiments, we haveused both distributions depending on the environment. During evaluation and deployment, both the time and spacecomplexity will increase quickly ( O p m B q ) as the numberof objects increases. To speed up inference and have aninterpretable model, we post-process the trained model toextract the logical formula instead of using it directly. Thiscan be done recursively from the output of the model. Foreach used module, we replace the Gumbel-softmax (9) byan argmax to choose the predicates deterministically. Thefuzzy operations can then be replaced by their correspondingBoolean ones. The extracted interpretable model can thenoperate on Boolean tensors, which further saves space andcomputation time. ifferentiable Logic Machines Table 1: Success rates (%) of dNL-ILP, B ILP, NLM, and DLM on the family tree and graph reasoning tasks.dNL-ILP B ILP NLM DLM (Ours)Family Tree m “ m “ m “ m “ m “ m “ m “ m “ HasFather

100 100 100 100 100 100 100 100

HasSister

100 100 100 100 100 100 100 100

IsGrandparent

100 100 100 100 100 100 100 100

IsUncle .

32 96 .

77 100 100 100 100 100 100

IsMGUncle . N/A

100 100 100 100 100 100

Graph m “ m “ m “ m “ m “ m “ m “ m “ AdjacentToRed

100 100 100 100 100 100 100 100 .

36 85 .

30 100 100 100 100 100 100 . N/A

100 100 100 100 100 100 .

00 78 .

44 100 100 100 100 100 100 .

39 8 . N/A N/A

100 100 100 100

5. Experimental Results

In this section, we experimentally compare our architec-ture with previous state-of-the-art methods on ILP and RLtasks. For ILP, we evaluate B ILP (Evans & Grefenstette,2018), NLM (Dong et al., 2019), dNL-ILP (Payani & Fekri,2019a), and our architecture DLM on the family tree andgraph reasoning tasks used in NLM. We did not include theapproaches in multi-hop reasoning (Yang et al., 2017; Yang& Song, 2020) because although they can scale well, therules they can learn are much less expressive, which preventthem to solve any complex ILP tasks in an interpretableway. Other differential architectures such as MEM-NN(Sukhbaatar et al., 2015) or DNC (Graves et al., 2016) arealso left out, since they have been shown to be inferior onILP tasks compared to NLM and they furthermore do notprovide any interpretable solutions. For RL, we compareDLM with the best baselines as measured on the ILP tasks,namely NLM and NLRL (Jiang & Luo, 2019), which is anextension of B ILP to RL, on several variants of blocks worldtasks from NLRL and NLM, in addition to two other tasks

Sorting and

Path from NLM. More details about the ILPand RL tasks are given below and in Appendix A. The spec-iﬁcations of the computers used for training are provided inAppendix B.1.The ﬁrst two series of experiments demonstrate how well ourmethod performs on ILP and RL tasks in terms of successrates, in terms of computational times and memory usageduring training and testing compared to the other baselines.The last series of experiments is an ablation study that justi-ﬁes the different components (i.e., critic, Gumbel-softmax,dropout, policy distribution) of our method.

Since the authors of B ILP did not release their source code,we use the same implementation of B ILP as in NLRL. For NLM and dNL-ILP, we use the source codes shared by theirauthors.

Task Performance

For the ILP tasks, we report in Ta-ble 1 the success rates of the different methods on twodomains: family tree and graph. In the family tree domain,different tasks are considered corresponding to different tar-get predicates to be learned from an input graph wherenodes representing individuals are connected with rela-tions:

IsM other p X, Y q , IsF ather p X, Y q , IsSon p X, Y q ,and IsDaughter p X, Y q . The target predicates are HasF ather , HasSister , IsGrandP arent , IsU ncle ,and

IsM GU ncle (i.e., maternal great uncle). In thegraph domain, the different target predicates to be learnedfrom an input graph are

AdjacentT oRed , - Connectivity , - Connectivity , - OutDegree , - OutDegree (See Ap-pendix A.1 for deﬁnitions).The success rates are computed as the average over 250 ran-dom instances (i.e., family tree or graph) of the best modelover 10 models trained over different seeds. We report inAppendix C the percentage of seeds that succeeded in ourexperiments. Depending on tasks, our approach succeeds atleast 2/3 of seeds.In Table 1, we report the performance results of B ILP andNLM as given by Dong et al. (2019). For dNL-ILP, Payani &Fekri (2019a) did not evaluate their method in any standardILP tasks. Using their source code, we did our best toﬁnd the best set of hyperparameters (see Appendix B.3)for each ILP task. N/A means that the method ran outof memory. For dNL-ILP, the memory issue comes fromthat, both learning auxiliary predicates and increasing thenumber of variables of predicates will increase memoryconsumption sharply with growing number of nodes.(seedetails in Appendix B.3.1).The experimental results demonstrate that previous inter-pretable methods dNL-ILP and B ILP do not scale to difﬁcult ifferentiable Logic Machines × T i m e ( s ) dNL-ILPNLMDLM (Ours) 3 4 5 6 7 8 9 10 11 12persons ( × M e m o r y ( M i B ) dNL-ILPNLMDLM (Ours)5 10 15 20 25 30 35persons ( × T i m e ( s ) dNL-ILPNLMDLM (Ours) 5 10 15 20 25 30 35persons ( × M e m o r y ( M i B ) dNL-ILPNLMDLM (Ours) Figure 3: Comparison during test in grandparent (top) and 2-outdegree (bottom): (left) Computational time; (right) Memoryusage. On 2-outdegree NLM is rapidly out of memory.

ILP tasks and to larger number of objects. However, ourmethod can solve all the ILP tasks like NLM, while ourmethod can in addition provide an interpretable rule in con-trast to NLM.

Computational Performance

We compare now the dif-ferent algorithms with respect to computational times andmemory usage during training (see Table 8 in Appendix)and testing (see Figure 3).

We evaluate NLRL, NLM and our method DLM on 6 RLdomains: On the ﬁrst three,

Stack , Unstack , and On (Jiang &Luo, 2019), the agent is trained to learn the binary predicate move p X, Y q which moves block X on block (or ﬂoor) Y .The observable predicates are: isF loor p X q , top p X q , and on p X, Y q with an additional predicate onGoal p X, Y q forthe On task only. In Stack , the agent needs to stack all theblocks whatever their order. In

Unstack , the agent needs toput all the blocks on the ﬂoor. In On , the agent needs toreach the goal speciﬁed by onGoal . The last three domainsare Sorting , Path and

Blocksworld (Dong et al., 2019). In

Sorting , the agent must learn swap p X, Y q where X and Y are two elements of a list to sort. The binary observable pred-icates are smallerIndex , sameIndex , greaterIndex , smallerV alue , sameV alue and greaterV alue . In Path ,the agent is given a graph as a binary predicate with asource node and a target node as two unary predicates.It must learn the shortest path with an unary predicate goT o p X q where X is the destination node. In Blocksworld ,it also learns move p X, Y q . This environment is the mostcomplex one, it features a target world and a source Table 2: Success rates (%) of NLRL, NLM, and DLM on RL tasks.NLRL NLM nIDLM DLMUnstack swap top 2

100 100 100 100

Stack swap right 2

100 100 100 100

On swap top 2

100 100 100 100 swap middle 2

100 100 100 100

Sorting m “

10 97 100 100 100 M “ N/A

100 100 100

Path m “ N/A

100 100 81 M “

50 100 100 16

Blocks m “ N/A

100 100 ´ world M “

50 100 100 ´ world with numbered blocks, which makes the numberof constants to be p m ` q where m is the number ofblocks and corresponds to the ﬂoor. The agent is re-warded if both worlds match exactly. The binary observ-able predicates are sameW orldID , smallerW orldID , largerW orldID , sameID , smallerID , largerID , lef t , sameX , right , below , sameY , and above .All those domains are sparse reward RL problems. Sincethe ﬁrst three domains are relatively simple, they can betrained and evaluated on ﬁxed instances with a ﬁxed numberof blocks. In contrast, for the last three domains, the trainingand testing instances are generated randomly. Those lastthree domains, which are much harder than the ﬁrst three,also require training with curriculum learning, which wasalso used by Dong et al. (2019). The difﬁculty of a lesson isdeﬁned by the number of objects. The maximum difﬁcultyis set to m “ . Further details about the training withcurriculum learning are provided in Appendix B.2. Aftertraining, we evaluate the learned model on instances of size m “ , but also M “ to assess its generalizability.Table 2 provides the success rates of all the algorithms ondifferent RL tasks. For our architecture, we provide theresults whether we learn an interpretable policy (DLM) ornot (nIDLM). Each subrow of an RL task corresponds tosome instance(s) on which a trained model is evaluated.The experimental results show that NLRL does not scaleto harder RL tasks, as expected. Interestingly, we can alsoobserve that NLM does not always generalize well when ifferentiable Logic Machines trained only on one instance, while our method does nothave this issue. Thus, our architecture is always superior toNLRL and better than NLM on problems with few instances.On Sorting where we can learn a fully-interpretable policy,DLM is much better than NLM in terms of computationaltime and memory usage during testing.For the harder RL tasks (

Path , Blocksworld ), our methodcan reach similar performances with a non-interpretablepolicy, i.e., if we do not enforce the convergence to an inter-pretable policy. However, obtaining an interpretable policywith curriculum learning (CL) reveals to be difﬁcult: thereis a contradiction between learning to solve a lesson andconverging to a ﬁnal interpretable policy that generalizes.Indeed, on the one hand, we can learn an interpretable pol-icy for a lesson with a small number of objects, howeverthat policy will probably not generalize and keep trainingthat interpretable policy on the next lesson is hard since thesoftmaxes are nearly argmaxes. On the other hand, we canlearn to solve all the lessons with a non-interpretable policy,but that ﬁnal policy is hard to turn into an interpretable one,because of the many local optima in the loss landscape. Thistraining difﬁculty explains the lower success rates for DLMon

Path and this is why we did not manage to learn an inter-pretable policy for

Blocksworld . We leave for future workthe investigation of alternative RL training methods thatscale better than CL for such sparse-reward RL problems.

As an illustration for ILP, we provide the logic programlearned by our method on the task

IsGrandParent . For betterlegibility, we give more meaningful names to the learnedrules and simpliﬁed it by removing redundant parts:

IsChild p a, b q Ð IsSon p a, b q _ IsDaughter p a, b q IsGP C p a, b, c q Ð IsChild p c, a q ^ IsChild p b, c q IsGrandP arent p a, b qp a, b q Ð D C, IsGP C p a, b, C q We observe that the target predicate has been perfectlylearned. The logic program extracted from the trained DLMhas redundant parts (e.g., P ^ P ), because we used a rela-tively large architecture to ensure sufﬁcient expressivity. Re-dundancy could be reduced by using a smaller architecture,or the redundant parts could be removed by preprocessingthe extracted logic program, as we did. We provide thenon-simpliﬁed logic program in Appendix C.1.As an illustration for RL, we provide the simpliﬁed logicprogram learned by our method on the task On , which cor-responds to the output of the reasoning part: move p a, b q Ðp onGoal p a, b q _ isF loor p b qq ^ (cid:32) on p a, b q ^ top p a q . Us-ing this program, the decision-making part (stochastically)moves blocks to the ﬂoor and moves the good block on itsgoal position when it can. The complete logic program isprovided in Appendix C.1.Being able to ﬁnd solutions in a large architecture is a de- Table 3: Average ratio of successful seeds on all the tasks of graphsand family trees leading to a 100% success rate during testing withinterpretable rules. Score computed with 5 seeds for each task.

Successful seeds (%)Softmax without noise 58Constant β and dropout prob. 68DLM - Dropout 70Gaussian noise 80DLM 95sirable feature when the designer does not know the solu-tion before hand. Besides, note that we directly output aninterpretable logic program. In contrast, with previous inter-pretable models, logic rules with high weights are extractedto be inspected. However, those rules may not generalize be-cause weights are usually not concentrated on one element. In the following part, we report the performance of ourmodel by removing some of its features. We tried to trainour model by using only softmax without injecting noise,without decreasing the noise over time, without having adropout noise and ﬁnally by replacing the Gumbel distri-bution with a Gaussian one. In those experiments, duringevaluation, we are still using an argmax to retrieve the inter-pretable rules. Table 3 shows that all our choices help ourmodel to reach interpretable rules.We also performed an ablation study on the effects of usinga critic both in NLM and in DLM. In both architectures,using a critic improved learning speed, which demonstratesthe quality of our critic. We also evaluated different criticarchitectures, the GRU-based critic was found to performthe best. For space reasons, we provide further details withplots in Appendix C.3.

6. Conclusion

We proposed a novel neural-logic architecture that is capableof learning an interpretable logic program. It obtains state-of-the-art results for inductive logic programming tasks,while retaining interpretability and scaling much better. Forreinforcement learning tasks, it is superior to previous inter-pretable models. Compared to non-interpretable models, itcan achieve comparable results up to some complexity level,but it generalizes better on problems with few instances,and more importantly, it scales much better in terms ofcomputational times and memory usage during testing.Learning a fully-interpretable policy in RL for more com-plex tasks is a hard problem. Solving it calls for alterna-tive training methods that deal with sparse rewards (e.g., ifferentiable Logic Machines

Hindsight Experience Replay (Andrychowicz et al., 2017)),which we plan to explore next in our future work.

References

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong,R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., andZaremba, W. Hindsight experience replay. In

NeurIPS ,pp. 5049–5059, 2017.Barredo Arrieta, A., D´ıaz-Rodr´ıguez, N., Ser, J. D., Ben-netot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez,S., Molina, D., Benjamins, R., and et al. Explainable arti-ﬁcial intelligence (XAI): Concepts, taxonomies, opportu-nities and challenges toward responsible ai.

InformationFusion , 58:82–115, 2020.Bellman, R. A Markovian decision process.

Journal ofmathematics and mechanics , pp. 679–684, 1957.Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,Bougares, F., Schwenk, H., and Bengio, Y. Learningphrase representations using rnn encoder-decoder for sta-tistical machine translation. In

EMNLP , 2014.Cropper, A., Dumanˇci´c, S., and Muggleton, S. H. Turning30: New ideas in inductive logic programming. In

IJCAI ,pp. 4833–4839, 2020.Dong, H., Mao, J., Lin, T., Wang, C., Li, L., and Zhou, D.Neural logic machines. In

ICLR , 2019.Evans, R. and Grefenstette, E. Learning explanatory rulesfrom noisy data.

Journal of AI Research , 2018.Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka,I., Grabska-Barwi´nska, A., Colmenarejo, S., Grefenstette,E., Ramalho, T., Agapiou, J., and et al. Hybrid computingusing a neural network with dynamic external memory.

Nature , 538(7626):471–476, 2016.Jang, E., Gu, S., and Poole, B. Categorical reparameteriza-tion with gumbel-softmax. In

ICLR , 2017.Jiang, Z. and Luo, S. Neural logic reinforcement learning.In

ICML , 2019.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople.

Behavioral and Brain Sciences , 40, 2017.Lyu, D., Yang, F., Liu, B., and Gustafson, S. SDRL: Inter-pretable and data-efﬁcient deep reinforcement learningleveraging symbolic planning. In

AAAI , pp. 2970–2977,2019.Manhaeve, R., Dumanˇci´c, S., Kimmig, A., Demeester, T.,and De Raedt, L. Deepproblog: Neural probabilistic logicprogramming. In

NeurIPS , 2018. Marcus, G. Deep learning: A critical appraisal. 2018. arXiv:1801.00631.Payani, A. and Fekri, F. Inductive logic program-ming via differentiable deep neural logic networks. arXiv:1906.03523 , 2019a.Payani, A. and Fekri, F. Learning algorithms via neurallogic networks. arXiv:1904.01554 , 2019b.Payani, A. and Fekri, F. Incorporating relational backgroundknowledge into reinforcement learning via differentiableinductive logic programming. arXiv:2003.10386 , 2020.Raedt, L. d., Dumanˇci´c, S., Manhaeve, R., and Marra, G.From statistical relational to neuro-symbolic artiﬁcialintelligence. In

IJCAI , pp. 4943–4950, 2020.Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz,P. Trust region policy optimization. In

ICML , 2015.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms.

CoRR , abs/1707.06347, 2017.Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. End-to-end memory networks. In

NeurIPS , 2015.Sutton, R. S. and Barto, A. G.

Reinforcement learning: Anintroduction . MIT press, 2018.Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.

MachineLearning , 8(3):229–256, 1992.Yang, F., Yang, Z., and Cohen, W. W. Differentiable learningof logical rules for knowledge base reasoning. In

NeurIPS ,2017.Yang, Y. and Song, L. Learn to explain efﬁciently via neurallogic inductive learning. In

ICLR , 2020.Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., and Tenen-baum, J. B. Neural-symbolic VQA: Disentangling reason-ing from vision and language understanding. In

NeurIPS ,2018. ifferentiable Logic Machines

A. Tasks Description

A.1. ILP

A.1.1. F

AMILY T REE

For family tree tasks, they have the same backgroung predicates:

IsF ather p X, Y q , IsM other p X, Y q , IsSon p X, Y q and IsDaughter p X, Y q . IsF ather p X, Y q is T rue when Y is X ’s father. The other three predicates have the similar meaning.• HasFather: HasF ather p X q is T rue when X has father. It can be expressed by: HasF ather p X q Ð D Y, IsF ather p X, Y q • HasSister: HasSister p X q is T rue when X has at least one sister. It can be expressed by: HasSister p X q Ð D Y, IsSister p X, Y q IsSister p X, Y q Ð D Z, p IsDaughter p Z, Y q ^

IsMother p X, Z qq • IsGrandparent: IsGrandparent p X, Y q is T rue when Y is X ’s grandparent. It can be expressed by: IsGrandparent p X, Y q Ð D Z, pp IsSon p Y, Z q ^

IsF ather p X, Z qq _p

IsDaughter p Y, Z q ^

IsMother p X, Z qqq • IsUncle:

IsU ncle p X, Y q is T rue when Y is X ’s uncle. It can be expressed by: IsUncle p X, Y q Ð D Z, pp IsMother p X, Z q ^

IsBrother p Z, Y qqq _p

IsF ather p X, Z q ^

IsBrother p Z, Y qq IsBrother p X, Y q Ð D Z, pp IsSon p Z, Y q ^

IsSon p Z, X qq _p

IsSon p Z, Y q ^

IsDaughter p Z, X qqq • IsMGUncle:

IsM GU ncle p X, Y q is T rue when Y is X ’s maternal great uncle. It can be expressed by: IsMGUncle p X, Y q Ð D Z, p IsMother p X, Z q ^

IsUncle p Z, Y qq A.1.2. G

RAPH

For graph tasks,

HasEdge task have the same background predicates:

HasEdge p X, Y q . HasEdge p X, Y q is T rue whenthere is an undirected edge between node X and node Y .• AdjacentToRed: AdjacentT oRed p X q is T rue if node X has an edge with a red node. In this task, it also use Colors p X, Y q as another background predicate besides HasEdge p X, Y q . Color p X, Y q is T rue when the color ofnode X is Y . It can be expressed by: AdjacentT oRed p X q Ð D Y, p HasEdge p X, Y q ^

Color p Y, red qq • 4-Connectivity: ´ Connectivity p X, Y q is T rue if there exists a path between node X and node Y within 4 edges.It can be expressed by: ´ Connectivity p X, Y q Ð D Z, p HasEdge p X, Y q _

Distance p X, Y q _p

Distance p X, Z q ^

HasEdge p Z, Y qq _p

Distance p X, Z q ^

Distance p Z, Y qqq

Distance p X, Y q Ð D Z, p HasEdge p X, Z q ^

HasEdge p Z, Y qq • 6-Connectivity: ´ Connectivity p X, Y q is T rue if there exists a path between node X and node Y within 6 edges.It can be expressed by: ´ Connectivity p X, Y q Ð D Z, p HasEdge p X, Y q _

Distance p X, Y q _

Distance p X, Y q _p

Distance p X, Z q ^

Distance p Z, Y qq _p

Distance p X, Z q ^

Distance p Z, Y qq _p

Distance p X, Z q ^

Distance p Z, Y qqq

Distance p X, Y q Ð D Z, p HasEdge p X, Z q ^

HasEdge p Z, Y qq Distance p X, Y q Ð D Z, p HasEdge p X, Z q ^

Distance p Z, Y qq ifferentiable Logic Machines • 1-Outdegree: ´ Outdegree p X q is T rue if there the outdegree of node X is exactly 1. It can be expressed by: ´ Outdegree p X q Ð D Y, @ Z, p HasEdge p X, Y q ^ (cid:32)

HasEdge p X, Z qq • 2-Outdegree: ´ Outdegree p X q is T rue if there the outdegree of node X is exactly 2. It can be expressed by: ´ Outdegree p X q Ð D Y, @ Z, K p HasEdge p X, Y q ^ (cid:32)

HasEdge p X, Z q ^ (cid:32)

HasEdge p X, K qq A.2. RL

B. Experimental Set-Up

B.1. Computer Speciﬁcations

The experiments are ran by one thread on the computer with speciﬁcations shown in Table 4.

Table 4: Computer speciﬁcation.

Attribute SpeciﬁcationCPU 2 ˆ Intel(R) Xeon(R) CPU E5-2678 v3Threads 48Memory 64GB (4 ˆ ˆ GeForce GTX 1080 Ti

B.2. Curriculum Learning

Every 10 epochs, we test the performance of the agent over 100 instances. If he reaches 100% then it can move to the nextlesson. Our agents are trained only on one lesson at a time.

B.3. Hyperparameters

B.3.1. H

YPERPARAMETERS FOR D

NL-ILPFor dNL-ILP, we train each task with at most , iterations. Moreover, at each iteration, we use a new family tree orgraph as training data, which is randomly generated from the same data generator in NLM and DLM, as backgrounds fortraining the model.In dNL-ILP, a rule is deﬁned as a disjunction of terms . Arguments are atoms shown in target predicate.

V ariables areatoms shown in rules except arguments. Take

IsGrandparent task as an example:

IsGrandparent p X, Y q Ð D Z, pp IsSon p Y, Z q ^

IsF ather p X, Z qq _ p

IsDaughter p Y, Z q ^

IsM other p X, Z qqq (10)Here are 2 terms, 2 arguments and 1 variable in rule(10).For task

HasF ather , IsGrandparent and

AdjacentT oRed , dNL-ILP can achieve accuracy without learning anyauxiliary predicates. For other ILP tasks, it has to learn at least one auxiliary predicate to induct the target. In practice, theperformance decrease with increasing number of auxiliary predicates or variables, therefore here we only use at most oneauxiliary predicate and at most three variables. Table 5 shows hyperparameters for deﬁning rules in dNL-ILP that achievethe best performance. ifferentiable Logic Machines

Table 5: Hyperparameters for deﬁning dNL-ILP rules. Inference step is the number of forward chain. N arg , N var and N terms means thenumber of arguments, variables and terms, respectively. F am is the amalgamate function, refer to (Payani & Fekri, 2019a). Task Inferencestep Auxiliary Target N arg N var N terms F am N arg N var N terms F am HasFather 1 ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´

YPERPARAMETERS FOR

DLMWe have used ADAM with learning rate of . , 5 trajectories, with a clip of . in the PPO loss, λ “ . in GAE and avalue function clipping of . . We always used the NLRL distribution to get the DLM performance. To get the nIDLM onPath and Blockworld, we used a softmax with a temperature of . . Table 6: Architectures for DLM.

Depth Breadth n O n A IO residualFamily Tree HasFather 5 3 8 2HasSister 5 3 8 2IsGrandparent 5 3 8 2IsUncle 5 3 8 2IsMGUncle 9 3 8 2Graph AdjacentToRed 5 3 8 24-Connectivity 5 3 8 26-Connectivity 9 3 8 21-OutDegree 5 3 8 22-OutDegree 7 4 8 2NLRL tasks Unstack 4 2 8 2Stack 4 2 8 2On 4 2 8 2General Algorithm Sorting 4 3 8 2Path (nIDLM) 8 3 8 2 (cid:88)

Path 6 3 16 4Blocksworld (nIDLM) 8 2 8 2 (cid:88) ifferentiable Logic Machines

C. More details on Experiments

C.1. Examples of Interpretable Rules or Policies

As illustration for ILP, we provide the logic program learned by our method on the task

IsGrandParent . We used L “ layers, B “ breadth, n A “ atoms, and n O “ outputs per logic modules. For better legibility, we give more meaningfulnames to the learned rules and remove the expansions and reductions: IsChild p a, b q Ð IsSon p a, b q _ IsDaughter p a, b q IsChild p a, b q Ð IsSon p a, b q _ IsDaughter p a, b q IsGCP p a, b, c q Ð IsChild p a, c q ^ IsChild p c, b q IsGP C p a, b, c q Ð IsChild p c, a q ^ IsChild p b, c q IsGP C p a, b, c q Ð IsGP C p a, b, c q _ isGCP p b, a, c q IsGP p a, b q Ð D C, IsGP C p a, b, C q ^ D C, IsGP C p a, b, C q IsGrandP arent p a, b q Ð IsGP p a, b q ^ IsGP p a, b q We observe that the target predicates has been learned but the logic program has redundant parts, which could have beenavoided if we had used a smaller architecture. The redundant part could also be removed by preprocessing the extractedlogic program. Being able to ﬁnd solutions in a large architecture is a desirable feature when the designer does not know thesolution before hand.As illustration for RL, we provide the logic program learned by our method on the task On , which corresponds to the outputof the reasoning part: pred p a, b q Ð onGoal p b, a q _ isF loor p a q pred p a, b q Ð (cid:32) on p a, b q ^ top p a q pred p a, b q Ð pred p b, a q ^ pred p a, b q pred p a, b q Ð pred p b, a q ^ pred p b, a q move p a, b q Ð pred p a, b q ^ pred p a, b q Using this program, the decision-making part (stochastically) moves blocks to the ﬂoor and moves the good block on itsgoal position when it can.Here are other examples on the family tree domain: pred p a q Ð D B, IsF ather p a, B q ^ D B, IsMother p a, B q pred p a q Ð D B, IsF ather p a, B q _ D B, IsMother p a, B q pred p a q Ð D B, IsMother p a, B q _ D B, IsMother p a, B q pred p a q Ð pred p a q _ pred p a q pred p a q Ð pred p a q _ pred p a q pred p a q Ð pred p a q ^ pred p a q pred p a q Ð pred p a q _ pred p a q pred p a q Ð pred p a q _ pred p a q HasF ather p a q Ð pred p a q ^ pred p a q pred p a, b q Ð IsDaughter p b, a q ^ IsMother p a, b q pred p a, b q Ð IsDaughter p b, a q ^ IsF ather p a, b q pred p a, b q Ð IsDaughter p b, a q _ IsMother p a, b q pred p a, b q Ð D C, pred p b, a, C q ^ D C, pred p b, a, C q pred p a, b q Ð D C, pred p b, a, C q ^ D C, pred p b, a, C q pred p a q Ð D B, pred p a, B q ^ D B, pred p a, B q pred p a q Ð pred p a q _ pred p a q HasSister p a q Ð pred p a q ^ pred p a q ifferentiable Logic Machines pred p a, b q Ð IsSon p b, a q ^ IsSon p b, a q pred p a, b q Ð IsDaughter p b, a q _ IsSon p b, a q pred p a, b q Ð (cid:32) IsSon p b, a q _ IsMother p b, a q pred p a, b q Ð IsF ather p a, b q ^ IsF ather p a, b q pred p a, b, c q Ð (cid:32) IsMother p a, b q ^ IsMother p a, b q pred p a, b q Ð (cid:32) IsSon p b, a q ^ IsDaughter p b, a q pred p a q Ð D B, pred p a, B q _ D B, pred p a, B q pred p a, b q Ð D C, pred p a, b, C q _ D C, pred p a, b, C q pred p a, b q Ð (cid:32)D C, pred p b, a, C q ^ D C, pred p b, a, C q pred p a, b, c q Ð (cid:32) pred p b, a q _ pred p a, b q pred p a, b q Ð pred p a, b q ^ pred p b, a q pred p a, b, c q Ð (cid:32) pred p a, b q _ pred p b, c, a q pred p a, b q Ð (cid:32) pred p a, b q ^ @ C, pred p a, b, C q IsUncle p a, b q Ð pred p a, b q ^ pred p a, b q C.2. Percentage of successful seeds

Table 7: Percentage of seeds that reach 100 % over 10 seeds. dNL-ILP DLMFamily Tree HasFather 100 100HasSister 40 100IsGrandparent 80 100IsUncle { { { { { { C.3. Ablation Study: Critic ifferentiable Logic Machines

Table 8: Computational cost of dNL-ILP, B ILP, NLM, and DLM on the family tree and graph reasoning tasks.dNL-ILP B ILP NLM DLM (Ours)IsGrandparent T M T M T M T MTraining

201 30 1357 70 1629 382 m “

10 1 27 4 24 1 2 m “

20 1 30 4 70 1 24 m “

30 5 42 5 188 1 24 m “

40 13 99 5 414 2 74 m “

50 31 198 7 820 3 124 m “

60 65 358 8 1341 3 226 m “

70 119 596 8 2089 4 344 m “

80 192 932 11 3123 6 500 m “

90 303 1390 13 4434 8 724 m “

100 464 1994 17 6093 10 1002 m “

110 656 2771 21 8079 13 1321 m “

120 915 3751 27 10056 16 1710 m “

130 1247 4964

N/A N/A

19 2161 dNL-ILP B ILP NLM DLM (Ours)2-Outdegree T M T M T M T MTraining

966 22 ´ ´ m “ ´ ´ m “

10 1 22 ´ ´ m “

15 3 42 ´ ´ m “

20 9 112 ´ ´

N/A N/A m “

25 24 314 ´ ´

N/A N/A

10 1751 m “

30 47 682 ´ ´

N/A N/A

19 3594 m “

35 93 1332 ´ ´

N/A N/A

33 6666 T: time (s), M: Memory (MB). DLM used depth 4, breadth 3 for IsGrandparent, and depth 6,breadth 4 for 2-Outdegree.Table 9: Average sum of the rewards of NLRL, NLM, and DLM on RL tasks.

NLRL NLM nIDLM DLMUnstack test on 5 variations .

914 0 .

920 0 .

920 0 . Stack test on 5 variations .

877 0 .

920 0 .

920 0 . On test on 5 variations .

885 0 .

896 0 .

896 0 . Sorting m “

10 0 .

866 0 .

939 0 .

933 0 . M “ N/A .

556 0 .

367 0 . Path m “ N/A .

970 0 .

970 0 . M “

50 0 .

970 0 .

970 0 . Blocks m “ N/A .

888 0 . ´ world M “

50 0 .