[PDF] Learning Logic Programs by Explaining Failures

Abstract

Scientists form hypotheses and experimentally test them. If a hypothesis fails (is refuted), scientists try to explain the failure to eliminate other hypotheses. We introduce similar explanation techniques for inductive logic programming (ILP). We build on the ILP approach learning from failures. Given a hypothesis represented as a logic program, we test it on examples. If a hypothesis fails, we identify clauses and literals responsible for the failure. By explaining failures, we can eliminate other hypotheses that will provably fail. We introduce a technique for failure explanation based on analysing SLD-trees. We experimentally evaluate failure explanation in the Popper ILP system. Our results show that explaining failures can drastically reduce learning times.

Full PDF

LLearning Logic Programs by Explaining Failures

Rolf Morel , Andrew Cropper

University of Oxford { rolf.morel,andrew.cropper } @cs.ox.ac.uk Abstract

Scientists form hypotheses and experimentally testthem. If a hypothesis fails (is refuted), scientiststry to explain the failure to eliminate other hypothe-ses. We introduce similar explanation techniquesfor inductive logic programming (ILP). We buildon the ILP approach learning from failures . Givena hypothesis represented as a logic program, wetest it on examples. If a hypothesis fails, we iden-tify clauses and literals responsible for the failure.By explaining failures, we can eliminate other hy-potheses that will provably fail. We introduce atechnique for failure explanation based on analysingSLD-trees. We experimentally evaluate failure ex-planation in the P

OPPER

ILP system. Our resultsshow that explaining failures can drastically reducelearning times.

The process of forming hypotheses, testing them on data,analysing the results, and forming new hypotheses is the foun-dation of the scientiﬁc method [Popper, 2002]. For instance,imagine that Alice is a chemist trying to synthesise a vial ofthe compound octiron from substances thaum and slood . Todo so, Alice can perform actions, such as ﬁll a vial with asubstance ( ﬁll(Vial,Sub) ) or mix two vials ( mix(V1,V2,V3) ).One such hypothesis is: synth(A,B,C) ← ﬁll(V1,A), ﬁll(V1,B), mix(V1,V1,C) This hypothesis says to synthesise compound C , ﬁll vial V1 with substance A , ﬁll vial V1 with substance B , and then mixvial V1 with itself to form C .When Alice experimentally tests this hypothesis she ﬁndsthat it fails . From this failure, Alice deduces that hypothesesthat add more actions (i.e. literals) will also fail ( C1 ). Alicecan, however, go further and explain the failure as “vial V1 cannot be ﬁlled a second time”, which allows her to deducethat any hypothesis that includes ﬁll(V1,A) and ﬁll(V1,B) willfail ( C2 ). Clearly, conclusion C2 allows Alice to eliminatemore hypotheses than C1 , that is, by explaining failures Alicecan better form new hypotheses.Our main contribution is to introduce similar explanationtechniques for inductive program synthesis, where the goal is to machine learn computer programs from data [Shapiro,1983]. We build on the inductive logic programming (ILP)approach learning from failures and its implementation calledP OPPER [Cropper and Morel, 2021]. P

OPPER learns logic pro-grams by iteratively generating and testing hypotheses. Whena hypothesis fails on training examples, P

OPPER examinesthe failure to learn constraints that eliminate hypotheses thatwill provably fail as well. A limitation of P

OPPER is that itonly derives constraints based on entire hypotheses (as Alicedoes for C1 ) and cannot explain why a hypothesis fails (cannotreason as Alice does for C2 ).We address this limitation by explaining failures. The ideais to analyse a failed hypothesis to identify sub-programs thatalso fail. We show that, by identifying failing sub-programsand generating constraints from them, we can eliminate morehypotheses, which can in turn improve learning performance.By the Blumer bound [1987], searching a smaller hypothesisspace should result in fewer errors compared to a larger space,assuming a solution is in both spaces.Our approach builds on algorithmic debugging [Caballero et al. , 2017]. We identify sub-programs of hypotheses byanalysing paths in SLD-trees. In similar work [Shapiro, 1983;Law, 2018], only entire clauses can make up these sub-programs. By contrast, we can identify literals responsiblefor a failure within a clause. We extend P OPPER with failureexplanation and experimentally show that failure explanationcan signiﬁcantly improve learning performance.Our contributions are:• We relate logic programs that fail on examples to theirfailing sub-programs. For wrong answers we identifyclauses. For missing answers we additionally identifyliterals within clauses.• We show that hypotheses that are specialisations and gen-eralisations of failing sub-programs can be eliminated.• We prove that hypothesis space pruning based on sub-programs is more effective than pruning without them.• We introduce an SLD-tree based technique for failureexplanation. We introduce P

OPPER X , which adds theability to explain failures to the P OPPER

ILP system.• We experimentally show that failure explanation can dras-tically reduce (i) hypothesis space exploration and (ii)learning times. a r X i v : . [ c s . A I] F e b Related Work

Inductive program synthesis systems automatically generatecomputer programs from speciﬁcations, typically input/outputexamples [Shapiro, 1983]. This topic interests researchersfrom many areas of machine learning, including Bayesianinference [Silver et al. , 2020] and neural networks [Ellis etal. , 2018]. We focus on ILP techniques, which induce logicprograms [Muggleton, 1991]. In contrast to neural approaches,ILP techniques can generalise from few examples [Cropper et al. , 2020]. Moreover, because ILP uses logic program-ming as a uniform representation for background knowledge(BK), examples, and hypotheses, it can be applied to arbitrarydomains without the need for hand-crafted, domain-speciﬁcneural architectures. Finally, due to logic’s similarity to naturallanguage, ILP learns comprehensible hypotheses.Many ILP systems [Muggleton, 1995; Blockeel and Raedt,1998; Srinivasan, 2001; Ahlgren and Yuen, 2013; Inoue etal. , 2014; Sch¨uller and Benz, 2018; Law et al. , 2020] ei-ther cannot or struggle to learn recursive programs. Bycontrast, P

OPPER X can learn recursive programs and thusprograms that generalise to input sizes it was not trainedon. Compared to many modern ILP systems [Law, 2018;Evans and Grefenstette, 2018; Kaminski et al. , 2019; Evans et al. , 2021], P OPPER X supports large and inﬁnite domains,which is important when reasoning about complex data struc-tures, such as lists. Compared to many state-of-the-art systems[Cropper and Muggleton, 2016; Evans and Grefenstette, 2018;Kaminski et al. , 2019; Hocquette and Muggleton, 2020;Patsantzis and Muggleton, 2021] P OPPER X does not needmetarules (program templates) to restrict the hypothesis space.Algorithmic debugging [Caballero et al. , 2017] explainsfailures in terms of sub-programs. Similarly, in databases provenance is used to explain query results [Cheney et al. ,2009]. In seminal work on logic program synthesis, Shapiro[1983] analysed debugging trees to identify failing clauses.By contrast, our failure analysis reasons about concrete SLD-trees. Both ILASP3 [Law, 2018] and the remarkably similarProSynth [Raghothaman et al. , 2020] induce logic programsby precomputing every possible clause and then using a select-test-and-constrain loop. This precompute step is infeasible forclauses with many literals and restricts their failure explana-tion to clauses. By contrast, P OPPER X does not precomputeclauses and can identify clauses and literals within clausesresponsible for failure.P OPPER [Cropper and Morel, 2021] learns ﬁrst-order con-straints, which can be likened to conﬂict-driven clause learning[Silva et al. , 2009]. Failure explanation in P

OPPER X can there-fore be viewed as enabling P OPPER to detect smaller conﬂicts,yielding smaller yet more general constraints that prune moreeffectively.

We now reiterate the LFF problem [Cropper and Morel, 2021]as well as the relation between constraints and failed hypothe-ses. We then introduce failure explanation in terms of sub-programs. We assume standard logic programming deﬁnitions[Lloyd, 2012].

To deﬁne the LFF problem, we ﬁrst deﬁne predicate declara-tions and hypothesis constraints . LFF uses predicate decla-rations as a form of language bias, deﬁning which predicatesymbols may appear in a hypothesis. A predicate declaration isa ground atom of the form head pred ( p , a ) or body pred ( p , a ) where p is a predicate symbol of arity a . Given a set ofpredicate declarations D , a deﬁnite clause C is declarationconsistent when two conditions hold (i) if p /m is the predicatein the head of C , then head pred ( p , m ) is in D , and (ii) for all q /n predicate symbols in the body of C , body pred ( q , n ) isin D .To restrict the hypothesis space, LFF uses hypothesis con-straints. Let L be a language that deﬁnes hypotheses, i.e. ameta-language. Then a hypothesis constraint is a constraint ex-pressed in L . Let C be a set of hypothesis constraints writtenin a language L . A set of deﬁnite clauses H is consistent with C if, when written in L , H does not violate any constraint in C .We now deﬁne the LFF problem, which is based on the ILPlearning from entailment setting [Raedt, 2008]: Deﬁnition 3.1 ( LFF input ) . A LFF input is a tuple ( E + , E − , B, D, C ) where E + and E − are sets of groundatoms denoting positive and negative examples respectively; B is a Horn program denoting background knowledge; D isa set of predicate declarations; and C is a set of hypothesisconstraints.A deﬁnite program is a hypothesis when it is consistent withboth D and C . We denote the set of such hypotheses as H D,C .We deﬁne a LFF solution:

Deﬁnition 3.2 ( LFF solution ) . Given an input tuple ( E + , E − , B, D, C ) , a hypothesis H ∈ H D,C is a solution when H is complete ( ∀ e ∈ E + , B ∪ H | = e ) and consistent ( ∀ e ∈ E − , B ∪ H (cid:54)| = e ).If a hypothesis is not a solution then it is a failure and a failed hypothesis. A hypothesis H is incomplete when ∃ e + ∈ E + , H ∪ B (cid:54)| = e + . A hypothesis H is inconsistent when ∃ e − ∈ E − , H ∪ B | = e − . A worked example of LFF isincluded in Appendix A. The key idea of LFF is to learn constraints from failed hypothe-ses. Cropper and Morel [2021] introduce constraints based onsubsumption [Plotkin, 1971] and theory subsumption [Midel-fart, 1999]. A clause C subsumes a clause C if and only ifthere exists a substitution θ such that C θ ⊆ C . A clausaltheory T subsumes a clausal theory T , denoted T (cid:22) T ,if and only if ∀ C ∈ T , ∃ C ∈ T such that C subsumes C . Subsumption implies entailment, i.e. if T (cid:22) T then T | = T . A clausal theory T is a specialisation of a clausaltheory T if and only if T (cid:22) T . A clausal theory T is a generalisation of a clausal theory T if and only if T (cid:22) T .Hypothesis constraints prune the hypothesis space. Gener-alisation constraints only prune generalisations of inconsistenthypotheses.

Specialisation constraints only prune specialisa-tions of incomplete hypotheses. Generalisation and specialisa-tion constraints are sound in that they do not prune solutions[Cropper and Morel, 2021]. .3 Missing and Incorrect Answers

We follow Shapiro [1983] in identifying examples as responsi-ble for the failure of a hypothesis H given background knowl-edge B . A positive example e + is a missing answer when B ∪ H (cid:54)| = e + . Similarly, a negative example e − is an incorrectanswer when B ∪ H | = e − . We relate missing and incorrectanswers to specialisations and generalisations. If H has amissing answer e + , then each specialisation of H has e + asa missing answer, so the specialisations of H are incompleteand can be eliminated. If H has an incorrect answer e − , theneach generalisation of H has e − as an incorrect answer, so thegeneralisations of H are inconsistent and can be eliminated. Example 1 ( Missing answers and specialisations ) . Con-sider the following droplast hypothesis: H = { droplast(A,B) ← empty(A),tail(A,B) } Both droplast ([1 , , , [1 , and droplast ([1 , , [1]) aremissing answers of H , so H is incomplete and we can pruneits specialisations, e.g. programs that add literals to the clause. Example 2 ( Incorrect answers and generalisations ) . Con-sider the hypothesis H : H = (cid:26) droplast(A,B) ← tail(A,C),tail(C,B)droplast(A,B) ← tail(A,B) (cid:27) In addition to being incomplete, H is inconsistent because ofthe incorrect answer droplast ([1 , , []) , so we can prune thegeneralisations of H , e.g. programs with additional clauses. We now extend LFF by explaining failures in terms of failingsub-programs. The idea is to identify sub-programs that causethe failure. Consider the following two examples:

Example 3 ( Explain incompleteness ) . Consider the positiveexample e + = droplast ([1 , , [1]) and the previously deﬁnedhypothesis H . An explanation for why H does not entail e + is that empty([1,2]) fails. It follows that the program H (cid:48) = { droplast(A,B) ← empty(A) } has e + as a missing answerand is incomplete, so we can prune all specialisations of it. Example 4 ( Explain inconsistency ) . Consider the negativeexample e − = droplast ([1 , , []) and the previously deﬁnedhypothesis H . The ﬁrst clause of H always entails e − re-gardless of other clauses in the hypothesis. It follows that theprogram H (cid:48) = { droplast(A,B) ← tail(A,C),tail(C,B) } has e − as an incorrect answer and is inconsistent, so we can pruneall generalisations of it.We now deﬁne a sub-program : Deﬁnition 3.3 ( Sub-program ) . The deﬁnite program P is a sub-program of the deﬁnite program Q if and only if either:• P is the empty set• there exists C p ∈ P and C q ∈ Q such that C p ⊆ C q and P \ { C p } is a sub-program of Q \ { C q } In functional program synthesis, sub-programs (sometimescalled partial programs ) are typically deﬁned by leaving outnodes in the parse tree of the original program [Feng et al. ,2018]. Our deﬁnition generalises this idea by allowing forarbitrary ordering of clauses and literals.We now deﬁne the failing sub-programs problem:

Deﬁnition 3.4 ( Failing sub-programs problem ) . Given thedeﬁnite program P and sets of examples E + and E − , the failing sub-programs problem is to ﬁnd all sub-programs of P that do not entail an example of E + or entail an example of E − .By deﬁnition, a failing sub-program is incomplete and/or in-consistent, so, by Section 3.2, we can always prune specialisa-tions and/or generalisations of a failing sub-program. Remark 1 ( Undecidability ) . The failing sub-programs prob-lem is undecidable in general as deciding entailment can bereduced to it.We show that sub-programs are effective at pruning:

Theorem 1 ( Better pruning ) . Let H be a deﬁnite programthat fails and P ( (cid:54) = H ) be a sub-program of H that fails.Specialisation and generalisation constraints for P can alwaysachieve additional pruning versus those only for H . Proof.

Suppose H is a specialisation of P . If P is incomplete,then among the specialisations of P , which are all prunable,is H and its specialisations. If P is inconsistent, P ’s general-isations do not completely overlap with H ’s generalisationsand specialisations (using that P (cid:54) = H ). Hence, pruning P ’sgeneralisations prunes programs not pruned by H . The casewhere H is a generalisation of P is analogous. In the remain-ing case, where H and P are not related by subsumption, it isimmediate that the constraints derived for P prune a distinctpart of the hypothesis space. We now describe our failure explanation technique, whichidentiﬁes sub-programs by identifying both clauses and liter-als within clauses responsible for failure. Subsequently wesummarise the P

OPPER

ILP system before introducing ourextension of it: P

OPPER X . In algorithmic debugging, missing and incorrect answers helpcharacterise which parts of a debugging tree are wrong [Ca-ballero et al. , 2017]. Debugging trees can be seen as gener-alising SLD-trees, with the latter representing the search fora refutation [Nienhuys-Cheng and Wolf, 1997]. Exploitingtheir granularity, we analyse SLD-trees to address the failingsub-programs problem, only identifying a subset of them.A branch in a SLD-tree is a path from the root goal to a leaf.Each goal on a branch has a selected atom , on which resolutionis performed to derive child goals. A branch that ends in anempty leaf is called successful , as such a path represents arefutation. Otherwise a branch is failing . Note that selectedatoms on a branch identify a subset of the literals of a program.Let B be a Horn program, H be a hypothesis, and e be anatom. The SLD-tree T for B ∪ H ∪ {¬ e } , with ¬ e as the root,proves B ∪ H | = e iff T contains a successful branch. Givena branch λ of T , we deﬁne the λ -sub-program of H . A literal L of H occurs in λ -sub-program H (cid:48) if and only if L occursas a selected atom in λ or L was used to produce a resolventthat occurs in λ . The former case is for literals in the bodyof clauses and the latter for head literals. Now consider theLD-tree T (cid:48) for B ∪ H (cid:48) ∪ {¬ e } with ¬ e as root. As all literalsnecessary for λ occur in B ∪ H (cid:48) , the branch λ must occur in T (cid:48) as well.Suppose e − is an incorrect answer for hypothesis H . Thenthe SLD-tree for B ∪ H ∪ {¬ e − } , with ¬ e − as root, hasa successful branch λ . The literals of H necessary for thisbranch are also present in λ -sub-program H (cid:48) , hence e − isalso an incorrect answer of H (cid:48) . Now suppose e + is a missinganswer of H . Let T be the SLD-tree for B ∪ H ∪ {¬ e + } , with ¬ e + as root, and λ (cid:48) be any failing branch of T . The literalsof H in λ (cid:48) are also present in λ (cid:48) -sub-program H (cid:48)(cid:48) . This ishowever insufﬁcient for concluding that the SLD-tree for H (cid:48)(cid:48) has no successful branch. Hence it is not immediate that e + isa missing answer for H (cid:48)(cid:48) . In case that H (cid:48)(cid:48) is a specialisationof H we can conclude that e + is a missing answer. OPPER P OPPER tackles the LFF problem (Deﬁnition 3.1) using a generate , test , and constrain loop. A logical formula is con-structed and maintained whose models correspond to Prologprograms. The ﬁrst stage is to generate a model and con-vert it to a program. The program is tested on all positiveand negative examples. The number of missing and incorrectanswers determine whether specialisations and/or generalisa-tions can be pruned. When a hypothesis fails, new hypothesisconstraints (Section 3.2) are added to the formula, which elim-inates models and thus prunes the hypothesis space. P OPPER then loops back to the generate stage.Smaller programs prune more effectively, which is partlywhy P

OPPER searches for hypotheses by their size (num-ber of literals) . Yet there are many small programs thatP OPPER does not consider well-formed that achieve sig-niﬁcant, sound pruning. Consider the sub-program H (cid:48) = { droplast(A,B) ← empty(A) } from Example 3. P OPPER doesnot generate H (cid:48) as it does not consider it a well-formed hy-pothesis (as the head variable B does not occur in the body).Yet precisely because this sub-program has so few body literalsis why it is so effective at pruning specialisations. OPPER X We now introduce P

OPPER X , which extends P OPPER withSLD-based failure explanation. Like P

OPPER , any generatedhypothesis H is tested on the examples. However, addition-ally, for each tested example we obtain the selected atoms oneach branch of the example’s SLD-tree, which correspond tosub-programs of H . As shown, sub-programs derived fromincorrect answers have the same incorrect answers. For eachsuch identiﬁed inconsistent sub-program H (cid:48) of H we tell theconstrain stage to prune generalisations of H (cid:48) . Sub-programsderived from missing answers are retested, now without obtain-ing their SLD-trees. If a sub-program H (cid:48)(cid:48) of H is incompletewe inform the constrain stage to prune specialisations of H (cid:48)(cid:48) . P OPPER generates elimination constraints when a hypothesisentails none of the positive examples [Cropper and Morel, 2021]. The other reason is to ﬁnd optimal solutions, i.e. those with theminimal number of literals. As in P

OPPER , we prune by elimination constraints if no positiveexamples are entailed.

Pruning for sub-programs is in addition to the pruning that theconstrain stage already does for H . This is important as H ’sfailing sub-programs need not be specialisations/generalisa-tions of H . We claim that failure explanation can improve learning per-formance. Our experiments therefore aim to answer the ques-tions: Q1 Can failure explanation prune more programs? Q2 Can failure explanation reduce learning times?A positive answer to Q1 does not imply a positive answerfor Q2 because of the potential overhead of failure explana-tion. Identifying sub-programs requires computational effortand the additional constraints could potentially overwhelma learner. For example, as well as identifying sub-programs,P OPPER X needs to derive more constraints, ground them, andhave a solver reason over them. These operations are all costly.To answer Q1 and Q2 , we compare P OPPER X against P OP - PER . The addition of failure explanation is the only differencebetween the systems and in all the experiments the settingsfor P

OPPER X and P OPPER are identical. We do not compareagainst other state-of-the-art ILP systems, such as Metagol[Cropper and Muggleton, 2016] and ILASP3 [Law, 2018] be-cause such a comparison cannot help us answer Q1 and Q2 .Moreover, P OPPER has been shown to outperform these twosystems on problems similar to the ones we consider [Cropperand Morel, 2021].We run the experiments on a 10-core server (at 2.2GHz)with 30 gigabytes of memory (note that P

OPPER and P OP - PER X only run on a single CPU). When testing individualexamples, we use an evaluation timeout of 33 milliseconds. The goal of this experiment is to evaluate whether failureexplanation can improve performance when progressively in-creasing the size of the target program. We therefore need aproblem where we can vary the program size. We considera robot strategy learning problem. There is a robot that canmove in four directions in a grid world, which we restrict tobeing a corridor (dimensions × ). The robot starts in thelower left corner and needs to move to a position to its right.In this experiment, failure explanation should determine thatany strategy that moves up, down, or left can never succeedand thus can never appear in a solution. Settings.

An example is an atom f ( s , s ) , with start ( s )and end ( s ) states. A state is a pair of discrete coordinates ( x, y ) . We provide four dyadic relations as BK: move right , move left , move up , and move down , which change the state,e.g. move right((2,2),(3,2)) . We allow one clause with up to 10body literals and 11 variables. We use hypothesis constraints toensure this clause is forward-chained [Kaminski et al. , 2019],which means body literals modify the state one after another. Method.

The start state is (0 , and the end state is ( n, ,for n in , , , . . . , . Each trial has only one (positive)example: f ((0 , , ( n, . We measure learning times andhe number of programs generated. We enforce a timeout of10 minutes per task. We repeat each experiment 10 times andplot the mean and (negligible) standard error. Results.

Figure 1a shows that P

OPPER X substantially out-performs P OPPER in terms of learning time. Whereas P OP - PER X needs around 80 seconds to ﬁnd a 10 move solution,P OPPER exceeds the 10 minute timeout when looking for asix move solution. The reason for the improved performanceis that P

OPPER X generates far fewer programs, as failure ex-planation will, for example, prune all programs whose ﬁrstmove is to the left. For instance, to ﬁnd a ﬁve literal solution,P OPPER generates 1300 programs, whereas P

OPPER X onlygenerates 62. When looking for a 10 move solution, P OPPER X only generates 1404 programs in a hypothesis space of 1.4million programs. These results show that, compared to P OP - PER , P

OPPER X generates substantially fewer programs andrequires less learning time. The results from this experimentstrongly suggest that the answer to questions Q1 and Q2 isyes. Program size L ea r n i ng ti m e ( s ec ond s ) P OPPER X baseline (a) Learning time. , , , Program size G e n e r a t e dp r og r a m s P OPPER X baseline (b) Number of programs.Figure 1: Results of robot planning experiment. The x-axes denotethe number of body literals in the solution, i.e. the number of movesrequired. We now explore whether failure explanation can improve learn-ing performance on real-world string transformation tasks. Weuse a standard dataset [Lin et al. , 2014; Cropper, 2019] formedof 312 tasks, each with 10 input-output pair examples. Forinstance, task 81 has the following two input-output pairs:

Input Output “Alex”,“M”,41,74,170 M“Carly”,“F”,32,70,155 F

Settings.

As BK, we give each system the monadic predi-cates is uppercase , is empty , is space , is letter , is number anddyadic predicates mk uppercase , mk lowercase , skip1 , copy-skip1 , copy1 . For each monadic predicate we also provide apredicate that is its negation. We allow up to 3 clauses with 4body literals and up to 5 variables per clause. Method.

The dataset has 10 positive examples for each prob-lem. We perform cross validation by selecting 10 distinct sub-sets of 5 examples for each problem, using the other 5 to test.We measure learning times and number of programs generated.We enforce a timeout of 120 seconds per task. We repeat eachexperiment 10 times, once for each distinct subset, and recordmeans and standard errors. . . . . . . . . . . Ratio of generated programs R a ti oo f l ea r n i ng ti m e Figure 2: String transformation results. The ratio of number ofprograms that P

OPPER X needs versus P OPPER is plotted against theratio of learning time needed on that problem.

Results.

For 52 problems both P

OPPER and P

OPPER X ﬁndsolutions . On 11 tasks P OPPER timeouts, and on 7 of thesein all trials. P

OPPER X ﬁnds solutions on these same 11 tasks,with timeouts in some trials on only 6 tasks. As relational so-lutions are allowed, many solutions are not ideal, e.g. allowingfor optionally copying over a character.Figure 2 plots ratios of generated programs and learningtimes. Each point represents a single problem. The x-axisis the ratio of programs that P OPPER X generates versus thenumber of programs that P OPPER generates. The y-value isthe ratio of learning time of P

OPPER X versus P OPPER . Theseratios are acquired by dividing means, the mean of P

OPPER X over that of P OPPER .Looking at x-axis values, of the 52 problems plotted 50require fewer programs when run with P

OPPER X . Lookingat the y-axis, the learning times of 51 problems are fasteron P OPPER X . Note that either failure explanation is veryeffective or its inﬂuence is rather limited, which we exploremore in the next experiment.Overall, these results show that, compared to P OPPER , P OP - PER X almost always needs fewer programs and less time tolearn programs. This suggests that the answer to questions Q1 and Q2 is yes. This experiment evaluates whether failure explanation canimprove performance when learning programs for recursivelist problems, which are notoriously difﬁcult for ILP sys-tems. Indeed, other state-of-the-art ILP system [Law, 2018;Evans and Grefenstette, 2018; Kaminski et al. , 2019] struggleto solve these problems. We use the same 10 problems usedby [Cropper and Morel, 2021] to show that P

OPPER drasti-cally outperforms M

ETAGOL [Cropper and Muggleton, 2016]and A

LEPH [Srinivasan, 2001]. The 10 tasks include a mixof monadic (e.g. evens and sorted ), dyadic (e.g. droplast and ﬁnddup ), and triadic ( dropk ) target predicates. Some prob-lems are functional (e.g. last and len ) and some are relational(e.g. ﬁnddup and member ). Note that these problems are very difﬁcult with many of them nothaving solutions given only our primitive BK and with the learnedprogram restricted to deﬁning a single predicate. Therefore, absoluteperformance should be ignored. The important result is the relativeperformance of the two systems. ettings.

We provide as BK the monadic relations empty , zero , one , even , odd , the dyadic relations element , head , tail , increment , decrement , geq , and the triadic relation cons . Weprovide simple types and mark the arguments of predicates aseither input or output. We allow up to two clauses with ﬁvebody literals and up to ﬁve variables per clause. Method.

We generate 10 positive and 10 negative examplesper problem. Each example is randomly generated from listsup to length 50, whose integer elements are sampled from1 to 100. We test on a 1000 positive and a 1000 negativerandomly sampled examples. We measure overall learningtime, number of programs generated, and predictive accuracy.We also measure the time spent in the three distinct stages ofP

OPPER and P

OPPER X . We repeat each experiment 25 timesand record the mean and standard error. Number of programs Learning time (seconds)

Name P OPPER P OPPER X ratio P OPPER P OPPER X ratio len 590 ± ± ± ± dropk 114 ± ± ± ± ﬁnddup 1223 ±

22 644 ± ± ± member 57 ± ± ± ± last 232 ± ± ± ± evens 306 ± ± ± ± threesame 18 ± ± ± ± droplast 161 ± ± ± ± ± ± ± ± ±

40 599 ± ± ± Table 1: Results for programming puzzles. Left, the average numberof programs generated by each system. Right, the correspondingaverage time to ﬁnd a solution. We round values over one to thenearest integer. The error is standard error.

Results.

Both systems are equally accurate, except on sorted where P

OPPER scores 98% and P

OPPER X dropk and 99% on both ﬁnddup and threesame . Allother problems have 100% accuracy.Table 1 shows the learning times in relation to the number ofprograms generated. Crucially, it includes the ratio of the meanof P OPPER X over the mean of P OPPER . On these 10 problems,P

OPPER X always considers fewer hypotheses than P OPPER .Only on three problems is over 90% of the original numberof programs considered. On the len problem, P

OPPER X onlyneeds to consider 10% of the number of hypotheses.As seen from the ratio columns, the number of generatedprograms correlates strongly with the learning time (0.96 cor-relation coefﬁcient). Only on three problems is P OPPER X slightly slower than P OPPER . Hence P

OPPER can be nega-tively impacted by failure explanation, however, when P OP - PER X is faster, the speed-up can be considerable.To illustrate how failure explanation can drastically improvepruning, consider the following hypothesis that P OPPER X considers in the len problem: f(A,B):- element(A,D),odd(D),even(D),tail(A,C),element(C,B) .Failure explanation identiﬁes the failing sub-program: f(A,B):- element(A,D),odd(D),even(D). As should be hopefully clear, generating constraints from thissmaller failing program, which is not a P

OPPER hypothesis,leads to far more effective pruning.

Figure 3: Relative time spent in three stages of P

OPPER X and P OP - PER . From bottom to top: testing (in red), generating hypotheses (inblue), and imposing constraints (in orange). Times are scaled by thetotal learning time of P

OPPER , with P

OPPER ’s average time(s) on theleft and P

OPPER X ’s on the right. Bars are standard error. Figure 3 shows the relative time spent in each stage of P OP - PER X and P OPPER and that any of the stages can dominatethe runtime. For addhead , it is hypothesis generation (pre-dominantly spent searching for a model). For ﬁnddup , it isconstraining (mostly spent grounding constraints). More per-tinently, droplast , the only dyadic problem whose output is alist, is dominated by testing.We can also infer the overhead of failure explanation byanalysing SLD-trees from Figure 3. All problems from last to sorted have P OPPER X spend more time on testing thanP OPPER . On both last and sorted , P

OPPER X incurs consid-erable testing overhead. Whilst for last this effort translatesinto more effective pruning constraints, for sorted this is notthe case. Abstracting away from the implementation of failureexplanation, we see that P OPPER outﬁtted with zero-overheadsub-program identiﬁcation would have been strictly faster.Overall, these result strongly suggest that the answer toquestions Q1 and Q2 is yes. To improve the efﬁciency of ILP, we have introduced an ap-proach for failure explanation. Our approach, based on SLD-trees, identiﬁes failing sub-programs, including failing literalsin clauses. We implemented our idea in P

OPPER X . Our ex-periments show that failure explanation can drastically reducelearning times. Limitations.

We have shown that identifying failing sub-programs will lead to more constraints and thus more prun-ing of the hypothesis space (Theorem 1), which our exper-iments empirically conﬁrm. We have not, however, quanti-ﬁed the theoretical effectiveness of pruning by sub-programs,nor have we evaluated improvements in predictive accuracy,which are implied by the Blumer bound [Blumer et al. , 1987].Future work should address both of these limitations. Al-though we have shown that failure explanation can drasti-cally reduce learning times, we can still signiﬁcantly im-prove our approach. For instance, reconsider the failingsub-program f(A,B):- element(A,D),odd(D),even(D) from Sec-tion 5.3. We should be able to identify that the two literals odd(D) and even(D) can never both hold in the body of aclause, which would allow us to prune more programs. Fi-nally, in future work, we want to explore whether our in-herently interpretable failure explanations can aid explain-able AI and ultra-strong machine learning [Michie, 1988;Muggleton et al. , 2018]. eferences [Ahlgren and Yuen, 2013] John Ahlgren and Shiu Yin Yuen.Efﬁcient program synthesis using constraint satisfaction ininductive logic programming.

JMLR , 2013.[Blockeel and Raedt, 1998] Hendrik Blockeel and Luc DeRaedt. Top-down induction of ﬁrst-order logical decisiontrees.

AIJ , 1998.[Blumer et al. , 1987] Anselm Blumer, Andrzej Ehrenfeucht,David Haussler, and Manfred K. Warmuth. Occam’s razor.

Inf. Process. Lett. , 1987.[Caballero et al. , 2017] Rafael Caballero, Adri´an Riesco, andJosep Silva. A survey of algorithmic debugging.

ACMComput. Surv. , 2017.[Cheney et al. , 2009] James Cheney, Laura Chiticariu, andWang Chiew Tan. Provenance in databases: Why, how, andwhere.

Found. Trends Databases , 2009.[Cropper and Morel, 2021] Andrew Cropper and Rolf Morel.Learning programs by learning from failures.

MachineLearning , 2021. To appear.[Cropper and Muggleton, 2016] Andrew Cropper andStephen H. Muggleton. Learning higher-order logicprograms through abstraction and invention. In

IJCAI ,2016.[Cropper et al. , 2020] Andrew Cropper, Sebastijan Duman-cic, and Stephen H. Muggleton. Turning 30: New ideas ininductive logic programming. In

IJCAI , 2020.[Cropper, 2019] Andrew Cropper. Playgol: Learning pro-grams through play.

IJCAI , 2019.[Ellis et al. , 2018] Kevin Ellis, Lucas Morales, MathiasSabl´e-Meyer, Armando Solar-Lezama, and Josh Tenen-baum. Learning libraries of subroutines for neurally-guidedbayesian program induction. In

NeurIPS , 2018.[Evans and Grefenstette, 2018] Richard Evans and EdwardGrefenstette. Learning explanatory rules from noisy data.

JAIR , 2018.[Evans et al. , 2021] Richard Evans, Jos´e Hern´andez-Orallo,Johannes Welbl, Pushmeet Kohli, and Marek Sergot. Mak-ing sense of sensory input.

Artiﬁcial Intelligence , 2021.[Feng et al. , 2018] Yu Feng, Ruben Martins, Osbert Bastani,and Isil Dillig. Program synthesis using conﬂict-drivenlearning. In

PLDI , 2018.[Hocquette and Muggleton, 2020] C´eline Hocquette andStephen H. Muggleton. Complete bottom-up predicateinvention in meta-interpretive learning. In

IJCAI , 2020.[Inoue et al. , 2014] Katsumi Inoue, Tony Ribeiro, and ChiakiSakama. Learning from interpretation transition.

MachineLearning , 2014.[Kaminski et al. , 2019] Tobias Kaminski, Thomas Eiter, andKatsumi Inoue. Meta-interpretive learning using hex-programs. In

IJCAI , 2019.[Law et al. , 2020] Mark Law, Alessandra Russo, ElisaBertino, Krysia Broda, and Jorge Lobo. Fastlas: scalable in-ductive logic programming incorporating domain-speciﬁcoptimisation criteria. In

AAAI , 2020. [Law, 2018] Mark Law.

Inductive learning of answer setprograms . PhD thesis, Imperial College London, UK, 2018.[Lin et al. , 2014] Dianhuan Lin, Eyal Dechter, Kevin Ellis,Joshua B. Tenenbaum, and Stephen Muggleton. Bias refor-mulation for one-shot function induction. In

ECAI , 2014.[Lloyd, 2012] John W Lloyd.

Foundations of logic program-ming . Springer Science & Business Media, 2012.[Michie, 1988] Donald Michie. Machine learning in the nextﬁve years. In

EWSL , 1988.[Midelfart, 1999] Herman Midelfart. A bounded search spaceof clausal theories. In

ILP , 1999.[Muggleton et al. , 2018] S.H. Muggleton, U. Schmid,C. Zeller, A. Tamaddoni-Nezhad, and T. Besold. Ultra-strong machine learning - comprehensibility of programslearned with ILP.

Machine Learning , 2018.[Muggleton, 1991] Stephen Muggleton. Inductive logic pro-gramming.

New Generation Comput. , 1991.[Muggleton, 1995] Stephen Muggleton. Inverse entailmentand progol.

New Generation Comput. , 1995.[Nienhuys-Cheng and Wolf, 1997] Shan-Hwei Nienhuys-Cheng and Ronald de Wolf.

Foundations of InductiveLogic Programming . Springer-Verlag New York, Inc.,Secaucus, NJ, USA, 1997.[Patsantzis and Muggleton, 2021] S. Patsantzis andStephen H. Muggleton. Top program constructionand reduction for polynomial time meta-interpretivelearning.

Machine Learning , 2021.[Plotkin, 1971] G.D. Plotkin.

Automatic Methods of InductiveInference . PhD thesis, Edinburgh University, August 1971.[Popper, 2002] K.R. Popper.

Conjectures and Refutations:The Growth of Scientiﬁc Knowledge . Routledge, 2002.[Raedt, 2008] Luc De Raedt.

Logical and relational learning .Cognitive Technologies. Springer, 2008.[Raghothaman et al. , 2020] Mukund Raghothaman, JonathanMendelson, David Zhao, Mayur Naik, and BernhardScholz. Provenance-guided synthesis of datalog programs.

PACMPL , 2020.[Sch¨uller and Benz, 2018] Peter Sch¨uller and Mishal Benz.Best-effort inductive logic programming via ﬁne-grainedcost-based hypothesis generation.

Machine Learning , 2018.[Shapiro, 1983] Ehud Y. Shapiro.

Algorithmic Program De-Bugging . MIT Press, Cambridge, MA, USA, 1983.[Silva et al. , 2009] Jo˜ao P. Marques Silva, Inˆes Lynce, andSharad Malik. Conﬂict-driven clause learning SAT solvers.In

Handbook of Satisﬁability . 2009.[Silver et al. , 2020] Tom Silver, Kelsey R. Allen, Alex K.Lew, Leslie Pack Kaelbling, and Josh Tenenbaum. Few-shot bayesian imitation learning with logical program poli-cies. In

AAAI , 2020.[Srinivasan, 2001] A. Srinivasan. The ALEPH manual. 2001. =  h = { droplast(A,B):- empty(A),tail(A,B). } h = { droplast(A,B):- empty(A),cons(C,D,A),tail(D,B). } h = (cid:26) droplast(A,B):- tail(A,C),tail(C,B).droplast(A,B):- tail(A,B). (cid:27) h = { droplast(A,B):- empty(A),tail(A,B),head(A,C),head(B,C). } h = (cid:26) droplast(A,B):- tail(A,C),tail(C,B).droplast(A,B):- tail(A,B),tail(B,A). (cid:27) h = (cid:26) droplast(A,B):- tail(A,B),empty(B).droplast(A,B):- cons(C,D,A),droplast(D,E),cons(C,E,B). (cid:27) h = (cid:40) droplast(A,B):- tail(A,C),tail(C,B).droplast(A,B):- tail(A,B).droplast(A,B):- tail(A,C),droplast(C,B). (cid:41)  Figure 4: LFF hypothesis space considered in Example 5.

A Appendix: LFF Example

Example 5.

To illustrate LFF, consider learning a droplast/2 program. Suppose our predicate declara-tions D are head pred(droplast,2) , denoting that we wantto learn a droplast/2 relation, and body pred(empty,1) , body pred(head,2) , body pred(tail,2) , and body pred(cons,3) .Suitable deﬁnitions for the provided body predicate dec-larations constitute our background knowledge B . To al-low for learning a recursive program, we also supply thepredicate declaration body pred(droplast,2) . Let e +1 = droplast ([1 , , , [1 , , e +2 = droplast ([1 , , [1]) and e − = droplast ([1 , , []) . Then E + = { e +1 , e +2 } and E − = { e − } are the positive and negative examples, respectively. Our ini-tial set of hypothesis constraints C only ensure that hypothesesare well-formed, e.g. that each variable that occurs in the headof a rule occurs in the rule’s body.We now consider learning a solution for LFF input ( E + , E − , B, D, C ) , where, for demonstration purposes, weuse the simpliﬁed hypothesis space H ⊆ H D,C of ﬁgure4. The order the hypotheses are considered in is by theirnumber of literals. Pruning is achieved by adding additionalhypothesis constraints. First we learn by a generate-test-and-constrain loop without failure explanation. This ﬁrst sequenceis representative of P

OPPER ’s execution:1. P

OPPER starts by generating h . B ∪ h fails to entail e +1 and e +2 and correctly does not entail e − . Hence onlyspecialisations of h get pruned, namely h .2. P OPPER subsequently generates h . B ∪ h fails to entail e +1 and e +2 and is correct on e − . Hence specialisations of h get pruned, of which there are none in H .3. P OPPER subsequently generates h . B ∪ h does not entailthe positive examples, but does entail negative example e − . Hence specialisations and generalisations of h getpruned, meaning only generalisation h .4. P OPPER subsequently generates h . B ∪ h is correct onnone of the examples. Hence specialisations and gener-alisations of h get pruned, of which there are none in H . 5. P OPPER subsequently generates h . B ∪ h is correct onall the examples and hence is returned.Now consider learning by a generate-test-and-constrain loop with failure explanation. The following execution sequence isrepresentative of P OPPER X :1. P OPPER X starts by generating h . B ∪ h failsto entail e +1 and e +2 and correctly does not entail e − . Failure explanation identiﬁes sub-program h (cid:48) = { droplast(A,B):- empty(A). } . h (cid:48) fails in the sameway as h . Hence specialisations of both h and h (cid:48) getpruned, namely h and h .2. P OPPER X subsequently generates h . B ∪ h does not en-tail the positive examples, but does entail negative exam-ple e − . Failure explanation identiﬁes sub-program h (cid:48) = { droplast(A,B):- tail(A,C),tail(C,B). } . B ∪ h (cid:48) fails in the same way as h . Hence specialisations andgeneralisations of h and h (cid:48) get pruned, meaning h and h .3. P OPPER X subsequently generates h6