Inductive logic programming at 30: a new introduction
aa r X i v : . [ c s . A I] O c t Journal of Artificial Intelligence Research 1 (1993) 1-15 Submitted 6 /
91; published 9 / Inductive logic programming at 30: a new introduction
Andrew Cropper
ANDREW . CROPPER @ CS . OX . AC . UK University of Oxford
Sebastijan Dumanˇci´c
SEBASTIJAN . DUMANCIC @ CS . KULEUVEN . BE KU Leuven
Abstract
Inductive logic programming (ILP) is a form of machine learning. The goal of ILP is to in-duce a hypothesis (a set of logical rules) that generalises given training examples. In contrast tomost forms of machine learning, ILP can learn human-readable hypotheses from small amountsof data. As ILP approaches 30, we provide a new introduction to the field. We introduce thenecessary logical notation and the main ILP learning settings. We describe the main buildingblocks of an ILP system. We compare several ILP systems on several dimensions. We describein detail four systems (Aleph, TILDE, ASPAL, and Metagol). We document some of the mainapplication areas of ILP. Finally, we summarise the current limitations and outline promisingdirections for future research.
1. Introduction
A remarkable feat of human intelligence is the ability to learn new knowledge. A key type oflearning is induction : the process of forming general rules (hypotheses) from specific obser-vations (examples). For instance, suppose you draw 10 red balls out of a bag, then you mightinduce a hypothesis (a rule) that all the balls in the bag are red. Having induced this hypothesis,you can predict the colour of the next ball out of the bag.The goal of machine learning (ML) (Mitchell, 1997) is to automate induction. In otherwords, the goal of ML is to induce a hypothesis (also called a model ) that generalises trainingexamples (observations). For instance, given labelled images of cats and dogs, the goal is toinduce a hypothesis that predicts whether an unlabelled image is a cat or a dog.Inductive logic programming (ILP) (Muggleton, 1991; Muggleton & De Raedt, 1994) is aform of ML. As with other forms of ML, the goal of ILP is to induce a hypothesis that generalisestraining examples. However, whereas most forms of ML use tables to represent data (examplesand hypotheses), ILP uses logic programs (sets of logical rules). Moreover, whereas most formsof ML learn functions, ILP learns relations.We illustrate ILP through three toy scenarios. Suppose you want to predict whether someone is happy. To do so, you ask four people ( alice , bob , claire , and dave ) whether they are happy. You also ask additional information, specifically theirjob, their company, and whether they like lego. Standard ML approaches, such as a decision treeor neural network learner, would represent this data as a table, such as Table 1. Using standard
1. Table-based learning is attribute-value learning. See De Raedt (2008) for an overview of the hierarchy of repre-sentations. ©1993 AI Access Foundation. All rights reserved.
ROPPER AND D UMANCIC
ML terminology, each row represents a training example, the first three columns ( name , job , and enjoys lego ) represent features , and the final column ( happy ) represents the label or classification . Name Job Enjoys lego Happy alice lego builder yes yes bob lego builder no no claire estate agent yes no dave estate agent no noTable 1: A standard table representation of a ML task used by most forms of ML.Given this table as input, the goal of table-based ML approaches is to induce a hypothesis topredict the label for unseen examples. For instance, given the table as input, a neural networklearner (Rosenblatt, 1958) would learn a hypothesis as table of numbers that weight the impor-tance of the input features (or hidden features in a multi-layer network). We can then use thelearned table of weights to make predictions for unseen examples, such as “If edna is an estateagent who likes lego, then is edna happy?”.In contrast to standard ML approaches, ILP does not represent data as tables. ILP insteadrepresents data as logic programs , sets of logical rules. The fundamental concept in logic pro-gramming is an atom . An atom is of the form p ( x , . . . , x n ) , where p is a predicate symbol ofarity n (takes n arguments) and each x i is a term . A logic program uses atoms to represent data.For instance, we can represent that alice enjoys lego as the atom enjoys_lego(alice) and that bob is a lego builder as lego_builder(bob) .An ILP learning task is formed of three sets ( B , E + , E − ). The first set ( B ) denotes backgroundknowledge (BK). BK is similar to features but can contain relations and information indirectlyassociated with the examples. We discuss BK in Section 4.3. For the data in Table 1, we canrepresent B as: B = lego_builder(alice).lego_builder(bob).estate_agent(claire).estate_agent(dave).enjoys_lego(alice).enjoys_lego(claire). ILP usually follows the closed world assumption (Reiter, 1977), so if anything is not explic-itly true, we assume it is false. With this assumption, we do not need to explicitly state that enjoys_lego(bob) and enjoys_lego(dave) are false.The second and third sets denote positive ( E + ) and negative ( E − ) examples. For the data inTable 1, we can represent the examples as: E + = (cid:8) happy(alice). (cid:9) E − = happy(bob).happy(claire).happy(dave). Given these three sets, the goal of ILP is to induce a hypothesis that with the BK logically entailsas many positive and as few negative examples as possible. Rather than represent a hypothesis NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION as a table of numbers, ILP represents a hypothesis H as a set of logical rules, such as: H = (cid:8) ∀ A . lego_builder(A) ∧ enjoys_lego(A) → happy(A) (cid:9) This hypothesis contains one rule that says for all A , if lego_builder(A) and enjoys_lego(A) are true, then happy(A) must also be true. In other words, this rule says that if a person is a legobuilder and enjoys lego then they are happy. Having induced a rule, we can deduce knowledgefrom it. For instance, this rule says if lego_builder(alice) and enjoys_lego(alice) are truethen happy(alice) must also be true.The above rule is written in a standard first-order logic notation. However, logic programs(Section 2) are usually written in reverse implication form: head:- body , body , . . . , body n A rule in this form states that the head atom is true when every body i atom is true. A commadenotes conjunction, so the body is a conjunction of atoms. In logic programming, every variableis assumed to be universally quantified, so we drop quantifiers. We also flip the direction of theimplication symbol → to ← and then often replace it with :- because it is easier to use whenwriting computer programs. Therefore, in logic programming notation, the above hypothesisis: H = (cid:8) happy(A):- lego_builder(A),enjoys_lego(A). (cid:9) It is important to note that logic programs are declarative which means that the order of atomsin a rule does not matter. For instance, the above hypothesis is semantically identical to thisone: H = (cid:8) happy(A):- enjoys_lego(A),lego_builder(A). (cid:9) As this scenario illustrates, ILP learns human-readable hypotheses, which is crucial for explain-able AI and ultra-strong ML (Michie, 1988). By contrast, the hypotheses induced by most otherforms of ML are not human-readable.
Suppose you want to learn a string transformation programs from input out put examples,such as program that returns the last character of a string: Input Output machine elearning galgorithm mMost forms of ML would represent these examples as a table, such as using dummy variables .By contrast, ILP represents these examples as atoms, such as: E + = last([m,a,c,h,i,n,e], e).last([l,e,a,r,n,i,n,g], g).last([a,l,g,o,r,i,t,m], m).
2. Also known as design variables or one-hot-encoding . ROPPER AND D UMANCIC
The symbol last is the target predicate that we want to learn (the relation to generalise). Thefirst argument of each atom represents an input list and the second argument represents anoutput value.To induce a hypothesis for these examples, we need to provide an ILP system with suitableBK, such as common list operations:
Name Description Example empty(A) A is an empty list empty([]). head(A,B) B is the head of the list A head([c,a,t],c). tail(A,B) B is the tail of the list A tail([c,a,t],[a,t]).
Given the aforementioned examples and BK with the above list operations, an ILP system couldinduce the hypothesis: H = (cid:26) last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),last(C,B). (cid:27) This hypothesis contains two rules. The first rule says that the relation last(A,B) is true whenthe three atoms tail(A,C) , empty(C) , and head(A,B) are true. In other words, the first rulesays that B is the last element of A when the tail of A is empty and B is the head of A . The secondrule is recursive and says that last(A,B) is true when the two atoms tail(A,C) and last(C,B) are true. In other words, the second rule says that B is the last element of A when C is the tail of A and B is the last element of C .As this scenario illustrates, ILP induces hypotheses that generalise beyond the training exam-ples. For instance, this hypothesis generalises to lists of arbitrary length and elements not seenin the training examples. By contrast, many other forms of ML are notorious for their inabilityto generalise beyond the training data (Marcus, 2018; Chollet, 2019; Bengio et al., 2019). Consider inducing sorting algorithms. Suppose you have the following positive and negativeexamples, again represented as atoms, where the first argument is an unsorted list and thesecond argument is a sorted list: E + = (cid:26) sort([2,1],[1,2]).sort([5,3,1],[1,3,5]). (cid:27) E − = (cid:26) sort([2,1],[2,1]).sort([1,3,1],[1,1,1]). (cid:27) Also suppose that as BK we have the same empty , head , and tail relations from the stringtransformation scenario and two additional relations: Name Description Example partition(Pivot,A,L,R) L is a sublist of A containing ele-ments less than or equal to Pivot and R is a sublist of A containingelements greater than the pivot pivot(3,[4,1,5,2],[1,2],[4,5]). append(A,B,C) true when C is the concatenationof A and B append([a,b,c],[d,e],[a,b,c,d,e]). NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
Given these three sets, an ILP system could induce the hypothesis: H = sort(A,B):- empty(A),empty(B).sort(A,B):- head(A,Pivot),partition(Pivot,A,L1,R1),sort(L1,L2),sort(R1,R2),append(L2,R2,B). This hypothesis corresponds to the quicksort algorithm (Hoare, 1961) and generalises to lists ofarbitrary length and elements not seen in the training examples. This scenario shows that ILP isa form of inductive program synthesis (Shapiro, 1983), where the goal is to automatically buildexecutable computer programs.
As these three toy scenarios illustrate, ILP is different to most ML approaches. Most ML ap-proaches, such as decision tree, support vector, and neural network learners, rely on statisticalinference. By contrast, ILP relies on logical inference and often techniques from automated rea-soning and knowledge representation . Table 2 shows a vastly simplified comparison between ILPand statistical ML approaches. We now briefly discuss these differences.
Statistical ML ILPExamples
Many Few
Data
Tables Logic programs
Hypotheses
Propositional / functions First / higher-order relations Explainability
Difficult Possible
Knowledge transfer
Difficult EasyTable 2: A vastly simplified comparison between ILP and statistical ML approaches. This tableis based on the table by Gulwani et al. (2015).1.4.1 E
XAMPLES
Many forms of ML are notorious for their inability to generalise from small numbers of trainingexamples, notably deep learning (Marcus, 2018; Chollet, 2019; Bengio et al., 2019). As Evansand Grefenstette (2018) point out, if we train a neural system to add numbers with 10 digits,it might generalise to numbers with 20 digits, but when tested on numbers with 100 digits, thepredictive accuracy drastically decreases (Reed & de Freitas, 2016; Kaiser & Sutskever, 2016).By contrast, ILP can induce hypotheses from small numbers of examples, often from a singleexample (Lin et al., 2014; Muggleton et al., 2018). Moreover, as the string transformation andsorting scenarios show, ILP induces hypotheses that generalise beyond training data. In bothscenarios, the hypotheses generalise to lists of arbitrary lengths and arbitrary elements. Thisdata-efficiency is important because we often only have small amounts of training data. For in-stance, Gulwani (2011) applies techniques similar to ILP to induce programs from user-providedexamples in Microsoft Excel to solve string transformation problems, where it is infeasible toask a user for thousands of examples. This data-efficiency has made ILP attractive in many real- ROPPER AND D UMANCIC world applications, especially in drug design, where millions of examples are not always easyto obtain (Section 7).1.4.2 D
ATA
In contrast to most forms of ML, which learn using finite tables of examples and features, ILPlearns using BK represented as a logic program. Using logic programs to represent data allowsILP to learn with complex relational information and allows for easy integration of expert knowl-edge. If learning causal relations in causal networks, a user can encode constraints about thenetwork (Inoue et al., 2013). If learning to recognise events, a user could provide the axiomsof the event calculus as BK (Katzouris et al., 2015). If learning to recognise objects in images,a user can provide a theory of light as BK (Muggleton et al., 2018).A key advantage of using relational BK is the ability to succinctly represent infinite relations.For instance, it is trivial to define a summation relation over the infinite set of natural numbers: add(A,B,C):- C = A+B.
By contrast, tabled-based ML approaches are mostly restricted to finite data and cannot repre-sent this information. For instance, it is impossible to provide a decision tree learner (Quinlan,1986, 1993) this infinite relation because it would require an infinite feature table. Even if werestricted ourselves to a finite set of n natural numbers, a table-based approach would still need n features to represent the complete summation relation.1.4.3 H YPOTHESES
Most forms of ML learn tables of numbers. By contrast, ILP induces logic programs. Using logicprograms to represent hypotheses has many benefits. Because they are closely related to rela-tional databases, logic programs naturally support relational data such as graphs. Because ofthe expressivity of logic programs, ILP can learn complex relational theories, such as cellularautomata (Inoue et al., 2014; Evans et al., 2019), event calculus theories (Katzouris et al., 2015,2016), Petri nets (Bain & Srinivasan, 2018), and, in general, complex algorithms (Cropper &Morel, 2020). Indeed, logic programs are Turing complete (Tärnlund, 1977), so can repre-sent any computer program. Because of the symbolic nature of logic programs, ILP can reasonabout hypotheses, which allows it to learn optimal programs, such as minimal time-complexityprograms (Cropper & Muggleton, 2019) and secure access control policies (Law et al., 2020).Moreover, because induced hypotheses have the same language as the BK, they can be storedin the BK, making transfer learning trivial (Lin et al., 2014).1.4.4 E
XPLAINABILITY
Because of logic’s similarity to natural language, logic programs can be easily read by humans,which is crucial for explainable AI. Recent work by Muggleton et al. (2018) (also explored by Aiet al. (2020)) evaluates the comprehensibility of ILP hypotheses using Michie’s (1988) notionof ultra-strong ML , where a learned hypothesis is expected to not only be accurate but to alsodemonstrably improve the performance of a human when provided with the learned hypothesis.Because of this interpretability, ILP has long been used by domain experiments for scientific NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
User-provided input Learning output
Examples last([m,a,c,h,i,n,e], e).last([l,e,a,r,n,i,n,g], g).last([a,l,g,o,r,i,t,m], m).
Background knowledge empty(A) A is an empty list head(A,B) B is the head of the list
Atail(A,B) B is the tail of the list A ILP system
Program last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),last(C,B).
Search space over programseach node in the search tree is a program last(A,B):- tail(A,B). last(A,B):- tail(A,C),empty(C),head(A,B).
Figure 1: An ILP system learns programs from examples and user-provided BK. The systemlearns by searching a space of possible programs which are constructed from BK. discovery (King et al., 1992; Srinivasan et al., 1996, 1997, 2006; Kaalia et al., 2016). Forinstance, the Robot Scientist (King et al., 2004) is a system that uses ILP to generate hypotheses toexplain data and can also automatically devise experiments to test the hypotheses, physically runthe experiments, interpret the results, and then repeat the cycle. Whilst researching yeast-basedfunctional genomics, the Robot Scientist became the first machine to independently discovernew scientific knowledge (King et al., 2009).1.4.5 K
NOWLEDGE TRANSFER
Most ML algorithms are single-task learners and cannot reuse learned knowledge. For instance,although AlphaGo (Silver et al., 2016) has super-human Go ability, it cannot reuse this knowl-edge to play other games, nor the same game with a slightly different board. By contrast,because of its symbolic representation, ILP naturally supports lifelong and transfer learning(Torrey et al., 2007; Lin et al., 2014; Cropper, 2019, 2020), which is considered essential forhuman-like AI (Lake et al., 2016; Mitchell et al., 2018). For instance, when inducing solutionsto a set of string transformation tasks, such as those in Scenario 2, Lin et al. (2014) show thatan ILP system can automatically identify easier problems to solve, learn programs for them, andthen reuse the learned programs to help learn programs for more difficult problems. Moreover,they show that this knowledge transfer approach leads to a hierarchy of reusable programs,where each program builds on simpler programs.
Building an ILP system (Figure 1) requires making several choices or assumptions. Understand-ing these assumptions is key to understanding ILP. We discuss these assumptions in Section 4but briefly summarise them now.
Learning setting.
The central choice is how to represent examples. The examples in the threescenarios in this section are in the form of boolean concepts ( lego_builder ) or input-output exam-ples (string transformation and sorting). Although boolean concepts and input-output examples
3. See the work of Muggleton (1999b) for a (slightly outdated) summary of scientific discovery using ILP. ROPPER AND D UMANCIC are common representations, there are other representations, such as interpretations (Blockeel& De Raedt, 1998; Law et al., 2014) and transitions (Inoue et al., 2014; Evans et al., 2019).The representation determines the learning setting which in turn defines what it means for aprogram to solve the ILP problem.
Representation language.
ILP represents data as logic programs. There are, however, manydifferent types of logic programs, each with strengths and weaknesses. For instance, Prolog is aTuring-complete logic programming language often used in ILP. Datalog is a syntactical subsetof Prolog that sacrifices features (such as data structures) and expressivity (it is not Turing-complete) to gain efficiency and decidability. Choosing a suitable representation language iscrucial in determining which problems an ILP system can solve.
Defining the hypothesis space.
The basic ILP problem is to search the hypothesis space fora suitable hypothesis. The hypothesis space contains all possible programs that can be builtin the chosen representation language. Unrestricted, the hypothesis space is infinite, so it isimportant to restrict it to make the search feasible. The main way to restrict the hypothesisspace is to enforce an inductive bias (Mitchell, 1997). The most common bias is a language bias which enforces restrictions on hypotheses, such as how many variables or relations can be in ahypothesis. Choosing an appropriate language bias is a major challenge in ILP.
Search method.
Having defined the hypothesis space, the problem is to efficiently search it.There are many different ILP approaches to searching the hypothesis space. A classical wayto distinguish between approaches is by whether they use a top-down or bottom-up approach.Top-down approaches (Quinlan, 1990; Blockeel & De Raedt, 1998; Bratko, 1999; Ribeiro &Inoue, 2014) start with an overly general hypothesis and try to specialise it, similar to a deci-sion tree learner (Quinlan, 1986, 1993). Bottom-up approaches (Muggleton, 1987; Muggleton& Buntine, 1988; Muggleton & Feng, 1990; Inoue et al., 2014) start with an overly specifichypothesis and try to generalise it. Some approaches combine the two (Muggleton, 1995; Srini-vasan, 2001). A third new approach has recently emerged called meta-level ILP (Inoue et al.,2013; Muggleton et al., 2015; Inoue, 2016; Law et al., 2020; Cropper & Morel, 2020). Thisapproach represents the ILP problem as a meta-level logic program, i.e. a program that reasonsabout programs. Meta-level approaches often delegate the search for a hypothesis to an off-the-shelf solver (Corapi et al., 2011; Athakravi et al., 2013; Muggleton et al., 2014; Law et al.,2014; Kaminski et al., 2018; Evans et al., 2019; Cropper & Dumanˇci´c, 2020; Cropper & Morel,2020) after which the meta-level solution is translated back to a standard solution for the ILPtask. Meta-level approaches are exciting because they (i) can often learn optimal and recursiveprograms, and (ii) they use diverse techniques and technologies.
As Sammut (1993) states, ILP has its roots in research dating back to at least the 1970s, withnotable early contributions by Plotkin (1971) and Shapiro (1983). ILP as a field was founded in1991 by Muggleton (Muggleton, 1991) and has since had an annual ILP conference. Since itsfounding, there have been several excellent ILP survey papers (Muggleton & De Raedt, 1994;Muggleton, 1999a; Page & Srinivasan, 2003; Muggleton et al., 2012) and books (Nienhuys-Cheng & Wolf, 1997; De Raedt, 2008). In this paper, our goal is to provide a new introductionto the field aimed at a general AI reader – although we assume some basic knowledge of logic NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION and machine learning. We also differ from existing surveys by including, and mostly focusingon, recent developments (Cropper et al., 2020a), such as new methods for learning recursiveprograms, predicate invention, and meta-level search methods.The rest of the paper is organised as follows:• We describe necessary logic programming notation (Section 2).• We define the standard ILP learning settings (Section 3).• We describe the basic assumptions required to build an ILP system (Section 4).• We compare many ILP systems and describe the features they support (Section 5).• We describe four ILP systems in detail (Aleph, TILDE, ASPAL, and Metagol) (Section 6).• We summarise some of the key application areas of ILP (Section 7).• We briefly compare ILP to other forms of ML (Section 8).• We conclude by outlining the main current limitations of ILP and suggesting directionsfor future research (Section 9)
2. Logic programming
ILP uses logic programs (Kowalski, 1974) to represent BK, examples, and hypotheses. A logicprogram is fundamentally different from an imperative program (e.g. C, Java, Python) and verydifferent from a functional program (e.g. Haskell, OCaml). Imperative programming views aprogram as a sequence of step-by-step instructions where computation is the process of exe-cuting the instructions. By contrast, logic programming views a program as a logical theory (aset of logical rules) where computation is various forms of deduction over the theory, such assearching for a proof or a model of it. Another major difference is that a logic program is declar-ative (Lloyd, 1994) because it allows a user to state what a program should do, rather than how it should work. This declarative nature means that the order of rules in a logic program doesnot matter.In the rest of this section, we introduce the basics of logic programming. We cover the syn-tax and semantics and briefly introduce different types of logic programs. We focus on conceptsnecessary for understanding ILP and refer the reader to more detailed expositions of logic pro-gramming (Nienhuys-Cheng & Wolf, 1997; De Raedt, 2008; Lloyd, 2012) and Prolog (Sterling& Shapiro, 1994; Bratko, 2012) for more information. Readers comfortable with logic can skipthis section.
We first define the syntax of a logic program:• A variable is a string of characters starting with an uppercase letter, e.g. A , B , and C .• A function symbol is a string of characters starting with a lowercase letter.• A predicate symbol is a string of characters starting with a lowercase letter, e.g. job or happy . The arity n of a function or predicate symbol p is the number of arguments it takesand is denoted as p / n , e.g. happy/1 , head/2 , and append/3 .• A constant symbol is a function symbol with zero arity, e.g. alice or bob . ROPPER AND D UMANCIC • A term is a variable, a constant symbol, or a function symbol of arity n immediately fol-lowed by a tuple of n terms.• A term is ground if it contains no variables.• An atom is a formula p ( t , . . . , t n ) , where p is a predicate symbol of arity n and each t i isa term, e.g. lego_builder(alice) , where lego_builder is a predicate symbol of arity 1and alice is a constant symbol.• An atom is ground if all of its terms are ground, e.g. lego_builder(alice) is ground but lego_builder(A) , where A is a variable, is not ground.• The negation symbol is ¬ .• A literal is an atom A (a positive literal ) or its negation ¬ A (a negative literal ). For instance, lego_builder(alice) is both an atom and a literal but ¬ lego_builder(alice) is onlya literal because it additionally includes the negation symbol ¬ .A clause is a finite (possibly empty) set of literals. A clause represents a disjunction of literals.For instance, the following set is a clause: { happy(A) , ¬ lego_builder(A) , ¬ enjoys_lego(A) } The variables in a clause are implicitly universally quantified , so we do not use quantifiers. Aclause is ground if it contains no variables. A clausal theory is a set of clauses. A
Horn clause isa clause with at most one positive literal. In logic programming we represent clauses in reverseimplication form: h ← b , b , . . . , b n . The symbol h is an atom (a positive literal) and is called the head of the clause. The symbols b i are literals and are called the body of the clause. The notation b , b , . . . , b n is shorthandfor a conjunction of literals, i.e. b ∧ b ∧ . . . ∧ b n . We sometimes use the name rule instead of clause . We often replace the symbol ← with :- to make it easier to write programs: h:- b , b , . . . , b n . Informally, a rule states that the head is true if the body is true, i.e. all of the body literals aretrue. For instance, recall from the introduction the rule: happy(A):- lego_builder(A),enjoys_lego(A).
This rule says that happy(A) is true when both lego_builder(A) and enjoys_lego(A) are true,where A is a variable which can be bound to a person. We can use this rule to deduce knowledge.For instance, using the data from the introduction, this rule says that happy(alice) is truebecause both lego_builder(alice) and enjoys_lego(alice) are true. Table 3 shows specifictypes of clauses.A key concept in logic programming is that of a substitution . Simultaneously replacingvariables v , . . . , v n in a clause with terms t , . . . , t n is called a substitution and is denoted as θ = { v / t , . . . , v n / t n } . For instance, applying the substitution θ = { A / bo b } to loves(alice,A) results in loves(alice,bob) . A substitution θ unifies atoms A and B in the case A θ = B θ . Aclause C θ -subsumes a clause D whenever there exists a substitution θ such that C θ ⊆ D .A definite logic program is a set of definite clauses.
4. Also called a denial and hard constraint in ASP NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
Name Description Example
Definite clause A Horn clause with exactly one positiveliteral qsort(A,B):- empty(A),empty(B).
Goal A Horn clause with no head (no positiveliteral) :- head(A,B),head(B,A).
Unit clause A definite clause with no body (no nega-tive literals). We often drop the implica-tion arrow for unit clauses. loves(alice,X).
Fact A ground unit clause. As with unitclauses, we often drop the implication ar-row for facts. loves(andy,laura).
Table 3: Types of Horn clauses.
The semantics of logic programs is based on the concepts of a Herbrand universe , base , and interpretation . All three concepts build upon a given vocabulary V containing all constants,functions, and predicate symbols of a program. The Herbrand universe is the set of all groundterms that can be formed from the constants and functions symbols in V . For instance, theHerbrand universe of the lego builder example (Section 1.1) is {alice, bob, claire, dave} If the example also contained the function symbol age/1 , then the Herbrand universe would bethe infinite set: {alice, bob, claire, dave, age(alice), age(bob), age(age(alice)), . . . }
The Herbrand base is the set of all ground atoms that can be formed from the predicate symbolsin V and the terms in the corresponding Herbrand universe. For instance, the Herbrand baseof the lego builder example is: happy(alice), happy(bob), happy(claire), happy(dave),lego_builder(alice), lego_builder(bob), lego_builder(claire), lego_builder(dave),estate_agent(alice), estate_agent(bob), estate_agent(claire), estate_agent(dave),enjoys_lego(alice), enjoys_lego(bob), enjoys_lego(claire), enjoys_lego(dave) A Herbrand interpretation assigns truth values to the elements of a Herbrand base. By con-vention, a Herbrand interpretation includes true ground atoms, assuming that every atom notincluded is false. For instance, the Herbrand interpretation corresponding to the example inSection 1.1 is: (cid:26) happy(alice), lego_builder(alice), lego_builder(bob), estate_agent(claire),estate_agent(dave), enjoys_lego(alice), enjoys_lego(claire) (cid:27)
A Herbrand interpretation is a model for a clause if for all ground substitutions θ , head θ is truewhenever body θ is true. A Herbrand interpretation is a model for a set of clauses if it is a modelfor every clause in it. For instance, the Herbrand interpretation from the previous paragraph isa model for the clause: ROPPER AND D UMANCIC happy(A):- lego_builder(A), enjoys_lego(A). because every substitution that makes the body ( θ ={A/alice} ) true also makes the head true.By contrast, the following interpretation is not a model of the clause because the substitution θ ={A/dave} makes the body true but not the head: enjoys_lego(A):- estate_agent(A). A clause c is a logical consequence of a theory T if every Herbrand model of T is also a model of c and c is said to be entailed by T , written T | = c . There are different types of logic programs. We now cover some of the most important ones forILP.2.3.1 C
LAUSAL LOGIC
Logic programming is based on clausal logic. An advantage of clausal logic is the simple repre-sentation: sets of literals. Clausal programs are of the form: head ; . . . ; head m :- body , . . . , body n . In this program, the symbol ; denotes disjunction . Clausal programs can have multiple conse-quences. For instance, stating that a human is either male or female can be expressed as: female(X); male(X) :- human(X). Another advantage of clausal logic is the existence of efficient inference engines. Robinson(1965) shows that a single rule of inference (the resolution principle) is both sound and refuta-tion complete for clausal logic. Resolution now forms the foundation of logic in AI, and is, forinstance, the fundamental operation in the DPLL procedure (Davis et al., 1962) which formsthe basis for most efficient complete SAT solvers.2.3.2 H
ORN LOGIC
Most ILP systems induce Horn programs. In contrast to clausal logic, a Horn clause has at mostone head literal. All programs mentioned in the introduction are Horn programs, such as theprogram for extracting the last element of the list: last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),last(C,B).
One reason for focusing on Horn theories, rather than full clausal theories, is SLD-resolution(Kowalski & Kuehner, 1971), an inference rule that sacrifices expressibility for efficiency. Forinstance, the clause p ( a ) ∨ p ( a ) cannot be expressed in Horn logic because it has two positiveliterals. Horn logic is, however, still Turing complete (Tärnlund, 1977). The efficiency benefitsof Horn theories come from their properties when performing resolution on them. The resolventof two Horn clauses is itself a Horn clause. The resolvent of a goal clause and a definite clauseis a goal clause. These properties lead to greater efficiency during inference (basically, SLD-resolution needs to consider fewer options when searching for proof). NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
ROLOG
Prolog (Kowalski, 1988; Colmerauer & Roussel, 1993) is a logic programming language basedon SLD-resolution. Pure Prolog is restricted to Horn clauses. Most Prolog implementationsallow extra-logical features, such as cuts. Prolog is not purely declarative because of constructslike cut, which means that a procedural reading of a Prolog program is needed to understand it.In other words, the order of clauses in a Prolog program has a major influence on its executionand results. Computation in a Prolog program is proof search.2.3.4 D
ATALOG
Datalog is a subfragment of definite programs. The main two restrictions are (i) every variablein the head literal must also appear in a body literal, and (ii) complex terms as arguments ofpredicates are disallowed, e.g. p(f(1),2) or lists. Therefore, the list manipulation programsfrom previous sections cannot (easily) be expressed in Datalog. In contrast, Datalog is sufficientfor the happy
Scenario because where structured terms are not needed: happy(A):- lego_builder(A),enjoys_lego(A).
Compared to definite programs, the main advantage of Datalog is decidability (Dantsin et al.,2001). However, this decidability comes at the cost of expressivity as Datalog is not Turingcomplete. By contrast, definite programs with function symbols have the expressive power ofTuring machines and are consequently undecidable (Tärnlund, 1977). Unlike Prolog, Datalogis purely declarative.2.3.5 N ON - MONOTONIC LOGIC
A logic is monotonic when adding a clause to it does not reduce the logical consequences of thattheory. Definite programs are monotonic because anything that could be deduced before a clauseis added to it can still be deduced after it is added. In other words, adding a clause to a definiteprogram cannot remove the logical consequences of the program. A logic is non-monotonicif some conclusions can be removed by adding more knowledge. For instance, consider thefollowing propositional program: sunny.happy:- sunny.
This program states it is sunny and that I am happy if it is sunny. We can therefore deduce thatI am happy because it is sunny. Now suppose that I added another rule: sunny.happy:- sunny.happy:- rich.
This new rule states that I am also happy if I am rich. Note that by the closed world assumption,we know I am not rich. After adding this rule, we can still deduce that I am happy from the firstrule.Now consider the non-monotonic program: sunny.happy:- sunny, not weekday. ROPPER AND D UMANCIC
This program states it is sunny and I am happy if it is sunny and it is not a weekday. By theclosed world assumption, we can deduce that it is not a weekday, so we can deduce that I amhappy because it is sunny and it is not a weekday. Now suppose we added knowledge that it isa weekday. sunny.happy:- sunny, not weekday.weekday.
Then we can no longer deduce that I am happy. In other words, by adding knowledge that it isa weekday, the conclusion that I am happy no longer holds.Definite programs with negation as failure (NAF) (Clark, 1977) are non-monotonic. Theterm normal logic program is often used to described logic programs with NAF in the body.There are many different semantics ascribed to non-monotonic programs, including completion(Clark, 1977), well-founded (Gelder et al., 1991), and stable model (answer set) (Gelfond &Lifschitz, 1988) semantics. Discussing the differences of these semantics is beyond the scope ofthis paper.2.3.6 A
NSWER SET PROGRAMMING
Answer set programming is a form of logic programming based on stable model (answer set)semantics (Gelfond & Lifschitz, 1988). Whereas a definite logic program has only one model(the least Herbrand model), an ASP program can have one, many, or even no models (answersets). This makes ASP particularly attractive for expressing common-sense reasoning (Law et al.,2018a). Similar to Datalog, an answer set program is purely declarative. ASP also supportsadditional language features, such as aggregates and weak and hard constraints. Computationin ASP is the process of finding models. Answer set solvers perform the search and thus generatemodels. Most ASP solvers (Gebser et al., 2012), in principle, always terminate (unlike Prologquery evaluation, which may lead to an infinite loop). We refer the reader to the excellent bookby Gebser et al. (2012) for more information.
3. Inductive logic programming
In the introduction, we described three toy ILP scenarios. In each case, the problem was formedof three sets B (background knowledge), E + (positive examples), and E − (negative examples).We informally stated the ILP problem is to induce a hypothesis H that with B generalises E + and E − . We now formalise this problem.According to De Raedt (1997), there are three main ILP learning settings: learning from entailment (LFE), interpretations (LFI), and satisfiability (LFS). LFE and LFI are by far the mostpopular learning settings, so we only cover these two. Moreover, De Raedt (1997) showed thatLFI reduces to LFE, which in turn reduces to LFS. Other recent work focuses on learning fromtransitions (Inoue et al., 2014; Evans et al., 2019; Ribeiro et al., 2020). We refer the reader tothose works for an overview of that new learning setting.In each setting, the symbol X denotes the example space , the set of examples for which aconcept is defined; B denotes the language of background knowledge , the set of all clauses thatcould be provided as background knowledge; and H denotes the hypothesis space , the set of allpossible hypotheses. NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
LFE is by far the most popular ILP setting (Shapiro, 1983; Muggleton, 1987; Muggleton & Bun-tine, 1988; Muggleton & Feng, 1990; Quinlan, 1990; Muggleton, 1995; Bratko, 1999; Srini-vasan, 2001; Ray, 2009; Ahlgren & Yuen, 2013; Muggleton et al., 2015; Cropper & Muggleton,2016; Kaminski et al., 2018; Cropper & Morel, 2020). The LFE problem is:
Definition 1 ( Learning from entailment ) . Given a tuple ( B , E + , E − ) where:• B ⊆B denotes background knowledge• E + ⊆ X denotes positive examples of the concept• E − ⊆ X denotes negative examples of the conceptThe goal LFE is to return a hypothesis H ∈ H such that:• ∀ e ∈ E + , H ∪ B | = e (i.e. H is complete )• ∀ e ∈ E − , H ∪ B = e (i.e. H is consistent ) Example 1.
Consider the learning from entailment tuple: B = lego_builder(alice).lego_builder(bob).estate_agent(claire).estate_agent(dave).enjoys_lego(alice).enjoys_lego(claire). E + = (cid:8) happy(alice). (cid:9) E − = happy(bob).happy(claire).happy(dave). Also assume that we have the hypothesis space: H = h : happy(A):- lego_builder(A).h : happy(A):- estate_agent(A).h : happy(A):- likes_lego(A).h : happy(A):- lego_builder(A),estate_agent(A).h : happy(A):- lego_builder(A),enjoys_lego(A).h : happy(A):- estate_agent(A),enjoys_lego(A). Then we can consider which hypotheses an ILP system should return:• B ∪ h | = happ y ( bo b ) so is inconsistent• B ∪ h = happ y ( alice ) so is incomplete• B ∪ h | = happ y ( clair e ) so is inconsistent• B ∪ h = happ y ( alice ) so is incomplete• B ∪ h is both complete and consistent• B ∪ h = happ y ( alice ) so is incomplete ROPPER AND D UMANCIC
The LFE problem in Definition 1 is general. ILP systems impose strong restrictions on X , B , and H . For instance, some restrict X to only contain atoms whereas others allow clauses. Somerestrict H to contain only Datalog clauses. We discuss these biases in Section 4.According to Definition 1, a hypothesis must entail every positive example (be complete )and no negative examples (be consistent ). However, training examples are often noisy, so it isdifficult to find a hypothesis that is both complete and consistent. Therefore, most approachesrelax this definition and try to find a hypothesis that covers as many positive and as few negativeexamples as possible. Precisely what this means depends on the system. For instance, thedefault cost function in Aleph (Srinivasan, 2001) is coverage , defined as the number of positiveexamples covered subtracted by the number of negative examples covered. Other systems alsoconsider the size of a hypothesis, typically the number of clauses or literals in it. We discussnoise handling in Section 5.1. The second most popular (De Raedt & Dehaspe, 1997; Blockeel & De Raedt, 1998; Law et al.,2014) learning setting is LFI where an example is an interpretation, i.e. a set of facts. The LFIproblem is:
Definition 2 ( Learning from interpretations ) . Given a tuple ( B , E + , E − ) where:• B ⊆B denotes background knowledge• E + ⊆ X denotes positive examples of the concept, each example being a set of facts• E − ⊆ X denotes negative examples of the concept, each example being a set of factsThe goal of LFI is to return a hypothesis H ∈ H such that:• ∀ e ∈ E + , e is a model of H ∪ B • ∀ e ∈ E − , e is not a model of H ∪ B Example 2.
To illustrate LFI, we use the example from (De Raedt & Kersting, 2008a). Considerthe following BK B = (cid:26) father(henry,bill). father(alan,betsy) father(alan,benny).mother(beth,bill). mother(ann,betsy). mother(alice,benny). (cid:27) and the following examples E + = e = carrier(alan).carrier(ann).carrier(betsy). e = carrier(benny).carrier(alan).carrier(alice). E − = (cid:26) e = (cid:26) carrier(henry).carrier(beth). (cid:27) (cid:27) NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
Also assume the following hypothesis space H = (cid:26) h = carrier(X):- mother(Y,X),carrier(Y),father(Z,X),carrier(Z).h = carrier(X):- mother(Y,X),father(Z,X). (cid:27) To solve the LFI problem (Definition 2), we need to find a hypothesis H such that e and e aremodels of H ∪ B and e is not. The hypothesis h covers both e and e , i.e., for every substitution θ such that bod y ( h ) θ ⊆ B ∪ e ∪ e holds, it also holds that head ( h ) θ ⊆ B ∪ e ∪ e . h doesnot cover e as there exists a substitution θ = { X / bill , Y / beth , Y / henry } such that body holdsbut the head does not. For the same reason, h does not cover any of the examples.
4. Building an ILP system
Building an ILP system requires making several choices or assumptions, which are part of the inductive bias of a learner. An inductive bias is essential for tractable learning and all ML ap-proaches impose an inductive bias (Mitchell, 1997). Understanding these assumptions is key tounderstanding ILP. The choices can be categorised as:•
Learning setting : how to represent examples•
Representation language : how to represent BK and hypotheses•
Language bias : how to define the hypothesis space•
Search method : how to search the hypothesis spaceTable 4 shows the assumptions of some ILP systems. Please note that this table is not a completelisting of ILP systems. The table excludes many important ILP systems, including interactive sys-tems, such as Marvin (Sammut, 1981), MIS (Shapiro, 1983), DUCE (Muggleton, 1987), Cigol(Muggleton & Buntine, 1988), and Clint (De Raedt & Bruynooghe, 1992), and probabilisticsystems, such as SLIPCOVER (Bellodi & Riguzzi, 2015) and ProbFOIL (De Raedt et al., 2015).Covering all ILP systems is beyond the scope of this paper. We discuss these differences / assump-tions. The three main ILP learning settings are learning from entailment , interpretations , and satisfia-bility (Section 3). There is also a learning setting called learning from transitions (Inoue et al.,2014; Evans et al., 2019). Within the learning from entailment setting, there are further distinc-tions. Some systems, such as Progol (Muggleton, 1995), allow for clauses as examples. Mostsystems, however, learn from sets of facts, so this dimension of comparison is not useful.
5. The original FOIL setting is more restricted than the table shows and can only have BK in the form of facts andit does not allow for functions (De Raedt, 2008).6. The FOIL paper does not discuss its language bias.7. LFIT employs many implicit language biases.8. The LFIT approach of Inoue et al. (2014) is bottom-up and the LFIT approach of Ribeiro and Inoue (2014) ittop-down.9. ∂ ILP uses rule templates which can be seen as a generalisation of metarules. ROPPER AND D UMANCIC
System Setting Hypotheses BK Language Bias Search methodFOIL (Quinlan, 1990) LFE Definite Definite n / a TD Progol (Muggleton, 1995) LFE Normal Normal Modes BU + TD TILDE (Blockeel & De Raedt, 1998) LFI Logical trees Normal Modes TD
Aleph (Srinivasan, 2001) LFE Normal Normal Modes BU + TD XHAIL (Ray, 2009) LFE Normal Normal Modes BU
ASPAL (Corapi et al., 2011) LFE Normal Normal Modes ML
Atom (Ahlgren & Yuen, 2013) LFE Normal Normal Modes BU + TD LFIT (Inoue et al., 2014) LFT Normal None n / a BU + TD ILASP (Law et al., 2014) LFI ASP ASP Modes ML
Metagol (Muggleton et al., 2015) LFE Definite Normal Metarules ML ∂ ILP (Evans & Grefenstette, 2018) LFE Datalog Facts Metarules ML HEXMIL (Kaminski et al., 2018) LFE Datalog Normal Metarules ML
Apperception (Evans et al., 2019) LFT Datalog ⊃− None Types ML
Popper (Cropper & Morel, 2020) LFE Definite Normal Declarations ML
Table 4: Assumptions made of popular ILP systems. LFE stands for learn from entailment , LFIstands for learning from interpretations , LFT stands for learning from transitions . TDstands for top-down , BU stands for bottom-up , and ML stands for meta-level . Pleasenote that this table is meant to provide a very high-level overview of some ILP systems.Therefore, the table entries are coarse and should not be taken absolutely literally. Forinstance, Progol, Aleph, and ILASP support other types of language biases, such asconstraints on clauses. Popper also, for instance, supports ASP programs as BK, but usually takes normal programs.
The clearest way to differentiate ILP systems is by the hypotheses they learn. The simplest dis-tinction is between systems that learn propositional programs, such as Duce (Muggleton, 1987),and those that learn first-order programs. Almost all ILP systems learn first-order (or higher-order) programs, so this distinction is not useful. For systems that learn first-order programs,there are classes of programs that they learn. We now cover a selection of them.4.2.1 C
LAUSAL LOGIC
Some ILP systems induce full (unrestricted) clausal theories, such as Claudien (De Raedt &Dehaspe, 1997) and CF-induction (Inoue, 2004). However, reasoning about full clausal theoriesis computationally expensive, so most ILP systems learn sub-fragments (restrictions) of clausallogic. NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION
EFINITE PROGRAMS
Definite programs support complex data structures, such as lists. ILP systems that aim to inducealgorithms and general-purpose programs (Shapiro, 1983; Bratko, 1999; Ahlgren & Yuen, 2013;Muggleton et al., 2015; Cropper & Muggleton, 2016, 2019; Cropper & Morel, 2020) ofteninduce definite programs, typically as Prolog programs. A key motivation for inducing Prologprograms is that there are many efficient Prolog implementations, such as YAP (Costa et al.,2012) and SWI-Prolog (Wielemaker et al., 2012).4.2.3 D
ATALOG .Datalog is a syntactical subset of Horn logic. Datalog is a truly declarative language. By contrast,reordering clauses and literals in a Prolog program can change the results (and can easily leadto non-terminating programs). Whereas a Prolog query may never terminate, a Datalog query isguaranteed to terminate. This decidability, however, comes at the expense of not being a Turing-complete language. Because of this decidability, some ILP systems induce Datalog programs(Evans & Grefenstette, 2018; Kaminski et al., 2018; Evans et al., 2019).4.2.4 N
ORMAL PROGRAMS
Many ILP systems learn normal logic programs (Section 2.3.5). The main motivation for learn-ing normal programs is that many practical applications require non-monotonic reasoning.Moreover, it is often simpler to express a concept with negation. For instance, consider thefollowing ILP problem based on the paper by Ray (2009): B = bird(A):- penguin(A)bird(alvin)bird(betty)bird(charlie)penguin(doris) E + = flies(alvin)flies(betty)flies(charlie) E − = (cid:8) flies(doris) (cid:9) Without negation it is difficult to induce a general hypothesis for this problem. By contrast, withnegation an ILP system could learn the hypothesis: H = (cid:8) flies(A):- bird(A), not penguin(A) (cid:9) ILP approaches that learn normal logic programs can further be characterised by their semantics,such as whether they are based on completion (Clark, 1977), well-founded (Gelder et al., 1991),or stable model (answer set) (Gelfond & Lifschitz, 1988) semantics. Discussing the differencesbetween these semantics is beyond the scope of this paper.4.2.5 A
NSWER SET PROGRAMS
Inducing ASP programs is a relatively new topic (Otero, 2001; Law et al., 2014). There areseveral benefits to learning ASP programs. When learning Prolog programs with negation, theprograms must be stratified, or otherwise the learned program may loop under certain queries(Law et al., 2018a). ASP programs support rules that are not available in Prolog, such as choice ROPPER AND D UMANCIC rules and weak and hard constraints. For instance, ILASP (Law et al., 2014), can learn thefollowing definition of a Hamiltonian graph (taken from Law et al. (2020)) represented as anASP program:
This program contains features not found in Prolog programs, such as a choice rule (the firstline), which states that the literal in(V0,V1) can be true, but need not be. The bottom twolines are hard constraints , which are also not usually found by ILP systems that learn Prologprograms.Approaches to learning ASP programs can be divided into two categories: brave learners ,which aim to learn a program such that at least one answer set covers the examples, and cautiouslearners , which aim to find a program which covers the examples in all answer sets. We referto the work of Sakama and Inoue (2009) and Law et al. (2018a) for more information aboutthese different approaches.4.2.6 H
IGHER - ORDER PROGRAMS
Most ILP systems learn first-order programs. However, as many programmers know, there arebenefits in using higher-order representations. For instance, suppose you have some encrypt-ed / decrypted strings represented as Prolog facts: E + = decrypt([d,b,u],[c,a,t])decrypt([e,p,h],[d,o,g])decrypt([h,p,p,t,f],[g,o,o,s,e]) Your goal is to induce a decryption program from these examples. Given these examples andsuitable BK, an ILP system could learn the first-order program: H = decrypt(A,B):- empty(A),empty(B)decrypt(A,B):- head(A,C),chartoint(C,D),prec(D,E),inttochar(E,F),head(B,F),tail(A,G),tail(B,H),decrypt(G,H) This program defines a Caesar cypher which shifts each character back once (e.g. z y , y x ,etc). Although correct (ignoring the modulo operation for simplicity), this program is long anddifficult to read. To overcome this limitation, some ILP systems (Cropper et al., 2020) learnhigher-order logic programs, such as: H = (cid:26) decrypt(A,B):- map(A,B,inv)inv(A,B):- char_to_int(A,C),prec(C,D),int_to_char(D,B) (cid:27) This program is higher-order because it allows for quantification over variables that can bebound to predicate symbols. In other words, this program is higher-order because it allows NDUCTIVE LOGIC PROGRAMMING AT
A NEW INTRODUCTION literals to take predicate symbols as arguments. The symbol inv is invented (we discussion predicate invention in Section 5.4) and is used as an argument for map in the first clause andas a predicate symbol in the second clause. The higher-order program is smaller than the first-order program because the higher-order background relation map abstracts away the need tolearn a recursive program. Cropper et al. (2020) show that inducing higher-order programs candrastically improve learning performance in terms of predictive accuracy, sample complexity,and learning times.4.2.7 D ISCUSSION
Why do systems learn so many different types of hypotheses? The systems differ mostly becauseof the problems they are designed to solve. For instance, Metagol and Popper are designed forprogram synthesis tasks, where the goal is to learn executable computer programs that reasonover complex data structures and infinite domains, so both systems induce Prolog programs.By contrast, TILDE and ILASP focus on learning predictive rules, which typically do not requirecomplex structures.To illustrate a difference, reconsider the quicksort example from the introduction repre-sented as a Prolog program: H = sort(A,B):- empty(A),empty(B).sort(A,B):- head(A,Pivot),partition(Pivot,A,L1,R1),sort(L1,L2),sort(R1,R2),append(L2,R2,B). Given this program, we can use Prolog to query whether sort(A,B) holds for specific values of A and B . For instance, we can call sort([1,2,3],[1,2,3]) , which Prolog will tell us is true. Wecan call sort([1,3,2],[1,3,2]) , which Prolog will tell us is false. This type of querying can beseen as classification because we classify the truth or falsity of a particular ground atom. Prologprograms can also be queried with variables to ask for answer substitutions . For instance, we cancall sort([1,3,2],A) and ask for an answer substitution for A and Prolog will return A=[1,2,3] .Because we are reasoning about relations we can also call sort(A,[1,2,3]) and ask for ananswer substitution for A and, depending on the exact definitions of the body literals, Prolog mayreturn A=[3,2,1] . In fact, Prolog could return all possible values for A , i.e. all unsorted lists thatcan be sorted to [1,2,3] . Such querying is not possible with standard ASP approaches becausean ASP solver tries to find a model of the program. To make this distinction clear, supposewe have the above sort program and the unsorted list [5,4,7,8,2,1,6] . To deduce a sortedlist we could query the Prolog program by calling sort([5,4,7,8,2,1,6],A) and Prolog willreturn the answer substitution [1,2,4,5,6,7,8] . By contrast, if we wanted to deduce a sortedlist using a standard ASP approach (we ignore that ASP systems do not usually support lists), wewould need to enumerate every possible list l and check whether sort([5,4,7,8,2,1,6],l) holds, or, to put it another way, would need to enumerate the unsorted / sorted list pairs. OMPLEX RELATIONS
ILP uses BK, which is similar to features used in other forms of machine learning. However,whereas features are finite tables, BK is a logic program. Using logic programs to represent data ROPPER AND D UMANCIC allows ILP to learn with complex relational information. For instance, if the task is to recogniseobjects in images, a user can provide a theory of light as BK (Muggleton et al., 2018). If learningto recognise events, a user could provide the axioms of the event calculus as BK (Katzouris et al.,2015). If learning causal relations in causal networks, a user can encode constraints about thenetwork (Inoue et al., 2013).Suppose we want to learn list or string transformation programs, we might want to supplyhelper relations, such as head , tail , and last as BK: B = head([H|_],H).tail([_|T],T).last([H],H).last([_|T1],A):- tail(T1,T2),last(T2,A). These relations hold for lists of any length and any type. By contrast, if we wanted to do thesame with a table-based ML approach, we would need to pre-compute these facts for use in, forinstance, a one-hot-encoding representation. Such an approach is infeasible for large domainsand is impossible for infinite domains.As a second example, suppose you want to learn the definition of a prime number. Then youmight want to give an ILP system the ability to perform arithmetic reasoning, such as using theProlog relations: B = even(A):- 0 is mod(A,2).odd(A):- 1 is mod(A,2).sum(A,B,C):- C is A+B.gt(A,B):- A>B.lt(A,B):- A ONSTRAINTS BK allows a human to encode prior knowledge of a problem. As a trivial example, if learningbanking rules to determine whether two companies can lend to each other, you may encode aprior constraint to prevent two companies from lending to each other if they are owned by thesame parent company: :- lend(CompanyA,CompanyB),parent(CompanyA,CompanyC),parent(CompanyB,CompanyC). A second example (taken from the Aleph manual (Srinivasan, 2001)) comes from a pharmaceu-tical application where a user expresses two constraints over possible hypotheses: NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION :- hypothesis(Head,Body,_),has_pieces(Body,Pieces),length(Pieces,N),N =< 2.:- hypothesis(_,Body,_),has_pieces(Body,Pieces),incomplete_distances(Body,Pieces). The first constraint states that hypotheses are unacceptable if they have fewer than three pieces .The second constraint states that hypotheses are unacceptable if they do not specify the distancesbetween all pairs of pieces.4.3.3 D ISCUSSION As with choosing appropriate features in other forms of ML, choosing appropriate BK in ILPis crucial for good learning performance. ILP has traditionally relied on predefined and hand-crafted BK, often designed by domain experts. However, it is often difficult and expensive toobtain such BK. Indeed, the over-reliance on hand-crafted BK is a common criticism of ILP(Evans & Grefenstette, 2018). The difficulty is finding the balance of having enough BK to solvea problem, but not too much that a system becomes overwhelmed. We discuss these two issues. Too little BK. If we use too little or insufficient BK then we may exclude the target hypothesis.For instance, reconsider the string transformation problem from the introduction, where wewant to learn a program that returns the last character of a string from examples such as: E + = last([m,a,c,h,i,n,e], e)last([l,e,a,r,n,i,n,g],g)last([a,l,g,o,r,i,t,m], m) To induce a hypothesis from these examples, we need to provide an ILP system with suitable BK.For instance, we might provide BK that contains relations for common list / string operations,such as empty , head , and tail . Given these three relations, an ILP system could learn theprogram: H = (cid:26) last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),last(C,B). (cid:27) However, suppose that the user had not provided tail as BK. Then how could an ILP systemlearn the above hypothesis? This situation is a major problem for ILP systems.Two recent avenues of research attempt to overcome this limitation. The first idea is toenable an ILP system to automatically invent new predicate symbols, which we discuss in Section5.4, which has been shown to mitigate missing BK (Cropper & Muggleton, 2015). The secondidea is to perform lifelong and transfer learning to discover knowledge that can be reused tohelp learn other programs (Lin et al., 2014; Cropper, 2019, 2020), which we discuss in Section5.4.5. Despite these recent advances, ILP still relies on much human input to solve a problem.Addressing this limitation is a major challenge for ILP. ROPPER AND D UMANCIC Too much BK. As with too little BK, a major challenge in ILP is too much irrelevant BK. Toomany relations (assuming that they can appear in a hypothesis) is often a problem because thesize of the hypothesis space is a function of the size of the BK. Empirically, too much irrelevantBK is detrimental to learning performance (Srinivasan et al., 1995, 2003; Cropper, 2020), thisalso includes irrelevant language biases (Cropper & Tourret, 2020). Addressing the problem oftoo much BK has been under-researched. In Section 9, we suggest that this topic is a promisingdirection for future work, especially when considering the potential for ILP to be used for lifelonglearning (Section 5.4.5). The basic ILP problem is to search the hypothesis space for a suitable hypothesis. The hypothesisspace contains all possible programs that can be built in the chosen representation language.Unrestricted, the hypothesis space is infinite, so it is important to restrict it to make the searchfeasible. The main way to restrict the hypothesis space is to enforce an inductive bias (Mitchell,1997). Without any bias, the hypothesis space is infinite. The most common bias is a languagebias which enforces restrictions on hypotheses, such as to restrict the number of variables, lit-erals, and clauses in a hypothesis. These restrictions can be categorised as either syntactic bias,restrictions on the form of clauses in a hypothesis, and semantic bias, restrictions on the be-haviour of induced hypotheses (Adé et al., 1995).In the happy example (Example 1.1), we assumed that a hypothesis only contains predicatesymbols which appear in the BK or examples. However, we need to encode this bias to givean ILP system. There are several ways of encoding a language bias, such as grammars (Cohen,1994a). We focus on mode declarations (Muggleton, 1995) and metarules (Cropper & Tourret,2020), two popular language biases.4.4.1 M ODE DECLARATIONS Mode declarations are the most popular form of language bias (Muggleton, 1995; Blockeel &De Raedt, 1998; Srinivasan, 2001; Ray, 2009; Corapi et al., 2010, 2011; Athakravi et al., 2013;Ahlgren & Yuen, 2013; Law et al., 2014; Katzouris et al., 2015). Mode declarations state whichpredicate symbols may appear in a clause, how often, and also their argument types. In themode language, modeh declarations denote which literals may appear in the head of a clauseand modeb declarations denote which literals may appear in the body of a clause. A modedeclaration is of the form: mod e ( r ecall , pr ed ( m , m , . . . , m a )) The following are all valid mode declarations: modeh(1,happy(+person)).modeb(*,tail(+list,-list)).modeb(*,head(+list,-element)).modeb(2,add(+int,+int,-int)). The first argument of a mode declaration is an integer denoting the recall . Recall is the maximumnumber of times that a mode declaration can be used in a clause. Another way of understanding NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION recall is that it bounds the number of alternative solutions for a literal. For instance, if learningthe parent kinship relation, we could use the modeb declaration: modeb(2,parent(+person,-person)). This declaration states that a person has most has two parents. If we were learning the grandparent relation, then we could set the recall to four. Providing a recall is a hint to an ILP system to ig-nore certain hypotheses. For instance, if we know that a relation is functional, such as succ ,then we can bound the recall to one. The symbol * denotes no bound.The second argument denotes that the predicate symbol that may appear in the head ( modeh )or body ( modeb ) of a clause and the type of arguments it takes. The symbols + , − , and input , output , or ground arguments respectively. An input argumentspecifies that, at the time of calling the literal, the corresponding argument must be instanti-ated. In other words, the argument needs to be bound to a variable that already appears inthe clause. An output argument specifies that the argument should be bound after calling thecorresponding literal. An ground argument specifies that the argument should be ground and isoften used to learn clauses in with constant symbols in them. Example 3 (Mode declarations) . To illustrate mode declarations, consider the modes: modeh(1,target(+list,-char)).modeb(*,head(+list,-char)).modeb(*,tail(+list,-list)).modeb(1,member(+list,-list)).modeb(1,equal(+char,-char)). Given these modes, the clause: target(A,B):- head(A,C),tail(C,B). Is not mode consistent because modeh(1,target(+list,-char)) requires that the second argu-ment of target ( B ) is char and the mode modeb(*,tail(+list,-list)) requires that the secondargument of tail ( B ) is a list, so this clause is mode inconsistent.The clause: target(A,B):- empty(A),head(C,B). Is also not mode consistent because modeb(*,head(+list,-char)) requires that the first argu-ment of head ( C ) is instantiated but the variable C is never instantiated in the clause.The clause: target(A,B):- member(A,B),member(A,C),equal(B,C). Is not mode consistent because modeb(1,member(+list,-char)) requires that member(A,B) appears at most once.By contrast, the following clauses are all mode consistent: ROPPER AND D UMANCIC target(A,B):- tail(A,C),head(C,B).target(A,B):- tail(A,C),tail(C,D),equal(C,D),head(A,B).target(A,B):- tail(A,C),member(C,B). Different ILP systems use mode declarations is slightly different ways. Progol and Aleph usemode declarations with input / output argument types because they induce Prolog programs,where the order of literals in a clause matters. By contrast, ILASP induces ASP programs, wherethe order of literals in a clause does not matter, so ILASP does not use input / output arguments.4.4.2 M ETARULES Metarules are another popular form of syntactic bias and are used by many systems (Emdeet al., 1983; De Raedt & Bruynooghe, 1992; Flener, 1996; Kietz & Wrobel, 1992; Wang et al.,2014; Muggleton et al., 2015; Cropper & Muggleton, 2016; Kaminski et al., 2018; Evans &Grefenstette, 2018; Bain & Srinivasan, 2018). Metarules are second-order Horn clauses whichdefine the structure of learnable programs which in turn defines the hypothesis space (Cropper& Tourret, 2020). For instance, to learn the grandparent relation given the parent relation,the chain metarule would be suitable: P(A,B):- Q(A,C), R(C,B). The letters P , Q , and R denote second-order variables (variables that can be bound to predicatesymbols) and the letters A , B and C denote first-order variables (variables that can be bound toconstant symbols). Given the chain metarule, the background parent relation, and examples ofthe grandparent relation, ILP approaches will try to find suitable substitutions for the second-order variables, such as the substitutions { P / grandparent, Q / parent, R / parent } to induce thetheory: grandparent(A,B):- parent(A,C),parent(C,B). The idea of using metarules to restrict the hypothesis space has been widely adopted by manynon-ILP approaches (Albarghouthi et al., 2017; Rocktäschel & Riedel, 2017; Si et al., 2018;Raghothaman et al., 2020). However, despite their now widespread use, there is little workdetermining which metarules to use for a given learning task. Instead, these approaches assumesuitable metarules as input or use metarules without any theoretical guarantees. In contrastto other forms of bias in ILP, such as modes or grammars, metarules are themselves logicalstatements, which allows us to reason about them. For this reason, there is preliminary work inreasoning about metarules to identify universal sets suitable to learn certain fragments of logicprograms (Cropper & Muggleton, 2014; Tourret & Cropper, 2019; Cropper & Tourret, 2020).Despite this preliminary work, deciding which metarules to use for a given problem is still amajor challenge, which future work must address. 10. Metarules are also called program schemata (Flener, 1996) and second-order schemata (De Raedt & Bruynooghe,1992), amongst many other names. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION ISCUSSION Choosing an appropriate language bias is essential to make an ILP problem tractable becauseit defines the hypothesis space. If the bias is too weak , then the problem is intractable. If thebias is too strong then we risk excluding the correct solution from the hypothesis space. Thistrade-off is one of the major problems holding ILP back from being widely used .To understand the impact of an inappropriate language bias, consider the string transfor-mation example in Section 1.2. Even if all necessary background relations are provided, notproviding a recursive metarule (e.g. R(A,B):- P(A,C), R(C,B) ) would prevent a metarule-based system from inducing a program that generalises to lists of any length. Similarly, notsupplying a recursive mode declaration for the target relation would prevent a mode-basedsystem from finding the correct hypothesis. For instance, it would be impossible to induce acorrect program for the string transformation task in Section 1.2 if the first argument of the head relation is specified as an output argument and the second one as the input.Different language biases offers different benefits. Mode declarations are expressive enoughto enforce a strong bias to significantly prune the hypothesis space. They are especially appro-priate when a user has much knowledge about their data and can, for instance, determinesuitable recall values. If a user does not have such knowledge, then it can be very difficult todetermine suitable mode declarations. Moreover, if a user provides very weak mode declara-tions (for instance with infinite recall, a single type, and no input / output arguments), then thesearch quickly becomes intractable. Although there is some work on learning mode declarations(McCreath & Sharma, 1995; Ferilli et al., 2004; Picado et al., 2017), it is still a major challengeto choose appropriate ones.The main strength of metarules is that they require little knowledge of the backgroundrelations, and a user does not need to provide recall values, types, or specify input / outputarguments. Also, because they precisely define the form of hypotheses, they can greatly reducethe hypothesis space, especially if the user knows about the class of programs to be learned.However, as previously mentioned, the major downside with metarules is determining whichmetarules to use for an arbitrary learning task. Although there is some preliminary work inidentifying universal sets of metarules (Cropper & Muggleton, 2014; Tourret & Cropper, 2019;Cropper & Tourret, 2020), deciding which metarules to use for a given problem is a majorchallenge, which future work must address. Having defined the hypothesis space, the next problem is to efficiently search it. There aretwo traditional search methods: bottom-up and top-down . These methods rely on notions ofgenerality, where one program is more general or more specific than another. Most approachesreason about the generality of hypotheses syntactically through θ -subsumption (or subsumption for short) (Plotkin, 1971): 11. The Blumer bound (Blumer et al., 1987) (the bound is a reformulation of Lemma 2.1) helps explain this trade-off. This bound states that given two hypothesis spaces, searching the smaller space will result in fewer errorscompared to the larger space, assuming that the target hypothesis is in both spaces. Here lies the problem: howto choose a learner’s hypothesis space so that it is large enough to contain the target hypothesis yet small enoughto be efficiently searched. ROPPER AND D UMANCIC Definition 3 (Clausal subsumption) . A clause C subsumes a clause C if and only if there existsa substitution θ such that C θ ⊆ C . Example 4 (Clausal subsumption) . Let C and C be the clauses: C = f(A,B):- head(A,B)C = f(X,Y):- head(X,Y),odd(Y) .Then C subsumes C because C θ ⊆ C with θ = { A / X , Y / B } .Conversely, a clause C is more specific than a clause C if C subsumes C .A generality relation imposes an order over the hypothesis space. Figure 2 shows this order.An ILP system can exploit this ordering during the search for a hypothesis. For instance, if aclause does not entail a positive example, then there is no need to explore any of its speciali-sations because it is logically impossible for them to entail the example. Likewise, if a clauseentails a negative example, then there is no need to explore any of its generalisations becausethey will also entail the example.4.5.1 T OP - DOWN Top-down algorithms, such as FOIL (Quinlan, 1990), TIDLE (Blockeel & De Raedt, 1998), andHYPER (Bratko, 1999) start with a general hypothesis and then specialise it. For instance,HYPER searches a tree in which the nodes correspond to hypotheses. Each child of a hypothesisin the tree is more specific than or equal to its predecessor in terms of theta-subsumption, i.e. ahypothesis can only entail a subset of the examples entailed by its parent. The construction ofhypotheses is based on hypothesis refinement (Shapiro, 1983; Nienhuys-Cheng & Wolf, 1997).If a hypothesis is considered that does not entail all the positive examples, it is immediatelydiscarded because it can never be refined into a complete hypothesis.4.5.2 B OTTOM - UP Bottom-up algorithms start with the examples and generalise them (Muggleton, 1987; Muggle-ton & Buntine, 1988; Muggleton & Feng, 1990; Muggleton et al., 2009b; Inoue et al., 2014).For instance, Golem (Muggleton & Feng, 1990) generalises pairs of examples based on relativeleast-general generalisation (Buntine, 1988).Plotkin’s (1971) notion of least-general generalisation (LGG) is the fundamental concept ofbottom-up ILP methods. Given two clauses, the LGG operator returns the most specific singleclause that is more general than both of them. To define the LGG of two clauses, we start withthe LGG of terms:• lgg(f(s , . . . ,s n ), f(t , . . . ,t n )) = f(lgg(s ,t ), . . . ,lgg(s n ,t n )) .• lgg(f(s , . . . ,s n ), g(t , . . . ,t n )) = V (a variable).We define the LGG of literals:• lgg(p(s , . . . ,s n ), p(t , . . . ,t n )) = p(lgg(s ,t ), . . . ,lgg(s n ,t n )) .• lgg( ¬ p(s , . . . ,s n ), ¬ p(t , . . . ,t n )) = ¬ p(lgg(s ,t ), . . . ,lgg(s n ,t n )) • lgg(p(s , . . . ,s n ), q(t , . . . ,t n )) is undefined NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Modes modeh(1, son(+,-)).modeb(1, male(+)).modeb(1, parent(-,+)).modeb(1, parent( Figure 2: The generality relation orders the hypothesis space into a lattice (an arrow connects ahypothesis with its specialisation). The hypothesis space is built from the modes andonly shown partially ( indicates that a constant needs to be used as an argument;only claire is used as a constant here). The most general hypothesis sits on the top ofthe lattice, while the most specific hypotheses are at the bottom. The top-down latticetraversal starts at the top, with the most general hypothesis, and specialises it movingdownwards through the lattice. The bottom-up traversal starts at the bottom, withthe most specific hypothesis, and generalises it moving upwards through the lattice.• lgg(p(s , . . . ,s n ), ¬ p(t , . . . ,t n )) is undefined• lgg( ¬ p(s , . . . ,s n ), p(t , . . . ,t n )) is undefined.Finally, we define the LGG of two clauses: lgg(cl ,cl ) = { lgg(l ,l ) | l ∈ cl , l ∈ cl , lgg(l ,l ) is defined } In other words, the LGG of two clauses is a LGG of all pairs of literals of the two clauses.Buntine’s (1988) notion of relative least-general generalisation (RLGG) computes a LGG oftwo examples relative to the BK (assumed to be a set of facts), i.e: rlgg(e ,e ) = lgg(e :- BK, e :- BK) . ROPPER AND D UMANCIC Figure 3: Bongard problems Example 5. To illustrate RLGG, consider the Bongard problems in Figure 3 where the goal is tospot the common factor in both images. Assume that the images are described with the BK: B = contains(1,o1).contains(2,o3).contains(1,o2).triangle(o1).triangle(o3).points(o1,down).points(o3,down).circle(o2). We can use RLGG to identify the common factor, i.e., to find a program representing the commonfactor. We will denote the example images as bon(1) and bon(2) . We start by formulating theclauses describing examples relative to BK and removing irrelevant parts of BK: lgg( (cid:0) bon(1) :- contains(1,o1), contains(1,o2), triangle(o1), points(o1,down),circle(o2) , contains(2,o3), triangle(o3), points(o3,down). (cid:1) , (cid:0) bon(2) :- contains(1,o1), contains(1,o2), triangle(o1), points(o1,down),circle(o2), contains(2,o3), triangle(o3), points(o3,down). (cid:1) ) We proceed by computing LGG for the head and the body literals of the two clauses separately .The LGG of the head literals is lgg(bon(1),bon(2)) = bon(lgg(1,2)) = bon(X) . An importantthing to note here is that we have to use the same variable for the same ordered pair of terms everywhere. For instance, we have used the variable X for lgg(1,2) and we have to use thesame variable every time we encounter the same pair of terms. To compute the LGG of the bodyliterals, we compute the LGG for all pairs of body literals: (cid:8) lgg(contains(1,o1),contains(2,o3)), lgg(contains(1,o1),triangle(o3)), lgg(contains(1,o1),points(o3,down))lgg(contains(1,o2),contains(2,o3)), lgg(contains(1,o2),triangle(o3)), lgg(contains(1,o1),points(o3,down)),lgg(triangle(o1),contains(2,o3)), lgg(triangle(o1),triangle(o3)), triangle(o1),points(o3,down),lgg(points(o1,down),contains(2,03)), lgg(points(o1,down),triangle(o3)), lgg(points(o1,down),points(o3,down)),lgg(circle(o2),contains(2,o3)), lgg(circle(o2),triangle(o3)), lgg(circle(o2),points(o3,down)) (cid:9) . Computing the LGGs and removing the redundant literals gives us the clause: bon(X):- contains(X,Y),triangle(Y),points(Y,down). We suggest the book by De Raedt (2008) for more information about generality orders. 12. We can do this because when a clause is converted to the set representation, the literals in the body and headhave different signs (body literals are negative, while the head literals are positive) which results in an undefinedLGG NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION OP - DOWN AND BOTTOM - UP Progol is one of the most important ILP systems and has inspired many other ILP approaches(Srinivasan, 2001; Ray, 2009; Ahlgren & Yuen, 2013), including Aleph, which we cover in detailin Section 6.1. Progol is, however, slightly confusing because it is a top-down ILP system but itfirst uses a bottom-up approach to bound the search space. Indeed, many authors only considerit a top-down ILP approach. Progol is a set covering algorithm. Starting with an empty program,Progol picks an uncovered positive example to generalise. To generalise an example, Progol usesmode declarations (Section 4.4.1) to build the bottom clause (Muggleton, 1995), the logicallymost-specific clause that explains the example. The use of a bottom clause bounds the searchfrom above (the empty set) and below (the bottom clause). In this way, Progol is a bottom-upapproach because it starts with a bottom clause and tries to generalise it. However, to find ageneralisation of the bottom clause, Progol uses an A* algorithm to search for a generalisation ina top-down (general-to-specific) manner and uses the other examples to guide the search . Inthis way, Progol is a top-down approach. When the clause search (the search for a generalisationof the bottom clause) has finished, Progol adds the clause to its hypothesis (and thus makes itmore general) and removes any positive examples entailed by the new hypothesis. It repeatsthis process until there are no more positive examples uncovered. In Section 6.1, we discussthis approach in more detail when we describe Aleph (Srinivasan, 2001), an ILP system that isvery similar to Progol.4.5.4 M ETA - LEVEL A third new approach has recently emerged called meta-level ILP (Inoue et al., 2013; Muggletonet al., 2015; Inoue, 2016; Law et al., 2020; Cropper & Morel, 2020). There is no agreed-upon definition for what meta-level ILP means, but most approaches encode the ILP problemas a meta-level logic program, i.e. a program that reasons about programs. Such meta-levelapproaches often delegate the search for a hypothesis to an off-the-shelf solver (Corapi et al.,2011; Athakravi et al., 2013; Muggleton et al., 2014; Law et al., 2014; Kaminski et al., 2018;Evans et al., 2019; Cropper & Dumanˇci´c, 2020; Cropper & Morel, 2020) after which the meta-level solution is translated back to a standard solution for the ILP task. In other words, insteadof writing a procedure to search in a top-down or bottom-up manner, meta-level approachesformulate the learning problem as a declarative problem, often as an ASP problem (Corapiet al., 2011; Athakravi et al., 2013; Muggleton et al., 2014; Law et al., 2014; Kaminski et al.,2018; Evans et al., 2019; Cropper & Dumanˇci´c, 2020; Cropper & Morel, 2020). For instance,ASPAL (Corapi et al., 2011) translates an ILP task into a meta-level ASP program which describesevery example and every possible rule in the hypothesis space (defined by mode declarations).ASPAL then uses an ASP system to find a subset of the rules that cover all the positive but noneof the negative examples. In other words, ASPAL delegates the search to an ASP solver. ASPALuses an ASP optimisation statement to find the hypothesis with the fewest literals.Meta-level approaches can often learn optimal and recursive programs. Moreover, meta-level approaches use diverse techniques and technologies. For instance, Metagol (Muggletonet al., 2015; Cropper & Muggleton, 2016) uses a Prolog meta-interpreter to search for a proofof a meta-level Prolog program. ASPAL (Corapi et al., 2011), ILASP (Law et al., 2014), HEXMIL 13. The A* search strategy employed by Progol can easily be replaced by alternative search algorithms, such asstochastic search (Muggleton & Tamaddoni-Nezhad, 2008). ROPPER AND D UMANCIC (Kaminski et al., 2018), and the Apperception Engine (Evans et al., 2019) translate an ILPproblem into an ASP problem and use powerful ASP solvers to find a model of the problem – notethat these systems all employ very different algorithms. ∂ ILP (Evans & Grefenstette, 2018) usesneural networks to solve the problem. Overall, the development of meta-level ILP approaches isexciting because it has diversified ILP from the standard clause refinement approach of earlierILP systems.For more information about meta-level reasoning, we suggest the work of Inoue (2016),who provides a nice introduction to meta-level reasoning and learning. (Law et al., 2020) alsoprovide a nice overview of what he calls conflict-driven ILP, which the ILP systems ILASP3 (Law,2018) and Popper (Cropper & Morel, 2020) adopt.4.5.5 D ISCUSSION The different search methods discussed above have different advantages and disadvantages,and there is no ‘best’ approach. Moreover, as Progol illustrates, there is not necessarily cleardistinctions between top-down , bottom-up , and meta-level approaches. We can, however, makesome general observations about the different approaches.Bottom-up approaches can be seen as being data- or example-driven . The major advantage ofthese approaches is that they are typically very fast. However, as Bratko (1999) points out, thereare several disadvantages of bottom-up approaches, such as (i) they typically use unnecessarilylong hypotheses with many clauses, (ii) it is difficult for them to learn recursive hypotheses andmultiple predicates simultaneously, and (iii) they do not easily support predicate invention.The main advantages of top-down approaches are that they can more easily learn recursiveprograms and textually minimal programs. The major disadvantage is that they can be pro-hibitively inefficient because they can generate many hypotheses that do not cover even a singlepositive example. Another disadvantage of top-down approaches is their reliance on iterativeimprovements. For instance, TILDE keeps specialising every clause which leads to improvement(i.e., a clause covers fewer negative examples). As such, TILDE can get stuck with suboptimalsolutions if the necessary clauses are very long and intermediate specialisations do not improvethe score (coverage) of the clause. To avoid this issue, these systems rely on lookahead (Struyfet al., 2006) which increases the complexity of learning.The main advantage of meta-level approaches is that they can learn recursive programsand optimal programs (Corapi et al., 2011; Law et al., 2014; Kaminski et al., 2018; Evans& Grefenstette, 2018; Evans et al., 2019; Cropper & Morel, 2020). They can also harnessthe state-of-the-art techniques in constraint solving, notably in ASP. However, some unresolvedissues remain. A key issue is that many approaches encode an ILP problem as a single (oftenvery large) ASP problem (Corapi et al., 2011; Law et al., 2014; Kaminski et al., 2018; Evanset al., 2019), so struggle to scale to problems with very large domains. Moreover, since most ASPsolvers only work on ground programs (Gebser et al., 2014), pure ASP-based approaches areinherently restricted to tasks that have a small and finite grounding. Although preliminary workattempts to tackle this issue (Cropper & Morel, 2020), work is still needed for these approachesto scale to very large problems. Many approaches also precompute every possible rule in ahypothesis (Corapi et al., 2011; Law et al., 2014), so struggle to learn programs with largerules, although preliminary work tries to address this issue (Cropper & Dumanˇci´c, 2020). NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION 5. ILP systems Table 5 compares the same ILP systems in Table 4 on a small number of dimensions. Again, Table5 excludes many important ILP systems. It also excludes many other dimensions of comparison,such as whether a system supports non-observational predicate learning, where examples of thetarget relations are not directly given (Muggleton, 1995). We discuss these features in turn. System Noise Optimality Infinite domains Recursion Predicate inventionFOIL (Quinlan, 1990) Yes No Yes Partly No Progol (Muggleton, 1995) No Yes Yes Partly No TILDE (Blockeel & De Raedt, 1998) Yes No Yes No No Aleph (Srinivasan, 2001) No Yes Yes Partly No XHAIL (Ray, 2009) Yes No Yes Partly No ASPAL (Corapi et al., 2011) No Yes No Yes No Atom (Ahlgren & Yuen, 2013) Yes No Yes Partly No ILASP (Law et al., 2014) Yes Yes No Yes Partly LFIT (Inoue et al., 2014) No Yes No No No Metagol (Muggleton et al., 2015) No Yes Yes Yes Yes ∂ ILP (Evans & Grefenstette, 2018) Yes Yes No Yes Partly HEXMIL (Kaminski et al., 2018) No Yes No Yes Yes Apperception (Evans et al., 2019) Yes Yes No Yes Partly Popper (Cropper & Morel, 2020) No Yes Yes Yes No Table 5: A vastly simplified comparison of ILP systems. As with Table 4, this table is meant toprovide a very high-level overview of some ILP systems. Therefore, the table entriesare coarse and should not be taken absolutely literally. For instance, Metagol does notsupport noise, and thus has the value no in the noise column, but there is a extension(Muggleton et al., 2018) that samples examples to mitigate the issue of misclassifiedexamples. ILASP and ∂ ILP support predicate invention, but a restricted form. SeeSection 5.4 for an explanation. FOIL, Progol, and XHAIL can learn recursive programswhen given sufficient examples. See Section 5.3 for an explanation. 14. A logical decision tree learned by TILDE can be translated into a logic program that contains invented predicatesymbols. However, TILDE is unable to reuse any invented symbols whilst learning.15. ILASP precomputes every rule defined by a given mode declaration M to form a rule space S M . Given backgroundknowledge B and an example E , ILASP requires that the grounding of B ∪ S M ∪ E must be finite.16. LFIT does not support recursion in the rules but allows recursion in their usage. The input is a set of pairs ofinterpretations and the output is a logic program which can be recursively applied on its own output to producesequences of interpretations. ROPPER AND D UMANCIC Noise handling is important in machine learning. In ILP there can be different forms of noise.We distinguish between three types of noise:• Noisy examples : where an example is misclassified• Incorrect BK : where a relation holds when it should not (or does not hold when it should)• Imperfect BK : where relations are missing or there are too many irrelevant relationsWe discuss these three types of noise.5.1.1 N OISY EXAMPLES The ILP problem definitions from Section 3 are too strong to account for noisy examples becausethey expect a hypothesis that entails all of the positive and none of the negative examples.Therefore, most ILP systems relax this constraint and accept a hypothesis that does not necessarycover all positive examples or that covers some negative examples .Most ILP systems based on set covering naturally support noise handling. For instance,TILDE essentially extends a decision tree learner (Quinlan, 1986, 1993) to the first-order settingand uses the same information gain methods to induce hypotheses. Progol is similar because itemploys a minimum-description length principle to perform set covering over the positive ex-amples. The noise-tolerant version of ILASP (Law et al., 2018b) uses ASP’s optimisation abilitiesto provably learn the program with the best coverage. In general, handling noisy examples is awell-studied topic in ILP.5.1.2 I NCORRECT BKMost ILP systems assume that the BK is perfect. In other words, most ILP approaches assumethat atoms are true or false, and there is no room for uncertainty. This assumption is a major lim-itation because real-world data, such as images or speech, cannot always be easily be translatedinto a purely noise-free symbolic representation. We discuss this limitation in Section 9.1.One of the key appealing features of ∂ ILP is that it takes a differentiable approach to ILPand can be given fuzzy or ambiguous data. Rather than an atom being true or false, ∂ ILP givesatoms continuous semantics, which maps atoms to the real unit interval [ 0, 1 ] . The authorssuccessfully demonstrate the approach on the MNIST classification.A natural way to handle (possibly) incorrect BK is to specify the uncertainty about BK, whichleads to the probabilistic approaches to machine learning. To effectively utilise imperfect BK, statistical relational artificial intelligence (StarAI) (De Raedt & Kersting, 2008b; De Raedt et al.,2016) unites logic programming with probabilistic reasoning. StarAI formalisms allow a userto explicitly quantify the confidence in the correctness of the BK by annotating parts of BK withprobabilities or weights.Perhaps the simplest flavour of StarAI languages, and the one that directly builds upon logicprogramming and Prolog, is a family of languages based on distribution semantics (Sato, 1995;Sato & Kameya, 2001; De Raedt et al., 2007). In contrast to logic programs which representa deterministic program, probabilistic logic programs define a distribution over the possible 17. It is, unfortunately, a common misconception that ILP cannot handle mislabelled examples (Evans & Grefenstette,2018). NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION executions of a program. Problog (De Raedt et al., 2007), a prominent member of this fam-ily, represents a minimal extension of Prolog that supports such stochastic execution. Problogintroduces two types of probabilistic choices: probabilistic facts and annotated disjunctions.Probabilistic facts are the most basic stochastic unit in Problog. They take the form of logicalfacts labeled with a probability p and represent a Boolean random variable that is true withprobability p and false with probability 1 − p . For instance, the following probabilistic factstates that there is 1% chance of an earthquake in Naples. An alternative interpretation of this statement is that 1% of executions of the probabilistic pro-gram would observe an earthquake. Whereas probabilistic facts introduce non-deterministicbehaviour on the level of facts, annotated disjunctions introduce non-determinism on the levelof clauses Annotated disjunctions allow for multiple literals in the head, but only one of thehead literals can be true at a time. For instance, the following annotated disjunction states thata ball can be either green, red, or blue, but not a combination of colours: ::colour(B,green); ::colour(B,red); ::colour(B,blue) :- ball(B). Though StarAI frameworks allow for incorrect BK, they add another level of complexity to learn-ing: besides identifying the right program (also called structure in StarAI), the learning task alsoconsists of learning the corresponding probabilities of probabilistic choices (also called param-eters). Learning probabilistic logic programs is largely unexplored, with only a few existingapproaches (De Raedt et al., 2015; Bellodi & Riguzzi, 2015).5.1.3 I MPERFECT BKHandling imperfect BK is an under explored topic in ILP. We can distinguish between two typesof imperfect BK: missing BK and too much BK, which we discussed in Section 4.3.3. There are often multiple (sometimes infinite) hypotheses that solve the ILP problem (or havethe same training error). In such cases, which hypothesis should we choose?5.2.1 O CCAMIST BIAS Many ILP systems try to learn a textually minimal hypothesis. This approach is justified asfollowing an Occamist bias (Schaffer, 1993). The most common interpretation of an Occamistbias is that amongst all hypotheses consistent with the data, the simplest is the most likely . 18. Domingos (1999) points out that this interpretation is controversial, partly because Occam’s razor is interpretedin two different ways. Following Domingos (1999), let the generalisation error of a hypothesis be its error onunseen examples and the training error be its error on the examples it was learned from. The formulation of therazor that is perhaps closest to Occam’s original intent is given two hypotheses with the same generalisation error,the simpler one should be preferred because simplicity is desirable in itself . The second formulation, for which mostILP systems follow, is quite different and can be stated as given two hypotheses with the same training error, thesimpler one should be preferred because it is likely to have lower generalisation error . Domingos (1999) points outthat the first razor is largely uncontroversial, but the second one, taken literally, is provably and empirically false(Zahálka & Zelezný, 2011). Many ILP systems did not distinguish between the two cases. We therefore also donot make any distinction. ROPPER AND D UMANCIC Most approaches use an Occamist bias to find the smallest hypothesis, measured in terms of thenumber of clauses (Muggleton et al., 2015), literals (Law et al., 2014), or description length(Muggleton, 1995). Most ILP systems are not, however, guaranteed to induce smallest pro-grams. A key reason for this limitation is that many approaches learn a single clause at a timeleading to the construction of sub-programs which are sub-optimal in terms of program size andcoverage. For instance, Aleph, described in detail in the next section, offers no guarantees aboutthe program size and coverage.Newer ILP systems address this limitation (Corapi et al., 2011; Law et al., 2014; Cropper& Muggleton, 2016; Kaminski et al., 2018; Cropper & Morel, 2020). The main development isto take a meta-level (global) view of the induction task, which we discussed in Section 4.5. Inother words, rather than induce one clause at a time from a subset of the examples, the idea isto induce a whole program using all the examples. For instance, ASPAL (Corapi et al., 2011) isgiven as input a hypothesis space with a set of candidate clauses. The ASPAL task is to find aminimal subset of clauses that entails all the positive and none of the negative examples. ASPALuses ASP’s optimisation abilities to provably learn the program with the fewest literals.5.2.2 C OST - MINIMAL PROGRAMS The ability to learn optimal programs opens ILP to new problems. For instance, learning efficientlogic programs has long been considered a difficult problem in ILP (Muggleton & De Raedt,1994; Muggleton et al., 2012), mainly because there is no declarative difference between anefficient program, such as mergesort, and an inefficient program, such as bubble sort. To addressthis issue, Metaopt (Cropper & Muggleton, 2019) learns efficient programs. Metaopt maintainsa cost during the hypothesis search and uses this cost to prune the hypothesis space. To learnminimal time complexity logic programs, Metaopt minimises the number of resolution steps.For instance, imagine learning a find duplicate program, which finds a duplicate element in alist e.g. [p,r,o,g,r,a,m] r , and [i,n,d,u,c,t,i,o,n] i . Given suitable input data,Metagol induces the program: f(A,B):- head(A,B),tail(A,C),element(C,B).f(A,B):- tail(A,C),f(C,B). This program goes through the elements of the list checking whether the same element existsin the rest of the list. Given the same input, Metaopt induces the program: f(A,B):- mergesort(A,C),f1(C,B).f1(A,B):- head(A,B),tail(A,C),head(C,B).f1(A,B):- tail(A,C),f1(C,B). This program first sorts the input list and then goes through the list to check whether for du-plicate adjacent elements. Although larger, both in terms of clauses and literals, the programlearned by Metaopt is more efficient ( O ( log n ) ) than the program learned by Metagol ( O ( n ) ).FastLAS (Law et al., 2020) follows this idea and takes as input a custom scoring functionand computes an optimal solution with respect to the given scoring function. The authors showthat this approach allows a user to optimise domain-specific performance metrics on real-worlddatasets, such as access control policies. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Some ILP systems, mostly meta-level approaches, cannot handle infinite domains (Corapi et al.,2011; Athakravi et al., 2013; Law et al., 2014; Evans & Grefenstette, 2018; Kaminski et al.,2018; Evans et al., 2019). The reason that pure ASP-based systems (Corapi et al., 2011; Lawet al., 2014; Kaminski et al., 2018; Evans et al., 2019) cannot handle infinite domains is that(most) current ASP solvers only work on ground programs. ASP systems (which combine agrounder and a solver), such as Clingo (Gebser et al., 2014), first take a first-order program asinput, ground it using an ASP grounder, and then use an ASP solver to determine whether theground problem is satisfiable. A limitation of this approach is the intrinsic grounding problem:a problem must have a finite, and ideally small, grounding. This grounding issue is especiallyproblematic when reasoning about complex data structures (such as lists) and real numbers(most ASP implementations do not natively support lists nor real numbers). For instance, ILASP(Law et al., 2014) can represent real numbers as strings and can delegate reasoning to Python(via Clingo’s scripting feature). However, in this approach, the numeric computation is per-formed when grounding the inputs, so the grounding must be finite, which makes it impracti-cal. This grounding problem is not specific to ASP-based systems. For instance, ∂ ILP is an ILPsystem based on a neural network, but it only works on BK in the form of a finite set of groundatoms. This grounding problem is essentially the fundamental problem faced by table-based MLapproaches that we discussed in Section 4.3.One approach to mitigate this problem is to use context-dependent examples (Law et al.,2016), where BK can be associated with specific examples, so that an ILP systems need onlyground part of the BK. Although this approach is shown to improve the grounding problemcompared to not using context-dependent examples, the approach still needs a finite groundingfor each example and still struggles as the domain size increases (Cropper & Morel, 2020).The power of recursion is that an infinite number of computations can be described by afinite recursive program (Wirth, 1985). In ILP, recursion is often crucial for generalisation. Weillustrate this importance with two examples. Example 6 (Reachability) . Consider learning the concept of reachability in a graph. Withoutrecursion, an ILP system would need to learn a separate clause to define reachability of differentlengths. For instance, to define reachability depths for 1-4 would require the program: reachable(A,B):- edge(A,B).reachable(A,B):- edge(A,C),edge(C,B).reachable(A,B):- edge(A,C),edge(C,D),edge(D,B).reachable(A,B):- edge(A,C),edge(C,D),edge(D,E),edge(E,B). This program does not generalise because it does not define reachability for arbitrary depths.Moreover, most ILP systems would need examples of each depth to learn such a program. Bycontrast, an ILP system that supports recursion can learn the program: reachable(A,B):- edge(A,B).reachable(A,B):- edge(A,C),reachable(C,B). Although smaller, this program generalises to reachability of any depth. Moreover, ILP systemscan learn this definition from a small number of examples of arbitrary reachability depth. ROPPER AND D UMANCIC Example 7 (String transformations) . As a second example, reconsider the string transforma-tion problem from the introduction (Section 1.2). As with the reachability example, withoutrecursion, an ILP system would need to learn a separate clause to find the last element for eachlist of length n , such as: last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),tail(C,D),empty(D),head(C,B).last(A,B):- tail(A,C),tail(C,D),tail(D,E),empty(E),head(E,B). By contrast, an ILP system that supports recursion can learn the compact program: last(A,B):- tail(A,C),empty(C),head(A,B).last(A,B):- tail(A,C),last(C,B). Because of the symbolic representation and the recursive nature, this program generalises tolists of arbitrary length and which contain arbitrary elements (e.g. integers and characters).Without recursion it is often difficult for an ILP system to generalise from small numbers ofexamples (Cropper et al., 2015). Moreover, recursion is vital for many program synthesis tasks,such as the quicksort scenario from the introduction. Despite its importance, learning recursiveprograms has long been a difficult problem for ILP (Muggleton et al., 2012). Moreover, thereare many negative theoretical results on the learnability of recursive programs (Cohen, 1995b).As Table 5 shows, many ILP systems cannot learn recursive programs, or can only learn it in alimited form.A common limitation of many systems is that they rely on bottom clause construction (Mug-gleton, 1995), which we discuss in more detail in Section 6.1. In this approach, for each ex-ample, an ILP system creates the most specific clause that entails the example and then tries togeneralise the clause to entail other examples. However, in this approach, an ILP system learnsonly a single clause per example . This covering approach requires examples of both the baseand inductive cases, which means that such systems struggle to learn recursive programs, espe-cially from small numbers of examples.Interest in recursion has resurged recently with the introduction of meta-interpretive learn-ing (MIL) (Muggleton et al., 2014, 2015; Cropper et al., 2020) and the MIL system Metagol(Cropper & Muggleton, 2016). The key idea of MIL is to use metarules (Section 4.4.2) to re-strict the form of inducible programs and thus the hypothesis space. For instance, the chain metarule ( P ( A , B ) ← Q ( A , C ) , R ( C , B ) ) allows Metagol to induce programs such as: f(A,B):- tail(A,C),head(C,B). Metagol induces recursive programs using recursive metarules, such as the tail recursive metarule P(A,B) ← Q(A,C), P(C,B) . Metagol can also learn mutually recursive programs, such as learningthe definition of an even number by also inventing and learning the definition of an odd number( even_1 ): 19. This statement is not true for all ILP systems that employ bottom clause construction. XHAIL (Ray, 2009), forinstance, can induce multiple clauses per example.20. Metagol can induce longer clauses though predicate invention, which we discuss in Section 5.4. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION even(0).even(A):- successor(A,B),even_1(B).even_1(A):- successor(A,B),even(B). Many ILP systems can now learn recursive programs (Evans & Grefenstette, 2018; Kaminskiet al., 2018; Evans et al., 2019; Cropper & Morel, 2020). With recursion, ILP systems cangeneralise from small numbers of examples, often a single example (Lin et al., 2014; Cropper,2019). For instance, Popper (Cropper & Morel, 2020) can learn common list transformationprograms from just a handful of examples, such as a program to drop the last element of a list: droplast(A,B):- tail(A,B),empty(B).droplast(A,B):- tail(A,C),droplast(C,D),head(A,E),cons(E,D,B). The ability to learn recursive programs has opened up ILP to new application areas, includinglearning string transformations programs (Lin et al., 2014), robot strategies (Cropper & Muggle-ton, 2015), regular and context-free grammars (Muggleton et al., 2014), answer set grammars(Law et al., 2019), and even efficient algorithms (Cropper & Muggleton, 2019). Most ILP systems assume that the given BK is suitable to induce a solution. This assumptionmay not always hold. Rather than expecting a user to provide all the necessary BK, the goal of predicate invention is for an ILP system to automatically invent new auxiliary predicate symbols,i.e. to introduce new predicate symbols in a hypothesis that are not given the the examplesnor the BK. This idea is similar to when humans create new functions when manually writingprograms, such as to reduce code duplication or to improve readability. Predicate invention hasrepeatedly been stated as an important challenge in ILP (Muggleton & Buntine, 1988; Stahl,1995; Muggleton et al., 2012). As Muggleton et al. (2012) state, predicate invention is attractivebecause it is a most natural form of automated discovery . Similarly, Russell and Norvig (2010) say some of the deepest revolutions in science come from the invention of new predicates and functions- for example, Galileo’s invention of acceleration, or Joule’s invention of thermal energy. Once theseterms are available, the discovery of new laws becomes (relatively) easy . Russell (2019) goes evenfurther and argues that the automatic invention of new high-level concepts is the most importantstep needed to reach human-level AI.A classical example of predicate invention is learning the definition of grandparent fromonly the background relations mother and father . Given suitable examples and no other back-ground relations, an ILP system can learn the program: grandparent(A,B):- mother(A,C),mother(C,B).grandparent(A,B):- mother(A,C),father(C,B).grandparent(A,B):- father(A,C),mother(C,B).grandparent(A,B):- father(A,C),father(C,B). Although correct, this program is large and has 4 clauses and 12 literals. By contrast, considerthe program learned by a system which supports predicate invention: grandparent(A,B):- inv(A,C),inv(C,B).inv(A,B):- mother(A,B).inv(A,B):- father(A,B). ROPPER AND D UMANCIC To learn this program, an ILP system has invented a new predicate symbol inv . This programis semantically equivalent to the previous one, but is shorter both in terms of the number ofliterals (7) and of clauses (3) and is arguably more readable. The invented symbol inv can beinterpreted as parent . In other words, if we rename inv to parent we have the program: grandparent(A,B):- parent(A,C),parent(C,B).parent(A,B):- mother(A,B).parent(A,B):- father(A,B). As this example shows, predicate invention can help learn smaller programs, which, in general,is preferable because most ILP systems struggle to learn large programs (Cropper et al., 2020b;Cropper & Dumanˇci´c, 2020).To further illustrate this size reduction, consider learning the greatgrandparent relation,again from only the background relations mother and father . Without predicate invention, anILP system would need to learn the 8 clause program: greatgrandparent(A,B):- mother(A,C),mother(C,D),mother(D,B).greatgrandparent(A,B):- mother(A,C),mother(C,D),father(D,B).greatgrandparent(A,B):- mother(A,C),father(C,D),mother(D,B).greatgrandparent(A,B):- mother(A,C),father(C,D),father(D,B).greatgrandparent(A,B):- father(A,C),father(C,D),father(D,B).greatgrandparent(A,B):- father(A,C),father(C,D),mother(D,B).greatgrandparent(A,B):- father(A,C),mother(C,D),father(D,B).greatgrandparent(A,B):- father(A,C),mother(C,D),mother(D,B). By contrast, an ILP system that supports predicate invention could again invent a new inv symbol (which again corresponds to parent) to learn a smaller program: greatgrandparent(A,B):- inv(A,C),inv(C,D),inv(D,B).inv(A,B):- mother(A,B).inv(A,B):- father(A,B). Predicate invention has been shown to help reduce the size of programs, which in turns reducessample complexity and improves predictive accuracy (Dumanˇci´c & Blockeel, 2017; Cropper,2019; Cropper et al., 2020; Dumanˇci´c et al., 2019; Dumanˇci´c & Cropper, 2020).The inductive general game playing (IGGP) problem (Cropper et al., 2020b) is to learn thesymbolic rules of games from observations of gameplay, such as learning the rules of connectfour . Although predicate invention is not strictly necessary to learn solutions to any of the IGGPproblems, it can significantly reduce the size of the solutions, sometimes by several orders ofmagnitude. For instance, to learn accurate solutions for the connect four game, a learner needs tolearn the concept of a line, which in turns requires learning the concepts of horizontal, vertical, 21. This use of the term semantically equivalent is imprecise. Whether these two programs are strictly equivalentdepends on the definition of logical equivalence, for which there are many (Maher, 1988). Moreover, equiv-alence between the two programs is further complicated because they have different vocabularies (because ofthe invented predicate symbol). Our use of equivalence is based on the two programs having the same logicalconsequences for the target predicate symbol grandparent . NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION and diagonal lines. Since these concepts are used in most of the rules, the ability to discoverand reuse such concepts is crucial to learn performance.To further illustrate the power of predicate invention, imagine learning a droplasts pro-gram, which removes the last element of each sublist in a list, e.g. [alice,bob,carol] [alic,bo,caro] . Given suitable examples and BK, Metagol ho (Cropper et al., 2020) learns thehigher-order program: droplasts(A,B):- map(A,B,droplasts1).droplasts1(A,B):- reverse(A,C),tail(C,D),reverse(D,B). To learn this program, Metagol ho invents the predicate symbol droplasts1 , which is used twicein the program: once as term in the literal map(A,B,droplasts1) and once as a predicate sym-bol in the literal droplasts1(A,B) . This higher-order program uses map to abstract away themanipulation of the list and to avoid the need to learn an explicitly recursive program (recur-sion is implicit in map ).Now consider learning a double droplasts program ( ddroplasts ), which extends the droplastproblem so that, in addition to dropping the last element from each sublist, it also drops the lastsublist, e.g. [alice,bob,carol] [alic,bo] . Given suitable examples, metarules, and BK,Metagol ho learns the program: ddroplasts(A,B):- map(A,C,ddroplasts1),ddroplasts1(C,B).ddroplasts1(A,B):- reverse(A,C),tail(C,D),reverse(D,B). This program is similar to the aforementioned droplasts program, but additionally reuses theinvented predicate symbol ddroplasts1 in the literal ddroplasts1(C,B) . This program illus-trates the power of predicate invention to allow an ILP system to learn substantially more com-plex programs.5.4.1 P REDICATE INVENTION DIFFICULTY Most early attempts at predicate invention were unsuccessful, and, as Table 5 shows, manypopular ILP systems do not support it. As Kramer (1995) point out, predicate invention isdifficult for at least three reasons:• When should we invent a new symbol? There must be a reason to invent a new symbol,otherwise we would never invent one.• How should you invent a new symbol? How many arguments should it have?• How do we judge the quality of a new symbol? When should we keep an invented symbol?There are many predicate invention techniques. We briefly discuss some approaches now.5.4.2 I NVERSE RESOLUTION Early work on predicate invention was based on the idea of inverse resolution (Muggleton &Buntine, 1988) and specifically W operators . Discussing inverse resolution in depth is beyondthe scope of this paper. We refer the reader to the original work of Muggleton and Buntine(1988) or the overview books by Nienhuys-Cheng and Wolf (1997) and De Raedt (2008) formore information. Although inverse resolution approaches could support predicate invention,they never demonstrated completeness, partly because of the lack of a declarative bias to delimitthe hypothesis space (Muggleton et al., 2015). ROPPER AND D UMANCIC LACEHOLDERS One approach to predicate invention is to predefine invented symbols through mode declara-tions, which Leban et al. (2008) call placeholders and which Law (2018) calls prescriptive pred-icate invention . For instance, to invent the parent relation, a suitable modeh declaration wouldbe required, such as: modeh(1,inv(person,person)). However, this placeholder approach is limited because it requires that a user manually specifythe arity and argument types of a symbol (Law et al., 2014), which rather defeats the point, orrequires generating all possible invented predicates (Evans & Grefenstette, 2018; Evans et al.,2019), which is computationally expensive.5.4.4 M ETARULES Interest in predicate invention has resurged largely due to meta-interpretive learning (MIL)(Muggleton et al., 2014, 2015; Cropper et al., 2020) and the Metagol implementation (Cropper& Muggleton, 2016). Metagol avoids the issues of older ILP systems by using metarules (Section4.4.2) to define the hypothesis space and in turn reduce the complexity of inventing a newpredicate symbol, i.e. Metagol uses metarules to drive predicate invention. As mentioned inSection 4.4.2, a metarule is a higher-order clause. For instance, the chain metarule ( P ( A , B ) ← Q ( A , C ) , R ( C , B ) ) allows Metagol to induce programs such as: f(A,B):- tail(A,C),tail(C,B). This program drops the first two elements from a list. To induce longer clauses, such as todrop first three elements from a list, Metagol can use the same metarule but can invent a newpredicate symbol and then chain their application, such as to induce the program: f(A,B):- tail(A,C),inv(C,B).inv(A,B):- tail(A,C),tail(C,B). We could unfold (Tamaki & Sato, 1984) this program to remove the invented symbol to derivethe program: f(A,B):- tail(A,C),tail(C,D),tail(D,B). A side-effect of this metarule-driven approach to predicate invention is that problems are forcedto be decomposed into smaller problems. For instance, suppose you wanted to learn a programthat drops the first four elements of a list, then Metagol could learn the following program,where the invented predicate symbol inv is used twice: f(A,B):- inv(A,C),inv(C,B).inv(A,B):- tail(A,C),tail(C,B). To learn this program, Metagol invents the predicate symbol inv and induces a definition forit using the chain metarule. Metagol uses this new predicate symbol in the definition for thetarget predicate f . NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION IFELONG LEARNING The aforementioned techniques for predicate invention are aimed at single-task problems. Thereare several approaches that invent predicate symbols in a lifelong learning setting. Dependent learning. Predicate Invention can be performed by continually learning programs(meta-learning). For instance Lin et al. (2014) use a technique called dependent learning toenable Metagol to learn string transformations programs over time. Given a set of 17 stringtransformation tasks, their learner automatically identifies easier problems, learn programs forthem, and then reuses the learned programs to help learn programs for more difficult problems.They use predicate invention to reform the bias of the learner where after a solution is learnednot only is the target predicate added to the BK but also its constituent invented predicates.The authors experimentally show that their multi-task approach performs substantially betterthan a single-task approach because learned programs are frequently reused. Moreover, theyshow that this approach leads to a hierarchy of BK composed of reusable programs, where eachbuilds on simpler programs. Figure 4 shows this approach. Subsequent work has extended theapproach to handle thousands of tasks (Cropper, 2020). 345 67 8 1112 131 10 172 1514 169 3 4 56 7 811 1213110172 1514 16 9Time Out Size Bound Dependent Learning Independent Learning Figure 4: This figure is taken from the work of Lin et al. (2014). It shows the programs learnedby dependent (left) and independent (right) learning approaches. The size bound col-umn denotes the number of clauses in the induced program. The nodes correspondto programs and the numbers denote the task that the program solves. For the de-pendent learning approach, the arrows correspond to the calling relationships of theinduced programs. For instance, the program to solve task three reuses the solutionto solve task 12, which in turn reuses the solution to task 17, which in turn reusesthe solution to task 15. Tasks 4, 5, and 16 cannot be solved using an independentlearning approach, but can when using a dependent learning approach. Self-supervised learning. The goal of Playgol (Cropper, 2019) is similar to that by Lin et al.(2014) in that it aims to automatically discover reusable general programs to improve learning ROPPER AND D UMANCIC performance. Playgol goes one step further by not requiring a large corpus of user-suppliedtasks to learn from. Before trying to solve the set of user-supplied tasks, Playgol first plays byrandomly sampling its own tasks to solve, and tries to solve them, adding any solutions to theBK, which can be seen as a form of self-supervised learning. After playing Playgol tries to solvethe user-supplied tasks by reusing solutions learned whilst playing. For instance, consider theprogram that Playgol learns for the string transformation task named build_95 : build_95(A,B):- play_228(A,C),play_136_1(C,B).play_228(A,B):- play_52(A,B),uppercase(B).play_228(A,B):- skip(A,C),play_228(C,B).play_136_1(A,B):- play_9(A,C),mk_uppercase(C,B).play_9(A,B):- skip(A,C),mk_uppercase(C,B).play_52(A,B):- skip(A,C),copy(C,B). The solution for this task is difficult to read mainly because it is almost entirely composed ofinvented predicate symbols. Only the definitions for uppercase , mk_uppercase , skip , and copy are provided as BK; all the others are invented by Playgol. The solution for this task reusesthe solution to the self-supervised play task play_228 and the sub-program play_136_1 fromthe play task play_136 , where play_136_1 is invented. The predicate play_228 is a recursivedefinition that corresponds to the concept of “skip to the first uppercase letter and then copy theletter to the output”. The predicate play_228 reuses the solution for another play task play_52 .5.4.6 (U NSUPERVISED ) COMPRESSION The aforementioned predicate invention approaches combine invention with induction, wherethe usefulness of an invented predicate symbol is measured by whether it can help solve a givenlearning task. However, a predicate symbol that does not help to solve the task immediatelymight still be useful. For instance, to learn the quicksort algorithm, the learner needs to be ableto partition the list given a pivot element and append two lists. If partition and append arenot provided in BK, the learner would need to invent them. While both relations are essentialfor the task, inventing only the partition predicate would deem it useless as it is insufficient tosolve the task. Several predicate invention approaches decouple invention from induction anduse alternative criteria to judge the usefulness of invented predicates. That criterion is oftencompression . Auto-encoding logic programs. Auto-encoding logic programs (ALPs) (Dumanˇci´c et al., 2019)invent predicates by simultaneously learning a pair of logic programs: (i) an encoder that mapsthe examples given as interpretations to new interpretations defined entirely in terms of in-vented predicates , and (ii) a decoder that reconstructs the original interpretations from theinvented ones. The invented interpretations compress the given examples and invent usefulpredicates by capturing regularities in the data. ALPs, therefore, change the representation ofthe problem. The most important implication of the approach is that the target programs are eas-ier to express via the invented predicates. The authors experimentally show that learning fromthe representation invented by ALPs improves the learning performance of generative Markov 22. Evaluating the usefulness of invented predicates via their ability to compress a theory goes back to some of theearliest work in ILP by the Duce system (Muggleton, 1987).23. The head of every clause in the encoder invents a predicate. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION logic networks (MLN) (Richards & Mooney, 1995). Generative MLNs learn a (probabilistic)logic program that explains all predicates in an interpretation, not a single target predicate.The predicates invented by ALPs therefore aid the learning of all predicates in the BK. Program refactoring. Knorf (Dumanˇci´c & Cropper, 2020) pushes the idea of ALPs even fur-ther. After learning to solve user-supplied tasks in the lifelong learning setting, Knorf compressesthe learnt program by removing redundancies in it. If the learnt program contains inventedpredicates, Knorf revises them and introduces new ones that would lead to a smaller program.By doing so, Knorf optimises the representation of obtained knowledge. The refactored pro-gram is smaller in size and contains less redundancy in clauses, both of which lead to improvedperformance. The authors experimentally demonstrate that refactoring improves learning per-formance in lifelong learning. More precisely, Metagol learns to solve more tasks when using therefactored BK, especially when BK is large. Moreover, the authors also demonstrate that Knorfsubstantially reduces the size of the BK program, reducing the number of literals in a programby 50% or more. Theory refinement. All the aforementioned approaches are related theory refinement (Wro-bel, 1996), which aims to improve the quality of a theory. Theory revision approaches (Adéet al., 1994; Richards & Mooney, 1995) revise a program so that it entails missing answers ordoes not entail incorrect answers. Theory compression (De Raedt et al., 2008) approaches selecta subset of clauses such that the performance is minimally affected with respect to certain ex-amples. Theory restructuring changes the structure of a logic program to optimise its executionor its readability (Wrobel, 1996). By contrast, the approaches described in the previous sectionaim to improve the quality of a theory by inventing new predicate symbols.5.4.7 C ONNECTION TO REPRESENTATION LEARNING Predicate invention changes the representation of a problem by introducing new predicate sym-bols. Predicate invention is thus closely related to representation learning (or feature learning )(Bengio et al., 2013), which has witnessed a tremendous progress over the past decade in deeplearning (LeCun et al., 2015). The central idea of representation learning coincides with theone behind predicate invention: improving learning performance by changing the representa-tion of a problem. In contrast to the symbolic nature of ILP, representation learning operates onnumerical principles and discovers data abstractions in tabular form. Moreover, a large subfieldof representation learning focuses on representing structured data, including relational data, insuch tabular form.Despite strong connections, there is little interaction between predicate invention and repre-sentation learning. The main challenges in transferring the ideas from representation learningto predicate invention are their different operating principles. It is not clear how symbolicconcepts can be invented through table-based learning principles that current representationlearning approaches use. Only a few approaches (Dumanˇci´c & Blockeel, 2017; Dumanˇci´c et al.,2019; Sourek et al., 2018) start from the core ideas in representation learning, strip them ofnumerical principles and re-invent them from symbolic principles. A more common approachis to transform relational data into a propositional tabular form that can be used as an inputto a neural network (Dash et al., 2018; Kaur et al., 2019, 2020). A disadvantage of the latterapproaches is that they only apply to propositional learning tasks, not to first-order program in-duction tasks where infinite domains are impossible to propositionalise. Approaches that force ROPPER AND D UMANCIC neural networks to invent symbolic constructs, such as ∂ ILP and neural theorem provers (Rock-täschel & Riedel, 2017), do so by sacrificing the expressivity of logic (they can only learn shortDatalog programs). 6. ILP systems We now describe in detail four ILP systems: Aleph (Srinivasan, 2001), TILDE (Blockeel & DeRaedt, 1998), ASPAL (Corapi et al., 2011), and Metagol (Cropper & Muggleton, 2016). It isimportant to note that these systems are not necessarily the best, nor the most popular, but useconsiderably different approaches and are relatively simple to explain. Progol (Muggleton, 1995) is arguably the most influential ILP system, having influenced manysystems (Srinivasan, 2001; Ray, 2009; Ahlgren & Yuen, 2013), which in turn have inspiredmany other ILP systems (Katzouris et al., 2015, 2016). Aleph (Srinivasan, 2001) is an ILPsystem based on Progol. We discuss Aleph, rather than Progol, because the implementation,written in Prolog, is easier to use and the manual is more detailed.6.1.1 A LEPH SETTING The Aleph problem setting is: Given:- A set of mode declarations M - BK in the form of a normal program - E + positive examples represented as a set of facts - E − negative examples represented as a set of facts Return: A normal program hypothesis H such that: - H is consistent with M - ∀ e ∈ E + , H ∪ B | = e (i.e. is complete) - ∀ e ∈ E − , H ∪ B = e (i.e. is consistent)We will discuss what H is consistent with M means later in this section.6.1.2 A LEPH ALGORITHM .To find a hypothesis, Aleph uses the following set covering approach:1. Select a positive example to be generalised. If none exists, stop; otherwise proceed to thenext step.2. Construct the most specific clause (the bottom clause) (Muggleton, 1995) that entails theselected example and that is consistent with the mode declarations.3. Search for a clause more general than the bottom clause and that has the best score.4. Add the clause to the current hypothesis and remove all the examples made redundantby it. Return to step 1.We discuss the basic approaches to steps 2 and 3. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION S TEP BOTTOM CLAUSE CONSTRUCTION The purpose of constructing a bottom clause is to bound the search in step 3. The bottomclause is the most specific clause that explains a single example. Having constructed a bottomclause, Aleph can ignore any clauses that are not more general than it. In other words, Alephonly considers clauses which are generalisations of the bottom clause, which must all entail theexample. We use the bottom clause definition provided by De Raedt (2008): Definition 4 (Bottom clause) . Let B be a clausal hypothesis and C be a clause. Then the bottomclause ⊥ ( C ) is the most specific clause such that: B ∪ ⊥ ( C ) | = C Example 8 (Bottom clause) . To illustrate bottom clauses, we use an example from De Raedt(2008). Let B be: B = (cid:26) polygon(A):- rectangle(A).rectangle(A):- square(A). (cid:27) And let C be: C = pos(A):- red(A),square(A). Then: ⊥ (C) = pos(A):- red(A),square(A),rectangle(A),polygon(A). This bottom clause contains the literal rectangle(A) because it is implied by square(A) . Theinclusion of rectangle(A) in turn implies the inclusion of polygon(A) .Any clause that is not more general than the bottom clause cannot entail C and so can be ignored.For instance, we can ignore the clause pos(A):- green(A) because it is not more general than ⊥ ( C ) .We do not describe how to construct the bottom clause. See the paper by Muggleton (1995)or the book of De Raedt (2008) for good explanations. However, it is important to understandthat, in general, a bottom clause can have infinite cardinality. To restrict the construction of thebottom clause (and in turn the hypothesis space), Aleph uses mode declarations (Section 4.4.1).Having constructed a bottom clause, Aleph then searches for generalisations of ⊥ ( C ) in Step 3.In this way, Aleph can be seen as a bottom-up approach because it starts with the examples andtries to generalise them.S TEP CLAUSE SEARCH In Step 3, Aleph is given a bottom clause of the form h:- b , . . . , b n and searches for gen-eralisations of this clause. The importance of constructing the bottom clause is that it boundsthe search space from below (the bottom clause). Aleph starts with the most general clause h:- and tries to specialise it by adding literals to it, which it selects from the bottom clause orby instantiating variables. In this way, Aleph performs a top-down search. Each specialisationof a clause is called a refinement . Properties of refinement operators (Shapiro, 1983) are well-studied in ILP (Nienhuys-Cheng & Wolf, 1997; De Raedt, 2008), but are beyond the scope ofthis paper. The key thing to understand is that Aleph’s search is bounded from above (the most ROPPER AND D UMANCIC pos(A):-pos(A):- red(A) pos(A):- square(A) pos(A):- rectangle(A) pos(A):- polygon(A)pos(A):-red(A),square(A). pos(A):-red(A),rectangle(A). pos(A):-red(A),polygon(A). pos(A):-square(A),rectangle(A). pos(A):-square(A),polygon(A). pos(A):-rectangle(A),polygon(A).pos(A):-red(A),square(A),rectangle(A). pos(A):-red(A),square(A),polygon(A). pos(A):-square(A),rectangle(A),polygon(A).pos(A):-red(A),square(A),rectangle(A),polygon(A).Most general hypothesisMost specific hypothesis Figure 5: Aleph bounds the hypothesis space from above (the most general hypothesis) andbelow (the most specific hypothesis). Aleph starts the search from the most generalhypothesis and specialises it (by adding literals from the bottom clause) until it findthe best hypothesis.general clause) and below (the most specific clause). Figure 5 illustrates the search space ofAleph when given the bottom clause ⊥ (C).Aleph performs a bounded breadth-first search to enumerate shorter clauses before longerones, although a user can easily change the search strategy . The search is bounded by sev-eral parameters, such as a maximum clause size and a maximum proof depth. Aleph evaluates(assigns a score) to each clause in the search. Aleph’s default evaluation function is coverage defined as P − N , where P and N are the numbers of positive and negative examples respec-tively entailed by the clause. Aleph comes with 13 evaluation functions, such as entropy and compression .Having found the best clause, Aleph adds the clause to the hypothesis, removes all thepositive examples covered by the new hypothesis, and returns to Step 1. We do not have enoughspace to detail the whole clause search mechanism, and how the score is computed, so werefer to the reader to the Progol tutorial by Muggleton and Firth (2001) for a more detailedintroduction.6.1.3 D ISCUSSION Advantages. Aleph is one of the most popular ILP systems because (i) it has a solid and eas-ily available implementation with many options, and (ii) it has good empirical performance. 24. Progol, by contrast, uses an A* search (Muggleton, 1995). NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Moreover, it is a single Prolog file, which makes it easy to download and use . Because it usesa bottom clause to bound the search, Aleph is also very efficient at identifying relevant constantsymbols that may appear in a hypothesis, which is not the case for pure top-down approaches.Aleph also supports many other features, such as numerical reasoning, inducing constraints,and allowing user-supplied cost functions. Disadvantages. Because it is based on inverse entailment, Aleph struggles to learn recursiveprograms and optimal programs and does not support predicate invention. Another problemwith Aleph is that it uses many parameters, such as parameters that change the search strategywhen generalising a bottom clause (step 3) and parameters that change the structure of learn-able programs (such as limiting the number of literals in the bottom clause). These parameterscan greatly influence learning performance. Even for experts, it is non-trivial to find a suitableset of parameters for a problem. TILDE (Blockeel & De Raedt, 1998) is a first-order generalisation of decision trees, and specif-ically the C4.5 (Quinlan, 1993) learning algorithm. TILDE learns from interpretations, insteadof entailment as Aleph, and is an instance of top-down methodology.6.2.1 TILDE SETTING The TILDE problem setting is: Given:- A set of classes C - A set of mode declarations - A set of examples E represented as a set of interpretations - BK in the form of a definite program Return: A (normal) logic program hypothesis H such that: - ∀ e ∈ E , H ∧ B ∧ e | = c , c ∈ C , where c is the class of the example e - ∀ e ∈ E , H ∧ B ∧ e = c ′ , c ′ ∈ C − { c } ALGORITHM TILDE behaves almost exactly the same as C4.5 limited to binary attributes, meaning that it usesthe same heuristics and pruning techniques. What TILDE does differently is the generation ofcandidates splits. Whereas C4.5 generates candidates as attribute-value pairs (or value inequal-ities in case of continuous attributes), TILDE uses conjunctions of literals. The conjunctionsare explored gradually from the most general to the most specific ones, where θ -subsumption(Section 2) is used as an ordering.To find a hypothesis, TILDE employs a divide-and-conquer strategy recursively repeating thefollowing steps: 25. Courtesy of Fabrizio Riguzzi and Paolo Niccolò Giubelli, Aleph is now available as a SWIPL package athttps: // / pack / list?p = aleph ROPPER AND D UMANCIC • if all examples belong to the same class, create a leaf predicting that class• for each candidate conjunction con j , find the normalised information gain when splittingon con j – if no candidate provides information gain, turn the previous node into a leaf predict-ing the majority class• create a decision node n that splits on the candidate conjunction with the highest infor-mation gain• Recursively split on the subsets of data obtained by the splits and add those nodes aschildren of n Example 9 (Machine repair example (Blockeel & De Raedt, 1998)) . To illustrate TILDE’s learn-ing procedure, consider the following example. Each example is as interpretation (a set of facts)and it describes (ii) a machine with parts that are worn out, and (ii) an action an engineer shouldperform: fix the machine, send it back to the manufacturer, or nothing if the machine is ok .These actions are the classes to predict. E = E1: {worn(gear). worn(chain). class(fix).}E2: {worn(engine). worn(chain). class(sendback).}E3: {worn(wheel). class(sendback).}E4: {class(ok).} Background knowledge contains information which parts are replaceable and which are not: B = replaceable(gear).replaceable(chain).not_replaceable(engine).not_replaceable(wheel). Like any top-down approach, TILDE starts with the most general program (an empty program)and gradually refines it (specialises it) until the satisfactory performance is reached. To refinethe program, TILDE relies on mode declarations which define conjunctions that can be addedto the current clause.Assume the mode declarations: mode(replaceable(+X)).mode(not_replaceable(+X)).mode(worn(+X)). Each mode declaration forms a candidate split: worn(X).replaceable(X).not_replaceable(X). The conjunction worn(X) yields the highest information gain (calculated as with propositionalC4.5) and is set as the root of the tree. For details about the C4.5 and information gain, we referthe reader to excellent machine learning book by Mitchell (1997). NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION worn(X)not_replaceable(X) oksendback fixclass(X,sendback) :- worn(X), not_replaceable(X), !.class(X,fix) :- worn(X, !.)class(X,ok). Figure 6: TILDE learns tree-shaped (normal) programs. Clauses in the program correspond topaths along the tree.TILDE proceeds by recursively repeating the same procedure over both outcomes of thetest: when worn(X) is true and false . When the root test fails, the dataset contains a singleexample ( E4 ); TILDE forms a branch by creating the leaf predicting the class ok . When the roottest succeeds, not all examples ( E1, E2, E3 ) belong to the same class. TILDE thus refines theroot node further: worn(X), worn(X).worn(X), replaceable(X).worn(X), not_replaceable(X) worn(X), worn(Y).worn(X), replaceable(Y).worn(X), not_replaceable(Y) The candidate refinement worn(X), not_replaceable(X) perfectly divides the remaining ex-amples and thus not_replaceable(X) is added as the subsequent test. All examples are classi-fied correctly, and thus the learning stops.The final TILDE tree is (illustrated in Figure 6): class(X,sendback):- worn(X),not_replaceable(X),!.class(X,fix):- worn(X),!.class(X,ok). Note the usage of the cut ( ! ) operator, which is essential to ensure that only one branch of thedecision tree holds for each example.6.2.3 D ISCUSSION Advantages. An interesting aspect of TILDE is that it learns normal logic programs (whichincludes negation) instead of definite logic programs. This means that TILDE can learn moreexpressive than the majority of ILP systems. To match its expressivity, a system needs to supportnegation in the body. Another advantage of TILDE is that, compared to other ILP systems, itsupports both categorical and numerical data. Indeed, TILDE is an exception among ILP systems,which usually struggle to handle numerical data. At any refinement step, TILDE can add a literalof the form <(X,V) , or equivalently X < V with V being a value. TILDE’s stepwise refinementkeeps the number of inequality tests tractable. Disadvantages. Although TILDE learns normal programs, it requires them to be in the shapeof a tree and does not support recursion. Furthermore, TILDE inherits the limitations of top-down systems, such as generating many needless candidates. Another weakness of TILDE is ROPPER AND D UMANCIC the need for lookahead. Lookahead is needed when a single literal is useful only in a conjunc-tion with another literal. Consider, for instance, that the machine repair scenario has a relation number_of_components and the target rule that a machine needs to be fixed when a part con-sisting of more than three parts is worn out: class(fix):- worn(X),number_of_components(X,Y),Y > 3. To find this clause, TILDE would first refine the clause: class(fix):- worn(X). into: class(fix):- worn(X),number_of_components(X,Y). However, this candidate clause would be rejected as it yields no information gain (every examplecovered by the first clause is also covered by the second clause). The introduction of a literalwith the number_of_components predicate is only helpful if it is introduced together with theinequality related to the second argument of the literal. Informing TILDE about this dependencyis known as lookahead. ASPAL (Corapi et al., 2011) was one of the first meta-level ILP systems, which directly influencedother ILP systems, notably ILASP. ASPAL is one of the simplest ILP systems to explain. It usesthe mode declarations to build every possible clause that could be in a hypothesis. It adds a flagto each clause indicating whether the clause should be in a hypothesis. It then formulates theproblem of deciding which flags to turn on as an ASP problem.6.3.1 ASPAL SETTING The ASPAL problem setting is: Given:- A set of mode declarations M - BK in the form of a normal program - E + positive examples represented as a set of facts - E − negative examples represented as a set of facts - A penalty function γ Return: A normal program hypothesis H such that: - H is consistent with M - ∀ e ∈ E + , H ∪ B | = e (i.e. is complete) - ∀ e ∈ E − , H ∪ B = e (i.e. is consistent) - The penalty function γ is minimal NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION ALGORITHM ASPAL encodes an ILP problem as a meta-level ASP program. The answer sets of this meta-levelprogram are solutions to the ILP task. The ASPAL algorithm is one of the simplest in ILP:1. Generate all possible rules consistent with the given mode declarations. It assigns eachrule a unique identifier and adds that as an abducible (guessable) literal in each rule.2. Use an ASP solver to find a minimal subset of the rules.Step 1 is a little more involved, and we explain why below. Also, similar to Aleph, ASPAL hasseveral input parameters that contrast the size of the hypothesis space, such as the maximumnumber of body literals and the maximum number of clauses. Step 2 uses an ASP optimisationstatement to learn a program with a minimal penalty.6.3.3 ASPAL EXAMPLE Example 10 (ASPAL) . To illustrate ASPAL, we slightly modify the example from Corapi et al.(2011). We also ignore the penalty statement. ASPAL is given as input B , E + , E − , and M : B = bird(alice).bird(betty).can(alice,fly).can(betty,swim).ability(fly).ability(swim). E + = (cid:8) penguin(betty). (cid:9) E − = (cid:8) penguin(alice). (cid:9) M = (cid:26) modeh(penguin(+bird)).modeb(*,notcan(+bird, (cid:27) Given these modes, the possible rules are: penguin(X):- bird(X).penguin(X):- bird(X), not can(X,fly).penguin(X):- bird(X), not can(X,swim).penguin(X):- bird(X), not can(X,swim), not can(X,fly). ASPAL generates skeleton rules which replace constants with variables and adds an extra literalto each rule as an abducible literal: penguin(X):- bird(X), rule(r1).penguin(X):- bird(X), not can(X,C1), rule(r2,C1).penguin(X):- bird(X), not can(X,C1), not can(X,C2), rule(r3,C1,C2). ASPAL forms a meta-level ASP program from these rules that is passed to an ASP solver: ROPPER AND D UMANCIC bird(alice).bird(betty).can(alice,fly).can(betty,swim).ability(fly).ability(swim).penguin(X):- bird(X), rule(r1).penguin(X):- bird(X), not can(X,C1), rule(r2,C1).penguin(X):- bird(X), not can(X,C1), not can(X,C2), rule(r3,C1,C2).0 {rule(r1),rule(r2,fly),rule(r2,swim),rule(r3,fly,swim)}4.goal : - penguin(betty), not penguin(alice).: - not goal. The key statement in this meta-level program is: This statement is a choice rule, which states none or at most four of the literals { rule(r1),rule(r2,fly), rule(r2,swim), rule(r3,fly,swim) } could be true. The job of the ASP solver is todetermine which of those literals should be true, which corresponds to an an answer set for thisprogram: rule(r2,c(fly)). Which is translated to a program: penguin(A):- not can(A,fly). ISCUSSION Advantages. A major advantage of ASPAL is its sheer simplicity, which has inspired otherapproaches, notably ILASP. It also learns optimal programs by employing ASP optimisation con-straints. Disadvantages. The main limitation of ASPAL is scalability. It precomputes every possible rulein a hypothesis, which is infeasible on all but trivial problems. For instance, when learning gamerules from observations (Cropper et al., 2020b), ASPAL performs poorly for this reason. An interpreter is a program that evaluates (interprets) programs. A meta-interpreter is an inter-preter written in the same language that it evaluates. Metagol (Muggleton et al., 2015; Cropper& Muggleton, 2016; Cropper et al., 2020) is a form of ILP based on a Prolog meta-interpreter.6.4.1 M ETAGOL SETTING The Metagol problem setting is: Given:- A set of metarules M NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION - BK in the form of a normal program - E + positive examples represented as a set of facts - E − negative examples represented as a set of facts Return: A definite program hypothesis H such that: - ∀ e ∈ E + , H ∪ B | = e (i.e. is complete) - ∀ e ∈ E − , H ∪ B = e (i.e. is consistent) - ∀ h ∈ H , ∃ m ∈ M such that h = m θ , where θ is a substitution that grounds all the existentiallyquantified variables in m The last condition ensures that a hypothesis is an instance of the given metarules. It is thiscondition that enforces the strong inductive bias in Metagol.6.4.2 M ETAGOL ALGORITHM Metagol uses the following procedure to find a hypothesis:1. Select a positive example (an atom) to generalise. If none exists, stop, otherwise proceedto the next step.2. Try to prove the atom by:(a) using given BK or an already induced clause(b) unifying the atom with the head of a metarule (Section 4.4.2), binding the variablesin a metarule to symbols in the predicate and constant signatures, saving the substi-tutions, and then proving the body of the metarule through meta-interpretation (bytreating the body atoms as examples and applying step 2 to them)3. After proving all the positive examples, check the hypothesis against the negative exam-ples. If the hypothesis does not entail any negative example stop; otherwise backtrack toa choice point at step 2 and continue.In other words, Metagol induces a logic program by constructing a proof of the positive exam-ples. It uses metarules to guide the proof search. After proving all the examples, the inducedprogram is complete by construction. Metagol checks the consistency of the induced programagainst the negative examples. If the program is inconsistent, Metagol backtracks to exploredifferent proofs (programs).Metarules are fundamental to Metagol. For instance, the chain metarule is: P(A,B):- Q(A,C), R(C,B). The letters P , Q , and R denote second-order variables. Metagol internally represents metarulesas Prolog facts of the form: metarule(Name,Subs,Head,Body). Here Name denotes the metarule name, Subs is a list of variables that Metagol should find sub-stitutions for, and Head and Body are list representations of a clause. For example, the internalrepresentation of the chain metarule is: ROPPER AND D UMANCIC metarule(chain,[P,Q,R], [P,A,B], [[Q,A,C],[R,C,B]]). Metagol represents substitutions, which we will call metasubs , as Prolog facts of the form: sub(Name,Subs). Here Name is the name of the metarule and Subs is a list of substitutions. For instance, bindingthe variables P , Q , and R with second , tail , and head respectively in the chain metarule wouldlead to the metasub sub(chain,[second,tail,head]) and the clause: second(A,B):- tail(A,C),head(C,B). To learn optimal programs, Metagol enforces a bound on the program size (the number ofmetasubs). Metagol uses iterative deepening to search for hypotheses. At depth d = d + 1. At eachiteration d , Metagol introduces d − d clauses.New predicates symbols are formed by taking the name of the task and adding underscores andnumbers. For example, if the task is f and the depth is 4 then Metagol will add the predicatesymbols f_1 , f_2 , and f_3 to the predicate signature.6.4.3 M ETAGOL EXAMPLE Example 11 (Kinship example) . To illustrate Metagol, suppose you have the following BK: B = mother(ann,amy).mother(ann,andy).mother(amy,amelia).mother(amy,bob).mother(linda,gavin).father(steve,amy).father(steve,andy).father(andy,spongebob).father(gavin,amelia). And the following metarules represented in Prolog: metarule(ident,[P,Q], [P,A,B], [[Q,A,B]]).metarule(chain,[P,Q,R], [P,A,B], [[Q,A,C],[R,C,B]]). We can call Metagol with a lists of positive ( E + ) and negative ( E − ) examples: E + = grandparent(ann,amelia).grandparent(steve,amelia).grandparent(steve,spongebob).grandparent(linda,amelia). E − = (cid:8) grandparent(amy,amelia). (cid:9) NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION In Step 1, Metagol selects an atom (an example) to generalise. Suppose Metagol selects grandparent(ann,amelia) .In Step 2a, Metagol tries to prove this atom using the BK or an already induced clause. Since grandparent is not part of the BK and Metagol has not yet induced any clauses, this step fails.In Step 2b, Metagol tries to prove this atom using a metarule. Metagol can, for instance, unifythe atom with the head of the ident metarule to form the clause: grantparent(ann,amelia):- Q(ann,amelia). Metagol saves a metasub for this clause: sub(indent,[grantparent,Q]) Note that the symbol Q in this metasub is still a variable.Metagol then recursively tries to prove the atom Q(ann,amelia) . Since there is no Q suchthat Q(ann,amelia) is true, this step fails.Because the ident metarule failed, Metagol removes the metasub and backtracks to try adifferent metarule. Metagol unifies the atom with the chain metarule to form the clause: grantparent(ann,amelia):- Q(ann,C),R(C,amelia). Metagol saves a metasub for this clause: sub(chain,[grantparent,Q,R]) Metagol then recursively tries to prove the atoms Q(ann,C) and R(C,amelia) . Suppose therecursive call to prove Q(ann,C) succeeds by substituting Q with mother to form the atom mother(ann,amy) . This successful substitution binds Q in the metasub to mother and binds C to amy which is propagated to the other atom which now becomes R(amy,amelia) . Metagol alsoproves this second atom by substituting R with mother to form the atom mother(amy,amelia) .The proof is now complete and the metasub is now: sub(chain,[grantparent,mother,mother]) This metasub essentially means that Metagol has induced the clause: grantparent(A,B):- mother(A,C),mother(C,B). After proving the example, Metagol moves to Step 1 pick another example. Suppose it picks theexample grandparent(steve,amelia) . Then in Step 2 Metagol tries to generalise this atom.In Step 2a, Metagol tries to prove this atom using the BK, which again clearly fails, and thentries to prove this atom using an already induced clause. Since grantparent(steve,amelia):-mother(steve,C),mother(C,amelia) fails, this step fails. In Step 2b, Metagol tries to provethis atom again using a metarule. Metagol can again use the chain metarule but with differentsubstitutions to form the metasub: sub(metarule,[grantparent,father,mother]) This metasub corresponds to the clause: ROPPER AND D UMANCIC grantparent(A,B):- father(A,C),mother(C,B). Metagol has now proven the first two examples by inducing the clauses: grantparent(A,B):- mother(A,C),mother(C,B).grantparent(A,B):- father(A,C),mother(C,B). If given no bound on the program size, then Metagol would prove the other two examples thesame way by inducing two more clauses to finally form the program: grantparent(A,B):- mother(A,C),mother(C,B).grantparent(A,B):- father(A,C),mother(C,B).grantparent(A,B):- father(A,C),father(C,B).grantparent(A,B):- mother(A,C),father(C,B). In practice, however, Metagol would not learn this program. It would induce the followingprogram: grandparent(A,B):- grandparent_1(A,C),grandparent_1(C,B).grandparent_1(A,B):- father(A,B).grandparent_1(A,B):- mother(A,B). In this program, the symbol grandparent_1 is invented and corresponds to the parent relation.However, it is difficult to concisely illustrate predicate invention in this example. We, therefore,illustrate predicate invention in Metagol with an even simpler example. Example 12 (Predicate invention) . Suppose we have the single positive example: E + = (cid:8) f([i,l,p],p). (cid:9) Also suppose that we only have the chain metarule and the background relations head and tail . Given this input, in Step 2b, Metagol will try to use the chain metarule to prove theexample. However, using only the given the BK and metarules, the only programs that Metagolcan construct are combinations of the four clauses: f(A,B):- head(A,C),head(C,B).f(A,B):- head(A,C),tail(C,B).f(A,B):- tail(A,C),tail(C,B).f(A,B):- tail(A,C),head(C,B). No combination of these clauses can prove the examples, so Metagol must use predicate inven-tion to learn a solution.To use predicate invention, Metagol will try to prove the example using the chain metarule,which will lead to the construction of the program: f([i,l,p],p):- Q([i,l,p],C),R(C,p). Metagol would save a metasub for this clause: sub(chain,[f,Q,R]). NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Metagol will then try to recursively prove both Q([i,l,p],C) and R(C,p) . To prove Q([i,l,p],C) ,Metagol will say that it cannot prove it using a relation in the BK, so it will try to invent a newpredicate symbol, which leads to the new atom f_1([i,l,p],C) and the program: f([i,l,p],p):- f_1([i,l,p],C),R(C,p). Note that this binds Q in the metasub to f_1 .Metagol then tries to prove the f_1([i,l,p],C) and R(C,p) atoms. To prove f_1([i,l,p],C) ,Metagol could use the chain metarule to form the clause: f_1([i,l,p],C):- Q2([i,l,p],D),R2(D,C). Metagol would save another metasub for this clause: sub(chain,[f_1,Q2,R2]). Metagol then tries to prove the Q2([i,l,p],D) and R2(D,C) atoms. Metagol can prove Q2([i,l,p],D) by binding Q2 to tail so that D is bound to [l,p] . Metagol can then prove R2([l,p],C) by bind-ing R2 to tail so that C is bound [p] . Remember that the binding of variables is propagatedthrough the program, so C in R(C,p) is now bound to R([p],p) . Metagol then tries to provethe remaining atom R([p],p) , which it can by binding R to head . The proof of all the atoms isnow complete and the final metasubs are: sub(chain,[f,f_1,head]).sub(chain,[f_1,tail,tail]). These metasubs correspond to the program: f(A,B):- f_1(A,C),head(C,B).f_1(A,B):- tail(A,C),tail(C,B). Metagol supports learning recursive programs and, as far as we are aware, isthe only system that supports automatic predicate invention. Because it uses iterative deep-ening over the program size, it is guaranteed to learn the smallest program. Because it usesmetarules, Metagol can tightly restrict the hypothesis space, which means that it is extremelyefficient at finding solutions. Another advantage is in terms of the implementation. The basicMetagol implementation is less than 100 lines of Prolog code. This succinctness makes Metagolextremely easy to adapt, such as to add negation (Siebers & Schmid, 2018), to add types (Morelet al., 2019), to learn higher-order programs (Cropper et al., 2020), to learn efficient programs(Cropper & Muggleton, 2015, 2019), and to combine with Bayesian inference (Muggleton et al.,2013). Disadvantages. As mentioned in Section 4.4.2, deciding which metarules to use for a giventask is a major open problem. For some tasks, such as string transformations, it is relativelystraightforward to choose a suitable set of metarules because one already knows the generalform of hypotheses. However, when one has little knowledge of the target solutions, thenMetagol is unsuitable. There is some preliminary work in identifying universal sets of metarules(Cropper & Muggleton, 2014; Tourret & Cropper, 2019; Cropper & Tourret, 2020). However, ROPPER AND D UMANCIC this work mostly focuses on dyadic logic. If a problem contains predicates of arities greater thantwo, then Metagol is almost certainly unsuitable. The Metagol search complexity is exponential(Lin et al., 2014) in the number of clauses which makes it difficult to learn programs with manyclauses. Another problem is that because Metagol works by constructing partial programs, it ishighly sensitive to the size and order of the input (Cropper & Morel, 2020). Finally, Metagolcannot handle noisy examples and struggles to learn large programs (Cropper, 2017; Cropper& Dumanˇci´c, 2020; Cropper & Morel, 2020). 7. Applications We now briefly discuss application areas of ILP. Bioinformatics and drug design. Perhaps the most prominent application of ILP is in bioin-formatics and drug design. ILP is especially suitable for such problems because biological struc-tures, including molecules and protein interaction networks, can easily be expressed as relations:molecular bonds define relations between atoms and interactions define relations between pro-teins. Moreover, as mentioned in the introduction, ILP induces human-readable models. ILP can,therefore, make predictions based on the (sub)structured present in biological structures whichdomain experts can interpret. The types of task ILP has been applied to include identifying andpredicting ligands (substructures responsible for medical activity) (Finn et al., 1998; Srinivasanet al., 2006; Kaalia et al., 2016), predicting mutagenic activity of molecules and identifyingstructural alerts for the causes of chemical cancers (Srinivasan et al., 1997, 1996), learning pro-tein folding signatures (Turcotte et al., 2001), inferring missing pathways in protein signallingnetworks (Inoue et al., 2013), and modelling inhibition in metabolic networks (Tamaddoni-Nezhad et al., 2006). Robot scientist. One of the most notable applications of ILP was in the Robot Scientist project(King et al., 2009). The Robot Scientist uses logical BK to represent the relationships betweenprotein-coding sequences, enzymes, and metabolites in a pathway. The Robot Scientist usesILP to automatically generate hypotheses to explain data, and then devises experiments to testhypotheses, run the experiments, interpret the results, and then repeat the cycle (King et al.,2004). Whilst researching yeast-based functional genomics, the Robot Scientist became the firstmachine to independently discover new scientific knowledge (King et al., 2009). Ecology. There has been much recent work on applying ILP in ecology (Bohan et al., 2011;Tamaddoni-Nezhad et al., 2014; Bohan et al., 2017). For instance, Bohan et al. (2011) useILP to generate plausible and testable hypotheses for trophic relations (‘who eats whom’) fromecological data. Program analysis. Due to the expressivity of logic programs as a representation language, ILPsystems have found successful applications in software design. ILP systems have proven effectivein learning SQL queries (Albarghouthi et al., 2017; Sivaraman et al., 2019) and programminglanguage semantics (Bartha & Cheney, 2019). Other applications include code search (Sivara-man et al., 2019), in which an ILP system interactively learns a search query from examples,and software specification recovery from execution behaviour (Cohen, 1994b, 1995a). Data curation and transformation. Another successful application of ILP is in data curationand transformation, which is again largely because ILP can learn executable programs. The NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION most prominent example of such tasks are string transformations, such as the example givenin the introduction. There is much interest in this topic, largely due to success in synthesisingprograms for end-user problems, such as string transformations in Microsoft Excel (Gulwani,2011). String transformation have become a standard benchmark for recent ILP papers (Linet al., 2014; Cropper et al., 2020; Cropper & Dumanˇci´c, 2020; Cropper & Morel, 2020; Crop-per, 2019). Other transformation tasks include extracting values from semi-structured data (e.g.XML files or medical records), extracting relations from ecological papers, and spreadsheet ma-nipulation (Cropper et al., 2015). Learning from trajectories. Learning from interpretation transitions (LFIT) (Inoue et al.,2014) automatically constructs a model of the dynamics of a system from the observation of itsstate transitions . Given time-series data of discrete gene expression, it can learn gene interac-tions, thus allowing to explain and predict states changes over time (Ribeiro et al., 2020). LFIThas been applied to learn biological models, like Boolean Networks, under several semantics:memory-less deterministic systems (Inoue et al., 2014; Ribeiro & Inoue, 2014), probabilistic sys-tems (Martínez et al., 2015) and their multi-valued extensions (Ribeiro et al., 2015; Martínezet al., 2016). Martínez et al. (2015, 2016) combine LFIT with a reinforcement learning algo-rithm to learn probabilistic models with exogenous effects (effects not related to any action)from scratch. The learner was notably integrated in a robot to perform the task of clearing thetableware on a table. In this task external agents interacted, people brought new tableware con-tinuously and the manipulator robot had to cooperate with mobile robots to take the tablewareto the kitchen. The learner was able to learn a usable model in just five episodes of 30 actionexecutions. Evans et al. (2019) apply the Apperception Engine to explain sequential data, such asrhythms and simple nursery tunes, image occlusion tasks, and sequence induction intelligencetests. They show that their system can perform human-level performance. Natural language processing. Many natural language processing tasks require an under-standing of the syntax and semantics of the language. ILP is well-suited for addressing suchtasks for three reasons (i) it is based on an expressive formal language which can capture / re-spect the syntax and semantics of the natural language, (ii) linguistics knowledge and principlescan be integrated into ILP systems, and (iii) the learnt clauses are understandable to a linguist.ILP has been applied to learn grammars (Mooney & Califf, 1995; Muggleton et al., 2014; Lawet al., 2019) and parsers (Zelle & Mooney, 1996, 1995; Mooney, 1999) from examples. For anextensive overview of language tasks that can benefit from ILP see the paper by Dzeroski et al.(1999). Physics-informed learning. A major strength of ILP is its ability to incorporate and exploitbackground knowledge. Several ILP applications solve problems from first principles : providedphysical models of the basic primitives, ILP systems can induce the target hypothesis whosebehaviour is derived from the basic primitives. For instance, ILP systems can use a theory oflight to understand images (Dai et al., 2017; Muggleton et al., 2018). Similarly, simple elec-tronic circuits can be constructed from the examples of the target behaviour and the physics ofbasic electrical components (Grobelnik, 1992) and models of simple dynamical systems can belearned given the knowledge about differential equations (Bratko et al., 1991). 26. The LFIT implementations are available at: https://github.com/Tony-sama/pylfit ROPPER AND D UMANCIC Robotics. Similarly to the previous category, robotics applications often require incorporatingdomain knowledge or imposing certain requirements on the learnt programs. For instance, TheRobot Engineer (Sammut et al., 2015) uses ILP to design tools for robot and even completerobots, which are tests in simulations and real-world environments. Metagol o (Cropper & Mug-gleton, 2015) learns robot strategies considering their resource efficiency and Antanas et al.(2015) recognise graspable points on objects through relational representations of objects. Games. Inducing game rules has a long history in ILP, where chess has often been the focus(Goodacre, 1996; Morales, 1996; Muggleton et al., 2009a). For instance, Bain (1994) studiesinducing rules to determine the legality of moves in the chess KRK (king-rook-king) endgame.Castillo and Wrobel (2003) uses a top-down ILP system and active learning to induce a rulefor when a square is safe in the game minesweeper. Legras et al. (2018) show that Aleph andTILDE can outperform an SVM learner in the game of Bridge. Law et al. (2014) uses ILASP toinduce the rules for Sudoku and show that this more expressive formalism allows for game rulesto be expressed more compactly. Cropper et al. (2020b) introduce the ILP problem of inductivegeneral game playing : the problem of inducing game rules from observations, such as Checkers , Sokoban , and Connect Four . Other. Other notable applications include learning event recognition systems (Katzouris et al.,2015, 2016), tracking the evolution of online communities (Athanasopoulos et al., 2018), andthe MNIST dataset (Evans & Grefenstette, 2018). 8. Related work We now provide a more general background to ILP and try to connect it to other forms ofmachine learning. ILP is a form of ML, which surprises many researchers who only associate machine learning withstatistical techniques. However, the idea of machine learning dates back to Turing (1950) whoanticipated the difficulty in programming a computer with human intelligence and instead sug-gested building computers that learn similar to how a human child learns. Turing also suggestedlearning with a logical representation and BK and hinted at the difficulty of learning without it(Muggleton, 2014). Most researchers now use Mitchell’s (1997) definition of ML: Definition 5. (Machine learning) A learning algorithm is said to learn from experience E withrespect to some class of tasks T and performance measure P , if its performance at tasks in T ,as measured by P , improves with experience E .If we follow this definition then it is clear that ILP is no different from standard machine learningapproaches: it improves given more examples. The confusion seems to come from ILP’s useof logic programs, rather than tables of numbers, as a representation for learning. But, asDomingos (2015) points out, there are generally five areas of ML: symbolists, connectionists,Bayesian, analogisers, and evolutionists. ILP is in the symbolic learning category. Logic-based ML. Alan Turing can be seen as the first AI symbolist, as he proposed using alogical representation to build thinking machines (Muggleton, 1994). John McCarthy, however, NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION made the first comprehensive proposal for the use of logic in AI with his highly ambitious adviceseeker idea (McCarthy, 1959). Much work on using logic specifically for machine learning soonfollowed. Recognising the limitations of table-based representations, Banerji (1964) proposedusing predicate logic as a representation language for learning and even emphasised the impor-tance of allowing new concepts. Michalski’s (1969) work on the AQ algorithm, which inducesrules using a set covering algorithm, has greatly influenced many ILP systems, such as FOIL(Quinlan, 1990) and Progol (Muggleton, 1995). Short ILP history. Plotkin’s (1971) work on subsumption and least general generalisation hasinfluenced nearly all of ILP and almost all of ILP theory is connected to the notion of subsump-tion. Other notable work includes Vera (1975) on induction algorithms for predicate calculusand Sammut’s (1981) MARVIN system, which was one of the first systems to learn executableprograms. Shapiro’s (1983) work on inducing Prolog programs made major contributions to ILP,including the concepts of backtracking and refinement operators. Quinlan’s (1990) FOIL systemis one of the most well-known ILP systems and is natural extension of ID3 (Quinlan, 1986) fromthe propositional setting to the first-order setting and uses similar information gain. Other no-table contributions around this time include the introduction of inverse resolution (Muggleton &Buntine, 1988), which was also one of the earliest approaches at predicate invention. The fieldof ILP was founded by Muggleton in 1991, who stated that it lies at the intersection of machinelearning and knowledge representation (Muggleton, 1991). Muggleton also introduced manyof the most popular early ILP systems, including Duce (Muggleton, 1987), CIGOL (Muggleton& Buntine, 1988), and, most notably, Progol (Muggleton, 1995), which introduced the idea of inverse entailment and which later inspired much research. Because ILP induces programs, it is also a form of program synthesis (Manna & Waldinger, 1980;Shapiro, 1983), where the goal is to build a program from a specification. Universal inductionmethods, such as Solomonoff induction (Solomonoff, 1964a, 1964b) and Levin search (Levin,1973) are forms of program synthesis. However, universal methods are impractical becausethey learn only from examples and, as Mitchell (1997) points out, bias-free learning is futile.8.2.1 D EDUCTIVE PROGRAM SYNTHESIS Program synthesis traditionally meant deductive synthesis (Manna & Waldinger, 1980), wherethe goal is to build a programs from a full specification, where a specification precisely statesthe requirements and behaviour of the desired program. For example, to build a program thatreturns the last element of a non-empty list, we could provide a formal specification written inZ notation (Spivey & Abrial, 1992): last : seq_0 X –> Xforall s : seq_0 X last s = s( Deductive approaches can also take specifications written as a Prolog program: last([A]).last([A|B]):-last(B). ROPPER AND D UMANCIC A drawback of deductive approaches is that formulating a specification is difficult and typicallyrequires a domain expert. In fact, formulating a specification can be as difficult as finding asolution (Cropper, 2017). For example, formulating the specification for the following stringtransformations is non-trivial: ‘[email protected]’ => ‘Alan Turing’‘[email protected]’ => ‘Alonzo Church’‘[email protected]’ => ‘Kurt Godel’ NDUCTIVE PROGRAM SYNTHESIS Deductive approaches take full specifications as input and are efficient at building programs.Universal induction methods take only examples as input and are inefficient at building pro-grams. There is an area in between called inductive program synthesis . Similar to universalinduction methods, inductive program synthesis systems learn programs from incomplete speci-fications, typically input / output examples. In contrast to universal induction methods, inductiveprogram synthesis systems use background knowledge, and are thus less general than universalmethods, but are more practical because the background knowledge is a form of inductive bias(Mitchell, 1997) which restricts the hypothesis space. When given no background knowledge,and thus no inductive bias, inductive program synthesis methods are equivalent to universalinduction methods.Early work on inductive program synthesis includes Plotkin (1971) on least generalisation,Vera (1975) on induction algorithms for predicate calculus, Summers (1977) on inducing Lispprograms, and Shapiro (1983) on inducing Prolog programs. Interest in inductive programsynthesis has grown recently, partly due to applications in real-world problems, such as end-user programming (Gulwani, 2011).Inductive program synthesis interests researchers from many areas of computer science, no-tably machine learning and programming languages (PL). The two major differences betweenmachine learning and PL approaches are (i) the generality of solutions (synthesised programs)and (ii) noise handling. PL approaches often aim to find any program that fits the specifica-tion, regardless of whether it generalises. Indeed, PL approaches rarely evaluate the ability oftheir systems to synthesise solutions that generalise, i.e. they do not measure predictive accu-racy (Feser et al., 2015; Osera & Zdancewic, 2015; Albarghouthi et al., 2017; Si et al., 2018;Raghothaman et al., 2020). By contrast, the major challenge in machine learning is learninghypotheses that generalise to unseen examples. Indeed, it is often trivial to learn an overly spe-cific solution for a given problem. For instance, an ILP system can trivially construct the bottomclause (Muggleton, 1995) for each example. Similarly, noise handling is a major problem inmachine learning, yet is rarely considered in the PL literature.Besides ILP, inductive program synthesis has been studied in many areas of machine learn-ing, including deep learning (Balog et al., 2017; Ellis et al., 2018, 2019). The main advantagesof neural approaches are that they can handle noisy BK, as illustrated by ∂ ILP, and can harnesstremendous computational power (Ellis et al., 2019). However, neural methods often requiremany more examples (Reed & de Freitas, 2016; Dong et al., 2019) to learn concepts that sym-bolic ILP can learn from just a few. Another disadvantage of neural approaches is that they 27. Minor differences include the form of specification and theoretical results. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION often require hand-crafted neural architectures for each domain. For instance, the REPL ap-proach (Ellis et al., 2019) needs a hand-crafted grammar, interpreter, and neural architecturefor each domain. By contrast, because ILP uses logic programming as a uniform representa-tion for examples, background knowledge, and hypotheses, it can easily be applied to arbitrarydomains. Program induction vs program synthesis Confusingly, inductive program synthesis is oftensimply referred to as program synthesis . Moreover, the terms program induction (Kitzelmann &Schmid, 2006; Lin et al., 2014; Lake et al., 2015; Cropper, 2017; Ellis et al., 2018) and inductiveprogramming (Gulwani et al., 2015) have traditionally meant inductive program synthesis .Gulwani et al. (2017) divide inductive program synthesis into two categories: (i) program in-duction, and (ii) program synthesis. They say that program induction approaches are neuralarchitectures that learn a network that is capable of replicating the behaviour of a program. Bycontrast, they say that program synthesis approaches output or return an interpretable program. 9. Summary and future work In a survey paper from a decade ago, Muggleton et al. (2012) proposed directions for futureresearch. There have since been major advances in many of these directions, including in pred-icate invention (Section 5.4), using higher-order logic as a representation language (Section4.4.2) and for hypotheses (Section 4.2.6), and applications in learning actions and strategies(Section 7). We think that these and other recent advances put ILP in a prime position to have asignificant impact on AI over the next decade, especially to address the key limitations of stan-dard forms of machine learning. There are, however, still many limitations that future shouldaddress. Muggleton et al. (2012) argue that a problem with ILP is thelack of well-engineered tools. They state that whilst over 100 ILP systems have been built sincethe founding of ILP in 1991, less than a handful of systems can be meaningfully used by ILPresearchers. One reason is that ILP systems are often only designed as prototypes and are oftennot well-engineered or maintained. Another major problem is that ILP systems are notoriouslydifficult to use: you often need a PhD in ILP to use any of the tools. Even then, it is still often onlythe developers of a system that know how to properly use it. This difficulty of use is compoundedby ILP systems often using many different biases or even different syntax for the same biases.For instance, although they all use mode declarations, the way of specifying a learning task inProgol, Aleph, TILDE, and ILASP varies considerably. If it is difficult for ILP researchers to useILP tools, then what hope do non-ILP researchers have? For ILP to be more widely adoptedboth inside and outside of academia, we must develop more standardised, user-friendly, andbetter-engineered tools. Language biases. ILP allows a user to provide BK and a language bias. Both are importantand powerful features, but only when used correctly. For instance, Metagol employs metarules 28. Inductive program synthesis is also called programming by example and program synthesis from examples , amongstmany other names. ROPPER AND D UMANCIC (Section 4.4.2) to restrict the syntax of hypotheses and thus the hypothesis space. If a usercan provide suitable metarules, then Metagol is extremely efficient. However, if a user cannotprovide suitable metarules (which is often the case), then Metagol is almost useless. This samebrittleness applies to ILP systems that employ mode declarations (Section 4.4.1). In theory, auser can provide very general mode declarations, such as only using a single type and allowingunlimited recall. In practice, however, weak mode declarations often lead to very poor perfor-mance. For good performance, users of mode-based systems often need to manually analyse agiven learning task to tweak the mode declarations, often through a process of trial and error.Moreover, if a user makes a small mistake with a mode declaration, such as giving the wrongargument type, then the ILP system is unlikely to find a good solution. This need for an almostperfect language bias is severely holding back ILP from being widely adopted. To address thislimitation, we think that an important direction for future work is to develop techniques for au-tomatically identifying suitable language biases. Although there is some work on mode learning(McCreath & Sharma, 1995; Ferilli et al., 2004; Picado et al., 2017) and work on identifyingsuitable metarules (Cropper & Tourret, 2020), this area of research is largely under-researched. Predicate invention and abstraction. Russell (2019) argues that the automatic inventionof new high-level concepts is the most important step needed to reach human-level AI. Newmethods for predicate invention (Section 5.4) have improved the ability of ILP to invent suchhigh-level concepts. However, predicate invention is still difficult and there are many challengesto overcome. For instance, in inductive general game playing (Cropper et al., 2020b), the task isto learn the symbolic rules of games from observations of gameplay, such as learning the rulesof connect four . The reference solutions for the games come from the general game playingcompetition (Genesereth & Björnsson, 2013) and often contain auxiliary predicates to makethem simpler. For instance, the rules for connect four are defined in terms of definitions forlines which are themselves defined in terms of columns, rows, and diagonals. Although theseauxiliary predicates are not strictly necessary to learn the reference solution, inventing suchpredicates significantly reduces the size of the solution (sometimes by multiple orders of mag-nitude), which in turns makes them much easier to learn. Although new methods for predicateinvention (Section 5.4) can invent high-level concepts, they are not yet sufficiently powerfulenough to perform well on the IGGP dataset. Making progress in this area would constitute amajor advancement in ILP and a major step towards human-level AI. Lifelong learning. Because of its symbolic representation, a key advantage of ILP is thatlearned knowledge can be remembered and explicitly stored in the BK. For this reason, ILP nat-urally supports lifelong (Silver et al., 2013), multi-task (Caruana, 1997), and transfer learning (Torrey & Shavlik, 2009), which are considered essential for human-like AI (Lake et al., 2016).The general idea behind all of these approaches is to reuse knowledge gained from solving oneproblem to help solve a different problem. Although early work in ILP explored this form oflearning (Sammut, 1981; Quinlan, 1990), it has been under-explored until recently (Lin et al.,2014; Cropper, 2019, 2020; Hocquette & Muggleton, 2020; Dumanˇci´c & Cropper, 2020), mostlybecause of new techniques for predicate invention. For instance, Lin et al. (2014) learn 17 stringtransformations programs over time and show that their multi-task approach performs betterthan a single-task approach because learned programs are frequently reused. However, theseapproaches have only been demonstrated on a small number of tasks. To reach human-level AI,we would expect a learner to learn thousands or even millions of concepts. But handling the NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION complexity of thousands of tasks is challenging because, as we explained in Section 4.3, ILP sys-tems struggle to handle large amounts of BK. This situation leads to the problem of catastrophicremembering (Cropper, 2020): the inability for a learner to forget knowledge. Although there isinitial work on this topic (Cropper, 2020), we think that a key area for future work is handlingthe complexity of lifelong learning. Relevance. The catastrophic remembering problem is essentially the problem of relevance :given a new ILP problem with lots of BK, how does an ILP system decide which BK is rele-vant? Although too much irrelevant BK is detrimental to learning performance (Srinivasanet al., 1995, 2003), there is almost no work in ILP on trying to identify relevant BK. One emerg-ing technique is to train a neural network to score how relevant programs are in the BK and tothen only use BK with the highest score to learn programs (Balog et al., 2017; Ellis et al., 2018).However, the empirical efficacy of this approach has yet to be clearly demonstrated. Moreover,these approaches have only been demonstrated on small amounts of BK and it is unclear howthey scale to BK with thousands of relations. Without efficient relevancy methods, it is unclearhow lifelong learning can be achieved. Noisy BK. Another issue related to lifelong learning is the underlying uncertainty associatedwith adding learned programs to the BK. By the inherent nature of induction, induced programsare not guaranteed to be correct (i.e. are expected to be noisy), yet they are the building blocksfor subsequent induction. Building noisy programs on top of other noisy programs could leadto eventual incoherence of the learned program. This issue is especially problematic because,as mentioned in Section 5.1, most ILP approaches assume noiseless BK, i.e. a relation is trueor false without any room for uncertainty. One of the appealing features of ∂ ILP is that ittakes a differentiable approach to ILP, where it can be provided with fuzzy or ambiguous data.Developing similar techniques to handle noisy BK is an under-explored topic in ILP. Probabilistic ILP. A principled way to handle noise is to unify logical and probabilistic rea-soning, which is the focus of statistical relational artificial intelligence (StarAI) (De Raedt et al.,2016). While StarAI is a growing field, inducing probabilistic logic programs has received littleattention, with few notable exceptions (Bellodi & Riguzzi, 2015; De Raedt et al., 2015), as in-ference remains the main challenge. Addressing this issue, i.e. unifying probability and logic inan inductive setting, would be a major achievement (Marcus, 2018). Explainability. Explainability is one of the claimed advantages of a symbolic representation.Recent work (Muggleton et al., 2018; Ai et al., 2020) evaluates the comprehensibility of ILPhypotheses using Michie’s (1988) framework of ultra-strong machine learning , where a learnedhypothesis is expected to not only be accurate but to also demonstrably improve the performanceof a human being provided with the learned hypothesis. Muggleton et al. (2018) empiricallydemonstrate improved human understanding directly through learned hypotheses. However,more work is required to better understand the conditions under which this can be achieved,especially given the rise of predicate invention. Learning from raw data. Most ILP systems require data in perfect symbolic form. However,much real-world data, such as images and speech, cannot easily be translated into a symbolicform. Perhaps the biggest challenge in ILP is to learn how to both perceive sensory input andlearn a symbolic logic program to explain the input. For instance, consider a task of learning to ROPPER AND D UMANCIC perform addition from MNIST digits. Current ILP systems need to be given as BK symbolic repre-sentations of the digits, which could be achieved by first training a neural network to recognisethe digits. Ideally, we would not want to treat the two problems separately, but rather simultane-ously learn how to recognise the digits and learn a program to perform the addition. A handfulof approaches have started to tackle this problem (Manhaeve et al., 2018; Dai et al., 2019;Evans et al., 2019; Dai & Muggleton, 2020), but developing better ILP techniques that can bothperceive sensory input and learn complex relational programs would be a major breakthroughnot only for ILP, but the whole of AI. Further reading For an introduction to the fundamentals of logic and automated reasoning, we recommend thebook of Harrison (2009). To read more about ILP, then we suggest starting with the foundingpaper by Muggleton (1991) and a survey paper that soon followed (Muggleton & De Raedt,1994). For a detailed exposition of the theory of ILP, we thoroughly recommend the books ofNienhuys-Cheng and Wolf (1997) and De Raedt (2008). Acknowledgements We thank Céline Hocquette, Jonas Schouterden, Jonas Soenen, Tom Silver, and Tony Ribeiro forhelpful comments and suggestions. References Adé, H., De Raedt, L., & Bruynooghe, M. (1995). Declarative bias for specific-to-general ILPsystems. Machine Learning , (1-2), 119–154.Adé, H., Malfait, B., & De Raedt, L. (1994). RUTH: an ILP theory revision system. In Ras, Z. W., &Zemankova, M. (Eds.), Methodologies for Intelligent Systems, 8th International Symposium,ISMIS ’94, Charlotte, North Carolina, USA, October 16-19, 1994, Proceedings , Vol. 869 of Lecture Notes in Computer Science , pp. 336–345. Springer.Ahlgren, J., & Yuen, S. Y. (2013). Efficient program synthesis using constraint satisfaction ininductive logic programming. J. Machine Learning Res. , (1), 3649–3682.Ai, L., Muggleton, S. H., Hocquette, C., Gromowski, M., & Schmid, U. (2020). Beneficial andharmful explanatory machine learning. CoRR , abs / .Albarghouthi, A., Koutris, P., Naik, M., & Smith, C. (2017). Constraint-based synthesis of datalogprograms. In Beck, J. C. (Ed.), Principles and Practice of Constraint Programming - 23rdInternational Conference, CP 2017, Melbourne, VIC, Australia, August 28 - September 1,2017, Proceedings , Vol. 10416 of Lecture Notes in Computer Science , pp. 689–706. Springer.Antanas, L., Moreno, P., & De Raedt, L. (2015). Relational kernel-based grasping with numericalfeatures. In Inoue, K., Ohwada, H., & Yamamoto, A. (Eds.), Inductive Logic Programming -25th International Conference, ILP 2015, Kyoto, Japan, August 20-22, 2015, Revised SelectedPapers , Vol. 9575 of Lecture Notes in Computer Science , pp. 1–14. Springer.Athakravi, D., Corapi, D., Broda, K., & Russo, A. (2013). Learning through hypothesis refinementusing answer set programming. In Zaverucha, G., Costa, V. S., & Paes, A. (Eds.), Inductive NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Logic Programming - 23rd International Conference, ILP 2013, Rio de Janeiro, Brazil, August28-30, 2013, Revised Selected Papers , Vol. 8812 of Lecture Notes in Computer Science , pp.31–46. Springer.Athanasopoulos, G., Paliouras, G., Vogiatzis, D., Tzortzis, G., & Katzouris, N. (2018). Predictingthe evolution of communities with online inductive logic programming. In Alechina, N.,Nørvåg, K., & Penczek, W. (Eds.), , Vol. 120 of LIPIcs ,pp. 4:1–4:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik.Bain, M. (1994). Learning logical exceptions in chess . Ph.D. thesis, University of Strathclyde.Bain, M., & Srinivasan, A. (2018). Identification of biological transition systems using meta-interpreted logic programs. Machine Learning , (7), 1171–1206.Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., & Tarlow, D. (2017). Deepcoder: Learn-ing to write programs. In . OpenReview.net.Banerji, R. B. (1964). A language for the description of concepts. General Systems , (1), 135–141.Bartha, S., & Cheney, J. (2019). Towards meta-interpretive learning of programming languagesemantics. In Kazakov, D., & Erten, C. (Eds.), Inductive Logic Programming - 29th Inter-national Conference, ILP 2019, Plovdiv, Bulgaria, September 3-5, 2019, Proceedings , Vol.11770 of Lecture Notes in Computer Science , pp. 16–25. Springer.Bellodi, E., & Riguzzi, F. (2015). Structure learning of probabilistic logic programs by searchingthe clause space. Theory Pract. Log. Program. , (2), 169–212.Bengio, Y., Courville, A. C., & Vincent, P. (2013). Representation learning: A review and newperspectives. IEEE Trans. Pattern Anal. Mach. Intell. , (8), 1798–1828.Bengio, Y., Deleu, T., Rahaman, N., Ke, N. R., Lachapelle, S., Bilaniuk, O., Goyal, A., & Pal, C. J.(2019). A meta-transfer objective for learning to disentangle causal mechanisms. CoRR , abs / .Blockeel, H., & De Raedt, L. (1998). Top-down induction of first-order logical decision trees. Artif. Intell. , (1-2), 285–297.Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Inf. Process.Lett. , (6), 377–380.Bohan, D. A., Caron-Lormier, G., Muggleton, S., Raybould, A., & Tamaddoni-Nezhad, A. (2011).Automated discovery of food webs from ecological data using logic-based machine learn-ing. PLoS One , (12), e29028.Bohan, D. A., Vacher, C., Tamaddoni-Nezhad, A., Raybould, A., Dumbrell, A. J., & Woodward,G. (2017). Next-generation global biomonitoring: large-scale, automated reconstructionof ecological networks. Trends in Ecology & Evolution , (7), 477–487.Bratko, I. (1999). Refining complete hypotheses in ILP. In Dzeroski, S., & Flach, P. A. (Eds.), Inductive Logic Programming, 9th International Workshop, ILP-99, Bled, Slovenia, June 24-27, 1999, Proceedings , Vol. 1634 of Lecture Notes in Computer Science , pp. 44–55. Springer. ROPPER AND D UMANCIC Bratko, I. (2012). Prolog Programming for Artificial Intelligence, 4th Edition . Addison-Wesley.Bratko, I., Muggleton, S., & Varsek, A. (1991). Learning qualitative models of dynamic systems.In Birnbaum, L., & Collins, G. (Eds.), Proceedings of the Eighth International Workshop(ML91), Northwestern University, Evanston, Illinois, USA , pp. 385–388. Morgan Kaufmann.Buntine, W. L. (1988). Generalized subsumption and its applications to induction and redun-dancy. Artif. Intell. , (2), 149–176.Caruana, R. (1997). Multitask learning. Machine Learning , (1), 41–75.Castillo, L. P., & Wrobel, S. (2003). Learning minesweeper with multirelational learning. InGottlob, G., & Walsh, T. (Eds.), IJCAI-03, Proceedings of the Eighteenth International JointConference on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003 , pp. 533–540.Morgan Kaufmann.Chollet, F. (2019). On the measure of intelligence. CoRR , abs / .Clark, K. L. (1977). Negation as failure. In Gallaire, H., & Minker, J. (Eds.), Logic and DataBases, Symposium on Logic and Data Bases, Centre d’études et de recherches de Toulouse,France, 1977 , Advances in Data Base Theory, pp. 293–322, New York. Plemum Press.Cohen, W. W. (1994a). Grammatically biased learning: Learning logic programs using an explicitantecedent description language. Artif. Intell. , (2), 303–366.Cohen, W. W. (1994b). Recovering software specifications with inductive logic programming.In Hayes-Roth, B., & Korf, R. E. (Eds.), Proceedings of the 12th National Conference onArtificial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 1 , pp. 142–148.AAAI Press / The MIT Press.Cohen, W. W. (1995a). Inductive specification recovery: Understanding software by learningfrom example behaviors. Autom. Softw. Eng. , (2), 107–129.Cohen, W. W. (1995b). Pac-learning recursive logic programs: Negative results. J. Artif. Intell.Res. , , 541–573.Colmerauer, A., & Roussel, P. (1993). The birth of prolog. In Lee, J. A. N., & Sammet, J. E.(Eds.), , pp. 37–52. ACM.Corapi, D., Russo, A., & Lupu, E. (2010). Inductive logic programming as abductive search. InHermenegildo, M. V., & Schaub, T. (Eds.), Technical Communications of the 26th Interna-tional Conference on Logic Programming, ICLP 2010, July 16-19, 2010, Edinburgh, Scotland,UK , Vol. 7 of LIPIcs , pp. 54–63. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik.Corapi, D., Russo, A., & Lupu, E. (2011). Inductive logic programming in answer set program-ming. In Muggleton, S., Tamaddoni-Nezhad, A., & Lisi, F. A. (Eds.), Inductive Logic Pro-gramming - 21st International Conference, ILP 2011, Windsor Great Park, UK, July 31 -August 3, 2011, Revised Selected Papers , Vol. 7207 of Lecture Notes in Computer Science , pp.91–97. Springer.Costa, V. S., Rocha, R., & Damas, L. (2012). The YAP prolog system. Theory Pract. Log. Program. , (1-2), 5–34.Cropper, A. (2017). Efficiently learning efficient programs . Ph.D. thesis, Imperial College London,UK. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Cropper, A. (2019). Playgol: Learning programs through play. In Kraus, S. (Ed.), Proceedingsof the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019,Macao, China, August 10-16, 2019 , pp. 6074–6080. ijcai.org.Cropper, A. (2020). Forgetting to learn logic programs. In The Thirty-Fourth AAAI Conference onArtificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020 , pp. 3676–3683.AAAI Press.Cropper, A., & Dumanˇci´c, S. (2020). Learning large logic programs by going beyond entailment.In Bessiere, C. (Ed.), Proceedings of the Twenty-Ninth International Joint Conference onArtificial Intelligence, IJCAI 2020 , pp. 2073–2079. ijcai.org.Cropper, A., Dumanˇci´c, S., & Muggleton, S. H. (2020a). Turning 30: New ideas in inductivelogic programming. In Bessiere, C. (Ed.), Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence, IJCAI 2020 , pp. 4833–4839. ijcai.org.Cropper, A., Evans, R., & Law, M. (2020b). Inductive general game playing. Machine Learning , (7), 1393–1434.Cropper, A., & Morel, R. (2020). Learning programs by learning from failures. CoRR , abs / .Cropper, A., Morel, R., & Muggleton, S. (2020). Learning higher-order logic programs. MachineLearning , (7), 1289–1322.Cropper, A., & Muggleton, S. (2015). Can predicate invention compensate for incomplete back-ground knowledge?. In Nowaczyk, S. (Ed.), Thirteenth Scandinavian Conference on Artifi-cial Intelligence - SCAI 2015, Halmstad, Sweden, November 5-6, 2015 , Vol. 278 of Frontiersin Artificial Intelligence and Applications , pp. 27–36. IOS Press.Cropper, A., & Muggleton, S. H. (2014). Logical minimisation of meta-rules within meta-interpretive learning. In Davis, J., & Ramon, J. (Eds.), Inductive Logic Programming -24th International Conference, ILP 2014, Nancy, France, September 14-16, 2014, RevisedSelected Papers , Vol. 9046 of Lecture Notes in Computer Science , pp. 62–75. Springer.Cropper, A., & Muggleton, S. H. (2015). Learning efficient logical robot strategies involvingcomposable objects. In Yang, Q., & Wooldridge, M. J. (Eds.), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires,Argentina, July 25-31, 2015 , pp. 3423–3429. AAAI Press.Cropper, A., & Muggleton, S. H. (2016). Metagol system. https: // github.com / metagol / metagol.Cropper, A., & Muggleton, S. H. (2019). Learning efficient logic programs. Machine Learning , (7), 1063–1083.Cropper, A., Tamaddoni-Nezhad, A., & Muggleton, S. H. (2015). Meta-interpretive learning ofdata transformation programs. In Inoue, K., Ohwada, H., & Yamamoto, A. (Eds.), InductiveLogic Programming - 25th International Conference, ILP 2015, Kyoto, Japan, August 20-22,2015, Revised Selected Papers , Vol. 9575 of Lecture Notes in Computer Science , pp. 46–59.Springer.Cropper, A., & Tourret, S. (2020). Logical reduction of metarules. Machine Learning , (7),1323–1369. ROPPER AND D UMANCIC Dai, W., Muggleton, S., Wen, J., Tamaddoni-Nezhad, A., & Zhou, Z. (2017). Logical vision:One-shot meta-interpretive learning from real images. In Lachiche, N., & Vrain, C. (Eds.), Inductive Logic Programming - 27th International Conference, ILP 2017, Orléans, France,September 4-6, 2017, Revised Selected Papers , Vol. 10759 of Lecture Notes in Computer Sci-ence , pp. 46–62. Springer.Dai, W.-Z., & Muggleton, S. H. (2020). Abductive knowledge induction from raw data..Dai, W., Xu, Q., Yu, Y., & Zhou, Z. (2019). Bridging machine learning and logical reasoning byabductive learning. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox,E. B., & Garnett, R. (Eds.), Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December2019, Vancouver, BC, Canada , pp. 2811–2822.Dantsin, E., Eiter, T., Gottlob, G., & Voronkov, A. (2001). Complexity and expressive power oflogic programming. ACM Comput. Surv. , (3), 374–425.Dash, T., Srinivasan, A., Vig, L., Orhobor, O. I., & King, R. D. (2018). Large-scale assessmentof deep relational machines. In Riguzzi, F., Bellodi, E., & Zese, R. (Eds.), Inductive LogicProgramming - 28th International Conference, ILP 2018, Ferrara, Italy, September 2-4, 2018,Proceedings , Vol. 11105 of Lecture Notes in Computer Science , pp. 22–37. Springer.Davis, M., Logemann, G., & Loveland, D. W. (1962). A machine program for theorem-proving. Commun. ACM , (7), 394–397.De Raedt, L. (1997). Logical settings for concept-learning. Artif. Intell. , (1), 187–201.De Raedt, L. (2008). Logical and relational learning . Cognitive Technologies. Springer.De Raedt, L., & Bruynooghe, M. (1992). Interactive concept-learning and constructive inductionby analogy. Machine Learing , , 107–150.De Raedt, L., & Dehaspe, L. (1997). Clausal discovery. Machine Learning , (2-3), 99–146.De Raedt, L., Dries, A., Thon, I., den Broeck, G. V., & Verbeke, M. (2015). Inducing probabilisticrelational rules from probabilistic examples. In Yang, Q., & Wooldridge, M. J. (Eds.), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015 , pp. 1835–1843. AAAI Press.De Raedt, L., & Kersting, K. (2008a). Probabilistic inductive logic programming. In De Raedt,L., Frasconi, P., Kersting, K., & Muggleton, S. (Eds.), Probabilistic Inductive Logic Program-ming - Theory and Applications , Vol. 4911 of Lecture Notes in Computer Science , pp. 1–27.Springer.De Raedt, L., & Kersting, K. (2008b). Probabilistic Inductive Logic Programming , p. 1–27.Springer-Verlag, Berlin, Heidelberg.De Raedt, L., Kersting, K., Kimmig, A., Revoredo, K., & Toivonen, H. (2008). Compressingprobabilistic prolog programs. Machine Learning , (2-3), 151–168.De Raedt, L., Kersting, K., Natarajan, S., & Poole, D. (2016). Statistical Relational Artificial Intel-ligence: Logic, Probability, and Computation . Synthesis Lectures on Artificial Intelligenceand Machine Learning. Morgan & Claypool Publishers. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION De Raedt, L., Kimmig, A., & Toivonen, H. (2007). Problog: A probabilistic prolog and its applica-tion in link discovery. In IJCAI 2007, Proceedings of the 20th International Joint Conferenceon Artificial Intelligence, Hyderabad, India, January 6-12, 2007 , pp. 2462–2467.Domingos, P. (2015). The master algorithm: How the quest for the ultimate learning machine willremake our world . Basic Books.Domingos, P. M. (1999). The role of occam’s razor in knowledge discovery. Data Min. Knowl.Discov. , (4), 409–425.Dong, H., Mao, J., Lin, T., Wang, C., Li, L., & Zhou, D. (2019). Neural logic machines. In . OpenReview.net.Dumanˇci´c, S., & Blockeel, H. (2017). Clustering-based relational unsupervised representationlearning with an explicit distributed representation. In Sierra, C. (Ed.), Proceedings ofthe Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Mel-bourne, Australia, August 19-25, 2017 , pp. 1631–1637. ijcai.org.Dumanˇci´c, S., & Cropper, A. (2020). Knowledge refactoring for program induction. CoRR , abs / .Dumanˇci´c, S., Guns, T., Meert, W., & Blockeel, H. (2019). Learning relational representationswith auto-encoding logic programs. In Kraus, S. (Ed.), Proceedings of the Twenty-EighthInternational Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August10-16, 2019 , pp. 6081–6087. ijcai.org.Dzeroski, S., Cussens, J., & Manandhar, S. (1999). An introduction to inductive logic program-ming and learning language in logic. In Cussens, J., & Dzeroski, S. (Eds.), Learning Lan-guage in Logic , Vol. 1925 of Lecture Notes in Computer Science , pp. 3–35. Springer.Ellis, K., Morales, L., Sablé-Meyer, M., Solar-Lezama, A., & Tenenbaum, J. (2018). Learninglibraries of subroutines for neurally-guided bayesian program induction. In NeurIPS 2018 ,pp. 7816–7826.Ellis, K., Nye, M. I., Pu, Y., Sosa, F., Tenenbaum, J., & Solar-Lezama, A. (2019). Write, execute,assess: Program synthesis with a REPL. In Wallach, H. M., Larochelle, H., Beygelzimer, A.,d’Alché-Buc, F., Fox, E. B., & Garnett, R. (Eds.), Advances in Neural Information ProcessingSystems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS2019, 8-14 December 2019, Vancouver, BC, Canada , pp. 9165–9174.Emde, W., Habel, C., & Rollinger, C. (1983). The discovery of the equator or concept drivenlearning. In Bundy, A. (Ed.), Proceedings of the 8th International Joint Conference on Arti-ficial Intelligence. Karlsruhe, FRG, August 1983 , pp. 455–458. William Kaufmann.Evans, R., & Grefenstette, E. (2018). Learning explanatory rules from noisy data. J. Artif. Intell.Res. , , 1–64.Evans, R., Hernández-Orallo, J., Welbl, J., Kohli, P., & Sergot, M. J. (2019). Making sense ofsensory input. CoRR , abs / .Ferilli, S., Esposito, F., Basile, T. M. A., & Mauro, N. D. (2004). Automatic induction of first-orderlogic descriptors type domains from observations. In Camacho, R., King, R. D., & Srini-vasan, A. (Eds.), Inductive Logic Programming, 14th International Conference, ILP 2004, ROPPER AND D UMANCIC Porto, Portugal, September 6-8, 2004, Proceedings , Vol. 3194 of Lecture Notes in ComputerScience , pp. 116–131. Springer.Feser, J. K., Chaudhuri, S., & Dillig, I. (2015). Synthesizing data structure transformationsfrom input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation, Portland, OR, USA, June 15-17, 2015 , pp.229–239.Finn, P. W., Muggleton, S., Page, D., & Srinivasan, A. (1998). Pharmacophore discovery usingthe inductive logic programming system PROGOL. Machine Learning , (2-3), 241–270.Flener, P. (1996). Inductive logic program synthesis with DIALOGS. In Muggleton, S. (Ed.), In-ductive Logic Programming, 6th International Workshop, ILP-96, Stockholm, Sweden, August26-28, 1996, Selected Papers , Vol. 1314 of Lecture Notes in Computer Science , pp. 175–198.Springer.Gebser, M., Kaminski, R., Kaufmann, B., & Schaub, T. (2012). Answer Set Solving in Practice .Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & ClaypoolPublishers.Gebser, M., Kaminski, R., Kaufmann, B., & Schaub, T. (2014). Clingo = ASP + control: Prelimi-nary report. CoRR , abs / .Gebser, M., Kaufmann, B., & Schaub, T. (2012). Conflict-driven answer set solving: From theoryto practice. Artif. Intell. , , 52–89.Gelder, A. V., Ross, K. A., & Schlipf, J. S. (1991). The well-founded semantics for general logicprograms. J. ACM , (3), 620–650.Gelfond, M., & Lifschitz, V. (1988). The stable model semantics for logic programming. InKowalski, R. A., & Bowen, K. A. (Eds.), Logic Programming, Proceedings of the Fifth In-ternational Conference and Symposium, Seattle, Washington, USA, August 15-19, 1988 (2Volumes) , pp. 1070–1080. MIT Press.Genesereth, M. R., & Björnsson, Y. (2013). The international general game playing competition. AI Magazine , (2), 107–111.Goodacre, J. (1996). Inductive learning of chess rules using Progol . Ph.D. thesis, University ofOxford.Grobelnik, M. (1992). Markus: an optimized model inference system. In Proceedings of the LogicApproaches to Machine Learning Workshop .Gulwani, S. (2011). Automating string processing in spreadsheets using input-output examples.In Ball, T., & Sagiv, M. (Eds.), Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011 ,pp. 317–330. ACM.Gulwani, S., Hernández-Orallo, J., Kitzelmann, E., Muggleton, S. H., Schmid, U., & Zorn, B. G.(2015). Inductive programming meets the real world. Commun. ACM , (11), 90–99.Gulwani, S., Polozov, O., Singh, R., et al. (2017). Program synthesis. Foundations and Trends®in Programming Languages , (1-2), 1–119.Harrison, J. (2009). Handbook of Practical Logic and Automated Reasoning (1st edition). Cam-bridge University Press, USA. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Hoare, C. A. R. (1961). Algorithm 64: Quicksort. Commun. ACM , (7), 321.Hocquette, C., & Muggleton, S. H. (2020). Complete bottom-up predicate invention in meta-interpretive learning. In Bessiere, C. (Ed.), Proceedings of the Twenty-Ninth InternationalJoint Conference on Artificial Intelligence, IJCAI 2020 , pp. 2312–2318. ijcai.org.Inoue, K. (2004). Induction as consequence finding. Machine Learning , (2), 109–135.Inoue, K. (2016). Meta-level abduction. FLAP , (1), 7–36.Inoue, K., Doncescu, A., & Nabeshima, H. (2013). Completing causal networks by meta-levelabduction. Machine Learning , (2), 239–277.Inoue, K., Ribeiro, T., & Sakama, C. (2014). Learning from interpretation transition. MachineLearning , (1), 51–79.Kaalia, R., Srinivasan, A., Kumar, A., & Ghosh, I. (2016). Ilp-assisted de novo drug design. Machine Learning , (3), 309–341.Kaiser, L., & Sutskever, I. (2016). Neural gpus learn algorithms. In Bengio, Y., & LeCun, Y.(Eds.), .Kaminski, T., Eiter, T., & Inoue, K. (2018). Exploiting answer set programming with externalsources for meta-interpretive learning. Theory Pract. Log. Program. , (3-4), 571–588.Katzouris, N., Artikis, A., & Paliouras, G. (2015). Incremental learning of event definitions withinductive logic programming. Machine Learning , (2-3), 555–585.Katzouris, N., Artikis, A., & Paliouras, G. (2016). Online learning of event definitions. TheoryPract. Log. Program. , (5-6), 817–833.Kaur, N., Kunapuli, G., Joshi, S., Kersting, K., & Natarajan, S. (2019). Neural networks forrelational data. In Kazakov, D., & Erten, C. (Eds.), Inductive Logic Programming - 29thInternational Conference, ILP 2019, Plovdiv, Bulgaria, September 3-5, 2019, Proceedings ,Vol. 11770 of Lecture Notes in Computer Science , pp. 62–71. Springer.Kaur, N., Kunapuli, G., & Natarajan, S. (2020). Non-parametric learning of lifted restrictedboltzmann machines. Int. J. Approx. Reason. , , 33–47.Kietz, J.-U., & Wrobel, S. (1992). Controlling the complexity of learning in logic through syn-tactic and task-oriented models. In Inductive logic programming . Citeseer.King, R. D., Muggleton, S., Lewis, R. A., & Sternberg, M. J. (1992). Drug design by machinelearning: the use of inductive logic programming to model the structure-activity relation-ships of trimethoprim analogues binding to dihydrofolate reductase. Proceedings of theNational Academy of Sciences , (23), 11322–11326.King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., Liakata, M., Markham,M., Pir, P., Soldatova, L. N., et al. (2009). The automation of science. Science , (5923),85–89.King, R. D., Whelan, K. E., Jones, F. M., Reiser, P. G., Bryant, C. H., Muggleton, S. H., Kell, D. B.,& Oliver, S. G. (2004). Functional genomic hypothesis generation and experimentationby a robot scientist. Nature , (6971), 247–252. ROPPER AND D UMANCIC Kitzelmann, E., & Schmid, U. (2006). Inductive synthesis of functional programs: An explana-tion based generalization approach. J. Machine Learning Res. , , 429–454.Kowalski, R. A. (1974). Predicate logic as programming language. In Rosenfeld, J. L. (Ed.), In-formation Processing, Proceedings of the 6th IFIP Congress 1974, Stockholm, Sweden, August5-10, 1974 , pp. 569–574. North-Holland.Kowalski, R. A. (1988). The early years of logic programming. Commun. ACM , (1), 38–43.Kowalski, R. A., & Kuehner, D. (1971). Linear resolution with selection function. Artif. Intell. , (3 / Rapport technique OFAI-TR-95-32, Austrian Research Institute for Artificial Intelligence, Vienna .Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learningthrough probabilistic program induction. Science , (6266), 1332–1338.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2016). Building machines thatlearn and think like people. CoRR , abs / .Law, M. (2018). Inductive learning of answer set programs . Ph.D. thesis, Imperial College London,UK.Law, M., Russo, A., Bertino, E., Broda, K., & Lobo, J. (2019). Representing and learning gram-mars in answer set programming. In The Thirty-Third AAAI Conference on Artificial Intel-ligence, AAAI 2019 , pp. 2919–2928. AAAI Press.Law, M., Russo, A., Bertino, E., Broda, K., & Lobo, J. (2020). Fastlas: Scalable inductive logic pro-gramming incorporating domain-specific optimisation criteria. In The Thirty-Fourth AAAIConference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applicationsof Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on EducationalAdvances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020 , pp.2877–2885. AAAI Press.Law, M., Russo, A., & Broda, K. (2014). Inductive learning of answer set programs. In Fermé, E.,& Leite, J. (Eds.), Logics in Artificial Intelligence - 14th European Conference, JELIA 2014,Funchal, Madeira, Portugal, September 24-26, 2014. Proceedings , Vol. 8761 of Lecture Notesin Computer Science , pp. 311–325. Springer.Law, M., Russo, A., & Broda, K. (2016). Iterative learning of answer set programs from contextdependent examples. Theory Pract. Log. Program. , (5-6), 834–848.Law, M., Russo, A., & Broda, K. (2018a). The complexity and generality of learning answer setprograms. Artif. Intell. , , 110–146.Law, M., Russo, A., & Broda, K. (2018b). Inductive learning of answer set programs from noisyexamples. CoRR , abs / .Law, M., Russo, A., & Broda, K. (2020). The ilasp system for inductive learning of answer setprograms. The Association for Logic Programming Newsletter .Leban, G., Zabkar, J., & Bratko, I. (2008). An experiment in robot discovery with ILP. In Zelezný,F., & Lavrac, N. (Eds.), Inductive Logic Programming, 18th International Conference, ILP2008, Prague, Czech Republic, September 10-12, 2008, Proceedings , Vol. 5194 of LectureNotes in Computer Science , pp. 77–90. Springer. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature , (7553), 436–444.Legras, S., Rouveirol, C., & Ventos, V. (2018). The game of bridge: A challenge for ILP. InRiguzzi, F., Bellodi, E., & Zese, R. (Eds.), Inductive Logic Programming - 28th InternationalConference, ILP 2018, Ferrara, Italy, September 2-4, 2018, Proceedings , Vol. 11105 of LectureNotes in Computer Science , pp. 72–87. Springer.Levin, L. A. (1973). Universal sequential search problems. Problemy Peredachi Informatsii , (3),115–116.Lin, D., Dechter, E., Ellis, K., Tenenbaum, J. B., & Muggleton, S. (2014). Bias reformulationfor one-shot function induction. In Schaub, T., Friedrich, G., & O’Sullivan, B. (Eds.), ECAI2014 - 21st European Conference on Artificial Intelligence, 18-22 August 2014, Prague, CzechRepublic - Including Prestigious Applications of Intelligent Systems (PAIS 2014) , Vol. 263 of Frontiers in Artificial Intelligence and Applications , pp. 525–530. IOS Press.Lloyd, J. W. (1994). Practical advtanages of declarative programming. In , pp. 18–30.Lloyd, J. W. (2012). Foundations of logic programming . Springer Science & Business Media.Maher, M. J. (1988). Equivalences of logic programs. In Minker, J. (Ed.), Foundations of Deduc-tive Databases and Logic Programming , pp. 627–658. Morgan Kaufmann.Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., & De Raedt, L. (2018). Deepproblog:Neural probabilistic logic programming. In Bengio, S., Wallach, H. M., Larochelle, H.,Grauman, K., Cesa-Bianchi, N., & Garnett, R. (Eds.), Advances in Neural Information Pro-cessing Systems 31: Annual Conference on Neural Information Processing Systems 2018,NeurIPS 2018, 3-8 December 2018, Montréal, Canada , pp. 3753–3763.Manna, Z., & Waldinger, R. J. (1980). A deductive approach to program synthesis. ACM Trans.Program. Lang. Syst. , (1), 90–121.Marcus, G. (2018). Deep learning: A critical appraisal. CoRR , abs / .Martínez, D., Alenyà, G., Torras, C., Ribeiro, T., & Inoue, K. (2016). Learning relational dynamicsof stochastic domains for planning. In Coles, A. J., Coles, A., Edelkamp, S., Magazzeni, D.,& Sanner, S. (Eds.), Proceedings of the Twenty-Sixth International Conference on AutomatedPlanning and Scheduling, ICAPS 2016, London, UK, June 12-17, 2016 , pp. 235–243. AAAIPress.Martínez, D., Ribeiro, T., Inoue, K., Alenyà, G., & Torras, C. (2015). Learning probabilistic actionmodels from interpretation transitions. In Vos, M. D., Eiter, T., Lierler, Y., & Toni, F. (Eds.), Proceedings of the Technical Communications of the 31st International Conference on LogicProgramming (ICLP 2015), Cork, Ireland, August 31 - September 4, 2015 , Vol. 1433 of CEURWorkshop Proceedings . CEUR-WS.org.McCarthy, J. (1959). Programs with common sense. In Proceedings of the Teddington Conferenceon the Mechanization of Thought Processes , pp. 75–91, London. Her Majesty’s StationaryOffice.McCreath, E., & Sharma, A. (1995). Extraction of meta-knowledge to restrict the hypothesisspace for ilp systems. In Eighth Australian Joint Conference on Artificial Intelligence , pp.75–82. ROPPER AND D UMANCIC Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem..Michie, D. (1988). Machine learning in the next five years. In Sleeman, D. H. (Ed.), Proceedingsof the Third European Working Session on Learning, EWSL 1988, Turing Institute, Glasgow,UK, October 3-5, 1988 , pp. 107–122. Pitman Publishing.Mitchell, T. M. (1997). Machine learning . McGraw Hill series in computer science. McGraw-Hill.Mitchell, T. M., Cohen, W. W., Jr., E. R. H., Talukdar, P. P., Yang, B., Betteridge, J., Carlson, A.,Mishra, B. D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed,T., Nakashole, N., Platanios, E. A., Ritter, A., Samadi, M., Settles, B., Wang, R. C., Wijaya,D., Gupta, A., Chen, X., Saparov, A., Greaves, M., & Welling, J. (2018). Never-endinglearning. Commun. ACM , (5), 103–115.Mooney, R. J. (1999). Learning for semantic interpretation: Scaling up without dumbing down.In Cussens, J., & Dzeroski, S. (Eds.), Learning Language in Logic , Vol. 1925 of Lecture Notesin Computer Science , pp. 57–66. Springer.Mooney, R. J., & Califf, M. E. (1995). Induction of first-order decision lists: Results on learningthe past tense of english verbs. J. Artif. Intell. Res. , , 1–24.Morales, E. M. (1996). Learning playing strategies in chess. Computational Intelligence , ,65–87.Morel, R., Cropper, A., & Ong, C. L. (2019). Typed meta-interpretive learning of logic programs.In Calimeri, F., Leone, N., & Manna, M. (Eds.), Logics in Artificial Intelligence - 16th Eu-ropean Conference, JELIA 2019, Rende, Italy, May 7-11, 2019, Proceedings , Vol. 11468 of Lecture Notes in Computer Science , pp. 198–213. Springer.Muggleton, S. (1987). Duce, an oracle-based approach to constructive induction. In McDermott,J. P. (Ed.), Proceedings of the 10th International Joint Conference on Artificial Intelligence.Milan, Italy, August 23-28, 1987 , pp. 287–292. Morgan Kaufmann.Muggleton, S. (1991). Inductive logic programming. New Generation Computing , (4), 295–318.Muggleton, S. (1994). Logic and learning: Turing’s legacy. In Machine Intelligence 13 , pp. 37–56.Muggleton, S. (1995). Inverse entailment and progol. New Generation Comput. , (3&4), 245–286.Muggleton, S. (1999a). Inductive logic programming: Issues, results and the challenge of learn-ing language in logic. Artif. Intell. , (1-2), 283–296.Muggleton, S. (1999b). Scientific knowledge discovery using inductive logic programming. Commun. ACM , (11), 42–46.Muggleton, S. (2014). Alan turing and the development of artificial intelligence. AI Commun. , (1), 3–10.Muggleton, S., & Buntine, W. L. (1988). Machine invention of first order predicates by invertingresolution. In Laird, J. E. (Ed.), Machine Learning, Proceedings of the Fifth InternationalConference on Machine Learning, Ann Arbor, Michigan, USA, June 12-14, 1988 , pp. 339–352. Morgan Kaufmann. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Muggleton, S., Dai, W., Sammut, C., Tamaddoni-Nezhad, A., Wen, J., & Zhou, Z. (2018). Meta-interpretive learning from noisy images. Machine Learning , (7), 1097–1118.Muggleton, S., & De Raedt, L. (1994). Inductive logic programming: Theory and methods. J.Log. Program. , / , 629–679.Muggleton, S., De Raedt, L., Poole, D., Bratko, I., Flach, P. A., Inoue, K., & Srinivasan, A. (2012).ILP turns 20 - biography and future challenges. Machine Learning , (1), 3–23.Muggleton, S., & Feng, C. (1990). Efficient induction of logic programs. In Algorithmic LearningTheory, First International Workshop, ALT ’90, Tokyo, Japan, October 8-10, 1990, Proceed-ings , pp. 368–381.Muggleton, S., & Firth, J. (2001). Relational rule induction with cp rogol 4.4: A tutorial intro-duction. In Relational data mining , pp. 160–188. Springer.Muggleton, S., Paes, A., Costa, V. S., & Zaverucha, G. (2009a). Chess revision: Acquiring therules of chess variants through FOL theory revision from examples. In De Raedt, L. (Ed.), Inductive Logic Programming, 19th International Conference, ILP 2009, Leuven, Belgium,July 02-04, 2009. Revised Papers , Vol. 5989 of Lecture Notes in Computer Science , pp. 123–130. Springer.Muggleton, S., Santos, J. C. A., & Tamaddoni-Nezhad, A. (2009b). Progolem: A system based onrelative minimal generalisation. In De Raedt, L. (Ed.), Inductive Logic Programming, 19thInternational Conference, ILP 2009, Leuven, Belgium, July 02-04, 2009. Revised Papers , Vol.5989 of Lecture Notes in Computer Science , pp. 131–148. Springer.Muggleton, S., & Tamaddoni-Nezhad, A. (2008). QG / GA: a stochastic search for progol. MachineLearning , (2-3), 121–133.Muggleton, S. H., Lin, D., Chen, J., & Tamaddoni-Nezhad, A. (2013). Metabayes: Bayesianmeta-interpretative learning using higher-order stochastic refinement. In Inductive LogicProgramming - 23rd International Conference, ILP 2013, Rio de Janeiro, Brazil, August 28-30, 2013, Revised Selected Papers , pp. 1–17.Muggleton, S. H., Lin, D., Pahlavi, N., & Tamaddoni-Nezhad, A. (2014). Meta-interpretive learn-ing: application to grammatical inference. Machine Learning , (1), 25–49.Muggleton, S. H., Lin, D., & Tamaddoni-Nezhad, A. (2015). Meta-interpretive learning of higher-order dyadic Datalog: predicate invention revisited. Machine Learning , (1), 49–73.Muggleton, S. H., Schmid, U., Zeller, C., Tamaddoni-Nezhad, A., & Besold, T. R. (2018). Ultra-strong machine learning: comprehensibility of programs learned with ILP. Machine Learn-ing , (7), 1119–1140.Nienhuys-Cheng, S.-H., & Wolf, R. d. (1997). Foundations of Inductive Logic Programming .Springer-Verlag New York, Inc., Secaucus, NJ, USA.Osera, P., & Zdancewic, S. (2015). Type-and-example-directed program synthesis. In Grove, D.,& Blackburn, S. (Eds.), Proceedings of the 36th ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation, Portland, OR, USA, June 15-17, 2015 , pp. 619–630.ACM.Otero, R. P. (2001). Induction of stable models. In Rouveirol, C., & Sebag, M. (Eds.), InductiveLogic Programming, 11th International Conference, ILP 2001, Strasbourg, France, September ROPPER AND D UMANCIC , Vol. 2157 of Lecture Notes in Computer Science , pp. 193–205.Springer.Page, D., & Srinivasan, A. (2003). ILP: A short look back and a longer look forward. J. MachineLearning Res. , , 415–430.Picado, J., Termehchy, A., Fern, A., & Pathak, S. (2017). Towards automatically setting lan-guage bias in relational learning. In Schelter, S., & Zadeh, R. (Eds.), Proceedings of the 1stWorkshop on Data Management for End-to-End Machine Learning, DEEM@SIGMOD 2017,Chicago, IL, USA, May 14, 2017 , pp. 3:1–3:4. ACM.Plotkin, G. (1971). Automatic Methods of Inductive Inference . Ph.D. thesis, Edinburgh University.Quinlan, J. R. (1986). Induction of decision trees. Machine Learning , (1), 81–106.Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning , , 239–266.Quinlan, J. R. (1993). C4.5: Programs for Machine Learning . Morgan Kaufmann.Raghothaman, M., Mendelson, J., Zhao, D., Naik, M., & Scholz, B. (2020). Provenance-guidedsynthesis of datalog programs. Proc. ACM Program. Lang. , (POPL), 62:1–62:27.Ray, O. (2009). Nonmonotonic abductive inductive learning. J. Applied Logic , (3), 329–340.Reed, S. E., & de Freitas, N. (2016). Neural programmer-interpreters. In Bengio, Y., & LeCun,Y. (Eds.), .Reiter, R. (1977). On closed world data bases. In Gallaire, H., & Minker, J. (Eds.), Logic andData Bases, Symposium on Logic and Data Bases, Centre d’études et de recherches de Toulouse,France, 1977 , Advances in Data Base Theory, pp. 55–76, New York. Plemum Press.Ribeiro, T., Folschette, M., Magnin, M., & Inoue, K. (2020). Learning any semantics for dynam-ical systems represented by logic programs. working paper or preprint.Ribeiro, T., & Inoue, K. (2014). Learning prime implicant conditions from interpretation tran-sition. In Davis, J., & Ramon, J. (Eds.), Inductive Logic Programming - 24th InternationalConference, ILP 2014, Nancy, France, September 14-16, 2014, Revised Selected Papers , Vol.9046 of Lecture Notes in Computer Science , pp. 108–125. Springer.Ribeiro, T., Magnin, M., Inoue, K., & Sakama, C. (2015). Learning multi-valued biological modelswith delayed influence from time-series observations. In Li, T., Kurgan, L. A., Palade,V., Goebel, R., Holzinger, A., Verspoor, K., & Wani, M. A. (Eds.), , pp. 25–31. IEEE.Richards, B. L., & Mooney, R. J. (1995). Automated refinement of first-order horn-clause domaintheories. Machine Learning , (2), 95–131.Robinson, J. A. (1965). A machine-oriented logic based on the resolution principle. J. ACM , (1), 23–41.Rocktäschel, T., & Riedel, S. (2017). End-to-end differentiable proving. In Guyon, I., vonLuxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., & Garnett,R. (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA ,pp. 3788–3800. NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and orga-nization in the brain.. Psychological review , (6), 386.Russell, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach . Pearson, New Jersey.Third Edition.Russell, S. (2019). Human compatible: Artificial intelligence and the problem of control . Penguin.Sakama, C., & Inoue, K. (2009). Brave induction: a logical framework for learning from incom-plete information. Machine Learning , (1), 3–35.Sammut, C. (1981). Concept learning by experiment. In Hayes, P. J. (Ed.), Proceedings of the 7thInternational Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada,August 24-28, 1981 , pp. 104–105. William Kaufmann.Sammut, C. (1993). The origins of inductive logic programming: A prehistoric tale. In Pro-ceedings of the 3rd international workshop on inductive logic programming , pp. 127–147.J. Stefan Institute.Sammut, C., Sheh, R., Haber, A., & Wicaksono, H. (2015). The robot engineer. In Inoue, K.,Ohwada, H., & Yamamoto, A. (Eds.), Late Breaking Papers of the 25th International Confer-ence on Inductive Logic Programming, Kyoto University, Kyoto, Japan, August 20th to 22nd,2015 , Vol. 1636 of CEUR Workshop Proceedings , pp. 101–106. CEUR-WS.org.Sato, T. (1995). A statistical learning method for logic programs with distribution semantics. InSterling, L. (Ed.), Logic Programming, Proceedings of the Twelfth International Conferenceon Logic Programming, Tokyo, Japan, June 13-16, 1995 , pp. 715–729. MIT Press.Sato, T., & Kameya, Y. (2001). Parameter learning of logic programs for symbolic-statisticalmodeling. J. Artif. Intell. Res. , , 391–454.Schaffer, C. (1993). Overfitting avoidance as bias. Machine Learning , , 153–178.Shapiro, E. Y. (1983). Algorithmic Program DeBugging . MIT Press, Cambridge, MA, USA.Si, X., Lee, W., Zhang, R., Albarghouthi, A., Koutris, P., & Naik, M. (2018). Syntax-guided synthe-sis of Datalog programs. In Leavens, G. T., Garcia, A., & Pasareanu, C. S. (Eds.), Proceedingsof the 2018 ACM Joint Meeting on European Software Engineering Conference and Sympo-sium on the Foundations of Software Engineering, ESEC / SIGSOFT FSE 2018, Lake BuenaVista, FL, USA, November 04-09, 2018 , pp. 515–527. ACM.Siebers, M., & Schmid, U. (2018). Was the year 2000 a leap year? step-wise narrowing theorieswith metagol. In Riguzzi, F., Bellodi, E., & Zese, R. (Eds.), Inductive Logic Programming -28th International Conference, ILP 2018, Ferrara, Italy, September 2-4, 2018, Proceedings ,Vol. 11105 of Lecture Notes in Computer Science , pp. 141–156. Springer.Silver, D. L., Yang, Q., & Li, L. (2013). Lifelong machine learning systems: Beyond learningalgorithms. In Lifelong Machine Learning, Papers from the 2013 AAAI Spring Symposium,Palo Alto, California, USA, March 25-27, 2013 , Vol. SS-13-05 of AAAI Technical Report .AAAI.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser,J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the game ofgo with deep neural networks and tree search. nature , (7587), 484. ROPPER AND D UMANCIC Sivaraman, A., Zhang, T., den Broeck, G. V., & Kim, M. (2019). Active inductive logic program-ming for code search. In Atlee, J. M., Bultan, T., & Whittle, J. (Eds.), Proceedings of the41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada,May 25-31, 2019 , pp. 292–303. IEEE / ACM.Solomonoff, R. J. (1964a). A formal theory of inductive inference. part I. Information andControl , (1), 1–22.Solomonoff, R. J. (1964b). A formal theory of inductive inference. part II. Information andControl , (2), 224–254.Sourek, G., Aschenbrenner, V., Zelezný, F., Schockaert, S., & Kuzelka, O. (2018). Lifted relationalneural networks: Efficient learning of latent relational structures. J. Artif. Intell. Res. , ,69–100.Spivey, J. M., & Abrial, J. (1992). The Z notation . Prentice Hall Hemel Hempstead.Srinivasan, A. (2001). The ALEPH manual. Machine Learning at the Computing Laboratory,Oxford University .Srinivasan, A., King, R. D., & Bain, M. (2003). An empirical study of the use of relevanceinformation in inductive logic programming. J. Machine Learning Res. , , 369–383.Srinivasan, A., King, R. D., Muggleton, S., & Sternberg, M. J. E. (1997). Carcinogenesis pre-dictions using ILP. In Lavrac, N., & Dzeroski, S. (Eds.), Inductive Logic Programming, 7thInternational Workshop, ILP-97, Prague, Czech Republic, September 17-20, 1997, Proceed-ings , Vol. 1297 of Lecture Notes in Computer Science , pp. 273–287. Springer.Srinivasan, A., Muggleton, S., Sternberg, M. J. E., & King, R. D. (1996). Theories for mutagenic-ity: A study in first-order and feature-based induction. Artif. Intell. , (1-2), 277–299.Srinivasan, A., Muggleton, S. H., & King, R. D. (1995). Comparing the use of backgroundknowledge by inductive logic programming systems. In Proceedings of the 5th Interna-tional Workshop on Inductive Logic Programming , pp. 199–230. Department of ComputerScience, Katholieke Universiteit Leuven.Srinivasan, A., Page, D., Camacho, R., & King, R. D. (2006). Quantitative pharmacophore modelswith inductive logic programming. Machine Learning , (1-3), 65–90.Stahl, I. (1995). The appropriateness of predicate invention as bias shift operation in ILP. Ma-chine Learning , (1-2), 95–117.Sterling, L., & Shapiro, E. Y. (1994). The art of Prolog: advanced programming techniques . MITpress.Struyf, J., Davis, J., & Jr., C. D. P. (2006). An efficient approximation to lookahead in relationallearners. In Fürnkranz, J., Scheffer, T., & Spiliopoulou, M. (Eds.), Machine Learning: ECML2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22,2006, Proceedings , Vol. 4212 of Lecture Notes in Computer Science , pp. 775–782. Springer.Summers, P. D. (1977). A methodology for LISP program construction from examples. J. ACM , (1), 161–175.Tamaddoni-Nezhad, A., Bohan, D., Raybould, A., & Muggleton, S. (2014). Towards machinelearning of predictive models from ecological data. In Davis, J., & Ramon, J. (Eds.), Induc-tive Logic Programming - 24th International Conference, ILP 2014, Nancy, France, September NDUCTIVE LOGIC PROGRAMMING AT A NEW INTRODUCTION , Vol. 9046 of Lecture Notes in Computer Science , pp.154–167. Springer.Tamaddoni-Nezhad, A., Chaleil, R., Kakas, A. C., & Muggleton, S. (2006). Application of abduc-tive ILP to learning metabolic network inhibition from temporal data. Machine Learning , (1-3), 209–230.Tamaki, H., & Sato, T. (1984). Unfold / fold transformation of logic programs. In Tärnlund,S. (Ed.), Proceedings of the Second International Logic Programming Conference, UppsalaUniversity, Uppsala, Sweden, July 2-6, 1984 , pp. 127–138. Uppsala University.Tärnlund, S. (1977). Horn clause computability. BIT , (2), 215–226.Torrey, L., & Shavlik, J. (2009). Transfer learning. Handbook of Research on Machine LearningApplications and Trends: Algorithms, Methods, and Techniques , , 242.Torrey, L., Shavlik, J. W., Walker, T., & Maclin, R. (2007). Relational macros for transfer inreinforcement learning. In Blockeel, H., Ramon, J., Shavlik, J. W., & Tadepalli, P. (Eds.), Inductive Logic Programming, 17th International Conference, ILP 2007, Corvallis, OR, USA,June 19-21, 2007, Revised Selected Papers , Vol. 4894 of Lecture Notes in Computer Science ,pp. 254–268. Springer.Tourret, S., & Cropper, A. (2019). Sld-resolution reduction of second-order horn fragments. InCalimeri, F., Leone, N., & Manna, M. (Eds.), Logics in Artificial Intelligence - 16th EuropeanConference, JELIA 2019, Rende, Italy, May 7-11, 2019, Proceedings , Vol. 11468 of LectureNotes in Computer Science , pp. 259–276. Springer.Turcotte, M., Muggleton, S., & Sternberg, M. J. E. (2001). The effect of relational backgroundknowledge on learning of protein three-dimensional fold signatures. Machine Learning , (1 / Mind , (236), 433–460.Vera, S. (1975). Induction of concepts in the predicate calculus. In Advance Papers of the FourthInternational Joint Conference on Artificial Intelligence, Tbilisi, Georgia, USSR, September3-8, 1975 , pp. 281–287.Wang, W. Y., Mazaitis, K., & Cohen, W. W. (2014). Structure learning via parameter learning. InLi, J., Wang, X. S., Garofalakis, M. N., Soboroff, I., Suel, T., & Wang, M. (Eds.), Proceedingsof the 23rd ACM International Conference on Conference on Information and KnowledgeManagement, CIKM 2014, Shanghai, China, November 3-7, 2014 , pp. 1199–1208. ACM.Wielemaker, J., Schrijvers, T., Triska, M., & Lager, T. (2012). Swi-prolog. Theory Pract. Log.Program. , (1-2), 67–96.Wirth, N. (1985). Algorithms and data structures . Prentice Hall.Wrobel, S. (1996). First-order theory refinement. In Advances in Inductive Logic Programming ,pp. 14–33.Zahálka, J., & Zelezný, F. (2011). An experimental test of occam’s razor in classification. MachineLearning , (3), 475–481.Zelle, J. M., & Mooney, R. J. (1995). Comparative results on using inductive logic programmingfor corpus-based parser construction. In Wermter, S., Riloff, E., & Scheler, G. (Eds.), Con- ROPPER AND D UMANCIC nectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing ,Vol. 1040 of Lecture Notes in Computer Science , pp. 355–369. Springer.Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logicprogramming. In Clancey, W. J., & Weld, D. S. (Eds.), Proceedings of the Thirteenth NationalConference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelli-gence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 2 ,pp. 1050–1055. AAAI Press / The MIT Press. rXiv:2008.07912v3 [cs.AI] 13 Oct 2020 ......