[PDF] On the Parameterized Complexity of Learning First-Order Logic

Abstract

We analyse the complexity of learning first-order queries in a model-theoretic framework for supervised learning introduced by (Grohe and Tur\'an, TOCS 2004). Previous research on the complexity of learning in this framework focussed on the question of when learning is possible in time sublinear in the background structure. Here we study the parameterized complexity of the learning problem. We have two main results. The first is a hardness result, showing that learning first-order queries is at least as hard as the corresponding model-checking problem, which implies that on general structures it is hard for the parameterized complexity class AW[*]. Our second main contribution is a fixed-parameter tractable agnostic PAC learning algorithm for first-order queries over sparse relational data (more precisely, over nowhere dense background structures).

Full PDF

OOn the Parameterized Complexity ofLearning Logic

Steffen van Bergerem ! (cid:18) RWTH Aachen University, Germany

Martin Grohe ! (cid:18) RWTH Aachen University, Germany

Martin Ritzert ! (cid:18) RWTH Aachen University, Germany

Abstract

We analyse the complexity of learning first-order definable concepts in a model-theoretic frameworkfor supervised learning introduced by (Grohe and Turán, TOCS 2004). Previous research on thecomplexity of learning in this framework focussed on the question of when learning is possible intime sublinear in the background structure.Here we study the parameterized complexity of the learning problem. We obtain a hardness resultshowing that exactly learning first-order definable concepts is at least as hard as the correspondingmodel-checking problem, which implies that on general structures it is hard for the parameterizedcomplexity class AW [ ∗ ]. Our main contribution is a fixed-parameter tractable agnostic PAC learningalgorithm for first-order definable concepts over effectively nowhere dense background structures. We study the complexity of Boolean classification (a.k.a. concept learning) problems in anabstract setting where hypotheses are specified by formulas of first-order logic over finitestructures.Boolean classification is a supervised learning problem: the goal is to learn an unknownBoolean function c ∗ : X → { , } , the target function or target concept , defined on an instance space X , from a sequence ( x , λ ) , . . . , ( x m , λ m ) ∈ X × { , } of labelled examples ,where (in the simplest setting) for all i the label λ i is the target value c ∗ ( x i ). Given thelabelled examples, the learner computes a hypothesis h : X → { , } that is supposed to beclose to to the target concept c ∗ . To make this problem feasible at all, we usually assumethat the target concept is from some restricted concept class C , and the hypothesis from a hypothesis class H , which may or may not coincide with C . We mainly frame our results inValiant’s probably approximately correct (PAC) learning model [16] and the more general agnostic PAC learning model [11]. In the latter, we do not require a fixed target conceptand instead just assume that there is an unknown probability distribution D on X × { , } .Then the learner’s goal is to find a hypothesis h in the hypothesis class that minimizesthe probability Pr ( x,λ ) ∼D ( h ( x ) ̸ = λ ) of being wrong on a random (unseen) example. Thisprobability is known as the generalisation error (or risk ) of the hypothesis. It is known that,information theoretically, PAC learning and agnostic PAC learning are possible if and onlyif the hypothesis class H has bounded VC dimension [3, 19]. It is also known that if thisis the case, then we can cast the algorithmic problem as a minimisation problem known as empirical risk minimisation : find the hypothesis h ∈ H that minimises the training error (a.k.a. empirical risk ) m |{ h ( x i ) ̸ = λ i | ⩽ i ⩽ m }| . An approximate minimisation with anadditive error ε is sufficient for an (agnostic) PAC learning algorithm.After this very brief review of some necessary definitions and facts from algorithmiclearning theory (for more background, see [12, 14]), let us now describe the logical settingfrom [10] that we consider here. Hypotheses are defined by formulas of first-order logic © Steffen van Bergerem, Martin Grohe, and Martin Ritzert;licensed under Creative Commons License CC-BY 4.0 a r X i v : . [ c s . L O ] F e b On the Parameterized Complexity of Learning Logic or some other logic over finite structures. The instance space consists of all k -tuples ¯ u of elements of a finite structure A referred to as the background structure . Concepts andhypotheses are specified by formulas φ (¯ x ; ¯ y ) together with tuples ¯ v of parameters, which arealso elements of the structure A . Here, ¯ x , ¯ y are tuples of variables of length k, ℓ , respectively,where ℓ is the length of the parameter tuple ¯ v . The hypothesis specified by φ (¯ x ; ¯ y ) and ¯ v isthe Boolean function h φ, ¯ v with h φ, ¯ v (¯ u ) = 1 if A | = φ (¯ u ; ¯ v ) and h φ, ¯ v (¯ u ) = 0 otherwise.A motivation for this framework is query learning in relational databases [1, 4], where thegoal is to learn a query in some language such as SQL that matches given examples of tupleswith labels indicating whether they are supposed to be in the query answer or not. The roleof the parameters in this setting is that of constants that may appear in the query. Learningschema mappings for relational databases is a similar application scenario [15]. A generaltype of application our framework is designed for is to detect formally specified propertiesof a (black-box) system from an observed input-output behaviour, whether this is cast asverification or explainability (of the properties of a machine learning model). For a moreextensive discussion of the framework, we refer the reader to [9].The framework was introduced in [10], where the authors proved information-theoreticlearning results for both first-order and monadic second-order logic obtained by restricting thebackground structures, for example, to be planar or of bounded tree width. The algorithmicquestion was first studied in [9], where it was proved that on graphs of maximum degree d ,empirical risk minimisation for first-order definable hypotheses is possible in time polynomialin d and the number m of labelled examples the algorithm receives as input, independently ofthe size of the background structure A . This was generalised to first-order logic with counting[17] and with weight aggregation [18]. All these results are mainly interesting in structures ofsmall, say, polylogarithmic degree, because there they yield learning algorithms running intime sublinear in the size of the background structure. It was shown in [9, 17] that sublinearlearning is no longer possible if the degree is unrestricted. To address this issue, in [8] it wasproposed to introduce a preprocessing phase where (before seeing any labelled examples) thebackground structure is converted to some data structure that supports sublinear learninglater. This model was applied to monadic second-order logic on strings [8] and trees [6]. Our Contributions.

We study the complexity of learning first-order definable concepts inthe logical framework. We consider two algorithmic problems: an exact learning problem FO -Learn and an approximate version FO -ALearn tailored toward the agnostic PAClearning framework (see Section 3 for details).Both of these problems FO -Learn and FO -ALearn trivially have polynomial-time datacomplexity, that is, can be solved in polynomial time if the dimension k , the number ℓ ofparameters, and the quantifier rank q of the formulas are regarded as fixed constants. Thisis simply because the hypothesis space has polynomial size O ( n ℓ ), where the constant in thebig-Oh depends on k, q, ℓ , and a fixed first-order formula can be evaluated in polynomialtime. As a lower bound, we have mentioned above that the problems cannot be solved insublinear time if the degree of the structure A is unrestricted. The main question we studyhere is whether FO -Learn and FO -ALearn are fixed-parameter tractable , that is, solvablein time f ( k, ℓ, q, /ε ) n O (1) for some function f .Our first result states that at least the exact learning problem FO -Learn is not fixed-parameter tractable. ▶ Theorem 1. FO -Learn is hard for the parameterized complexity class AW [ ∗ ] underparameterized Turing reductions. . van Bergerem, M. Grohe, and M. Ritzert 3 In view of this theorem, we look at restricted classes of background structures. Theproof of Theorem 1 is a reduction from the model-checking problem for first-order logic,which is complete for AW [ ∗ ] [2]. This reduction remains valid for restrictions of the inputstructure of the model-checking problem resp. background structure of the learning problemto classes C satisfying some mild closure conditions. Thus, to find fixed-parameter tractablerestrictions of the learning problem, we need to look at classes with a fixed-parametertractable model-checking problem.In fact, regardless of a whether the learning problem is hard or not, it only makes senseto consider the problem in settings where the model-checking problem is tractable, becauseotherwise even if we could learn a hypothesis with small error it would not help us at all,because it would not be feasible to evaluate the hypothesis on a given instance.It is relatively easy to see that for classes with a fixed-parameter tractable model-checkingproblem, again satisfying mild closure conditions, the unary version of the learning problem(where k = 1) is reducible to the model-checking problem and hence fixed-parameter tractableas well. Unfortunately, the simple reduction argument breaks down for k ⩾

2. By a resultdue to Laskowski [13], bounded VC dimension for the 1-dimensional hypothesis class impliesbounded VC dimension for the higher-dimensional classes. This allows an extension from k = 1 to larger k for the information-theoretic learning problem. Unfortunately, Laskowski’sresult is not algorithmic, so it does not help us here.Nevertheless, our main result is that for effectively nowhere dense classes of structures,which are among the largest known with a tractable FO -model-checking problem [7], theagnostic PAC learning problem is fixed-parameter tractable (for all values of k ). ▶ Theorem 2. FO -ALearn is fixed-parameter tractable on nowhere dense classes of struc-tures. The proof of this theorem heavily depends on the characterisation of nowhere denseclasses in terms of the so-called splitter game [7]. Interestingly, the parameters the learnerchooses are essentially the vertices the splitter chooses in her winning strategy for the game.

We use N and N ⩾ to denote the non-negative and positive integers and use the shorthand[ m ] = { , . . . , m } . All functions in this paper are computable. Additionally, all variablesdenote integers except for ε, δ and their variations such as ε ∗ and δ . We expand the standardidentity function to I ( x, y, z ) = x , which ignores everything but the first input.For the ease of presentation, all results will be given in terms of (coloured) graphs, whichwe formally define below, instead of general relational structures. All our results can beextended to also work with arbitrary relational structures instead of graphs.A graph G consists of nodes V ( G ) and edges E ( G ). Formally, we view a (coloured) graphas a relational structure of some vocabulary τ = { E, P , . . . , P ℓ } where E is binary and the P j are unary. We assume graphs to be undirected and loop-free, that is, in a graph G the edgerelation E ( G ) is symmetric and irreflexive. If we want to specify the vocabulary explicitly,we call such a graph a τ -coloured graph . For τ ′ ⊇ τ , a τ ′ -expansion of a τ -coloured graph G is a τ ′ -coloured graph G ′ such that V ( G ′ ) = V ( G ), E ( G ′ ) = E ( G ), and P ( G ′ ) = P ( G ) forall P ∈ τ . The size or order of a graph the number of its vertices | V ( G ) | . A graph G ′ ⊆ G isa subgraph of G if V ( G ′ ) ⊆ V ( G ), E ( G ′ ) ⊆ E ( G ), and P ( G ′ ) ⊆ P ( G ) for all P ∈ τ . Givena set S ⊆ V ( G ), the induced subgraph G [ S ] is defined as the graph with nodes S and allrelations restricted to the nodes in S . On the Parameterized Complexity of Learning Logic

In the proofs, we regularly need distances. The distance between two nodes u, v ∈ V ( G )is the length of the shortest path between those nodes. The distance between a node u ∈ V ( G ) and a vector ¯ v ∈ (cid:0) V ( G ) (cid:1) k is defined as dist( u, ¯ v ) = min v ∈ ¯ v (cid:0) dist( u, v ) (cid:1) . Similarly,we define the distance between two vectors ¯ u ∈ (cid:0) V ( G ) (cid:1) k , ¯ v ∈ (cid:0) V ( G ) (cid:1) ℓ as dist(¯ u, ¯ v ) =min u ∈ ¯ u,v ∈ ¯ v (cid:0) dist( u, v ) (cid:1) . The r - neighbourhood N Gr (¯ a ) = { u ∈ V ( G ) | dist( u, ¯ a ) ⩽ r } is theset of nodes up to a distance of r from ¯ a . We additionally need the r -neighbourhood of a set S ⊆ V ( G ) defined as N Gr ( S ) = { u ∈ V ( G ) | s ∈ S, dist( u, s ) ⩽ r } . The structure N Gr (¯ a ) = G [ N Gr (¯ a )] induced from the neighbourhood of ¯ a is called the (induced) r -neighbourhoodstructure. Neighbourhoods of different sizes can be related; we will need the connectiondescribed in the next lemma. It is an easy consequence of the so-called Vitali CoveringLemma. ▶ Lemma 3.

Let G be a graph and X ⊆ V ( G ) , and let r ⩾ . Then there is a set Z ⊆ X and an R = 3 i r , where ⩽ i ⩽ | X | − , such that (i) N GR ( z ) ∩ N GR ( z ′ ) = ∅ for all distinct z, z ′ ∈ Z and (ii) N Gr ( X ) ⊆ N GR ( Z ) . Proof.

For i ⩾

0, let R i := 3 i r . We inductively construct a sequence Z ⊃ Z ⊃ · · · ⊃ Z p ofsubsets of X such that for every i we have N Gr ( X ) ⊆ N GR i ( Z i ) and for p we additionally have N GR p ( z ) ∩ N GR p ( z ′ ) = ∅ for all distinct z, z ′ ∈ Z p . Then p ⩽ | X | − Z p is the desired set.Let Z := X . Now suppose that Z i is defined. If N GR i ( z ) ∩ N GR i ( z ′ ) = ∅ for all distinct z, z ′ ∈ Z i , we set p := i and stop the construction. Otherwise, let Z i +1 ⊆ Z i be inclusion-wisemaximal such that N GR i ( z ) ∩ N GR i ( z ′ ) = ∅ for all distinct z, z ′ ∈ Z i +1 . Clearly Z i +1 ⊂ Z i , asotherwise we would have stopped the construction. Then for all y ∈ Z i there is a z ∈ Z i +1 such that N GR i ( y ) ∩ N GR i ( z ) ̸ = ∅ , which implies that N GR i ( y ) ⊆ N G R i ( z ) = N GR i +1 ( z ). Thus, N Gr ( X ) ⊆ N GR i ( Z i ) ⊆ N GR i +1 ( Z i +1 ).In the worst case, all pairs Z i , Z i +1 differ by exactly one element and we actually need p = | X | − | Z p | = 1. This case can be reached when each x i is at position3 i − r on a path and x ∈ Z i for all i . ◀ Nowhere Dense Graphs

Let G be a graph and r, s >

0. The ( r, s ) -splitter game on G is played by two players called Connector and

Splitter . The game is played in a sequenceof at most s rounds . We let G := G . In round i + 1 of the game, Connector picks avertex v i +1 ∈ V ( G i ), and Splitter answers by picking w i +1 ∈ N G i r ( v i ). We let G i +1 := G i [ N G i r ( v i +1 ) \ { w i +1 } ]. Splitter wins if G i +1 = ∅ , otherwise the play continues. ▶ Fact 4 ([7]) . A class C of graphs is nowhere dense if for every r > there is an s > such that for every G ∈ C , Splitter has a winning strategy for the ( r, s ) -splitter game on G . In a modified version of the game, in each round Splitter not only picks a vertex v , butalso a radius r ′ ⩽ r , and the game continues in N r ′ ( v ). Clearly, this does not help Connector,and the fact remains true for the same s . Nevertheless, it will be convenient for us later towork with this modified version of the game.A graph class is effectively nowhere dense if it is nowhere dense and the winning strategyof Splitter is computable. In this paper, we only consider effectively nowhere dense graphclasses. Types

Let us fix such a vocabulary τ in the following and assume graphs are τ -coloured. FO [ τ ] denotes the set of all formulas of first-order logic of vocabulary τ , and FO [ τ, q ] denotes . van Bergerem, M. Grohe, and M. Ritzert 5 the subset of FO [ τ ] consisting of all formulas of quantifier rank up to q . For coloured graphs,the signature includes the edge relation symbol as well as all colouring relation symbols.The q -type of a tuple ¯ v = ( v , . . . , v k ) ∈ (cid:0) V ( G ) (cid:1) k of arity k is the set tp q ( G, ¯ v ) of all FO [ τ, q ]-formulas φ (¯ x ) such that G | = φ (¯ v ). Moreover, for q, r ⩾

0, the local ( q, r ) -type of ¯ v is the set ltp q,r ( G, ¯ v ) := tp q ( N Gr (¯ v ) , ¯ v ). A k -variable q -type (of vocabulary τ ) is a set θ of FO [ τ, q ]-formulas whose free variables are among x , . . . , x k . Since up to logical equivalencethere are only finitely many FO [ τ, q ]-formulas whose free variables are among x , . . . , x k , wecan view types as finite sets of formulas. Formally, we syntactically define a normal formsuch that for all τ, q, k there are only finitely many FO [ τ, q ]-formulas in normal form withfree variables among x , . . . , x k and an algorithm that transforms every formula into anequivalent formula in normal form without increasing the quantifier rank. Then, we viewtypes as sets of formulas in normal form. We denote the set of all k -variable q -types ofvocabulary τ by Tp[ τ, k, q ].It is easy to see that for every FO [ τ, q ]-formula φ ( x , . . . , x k ) there is a set Φ ⊆ Tp[ τ, k, q ]such that for all τ -coloured graphs G and all ¯ v ∈ (cid:0) V ( G ) (cid:1) k , G | = φ (¯ v ) ⇐⇒ tp q ( G, ¯ v ) ∈ Φ . The following fact is a consequence of Gaifman’s Theorem [5]. ▶ Fact 5.

For all q ⩾ , there is an r := r ( q ) ∈ O ( q ) such that for all k ⩾ , all vocabularies τ , all τ -coloured graphs G , and all ¯ v, ¯ v ′ ∈ (cid:0) V ( G ) (cid:1) k , if ltp q,r ( G, ¯ v ) = ltp q,r ( G, ¯ v ′ ) , then tp q ( G, ¯ v ) = tp q ( G, ¯ v ′ ) . It is important for us to note that this r can be chosen to depend only on q , independentof the vocabulary τ . From Fact 5, we directly get the following corollary. ▶ Corollary 6.

Let φ ( x , . . . , x k ) be an FO [ τ, q ] -formula, and let r := r ( q ) be chosen accordingto Fact 5. Then for every graph G , there is a a set Φ of k -variable q -types such that for all ¯ v ∈ (cid:0) V ( G ) (cid:1) k , G | = φ (¯ v ) ⇐⇒ ltp q,r ( G, ¯ v ) ∈ Φ . The last lemma in this section shows how to combine types of tuples. ▶ Lemma 7.

Let ¯ v, ¯ v ′ ∈ (cid:0) V ( G ) (cid:1) k and ¯ w ∈ (cid:0) V ( G ) (cid:1) ℓ and ¯ x ∈ (cid:0) V ( G ) (cid:1) m . If ltp q,r ( G, ¯ v ¯ w ) =ltp q,r ( G, ¯ v ′ ¯ w ) and ltp q,r ( G, ¯ v ¯ x ) = ltp q,r ( G, ¯ v ′ ¯ x ) , then ltp q,r ( G, ¯ v ¯ w ¯ x ) = ltp q,r ( G, ¯ v ′ ¯ w ¯ x ) . Parameterized Complexity A parameterized problem (over a finite alphabet Σ) is a pair( L, κ ), where L ⊆ Σ ∗ and κ : Σ ∗ → N is a polynomial-time computable function, called parametrization of Σ ∗ . If the parametrization κ is clear from the context, we can omitit. We say that ( L, κ ) is fixed-parameter tractable or (

L, κ ) ∈ FPT if there is an algorithm L , a computable function f : N → N , and a polynomial p such that L decides L and runsin time f ( κ ( x )) · p ( | x | ) on input x ∈ Σ ∗ . The class FPT essentially captures all tractableparameterized problems.

In the context of machine learning algorithms, we usually think of training examples as beinggiven as a list of tuples (¯ v , λ ) , . . . , (¯ v m , λ m ) ∈ V ( G ) k × { , } . However, in our abstractsetting, it is often more convenient to consider them as sets of positive and negative examples On the Parameterized Complexity of Learning Logic Λ + , Λ − with Λ + = { ¯ v i | λ i = 1 } and Λ − = { ¯ v i | λ i = 0 } . The training error of a concept( φ, ¯ w ) on positive and negative examples Λ + , Λ − ⊆ (cid:0) V ( G ) (cid:1) k iserr (Λ + , Λ − ) ( φ, ¯ w ) := 1 m (cid:12)(cid:12)(cid:12)(cid:8) ¯ v ∈ Λ + | G ̸| = φ (¯ v ; ¯ w ) (cid:9) ∪ (cid:8) ¯ v ∈ Λ − | G | = φ (¯ v ; ¯ w ) (cid:9)(cid:12)(cid:12)(cid:12) , where m is the number of the training examples. We always assume that all examples aredistinct.Let L, Q : N → N be functions with L ( ℓ, k, q ) ⩾ ℓ and Q ( q, k, ℓ ) ⩾ q for all k, ℓ, q ∈ N .The first learning problem we consider is the problem FO -Learn . FO -Learn [ L, Q ] Input

Graph G , sets of positive and negative training examples Λ + , Λ − ⊆ (cid:0) V ( G ) (cid:1) k , k, ℓ ∈ N , q ∈ N Parameter k + ℓ + q + | τ | Problem

Return an FO -formula φ (¯ x ; ¯ y ) of quantifier rank at most Q ( q, k, ℓ ) and atuple ¯ w ∈ (cid:0) V ( G ) (cid:1) L ( ℓ,k,q ) such that err (Λ + , Λ − ) ( φ, ¯ w ) = 0 . The algorithm may reject if there is no FO -formula φ ∗ (¯ x ; ¯ y ) of quantifier rankat most q and no tuple ¯ w ∗ ∈ (cid:0) V ( G ) (cid:1) ℓ such that err (Λ + , Λ − ) ( φ ∗ , ¯ w ∗ ) = 0.Note the usage of the quantifier rank in the learning problem. The learner returns aformula φ with qr( φ ) ⩽ Q ( q, k, ℓ ) if there is a formula φ ∗ with qr( φ ∗ ) ⩽ q dividing theexamples, or rejects if there is no such φ ∗ . A similar distinction is made for the number ofparameters ℓ .Typically, the new quantifier rank only depends on the old quantifier rank and in mostcases, Q ( q, k, ℓ ) = q . Similarly, in most cases L is the identity I which means that thealgorithm is allowed to use exactly as many parameters as the underlying concept. In most ofthe cases where the number of parameters or the quantifier rank are increased, this increaseis linear in ℓ or q , respectively. Generally, the learning algorithm is allowed to use additionalparameters and more complex formulas (with higher quantifier rank).In the previous version of the learning problem, we expected the learner to always returna consistent hypothesis ( φ, ¯ w ). It is easy to generalise this to a setting where there is notnecessarily a formula of the desired quantifier rank and a tuple of the given size that exactlyfits the training data. Thus, if there is a formula with an error of ε ∗ , we aim to find a formula(again with possibly larger quantifier rank and more parameters) which approximatelymatches this error. Since this variant of the learning problem is closely connected to theagnostic setting, we call it FO -ALearn . FO -ALearn [ L, Q ] Input

Graph G , sets of positive and negative training examples Λ + , Λ − ⊆ (cid:0) V ( G ) (cid:1) k , k, ℓ ∈ N , q ∈ N , ε > Parameter k + ℓ + q + | τ | + ε Problem

Let ε ∗ ⩾ FO -formula φ ∗ (¯ x ; ¯ y ) of quantifier rank at most q and a tuple ¯ w ∗ ∈ (cid:0) V ( G ) (cid:1) ℓ witherr (Λ + , Λ − ) ( φ ∗ , ¯ w ∗ ) ⩽ ε ∗ .Then the task is to return an FO -formula φ (¯ x ; ¯ y ) of quantifier rank at most Q ( q, k, ℓ ) and a tuple ¯ w ∈ (cid:0) V ( G ) (cid:1) L ( ℓ,k,q ) such that err (Λ + , Λ − ) ( φ, ¯ w ) ⩽ ε ∗ + ε. . van Bergerem, M. Grohe, and M. Ritzert 7 In this section, we prove that the learning problem is at least as hard as the model checkingproblem for first-order logic on the same class of background structures. In particular,this implies that learning FO -formulas on arbitrary graphs is AW [ ∗ ]-hard, and thus notfixed-parameter tractable under reasonable complexity theoretic assumptions. The modelchecking problem for first-order logic is defined as follows. FO - MC Input

Graph G , FO -sentence φ Parameter | φ | Problem

Decide whether G | = φ holds.It is well known that FO-MC is complete for the parameterized complexity class AW [ ∗ ]. ▶ Theorem 8.

There is an algorithm with access to an FO -Learn oracle that solves FO - MC in time O (cid:0) g ( q ) n (cid:1) for a function g : N → N . Proof.

We describe a recursive algorithm for solving FO - MC with an oracle to FO -Learn .For the proof, we first describe an algorithm for L (0 , k, q ) = 0, that is, the formula returnedby FO -Learn is not allowed to use parameters if we call FO -Learn with ℓ = 0. We willthen extend the proof to the general case. Let G be a graph with n vertices and φ an FO -sentence of quantifier rank at most q . Without loss of generality, let φ = ∃ xψ ( x ) forsome FO -formula ψ ( x ) of quantifier rank at most q − (cid:0) n (cid:1) calls to the oracle FO -Learn , we shall compute a partition P , . . . , P s of V ( G )such that for all i ∈ [ s ] and all vertices v, w ∈ P i we have tp q − ( G, v ) = tp q − ( G, w ), andthe number s of classes is bounded in terms of q (and does not depend on the size of theinput graph). For assigning the nodes to the partitions, we define a graph G P with an edgebetween any two nodes for which the oracle states on input Λ − = { v } and Λ + = { w } that v, w are indistinguishable. For the call of FO -Learn [ L, Q ], we use the parameters k = 1, ℓ = 0, and set q from the problem to q −

1. We then assign each connected componentto one of the P i . This works since the only possibility that two nodes u, v end up in thesame partition is that they are not distinguished with q − Q ( q − , k, ℓ ) parameters, which is why we need the connected components of G P ).Up to logical equivalence, there are only finitely many FO -formulas χ ( x ) of quantifierrank at most Q ( q − , k, ℓ ). Hence, the number of partitions s can be bounded in terms of q by some function f : N → N , i.e. s ⩽ f ( q ), independently from G and φ .For every i ∈ [ s ], pick an arbitrary element v i ∈ P i to represent the partition P i . Notethat G | = φ if and only if G | = ψ ( v i ) for some i ∈ [ s ]. We simulate that v i is a constantby substituting every atom in which v i occurs. Let G i be the graph G expanded with theunary relations E v i and = v i , where E v i ( x ) ↔ E ( v i , x ) and = v i ( x ) ↔ v i = x for all x . Let ψ i := ψ ( v i ), where all atoms in which v i occurs are rewritten using E v i , = v i , and Booleans.Then, G | = ψ ( v i ) if and only if G i | = ψ i . We recursively call our algorithm for every i ∈ [ s ]to decide whether G i | = ψ i .The recursion tree has depth at most q and the degree is bounded by f ( q ). Thus, thereare at most O (cid:0) f ( q ) q (cid:1) recursive calls and in each of them the oracle is called at most O ( n )times. All in all, the running time of the algorithm is in O (cid:0) g ( q ) n (cid:1) for g ( i ) = f ( i ) i .So far, we only talked about the case L (0 , k, q ) = 0. If L (0 , k, q ) >

0, a single parametercould distinguish the two examples in each call of FO -Learn . This can be avoided by calling FO -Learn on L (0 , k, q ) + 1 distinct copies of the input defined in the proof. Concretely, On the Parameterized Complexity of Learning Logic instead of calling FO -Learn on G with Λ + = { v } and Λ − = { w } , we call it on G ′ = L (0 ,k,q )+1 ] i =1 G ( i ) , Λ + = (cid:8) v (1) , . . . , v ( L (0 ,k,q )+1) (cid:9) , Λ − = (cid:8) w (1) , . . . , w ( L (0 ,k,q )+1) (cid:9) , where G ( i ) is the i -th copy of G and v ( i ) , w ( i ) are the copies of v, w in G ( i ) . Then, in at leastone of the copies of G there is no parameter which enforces the original behaviour. ◀ From the complexity of FO - MC , Theorem 1 follows directly. ▶ Theorem 1. FO -Learn is hard for the parameterized complexity class AW [ ∗ ] underparameterized Turing reductions. For a class C of graphs and a parameterized problem ( M, κ ), with M being FO -Learn [ L, Q ], FO -ALearn [ L, Q ], or FO - MC , we say that ( M, κ ) is fixed-parameter tractable on C if there isan algorithm L , a computable function f : N → N , and a polynomial p such that on instances x ,where the encoded graph G is in C , L decides M and runs in time f (cid:0) κ ( x ) (cid:1) · p ( | x | ). ▶ Lemma 9.

Let C be a class of graphs such that FO - MC is fixed-parameter tractable on C .Let L, Q : N → N with L ( ℓ, k, q ) ⩾ ℓ and Q ( q, k, ℓ ) ⩾ q for all k, ℓ, q . Then, for a constant ℓ ∈ N , the problems FO -Learn [ L, Q ] and FO -ALearn [ L, Q ] are fixed-parameter tractableon C , i.e. there is a function f : N → N and some c > such that FO -Learn [ L, Q ] and FO -ALearn [ L, Q ] are solvable in time f ( k + q + | τ | ) · m · | V ( G ) | c , where m is the number oftraining examples and G ∈ C is the input graph. Here, a simple brute-force test of all | V ( G ) | ℓ possible parameter configurations can bedone within the given time bound. Note that we do not need to search L ( ℓ, k, q ) parametersas we assume the existence of a solution with only ℓ parameters which we detect by theexhaustive search. A formal proof is given in the appendix. ▶ Lemma 10.

Let C be a class of graphs such that FO - MC is fixed-parameter tractable on C .Let L, Q : N → N with L ( ℓ, k, q ) ⩾ ℓ and Q ( q, k, ℓ ) ⩾ q for all k, ℓ, q . Then, for k = 1 , theproblem FO -Learn [ L, Q ] is fixed-parameter tractable on C , i.e. there is a function f : N → N and some c > such that FO -Learn [ L, Q ] is solvable in time f ( ℓ + q + | τ | ) · m · | V ( G ) | c ,where m is the number of training examples and G ∈ C is the input graph. Here, techniques similar to those used in [8, 6] can be applied to find a consistent parametersetting. Again, a formal proof is given in the appendix. Because the algorithm for Lemma10 explicitly uses the consistency of the returned hypothesis, the result does not triviallyextend to FO -ALearn .The following result is a precise version of Theorem 2. ▶ Theorem 11.

There are functions

L, Q : N → N such that the problem FO -ALearn [ L, Q ] is fixed-parameter tractable on every nowhere dense graph class C . The theorem easily follows from the following lemma. ▶ Lemma 12.

There is an algorithm L with the following properties. Let C be an effectivelynowhere dense graph class, k ⩾ , ℓ, q ⩾ , and L, Q ⩾ values depending on k, ℓ, q . Givena graph G ∈ C with n vertices, positive and negative examples Λ + , Λ − ⊆ (cid:0) V ( G ) (cid:1) k , and an ε > , L runs in time f ( k + ℓ + q + | τ | + 1 /ε ) · ( n + m ) O (1) , where m = | Λ + | + | Λ − | . . van Bergerem, M. Grohe, and M. Ritzert 9 Let ε ∗ ⩾ such that there is an FO -formula φ ∗ (¯ x ; ¯ y ) of quantifier rank at most q with k + ℓ free variables and a tuple ¯ w ∗ ∈ (cid:0) V ( G ) (cid:1) ℓ such that err (Λ + , Λ − ) ( φ ∗ , ¯ w ∗ ) ⩽ ε ∗ .Then, the algorithm returns an FO -formula φ (¯ x ; ¯ y ) of quantifier rank at most Q with k + L free variables and a tuple ¯ w ∈ (cid:0) V ( G ) (cid:1) L such that err (Λ + , Λ − ) ( φ, ¯ w ) ⩽ ε ∗ + ε . The remainder of this section is devoted to the proof of Lemma 12. In the following, welet C be an effectively nowhere dense graph class and we fix k ⩾ q, ℓ ⩾

0. We choose r = r ( q ) according to Fact 5 and let R := 3 ℓ − · (cid:0) ( k + 2)(2 r + 1) (cid:1) . The specific choice of R will be justified in the proof of Lemma 15 in the appendix. Let G ∈ C and let s ⩾ R, s )-splitter game on G . The value s can becomputed since C is effectively nowhere dense. We let V := V ( G ), E := E ( G ), n := | V | , L := ℓ · s and Q := q + log R . Furthermore, let Λ + , Λ − ⊆ V k be disjoint sets of positiveand negative examples and let Λ := Λ + ∪ Λ − and m := | Λ | . Let ε ∗ ⩾ FO -formula φ ∗ (¯ x ; ¯ y ) of quantifier rank at most q with k + ℓ free variables and a tuple¯ w ∗ = ( w ∗ , . . . , w ∗ ℓ ) ∈ V ℓ such that err (Λ + , Λ − ) ( φ ∗ , ¯ w ∗ ) ⩽ ε ∗ .Let ε >

0. Whenever we speak of an fpt algorithm in the following, we mean an algorithmrunning in time f ( k, ℓ, q, s, /ε ) · ( n + m ) O (1) . Our goal is to compute a concept ( φ, ¯ w ), our hypothesis , consisting of an FO -formula φ (¯ x ; ¯ y ) of quantifier rank at most Q with k + L freevariables and a tuple ¯ w ∈ V L such that err (Λ + , Λ − ) ( φ, ¯ w ) ⩽ ε ∗ + ε , using an fpt algorithm.This means that the hypothesis correctly classifies all but m · ( ε ∗ + ε ) examples from Λ.A vector ¯ v ∈ V ℓ ′ is called ε ′ -discriminating if there is formula ψ of quantifier rank atmost Q such that err (Λ + , Λ − ) ( ψ, ¯ v ) ⩽ ε ′ . We then say that ¯ v discriminates all but (at most) ε ′ · m examples. If we have an ( ε ∗ + ε )-discriminating L -tuple ¯ w , then we can find the formula φ by an fpt algorithm that simply steps through all possible formulas since model checkingon nowhere dense graphs is in FPT and there are only finitely many formulas to check.A conflict in (Λ + , Λ − ) is a pair (¯ v + , ¯ v − ) ∈ Λ + × Λ − such that ltp q,r ( G, ¯ v + ) = ltp q,r ( G, ¯ v − ).The type of a conflict (¯ v + , ¯ v − ) is ltp q,r ( G, ¯ v + ) = ltp q,r ( G, ¯ v − ). Let Ξ be the set of allconflicts. To resolve a conflict (¯ v + , ¯ v − ) ∈ Ξ, we need to find parameters ¯ w such thatltp q,r ( G, ¯ v + ¯ w ) ̸ = ltp q,r ( G, ¯ v − ¯ w ). Only parameters w in the (2 r + 1)-neighbourhood of¯ v + ∪ ¯ v − have an effect on the local type. We say that we attend to the conflict (¯ v + , ¯ v − ) ifwe choose at least one parameter in N G r +1 (¯ v + ¯ v − ). Note that attending to a conflict is anecessary, but not a sufficient condition for resolving the conflict. If we do not attend toa conflict, we ignore it. An example ¯ v ∈ Λ is critical if it is involved in some conflict, thatis, either ¯ v ∈ Λ + and there is a ¯ v − ∈ Λ − such that (¯ v, ¯ v − ) ∈ Ξ, or ¯ v ∈ Λ − and there is a¯ v + ∈ Λ + such that (¯ v + , ¯ v ) ∈ Ξ. Let Γ be the set of all critical tuples. For every w ∈ V , letΓ( w ) := (cid:8) ¯ v ∈ Γ | w ∈ N G r +1 (¯ v ) (cid:9) Note that if ¯ v ∈ Γ( w ), then w attends to all conflicts ¯ v is involved in.The algorithm L now uses s steps, corresponding to moves in the splitter game, to find atuple that discriminates all but m ( ε ∗ + ε ) examples. For this, we define a sequence of graphs G , . . . , G s and sets of examples Λ , . . . , Λ s with Λ i = Λ + i ⊎ Λ − i ⊆ (cid:0) V ( G i ) (cid:1) k . Let G = G and Λ = Λ. We will define the graphs G i for i > i later.Every tuple that is not critical can be classified correctly without additional parametersby adding its local type explicitly to the hypothesis. Thus, in every step we only considerexamples that have been involved in a conflict in all previous steps.We now consider the i -th step of the algorithm. Let G i be the current graph and Γ i bethe set of critical examples in the i -th step. In the following two lemmas, we strategicallylimit the search space for the parameters such that in each step the error increases only by εs . The next lemma shows that there is a small set X of neighbourhood centres such that mostof the conflicts can be attended by nodes from N G i r +2 ( X ). ▶ Lemma 13.

There is a set X ⊆ V ( G i ) of size | X | ⩽ kℓsε such that (cid:12)(cid:12) Γ i ( u ) (cid:12)(cid:12) | Γ | < εℓ · s for all u ∈ V ( G i ) \ N G i r +2 ( X ) , where Γ i ( u ) is the set of critical tuples in G i that are affected by u and Γ is the set of criticaltuples in the original graph G . Furthermore, there is an fpt algorithm computing such a set. Proof.

We inductively define a sequence x , . . . , x p ∈ V ( G i ) as follows. For all j ⩾

1, wechoose x j ∈ V ( G i ) such that dist( x j , x j ′ ) > r + 2 for all j ′ < j and, subject to this condition, (cid:12)(cid:12) Γ i ( x j ) (cid:12)(cid:12) is maximum. If no such x j exists, we let p = j − v ∈ Γ, there are at most k of the x j such that ¯ v ∈ Γ i ( x j ).This holds as for every entry v of ¯ v ∈ (cid:0) V ( G i ) (cid:1) k , there is at most one of the x j in the(2 r + 1)-neighbourhood of v by the construction of X . Therefore, p X j =1 (cid:12)(cid:12) Γ i ( x j ) (cid:12)(cid:12) ⩽ k | Γ | . This means that there are at most kℓs/ε of the x j with (cid:12)(cid:12) Γ i ( x j ) (cid:12)(cid:12) ⩾ εℓs | Γ | . As the x j are sorted by decreasing (cid:12)(cid:12) Γ i ( x j ) (cid:12)(cid:12) , we have (cid:12)(cid:12) Γ i ( x j ) (cid:12)(cid:12) < εℓs | Γ | for all j > kℓsε . We let X = { x , . . . , x min { p,kℓs/ε } } . ◀ In the following, we fix a set X according to the lemma. We show in the next lemmathat there is always a tuple of parameters consisting only of vertices in the neighbourhoodof X that can induce a small error. ▶ Lemma 14.

If there is a tuple ¯ v ∈ (cid:0) V ( G i ) (cid:1) ℓ that discriminates all but m · ε ′ examplesin G i , then there is a tuple ¯ w ∈ (cid:0) N G i r +2 ( X ) (cid:1) ℓ that discriminates all but m ( ε ′ + ε/s ) examplesin G i . Proof.

We split the entries v j of ¯ v in two sets W ∗ , U ∗ , where W ∗ contains all v j that arecontained in N G i r +2 ( X ) and U ∗ contains the remaining entries of ¯ v . By the choice of X , each u ∈ U ∗ can only discriminate up to (cid:12)(cid:12) Γ i ( u ) (cid:12)(cid:12) ⩽ m (cid:12)(cid:12) Γ i ( u ) (cid:12)(cid:12) | Γ | ⩽ mεℓ · s examples, where Γ i ( u ) is the set of critical tuples in G i that are affected by u and Γ is theset of critical examples in the original graph G . Thus, the elements of U ∗ can discriminateat most m εs examples in G i . Since ¯ v discriminates all but m · ε ′ examples, the elementsof W ∗ must discriminate all but m · ( ε ′ + εs ) examples. We let ¯ w be an ℓ -tuple formed bythe elements of W ∗ , for example choosing w j = v j for all v j ∈ W ∗ , and w j is set to a fixedelement in W ∗ for all other entries. Then, ¯ w ∈ (cid:0) N G i r +2 ( X ) (cid:1) ℓ and ¯ w discriminates all but m ( ε ′ + ε/s ) examples in G i . ◀ In the proof, ¯ w is computed from ¯ v by dropping all elements that are far from X . By thechoice of X , this only introduces a small error.We can now describe Step i and the way it is embedded in the overall algorithm. Weknow that in G , the vector ¯ w ∗ discriminates all but mε ∗ examples by the definition of of ¯ w ∗ . van Bergerem, M. Grohe, and M. Ritzert 11 and ε ∗ . In all later graphs G i , there is a vector that discriminates all but m · ( ε ∗ + iεs ) of theexamples Λ i which will be guaranteed by the construction of G i and Λ i in Lemma 15. ByLemma 13, there is also a tuple close to X that discriminates all but m · (cid:0) ε ∗ + ( i +1) εs (cid:1) . Thus,by further restricting the set of possible parameters, we increase the error in each step. InStep i we will choose ℓ parameters. Together with the parameters we choose in all followingsteps, this will result in the wanted tuple that discriminates all but (at most) m ( ε ∗ + ε )examples. Our next goal is to find such a tuple ¯ w ∈ (cid:0) N G i r +2 ( X ) (cid:1) ℓ that (together with theparameters from the following steps) discriminates all but m ( ε ∗ + ε ) examples in G i .Clearly, for each such tuple of arity ℓ , there is a subset Y ⊆ X of size | Y | ⩽ ℓ such that¯ w ∈ (cid:0) N G i r +2 ( Y ) (cid:1) ℓ . We non-deterministically guess such a set Y = { y , . . . , y ℓ ′ } and keep itfixed in the following. Simulating this non-deterministic guess by a deterministic algorithmadds a multiplicative cost of | X | ℓ ⩽ (cid:0) ksℓε (cid:1) ℓ , which is allowed in an fpt algorithm.When searching for parameters in N G i r +2 ( Y ), we can only attend to conflicts with atleast one element in N G i r +3 ( Y ) as all other conflicts cannot be attended by parameters from N G i r +2 ( Y ). We apply Lemma 3 and obtain a set Z ⊆ Y and an R ′ = 3 j (cid:0) ( k + 2)(2 r + 1) (cid:1) ,where 0 ⩽ j ⩽ | Y | − ⩽ ℓ −

1, such that N G i R ′ ( z ) ∩ N G i R ′ ( z ′ ) = ∅ for all distinct z, z ′ ∈ Z and N G i ( k +2)(2 r +1) ( Y ) ⊆ N G i R ′ ( Z ) . Note that R ′ ⩽ R , where R is the radius from the ( R, s )-splitter game defined in the verybeginning of the proof. We will see in the proof of Lemma 15 why we need (cid:0) ( k + 2)(2 r + 1) (cid:1) -neighbourhoods. Suppose that Z = { z , . . . , z ℓ ′′ } , where ℓ ′′ ⩽ ℓ ′ ⩽ ℓ . For every j ∈ [ ℓ ′′ ],let w j together with the radius R ′ be Splitter’s answer if Connector picks z j in the (modified)( R, s )-splitter game on G . Note that we only consider possible picks z j ∈ Z and not arbitrarychoices. The reason why Splitter’s answers to the z j suffice is that if we can identify everynode in N GR ′ ( Z ), then we can solve (almost) all conflicts consisting of vectors of vertices fromthat set. Essentially, Splitter guarantees that after removing a certain set of points (heranswers in the splitter game), every node in the neighbourhood can be identified in at most s − ( i + 1) steps if she had a winning strategy for the current graph in s − i steps (which isguaranteed by the construction of G i ).We then choose the vertices ˆ w i = ( w , . . . , w ℓ ) as parameters in step i , where w j := w ℓ ′′ for all j ∈ { ℓ ′′ + 1 , . . . , ℓ } . With those parameters, we can now define the next graph G i +1 and the next set of examples Λ i +1 . ▶ Lemma 15.

There is a graph G i +1 and a set of examples Λ i +1 = Λ + i +1 ⊎ Λ − i +1 ⊆ (cid:0) V ( G i +1 ) (cid:1) k such that the following holds. (1) V ( G i +1 ) = N G i R ′ ( Z ) ⊎ U ⊎ U ∗ , where U is a set of fresh isolated vertices in G i +1 and U ∗ is the set of all isolated vertices in G i . | U | only depends on k, ℓ, q but not on m or n . (2) E ( G i +1 ) ⊆ E ( G i ) . (3) Splitter has a winning strategy for the (cid:0) ( R, s − ( i + 1) (cid:1) -splitter game on G i +1 . (4) If there is a tuple ¯ w i that discriminates all but m · ε ′ examples in G i , then there is atuple ¯ w i +1 that discriminates all but m ( ε ′ + εs ) examples from Λ i +1 in G i +1 . (5) If there is a tuple ¯ w i +1 from V ( G i +1 ) that distinguishes all but c examples from Λ i +1 ,then the tuple ˆ w i ¯ u i +1 distinguishes all but c examples from Λ i where ¯ u i +1 is derived from ¯ w i +1 by dropping all entries not in V ( G i ) .Furthermore, there is an fpt algorithm computing G i +1 and Λ i +1 . The formal proof of the lemma is given in the appendix. From the lemma itself and itsproof, we observe the following. ▶ Remark 16.

For the graph G i +1 the following holds:The isolated vertices in G i +1 will not be suitable parameters, as any conflict they resolvecould also be resolved without the use of additional parameters. Thus, we can assumethat all nodes from ¯ w i also appear in G i and we have that ¯ u i = ¯ w i in Statement (5).Every conflict in the examples Λ i +1 in the graph G i +1 corresponds to a conflict betweenexamples from Λ i in the graph G i . The number of conflicts can thus only decrease. G i +1 is nowhere dense because Splitter has a winning strategy.Λ i +1 contains exactly the critical examples from Λ i .Examples in Λ i +1 are projections of examples in Λ i into V ( G i +1 ), essentially changingall those nodes that are not included in N G i R ′ ( Z ) to isolated vertices that represent types.The type each isolated vertex represents is encoded by its colour.Note that only the last two items in the remark do not directly follow from the lemma(and are details of its proof). With the construction of the graph G i +1 and the correspondingset of examples Λ i +1 , we now have all the ingredients to prove Lemma 12. Proof of Lemma 12.

We start the algorithm by computing the number of steps s in thesplitter game. We set G = G and Λ = Λ, as described above. For i = 0 to s , we performthe following steps.We use Lemma 13 to compute a set X . By Lemma 14 we know that there exists a tuplein the neighbourhood of X that discriminates all but m (cid:0) ε ∗ + ( i +1) εs (cid:1) examples. This step iscrucial as only in the neighbourhood of X , parameters will have a high impact and we thuslimit our search for parameters to this area. Over all s steps, this additional error sums upto ε . Now, we non-deterministically guess a subset Y ⊆ X of size at most ℓ . Unrolling thisnon-deterministic guess adds a factor that only depends on the parameters of the problemand can thus be performed by an fpt algorithm. Next, we apply Lemma 3 and obtain Z . Bysaving Splitter’s answers in the splitter game to the picks z j ∈ Z in the parameters w j , weobtain the vector ˆ w i . With this vector ˆ w i , we can now apply Lemma 15 and compute thenext graph G i +1 and the next set of examples Λ i +1 to proceed to Step i + 1.In Step s , we know that the tuple ˆ w s discriminates all but m ( ε ∗ + ε ) examples from Λ s in G s . By concatenating all ˆ w i , we obtain a tuple ¯ w of length L = ℓ · s that discriminates allbut m ( ε ∗ + ε ) examples in the original graph G . We finish the computation by testing allpossible formulas of quantifier rank Q with k + L free variables and are guaranteed to findat least one φ with err (Λ + , Λ − ) ( φ, ¯ w ) ⩽ ε ∗ + ε .Overall, the problem FO -ALearn is in FPT since there is an fpt algorithm for modelchecking first-order formulas on nowhere dense graphs and the bound on the number of suchformulas we need to check only depends on k, L, and Q . All intermediate steps can also beperformed by fpt algorithms. ◀ We have studied the parameterized complexity of learning first-order definable concepts overclasses of finite structures. Our main result is a fixed-parameter tractable agnostic PAClearning algorithm for first-order definable concepts over nowhere dense classes of structures.We also obtain a hardness result for the exact learning problem FO -Learn . It remainsopen whether this hardness result can be extended to the agnostic learning problem FO -ALearn . . van Bergerem, M. Grohe, and M. Ritzert 13 Our fixed-parameter tractable learning algorithm increases all hyperparameters of thelearning problem, that is, the quantifier rank, the number of parameters, and the approxim-ation error. We leave it as an interesting and challenging open problem to see if all theseincreases are really necessary.It would also be interesting to extend our results to richer logics such as the extensions offirst-order logic with counting or weight aggregation. Similarly, it would be interesting toprove that concepts definable in monadic second-order logic over structures of bounded treewidth or bounded clique width can be learned by fixed-parameter tractable algorithms.Finally, while we know that first-order definable concepts admit no sublinear learningalgorithms on nowhere dense graph classes, it might be possible to obtain such algorithmsafter a polynomial time preprocessing phase (similar to the results of [8, 6] for monadicsecond-order logic on strings and trees).

References A. Abouzied, D. Angluin, C.H. Papadimitriou, J.M. Hellerstein, and A. Silberschatz. Learningand verifying quantified boolean queries by example. In R. Hull and W. Fan, editors,

Proceedingsof the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ,pages 49–60, 2013. K.A. Abrahamson, R.G. Downey, and M.R. Fellows. Fixed-parameter tractability and com-pleteness IV: On completeness for W[P] and PSPACE analogs.

Annals of Pure and AppliedLogic , 73:235–276, 1995. A. Blumer, A. Ehrenfeucht, D. Haussler, and M.K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.

Journal of the ACM , 36:929–965, 1989. A. Bonifati, R. Ciucanu, and S. Staworko. Learning join queries from user examples.

ACMTrans. Database Syst. , 40(4):24:1–24:38, 2016. Haim Gaifman. On local and non-local properties. In Jacques Stern, editor,

Proceedings of theHerbrand Symposium , volume 107 of

Studies in Logic and the Foundations of Mathematics ,pages 105–135. North-Holland, 1982. doi:10.1016/S0049-237X(08)71879-2 . Emilie Grienenberger and Martin Ritzert. Learning definable hypotheses on trees. In , pages 24:1–24:18, 2019. doi:10.4230/LIPIcs.ICDT.2019.24 . Martin Grohe, Stephan Kreutzer, and Sebastian Siebertz. Deciding first-order properties ofnowhere dense graphs.

J. ACM , 64(3):17:1–17:32, 2017. doi:10.1145/3051095 . Martin Grohe, Christof Löding, and Martin Ritzert. Learning MSO-definable hypotheses onstrings. In

International Conference on Algorithmic Learning Theory, ALT 2017, 15-17 October2017, Kyoto University, Kyoto, Japan , pages 434–451, 2017. URL: http://proceedings.mlr.press/v76/grohe17a.html . Martin Grohe and Martin Ritzert. Learning first-order definable concepts over structuresof small degree. In , pages 1–12, 2017. doi:10.1109/LICS.2017.8005080 . Martin Grohe and György Turán. Learnability and definability in trees and similar structures.

Theory Comput. Syst. , 37(1):193–220, 2004. doi:10.1007/s00224-003-1112-8 . David Haussler. Decision theoretic generalizations of the PAC model for neural net andother learning applications.

Inf. Comput. , 100(1):78–150, 1992. doi:10.1016/0890-5401(92)90010-D . Michael J. Kearns and Umesh V. Vazirani.

An Introduction to ComputationalLearning Theory . MIT Press, 1994. URL: https://mitpress.mit.edu/books/introduction-computational-learning-theory . M.C. Laskowski. Vapnik-Chervonenkis classes of definable sets.

Journal of the LondonMathematical Society (2) , 45:377–384, 1992. Shai Shalev-Shwartz and Shai Ben-David.

Understanding Machine Learning: From Theory toAlgorithms . Cambridge University Press, New York, NY, USA, 2014. Balder ten Cate, Víctor Dalmau, and Phokion G. Kolaitis. Learning schema mappings.

ACMTransactions on Database Systems , 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035 . Leslie G. Valiant. A theory of the learnable.

Commun. ACM , 27(11):1134–1142, 1984. doi:10.1145/1968.1972 . Steffen van Bergerem. Learning concepts definable in first-order logic with counting. In , pages 1–13, 2019. doi:10.1109/LICS.2019.8785811 . Steffen van Bergerem and Nicole Schweikardt. Learning concepts described by weight ag-gregation logic. In Christel Baier and Jean Goubault-Larrecq, editors, , volume 183 of

LIPIcs , pages 10:1–10:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.CSL.2021.10 . V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of eventsto their probabilities.

Theory of Probability and its Applications , 16:264–280, 1971. . van Bergerem, M. Grohe, and M. Ritzert 15

A AppendixA.1 Proof of Lemma 9 ▶ Lemma 9.

Learning algorithm for constant ℓ Require:

Sets of positive and negative training examples Λ + , Λ − ⊆ (cid:0) V ( G ) (cid:1) k , Λ + ∩ Λ − = ∅ for all φ ∈ Φ ∗ do for all ¯ v ∈ (cid:0) V ( G ) (cid:1) ℓ do consistent ← true for all ¯ u ∈ Λ + ∪ Λ − do Let G ∗ be the graph with V ( G ∗ ) = V ( G ) , R i ( G ∗ ) = { u i } for i ⩽ k, S i ( G ∗ ) = { v i } for i ⩽ ℓ , and R ( G ∗ ) = R ( G ) for other relations if (¯ u ∈ Λ − and G ∗ | = φ ) or (¯ u ∈ Λ + and G ∗ ̸| = φ ) then consistent ← false if consistent then return (cid:0) φ ′ (¯ x ; ¯ y ) , ¯ v (cid:1) , where φ ′ is obtained from φ by replacing all occurrences of R i x with x = x i and all occurrences of S i x are with x = y i reject Proof.

We start the proof of the learnability for the case that ℓ is constant by adding k + ℓ unary relations (colours) to τ : τ ∗ := τ ⊎ { R , . . . , R k , S , . . . , S ℓ } . Then we have a finite setΦ ∗ ⊆ FO [ τ ∗ , q ] of formulas in Gaifman normal form (here: sentences in Gaifman normal form)with quantifier rank at most q . We will search for a sentence that is consistent in the graph G ∗ over the extended alphabet τ ∗ . This sentence is then translated back into a formulawith k + ℓ free variables by substituting all occurrences of the additional relations. Thetranslated formula is then returned by the algorithm solving the problem FO -Learn [ L, Q ].The algorithm for finding a consistent hypothesis is given in pseudocode in Algorithm 1.Let n = | V ( G ) | . Using the assumption that k, ℓ ⩽ n , Algorithm 1 runs in time | Φ ∗ | · n ℓ · m · (cid:0) k + ℓ + n + g ( q + | τ ∗ | ) · n c (cid:1) ∈ O (cid:0) f ( k + q + | τ | ) · m · n ℓ + c +1 (cid:1) . for some functions f and g . In this running time, the first three factors correspond to thefor-loops of the algorithm and the bracketed part comes from model checking.The proof for FO -ALearn essentially works in the same way as the proof for FO -Learn .The main difference is that in the algorithm for FO -ALearn , the check for consistency issubstituted by the computation of the error and instead of returning the first consistentformula, the algorithm simply tests all pairs of formulas φ ∈ Φ ∗ and parameters ¯ v ∈ V ( A ) ℓ and returns the pair with lowest error. Again, we perform the search over formulas inextended structures and translate them back into a formula in the end. The computation ofthe error is performed by counting the correct and incorrect classifications induced by thepair ( φ, ¯ v ). The running time of both algorithms is identical. ◀ A.2 Proof of Lemma 10 ▶ Lemma 10.

Learning algorithm for k = 1 Require:

Sets of positive and negative training examples Λ + , Λ − ⊆ V ( G ) for all φ ∈ Φ ∗ do consistent ← true for i = 1 to ℓ do φ i ( x, y i +1 , . . . , y ℓ ) := ∃ y . . . ∃ y i (cid:0) V ij =1 S j y j ∧ φ ( x, y , . . . , y ℓ ) (cid:1) f ound ← false for all w ∈ V ( G ) do Let G ∗ be the graph with V ( G ∗ ) = V ( G ), S j ( G ∗ ) = { v j } for all j < i , S i ( G ∗ ) = { w } , P + ( G ∗ ) = Λ + , P − ( G ∗ ) = Λ − , and R ( G ∗ ) = R ( G ) for other relations if G ∗ | = ∃ y i +1 . . . ∃ y ℓ ∀ x (cid:16)(cid:0) P + x → φ i ( x, y i +1 , . . . , y ℓ ) (cid:1) ∧ (cid:0) P − x → ¬ φ i ( x, y i +1 , . . . , y ℓ ) (cid:1)(cid:17) then v i ← w f ound ← true break if not f ound then consistent ← false break if consistent then return ( φ ( x ; ¯ y ) , ¯ v ) reject Proof.

Let τ ∗ := τ ⊎ { S , . . . , S ℓ , P + , P − } . Again, similar to the setting in Lemma 9, wehave a finite set Φ ∗ ⊆ FO [ τ ∗ , q ] of formulas in Gaifman normal form with quantifier rankat most q with ℓ + 1 free variables. The algorithm uses a relation P + for all positive and arelation P − for all negative examples. Since k = 1, those relations are both unary and do notchange the graph structure. Using those unary relations for the examples, it is possible totest whether a given prefix of parameters can be extended to a consistent one. The algorithmthus starts by testing for every node u ∈ V ( G ) whether it can be extended to a consistentparameter setting. If it found such a node u , it fixes v = u and proceeds with the searchfor v . Let n = | V ( G ) | . After at most ℓ · n model checking steps, the algorithm has discovereda consistent parameter setting if there is one. As in Lemma 9, we use model checking forsentences only and expand the background structure to contain additional unary relations(colours) for the free variables of the formula. In our case, the free variables of the formulaswe evaluate are exactly the parameter prefixes. The algorithm is given in pseudocode inAlgorithm 2 and runs in time | Φ ∗ | · ℓ · n · (cid:0) n + m + ℓ + g ( q + ℓ + | τ ∗ | ) · n c (cid:1) ∈ O (cid:0) f ( ℓ + q + | τ | ) · m · n c +1 (cid:1) . for some functions f and g . In this running time, the first three factors correspond to thefor-loops of the algorithm and the bracketed part comes from model checking. ◀ . van Bergerem, M. Grohe, and M. Ritzert 17 A.3 Proof of Lemma 15 ▶ Lemma 15.

We let G i +1 be the graph obtained from the induced R ′ -neighbourhoodstructure N G i R ′ ( Z ) as follows. We start with the construction, the roles of each of the stepswill become clear while proving the statements of the lemma. Expand the graph by fresh colours D j,d for j ∈ [ ℓ ′ ] and d ∈ { , . . . , ( k + 2)(2 r + 1) } . Welet D j,d ( G i +1 ) be the set of all v such that dist G i ( v, y j ) = d . Expand the graph by fresh colours C j for j ∈ [ ℓ ′′ ]. Let C j ( G i +1 ) := N G i ( w j ). Delete all edges incident with the w j . Moreover, we add fresh colours B j for j ∈ [ ℓ ′′ ] andlet B j ( G i +1 ) := { w j } . For each nonempty set I ⊂ [ k ] and each | I | -variable q -type θ ∈ Tp[ τ, | I | , q ], we add anisolated vertex t I,θ and a fresh colour A I,θ and let A I,θ ( G i +1 ) = { t I,θ } .Thus structurally, G i +1 consists of the neighbourhoods G i +1 j := G i (cid:2) N G i R ′ ( z j ) \ { w j } (cid:3) for j ∈ [ ℓ ′′ ] and isolated vertices w , . . . , w ℓ ′′ and t I,θ for all nonempty I ⊂ [ k ] and θ ∈ Tp[ τ, | I | , q ].Hence, Statements (1) and (2) of Lemma 15 hold by the construction of G i +1 . Note that theneighbourhoods G i +1 j and G i +1 j ′ are disconnected for each j ̸ = j ′ by the construction of Z .Splitter’s winning strategy on G i is still valid on G i +1 , but one of the steps of the splittergame (removing a vertex Spoiler chose and continuing in the neighbourhood of a vertexConnector chose) has already been performed in each of the neighbourhoods G i +1 j in theconstruction of G i +1 . Hence, Statement (3) follows.Next, we define the set Λ i +1 = Λ + i +1 ⊎ Λ − i +1 ⊆ (cid:0) V ( G i +1 ) (cid:1) k of examples for the next round.For this step, we will need the isolated vertices introduced in Step 4 in the constructionof G i +1 . We only need to consider examples in Γ i , which are involved in a conflict; allother examples can be correctly classified without the use of parameters. For each tuple¯ v = ( v , . . . , v k ) ∈ Γ i with ¯ v ∩ N G i r +3 ( Y ) ̸ = ∅ , we define a tuple ¯ v ′ = ( v ′ , . . . , v ′ k ) ∈ V ( G i +1 )that we add to Λ + i +1 if ¯ v ∈ Λ + i and to Λ − i +1 if ¯ v ∈ Λ − i . For that, we consider the graph H ¯ v with vertex set V ( H ¯ v ) := [ k ] and edges { a, b } for all a, b ∈ [ k ] with 1 ⩽ dist G i ( v a , v b ) ⩽ r + 1.Let I , . . . , I p be the vertex sets of the connected components of H ¯ v . For each component I j , we proceed as follows: if there is some j ∈ I j such that v j ∈ N G i r +3 ( Y ), we let v ′ a := v a for all a ∈ I j . Note that for all a ∈ I j we have dist G i ( v j , v a ) ⩽ ( k − r + 1) and hence,dist G i ( Y, v a ) ⩽ r + 3 + ( k − r + 1) = ( k + 2)(2 r + 1). Thus, v ′ a ∈ N G i ( k +2)(2 r +1) ( Y ) ⊆ N G i R ′ ( Y ) ⊆ V ( G i +1 ). This is the point that determines the choice of R of the splitter gamein the very beginning of the proof. Otherwise, if v a ̸∈ N G i r +3 ( Y ) for all a ∈ I j , we considerthe restriction ¯ v | I j of ¯ v to the indices in I j and let θ := ltp q,r ( G i , ¯ v | I j ). In this case, we let v ′ a := t I j ,θ for all a ∈ I j . Note that for some examples this means that they only consist ofisolated vertices and thus will never be correctly classified by our algorithm. Observe that two tuples ¯ v ′ , ¯ v ′ ∈ Λ i +1 can only have the same type in G i +1 if their counterparts ¯ v , ¯ v ∈ Λ i have the same type in G i . Hence, we do not create any new conflicts by the construction.Let A = N G i R ′ ( Z ) and A ′ the modified version of this neighbourhood in G i +1 . It is easyto interpret A in G i +1 using the information encoded in the fresh colours added in Step 1-3of the construction of G i +1 . Using the parameters w , . . . , w ℓ ′′ , we can also interpret themodified neighbourhood A ′ in G i . We use this connection to prove Statement (4) and (5).Note that for encoding the distance information, the increased quantifier rank Q might bynecessary which is why we chose Q = q + log R .Assume there is a tuple ¯ w i ∈ V ( G i ) ℓ that discriminates all but mε ′ examples in G i . ByLemma 14, there is a tuple ¯ u i ∈ ( N G i r +2 ( Z )) ℓ that discriminates all but m ( ε ′ + εs ) examples in G i . Then ¯ u i also discriminates all but m ( ε ′ + εs ) examples in G i +1 due to the way we projectedthe examples and the fact that we can emulate A in G i +1 . This shows Statement (4).If we can distinguish ¯ v ′ , ¯ v ′ in G i +1 using parameters ¯ w i +1 , then we can distinguish ¯ v , ¯ v in G i using ¯ w i +1 and the chosen parameters w , . . . , w ℓ ′′ . This holds as we can interpret A ′ in G i . Thus, if ¯ w i +1 discriminates all but c examples in Λ i +1 , then ˆ w i ¯ w i +1 discriminates allbut c examples in Λ i . This shows Statement (5).To finish the proof, we need to show that all steps can be performed by an fpt algorithm.In the first three steps, the number of new colours only depends on the parameters k , ℓ , and q . In step 4, the number of new colours only depends on k , q and the size of the signature of G i . Since the number of computed graphs is bounded by s (and therefore in terms of q , k ,and ℓ ), the total number of fresh colours only depends on the parameters. Furthermore, thecomputations are linear in the size of G i . For the projection of the examples, the computationof the types can be performed by an fpt algorithm because G i is nowhere dense. All othercomputations for the projection can be done by an fpt algorithm as well.is nowhere dense. All othercomputations for the projection can be done by an fpt algorithm as well.