[PDF] Towards Approximate Query Enumeration with Sublinear Preprocessing Time

Abstract

This paper aims at providing extremely efficient algorithms for approximate query enumeration on sparse databases, that come with performance and accuracy guarantees. We introduce a new model for approximate query enumeration on classes of relational databases of bounded degree. We first prove that on databases of bounded degree any local first-order definable query can be enumerated approximately with constant delay after a constant time preprocessing phase. We extend this, showing that on databases of bounded tree-width and bounded degree, every query that is expressible in first-order logic can be enumerated approximately with constant delay after a sublinear (more precisely, polylogarithmic) time preprocessing phase. Durand and Grandjean (ACM Transactions on Computational Logic 2007) proved that exact enumeration of first-order queries on databases of bounded degree can be done with constant delay after a linear time preprocessing phase. Hence we achieve a significant speed-up in the preprocessing phase. Since sublinear running time does not allow reading the whole input database even once, sacrificing some accuracy is inevitable for our speed-up. Nevertheless, our enumeration algorithms come with guarantees: With high probability, (1) only tuples are enumerated that are answers to the query or `close' to being answers to the query, and (2) if the proportion of tuples that are answers to the query is sufficiently large, then all answers will be enumerated. Here the notion of `closeness' is a tuple edit distance in the input database. For local first-order queries, only actual answers are enumerated, strengthening (1). Moreover, both the `closeness' and the proportion required in (2) are controllable. We combine methods from property testing of bounded degree graphs with logic and query enumeration, which we believe can inspire further research.

Full PDF

aa r X i v : . [ c s . D B ] J a n TOWARDS APPROXIMATE QUERY ENUMERATION WITHSUBLINEAR PREPROCESSING TIME

ISOLDE ADLER AND POLLY FAHEYUniversity of Leeds, School of Computing, Leeds, UK e-mail address : [email protected] of Leeds, School of Computing, Leeds, UK e-mail address : [email protected]

Abstract.

This paper aims at providing extremely eﬃcient algorithms for approximatequery enumeration on sparse databases, that come with performance and accuracy guaran-tees. We introduce a new model for approximate query enumeration on classes of relationaldatabases of bounded degree. We ﬁrst prove that on databases of bounded degree any lo-cal ﬁrst-order deﬁnable query that has a suﬃciently large answer set can be enumeratedapproximately with constant delay after a preprocessing phase with constant running time.We extend this, showing that on databases of bounded tree-width and bounded degree,every query that is expressible in ﬁrst-order logic and has a suﬃciently large answer setcan be enumerated approximately with constant delay after a preprocessing phase with sublinear (more precisely, polylogarithmic ) running time.Durand and Grandjean (ACM Transactions on Computational Logic 2007) proved that exact enumeration of ﬁrst-order queries on databases of bounded degree can be done withconstant delay after a preprocessing phase with running time linear in the size of the inputdatabase. Hence we achieve a signiﬁcant speed-up in the preprocessing phase. Since sub-linear running time does not allow reading the whole input database even once, sacriﬁcingsome accuracy is inevitable for our speed-up. Nevertheless, our enumeration algorithmcomes with guarantees: With high probability, (1) only tuples are enumerated that areanswers to the query or ‘close’ to being answers to the query, and (2) if the proportionof tuples that are answers to the query is suﬃciently large, then all answers will be enu-merated. Here the notion of ‘closeness’ is a tuple edit distance in the input database. Forlocal ﬁrst-order queries, only actual answers are enumerated, strengthening (1). Moreover,both the ‘closeness’ and the proportion required in (2) are controllable. Our algorithmsonly access the input database by sampling local parts, in a distributed fashion.While our preprocessing phase is simpler than the preprocessing phase for the exactalgorithm, our enumeration phase is more involved, as we push parts of the computationinto the enumeration phase, allowing us to keep on enumerating answers.We combine methods from property testing of bounded degree graphs with logic andquery enumeration, which we believe can inspire further research.

Key words and phrases:

Query Enumeration, Sublinear Time Algorithms, Constant Delay, Logic andDatabases, Property Testing.

Preprint submitted toLogical Methods in Computer Science © Isolde Adler and Polly Fahey CC (cid:13) Creative Commons

ISOLDE ADLER AND POLLY FAHEY Introduction

Given the ubiquity and sheer size of stored data nowadays, there is an immense need forhighly eﬃcient algorithms to extract information from the data. When the input datais huge, many algorithms that are traditionally classiﬁed as ‘eﬃcient’ become impractical.Hence in practice often heuristics are used, at the price of losing control over the quality ofthe computed information. In many application areas however, such as aviation, security,medicine, and research, accuracy guarantees regarding the computed output are crucial.We address this by taking a step towards foundations for Approximate Query Pro-cessing [8]. We provide a new model for approximately enumerating the set of answers toqueries on relational databases. This enables us to decrease the running time signiﬁcantlycompared to traditional algorithms while providing probabilistic accuracy guarantees.

Query enumeration.

Query evaluation plays a central role in databases systems, and inthe past decades it has received huge attention both from practical and theoretical perspec-tives. One of the central problems is query enumeration . Here we are given a database D and a query q , and the goal is to compute the set q ( D ) of all answers to q on D . However,the set q ( D ) could be exponential in the number of free variables of q , and even bigger than D , hence the total running time required to enumerate all answers may not be a meaningfulcomplexity measure. Taking this into account, models for query enumeration distinguishtwo phases, a preprocessing phase , and an enumeration phase . Typically, in the preprocess-ing phase some form of data structure is computed from D and q , in such a way that in theenumeration phase all answers in q ( D ) can be enumerated (without repetition) with onlya small delay between any two consecutive answers. We focus on data complexity, i. e. weregard the query as being ﬁxed, and the database being the input. Eﬃciency is measuredboth in terms of the running time of the preprocessing phase and the delay , i. e. the maxi-mum time between the output of any two consecutive answers. For the delay we can hopefor constant time at best, independent of the size of the database. For the preprocessingphase, the best we can hope for regarding exact algorithms is linear time.Recent research has been very successful in providing exact enumeration algorithms forﬁrst-order queries on sparse relational databases. In 2007, Durand and Grandjean showedthat on relational databases of bounded degree, every ﬁrst-order query can be enumeratedwith constant delay after a linear time preprocessing phase [11]. This result triggered anumber of papers [17, 12, 23], culminating in Schweikardt, Segouﬁn and Vigny’s result thaton nowhere dense databases, ﬁrst-order queries can be enumerated with constant delay aftera pseudo-linear time preprocessing phase [22].

Our contributions.

In this paper we aim at sublinear time preprocessing and constantdelay in the enumeration phase. We consider databases D of bounded degree d , i. e. everyelement of the domain appears in at most d tuples in relations of D , and we identifyconditions under which ﬁrst-order deﬁnable queries can be enumerated approximately withconstant delay after a sublinear preprocessing phase. We consider two diﬀerent categoriesof ﬁrst-order deﬁnable queries, local and general (including non-local ) queries. A ﬁrst-orderquery is local if, given any bounded degree database and tuple, it can be decided by onlylooking at the local (ﬁxed radius) neighbourhood around the tuple whether the tuple is ananswer to the query for the database. We show the following. OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 3

On input databases of bounded degree, every (ﬁxed) local ﬁrst-order deﬁnable query canbe enumerated approximately with constant delay after a constant time preprocessing phase(Theorem 4.4).On input databases of bounded degree and bounded tree-width, every (ﬁxed) ﬁrst-orderdeﬁnable query can be enumerated approximately with constant delay after a sublinear timepreprocessing phase (Theorem 5.5).

We also give generalisations of the two theorems above (Theorems 4.5, 5.6 and 7.4) andapplications of our approach to further computational problems on databases (Theorems7.6 and 7.8), which we will discuss below.First, let us give some more details. For any local ﬁrst-order query q , bounded degreedatabase D and tuple ¯ a from D it can be decided in constant time whether ¯ a is an answer to q on D (Lemma 3.5). Using this fact, we show that for any ﬁxed local ﬁrst-order deﬁnablequery q (¯ x ) with | ¯ x | =: k and γ ∈ (0 , D withdomain of size n as input and does the following. It enumerates a set of tuples that areanswers to q on D , and with high probability it enumerates all answers to q on D if the sizeof the answer set of q on D is larger than γn k (i.e. the number of answers to the query islarger than a ﬁxed fraction of the total possible number of answers).Towards reducing the minimum size of the answer set required to enumerate all answersto the query, we show we actually only require size γn c , where c is the maximum numberof connected components in the neighbourhood (of some ﬁxed radius) of an answer to q (Theorem 4.5). We argue that in practice, c can be expected to be low for natural queries.If a ﬁrst-order query q is non-local, then for a database D and a tuple ¯ a , we canno longer decided in constant time whether ¯ a is an answer to q on D . However, usingHanf-locality of ﬁrst-order logic [15] and a result from the area of property testing, we canapproximately enumerate any ﬁrst-order deﬁnable query on bounded degree and boundedtree-width databases with polylogarithmic preprocessing time and constant delay (Theorem5.5). Let us now explain our notion of approximation, which is based on neighbourhoodtypes.For d ∈ N , let C be a class of databases of degree at most d over a ﬁxed ﬁnite schema.Let D ∈ C , let r ∈ N and let a be an element of the domain of D . The r -neighbourhoodtype of a in D is the isomorphism type of the sub-database of D induced by all elements ofthe domain whose distance to a (in the underlying graph of D ) is at most r , expanded by a . The element a is called the centre . This can be extended to deﬁne the r -neighbourhoodtype of a tuple ¯ a in D , by considering the isomorphism type of the sub-database induced bythe union of the r -neighbourhoods of all components of ¯ a , expanded by ¯ a . We call such anisomorphism type an r -type (with | ¯ a | centres) . Given a database query q (¯ x ) with | ¯ x | =: k and a database D with domain of size n we say that a tuple ¯ a from D is ǫ -close to beingan answer of q on D and C , if D can be modiﬁed with tuple modiﬁcations (insertions anddeletions) into a database D ′ ∈ C with at most ǫdn modiﬁcations, such that ¯ a is an answerof q on D ′ and ¯ a has the same r -neighbourhood type (for some r ) in D ′ and D . We let q ( D , C , ǫ ) be the set of k -tuples ¯ a of elements of D that are ǫ -close to being an answer of q on D and C . Note that for any local ﬁrst-order query q , q ( D , C , ǫ ) = q ( D ).We say that the enumeration problem Enum C ( q ) for q on C can be solved approximately with preprocessing time H ( n ) and constant delay for answer threshold function f ( n ), if forevery ǫ ∈ (0 , D ∈ C (for each given element of the domain, the tester can query the oracle for tuples in ISOLDE ADLER AND POLLY FAHEY any of the relations containing the element, and we assume that oracle queries are answeredin constant time), and is given the number n of elements of the domain, and proceeds in twophases. First, a preprocessing phase that runs in time H ( n ), followed by an enumerationphase where a set S of pairwise distinct tuples is enumerated, with constant delay betweenany two consecutive tuples. In addition, we require that with probability at least 2 /

3, (1) S ⊆ q ( D ) ∪ q ( D , C , ǫ ), and (2) if | q ( D ) | ≥ f ( n ), then q ( D ) ⊆ S .We consider database queries that are expressible in ﬁrst-order logic. Note that ournotion of approximation is designed speciﬁcally for ﬁrst-order queries and sparse databasesand for other classes of queries and input databases alternative models may be necessary. Weprove that for every ﬁrst-order query q (¯ x ) with | ¯ x | = k the problem Enum C td ( q ) (where C td is the class of all databases of d -bounded degree and t -bounded tree-width) can be solvedapproximately with polylogarithmic preprocessing time and constant delay with answerthreshold function f ( n ) = γn k for any γ ∈ (0 ,

1) (Theorem 5.5). As with local queries,we further prove that we can actually reduce the answer threshold function to f ( n ) = γn c where c ≤ k is the maximum number of connected components in the neighbourhood (ofsome ﬁxed radius) of an answer to q (Theorem 5.6). We also identify a condition that isbased on Hanf-locality of ﬁrst-order logic [15], which we call Hanf-sentence testability , andwe prove a general theorem (Theorem 7.4), that for every ﬁrst-order query q (¯ x ) with | ¯ x | = k that is Hanf-sentence testable on C in time H ( n ), the problem Enum C ( q ) can be solvedapproximately with preprocessing time O ( H ( n )) and constant delay for answer thresholdfunction f ( n ) = γn c as above.We illustrate our model throughout the paper with a running example which can bemotivated by the problems of subgraph matching and inexact subgraph matching in socialand biological networks (e.g. [26, 24]). We show that our running example is Hanf-sentencetestable on the class of all bounded degree graphs in constant time, and hence by Theorem7.4 it can be approximately enumerated with constant preprocessing time.Our notion of approximation is based on ‘structural’ closeness and therefore our algo-rithms are aimed at applications where structural similarity is essential. For example, whengiven a new huge dataset (such as biological datasets or social networks), a ﬁrst explorationof the approximate structure with some accuracy guarantees may be desirable to obtaininitial insights quickly. These insights could then e. g. be used to make decisions regardingmore time consuming analysis in a follow-up stage. Property testing.

Before sketching the proof idea of Theorem 5.5, let us give some back-ground on property testing. Property testing aims at providing highly eﬃcient algorithmsthat derive global information on the structure of the input by only exploring a small numberof local parts of it. These algorithms are randomised and allow for a small error. Never-theless, they come with guarantees regarding both the quality of the solution and eﬃciency.Typically, the algorithms only look at a constant number of small parts of the input, andthey run in constant or sublinear time. Even for problems that allow linear time exactalgorithms, such as graph connectivity, reducing the running time (while sacriﬁcing someaccuracy) may become crucial if the networks are huge. Property testing can be seen assolving relaxed decision problems . Instead of deciding whether a given input has a certainproperty, the goal is to determine with high probability correctly, whether the input hasthe property or is far from having it. Formally, a property P is an isomorphism closed classof relational databases. For example, each Boolean database query q deﬁnes a property P q ,the class of all databases satisfying q . A property testing algorithm ( tester , for short) for OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 5 P determines whether a given database D has property P (i. e. whether D is a member of P ) or is ǫ -far from having P . Testers are randomised and allow for a small constant errorprobability. The algorithms are parameterised by a distance measure ǫ , where the distancemeasure depends on the model.Property testing was ﬁrst introduced in [21], in the context of Programme Checking.In this paper we build on the model for property testing on relational databases of boundeddegree of [1], which is a generalisation of the bounded degree graph model [14]. This modelassumes a uniform upper bound d on the degree of the input databases. For ǫ ∈ [0 , D with domain of size n is ǫ -close to satisfying P , if we can make D isomorphicto a member of P by editing (inserting or removing) at most ǫdn tuples in relations of D (i. e. at most an ‘ ǫ -fraction’ of the maximum possible number dn of tuples in relations).Otherwise, D is called ǫ -far from P . An ǫ -tester receives the size n of the domain of theinput, and and has oracle access to the database. Techniques.

To give a ﬂavour of our techniques, we sketch the proof idea of Theorem 5.5.Let φ (¯ x ) be a ﬁrst-order formula with | ¯ x | = k and let D be an input database from the classof databases C td with bounded degree and bounded tree-width over a ﬁxed ﬁnite schema.In the preprocessing phase, we ﬁrst compute a formula χ that is equivalent to φ on C td ,and χ is in a special type of Hanf normal form that groups the Hanf-sentences and sphereformulas together (Lemma 3.2). We then run property testers on the sentence parts of χ and compute a set T of r -types (where r is the Hanf locality radius of φ ), that with highprobability for any D ∈ C td and ¯ a ∈ D k , if ¯ a ∈ φ ( D ), then the r -type of ¯ a in D is in T ,and if ¯ a ∈ D k \ φ ( D , C td , ǫ ), then the r -type of ¯ a in D is not in T . In the remainder ofthe preprocessing phase we randomly sample a constant number of k -tuples of elementsof D and check whether their r -type is in T . Assuming that | φ ( D ) | is suﬃciently large,with high probability we will have sampled at least one tuple whose r -type is in T , andwe start the enumeration phase by enumerating this tuple. To keep the enumeration going,after each tuple that is enumerated, we sample a constant number of tuples from the inputdatabase. To avoid outputting duplicates we keep a record of which tuples we have alreadyseen by using an array that can be updated and read in constant time. Finally, with highprobability we will see every possible k -tuple of elements of D . Further related research.

So far, only a small number of results in database theorymake use of models from property testing. Chen and Yoshida [10] study the testabilityof homomorphism inadmissibility in a model which is close to the general graph model(cf. e. g. [3]). Ben-Moshe et al. [5] study the testability of near-sortedness (a property ofrelations that states that most tuples are close to their place in some desired order). Ourmodel diﬀers from both of these, as it relies on a degree bound and uses a diﬀerent type oforacle access. A conjunctive query (CQ) is a ﬁrst-order formula constructed from atomicformulas using conjunctions and existential quantiﬁcation only. CQ evaluation is closelyrelated to solving constraint satisfaction problems (CSPs) [18]. CSPs have been studiedunder diﬀerent models from property testing ([9, 25, 2]). Our work, however, is relevant formore complex queries, as enumerating CQs in our model basically amounts to sampling.Our work is a step towards approximate enumeration on sparse databases. It wouldbe interesting to study approximate enumeration on databases of bounded average degree .However, this would require diﬀerent techniques.

ISOLDE ADLER AND POLLY FAHEY

Organisation.

In Section 2 we introduce notions used throughout the paper. In Section3 we give some useful normal forms of ﬁrst-order queries along with some results on localﬁrst-order queries. In Sections 4 and 5 we prove our main theorems on the enumerationof local and general ﬁrst-order queries respectively. In Section 6, in an attempt to pushthe boundaries further, we prove strengthened versions of the theorems proved in Sections4 and 5, showing how the required answer threshold can be reduced. Finally, in Section7 we prove a generalisation of our main theorem on approximate enumeration of generalﬁrst-order queries showing that the assumption of bounded tree-width can be replaced withthe weaker assumption of Hanf-sentence testability. We also provide results on approximatemembership testing and approximate counting.2.

Preliminaries

We let N be the set of natural numbers including 0, and N ≥ = N \ { } . For each n ∈ N ≥ ,we let [ n ] = { , , . . . , n } . Databases. A schema is a ﬁnite set σ = { R , . . . , R | σ | } of relation names, where each R ∈ σ has an arity ar( R ) ∈ N ≥ . The size of a schema, denoted by k σ k , is the sum ofthe arities of its relation names. A database D of schema σ ( σ -db for short) is of the form D = ( D, R D , . . . , R D| σ | ), where D is a ﬁnite set, the set of elements of D , and R D i is anar( R i )-ary relation on D . The set D is also called the domain of D . An (undirected) graph G is a tuple G = ( V ( G ) , E ( G )) where V ( G ) is a set of vertices and E ( G ) is a set of 2-elementsubsets of V ( G ) (the edges of G ). For an edge { u, v } ∈ E ( G ) we simply write uv . For agraph G with uv ∈ E ( G ) we let G \ uv denote the graph obtained from G by removing theedge uv from E ( G ) . An undirected graph can be seen as a { E } -db, where E is a binaryrelation name, interpreted by a symmetric, irreﬂexive relation.We assume that all databases are linearly ordered or, equivalently, that D = [ n ] for some n ∈ N (similar to [17]). We extend this linear ordering to a linear order on the relationsof D via lexicographic ordering. The Gaifman graph of a σ -db D is the undirected graph G ( D ) = ( V, E ), with vertex set V := D and an edge between vertices a and b whenever a = b and there is an R ∈ σ and a tuple ( a , . . . , a ar( R ) ) ∈ R D with a, b ∈ { a , . . . , a ar( R ) } .The degree deg( a ) of an element a in a database D is the total number of tuples in allrelations of D that contain a . We say the degree deg( D ) of a database D is the maximumdegree of its elements. A class of databases C has bounded degree , if there exists a constant d ∈ N such that for all D ∈ C , deg( D ) ≤ d . (We always assume that classes of databasesare closed under isomorphism.) Let us remark that the deg( D ) and the (graph-theoretic)degree of G ( D ) only diﬀer by at most a constant factor (cf. e. g. [11]). Hence both measuresyield the same classes of relational structures of bounded degree. We deﬁne the tree-width of a database D as the the tree-width of its Gaifman graph. (See e. g. [13] for a discussionof tree-width in this context.) A class C of databases has bounded tree-width , if there existsa constant t ∈ N such that all databases D ∈ C have tree-width at most t . Let D be a σ -db,and M ⊆ D . The sub-database of D induced by M is the database D [ M ] with domain M and R D [ M ] := R D ∩ M ar( R ) for every R ∈ σ . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 7

Database queries.

Let var be a countable inﬁnite set of variables , and ﬁx a relationalschema σ . The set FO[ σ ] is built from atomic formulas of the form x = x or R ( x , . . . , x ar( R ) ),where R ∈ σ and x , . . . , x ar( R ) ∈ var , and is closed under Boolean connectives ( ¬ , ∨ , ∧ , → , ↔ ) and existential and universal quantiﬁcations ( ∃ , ∀ ). The set FO[ { E } ] is the set ofﬁrst-order formulas for undirected graphs. We let FO := S σ schema FO[ σ ]. We use ∃ ≥ m x φ (and ∃ = m x φ , respectively) as a shortcut for the FO formula expressing that that the num-ber of witnesses x satisfying φ is at least m (exactly m , resp.). A free variable of an FOformula is a variable that does not appear in the scope of a quantiﬁer. For a tuple ¯ x ofvariables and a formula φ ∈ FO, we write φ (¯ x ) to indicate that the free variables of φ areexactly the variables in ¯ x . An FO formula without free variables is called a sentence . AnFO query (of arity k ∈ N ) is an FO formula φ (¯ x ) (with | ¯ x | = k ). Let D be a database and¯ a be a tuple of elements of D of length | ¯ x | . We write D | = φ (¯ a ), if φ is true in D whenwe replace the free variables of φ with ¯ a , and we say that ¯ a is an answer for φ on D . Welet φ ( D ) := { ¯ a ∈ D | ¯ a | | D | = φ (¯ a ) } be the set of all answers for φ on D . Two formulas φ (¯ x ) , ψ (¯ x ) ∈ FO[ σ ], where | ¯ x | = k , are d-equivalent (written φ (¯ x ) ≡ d ψ (¯ x )) if for all σ -dbs D with degree at most d and all ¯ a ∈ D k , D | = φ (¯ a ) iﬀ D | = ψ (¯ a ). The quantiﬁer rank of aformula φ , denoted by qr ( φ ), is the maximum nesting depth of quantiﬁers that occur in φ .The size of a formula φ , denoted by k φ k , is the length of φ as a string over the alphabet σ ∪ var ∪ {∃ , ∀ , ¬ , ∨ , ∧ , → , ↔ , = } ∪ { , } ∪ { ( , ) } . Enumeration problems.

Let σ be a relational schema, let C be a class of σ -dbs andlet φ (¯ x ) ∈ FO[ σ ]. The enumeration problem of φ over C denoted by Enum C ( φ ) is, givena database D ∈ C , to output the elements of φ ( D ) one by one with no repetition. An enumeration algorithm for the enumeration problem Enum C ( φ ) with input database D ∈ C proceeds in two phases, a preprocessing phase and an enumeration phase. The enumerationphase outputs all the elements of φ ( D ) with no duplicates. Furthermore, the enumerationphase has full access to the output of the preprocessing phase but can use only a constanttotal amount of extra memory.The delay of an enumeration algorithm is the maximum time between the start of theenumeration phase and the ﬁrst output (or the ‘end of enumeration message’ if there areno answers), two consecutive outputs, and the last output and the ‘end of enumerationmessage’. Neighbourhoods and Hanf normal form.

For a σ -db D and a, b ∈ D , the distance between a and b in D , denoted by dist D ( a, b ), is the length of a shortest path between a and b in G ( D ). The distance between two tuples ¯ a = ( a , . . . , a m ) and ¯ b = ( b , . . . , b l ) of D is the min { dist D ( a i , b j ) | ≤ i ≤ m, ≤ j ≤ l } . Let r ∈ N . For a tuple ¯ a ∈ D | ¯ a | ,we let N D r (¯ a ) denote the set of all elements of D that are at distance at most r from ¯ a .The r -neighbourhood of ¯ a in D , denoted by N D r (¯ a ), is the tuple ( D [ N r (¯ a )] , ¯ a ) where theelements of ¯ a are called centres . We omit the superscript and write N r (¯ a ) and N r (¯ a ), if D isclear from the context. Two r -neighbourhoods, N r (¯ a ) and N r (¯ b ), are isomorphic (written N r (¯ a ) ∼ = N r (¯ b )) if there is an isomorphism between D [ N r (¯ a )] and D [ N r (¯ b )] which maps ¯ a to¯ b . An ∼ =-equivalence-class of r -neighbourhoods with k centres is called an r -neighbourhoodtype (or r -type for short) with k centres. We let T σ,dr ( k ) denote the set of all r -types with k centres and degree at most d , over schema σ . Note that for ﬁxed d and σ , the cardinality ISOLDE ADLER AND POLLY FAHEY | T σ,dr ( k ) | =: c( r, k ) is a constant, only depending on r and k . We say that tuple ¯ a ∈ D | ¯ a | has r -type τ , if N D r (¯ a ) ∈ τ .Let r ∈ N and k ∈ N ≥ . A sphere-formula , denoted by sph τ (¯ x ) (where | ¯ x | = k ), is anFO formula which expresses that the r -type of ¯ x is τ , where τ is some r -type with k centres,and r is called the locality radius of the sphere-formula. A Hanf-sentence is a sentence ofthe form ∃ ≥ m x sph τ ( x ), where τ is an r -type with one centre, and r is the locality radius ofthe Hanf-sentence. An FO formula is in Hanf normal form if it is a Boolean combination ofHanf-sentences and sphere-formulas. The

Hanf locality radius of an FO formula φ in Hanfnormal form is the maximum of the locality radii of the Hanf-sentences and sphere-formulasof φ . A well-known theorem by Hanf states that on databases of bounded degree, everyFO formula can be transformed into an equivalent formula in Hanf normal form [15]. Thistheorem was subsequently reﬁned as follows. Theorem 2.1 ([7]) . For any φ (¯ x ) ∈ FO and d ∈ N ≥ , there exists a d -equivalent formula ψ (¯ x ) in Hanf normal form with the same free variables as φ , and ψ can be computed in time d O ( k φ k ) from φ . Furthermore, the Hanf locality radius of ψ is at most qr ( φ ) . For each FO formula φ , we ﬁx a formula ψ (that is computed by Theorem 2.1) that is d -equivalent to φ and is in Hanf normal form. We then ﬁx the Hanf locality radius of φ tobe the Hanf locality radius of ψ (and so we can then refer to the Hanf locality radius of anFO formula). Local and non-local ﬁrst-order queries.

We call an FO[ σ ] formula φ (¯ x ) (with k freevariables) local if there exists some r ∈ N such that for any σ -dbs D and D and tuples¯ a ∈ D k and ¯ a ∈ D k , if N D r (¯ a ) ∼ = N D r (¯ a ) then, D | = φ (¯ a ) if and only if D | = φ (¯ a ).We call r the locality radius of φ . If an FO formula is not local we say it is non-local . Wehighlight that this notion of locality diﬀers from that of Hanf locality and Gaifman localityof FO and should not be confused. Proviso.

For the rest of the paper, we ﬁx a schema σ and numbers d, t ∈ N with d ≥ .From now on, all databases are σ -dbs and have degree at most d , unless stated otherwise.We use G d to denote the class of all graphs with degree at most d , C d to denote the classof all σ -dbs with degree at most d , C td to denote the class of all σ -dbs with degree at most d and tree-width at most t and ﬁnally we use C to denote a class of σ -dbs with degree at most d . Property Testing.

First, we note that we only use methods from property testing fromSection 5 onwards. We use the model of property testing for bounded degree databasesintroduced in [1], which is a straightforward extension of the model for bounded degreegraphs [14]. Property testing algorithms do not have access to the whole input database.Instead, they are given access via an oracle . Let D be an input σ -db on n elements. Aproperty testing algorithm receives the number n as input, and it can make oracle queries of the form ( R, i, j ), where R ∈ σ , i ≤ n and j ≤ deg( D ). The answer to ( R, i, j ) is the j th tuple in R D containing the i th element of D (if such a tuple does not exist then it returns ⊥ ). We assume oracle queries are answered in constant time. Note that an oracle query is not a database query. According to the assumed linear order on D . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 9 c c N c c N c c N c N Figure 1: The four 2-types of Examples 2.2 and 5.2. The vertices labelled ‘ c ’ and ‘ c ’ arethe centres.Let D , D ′ be two σ -dbs, both having n elements. The distance between D and D ′ ,denoted by dist( D , D ′ ), is the minimum number of tuples that have to be inserted or removedfrom relations of D and D ′ to make D and D ′ isomorphic. For ǫ ∈ [0 , D and D ′ are ǫ -close if dist( D , D ′ ) ≤ ǫdn , and are ǫ -far otherwise. A property is simply a class ofdatabases. Note that every FO sentence φ deﬁnes a property P φ = {D | D | = φ } . We call P φ ∩ C the property deﬁned by φ on C . A σ -db D is ǫ -close to a property P if there existsa database D ′ ∈ P that is ǫ -close to D , otherwise D is ǫ -far from P .Let P ⊆ C be a property and ǫ ∈ (0 ,

1] be the proximity parameter. An ǫ -tester for P on C is a probabilistic algorithm which is given oracle access to a σ -db D ∈ C and it isgiven n := | D | as auxiliary input. The algorithm does the following.(1) If D ∈ P , then the tester accepts with probability at least 2 / D is ǫ -far from P , then the tester rejects with probability at least 2 / query complexity of a tester is the maximum number of oracle queries made. A testerhas constant query complexity, if the query complexity does not depend on the size of theinput database. We say a property P ⊆ C is uniformly testable in time f ( n ) on C , if forevery ǫ ∈ (0 ,

1] there exists an ǫ -tester for P on C which has constant query complexity andwhose running time on databases on n elements is f ( n ). Note that this tester must workfor all n . We give an example below, which is also the basis of our running example. Example 2.2.

On the class G d , consider the isomorphism types τ and τ of the 2-neighbourhoods ( N , ( c , c )) and ( N , ( c )) where N and N are the graphs shown inFigure 1 with centres ( c , c ) and ( c ). Let φ be the FO[ { E } ]-formula φ = ∃ x ∃ y sph τ ( x, y ) ∧¬∃ z sph τ ( z ) . Consider the property P := P φ . We show that on the class G d , P ∩ G d isuniformly testable with constant time. For this, let ǫ ∈ (0 , G ∈ G d and | V ( G ) | = n as an input, the ǫ -tester proceeds as follows:(1) If n < d /ǫ , do a full check of G and decide if G ∈ P .(2) Otherwise uniformly and independently sample α = log − ǫd/ / n ].(3) For each sampled vertex, compute its 2-neighbourhood.(4) If a vertex is found with 2-type τ then the tester rejects. Otherwise it accepts. Claim 2.3.

The above ǫ -tester accepts with probability at least / if G ∈ P and rejectswith probability at least / if G is ǫ -far from P . Furthermore, the ǫ -tester has constantquery complexity and runs in constant time.Proof: Note that in τ , every vertex has 1 or 3 neighbours. For showing correctness, ﬁrstassume G ∈ P . Then the tester will always accept as there exists no vertex with 2-type τ .Now assume G is ǫ -far from P . Then at least ǫdn edge modiﬁcations are necessary tomake G isomorphic to a graph in P . If n < d /ǫ then the tester will reject so assume otherwise. Inserting a copy of τ requires at most 8( d + 1) modiﬁcations (pick 8 vertices andremove all incident edges then add the 8 edges to make an isolated copy of τ ). Removing anedge uv from G will change the 2-type of any vertex in the set N G ( u ) ∩ N G ( v ). Lemma 3.2(a) of [6] states that | N G ( u ) | ≤ d and | N G ( v ) | ≤ d . Therefore, | N G ( u ) ∩ N G ( v ) | ≤ d and inserting a copy of τ could add at most 8 d many copies of τ . After inserting a copy of τ we need to remove all copies of τ . Let v ∈ V ( G ) be a vertex with 2-type τ . Let u be theneighbour of v with degree 1. If we remove the edge uv ∈ E ( G ), v ’s 2-type is no longer τ .Note that v has exactly 2 neighbours and u has 0 neighbours in G \ uv . Moreover, we claimthat by removing uv , we have introduced no new vertices with 2-type τ . To see this, observethat deleting uv will only aﬀect the 2-types of vertices in N G ( v ). But each vertex x ∈ N G ( v )will have a vertex with exactly two neighbours in its 2-neighbourhood in G \ uv . Hence thenew 2-type of x is not τ . This shows that there are at least ǫdn − d + 1) − d verticeswith 2-type τ . As n ≥ d /ǫ and d ≥

2, then 8( d + 1) ≤ d ≤ ǫdn/

3. The probabilitythat we sample a vertex with 2-type τ is therefore at least ǫdn/ n = ǫd/

3. Hence theprobability that none of the α sampled vertices have 2-type τ is at most (1 − ǫd/ α = 1 / / n < d /ǫ then we can do a full check of the input graph intime only dependent on d and ǫ . Otherwise, note that the tester samples only a constantnumber of vertices in (2), and for each of the sampled vertices, the tester needs to makea constant number of oracle queries only to calculate its 2-neighbourhood in (3), becausethe degree is bounded. Therefore the tester has constant query complexity and constantrunning time. (cid:4) Adler and Harwath showed that, on the class of all databases with bounded degree andtree-width, every property deﬁnable in monadic second-order logic with counting (CMSO) isuniformly testable in polylogarithmic running time [1]. (Where a function is polylogarithmic in n , if it is a polynomial in log n .) The logic CMSO is an extension of FO and we, therefore,get the following result which will be used as a subroutine in Section 5. Theorem 2.4 ([1]) . Each property P ⊆ C td deﬁnable in FO is uniformly testable on C td inpolylogarithmic running time. Model of Computation.

We use Random Access Machines (RAMs) and a uniform costmeasure when analysing our algorithms, i. e. we assume all basic arithmetic operationsincluding random sampling can be done in constant time, regardless of the size of thenumbers involved. We assume that if we initialise an array, all entries are set to 0 and thiscan be done in constant time for any length or dimension array. This is achieved by usingthe lazy array initialisation technique (cf. e.g. [19]) where entries are only actually storedwhen they are ﬁrst needed. We use one-based indexing for arrays. Let A be a 1-dimensionalarray. We assume that for a number a ∈ N ≥ , the entry A [ a ] can be accessed in constanttime. 3. Properties of first-order queries on bounded degree

In this section, we shall give some useful normal forms of FO queries. We shall then give acharacterisation and some results for local FO queries.

OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 11

General ﬁrst-order queries.

We make use of the following lemma to simplify Booleancombinations of sphere-formulas. We shall use this result to show we can write FO queries ina special type of Hanf normal form that groups the Hanf-sentences and the sphere-formulasin a convenient way.

Lemma 3.1 ([6]) . Let r, k, d ∈ N with k ≥ , d ≥ and let σ be a schema. For everyBoolean combination φ (¯ x ) of sphere-formulas of degree at most d and radius at most r ,there exists an I ⊆ T σ,dr ( k ) such that φ (¯ x ) is d -equivalent to W τ ∈ I sph τ (¯ x ) .Furthermore, given φ (¯ x ) , the set I can be computed in time poly( k φ k ) · ( kd r +1 ) O ( k σ k ) . In the following lemma, we show that we can write any FO query as a disjunction ofconjunctions of a sphere-formula and a boolean combination of Hanf-sentences. This normalform will be used in Lemma 5.4.

Lemma 3.2.

Let φ (¯ x ) ∈ FO and | ¯ x | = k . Let r be the Hanf locality radius of φ . For every d ∈ N with d ≥ there exists a computable, d -equivalent formula to φ of the form χ (¯ x ) = _ i ∈ [ m ] (cid:16) sph τ i (¯ x ) ∧ ψ si (cid:17) (3.1) for some m ∈ N , where for all i ∈ [ m ] , τ i is an r -type with k centres and ψ si is a conjunctionof Hanf-sentences and negated Hanf-sentences. For each φ (¯ x ) ∈ FO , we ﬁx such a d -equivalent formula to φ (so we can refer to the d -equivalent formula of φ in the form (3.1)).Proof. From φ we can construct a formula in the required form as follows. Firstly, byTheorem 2.1 we construct a d -equivalent formula in Hanf normal form. Next, we write theresulting formula in DNF to obtain a formula of the form χ (¯ x ) ′ = _ i ∈ [ l ] (cid:16) ψ fi (¯ x ) ∧ ψ si (cid:17) for some l ∈ N , where for i ∈ [ l ], ψ fi (¯ x ) is a conjunction of sphere-formulas and ψ si is aconjunction of Hanf-sentences and negated Hanf-sentences. Then, by Lemma 3.1, we canreplace each ψ fi (¯ x ) with a d -equivalent formula W t ∈ λ i sph t (¯ x ) where λ i is a set of r -typeswith k centres. Finally, we replace each W t ∈ λ i sph t (¯ x ) ∧ ψ si with W t i ∈ λ i (sph t i (¯ x ) ∧ ψ si ). Theresulting formula is in the required form.In Theorems 4.5 and 5.6, we reduce the minimum size of the answer set required toenumerate all answers to the query φ in our approximate enumeration algorithms. We showwe only actually require an answer set of size γn c , where c := conn( φ, d ) is the maximumnumber of connected components in the r -neighbourhood (where r is the Hanf-localityradius of φ ) of an answer to φ . We deﬁne conn( φ, d ) below. Deﬁnition 3.3 (conn( φ, d )) . Let φ (¯ x ) ∈ FO[ σ ] where | ¯ x | = k and let χ (¯ x ) be the formulain the form (3.1) of Lemma 3.2 that is d -equivalent to φ . We deﬁne conn( φ, d ) as themaximum number of connected components of the neighbourhood types that appear in thesphere-formulas of χ . Note that conn( φ, d ) ≤ k .Recall that we ﬁx a formula χ in the form (3.1) of Lemma 3.2 for each FO formula φ ,and hence conn( φ, d ) is well deﬁned. Local ﬁrst-order queries.

We shall start by showing that for any local FO query φ we can compute a set of r -types T (where r is the locality radius) such that for any σ -db D and tuple ¯ a , ¯ a is an answer to φ on D if and only if the r -type of ¯ a is in T . Lemma 3.4.

There is an algorithm that, given a local query φ (¯ x ) ∈ FO[ σ ] with k freevariables and given the locality radius r of φ , computes a set of r -types T with k centressuch that for any σ -db D and tuple ¯ a ∈ D k , ¯ a ∈ φ ( D ) if and only if the r -type of ¯ a in D isin T .Proof. Let T be an empty list. For each r -type τ with k centres we do the following. Let D τ be the ﬁxed representative σ -db of τ where ¯ c is the centre tuple, then if D τ | = φ (¯ c ) add τ to T . Then since φ is local and r is the locality radius of φ , for every σ -db D and tuple¯ a ∈ D k , D | = φ (¯ a ) if and only if the r -type of ¯ a in D is in T .Using the previous lemma we shall show that for any local FO query, σ -db D and tuple¯ a from D it can be decided in constant time whether ¯ a is an answer to φ on D . We will usethis when approximately enumerating answers to local FO queries. Lemma 3.5.

Let φ (¯ x ) ∈ FO[ σ ] be a local query with k free variables. There is an algorithmthat, given a σ -db D and a tuple ¯ a ∈ D k , decides whether ¯ a ∈ φ ( D ) in constant time.Proof. Let r be the locality radius of φ . First let us compute the set of r -types T as inLemma 3.4. We shall then compute the r -type τ of ¯ a in D . By Lemma 3.4 if τ ∈ T then¯ a ∈ φ ( D ) and if τ T then ¯ a φ ( D ).Since r does not depend on D , the r -type of ¯ a in D can be computed in constant time.Furthermore, computing the set T does not depend on D , and hence it can be decided inconstant time whether ¯ a ∈ φ ( D ).We shall ﬁnish this section with the following characterisation of local FO queries. Wedo not make use of this characterisation but we include it to aid intuition. The proof of theobservation is straightforward but we shall give it for completeness. Observation 3.6.

Let φ (¯ x ) ∈ FO[ σ ]. Then φ is local if and only if φ is d -equivalent to aboolean combination of sphere-formulas.Furthermore, for any local FO query φ , the locality radius of φ is equal to the Hanflocality radius of φ . Therefore, since the Hanf locality radius of an FO query is computableby Theorem 2.1, the locality radius of a local FO query is also computable. Proof.

We will give a proof of the ﬁrst part of the observation only. Let | ¯ x | = k . First letus assume that φ is d -equivalent to a FO formula χ that is a boolean combination of sphere-formulas. Let r be the Hanf locality radius of χ . Then since χ contains no Hanf-sentences,for any σ -dbs D and D and tuples ¯ a ∈ D k and ¯ a ∈ D k , if N D r (¯ a ) ∼ = N D r (¯ a ) then, D | = φ (¯ a ) if and only if D | = φ (¯ a ). Hence φ is local and r is the locality radius of φ .Now let us assume that φ is local. Let T be the set of r -types as constructed in Lemma3.4. Therefore φ is d -equivalent to the formula W τ ∈ T sph τ (¯ x ) which is in the requiredform. OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 13 Enumerating Answers to Local First-Order Queries

Assume q is a local FO query with k free variables and D is a σ -db, such that the set q ( D ) is larger than a ﬁxed proportion of all possible k -tuples, i. e. | q ( D ) | ≥ µ | D | k for someﬁxed µ ∈ (0 , q ( D ) withamortized constant delay, i. e. the average delay between any two outputs is constant. Foreach tuple ¯ a ∈ D k (processed in, say, lexicographical order), the algorithm tests if ¯ a is in q ( D ) (which can be done in constant time by Lemma 3.5 as q is local) and outputs ¯ a if¯ a ∈ q ( D ). Since we are assuming that | q ( D ) | is larger than a ﬁxed proportion of all possibletuples, the overall running time of the algorithm is O ( | D | k ) and hence the algorithm hasconstant amortized delay. In this section we prove that we can de-amortize this algorithmusing random sampling.We begin this section by showing that there exists a randomised algorithm which doesthe following. The input is a set V which is partitioned into two sets V and V . Weassume that the algorithm can test in constant time if a given element from V is in V or V .After a constant time preprocessing phase, the algorithm enumerates a set S of elementswith S ⊆ V , with constant delay. Furthermore, we show that if | V | is large enough thenwith high probability S = V . We then use this result to prove our main theorem of thissection (Theorem 4.4) on the approximate enumeration of the answers to a local query. InTheorem 4.5 we show that the relative size of the answer set can be reduced whilst stillguaranteeing that with high probability we enumerate all answers to the query. Lemma 4.1.

Fix µ ∈ (0 , and δ ∈ (0 , . There exists a randomised algorithm which doesthe following. The input is a set V which is partitioned into two sets V and V . We assumethat the algorithm is given access to the size of V and can decide in constant time whethera given element from V is in V or V . The algorithm outputs a set S ⊆ V such that if | V | ≥ µ | V | then, with probability at least δ , S = V .The algorithm has constant preprocessing time and enumerates S with no duplicatesand constant delay between any two consecutive outputs.Proof. Let | V | = n and let us assume that V comes with a linear order over its elements, orequivalently that V = [ n ]. If V does not come with a linear order over its elements then weuse the linear order induced by the encoding of V . Let q = min((1 − µ (1 − µ )) , (1 − δ ) / B of length n . The array B contains one entry for each element in[ n ] and it is used to record sampled elements. For an element a ∈ [ n ], the entry B [ a ] is1 if a has previously been sampled and it is 0 otherwise.(2) Initialise an empty queue Q , to store tuples to be enumerated.As discussed in Section 2 an array of any size can be initialised in constant time usinglazy initialisation and hence the preprocessing phase runs in constant time.Moving on to the enumeration phase, between each output the algorithm will samplea constant number of elements as well as going through a constant number of the elementsin [ n ] in order. The enumeration phase proceeds as follows:(1) Sample α = ⌈ log − µ (1 − µ ) q ⌉ many elements uniformly and independently from [ n ] andlet t be a list of these elements.(2) Add the next ⌈ /µ ⌉ elements from [ n ] to t . If there are less than ⌈ /µ ⌉ elementsremaining just add all the remaining elements to t . (3) For each element a in t , if B [ a ] = 1, skip this element. Otherwise, set B [ a ] = 1 and if a is in V add a to Q .(4) If Q = ∅ , output the next element from Q ; stop otherwise.(5) Repeat Steps 1-4 until there is no element to output in Step 4.In Steps 1 and 2 a list of elements is created which is of constant size. For each elementin this list, in Step 3, the algorithm can check whether it is in V in constant time and thearrays Q and B can be read and updated in constant time. Hence, each enumeration stepcan be done in constant time. This concludes the analysis of the running time. We nowprove correctness.Clearly, no duplicates will be enumerated due to the use of the array B and the onlyelements enumerated are those that are in V . Let S be the set of elements that areenumerated. We need to show that with probability at least 2 / | V | ≥ µ | V | , then S = V .In each enumeration step we take the next ⌈ /µ ⌉ elements from [ n ]. Assuming | V | ≥ µ | V | ,after ⌈ n · µ ⌉ ≤ ⌈ µ | V |⌉ enumeration steps the algorithm will have checked every element in[ n ] and therefore S = V . Let us ﬁnd a bound on the probability that we do at least ⌈ µ | V |⌉ enumeration steps. Claim 4.2.

For all q ∈ [0 , and m ∈ N ≥ , Q mi =1 (1 − q i +12 ) ≥ − q . Proof:

First let us prove that m Y i =1 (1 − q i +12 ) ≥ − q − q − q + q m +22 by induction on m .For the base case, let m = 1, then Y i =1 (1 − q i +12 ) = 1 − q ≥ − q − q − q + q as required.Now for the inductive step. Let us assume the claim is true for m and we shall showthe claim is true for m + 1. We have m +1 Y i =1 (1 − q i +12 ) = (cid:16) m Y i =1 (1 − q i +12 ) (cid:17) · (1 − q m +22 ) ≥ (1 − q − q − q + q m +22 )(1 − q m +22 )by the inductive hypothesis.(1 − q − q − q + q m +22 )(1 − q m +22 ) = 1 − q − q − q + q m +32 + q m +42 + q m +52 − q m +2 ≥ − q − q − q + q m +32 , as q ( m +4) / + q ( m +5) / − q m +2 ≥ m Y i =1 (1 − q i +12 ) ≥ − q − q − q + q m +22 ≥ − q as required (cid:4) Claim 4.3.

Assume that | V | ≥ µn . The probability that at least ⌈ µ | V |⌉ distinct elementsfrom V are enumerated is at least − q . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 15

Proof:

We shall start by showing that for j ∈ N , where 1 ≤ j ≤ ⌈ µ | V |⌉ , the probabilitythat at least j distinct elements from V are enumerated is at least Q ji =1 (1 − q ( i +1) / ).We shall prove this by induction on j . For the base case, let j = 1. If an elementfrom V is sampled in the ﬁrst enumeration step, then at least one element from V will beenumerated. An element that is in V is sampled with probability | V | n ≥ µnn = µ ≥ µ (1 − µ ) . The probability that out of the α elements sampled in the ﬁrst enumeration step thereis none from V is at most (1 − µ (1 − µ )) α ≤ q as α = ⌈ log − µ (1 − µ ) q ⌉ ≥ log − µ (1 − µ ) q .Therefore with probability at least 1 − q at least one element from | V | is enumerated andhence we have proved the base case.For the inductive step, assume that the claim is true for j , where 1 ≤ j < ⌈ µ | V |⌉ , weshall show it is true for j + 1. Let us assume j distinct elements from V have already beenenumerated, and a total of at least ( j + 1) α elements have been sampled (of which at least j are from V ). The probability an element from V that was not already enumerated issampled is ( | V | − j ) /n. Therefore, the probability that exactly j unique elements from V have been sampled is at most (cid:16) − | V | − jn (cid:17) ( j +1) α − j < (1 − µ (1 − µ )) ( j +1) α − j , as | V | − j > | V | − µ | V | ≥ µn (1 − µ ). Then(1 − µ (1 − µ )) ( j +1) α − j ≤ q j +1 (1 − µ (1 − µ )) j ≤ q j +1 ( q ) j = q j +22 , as α = ⌈ log − µ (1 − µ ) q ⌉ ≥ log − µ (1 − µ ) q and as q ≤ (1 − µ (1 − µ )) . Therefore, the probabilitythat there are at least j + 1 elements from V in these sampled tuples is at least 1 − q ( j +2) / .Then by the inductive hypothesis, the probability that at least j + 1 elements from V areenumerated is at least (cid:16) j Y i =1 (1 − q j +12 ) (cid:17) · (1 − q j +22 ) = j +1 Y i =1 (1 − q i +12 )as required.Finally, by Claim 4.2, the probability that at least ⌈ µ | V |⌉ many distinct elements from V are enumerated is at least ⌈ µ | φ ( D ) |⌉ Y i =1 (1 − q i +12 ) ≥ − q . (cid:4) By Claim 4.3 the probability that S = V if | V | ≥ µn is at least (1 − q ) ≥ δ by thechoice of q . This completes the proof.We now use Lemma 4.1 to prove the following theorem. Theorem 4.4.

Let φ (¯ x ) ∈ FO[ σ ] be a local query with k free variables and let γ ∈ (0 , .There exists an algorithm that is given a σ -db D as an input, that after a constant timepreprocessing phase, enumerates a set S (with no duplicates) with constant delay betweenany two consecutive outputs, such that: (1) S ⊆ φ ( D ) , and (2) if | φ ( D ) || ≥ γ | D | k (i.e. the number of answers to the query is larger than a ﬁxed fractionof the total possible number of answers), then with probability at least / , S = φ ( D ) .Proof. Given a tuple ¯ a ∈ | D | k we can test in constant time whether ¯ a ∈ φ ( D ) or ¯ a φ ( D )by Lemma 3.5. We can partition the set D k into two sets based on whether a tuple is in φ ( D ) or not. Therefore the algorithm from Lemma 4.1 (with δ = 2 / µ = γ , V = | D | k , V = φ ( D ) and V = | D | k \ φ ( D )) meets the requirements in the theorem statement.In our algorithms, in order to achieve constant preprocessing time and constant delaywe require the number of answers to the query to be some ﬁxed fraction of the total possiblenumber of answers. Otherwise, with high probability the algorithm would not sample ananswer in the enumeration phase and the algorithm would stop.It seems natural to expect that for queries occurring in practice, the elements of ananswer tuple are within a small distance of each other in the input database (i. e. the r -neighbourhood of the answer has few connected components). In such scenarios, we canstrengthen our main theorem by reducing the number of answers required to output allanswers to the query with high probability. Theorem 4.5.

Let φ (¯ x ) ∈ FO[ σ ] be a local query with locality radius r and let γ ∈ (0 , . Let c := conn( φ, d ) , i.e the maximum number of connected components in the r -neighbourhoodof a tuple ¯ a ∈ φ ( D ) for any σ -db D . There exists an algorithm that, given a σ -db D asinput, after a constant time preprocessing phase enumerates a set S (with no duplicates)with constant delay between any two consecutive outputs, such that the following hold. (1) S ⊆ φ ( D ) , and (2) if | φ ( D ) || ≥ γ | D | c , then with probability at least / , S = φ ( D ) . We defer the proof of Theorem 4.5 to Section 6.5.

Enumerating Answers to General First-Order Queries

We now shift our focus to enumerating answers to general FO queries, now they can be non-local in the sense that we can not check if a tuple is an answer to the query by only lookingat its neighbourhood. We are aiming at sublinear preprocessing time hence we cannot readthe whole input database and therefore will need to sacriﬁce some accuracy. We allow ouralgorithms to enumerate ‘close’ answers as well as actual answers. We start this section bydeﬁning our notion of approximation before proving our main result.5.1.

Our Notion of Approximation.

We shall start by deﬁning our notion of closeness.

Deﬁnition 5.1 ( ǫ -close answers to FO queries) . Let

D ∈ C be a σ -db and let ǫ ∈ (0 , φ (¯ x ) ∈ FO[ σ ] be a query with k free variables and Hanf locality radius r . A tuple ¯ a ∈ D k is ǫ -close to being an answer of φ on D and C if D can be modiﬁed (with tuple insertions anddeletions) into a σ -db D ′ ∈ C with at most ǫd | D | modiﬁcations (i.e dist( D , D ′ ) ≤ ǫd | D | )such that ¯ a ∈ φ ( D ′ ) and the r -type of ¯ a in D ′ is the same as the r -type of ¯ a in D .We denote the set of all tuples that are ǫ -close to being an answer of φ on D and C as φ ( D , C , ǫ ). Note that φ ( D ) ⊆ φ ( D , C , ǫ ).We shall illustrate Deﬁnition 5.1 in the following example. OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 17

Example 5.2.

On the class G d , consider the isomorphism types τ , τ and τ of the 2-neighbourhoods ( N , ( c , c )), ( N , ( c , c )) and ( N , ( c , c )) shown in Figure 1. Let φ ∈ FO[ { E } ] be given by φ ( x, y ) := sph τ ( x, y ) ∨ (sph τ ( x, y ) ∧¬ ( ∃ z ∃ w sph τ ( z, w ))) . This formulamight be useful in scenarios where ideally we want to return pairs of vertices with a speciﬁc2-type τ but if there is no such pair then returning vertex pairs with a similar 2-type willsuﬃce.Let G ∈ G d be a graph on n vertices and ǫ ∈ (0 , u, v ) ∈ V ( G ) with 2-type τ , ( u, v ) ∈ φ ( G ) and hence ( u, v ) ∈ φ ( G , G d , ǫ ).Assume ( u, v ) ∈ V ( G ) has 2-type τ . Then ( u, v ) ∈ φ ( G ) if and only if G contains novertex pair of 2-type τ . The pair ( u, v ) is in φ ( G , G d , ǫ ) if and only if G can be modiﬁed(with edge modiﬁcations) into a graph G ′ ∈ G d with at most ǫdn modiﬁcations such that( u, v ) ∈ φ ( G ′ ) and the 2-type of ( u, v ) in G ′ is still τ .For example if G is at distance at most ǫdn − d − n is large enoughsuch that ǫdn − d − >

0) from a graph G ′′ ∈ G d such that G ′′ | = ∃ x ∃ y sph τ ( x, y ) ∧¬ ( ∃ z ∃ w sph τ ( z, w )) then ( u, v ) ∈ φ ( G , G d , ǫ ). To see this let us assume that such a graph G ′′ exists. Note that as G d is closed under isomorphism we can assume that G and G ′′ are on the same vertices. Then if ( u, v ) has 2-type τ in G ′′ , ( u, v ) ∈ φ ( G , G d , ǫ ) since ǫdn − d − ≤ ǫdn . So let us assume that ( u, v ) does not have 2-type τ in G ′′ . Let( u , v ) ∈ V ( G ′′ ) have 2-type τ (we know one exists). Then we remove every edge thathas u , v , u or v as an endpoint (there are at most 2 d + 3 such edges), and then for eachedge we removed we insert the same edge back in but swapping any endpoint u to u and v to v and vice versa (this requires at most 2(2 d + 3) many edge modiﬁcations in total).By doing this we have essentially just swapped the labels of the vertices u and u and v and v . Hence in the resulting graph G ′ , ( u, v ) has 2-type τ and G ′ still contains no pair ofvertices with 2-type τ . Therefore ( u, v ) ∈ φ ( G ′ ), and the distance between G and G ′ is atmost ǫdn − d − d + 3) = ǫdn .Finally, for any pair ( u, v ) ∈ V ( G ) with 2-type τ , ( u, v ) φ ( G ) and ( u, v ) φ ( G , G d , ǫ )as for every G ′ ∈ G d there does not exist a pair with 2-type τ that is in φ ( G ′ ).The set φ ( D , C , ǫ ) contains all tuples that are ǫ -close to being answers to φ . A tuple¯ a ∈ D k is in φ ( D , C , ǫ ) if only a relatively small (at most ǫdn ) number of modiﬁcations to D are needed to make ¯ a an answer to φ without changing ¯ a ’s neighbourhood type. Thiscan be seen as a notion of structural approximation. One might be tempted to deﬁne φ ( D , C , ǫ ) diﬀerently, namely as the set of tuples that can be turned into an answer to φ on D (without necessarily preserving the neighbourhood type) with at most ǫdn modiﬁcationsto D . However, if φ ( D ) = ∅ , say, ¯ a ∈ φ ( D ), then we can turn any tuple ¯ b ∈ D k into an answerfor φ on D with only a constant number of modiﬁcations. This can be done by exchanging¯ b ’s r -neighbourhood with ¯ a ’s, for some r depending on φ . This is not meaningful.Let χ be as in (3.1) of Lemma 3.2 for φ . Note that only tuples with a neighbourhoodtype that appears in χ can be in the set φ ( D , C , ǫ ). Nevertheless, the diﬀerence | φ ( D , C , ǫ ) |−| φ ( D ) | can be unbounded. The following example demonstrates this. Example 5.3.

Let φ , τ and τ be as in Example 5.2. For m ∈ N ≥ , let G ,m be thegraph that contains m disjoint copies of τ and 1 disjoint copy of τ . Note that G ,m has n = 8( m + 1) vertices. The graph G ,m can be modiﬁed with one edge modiﬁcation to forma graph which satisﬁes ∃ x ∃ y sph τ ( x, y ) ∧ ¬ ( ∃ z ∃ w sph τ ( z, w )) without modifying the 2-typeof any pair ( u, v ) ∈ V ( G ,m ) with 2-type τ in G ,m . Therefore if 1 ≤ ǫdn then every pair ( u, v ) ∈ V ( G ,m ) with 2-type τ is in φ ( G ,m , G d , ǫ ). Hence, assuming 1 ≤ ǫdn we have | φ ( G ,m , G d , ǫ ) | − | φ ( G ,m ) | = m + 1 − n ).While φ ( D , C , ǫ ) is a structural approximation of φ ( D ), Example 5.3 illustrates thatit may not be a numerical approximation. However, in scenarios where the focus lies onstructural closeness, this might not be an issue.We say that the problem Enum C ( φ ) can be solved approximately with O ( H ( n )) prepro-cessing time and constant delay for answer threshold function f ( n ), if for every parameter ǫ ∈ (0 , D ∈ C and | D | = n as an input, that proceeds in two steps.(1) A preprocessing phase that runs in time O ( H ( n )), and(2) an enumeration phase that enumerates a set S of distinct tuples with constant delaybetween any two consecutive outputs.Moreover, we require that with probability at least 2 / S ⊆ φ ( D ) ∪ φ ( D , C , ǫ ) and, if | φ ( D ) | ≥ f ( n ), then φ ( D ) ⊆ S . The algorithm can make oracle queries of the form ( R, i, j )as discussed in Section 2 which allows us to explore bounded radius neighbourhoods inconstant time. We call such an algorithm an ǫ -approximate enumeration algorithm .5.2. Main Results.

Before proving our main result of this section on the approximateenumeration of general ﬁrst-order queries, we start by proving the following lemma. Inthis lemma, we show that for a given database D and FO query φ we can compute a setof neighbourhood types in polylogarithmic time, that with high probability only containsthe neighbourhood types of tuples that are answers or close to being answers to φ on D .To compute this set we write φ in the form (3.1) as in Lemma 3.2 and then run propertytesters on the sentence parts to determine with high probability whether tuples with thecorresponding r -type (the r -type that appears in the sphere-formula) are answers to φ onthe input database or are far from being an answer to φ on the input database. Lemma 5.4.

Let φ (¯ x ) ∈ FO[ σ ] with | ¯ x | = k and Hanf locality radius r and let ǫ ∈ (0 , .There exists an algorithm A ǫ , which, given oracle access to a σ -db D ∈ C td as input alongwith | D | = n , computes a set T of r -types with k centres such that with probability at least / , for any ¯ a ∈ D k , (1) if ¯ a ∈ φ ( D ) , then the r -type of ¯ a in D is in T , and (2) if ¯ a ∈ D k \ φ ( D , C td , ǫ ) , then the r -type of ¯ a in D is not in T .Furthermore, A ǫ runs in polylogarithmic time.Proof. If n < k/ǫ then we do a full check of D and form the set T exactly. Otherwise, A ǫ starts by computing the formula χ (¯ x ) that is d -equivalent to φ and is in the form (3.1) asin Lemma 3.2. Let m be as in Lemma 3.2. By Theorem 2.4, any sentence deﬁnable in FOis uniformly testable on C td in polylogarithmic time. Hence for every i ∈ [ m ] there existsan ǫ/ / ∃ ¯ x sph τ i (¯ x ) ∧ ψ si and rejects if the input is ǫ/ ∃ ¯ x sph τ i (¯ x ) ∧ ψ si . We can amplify this probability to (5 / /m by repeating the tester aconstant number of times and we denote the resulting ǫ/ π i . Next, A ǫ computesthe set T as follows.(1) Let T = ∅ .(2) For each i ∈ [ m ], run π i with D as input, and if π i accepts, then add τ i to T . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 19

By Lemma 3.2, χ (¯ x ) can be computed in constant time (only dependent on d , k φ k and k σ k ). Moreover, each ǫ/ π i runs in polylogarithmic time. Since m is a constant, A ǫ runs in polylogarithmic time.It now only remains to prove correctness. Let ¯ a ∈ D k and let τ be the r -type of ¯ a in D . Let us assume that each π i correctly accepts if D satisﬁes ∃ ¯ x sph τ i (¯ x ) ∧ ψ si and correctlyrejects if D is ǫ/ ∃ ¯ x sph τ i (¯ x ) ∧ ψ si , which happens with probability atleast (5 / (1 /m ) · m = 5 / a ∈ φ ( D ). We shall show that τ ∈ T . Since D | = φ (¯ a ),there exists at least one i ∈ [ m ] such that D | = sph τ i (¯ a ) ∧ ψ si (as φ is d -equivalent to χ (¯ x ) = W i ∈ [ m ] (cid:16) sph τ i (¯ x ) ∧ ψ si (cid:17) ). Hence, D | = ∃ ¯ x sph τ i (¯ x ) ∧ ψ si and as we are assuming π i correctly accepted, the r -type τ i will have been added to T . Since D | = sph τ i (¯ a ), τ i = τ ,and therefore τ ∈ T .Now let us assume that ¯ a ∈ D k \ φ ( D , C td , ǫ ). We shall show that τ T . For acontradiction let us assume that τ ∈ T and hence there must exist some i ∈ [ m ] such that D is ǫ/ ∃ ¯ x sph τ i (¯ x ) ∧ ψ si on C td and τ i = τ . By deﬁnition there exists a σ -db D ′ ∈ C td such that D ′ | = ∃ ¯ x sph τ i (¯ x ) ∧ ψ si and dist( D , D ′ ) ≤ ǫdn/

2. Since any propertydeﬁned by a FO sentence on C td is closed under isomorphism we can assume that D ′ canbe obtained from D with at most ǫdn/ D ′ the r -type of ¯ a is nolonger τ then we can modify D ′ with at most 4 dk tuple modiﬁcations into a σ -db D ′′ ∈ C td such that the r -type of ¯ a is τ in D ′′ and D ′′ ∼ = D ′ (and hence ¯ a ∈ φ ( D ′′ )). To do this wechoose a tuple ¯ b whose r -type is τ in D ′ and for any tuple that contains an element from ¯ a or ¯ b , delete it and add back the same tuple but with the elements from ¯ a exchanged for thecorresponding elements from ¯ b and vice versa. This requires at most 4 dk tuple modiﬁcations.Hence dist( D , D ′′ ) ≤ ǫdn/ dk ≤ ǫdn if n ≥ k/ǫ (which we can assume as otherwise wedo a full check of D and compute T exactly) and so by deﬁnition ¯ a ∈ φ ( D , C td , ǫ ) which isa contradiction. Therefore τ T .Hence with probability at least 5 /

6, for every ¯ a ∈ D k , if ¯ a ∈ φ ( D ), then the r -type of¯ a in D is in T , and if ¯ a ∈ D k \ φ ( D , C td , ǫ ), then the r -type of ¯ a in D is not in T .We now use Lemmas 4.1 and 5.4 to prove our main result of this section (Theorem 5.5). Theorem 5.5.

Let φ (¯ x ) ∈ FO[ σ ] where | ¯ x | = k . Then Enum C td ( φ ) can be solved approx-imately with polylogarithmic preprocessing time and constant delay for answer thresholdfunction f ( n ) = γn k for any parameter γ ∈ (0 , .Proof. Let

D ∈ C td with | D | = n , let ǫ ∈ (0 ,

1] and let γ ∈ (0 , ǫ -approximate enumeration algorithm for Enum C td ( φ ) that has answer threshold function f ( n ) = γn k , polylogarithmic preprocessing time and constant delay.In the preprocessing phase, the algorithm starts by running the algorithm from Lemma5.4 on D to compute a set T of r -types with k centres. Then the algorithm from the proofof Lemma 4.1 with µ = γ , δ = 5 / V = D k , V = { ¯ a ∈ D k | the r-type of ¯ a in D is in T } and V = D k \ V is run.By Lemma 5.4, the set T is computed in polylogarithmic time. Hence as the prepro-cessing phase from the proof of Lemma 4.1 runs in constant time, the whole preprocessingphase runs in polylogarithmic time. By Lemma 4.1 there is constant delay between anytwo consecutive outputs. This concludes the analysis of the running time. We now provecorrectness. Let S be the set of tuples enumerated. By Lemma 4.1 no duplicates are enumerated and S ⊆ V = { ¯ a ∈ D k | the r-type of ¯ a is in T } . By Lemma 5.4, with probability at least 5 / a ∈ D k , if ¯ a ∈ φ ( D ), then the r -type of ¯ a in D is in T , and if ¯ a ∈ D k \ φ ( D , C td , ǫ ),then the r -type of ¯ a in D is not in T . Therefore with probability at least 5 / φ ( D ) ⊆ V and V ⊆ φ ( D , C , ǫ ). Hence with probability at least 5 / > / S ⊆ φ ( D ) ∪ φ ( D , C , ǫ ) asrequired. As previously discussed with probability at least 5 / φ ( D ) ⊆ V . Note that if φ ( D ) ⊆ V , then | V | ≥ | φ ( D ) | . If we assume that φ ( D ) ⊆ V and | φ ( D ) | ≥ γn k = γ | V | , then | V | ≥ γ | V | and by Lemma 4.1 with probability at least 5 / S = V and hence φ ( D ) ⊆ S .Therefore the probability that φ ( D ) ⊆ S if | φ ( D ) | ≥ γn k is at least (5 / > / Theorem 5.6.

Let φ (¯ x ) ∈ FO[ σ ] and let c := conn( φ, d ) . Then the problem Enum C td ( φ ) can be solved approximately with polylogarithmic preprocessing time and constant delay foranswer threshold function f ( n ) = γn c for any parameter γ ∈ (0 , . We defer the proof of Theorem 5.6 to Section 6.6.

Proofs of Theorems 4.5 and 5.6

Before we prove Theorems 4.5 and 5.6 we start with some deﬁnitions (which are based onthose introduced by Kazana and Segouﬁn in [17]) and some lemmas.For each type τ ∈ T σ,dr ( k ) we ﬁx a representative for the corresponding r -type and ﬁxa linear order among its elements (where, for technical reasons, the centre elements alwayscome ﬁrst). This way, we can speak of the ﬁrst, second, . . . , element of an r -type. Let D be a σ -db and let ¯ a be a tuple in D with r -type τ . For technical reasons, if there aremultiple isomorphism mappings from the r -neighbourhood of ¯ a to the ﬁxed representativeof τ , we use the isomorphism mapping which is of smallest lexicographical order (recallthat we assume that D comes with a linear ordering on its elements). The cardinality of τ ,denoted as | τ | , is the number of elements in its representative.Let D be a σ -db and ¯ a be a tuple of elements from D . We say that ¯ a is r -connected ifthe r -neighbourhood of ¯ a in D is connected.Let s ∈ N , let F = ( α , . . . , α m ) be a sequence of elements from [ d s +1 ] (recall thatthe maximum size of an s -neighbourhood is d s +1 ), and let ¯ x = ( x , . . . , x m ) be a tuple.We write ¯ x = F ( x ) for the fact that, for j ∈ { , . . . , m } , x j is the α j -th element of the s -neighbourhood of x . We call each such F an s -binding of ¯ x . Given s -type τ , we say thatan s -binding F of ¯ x is r -good for τ if F ( x ) is r -connected for every x with type τ .For a given tuple ¯ x = ( x , . . . , x k ), an r -split of ¯ x is a set of triples C = { ( C , F , τ ) , . . . , ( C ℓ , F ℓ , τ ℓ ) } where for each i ∈ [ ℓ ] • ∅ 6 = C i ⊆ ¯ x , C i ∩ C j = ∅ for i = j ∈ [ ℓ ] and S ≤ i ≤ ℓ C i = { x , . . . , x k } , • τ i is a 3 rk -type with 1 centre, and • F i = ( α , . . . , α | C i | ) is a 3 rk -binding of a tuple with | C i | elements such that for each j ∈ { , . . . , | C i |} , α j ∈ [ | τ i | ] and F i is r -good for τ i . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 21

We write ¯ x i to represent the variables from C i , x i to represent the most signiﬁcant variablefrom C i (i.e the variable in C i which appears ﬁrst in the tuple ¯ x ), x i to represent the secondmost signiﬁcant variable from C i (i.e the variable in C i which appears second in the tuple¯ x ) and so on. We deﬁne the formulaSplit Cr (¯ x ) := ^ ≤ i = j ≤ ℓ ( N r (¯ x i ) ∩ N r (¯ x j ) = ∅ ) ∧ ^ ( C i ,F i ,τ i ) ∈ C (¯ x i = F i ( x i ) ∧ sph τ i ( x i )) . We let S σ,dr ( k ) denote the set of r -splits of tuples with k elements for σ -dbs with degree atmost d . We denote the cardinality of S σ,dr ( k ) as s ( r, k ). Remark 6.1.

For any r, k ∈ N , σ -db D and tuple ¯ a ∈ D k there exists exactly one r -split C such that D | = Split Cr (¯ a ).Let D be a σ -db, let r, k, c ∈ N where c ≤ k and let C be an r -split for a tuple with k elements. For tuples ¯ a ∈ D c and ¯ b ∈ D k we say that ¯ b is found from ¯ a and C , if c = | C | , D | = Split Cr (¯ b ) and for every i ∈ [ c ], the element b i (from ¯ b ) according to C , is equal to a i .Intuitively, ¯ a consists of the most signiﬁcant elements from ¯ b according to C . Remark 6.2.

Let D be a σ -db and let r, k ∈ N . For any ¯ b ∈ D k there exists exactly one r -split C (of a tuple with k elements) and tuple ¯ a from D such that ¯ b is found from ¯ a and C . Lemma 6.3.

Let r, k, c ∈ N where c ≤ k . There exists an algorithm which, given a σ -db D ,a tuple ¯ a ∈ D c and an r -split C of a tuple with k elements as input, returns a tuple ¯ b ∈ D k that is found from ¯ a and C if one exists and returns false otherwise. Furthermore if such a ¯ b exists then it is unique.The running time of the algorithm depends only on r , | C | , k , σ and d .Proof. Let D be a σ -db, let ¯ a ∈ D c and let C be an r -split of a tuple with k elements. Thefollowing algorithm returns a tuple ¯ b ∈ D k that is found from ¯ a and C if one exists andreturns false otherwise.(1) If | C | 6 = c or D 6| = V ( C i ,F i ,τ i ) ∈ C sph τ i ( a i ) then return false.(2) For each i ∈ [ c ], let ¯ b i be the tuple whose ﬁrst element is a i such that D | = (¯ b i = F i ( a i )).Then let ¯ b be the tuple found by combining all the ¯ b i according to C .(3) If D | = V ≤ i = j ≤ c ( N r (¯ b i ) ∩ N r (¯ b j ) = ∅ ), return ¯ b . Otherwise, return false.The 3 rk -neighbourhood of an element can be computed in time only dependent on r , k , σ and d . Hence Steps 1 and 2 runs in time dependent only on r , k , σ , d and | C | since each ¯ b i can be found by exploring the 3 rk -neighbourhood of a i . In Step 3, for every i ∈ [ c ], N r (¯ b i )can be computed in time only dependent on r , | ¯ b i | ≤ k , σ and d and hence the running timeof Step 3 depends only on r , k , σ , d and | C | also. Therefore the overall running time of thealgorithm depends only on r , k , σ , d and | C | as required.Assume a tuple ¯ b is returned by the above algorithm from C and ¯ a . Then clearly c = | C | , D | = Split Cr (¯ b ) and each b i according to C is equal to a i . Therefore ¯ b is found from¯ a and C .Now assume that there does exist a tuple ¯ b ∈ D k that is found from ¯ a and C . Then¯ b is unique as there is only one way to choose each tuple ¯ b i such that D | = (¯ b i = F i ( a i )).Furthermore, it is easy to see that ¯ b will be outputted by the above algorithm. Lemma 6.4.

Let T ⊆ T σ,dr ( k ) . We can compute a set of r -splits S for ¯ x = ( x , . . . , x k ) such that the following holds: For any σ -db D and tuple ¯ a ∈ D k , D | = W τ ∈ T sph τ (¯ a ) if andonly if D | = W C ∈ S Split Cr (¯ a ) .Proof. The algorithm proceeds as follows. Let S be an empty set. For each possible r -split C = { ( C , F , τ ) , . . . , ( C ℓ , F ℓ , τ ℓ ) } of the tuple ¯ x do the following. Let D be the disjointunion of the ﬁxed representatives of each τ i . Let ¯ b ∈ D k be a tuple such that D | = Split Cr (¯ b )(note that such a tuple exists by the deﬁnition of an r -split). Then if ¯ b ’s r -type in D is in T , add C to S .Towards correctness let D be a σ -db and let ¯ a ∈ D k . Let C = { ( C , F , τ ) , . . . , ( C ℓ , F ℓ , τ ℓ ) } be the r -split such that D | = Split Cr (¯ a ) (note that C is unique by Remark 6.1). Let D ′ bethe disjoint union of the ﬁxed representatives of each 3 rk -type that appears in C and let¯ b ∈ D ′ k be a tuple such that D ′ | = Split Cr (¯ b ). It remains to show that N D r (¯ a ) ∼ = N D ′ r (¯ b ).This completes the proof because by the construction of S , it implies that C ∈ S if and onlyif the r -type of ¯ a in D is in T (i.e. D | = W τ ∈ T sph τ (¯ a ) if and only if D | = W C ∈ S Split Cr (¯ a )).Recall that we use a ij and b ij to represent the elements from ¯ a and ¯ b respectively that arethe elements from C i that appear j -th in the tuples ¯ a and ¯ b respectively. As D | = Split Cr (¯ a )and D ′ | = Split Cr (¯ b ), by the deﬁnition of the formula Split Cr (¯ x ), it follows that N D rk ( a i ) ∼ = N D ′ rk ( b i ) for every i ∈ [ ℓ ]. For every i ∈ [ ℓ ] and j ∈ [ | C i | ], a ij is at distance at most(2 r + 1)( | C i | − ≤ (2 r + 1)( k − ≤ rk − r from a i in D , and b ij is at distance at most(2 r + 1)( | C i | − ≤ (2 r + 1)( k − ≤ rk − r from b i in D ′ (since each F i is r -good for τ i ). Therefore for every i ∈ [ ℓ ], the r -neighbourhoods of ¯ a i and ¯ b i are contained in the3 rk -neighbourhoods of a i and b i respectively and hence N D r (¯ a i ) ∼ = N D ′ r (¯ b i ). Then since N D r (¯ a i ) ∩ N D r (¯ a j ) = ∅ and N D ′ r (¯ b i ) ∩ N D ′ r (¯ b j ) = ∅ (as D | = Split Cr (¯ a ) and D ′ | = Split Cr (¯ b )), itfollows that N D r (¯ a ) ∼ = N D ′ r (¯ b ).Let us ﬁrst prove Theorem 5.6. Proof of Theorem 5.6.

Let

D ∈ C td with | D | = n , let ǫ ∈ (0 ,

1] and let γ ∈ (0 , ǫ -approximate enumeration algorithm for Enum C td ( φ ) that has answerthreshold function f ( n ) = γn c , polylogarithmic preprocessing time and constant delay.In the preprocessing phase, the algorithm starts by running the algorithm from Lemma5.4 on D to compute a set T of r -types with k centres. The algorithm then computes theset of r -splits S from T as in Lemma 6.4. An empty queue Q is then initialised which willstore tuples to be outputted in the enumeration phase.Let V = S ≤ i ≤ c D i . Let V be the set that contains all ¯ a ∈ V such that there exists a C ∈ S and ¯ b ∈ D k where ¯ b is found from ¯ a and C . Finally let V = V \ V . Note that byLemma 6.3, given a tuple ¯ a ∈ V it can be decided in constant time whether ¯ a ∈ V .The algorithm from Lemma 4.1 is then run with µ = γ/ ( c · s ( r, k )), δ = 4 / V , V and V as deﬁned above. Once the enumeration phase of the algorithm from Lemma 4.1starts we do the following.(1) Each time a tuple ¯ a is enumerated from the algorithm from Lemma 4.1, for each C ∈ S :run the algorithm from Lemma 6.3 with ¯ a and C and if a tuple is returned add it to Q .(2) If Q = ∅ , output the next tuple from Q ; stop otherwise.(3) Repeat Steps 1-2 until there is no tuple to output in step 2.From Lemma 5.4 the set T can be computed in polylogarithmic time. The set S canbe constructed in constant time as | T | is a constant and the number of possible r -splits OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 23 for a k -tuple is also a constant. Then as the preprocessing phase from the algorithm fromLemma 4.1 runs in constant time the overall running time of the preprocessing phase ispolylogarithmic.In the enumeration phase, by Lemma 4.1 there is constant delay between the outputsof the tuples ¯ a used in Step 1. For every such tuple, by the deﬁnition of the set V , thereexists at least one r -split in S that leads to a tuple being added to Q . Then as | S | is aconstant and the algorithm from Lemma 6.3 runs in constant time, the enumeration phasehas constant delay as required. This concludes the analysis of the running time. Let us nowprove correctness.By Lemma 4.1 in Step 1 of the enumeration phase no duplicate tuples ¯ a will be consid-ered. Since for every tuple ¯ b ∈ D k there exists exactly one r -split C and tuple ¯ a from D such that ¯ b is found from C and ¯ a (Remark 6.2), no duplicates will be enumerated.Now let us assume that the set of r -types T were computed correctly (i.e. for any¯ a ∈ D k , if ¯ a ∈ φ ( D ), then the r -type of ¯ a in D is in T , and if ¯ a ∈ D k \ φ ( D , C td , ǫ ), then the r -type of ¯ a in D is not in T ) which happens with probability at least 5 / b ∈ D k have r -type τ in D and let C ∈ S σ,dr ( k ) be such that D | = Split Cr (¯ b ).If ¯ b ∈ D k \ φ ( D , C td , ǫ ), τ T and hence by Lemma 6.4, C S and so ¯ b will not beenumerated. Therefore with probability at least 5 / φ ( D , C td , ǫ ) will beenumerated.If ¯ b ∈ φ ( D ), then τ ∈ T and hence by Lemma 6.4, C ∈ S . Let ¯ a be the tuple such that¯ b is found from ¯ a and C . Note that as the maximum number of connected components inthe r -neighbourhood of ¯ b in D is c , | ¯ a | ≤ c and hence ¯ a ∈ V . Then by deﬁnition ¯ a ∈ V .Hence if every tuple from V is considered in Step 1 of the enumeration phase, every tuplein φ ( D ) will be enumerated. By Lemma 4.1 with probability at least δ if | V | ≥ µ | V | , everytuple from V will be considered in Step 1. We know that | V | ≥ | φ ( D ) | /s ( r, k ) as every¯ a ∈ V leads us to at most | S | ≤ s ( r, k ) many tuples from φ ( D ) (since by Lemma 6.3 forany r -split C ∈ S there is at most one tuple that is found from ¯ a and C ). If | φ ( D ) | ≥ γn c then | V | ≥ γn c /s ( r, k ) ≥ γ | V | / ( c · s ( r, k )) = µ | V | as | V | = P ci =1 n i ≤ cn c and by the choiceof µ . Hence if | φ ( D ) | ≥ γn c with probability at least δ · / / φ ( D )will be enumerated. This completes the proof.We now prove Theorem 4.5 which is similar to the proof of Theorem 5.6. Proof of Theorem 4.5.

First let us note that if φ is local then by Lemma 3.4 we can computea set T of r -types (where r is the locality radius of φ ) in constant time such that for any σ -db D and tuple ¯ a from D , the r -type of ¯ a in D is in T if and only if ¯ a ∈ φ ( D ).Then to construct an algorithm as in the theorem statement we can just use the algo-rithm from the proof of Theorem 5.6 but change it in two ways. Firstly we allow the inputclass to be any class of bounded degree σ -dbs and secondly, we construct T as discussedabove. The only part of the algorithm from the proof of Theorem 5.6 that runs in non-constant time is the construction of T and hence our algorithm has the required runningtimes.To prove correctness ﬁrst note that in the proof of Theorem 5.6 the only reason the inputclass was C td was to allow the set T to be computed eﬃciently and with high probabilitycorrectly. Now T is computed exactly and since the algorithm will only enumerate tuplesthat have r -type in T , only tuples that are answers to the query for the input databasewill be enumerated as required. The proof of (2) from the theorem statement is then very similar to the last paragraph in the proof of Theorem 5.6 (the only diﬀerence is that now forlocal queries this happens with higher probability as T is computed exactly every time).7. Further Results

In this section, we start by generalising our result on approximate enumeration of general FOqueries (Theorem 5.6). We identify a condition that we call

Hanf-sentence testability , whichis a weakening of the bounded tree-width condition, under which we still get approximateenumeration algorithms with the same probabilistic guarantees as before. Finally, we discussapproximation versions of query membership testing and counting.7.1.

Generalising Theorem 5.6.

We ﬁrst introduce Hanf-sentence testability, which isbased on the Hanf normal form of a formula. It allows us to compute the set of r -typesas in Lemma 5.4 eﬃciently. Theorem 7.4 below is the generalisation of Theorem 5.6, andExample 7.5 illustrates the use of this generalisation. Deﬁnition 7.1 (Hanf-sentence testable) . Let φ (¯ x ) ∈ FO[ σ ] and χ (¯ x ) be the formula in theform (3.1) of Lemma 3.2 that is d -equivalent to φ . Let m be the number of conjunctiveclauses in χ . We say that φ is Hanf-sentence testable on C in time H ( n ) if for every i ∈ [ m ],the formula ∃ ¯ x sph τ i (¯ x ) ∧ ψ si is uniformly testable on C in time at most H ( n ).We shall illustrate Hanf sentence testability in the following example. Example 7.2.

Let φ be as in Example 5.2 and let G ∈ G d . If there exists ( u, v ) ∈ V ( G ) with 2-type τ then there exists a vertex with 2-type τ (where τ is as in Example 2.2)and vice versa. Hence, φ can be easily transformed into the form (3.1) of Lemma 3.2 byreplacing the subformula ¬ ( ∃ z ∃ w sph τ ( z, w )) with ¬∃ ≥ z sph τ ( z ). The resulting formulathen has two conjunctive clauses, sph τ ( x, y ) ∧ ¬∃ ≥ z sph τ ( z ) and sph τ ( x, y ). We saw inExample 2.2 that ∃ x ∃ y sph τ ( x, y ) ∧ ¬∃ ≥ z sph τ ( z ) is uniformly testable on G d in constanttime. The formula ∃ x ∃ y sph τ ( x, y ) is trivially testable in constant time on G d since we caninsert a copy of τ into a graph G ∈ G d with at most 8 d + 7 modiﬁcations and therefore if8 d + 7 ≤ ǫd | V ( G ) | we can always accept and otherwise (i.e. if | V ( G ) | < (8 d + 7) /ǫd ) we canjust do a full check of the graph for a copy of τ in constant time. Hence, φ is Hanf-sentencetestable on G d in constant time.Note that any FO query is Hanf sentence testable on C td in polylogarithmic time. Weshall now prove a result that is similar to Lemma 5.4 but works for any class C and FO query φ where φ is Hanf-sentence testable on C . This will then be used to show we can replacebounded tree-width with Hanf sentence testability and still obtain enumeration algorithmswith the same probabilistic guarantees. Lemma 7.3.

Let φ (¯ x ) ∈ FO[ σ ] with | ¯ x | = k and Hanf locality radius r and let ǫ ∈ (0 , . If φ is Hanf-sentence testable on C in time H ( n ) then there exists an algorithm B ǫ that runsin time O ( H ( n )) , which, given oracle access to a σ -db D ∈ C as input along with | D | = n ,computes a set T of r -types with k centres such that with probability at least / , for any ¯ a ∈ D k , (1) if ¯ a ∈ φ ( D ) , then the r -type of ¯ a in D is in T , and (2) if ¯ a ∈ D k \ φ ( D , C , ǫ ) , then the r -type of ¯ a in D is not in T . OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 25

Proof.

The algorithm B ǫ is nearly identical to the algorithm A ǫ from Lemma 5.4. The onlydiﬀerence being is we replace the input class C td with C . The ǫ/ π i used now haveinput class C (rather than C td ) and as φ is Hanf-sentence testable on C in time H ( n ) each π i runs in time O ( H ( n )) (rather than polylogarithmic). As all other parts of the algorithm A ǫ run in constant time, it follows that B ǫ runs in time O ( H ( n )) as required. The proof ofthe correctness of B ǫ is then identical to the proof of the correctness of A ǫ (but with theinput class C td replaced with C ).We shall now show that if a FO query φ is Hanf-sentence testable on a class C in time H ( n ) then Enum C ( φ ) can be solved approximately with preprocessing time O ( H ( n )) andconstant delay. Note we are still able to reduce the answer threshold function. Theorem 7.4.

Let φ (¯ x ) ∈ FO[ σ ] and let c := conn( φ, d ) . If φ is Hanf-sentence testableon C in time H ( n ) , then Enum C ( φ ) can be solved approximately with preprocessing time O ( H ( n )) and constant delay for answer threshold function f ( n ) = γn c for any γ ∈ (0 , .Proof. Let ǫ ∈ (0 , γ ∈ (0 ,

1) and let us assume that φ is Hanf-sentence testable on C intime H ( n ). If we take the ǫ -approximate enumeration algorithm for Enum C td ( φ ) with answerthreshold function f ( n ) = γn c given in the proof of Theorem 5.6, which we shall denoteby E φ, C td ,ǫ , and make the following changes: replace the input class C td with C , and useLemma 7.3 instead of Lemma 5.4 to compute the set of r -types T . Then we argue that theresulting algorithm E φ, C ,ǫ is an ǫ -approximate enumeration algorithm for Enum C ( φ ) withpreprocessing time O ( H ( n )) and constant delay for answer threshold function f ( n ) = γn c .In the preprocessing phase of E φ, C td ,ǫ the only part that runs in non-constant time is theconstruction of the set T (which takes polylogarithmic time). In E φ, C ,ǫ it takes O ( H ( n ))time to compute T and hence E φ, C ,ǫ has preprocessing time O ( H ( n )). Since E φ, C td ,ǫ hasconstant delay, E φ, C ,ǫ also has constant delay.Since the only diﬀerences in Lemma 5.4 and Lemma 7.3 is the running times and theinput class, the proof of the correctness of E φ, C ,ǫ is the same as the proof of the correctnessof E φ, C td ,ǫ but with the input class C td replaced with C .We shall now return to our running example where we discuss an FO query and inputclass, which previous theorems did not give us an approximate enumeration algorithm for,but by Theorem 7.4 can now be approximately enumerated. Example 7.5.

Let φ be the formula as in Example 5.2. We saw in Example 7.2 that φ isHanf-sentence testable on G d in constant time and that the formula in the form (3.1) ofLemma 3.2 that is d -equivalent to φ is χ ( x, y ) = sph τ ( x, y ) ∨ (sph τ ( x, y ) ∧ ¬∃ ≥ z sph τ ( z )).The maximum number of connected components of the neighbourhood types that appearin the sphere-formulas of χ is one. Hence, by Theorem 7.4, Enum G d ( φ ) can be solvedapproximately with constant preprocessing time and constant delay for answer thresholdfunction f ( n ) = γn for any parameter γ ∈ (0 , Approximate query membership testing.

The query membership testing problem for φ (¯ x ) ∈ FO[ σ ] over C is the computational problem where, for a database D ∈ C , we askwhether a given tuple ¯ a ∈ D k satisﬁes ¯ a ∈ φ ( D ). We call ¯ a the dynamical input and theanswer (‘true’ or ‘false’) the dynamical answer . Similar to query enumeration, the goal isto obtain an algorithm, that, after a preprocessing phase, can answer membership queries for dynamical inputs very eﬃciently. The preprocessing phase should also be very eﬃcient.Kazana [16] shows that the query membership testing problem for any φ (¯ x ) ∈ FO[ σ ] over C can be solved by an algorithm with a linear time preprocessing phase, and an answeringphase that, for a given dynamical input, computes the dynamical answer in constant time.Given a local FO query, by Lemma 3.5, for any σ -db D and tuple ¯ a from D we can testin constant time whether ¯ a ∈ φ ( D ). Hence in this section we shall focus on general queries.We introduce an approximate version of the query membership testing problem. Wesay that the query membership testing problem for φ (¯ x ) ∈ FO[ σ ] over C can be solved ap-proximately with an O ( H ( n ))-time preprocessing phase and constant-time answering phaseif for any ǫ ∈ (0 , D ∈ C and | D | = n as an input, and proceeds in two phases.(1) A preprocessing phase that runs in time O ( H ( n )).(2) An answer phase where, given dynamical input ¯ a ∈ D k , the following is computed inconstant time. • If ¯ a ∈ φ ( D ), the algorithm returns ‘true’, with probability at least 2 /

3, and • if ¯ a / ∈ φ ( D , C , ǫ ), the algorithm returns ‘false’, with probability at least 2 / Theorem 7.6.

The query membership testing problem for φ (¯ x ) ∈ FO[ σ ] (where | ¯ x | = k )over C td can be solved approximately with a polylogarithmic preprocessing phase and constant-time answering phase.Proof. Let r be the Hanf locality radius of φ . In the preprocessing phase a set T of r -typesas in Lemma 5.4 is computed. Then in the answer phase, given a tuple ¯ a ∈ D k , the r -type τ of ¯ a is computed. If τ ∈ T then the algorithm returns ‘true’, otherwise it returns ‘false’.By Lemma 5.4 the set T can be computed in polylogarithmic time and it takes constanttime to calculate τ . By Lemma 5.4 with probability at least 5 / > /

3, if ¯ a ∈ φ ( D ), thenthe r -type of ¯ a in D is in T , and if ¯ a ∈ D k \ φ ( D , C td , ǫ ), then the r -type of ¯ a in D is not in T . Therefore with probability at least 2 / a ∈ φ ( D ) the algorithm outputs ‘true’ and if¯ a / ∈ φ ( D , C , ǫ ) the algorithm outputs ‘false’ as required.Note that we can get a similar result for Hanf sentence testable FO queries over anyclass of bounded degree graphs.7.3. Approximate counting.

The counting problem for φ (¯ x ) ∈ FO[ σ ] over C is the prob-lem of, given a database D ∈ C , compute | φ ( D ) | . It was shown in [4] that the countingproblem for any φ (¯ x ) ∈ FO[ σ ] over C can be solved in linear time.Lemma 5.1 in [20], allows approximating the distribution of the r -types with one centreof an input graph by looking at a constant number of vertices. We can easily extend thisto databases and neighbourhood types with multiple centres.We ﬁx an enumeration τ , . . . , τ c( r,k ) of the r -types in T σ,dr ( k ). For a σ -db D with | D | = n , the k centre r -neighbourhood distribution of D is the vector dv r,k ( D ) of lengthc( r, k ) whose i -th component (denoted by dv r,k ( D )[ i ]) contains the number t ( i ) /n k , where t ( i ) ∈ N is the number of elements of D whose r -type is τ i .We let EstimateFrequencies r,s,k be an algorithm with oracle access to an input database D ∈ C , that samples s tuples from D k uniformly and independently and explores their r -neighbourhoods. EstimateFrequencies r,s,k returns the distribution vector ¯ v of the r -types of OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 27 this sample. EstimateFrequencies r,s,k has constant running time, independent of | D | , andcomes with the following guarantees. Lemma 7.7.

Let

D ∈ C be a database on n elements, λ ∈ (0 , and r, k ∈ N . If s ≥ c( r, k ) /λ · ln(20 c( r, k )) , with probability at least / the vector ¯ v returned by EstimateFrequencies r,s,k on input D satisﬁes k ¯ v − dv r,k ( D ) k ≤ λ . By combining Lemmas 5.4 and 6.4 and Lemma 7.7 we get the following result.

Theorem 7.8.

Let φ (¯ x ) ∈ FO[ σ ] , let ǫ ∈ (0 , , let λ ∈ (0 , and let c := conn( φ, d ) .There exists an algorithm, which, given oracle access to D ∈ C td and | D | = n as an input,returns an estimate of | φ ( D ) | such that with probability at least / the estimate is withinthe range [ | φ ( D ) | − λcn c , | φ ( D ) ∪ φ ( D , C td , ǫ ) | + λcn c ] . Furthermore, the algorithm runs inpolylogarithmic time in n . We shall only give the proof idea of Theorem 7.8. To estimate | φ ( D ) | we can do thefollowing. We will start by computing a set of r -types T as in Lemma 5.4 and then useLemma 6.4 to compute the set of r -splits S from T . Then using Lemma 7.7, for every i ∈ [ c ] we can compute an estimate to the vector dv rk,i ( D ). For every i ∈ [ c ], we canalso compute a vector ¯ v i which has a component corresponding to each 3 rk -type τ with i centres. The component in ¯ v i that corresponds to the 3 rk -type τ is the number of r -splits C ∈ S such that for any tuple ¯ a in D with 3 rk -type τ there will exist exactly one tuplethat is found from ¯ a and C . We can then estimate | φ ( D ) | using the estimates to the vectorsdv rk,i ( D ) and the vectors ¯ v i . By Lemmas 5.4 and 7.7 this estimate can be computed inpolylogarithmic time.For correctness, ﬁrst note that by Remark 6.2 for any tuple ¯ b in D there exists exactlyone tuple ¯ a from D and one r -split C such that ¯ b is found from ¯ a and C . Therefore we willnot double count any tuple. By Lemmas 5.4 and 6.4 and the construction of the vectors ¯ v i ,with high probability we will get an estimation to the number of tuples in φ ( D ). By lookingat the two extreme cases, where T contains only r -types of tuples in φ ( D ), and T containsall r -types of tuples in φ ( D ) ∪ φ ( D , C , ǫ ), it is easy to see that the returned estimate will bewithin the desired range.The obvious limitation of Theorem 7.8 is that | φ ( D ) ∪ φ ( D , C td , ǫ ) | can be much largerthan | φ ( D ) | , as discussed in Example 5.3. Nevertheless, in application where the focus ison structural closeness and very eﬃcient running time, this might be tolerable. Acknowledgement.

We would like to thank Benny Kimelfeld for inspiring discussionsduring early stages of this work.

References [1] Isolde Adler and Frederik Harwath. Property testing for bounded degree databases. In , volume 96, page 6. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018.[2] Noga Alon, W Fernandez De La Vega, Ravi Kannan, and Marek Karpinski. Random sampling andapproximation of max-csps.

Journal of computer and system sciences , 67(2):212–243, 2003.[3] Noga Alon, Tali Kaufman, Michael Krivelevich, and Dana Ron. Testing triangle-freeness in generalgraphs.

SIAM Journal on Discrete Mathematics , 22(2):786–819, 2008.[4] Guillaume Bagan, Arnaud Durand, Etienne Grandjean, and Fr´ed´eric Olive. Computing the jth solutionof a ﬁrst-order query.

RAIRO-Theoretical Informatics and Applications , 42(1):147–164, 2008. [5] Sagi Ben-Moshe, Yaron Kanza, Eldar Fischer, Arie Matsliah, Mani Fischer, and Carl Staelin. Detect-ing and exploiting near-sortedness for eﬃcient relational query evaluation. In

Proceedings of the 14thInternational Conference on Database Theory , pages 256–267. ACM, 2011.[6] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering fo+ mod queries under updateson bounded degree databases.

ACM Transactions on Database Systems (TODS) , 43(2):7, 2018.[7] Benedikt Bollig and Dietrich Kuske. An optimal construction of hanf sentences.

Journal of AppliedLogic , 10(2):179–186, 2012.[8] Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. Approximate query processing: No silver bullet.In

Proceedings of the 2017 ACM International Conference on Management of Data , pages 511–519.ACM, 2017.[9] Hubie Chen, Matt Valeriote, and Yuichi Yoshida. Constant-query testability of assignments to constraintsatisfaction problems.

SIAM Journal on Computing , 48(3):1022–1045, 2019.[10] Hubie Chen and Yuichi Yoshida. Testability of homomorphism inadmissibility: Property testing meetsdatabase theory. In

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principlesof Database Systems , pages 365–382. ACM, 2019.[11] Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are com-putable with constant delay.

ACM Transactions on Computational Logic (TOCL) , 8(4):21, 2007.[12] Arnaud Durand, Nicole Schweikardt, and Luc Segouﬁn. Enumerating answers to ﬁrst-order queries overdatabases of low degree. In Richard Hull and Martin Grohe, editors,

Proceedings of the 33rd ACMSIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT,USA, June 22-27, 2014 , pages 121–131. ACM, 2014. doi:10.1145/2594538.2594539 .[13] J¨org Flum and Martin Grohe.

Parameterized Complexity Theory (Texts in Theoretical Computer Sci-ence. An EATCS Series) . Springer-Verlag, Berlin, Heidelberg, 2006.[14] Oded Goldreich and Dana Ron. Property testing in bounded degree graphs.

Algorithmica , 32(2):302–343,2002.[15] William Hanf.

The Theory of Models , chapter Model-theoretic methods in the study of elementary logic,pages 132–145. North Holland, 1965.[16] Wojciech Kazana.

Query evaluation with constant delay . PhD thesis, ´Ecole normale sup´erieure deCachan, Paris, France, 2013.[17] Wojciech Kazana and Luc Segouﬁn. First-order query evaluation on structures of bounded degree.

Logical Methods in Computer Science , 7(2), 2011. doi:10.2168/LMCS-7(2:20)2011 .[18] Phokion G Kolaitis and Moshe Y Vardi. Conjunctive-query containment and constraint satisfaction.

Journal of Computer and System Sciences , 61(2):302–332, 2000.[19] Bernard M. E. Moret and Henry D. Shapiro.

Algorithms from P to NP (Vol. 1): Design and Eﬃciency .Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1991.[20] Ilan Newman and Christian Sohler. Every property of hyperﬁnite graphs is testable.

SIAM Journal onComputing , 42(3):1095–1112, 2013.[21] Ronitt Rubinfeld and Madhu Sudan. Robust characterizations of polynomials with applications toprogram testing.

SIAM Journal on Computing , 25(2):252–271, 1996.[22] Nicole Schweikardt, Luc Segouﬁn, and Alexandre Vigny. Enumeration for FO queries over nowheredense graphs. In Jan Van den Bussche and Marcelo Arenas, editors,

Proceedings of the 37th ACMSIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June10-15, 2018 , pages 151–163. ACM, 2018. doi:10.1145/3196959.3196971 .[23] Luc Segouﬁn and Alexandre Vigny. Constant delay enumeration for FO queries over databases with localbounded expansion. In Michael Benedikt and Giorgio Orsi, editors, , volume 68 of

LIPIcs , pages 20:1–20:16.Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017. doi:10.4230/LIPIcs.ICDT.2017.20 .[24] Hanghang Tong, Christos Faloutsos, Christos Faloutsos, Brian Gallagher, and Tina Eliassi-Rad. Fastbest-eﬀort pattern matching in large attributed graphs. In

Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 737–746. ACM, 2007.[25] Yuichi Yoshida. Optimal constant-time approximation algorithms and (unconditional) inapproximabilityresults for every bounded-degree csp. In

Proceedings of the forty-third annual ACM symposium onTheory of computing , pages 665–674. ACM, 2011.

OWARDS APPROXIMATE QUERY ENUMERATION WITH SUBLINEAR PREPROCESSING TIME 29 [26] Shijie Zhang, Shirong Li, and Jiong Yang. Gaddi: distance index based subgraph matching in biolog-ical networks. In

Proceedings of the 12th International Conference on Extending Database Technology:Advances in Database Technology , pages 192–203. ACM, 2009.

This work is licensed under the Creative Commons Attribution License. To view a copy of thislicense, visit https://creativecommons.org/licenses/by/4.0/https://creativecommons.org/licenses/by/4.0/