[PDF] Query Lifting: Language-integrated query for heterogeneous nested collections

Abstract

Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabilities are not supported by known correctness results or implementations that assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language NRC_\lambda(Set,Bag) that combines set and multiset constructs. We show how to normalize and translate queries to SQL, and develop a novel approach to querying heterogeneous nested collections, based on the insight that ``local'' query subexpressions that calculate nested subcollections can be ``lifted'' to the top level analogously to lambda-lifting for local function definitions.

Full PDF

QQuery Lifting

Language-integrated queryfor heterogeneous nested collections

Wilmer Ricciotti ( (cid:66) ) and James Cheney , Laboratory for Foundations of Computer ScienceUniversity of Edinburgh, Edinburgh, United Kingdom [email protected]@inf.ed.ac.uk The Alan Turing Institute, London, United Kingdom

Abstract.

Language-integrated query based on comprehension syntaxis a powerful technique for safe database programming, and provides abasis for advanced techniques such as query shredding or query ﬂatten-ing that allow eﬃcient programming with complex nested collections.However, the foundations of these techniques are lacking: although SQL,the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabili-ties are not supported by known correctness results or implementationsthat assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language N RC λ ( Set , Bag )that combines set and multiset constructs. We show how to normalizeand translate queries to SQL, and develop a novel approach to queryingheterogeneous nested collections, based on the insight that “local” querysubexpressions that calculate nested subcollections can be “lifted” to thetop level analogously to lambda-lifting for local function deﬁnitions.

Keywords: language-integrated query · nested relations · multisets Since the rise of relational databases as important software components in the1980s, it has been widely appreciated that database programming is hard [13].Databases oﬀer eﬃcient access to ﬂat tabular data using declarative SQL queries,a computational model very diﬀerent from that of most general-purpose lan-guages. To get the best performance from the database, programmers typicallyneed to formulate important parts of their program’s logic as queries, thus eﬀec-tively programming in two languages: their usual general-purpose language (e.g.Java, Python, Scala) and SQL, with the latter query code typically constructedas unchecked, dynamic strings. Programming in two languages is more thantwice as diﬃcult as programming in one language [35]. The result is a hybridprogramming model where important parts of the program’s functionality arenot statically checked and may lead to run-time failures, or worse, vulnerabilities a r X i v : . [ c s . P L ] J a n W. Ricciotti and J. Cheney such as SQL injection attacks. This undesirable state of aﬀairs was recognizedby Copeland and Maier [13] who coined the term impedance mismatch for it.Though higher-level wrapper libraries and tools such as object-relational map-pings (ORM) can help ameliorate the impedance mismatch, they often come ata price of performance and lack of transparency, as high-level operations on in-memory objects representing database data are not always mapped eﬃciently toqueries [44]. An alternative approach, which has almost as long a history as theimpedance mismatch problem itself, is to elevate queries in the host languagefrom unchecked strings to a typed, domain-speciﬁc sublanguage, whose interac-tions with the rest of the program can be checked and which can be mappedto database queries safely while providing strong guarantees. This approach isnowadays typically called language-integrated query following Microsoft’s suc-cessful LINQ extensions to .NET languages such as C ﬂat collections (i.e. tables of records without other collections nested inside ﬁeld val-ues) can always be translated to an equivalent query only using ﬂat relations (i.e.can be expressed in an SQL-like language). Wong [54] subsequently generalizedthis result and gave a constructive proof, in which the translation from nestedto ﬂat queries is accomplished through a strongly normalizing rewriting system.Wong’s work has informed a number of successful implementations, suchas the inﬂuential Kleisli system [55] for biomedical data integration, and theLinks programming language [12]. Although the implementation of LINQ inC nested results [25,8,21,52], by translat-ing such queries to a bounded number of ﬂat queries. This technique, currentlyimplemented in Links and DSH, has several beneﬁts: for example to implement provenance-tracking eﬃciently in queries [17,46]. Fowler et al. [19] showed that insome cases, Links’s support for nested query results decreased both the numberof queries issued and the total query evaluation time by an order of magnitudeor more compared to a Java database application. Unfortunately, there is still agap between the theory and practice of language-integrated query. Widely-usedand practically important SQL features that mix set and multiset collections,such as duplicate elimination, are supported by some implementations, but with-out guarantees regarding correctness or reliability. So far, such results have only uery Lifting 3 been proved for special cases [7,8], typically for homogeneous queries operatingon one uniform collection type. For example, in Links, queries have multiset se-mantics and cannot use duplicate elimination or set-valued operations. To thebest of our knowledge the questions of how to correctly translate ﬂat or nested heterogeneous queries to SQL are open problems.In this paper, we solve both open problems. We study a heterogeneous querylanguage

N RC λ ( Set , Bag ), which was introduced and studied in our recentwork [42]. We have previously extended the key results on query normalizationto

N RC λ ( Set , Bag ) [43], but unlike the homogeneous case, the resulting nor-mal forms do not directly correspond to SQL. In this paper, we ﬁrst show howﬂat

N RC λ ( Set , Bag ) queries can be translated to SQL, and we then developa new approach for evaluating queries over nested heterogeneous collections.The key (and, to us at least, surprising) insight is to recognize that these twosubproblems are really just diﬀerent facets of one problem. That is, when trans-lating ﬂat

N RC λ ( Set , Bag ) queries to SQL, the main obstacle is how to dealwith query expressions that depend on local variables; when translating nested

N RC λ ( Set , Bag ) queries to equivalent ﬂat ones, the main obstacle is also howto deal with query expressions that depend on local variables. We solve thisproblem by observing that such query subexpressions can be lifted , analogouslyto lambda-lifting of local function deﬁnitions in functional programming [30], byabstracting over their free variables. Diﬀerently to lambda-lifting, however, welift such expressions by converting them to tabular functions , or graphs , whichcan be calculated using database query constructs.The remainder of this paper presents our contributions as follows: – In section 2 we review the most relevant prior work and present our approachat a high, and we hope accessible, level. – In sections 3 and 4 we present the core languages

N RC λ ( Set , Bag ) and

N RC G which will be used in the rest of the paper. – Section 5 presents our results on translation of ﬂat

N RC λ ( Set , Bag ) queriesto SQL, via

N RC G . – Section 6 presents our results on translation of

N RC λ ( Set , Bag ) queries thatconstruct nested results to a bounded number of ﬂat

N RC G queries. – Sections 7 and 8 discuss related work and conclude.

In this section we sketch our approach. We use Links syntax [12], which diﬀersin superﬁcial respects from the core calculus in the rest of the paper but is morereadable. We rely without further comment on existing capabilities of language-integrated query in Links, which are described elsewhere [11,34,8]. Suppose, hy-pothetically, we are interested in certain presidential candidates and prescriptiondrugs they may be taking . In Links, an expression querying a small database ofpresidential candidates and their drug prescriptions can be written as follows: For example, to see whether drug interactions might explain erratic behavior suchas rage tweeting, creeping authoritarianism, or creepiness more generally. W. Ricciotti and J. Cheney name cid

DJT 45JRB 46

Cand cid did day

45 101 Mon45 223 Tue45 223 Thu46 765 Fri

Pres name drug

DJT hydrochloroquine

DJT adderall

JRB caﬀeine Q a cs DJT {hydrochloroquine, adderall}

JRB {caﬀeine} Q Q F Q Q in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caﬀeinea cs DJT (DJT,45)JRB (JRB,46) did drug hydrochloroquine adderall caﬀeine

Drug in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caﬀeine

Fig. 1.

Input tables

Cand, P res, Drug , intermediate result of Q F and result of Q . Q0 = for (c <- Cand, p <- Pres, d <- Drug)where (c.cid == p.cid && p.did == d.did)[(name=c.name,drug=d.drug)]

Some (totally ﬁctitious and not legally actionable) example data is shown inFigure 1; note that the prescriptions table

P res is a multiset containing duplicateentries. Executing this query in Links results in the following SQL query:

SELECT c.name, d.drugFROM Cand c, Pres p, Drug dWHERE c.cid = p.cid AND p.did = d.did

In Links, query results from the database are mapped back to list values non-deterministically, and the result of the above query Q will be a list contain-ing two copies of the tuple ( DJT , adderall ) and one copy of each of the tuples( DJT , hydrochloroquine ) and ( JRB , caffeine ). If we are just interested in whichcandidates take which drugs and not how many times each drug was taken, wewant to remove these duplicates. This can be accomplished in a basic SQL queryusing the DISTINCT keyword after

SELECT . Currently, in Links there is no wayto generate queries involving

DISTINCT , and this duplicate elimination can onlybe performed in-memory. While this is not hard to do when the duplicate elimi-nation happens at the end of the query, it is not as clear how to handle dedupli-cation operations correctly in arbitrary places inside queries. Furthermore, SQLhas several other operations that can have either set or multiset semantics suchas

UNION and

EXCEPT : how should they be handled?To study this problem we introduced a core calculus

N RC λ ( Set , Bag ) [42](reviewed in the next section) in which there are two collection types, sets andmultisets (or bags ); duplicate elimination maps a multiset to a set with the sameelements, and promotion maps a set to the least multiset with the same elements.We considered, but were not previously able to solve, two problems in thecontext of

N RC λ ( Set , Bag ) which are addressed in this paper. First, the fun-damental results regarding normalization and translation to SQL have beenstudied only for homogeneous query languages with collections consisting ofeither sets, bags, or lists. We recently extended the normalization results to

N RC λ ( Set , Bag ) [43], but the resulting normal forms do not correspond directlyto SQL queries if operations such as deduplication, promotion, or bag diﬀerenceare present. Second, query expressions that construct nested collections cannotbe translated directly to SQL and can be very expensive to execute in-memory uery Lifting 5 using nested loops, leading to the N + 1 query problem (or query avalanche prob-lem [26]) in which one query is performed for the outer loop and then another N queries are performed, one per iteration of the inner loop. Some techniques havebeen developed for translating nested queries to a ﬁxed number of ﬂat queries,but to date they either handle only homogeneous set or bag collections [53,8],or lack detailed correctness proofs [26,51].Regarding the ﬁrst problem, the closest work in this respect is by Libkinand Wong [33], who studied and related the expressiveness of comprehension-based homogeneous set and bag query languages but did not consider theirheterogeneous combination or translation to SQL. The following query illustratesthe fundamental obstacle: Q1 = for (c <- Cand)for (d <- dedup(for (p <- Pres, d <- Drug)where (c.cid == p.cid && p.did == d.did)[d.drug]))[(name=c.name, drug=d)]

This query is similar to Q , but eliminates duplicates among the drugs for eachcandidate. The query contains a duplicate elimination operation ( dedup ) appliedto another query subexpression that refers to c , which is introduced in an earliergenerator. This is not directly supported in classic SQL: by default the subqueriesin FROM clauses cannot refer to tuple variables introduced by earlier parts of the

FROM clause. In fact, this query is expressible in SQL:1999 using the

LATERAL keyword, which does allow such sideways information-passing:

SELECT c.name,d.drugFROM Cand c, LATERAL (SELECT DISTINCT d.drugFROM Pres p, Drug dWHERE p.cid = c.cid AND p.did = d.did) d (Without the

LATERAL keyword, this query is not well-formed SQL.) However,such queries have only recently become widely supported, so are not available onlegacy databases, and even when supported, are not typically optimized eﬀec-tively; for example PostgreSQL will evaluate it as a nested loop, with quadraticcomplexity or worse.Regarding the second problem, Van den Bussche [53] showed that any queryreturning nested set collections can be simulated by n ﬂat queries, where n isthe number of occurrences of the set collection type in the result. However,this translation has not been used as the basis for a practical system to ourknowledge, and does not respect multiset semantics. Cheney et al. [8] providedan analogous shredding translation for nested multiset queries, but translated toa richer target language (including SQL:1999 features such as ROW NUMBER ) anddid not handle operations such as multiset diﬀerence or duplicate elimination.Thus, neither approach handles the full expressiveness of a heterogeneous querylanguage over bags and sets. The following query illustrates the fundamentalobstacle:

W. Ricciotti and J. Cheney

Q2 = for (x <- Cand)[(name=x.name, drugs=dedup(for (p <- Pres, d <- Drug)where (x.cid == p.cid and p.did == d.did)[d.drug]))]

Much like Q , Q builds a multiset of pairs ( name, drugs ) but here drugs is a set of all of the drugs taken by candidate name . Such a query is, of course, noteven syntactically expressible in SQL because it returns a nested collection; it isnot expressible in previous work on nested query evaluation either, because theresult is a multiset of records, one component of which is a set.We will now illustrate how to translate Q to a plain SQL query (not using LATERAL ), and how to translate Q to two ﬂat queries such that the nestedresult can be constructed easily from their ﬂat results. First, note that we canrewrite both queries as follows, introducing an abbreviation F ( x ) for a querysubexpression parameterized by x : F(x) = for (p <- Pres, d <- Drug)where (x.cid == p.cid and p.did == d.did)[d.drug]Q1 = for (c <- Cand) for (d <- dedup(F(c))) [(name=c.name, drug=d)]Q2 = for (c <- Cand) [(name=c.name, drugs=dedup(F(c)))]

Next, observe that the set of all possible values for x appearing in some call to F ( x ) is ﬁnite, and can even be computed by a query. Therefore, we can write a closed query Q F that builds a lookup table that calculates the graph of F (orat least, as much of it as is needed to evaluate the queries) as follows: Q_F = dedup(for (x <- Cand, y <- F(x)) [(in=x,out=y))]

Notice that the use of deduplication here is really essential to deﬁne Q F correctly:if we did not deduplicate then there would be repeated tuples in Q F , leading toincorrect results later. If we inline and simplify F ( x ) in the above query, we getthe following: Q_F’ = dedup(for (x <- Cand, y <- Pres, z <- Drug)where (x.cid == y.cid && y.did = z.did)[(in=x,out=z.drug)])

Finally we may replace the call to F ( x ) in Q with a lookup to Q (cid:48) F , as follows: Q1’ = for (c <- Cand, f <- Q_F’) where (c == f.in)[(name=c.name, drug=f.out)]

This expression may now be translated directly to SQL, because the argumentto dedup is now closed:

SELECT c.name,f.drugFROM Cand c, (SELECT DISTINCT x.name,x.cid,z.drugFROM Cand x, Pres y, Drug zWHERE x.cid = y.cid AND y.did = z.did) fWHERE c.cid = f.cid AND c.name = f.name uery Lifting 7 name cid

DJT 45JRB 46

Cand cid did day

45 101 Mon45 223 Tue45 223 Thu46 765 Fri

Pres name drug

DJT hydrochloroquine

DJT adderall

JRB caﬀeine Q name drugs DJT {hydrochloroquine, adderall}

JRB {caﬀeine} Q Q F Q Q in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caﬀeinename drugs DJT (DJT,45)JRB (JRB,46) did drug hydrochloroquine adderall caﬀeine

Drug in out (DJT,45) hydrochloroquine(DJT,45) adderall(JRB,46) caﬀeine

Fig. 2.

Intermediate results of Q , Q and result of Q . Although this query looks a bit more complex than the one given earlier using

LATERAL , it can be optimized more eﬀectively, for example PostgreSQL generatesa query plan that uses a hash join, giving quasi-linear complexity.On the other hand, to deal with Q , we refactor it into two closed, ﬂat queries Q , Q and an expression Q (cid:48) that builds the nested result from their ﬂat results(illustrated in Figure 2): Q_21 = for (x <- Cand) [(name=x.name, drugs=x)]Q_22 = Q_FQ2’ = for (x <- Q21)[(name=x.name,drugs=for (y <- Q_22) where(x.drugs == y.in) [y.out])]

Notice that in Q we replaced the call to F with the argument x , while Q is just Q F again. The ﬁnal expression Q (cid:48) builds the nested result (in the hostlanguage’s memory) by traversing Q and computing the set value of each cs ﬁeld by looking up the appropriate values from Q . Thus, the original queryresult can be computed by ﬁrst evaluating Q and Q on the database, andthen evaluating the ﬁnal stitching query expression in-memory. (In practice, asdiscussed in Cheney et al. [8], it is important for performance to use a moresophisticated stitching algorithm than the above naive nested loop, but in thispaper we are primarily concerned with the correctness of the transformation.)The above examples are a bit simplistic, but illustrate the key idea of querylifting . In the rest of this paper we place this approach on a solid foundation,and (partially inspired by Gibbons et al. [20]), to help clarify the reasoning weextend the calculus with a type of tabulated functions or graphs −→ σ (cid:74) { τ } , with graph abstraction introduction form G ( − ; − ) and graph application M (cid:16) (cid:104)−→ x (cid:105) . Inour running example we could deﬁne Q F = G ( x ← R ; F ( x )), and we would usethe application operation M (cid:16) (cid:104)−→ x (cid:105) to extract the set of elements correspondingto x in Q F . We will also consider tabular functions that return multisets ratherthan sets, in order to deal with queries that return nested multisets. We recap the main points from [42], which introduced a calculus

N RC λ ( Set , Bag ) with the following syntax:

W. Ricciotti and J. Cheney

Types σ, τ ::= b | (cid:104)−−→ (cid:96) : σ (cid:105) | { σ } | (cid:72) σ (cid:73) | σ → τ Terms

M, N ::= x | t | c ( −→ M ) | (cid:104)−−−−→ (cid:96) = M (cid:105) | M.(cid:96) | λx.M | M N | ∅ | { M } | M ∪ N | (cid:83) { M | Θ }| (cid:102) | (cid:72) M (cid:73) | M (cid:93) N | M − N | (cid:85) (cid:72) M | Θ (cid:73) | δM | ιM | M where set N | M where bag N | empty set ( M ) | empty bag ( M ) Generators Θ ::= −−−−→ x ← M We distinguish between (local) variables x and (global) table names t , andassume standard primitive types b and primitive operations c ( −→ M ) including re-spectively Booleans B and equality at every base type. The syntax for recordsand record projection (cid:104)−−−−→ (cid:96) = M (cid:105) , M.(cid:96) , and for lambda-abstraction and application λx.M, M N is standard; as usual, let-binding is deﬁnable. Set operations includeempty set ∅ , singleton construction { M } , union M ∪ N , one-armed conditional M where set N , emptiness test empty set ( M ), and comprehension (cid:83) { M | Θ } ,where Θ is a sequence of generators x ← M . Similarly, multiset operations in-clude empty bag (cid:102) , singleton (cid:72) M (cid:73) , bag union M (cid:93) N , bag diﬀerence M − N ,conditional M where bag N , emptiness test empty bag ( M ). The syntax is com-pleted by duplicate elimination δM (converting a bag M into a set with thesame object type) and promotion ιM (which produces the bag containing allthe elements of the set M , with multiplicity 1).The one-way conditional operations M where set N and M where bag N evaluate Boolean test N , and return collection M if N is true, otherwise theempty set/bag; two-way conditionals can supported without problems. Otherset operations, such as intersection, membership, subset, and equality are alsodeﬁnable, as are bag operations such as intersection [4,33]. Also, we may deﬁne empty bag ( M ) as empty set ( δ ( M )) and M where set N as δ ( ι ( M ) where bag N ),but we prefer to include these constructs as primitives for symmetry. Generally,we will allow ourselves to write M where N and empty ( M ) without subscriptsif the collection kind of these operations is irrelevant or made clear by the context.We freely use syntax for unlabeled tuples (cid:104)−→ M (cid:105) , M.i and tuple types −→ σ andconsider them to be syntactic sugar for labeled records.The typing rules for the calculus are standard and provided in an appendix.For the purposes of this discussion, we will highlight two features of the typesystem. The ﬁrst is that the calculus used here diﬀers from our previous work byusing constants and table names, whose types are described by a ﬁxed signature Σ : Σ ( c ) = −→ b → b ( Γ (cid:96) M i : σ i ) i =1 ,...,n Γ (cid:96) c ( −→ M ) : τ Σ ( t ) = −−→ (cid:96) : bΓ (cid:96) t : (cid:72) (cid:104)−−→ (cid:96) : b (cid:105) (cid:73) As usual, a typing judgment Γ (cid:96) M : σ states that a term M is well-typedof type σ , assuming that its free variables have the types declared in the typingcontext Γ = x : σ , . . . , x k : σ k . For the two rules above, note in particular thatthe primitive functions c can only take inputs of base type and produce resultsat base type, and table constants t are always multisets of records where the uery Lifting 9 ﬁelds are of base type. We refer to a type of the form (cid:104)−−→ (cid:96) : b (cid:105) as ﬂat ; if σ is ﬂat,we refer to { σ } and (cid:72) σ (cid:73) as ﬂat collection types .The second is that our type system uses an approach `a la Church, meaningthat variable abstractions (in lambdas/comprehensions), empty sets and emptybags are annotated with their type in order to ensure the uniqueness of typing. Lemma 1. In N RC λ ( Set , Bag ) , if Γ (cid:96) M : σ and Γ (cid:96) M : τ , then σ = τ . In the context of a larger language implementation, most of these type anno-tations can be elided and inferred by type inference. We have chosen to dispensewith these details in the main body of this paper to avoid unnecessary syntacticcluttering.We will use a largely standard denotational semantics for

N RC λ ( Set , Bag ),in which sets and multisets are modeled as ﬁnitely-supported functions fromtheir element types to Boolean values { , } or natural numbers respectively.This approach follows the so-called K -relation semantics for queries [23,18] asused for example in the HoTTSQL formalization [10]. The full typing rules andsemantics are included in the appendix. N RC λ ( Set , Bag ) subsumes previous systems including

N RC [4,54],

BQL [33]and

N RC λ [11,8]. In this paper, we restrict our attention to queries in whichcollection types taking part in δ , ι or bag diﬀerence contain only ﬂat records.There are various reasons for excluding function types from these operators: forstarters, any concrete implementation that used function types in these positionswould need to decide the equality of functions; secondly, our rewrite system canensure that a term whose type does not contain function types has a normal formwithout lambda abstractions and applications only if any δ , ι , or bag diﬀerenceused in that term are applied to ﬁrst-order collections. We thus want to excludeterms such as: (cid:93) (cid:72) x (cid:72) (cid:73) (cid:72) (cid:73) | x ← ι ( { λyz.y } ∪ { λyz.z } ) (cid:73) which do not have an SQL representation despite having a ﬂat collection type.In order to obtain simpler normal forms, in which comprehensions only ref-erence generators with a ﬂat collection type, we also disallow nested collectionswithin δ , ι , and bag diﬀerence. We believe this is without loss of generality be-cause of Libkin and Wong’s results showing that allowing such operations atnested types does not add expressiveness to BQL .We have extended Wong’s normalizing rewrite rule system, so as to simplifyqueries to a form that is close to SQL, with no intermediate nested collections.Since our calculus is more liberal than Wong’s, allowing queries to be deﬁned bymixing sets and bags and also using bag diﬀerence, we have added non-standardrules to take care of unwanted situations. In particular, we use the followingconstrained eta-expansions for comprehensions: (cid:91) { δ ( M − N ) | Θ } (cid:32) (cid:91) {{ z }| Θ, z ← δ ( M − N ) } (cid:93) (cid:72) ιM | Θ (cid:73) (cid:32) (cid:93) (cid:72)(cid:72) z (cid:73) | Θ, z ← ιM (cid:73) (cid:93) (cid:72) M − N | Θ (cid:73) (cid:32) (cid:93) (cid:72)(cid:72) z (cid:73) | Θ, z ← M − N (cid:73) General normal forms M ::= X | (cid:104)−−−−→ (cid:96) = M (cid:105) | Q | R Base type terms X ::= x.(cid:96) | c ( −→ X ) | empty set ( Q ∗ ) | empty bag ( R ∗ ) Set normal forms Q ::= (cid:83) −→ CC ::= (cid:83) {{ M } where set X |−−−−→ x ← F } F ::= δt | δ ( R ∗ − R ∗ ) Bag normal forms R ::= (cid:85) −→ DD ::= (cid:85) (cid:72)(cid:72) M (cid:73) where bag X |−−−−→ x ← G (cid:73) G ::= t | ιQ ∗ | R ∗ − R ∗ Fig. 3.

Nested relational normal forms.

The rationale of these rules is that in order to achieve, for comprehensions,a form that can be easily translated to an SQL select query, we need to move allthe syntactic forms that are blocking to most normalization rules (i.e. promotionand bag diﬀerence) from the head of the comprehension to a generator. In orderfor this strategy to work out, we also need to know that the type of thesesubexpressions is ﬂat, as we previously mentioned.In Figure 3 we show the grammar for the normal forms for terms of nestedrelational types , i.e. types of the following form: σ ::= b | (cid:104)−−→ (cid:96) : σ (cid:105) | { σ } | (cid:72) σ (cid:73) For ease of presentation, the grammar actually describes a “standardized”version of the normal forms in which: – ∅ is represented as the trivial union (cid:83) −→ C where −→ C is the empty sequence; (cid:102) has a similar representation using a trivial disjoint union; – comprehensions without a guard are considered to be the same as those witha trivial true guard: (cid:91) {{ M }| Θ } = (cid:91) {{ M } where true | Θ } – singletons that do not appear as the head of a comprehension are representedas trivial comprehensions: { M } = (cid:91) {{ M } | } Each normal form M can be either a term of base type X , a tuple (cid:104)−−−−→ (cid:96) = M (cid:105) ,a set Q , or a bag R . The normal forms of sets and bags are rather similar, bothbeing deﬁned as unions of comprehensions with a singleton head. The gener-ators for set comprehensions F include deduplicated tables and deduplicatedbag diﬀerences; the generators for bag comprehensions G must be either tables,promoted set queries, or bag diﬀerences.The non-terminals used as the arguments of emptiness tests, promotion, andbag diﬀerence have been marked with a star to emphasize the fact that they uery Lifting 11 ( ∅ ) sql = SELECT WHERE (cid:102) ) sql = SELECT WHERE x.(cid:96) ) sql = x.(cid:96) ( c ( −→ X )) sql = ( c ) sql ( −−−→ ( X ) sql )( (cid:104)−−−−→ (cid:96) = X (cid:105) ) sql = ( X ) sql AS (cid:96) , . . . , ( X n ) sql AS (cid:96) n ( empty set ( Q ∗ )) sql = NOT EXISTS ( Q ∗ ) sql ( empty bag ( R ∗ )) sql = NOT EXISTS ( R ∗ ) sql ( Q ∗ ∪ Q ∗ ) sql = ( Q ∗ ) sql UNION ( Q ∗ ) sql ( R ∗ (cid:93) R ∗ ) sql = ( R ∗ ) sql UNION ALL ( R ∗ ) sql ( t ) sql = SELECT ∗ FROM t ( R ∗ − R ∗ ) sql = ( R ∗ ) sql EXCEPT ALL ( R ∗ ) sql ( δt ) sql = SELECT DISTINCT ∗ FROM t ( ι ( Q ∗ )) sql = ( Q ∗ ) sql ( δ ( R ∗ − R ∗ )) sql = SELECT DISTINCT ∗ FROM (( R ∗ ) sql EXCEPT ALL ( R ∗ ) sql s ) r ( x ← F ) sql = (cid:26) (( F ) sql ) x ( x closed) LATERAL (( F ) sql ) x (otherwise)( x ← G ) sql = (cid:26) (( G ) sql ) x ( x closed) LATERAL (( G ) sql ) x (otherwise)( (cid:83) {{ M ∗ } where set X | −−−−→ x ← F } ) sql = SELECT DISTINCT ( M ∗ ) sql FROM ( −−−−→ x ← F ) sql WHERE ( X ) sql ( (cid:85) (cid:72)(cid:72) M ∗ (cid:73) where bag X | −−−−→ x ← G (cid:73) ) sql = SELECT ( M ∗ ) sql FROM −−−−−−−→ ( x ← G ) sql WHERE ( X ) sql Fig. 4.

Translation to SQL must have a ﬂat collection type. The corresponding grammar can be obtainedfrom the grammar for nested normal forms by replacing the rule for M with thefollowing: M ∗ ::= (cid:104)−−−→ (cid:96) = X (cid:105) Normalized queries can be translated to SQL as shown in Figure 4 as longas they have a ﬂat collection type. The translation uses

SELECT DISTINCT and

UNION where a set semantics is needed, and

SELECT , UNION ALL and

EXCEPT ALL in the case of bag semantics. Note that promotion expressions ιQ ∗ are translatedsimply by translating Q ∗ , because in SQL there is no type distinction betweenset and multiset queries: all query results are multisets, and sets are consideredto be multisets having no duplicates.The other main complication in this translation is in handling generators x ← F , x ← G where F or G may be a non-closed expression ι ( Q ∗ ), R ∗ − R ∗ , or δ ( R ∗ − R ∗ ) containing references to other locally-bound variables. To deal withthe resulting lateral variable references, we add the LATERAL keyword to suchqueries. As explained earlier, the use of

LATERAL can be problematic and we willreturn to this issue in Section 5.

Remark 1 (Record ﬂattening).

The above translations handle queries that takeﬂat tables as input and produce ﬂat results (collections of ﬂat records (cid:104)−−→ (cid:96) : b (cid:105) ). Itis straightforward to support queries that return nested records (i.e. records con-taining other records, but not collections). For example, a query M : (cid:72) (cid:104) b , (cid:104) b , b (cid:105)(cid:105) (cid:73) can be handled by deﬁning both directions of the obviousisomorphism N : (cid:72) (cid:104) b , (cid:104) b , b (cid:105)(cid:105) (cid:73) ∼ = (cid:72) (cid:104) b , b , b (cid:105) (cid:73) : N − , normalizing the ﬂat query N ◦ M , evaluating the corresponding SQL, and applying the inverse N − to theresults. Such record ﬂattening is described in detail by Cheney et al. [9] and isimplemented in Links, so we will use it from now on without further discussion. Γ, −−−−−−−→ x i − : σ i − (cid:96) L i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : { τ } Γ (cid:96) G set ( −−−−→ x ← L ; M ) : −→ σ (cid:74) { τ } ( Γ, −−−−−−−→ x i − : σ i − (cid:96) L i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : (cid:72) τ (cid:73) Γ (cid:96) G bag ( −−−−→ x ← L ; M ) : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) τ ( Γ (cid:96) N i : σ i ) i Γ (cid:96) M (cid:16) ( −→ N ) : τ Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M − N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) { τ } Γ (cid:96) N : −→ σ (cid:74) { τ } Γ (cid:96) M ∪ N : −→ σ (cid:74) { τ } Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M (cid:93) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) δM : −→ σ (cid:74) { τ } Γ (cid:96) M : −→ σ (cid:74) { τ } Γ (cid:96) ιM : −→ σ (cid:74) (cid:72) τ (cid:73) Fig. 5.

N RC G additional typing rules. We now introduce

N RC G , an extension of the calculus N RC λ ( Set , Bag ) provid-ing a new type of ﬁnite tabular function graphs (in the remainder of this paper,also called simply “graphs”; they are similar to the ﬁnite maps and tables ofGibbons et al. [20]). The syntax of

N RC G is deﬁned as follows: Types σ, τ ::= · · · | −→ σ (cid:74) τ Terms

M, N ::= · · · | G set ( Θ ; N ) | G bag ( Θ ; N ) | M (cid:16) ( −→ N )Semantically, the type of graphs −→ σ (cid:74) τ will be interpreted as the set ofﬁnite functions from sequences of values of type −→ σ to values in τ : such functionscan return non-trivial values only for a ﬁnite subset of their input type. In oursettings, we will require the output type of graphs to be a collection type (i.e. τ shall be either { τ (cid:48) } or (cid:72) τ (cid:48) (cid:73) for some τ (cid:48) ), and we will use ∅ or (cid:102) as the trivialvalue. The typing rules involving graphs are shown in Figure 5.Graphs are created using the graph abstraction operations G set ( Θ ; N ) and G bag ( Θ ; N ), where Θ is a sequence of generators in the form −−−−→ x ← M ; the dualoperation of graph application is denoted by M (cid:16) ( −→ N ). An expression of theform G set ( −−−−→ x ← M ; N ) is used to construct a (ﬁnite) tabular function mappingeach sequence of values R , . . . , R n in the sets M , . . . , M n to the set N (cid:104) −→ R / −→ x (cid:105) .If each M i has type { σ i } and N has type { τ } , then the graph has type −→ σ (cid:74) { τ } .Similarly, if N has type (cid:72) τ (cid:73) , G bag ( −−−−→ x ← M ; N ) has type −→ σ (cid:74) (cid:72) τ (cid:73) . The terms M , . . . , M n constitute the (ﬁnite) domain of this graph. When the kind of graphapplication (set-based or bag-based) is clear from the context or unimportant,we will allow ourselves to write G ( − ; − ) instead of G set ( − ; − ) or G bag ( − ; − ). uery Lifting 13 A graph G of type −→ σ (cid:74) τ can be applied to a sequence of terms N , . . . , N n of type σ , . . . , σ n to obtain a term of type τ . If G = G ( −−−−→ x ← L ; M ), then we willwant the semantics of G ( −−−−→ x ← L ; M ) (cid:16) ( −→ N ) to be the same as that of M (cid:104) −→ N / −→ x (cid:105) ,provided that each of the N i is in the corresponding element of the domain ofthe graph. The typing rule does not enforce this requirement and if any of the N i is not an element of L i , the graph application will evaluate to an empty setor bag (depending on τ ).Graphs can also be merged by union, using ∪ or (cid:93) depending on their outputcollection kind. Furthermore, graphs that return bags can be subtracted fromone another using bag diﬀerence; the deduplication and promotion operationsalso extend to graphs in the obvious way. Lemma 2. In N RC G , Γ (cid:96) M : σ and Γ (cid:96) M : τ , then σ = τ . Whenever M is well typed and its typing environment is made clear by thecontext, we will allow ourselves to write ty ( M ) for the type of M . Furthermore,given a sequence of generators Θ = x ← L , . . . x n ← L n , such that for i =1 , . . . , n we have x : σ , . . . , x i − : σ i − (cid:96) L i : σ i , we will write ty ( Θ ) to denotethe associated typing context: ty ( Θ ) := x : σ , . . . , x n : σ n N RC λ ( Set , Bag ) The semantics of

N RC λ ( Set , Bag ) is extended to

N RC G as follows: (cid:114) G set ( −−−−→ x ← L ; M ) (cid:122) ρ ( −→ u , v )= ( (cid:86) i (cid:74) L i (cid:75) ρ [ x (cid:55)→ u , . . . , x i − (cid:55)→ u i − ] u i ) ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v (cid:114) G bag ( −−−−→ x ← L ; M ) (cid:122) ρ ( −→ u , v )= ( (cid:86) i (cid:74) L i (cid:75) ρ [ x (cid:55)→ u , . . . , x i − (cid:55)→ u i − ] u i ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v (cid:114) M (cid:16) ( −→ N ) (cid:122) ρv = (cid:74) M (cid:75) ρ ( −−−→ (cid:74) N (cid:75) ρ, v )In this deﬁnition, graph abstractions are interpreted as collections of pairs ofvalues ( −→ u , v ) where the −→ u represent the input and v the corresponding outputof the graph; consequently, the semantics of a graph G set ( −−−−→ x ← L ; M ) states thatthe multiplicity of ( −→ u , v ) is equal to the multiplicity of v in the semantics of M (where each x i is mapped to u i ) if each u i is in the semantics of L i , and zerootherwise. The semantics of bag graph abstractions is similar, with × substitutedfor ∧ to allow multiplicities greater than one in the graph output.For graph applications M (cid:16) ( −→ N ), the multiplicity of v is obtained as the mul-tiplicity of ( −−−→ (cid:74) N (cid:75) ρ, v ) in the semantics of M . The semantics of set and bag union,bag diﬀerence, bag deduplication, and set promotion, as deﬁned in N RC λ ( Set , Bag ), are extended to graphs and remain otherwise unchanged in

N RC G . In fact (as noted for example by Gibbons et al. [20]), the graph constructsof

N RC G are just a notational convenience: we can translate N RC G back to N RC λ ( Set , Bag ) by translating types −→ σ (cid:74) { τ } and −→ σ (cid:74) (cid:72) τ (cid:73) to {(cid:104)−→ σ , τ (cid:105)} and (cid:72) (cid:104)−→ σ , τ (cid:105) (cid:73) respectively, and the term constructs are rewritten as follows: G set ( −−−−→ x ← L ; M ) (cid:32) (cid:91) {{(cid:104)−→ x , y (cid:105)} | −−−−→ x ← L, y ← M }G bag ( −−−−→ x ← L ; M ) (cid:32) (cid:93) (cid:72)(cid:72) (cid:104)−→ x , y (cid:105) (cid:73) | −−−−−−→ x ← ι ( L ) , y ← M (cid:73) M (cid:16) (cid:104)−→ N (cid:105) (cid:32) (cid:91) {{ y } where set −→ x = −→ N | (cid:104)−→ x , y (cid:105) ← M } ( M : −→ σ (cid:74) { τ } ) M (cid:16) (cid:104)−→ N (cid:105) (cid:32) (cid:93) (cid:72)(cid:72) y (cid:73) where bag −→ x = −→ N | (cid:104)−→ x , y (cid:105) ← M (cid:73) ( M : −→ σ (cid:74) (cid:72) τ (cid:73) ) As explained at the end of section 3, if a subexpression of the form ι ( N ) or N − N contains free variables introduced by other generators in the query (i.e.not globally-scoped table variables), such queries cannot be translated directlyto SQL, unless the SQL:1999 LATERAL keyword is used.More precisely, we can give the following deﬁnition of lateral variable occur-rence.

Deﬁnition 1.

Given a query containing a comprehension (cid:83) { M | Θ, x ← N, Θ (cid:48) } or (cid:85) (cid:72) M | Θ, x ← N, Θ (cid:48) (cid:73) as a subterm, we say that x occurs laterally in Θ (cid:48) if,and only if, there is a binding y ← N (cid:48) in Θ (cid:48) such that x ∈ FV( N (cid:48) ) . Since

LATERAL is not implemented on all databases, and is sometimes imple-mented ineﬃciently, we would still like to avoid it. In this section we show howlateral occurrences can be eliminated even in the presence of bag promotion andbag diﬀerence, by means of a process we call delateralization .Using the

N RC G constructs, we can delateralize simple cases of deduplicationor multiset diﬀerence as follows: (cid:85) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) (cid:32) (cid:85) (cid:72) M | x ← N, y ← ι ( G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:85) (cid:72) M | x ← N, y ← P − P (cid:73) (cid:32) (cid:85) (cid:72) M | x ← N, y ← ( G ( x ← δN ; P ) − G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:83) { M | x ← N, y ← δ ( P − P ) } (cid:32) (cid:83) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) (cid:16) x } It is necessary to deduplicate N in the ﬁrst two rules to ensure that the resultscorrectly represent ﬁnite maps from the distinct elements of N to multisets ofcorresponding elements of P . (In any case, N needs to be deduplicated in orderto be used as a set in G ( x ← δN ; )).Given a query expression in normal form, the above rules together withstandard equivalences (such as commutativity of independent generators) canbe used to delateralize it: that is, remove all occurrences of free variables insubexpressions of the form ι ( N ), M − M , or δ ( M − M ). Theorem 1. If M is a ﬂat query in normal form, then there exists M (cid:48) equiva-lent to M with no lateral variable occurrences. uery Lifting 15 The proof of correctness of the basic delateralization rules and the above cor-rectness theorem are in the appendix.To illustrate some subtleties of the translation, here is a trickier example: (cid:93) (cid:72) M | x ← N, y ← Q − ι ( P ) (cid:73) where Q, P both depend on x . We proceed from the outside in, ﬁrst delateralizingthe diﬀerence: (cid:93) (cid:72) M | x ← N, y ← ( G ( x ← δ ( N ); Q ) − G ( x ← δ ( N ); ι ( P ))) (cid:16) x (cid:73) Note that this still contains a lateral subquery, namely ι ( P ) depends on x . Aftertranslating back to N RC λ ( Set , Bag ), and delateralizing ι ( P ), the query normal-izes to: Q = (cid:83) { ( x, z ) | x ∈ δ ( N ) , z ← P } Q = ( (cid:85) (cid:72) ( x, z ) | x ∈ ιδ ( N ) , z ← Q (cid:73) ) − ( (cid:85) (cid:72) ( x, z ) | x ∈ ιδ ( N ) , ( x (cid:48) , z ) ← ι ( Q ) , x = x (cid:48) (cid:73) ) (cid:85) (cid:72) M | x ← N, ( x (cid:48) , y ) ← Q , x = x (cid:48) (cid:73) In the previous sections, we have discussed how to translate queries with ﬂatcollection input and output to SQL. The shredding technique, introduced in [8],can be used to convert queries with nested output (but ﬂat input) to multiple ﬂatqueries that can be independently evaluated on an SQL database, then stitchedtogether to obtain the required nested result. This section provides an improvedversion of shredding, extended to a more liberal setting mixing sets and bags andallowing bag diﬀerence operations, and described using the graph operations wehave introduced, allowing an easier understanding of the shredding process.We introduce, in Figure 6, a shredding judgment to denote the process bywhich, given a normalized

N RC λ ( Set , Bag ) query, each of its subqueries havinga nested collection type is lifted (in a manner analogous to lambda-lifting [30]) toan independent graph query: more speciﬁcally, shredding will produce a shred-ding environment (denoted by

Φ, Ψ, . . . ), which is a ﬁnite map associating special graph variables ϕ, ψ to N RC G terms: Φ, Ψ, . . . ::= [ −−−−−→ ϕ (cid:55)→ M ]The shredding judgment has the following form: Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ where the (cid:90) ⇒ symbol separates the input (to the left) from the output (to theright). The normalized N RC λ ( Set , Bag ) term M is the query that is being con-sidered for shredding; M may contain free variables declared in Θ , which must bea sequence of N RC λ ( Set , Bag ) set comprehension bindings. Θ is initially empty, X is a base term Φ ; Θ (cid:96) X (cid:90) ⇒ X | Φ ( Φ i − ; Θ (cid:96) M i (cid:90) ⇒ ˘ M i | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:104)−−−−→ (cid:96) = M (cid:105) (cid:90) ⇒ (cid:104)−−−−→ (cid:96) = ˘ M (cid:105) | Φ n ϕ / ∈ dom( Φ n )( Φ i − ; Θ (cid:96) C i (cid:90) ⇒ ψ i (cid:16) dom( Θ ) | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:83) −→ C (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | ( Φ n \ −→ ψ )[ ϕ (cid:55)→ (cid:83) −−−−→ Φ n ( ψ )] ϕ / ∈ dom( Φ n )( Φ i − ; Θ (cid:96) D i (cid:90) ⇒ ψ i (cid:16) dom( Θ ) | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:85) −→ D (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | ( Φ n \ −→ ψ )[ ϕ (cid:55)→ (cid:85) −−−−→ Φ n ( ψ )] ϕ / ∈ dom( Ψ ) Φ ; Θ, −−−−→ x ← F (cid:96) M (cid:90) ⇒ ˘ M | ΨΦ ; Θ (cid:96) (cid:83) {{ M } where X |−−−−→ x ← F } (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } )] ϕ / ∈ dom( Ψ ) Φ ; Θ, −−−−−→ x ← G δ (cid:96) M (cid:90) ⇒ ˘ M | ΨΦ ; Θ (cid:96) (cid:85) (cid:72)(cid:72) M (cid:73) where X |−−−−→ x ← G (cid:73) (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:85) (cid:72)(cid:72) ˘ M (cid:73) where X |−−−−→ x ← G (cid:73) )] G δ (cid:44) (cid:26) Q ∗ if G = ιQ ∗ δG otherwise Φ \ −→ ψ (cid:44) [( ϕ (cid:55)→ N ) ∈ Φ | ϕ / ∈ −→ ψ ] Fig. 6.

Shredding rules. but during shredding it is extended with parts of the input that have alreadybeen processed. Similarly, the input shredding environment Φ is initially empty,but will grow during shredding to collect shredded queries that have alreadybeen generated. It is crucial, for our algorithm to work, that M be in the formpreviously described in Figure 3, as this allows us to make assumptions on itsshape: in describing the judgment rules, we will use the same metavariables asare used in that grammar.The output of shredding consists of a shredded term ˘ M and an output shred-ding environment Ψ . Ψ extends Φ with the new queries obtained by shredding M ; ˘ M is an output N RC G query obtained from M by lifting its collection typedsubqueries to independent queries deﬁned in Ψ .The rules for the shredding judgment operate as follows: the ﬁrst rule ex-presses the fact that a normalized base term X does not contain subexpressionswith nested collection type, therefore it can be shredded to itself, leaving theshredding environment Φ unchanged; in the case of tuples, we perform shred-ding pointwise on each ﬁeld, connecting the input and output shredding envi-ronments in a pipeline, and ﬁnally combining together the shredded subterms inthe obvious way.The shredding of collection terms (i.e. unions and comprehensions) is per-formed by means of query lifting : we turn the collection into a globally deﬁned(graph) query, which will be associated to a fresh name ϕ and instantiated to thelocal comprehension context by graph application. This operation is reminiscent uery Lifting 17 (cid:96) · : · (cid:96) Φ : Γ Γ (cid:96) M : −→ σ (cid:74) τ ϕ / ∈ dom( Γ ) (cid:96) Φ [ ϕ (cid:55)→ M ] : ( Γ, ϕ : −→ σ (cid:74) τ ) Fig. 7.

Typing rules for shredding environments. of the lambda lifting and closure conversion techniques used in the implementa-tion of functional languages to convert local function deﬁnitions into global ones.Thus, when shredding a collection, besides processing its subterms recursively,we will need to extend the output shredding environment with a deﬁnition forthe new global graph ϕ . In the interesting case of comprehensions, ϕ is deﬁned bygraph-abstracting over the comprehension context Θ ; notice that, since we areonly shredding normalized terms, we know that they have a certain shape and,in particular, the judgment for bag comprehensions must ensure that generators −→ G be converted into sets.The shredding of set and bag unions is performed by recursion on the sub-terms, using the same plumbing technique we employed for tuples; additionally,we optimize the output shredding environment by removing the graph queries −→ ψ resulting from recursion, since they are absorbed into the new graph ϕ .Notice that since the comprehension generators of our normalized queriesmust have a ﬂat collection type, they do not need to be processed recursively.Furthermore, since our normal forms ensure that promotion and bag diﬀerenceterms can only appear as comprehension generators, we do not need to providerules for these cases.The shredding environments used by the shredding judgment must be welltyped, in the sense described by the rules of Figure 7: the judgment (cid:96) Φ : Γ meansthat the graph variables of Φ are mapped to terms whose type is described by Γ . Whenever we add a mapping [ ϕ (cid:55)→ M ] to Φ , we must make sure that M iswell typed (of graph type) in the typing environment Γ associated to Φ .If (cid:96) Φ : Γ , we will write ty ( Φ ) to refer to the typing environment Γ associatedto Φ . The following result states that shredding preserves well-typedness: Theorem 2.

Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . If Θ (cid:96) M (cid:90) ⇒ ˘ M | Φ , then: – Φ is well-typed – ty ( Φ ) , ty ( Θ ) (cid:96) ˘ M : σ We now intend to prove the correctness of shredding: ﬁrst, we state a lemmawhich we can use to simplify certain expressions involving the semantics of graphapplication:

Deﬁnition 2.

Let Θ be a closed, well-typed sequence of generators. A substitu-tion ρ is a model of Θ (notation: ρ (cid:15) Θ ) if, and only if, for all x ∈ dom( Θ ) , wehave (cid:74) Θ ( x )) (cid:75) ρ ( x ) > . Lemma 3. (cid:114) ( (cid:83) −→ G ) (cid:16) ( −→ N ) (cid:122) ρ = (cid:87) i (cid:114) G i (cid:16) ( −→ N ) (cid:122) ρ

2. If ρ (cid:15) Θ , then for all M we have (cid:74) G ( Θ ; M ) (cid:16) (dom( Θ )) (cid:75) ρ = (cid:74) M (cid:75) ρ . To state the correctness of shredding, we need the following notion of shred-ding environment substitution.

Deﬁnition 3.

For every well-typed shredding environment Φ , the substitution of Φ into an N RC G term M (notation: M Φ ) is deﬁned as the operation replacingwithin M every free variable ϕ ∈ dom( Φ ) with ( Φ ( ϕ )) Φ (i.e.: the value assignedby Φ to ϕ , after recursively substituting Φ ). We can easily show that the above deﬁnition is well posed for well-typed Φ .We now show that shredding preserves the semantics of the input term, in thesense that the term obtained by substituting the output shredding environmentinto the output term is equivalent to the input. Theorem 3 (Correctness of shredding).

Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . If Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , then, for all ρ (cid:15) Θ , we have (cid:74) M (cid:75) ρ = (cid:114) ˘ M Ψ (cid:122) ρ .Proof. By induction on the shredding judgment. We comment two representativecases: – in the set comprehension case, we want to prove (cid:114) (cid:83) {{ M } where X |−−−−→ x ← F ] } (cid:122) ρ v = (cid:114) ( ϕ (cid:16) (dom( Θ ))) Ψ [ ϕ (cid:55)→ (cid:83) {G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } ) } ] (cid:122) ρ v where ρ (cid:15) Θ . We rewrite the lhs as follows: (cid:114) (cid:83) {{ M } where X |−−−−→ x ← F ] } (cid:122) ρ v = (cid:87) −→ u ( (cid:74) M (cid:75) ρ n = v ) ∧ ( (cid:74) X (cid:75) ρ n ) ∧ ( (cid:74) F i (cid:75) ρ i − u i )) i =1 ,...,n where ρ i = ρ [ x (cid:55)→ u , . . . , x i (cid:55)→ u i ] (cid:15) Θ, x ← F , . . . , x i ← F i for all i = 1 , . . . , n , and u i s.t. (cid:74) F i (cid:75) ρ i − u i . By the deﬁnition of substitution and byLemma 3, we rewrite the rhs: (cid:114) ( ϕ (cid:16) (dom( Θ ))) Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } )] (cid:122) ρ v = (cid:114) ( G ( Θ ; (cid:83) {{ ˘ M Ψ } where X |−−−−→ x ← F } )) (cid:16) (dom( Θ )) (cid:122) ρ v = (cid:114) (cid:83) {{ ˘ M Ψ } where X |−−−−→ x ← F } (cid:122) ρ v = (cid:87) −→ u ( (cid:114) ˘ M Ψ (cid:122) ρ n = v ) ∧ ( (cid:74) F i (cid:75) ρ i − u i )) i =1 ,...,n ∧ ( (cid:74) X (cid:75) ρ (cid:48) )We can prove that for all −→ u such that ρ n (cid:50) Θ, −−−−→ x ← F , ( (cid:74) F i (cid:75) ρ i − u i ) i =1 ,...,n =0. Therefore, we only need to consider those −→ u such that ρ n (cid:15) Θ, −−−−→ x ← F .Then, to prove the thesis, we only need to show: (cid:74) M (cid:75) ρ n = (cid:114) ˘ M Φ (cid:122) ρ n which follows by induction hypothesis, for ρ n (cid:15) Θ, −−−−→ x ← F . uery Lifting 19 – in the set union case, we want to prove (cid:114) (cid:91) −→ C (cid:122) ρ v = (cid:114) ( ϕ (cid:16) (dom( Θ )))( Ψ \ −→ ψ )[ ϕ (cid:55)→ (cid:91) −−−→ Ψ ( ψ ))] (cid:122) ρ v where ρ (cid:15) Θ . We rewrite the lhs as follows: (cid:114) (cid:91) −→ C (cid:122) ρ v = (cid:95) i (cid:74) C i (cid:75) ρ v By the deﬁnition of substitution and by Lemma 3, we rewrite the rhs: (cid:114) ( ϕ (cid:16) (dom( Θ )))( Ψ \ −→ ψ )[ ϕ (cid:55)→ (cid:83) −−−→ Ψ ( ψ ))] (cid:122) ρ v = (cid:114) ( (cid:83) −−−−−→ ( Ψ ( ψ )) Ψ ) (cid:16) (dom( Θ )) (cid:122) ρ v = (cid:87) i (cid:74) ( Ψ ( ψ i )) Ψ (cid:16) (dom( Θ )) (cid:75) ρ v By induction hypothesis and unfolding of deﬁnitions, we know for all i : (cid:74) C i (cid:75) ρ = (cid:114) ( ψ i (cid:16) ( −−−−−→ dom( Θ ))) Ψ (cid:122) ρ = (cid:114) ( Ψ ( ψ i )) Ψ (cid:16) ( −−−−−→ dom( Θ )) (cid:122) ρ which proves the thesis. (cid:117)(cid:116) N RC λ ( Set , Bag ) The output of the shredding judgment is a stratiﬁed version of the input term,where each element of the output shredding environment provides a layer of col-lection nesting; furthermore, the output is ordered so that each element of theshredding environment only references graph variables deﬁned to its left, whichis convenient for evaluation. Our goal is to evaluate each shredded item as anindependent query: however, these items are not immediately convertible to ﬂatqueries, partly because their type is still nested, and also due to the presence ofgraph operations introduced during shredding. We thus need to provide a trans-lation operation capable of converting the output of shredding into independentﬂat terms of

N RC λ ( Set , Bag ). This translation uses two main ingredients: – an index function to convert graph variable references to a ﬂat type I ofindices, such that φ, −→ x are recoverable from index ( φ, −→ x ); – a technique to express graphs as standard N RC λ ( Set , Bag ) relations.The resulting translation, denoted by (cid:98)·(cid:99) , is shown in in Figure 8. Let usremark that the translation need be deﬁned only for term forms that can beproduced as the output of shredding: this allows us, for instance, not to considerterms such as ιM or M − N , which can only appear as part of ﬂat generatorsof comprehensions or graphs.We discuss brieﬂy the interesting cases of the deﬁnition of the ﬂatteningtranslation. Base expressions X are expressible in N RC λ ( Set , Bag ), thereforethey can be mapped to themselves (this is also true for empty ( M ), since nor-malization ensures that the type of M be a ﬂat collection). Graph applications (cid:98) X (cid:99) = X (cid:106) (cid:104)−−−−→ (cid:96) = M (cid:105) (cid:107) = (cid:104)−−−−−→ (cid:96) = (cid:98) M (cid:99)(cid:105) (cid:106)(cid:91) −→ C (cid:107) = (cid:91) −−→(cid:98) C (cid:99) (cid:106)(cid:93) −→ D (cid:107) = (cid:93) −−→(cid:98) D (cid:99) (cid:4) ϕ (cid:16) ( −→ x ) (cid:5) = index ( ϕ, −→ x ) (cid:106)(cid:91) {{ M } where X |−−−−→ x ← F } (cid:107) = (cid:91) {{(cid:98) M (cid:99)} where X |−−−−→ x ← F } (cid:106)(cid:93) (cid:72)(cid:72) M (cid:73) where X |−−−−→ x ← G (cid:73) (cid:107) = (cid:93) (cid:72)(cid:72) (cid:98) M (cid:99) (cid:73) where X |−−−−→ x ← G (cid:73) (cid:106) G set ( −−−−→ x ← F ; M ) (cid:107) = (cid:91) {(cid:104) x, y (cid:105)|−−−−→ x ← F , y ← (cid:98) M (cid:99)} (cid:106) G bag ( −−−−→ x ← F ; M ) (cid:107) = (cid:93) (cid:72) (cid:104) x, y (cid:105)|−−−−−→ x ← ιF , y ← (cid:98) M (cid:99) (cid:73) Fig. 8.

Flattening embedding of shredded queries into

N RC λ ( Set , Bag ). ϕ (cid:16) ( −→ x ), as we said, are translated with the help of an index abstract operation:this is where the primary purpose of the translation is accomplished, by ﬂatten-ing a collection type to the ﬂat type I , making it possible for a shredded query tobe converted to SQL; although we do not specify the concrete implementation of index , it is worth noting that it must store the arguments of the graph applica-tion along with the (quoted) name of the graph variable ϕ . Tuples, unions, andcomprehensions only require a recursive translation of their subterms: howeverthe generators of comprehensions must have a ﬂat collection type, so no recursionis needed there. Finally, we translate graphs as collections of the pairs obtainedby associating elements of the domain of the graph to the corresponding output;it is simple to come up with a comprehension term building such a collection:set-valued graphs are translated using set comprehension, while bag-valued onesuse bag comprehension (this also means that in the latter case the generatorsfor the domain of the graph, which are set-typed, must be wrapped in a ι ).We can prove that the ﬂattening embedding produces ﬂat-typed terms, asexpected. Deﬁnition 4.

A well-typed set comprehension generator Θ is ﬂat-typed if, andonly if, for all x ∈ dom( Θ ) , there exists a ﬂat type σ such that ty ( Θ ( x )) = { σ } .A well-typed shredding environment Φ is ﬂat-typed if, and only if, for all ϕ ∈ dom( Φ ) , we have that ty ( (cid:98) Φ ( ϕ ) (cid:99) ) is a ﬂat collection type. Lemma 4.

Suppose Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , where Φ and Θ are ﬂat-typed. Then, ˘ M and Ψ are also ﬂat-typed. It is important to note that the composition of shredding and (cid:98)·(cid:99) does notproduce normalized

N RC λ ( Set , Bag ) terms: when we shred a comprehension, weadd to the output shredding environment a graph returning a comprehension,and when we translate this to

N RC λ ( Set , Bag ) we get two nested comprehen-sions: (cid:106) G ( x ← δt ; (cid:93) (cid:72)(cid:72) ˘ M (cid:73) | y ← ιQ ∗ (cid:73) ) (cid:107) = (cid:93) (cid:72) (cid:104) x, z (cid:105)| x ← ιδt, z ← (cid:93) (cid:72)(cid:72) (cid:106) ˘ M (cid:107) (cid:73) | y ← ιQ ∗ (cid:73)(cid:73) uery Lifting 21 (cid:76) X : b (cid:77) Ξ (cid:44) X (if X is not an index) (cid:76) (cid:104)−−−→ (cid:96) = ˘ N (cid:105) : (cid:104)−−→ (cid:96) : τ (cid:105) (cid:77) Ξ (cid:44) (cid:104)−−−−−−−−−→ (cid:96) = (cid:76) ˘ N : τ (cid:77) Ξ (cid:105) (cid:76) (cid:104)−−−→ (cid:96) = ˘ N (cid:105) .(cid:96) i : τ (cid:77) Ξ (cid:44) (cid:76) N i : τ (cid:77) Ξ (cid:76) index ( ϕ, −→ V ) : { τ } (cid:77) Ξ (cid:44) (cid:83) {{ (cid:76) p. τ (cid:77) Ξ } | p ← Ξ ( ϕ ) , p. (cid:104)−→ V (cid:105)} (cid:76) index ( ϕ, −→ V ) : (cid:72) τ (cid:73)(cid:77) Ξ (cid:44) (cid:85) (cid:72)(cid:72)(cid:76) p. τ (cid:77) Ξ (cid:73) | p ← Ξ ( ϕ ) , p. (cid:104)−→ V (cid:105) (cid:73) Fig. 9.

The stitching function.

In fact, not only is this term not in normal form, but it may even contain, within Q ∗ , a lateral reference to x ; thus, after a ﬂattening translation, we will alwaysrequire the resulting queries to be renormalized and, if needed, delateralized.Let norm denote N RC λ ( Set , Bag ) normalization, and S denote the evalua-tion of relational normal forms: we deﬁne the shredded value set Ξ correspondingto a shredding environment Φ as follows: Ξ (cid:44) { ϕ (cid:55)→ S ( norm ( (cid:98) M (cid:99) )) | [ ϕ (cid:55)→ M ] ∈ Φ } The evaluation S is ordinarily performed by a DBMS after converting the N RC λ ( Set , Bag ) query to SQL, as described in Section 5. The result of thisevaluation is reﬂected in a programming language such as Links as a list ofrecords.

Given a

N RC λ ( Set , Bag ) term with nested collections, we have ﬁrst shredded it,obtaining a shredded

N RC G term ˘ M and a shredding environment Φ containing N RC G graphs; then we have used a ﬂattening embedding to reﬂect both ˘ M and Φ back into the ﬂat fragment of N RC λ ( Set , Bag ); next we used normalizationand DBMS evaluation to convert the shredding environment into a shreddedvalue set Ξ . As the last step to evaluate M : τ , we need to combine (cid:106) ˘ M (cid:107) and Ξ together to reconstruct the correct nested value (cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ by stitching together partial ﬂat values.The stitching function is shown in Figure 9: its job is to visit all the compo-nents of tuples and collections, ignoring atomic values other than indices alongthe way. The real work is performed when an index ( ϕ, −→ V ) is found: conceptu-ally, the index should be replaced by the result of the evaluation of ϕ (cid:16) ( −→ V ).Remember that Ξ contains the result of the evaluation of the graph function ϕ after translation to N RC λ ( Set , Bag ), i.e. a collection of pairs associating eachinput of ϕ to the corresponding output: then, to obtain the desired result, wecan take Ξ ( ϕ ), ﬁlter all the pairs p whose ﬁrst component is (cid:104)−→ V (cid:105) , and returnthe second component of p after a recursive stitching. Finally, observe that wetrack the result type argument in order to disambiguate whether to construct aset or multiset when we encounter an index. Theorem 4 (Correctness of stitching).

Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . Let Φ be well-typed, and suppose Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ . Let Ξ be the result ofevaluating the ﬂattened queries in Ψ as above. Then (cid:114) ˘ M Ψ (cid:122) ρ = (cid:114)(cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ (cid:122) ρ . The full correctness result follows by combining the Theorems 3 and 4.

Corollary 1.

For all M such that (cid:96) M : τ , suppose (cid:96) M (cid:90) ⇒ ˘ M (cid:48) | Ψ , and let Ξ be the shredded value set obtained by evaluating the ﬂattened queries in Ψ . Then (cid:74) M (cid:75) = (cid:114)(cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ (cid:122) . Work on language-integrated query and comprehension syntax has taken placeover several decades in both the database and programming language commu-nities. We discuss the most closely related work below.

Comprehensions, normalization and language integration

The database commu-nity had already begun in the late 1980s to explore proposals for so-called non-ﬁrst-normal-form relations in which collections could be nested inside other col-lections [45], but following Trinder and Wadler’s initial work connecting databasequeries with monadic comprehensions [49], query languages based on these foun-dations were studied extensively, particularly by Buneman et al. [4,3]. For ourpurposes, Wong’s work on query normalization and translation to SQL [54] isthe most important landmark; this work provided the basis for practical imple-mentations such as Kleisli and later Links. Almost as important is the later workby Libkin and Wong [33], studying the questions of expressiveness of bag querylanguages via a language

BQL that extended basic

N RC with deduplication andbag diﬀerence operators. They related this language to

N RC with set semanticsextended with aggregation (count/sum) operations, but did not directly addressthe question of normalizing and translating

BQL queries to SQL. Grust andScholl [28] were early advocates of the use of comprehensions mixing set, bagand other monadic collections for query rewriting and optimization, but did notstudy normalization or translatability properties.Although comprehension-based queries began to be used in general-purposeprogramming languages with the advent of Microsoft LINQ [36] and Links [12],Cooper [11] made the next important foundational contribution by extendingWong’s normalization result to queries containing higher-order functions andshowing that an eﬀect system could be used to safely compose queries usinghigher-order functions even in an ambient language with side-eﬀects and recur-sive functions that cannot be used in queries. This work provided the basis forsubsequent development of language-integrated query in Links [34] and was lateradapted for use in F

Que Λ . However, on revisiting Cooper’s proof to extend it to heteroge-neous queries, we found a subtle gap in the proof, which was corrected in a recentpaper [43]; the original result was correct. As a result, in this paper we focus onﬁrst-order fragments of these languages without loss of generality. uery Lifting 23 Giorgidze et al. [22] have shown how to support non-recursive datatypes (i.e.sums) and Grust and Ulrich [29] built on this to show how to support functiontypes in query results using defunctionalization [29]. We considered using sumsto support a defunctionalization-style strategy for query lifting, but Giorgidzeet al. [22] map sum types to nested collections, which makes their approachunsuitable to our setting. Wong’s original normalization result also consideredsum types, but to the best of our knowledge normalization for

N RC λ ( Set , Bag )extended with sum types has not yet been proved.Recent work by Suzuki et al. [47] have outlined further extensions to lan-guage-integrated query in the

Que Λ system, which is based on ﬁnally-taglesssyntax [6] and employs Wong’s and Cooper’s rewrite rules; Katsushima and Kise-lyov’s subsequent short paper [31] outlined extensions to handling ordering andgrouping. Kiselyov and Katsushima [32] present an extension to Que Λ called Squr to handle ordering based on eﬀect typing, and they provide an eleganttranslation from

Squr queries to SQL based on normalization-by-evaluation.Okura and Kameyama [39] outline an extension to handle SQL-style groupingand aggregation operators in

Que Λ G ; however, their approach potentially gen-erates lateral variable occurrences inside grouping queries. These systems Que Λ , Squr and

Que Λ G consider neither heterogeneity nor nested results.Our adoption of tabulated functions ( graphs ) is inspired in part by Gibbonset al. [20], who provided an elegant rational reconstruction of relational algebrashowing how standard principles for reasoning about queries arise from adjunc-tions. They employed types for (ﬁnite) maps and tables to show how joins can beimplemented eﬃciently, and observed that such structures form a graded monad .We are interested in further exploring these structures and extending our workto cover ordering, grouping and aggregation. Query decorrelation and delateralization

There is a large literature on querydecorrelation , for example to remove aggregation operations from

SELECT or WHERE clauses (see e.g. [38,5] for further discussion). Delateralization appearsrelated to decorrelation, but we are aware of only a few works on this problem,perhaps because most DBMSs only started to support

LATERAL in the last fewyears. (Microsoft SQL Server has supported similar functionality for much longerthrough a keyword

APPLY .) Our delateralization technique appears most closelyrelated to Neumann and Kemper’s work on query unnesting [38]. In this con-text, unnesting refers to removal of “dependent join” expressions in a relationalalgebraic query language; such joins appear to correspond to lateral subqueries.This approach is implemented in the HyPER database system, but is not ac-companied by a proof of correctness, nor does it handle nested query results. Itwould be interesting to formalize this approach (or others from the decorrelationliterature) and relate it to delateralization.

Querying nested collections

Our approach to querying nested heterogeneouscollections clearly specializes to the homogeneous cases for sets and multisetsrespectively, which have been studied separately. Van den Bussche’s work on simulating queries on nested sets using ﬂat ones [53] has also inspired subse-quent work on query shredding, ﬂattening and (in this paper) lifting, thoughthe simulation technique itself does not appear practical (as discussed in theextended version of Cheney et al. [9]). More recently, Benedikt and Pradic [1]presented results on representing queries on nested collections using a boundednumber of interpretations (ﬁrst-order logic formulas corresponding to deﬁnableﬂat query expressions) in the context of their work on synthesizing

N RC queriesfrom proofs. This approach considers set-valued

N RC only, and its relationshipto our approach should be investigated further.Cheney et al.’s previous work on query shredding for multiset queries [8] isdiﬀerent in several important respects. In that work we did not consider dedupli-cation and bag diﬀerence operations from

BQL , which Libkin and Wong showedcannot be expressed in terms of other

N RC operations. The shredding transla-tion was given in several stages, and while each stage is individually comprehen-sible, the overall approach is not easy to understand. Finally, the last stages ofthe translation relied on SQL features not present (or expressible) in the sourcelanguage, such as ordering and the SQL:1999

ROW NUMBER construct, to synthe-size uniform integer keys. Our approach, in contrast, handles set, bag, and mixedqueries, and does not rely on any SQL:1999 features.In a parallel line of work, Grust et al. [26,21,50,52,51] have developed a num-ber of approaches to querying nested list data structures, ﬁrst in the context ofXML processing [24] and subsequently for

N RC -like languages over lists. Theearlier approach [26], named loop-lifting (not to be confused with query lifting !)made heavy use of SQL:1999 capabilities for numbering and indexing to decouplenested collections from their context, and was implemented in both Links [50]and earlier versions of the Database Supported Haskell library [21], both of whichrelied on an advanced query optimizer called

Pathﬁnder [27] to optimize thesequeries. The more recent approach, implemented by Ulrich in the current versionof DSH and described in detail in his thesis [51], is called query ﬂattening andis instead based on techniques from nested data parallelism [2]. Both loop-liftingand query ﬂattening are very powerful, and do not rely on an initial normaliza-tion stage, while supporting a rich source language with list semantics, ordering,grouping, aggregation, and deduplication which can in principle emulate set ormultiset semantics. However, to the best of our knowledge no correctness proofsexist for either technique. We view ﬁnding correctness results for richer querylanguages as an important challenge for future work.Another parallel line of work started by Fegaras and Maier [15,14] considersheterogeneous query languages based on monoid comprehensions, with set, list,and bag collections as well as grouping, aggregation and ordering operations, inthe setting of object-oriented databases, and forms the basis for complex objectdatabase systems such as λ DB [16] and Apache MRQL [14]. However, Wong-style normalization results or translations from ﬂat or nested queries to SQL arenot known for these calculi.

Lambda-lifting and closure conversion

Since Johnsson’s original work [30],lambda-lifting and closure conversion have been studied extensively for func- uery Lifting 25 tional languages, with Minamide et al.’s typed closure conversion [37] of par-ticular interest in compilers employing typed intermediate languages. We planto study whether known optimizations in the lambda-lifting and closure con-version literature oﬀer advantages for query lifting. The immediate importantnext step is to implement our approach and compare it empirically with previ-ous techniques such as query shredding and query ﬂattening. By analogy withlambda-lifting and closure conversion, we expect additional optimizations to bepossible by a deeper analysis of how variables/ﬁelds are used in lifted subqueries.Another problem we have not resolved is how to deal with deduplication or bagdiﬀerence at nested collection types in practice. Libkin and Wong [33] showedthat such nesting can be eliminated from

BQL queries, but their results do notprovide a constructive algorithm for eliminating the nesting.

Monadic comprehensions have proved to be a remarkably durable foundation fordatabase programming and language-integrated query, and has led to languagesupport (LINQ for .NET, Quill for Scala) with widespread adoption. Recentwork has demonstrated that techniques for evaluating queries over nested collec-tions, such as query shredding or query ﬂattening, can oﬀer order-of-magnitudespeedups in database applications [19] without sacriﬁcing declarativity or read-ability. However, query shredding lacks the ability to express common operationssuch as deduplication, while query ﬂattening is more expressive but lacks a de-tailed proof of correctness, and both techniques are challenging to understand,implement, or extend. We provide the ﬁrst provably correct approach to queryingnested heterogeneous collections involving both sets and multisets.Our most important insight is that working in a heterogeneous language,with both set and multiset collection types, actually makes the problem easier,by making it possible to calculate ﬁnite maps representing the behavior of nestedquery subexpressions under all of the possible environments encountered at runtime. Thus, instead of having to maintain or synthesize keys linking inner andouter collections, as is done in all previous approaches, we can instead use thevalues of variables in the closures of nested query expressions themselves asthe keys. The same approach can be used to eliminate sideways information-passing. This is analogous to lambda-lifting or closure conversion in compilationof functional languages, but diﬀers in that we lift local queries to (queries thatcompute) ﬁnite maps rather than ordinary function abstractions. We believethis idea may have broader applications and will next investigate its behavior inpractice and applications to other query language features.

Acknowledgments

This work was supported by ERC Consolidator Grant Skye(grant number 682315), and by an ISCF Metrology Fellowship grant providedby the UK government’s Department for Business, Energy and Industrial Strat-egy (BEIS). We are grateful to Simon Fowler for feedback and to anonymousreviewers for constructive comments.

References

1. Benedikt, M., Pradic, P.: Generating collection transformations from proofs. Proc.ACM Program. Lang. (POPL) (Jan 2021), https://doi.org/10.1145/34342952. Blelloch, G.E.: Vector Models for Data-Parallel Computing. MIT Press (1990)3. Buneman, P., Libkin, L., Suciu, D., Tannen, V., Wong, L.: Comprehension syntax.SIGMOD Record (1994)4. Buneman, P., Naqvi, S., Tannen, V., Wong, L.: Principles of programmingwith complex objects and collection types. Theor. Comput. Sci. (1) (1995).https://doi.org/10.1016/0304-3975(95)00024-Q5. Cao, B., Badia, A.: SQL query optimization through nested rela-tional algebra. ACM Trans. Database Syst. (3), 18–es (Aug 2007).https://doi.org/10.1145/1272743.12727486. Carette, J., Kiselyov, O., Shan, C.: Finally tagless, partially evaluated: Taglessstaged interpreters for simpler typed languages. J. Funct. Program. (5), 509–543 (2009). https://doi.org/10.1017/S09567968090072057. Cheney, J., Lindley, S., Wadler, P.: A practical theory of language-integrated query.In: ICFP (2013). https://doi.org/10.1145/2500365.25005868. Cheney, J., Lindley, S., Wadler, P.: Query shredding: eﬃcient relational evalua-tion of queries over nested multisets. In: SIGMOD. pp. 1027–1038. ACM (2014).https://doi.org/10.1145/2588555.26121869. Cheney, J., Lindley, S., Wadler, P.: Query shredding: Eﬃcient relational evaluationof queries over nested multisets (extended version). CoRR abs/1404.7078 (2014),http://arxiv.org/abs/1404.707810. Chu, S., Weitz, K., Cheung, A., Suciu, D.: HoTTSQL: Proving queryrewrites with univalent SQL semantics. In: PLDI. pp. 510–524. ACM (2017).https://doi.org/10.1145/3062341.306234811. Cooper, E.: The script-writer’s dream: How to write great SQL in your own lan-guage, and be sure it will succeed. In: DBPL (2009). https://doi.org/10.1007/978-3-642-03793-1 312. Cooper, E., Lindley, S., Wadler, P., Yallop, J.: Links: web programming withouttiers. In: FMCO (2007). https://doi.org/10.1007/978-3-540-74792-5 1213. Copeland, G., Maier, D.: Making Smalltalk a database system. SIGMOD Rec. (2) (1984)14. Fegaras, L.: An algebra for distributed big data analytics. J. Funct. Program. ,e27 (2017). https://doi.org/10.1017/S095679681700019315. Fegaras, L., Maier, D.: Optimizing object queries using an eﬀective calculus. ACMTrans. Database Syst. (4), 457–516 (2000)16. Fegaras, L., Srinivasan, C., Rajendran, A., Maier, D.: lambda-DB: An ODMG-based object-oriented DBMS. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.)SIGMOD. p. 583. ACM (2000). https://doi.org/10.1145/342009.33549417. Fehrenbach, S., Cheney, J.: Language-integrated provenance. Science of ComputerProgramming , 103–145 (2018)18. Foster, J.N., Green, T.J., Tannen, V.: Annotated XML: queries and provenance.In: PODS. pp. 271–280 (2008)19. Fowler, S., Harding, S., Sharman, J., Cheney, J.: Cross-tier web programming forcurated databases: a case study. International Journal of Digital Curation (1)(2020). https://doi.org/10.2218/ijdc.v15i1.717, pre-print presented at IDCC 202020. Gibbons, J., Henglein, F., Hinze, R., Wu, N.: Relational algebra byway of adjunctions. Proc. ACM Program. Lang. (ICFP) (Jul 2018).https://doi.org/10.1145/3236781uery Lifting 2721. Giorgidze, G., Grust, T., Schreiber, T., Weijers, J.: Haskell boards the Ferry -database-supported program execution for Haskell. In: IFL. pp. 1–18. No. 6647 inLNCS, Springer-Verlag (2010)22. Giorgidze, G., Grust, T., Ulrich, A., Weijers, J.: Algebraic data types for language-integrated queries. In: DDFP. pp. 5–10 (2013)23. Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS(2007)24. Grust, T., Mayr, M., Rittinger, J.: Let SQL drive the XQueryworkhorse (XQuery join graph isolation). In: EDBT. pp. 147–158 (2010).https://doi.org/10.1145/1739041.173906225. Grust, T., Mayr, M., Rittinger, J., Schreiber, T.: Ferry: Database-supported pro-gram execution. In: SIGMOD (June 2009)26. Grust, T., Rittinger, J., Schreiber, T.: Avalanche-safe LINQ compilation. PVLDB (1) (2010)27. Grust, T., Rittinger, J., Teubner, J.: Pathﬁnder: XQuery oﬀ the relational shelf.IEEE Data Eng. Bull. (4) (2008)28. Grust, T., Scholl, M.H.: How to comprehend queries functionally. J. Intell. Inf.Syst. (2-3), 191–218 (1999). https://doi.org/10.1023/A:100870502644629. Grust, T., Ulrich, A.: First-class functions for ﬁrst-order database engines. In:DBPL (2013), http://arxiv.org/abs/1308.0158

30. Johnsson, T.: Lambda lifting: Treansforming programs to recursive equations. In:FPCA. pp. 190–203 (1985). https://doi.org/10.1007/3-540-15975-4 3731. Katsushima, T., Kiselyov, O.: Language-integrated query with ordering, groupingand outer joins (poster paper). In: PEPM. pp. 123–124 (2017)32. Kiselyov, O., Katsushima, T.: Sound and eﬃcient language-integratedquery - maintaining the ORDER. In: APLAS 2017. pp. 364–383 (2017).https://doi.org/10.1007/978-3-319-71237-6 1833. Libkin, L., Wong, L.: Query languages for bags and aggregate functions. J. Comput.Syst. Sci. (2) (1997). https://doi.org/10.1006/jcss.1997.152334. Lindley, S., Cheney, J.: Row-based eﬀect types for database integration. In: TLDI(2012). https://doi.org/10.1145/2103786.210379835. Lindley, S., Wadler, P.: The audacity of hope: Thoughts on reclaiming the databasedream. In: ESOP (2010)36. Meijer, E., Beckman, B., Bierman, G.M.: LINQ: reconciling object,relations and XML in the .NET framework. In: SIGMOD (2006).https://doi.org/10.1145/1142473.114255237. Minamide, Y., Morrisett, J.G., Harper, R.: Typed closure conversion. In: POPL.pp. 271–283 (1996). https://doi.org/10.1145/237721.23779138. Neumann, T., Kemper, A.: Unnesting arbitrary queries. In: Datenbanksysteme f¨urBusiness, Technologie und Web (BTW). pp. 383–402 (2015)39. Okura, R., Kameyama, Y.: Language-integrated query with nested data structuresand grouping. In: FLOPS. pp. 139–158 (2020). https://doi.org/10.1007/978-3-030-59025-3 940. Paredaens, J., Van Gucht, D.: Converting nested algebra expressionsinto ﬂat algebra expressions. ACM Trans. Database Syst. (1) (1992).https://doi.org/10.1145/128765.12876841. Quill: Compile-time language integrated queries for Scala. Open source project,https://github.com/getquill/quill42. Ricciotti, W., Cheney, J.: Mixing set and bag semantics. In: DBPL. pp. 70–73(2019). https://doi.org/10.1145/3315507.33302028 W. Ricciotti and J. Cheney43. Ricciotti, W., Cheney, J.: Strongly normalizing higher-order relational queries. In:FSCD. pp. 28:1–28:22 (2020). https://doi.org/10.4230/LIPIcs.FSCD.2020.2844. Russell, C.: Bridging the object-relational divide. Queue (May 2008).https://doi.org/10.1145/1394127.139413945. Schek, H., Scholl, M.H.: The relational model with relation-valued attributes. Inf.Syst. (2), 137–147 (1986). https://doi.org/10.1016/0306-4379(86)90003-746. Stolarek, J., Cheney, J.: Language-integrated provenance in Haskell. The Art, Sci-ence, and Engineering of Programming (3), A11 (2018)47. Suzuki, K., Kiselyov, O., Kameyama, Y.: Finally, safely-extensibleand eﬃcient language-integrated query. In: PEPM. pp. 37–48 (2016).https://doi.org/10.1145/2847538.284754248. Syme, D.: Leveraging .NET meta-programming components from F (1-2) (2001)54. Wong, L.: Normal forms and conservative extension properties for querylanguages over collection types. J. Comput. Syst. Sci. (3) (1996).https://doi.org/10.1006/jcss.1996.003755. Wong, L.: Kleisli, a functional query system. J. Funct. Program. (1) (2000).https://doi.org/10.1017/S0956796899003585uery Lifting 29 A N RC λ ( Set , Bag ) A.1 Type system

We give here the full set of typing rules for

N RC λ ( Set , Bag ) that we omittedfrom the main body of the paper: they are shown in Figure 10. x : σ ∈ ΓΓ (cid:96) x : σ Σ ( c ) = −→ b → b ( Γ (cid:96) M i : b i ) i =1 ,...,n Γ (cid:96) c ( −→ M n ) : b ( Γ (cid:96) M i : σ i ) i =1 ,...,n Γ (cid:96) (cid:104)−−−−→ (cid:96) = M (cid:105) : (cid:104)−−→ (cid:96) : σ (cid:105) Γ (cid:96) M : (cid:104)−−→ (cid:96) : σ (cid:105) i = 1 , . . . , nΓ (cid:96) M.(cid:96) i : σ i Γ, x : σ (cid:96) M : τΓ (cid:96) λx σ .M : σ → τ Γ (cid:96) M : σ → τ Γ (cid:96) N : σΓ (cid:96) ( M N ) : τΓ (cid:96) ∅ σ : { σ } Γ (cid:96) M : σΓ (cid:96) { M } : { σ } Γ (cid:96) M : { σ } Γ (cid:96) N : { σ } Γ (cid:96) M ∪ N : { σ } ( Γ, x : σ , . . . , x i − : σ i − (cid:96) N i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : { τ } Γ (cid:96) (cid:83) { M |−−−−→ x ← N } : { τ } Γ (cid:96) M : { σ } Γ (cid:96) empty set ( M ) : B Γ (cid:96) M : { σ } Γ (cid:96) N : B Γ (cid:96) M where set N : { σ } Γ (cid:96) (cid:102) σ : (cid:72) σ (cid:73) Γ (cid:96) M : σΓ (cid:96) (cid:72) M (cid:73) : { σ } Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) N : (cid:72) σ (cid:73) Γ (cid:96) M (cid:93) N : (cid:72) σ (cid:73) ( Γ, x : σ , . . . , x i − : σ i − (cid:96) N i : (cid:72) σ i (cid:73) ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : (cid:72) τ (cid:73) Γ (cid:96) (cid:85) (cid:72) M |−−−−→ x ← N (cid:73) : (cid:72) τ (cid:73) Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) empty bag ( M ) : B Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) N : B Γ (cid:96) M where bag N : (cid:72) σ (cid:73) Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) δM : { σ } Γ (cid:96) M : { σ } Γ (cid:96) ιM : (cid:72) σ (cid:73) Fig. 10.

Type system of

N RC λ ( Set , Bag ). A.2 Normalization

We show in Figure 11 the rewrite system used to normalize

N RC λ ( Set , Bag )queries.

A.3 Semantics

We follow the K -relation style of semantics, as introduced by Green et al. [23]and used for formalization by Chu et al. [10]. λx.M ) N (cid:32) M [ N/x ] (cid:104) . . . , (cid:96) = M, . . . (cid:105) .(cid:96) (cid:32) M (cid:83) {∅| Θ } (cid:32) ∅ (cid:83) { M | Θ, x ← ∅ , Θ (cid:48) } (cid:32) ∅ (cid:83) { M | Θ, x ← { N } , Θ (cid:48) } (cid:32) (cid:83) { M [ N / x ] | Θ, Θ (cid:48) [ N / x ] } (cid:83) { M ∪ N | Θ } (cid:32) (cid:83) { M | Θ } ∪ (cid:83) { N | Θ } (cid:83) { M | Θ, x ← N ∪ R, Θ (cid:48) } (cid:32) (cid:83) { M | Θ, x ← N, Θ (cid:48) } ∪ (cid:83) { M | Θ, x ← R, Θ (cid:48) } (cid:83) { M | Θ, x ← (cid:83) { R | Θ (cid:48) } , Θ (cid:48)(cid:48) } (cid:32) (cid:83) { M | Θ, Θ (cid:48) , x ← R, Θ (cid:48)(cid:48) } (if dom( Θ (cid:48) ) / ∈ FV(

M, Θ (cid:48)(cid:48) )) (cid:83) { (cid:83) { M | Θ (cid:48) }| Θ } (cid:32) (cid:83) { M | Θ, Θ (cid:48) } (cid:83) { M | Θ, x ← R where set N, Θ (cid:48) } (cid:32) (cid:83) { M where set N | Θ, x ← R, Θ (cid:48) } (if x / ∈ FV( N )) (cid:83) { δ ( M − N ) | Θ } (cid:32) (cid:83) {{ z }| Θ, z ← δ ( M − N ) } (cid:83) { δ ( M − N ) where set R | Θ } (cid:32) (cid:83) {{ z } where set R | Θ, z ← δ ( M − N ) } (if z / ∈ FV( R )) M where set true (cid:32) M M where set false (cid:32) ∅ ∅ where set M (cid:32) ∅ ( N ∪ R ) where set M (cid:32) ( N where set M ) ∪ ( R where set M ) (cid:83) { N | Θ } where set M (cid:32) (cid:83) { N where set M | Θ } (if dom( Θ ) ∩ F V ( M ) = ∅ ) R where set N where set M (cid:32) R where set ( M ∧ N ) (cid:85) (cid:72) (cid:102) | Θ (cid:73) (cid:32) (cid:102) (cid:85) (cid:72) M | Θ, x ← (cid:102) , Θ (cid:48) (cid:73) (cid:32) (cid:102) (cid:85) (cid:72) M | Θ, x ← (cid:72) N (cid:73) , Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M [ N / x ] | Θ, Θ (cid:48) [ N / x ] (cid:73) (cid:85) (cid:72) M (cid:93) N | Θ (cid:73) (cid:32) (cid:85) (cid:72) M | Θ (cid:73) (cid:93) (cid:85) (cid:72) N | Θ (cid:73) (cid:85) (cid:72) M | Θ, x ← N (cid:93) R, Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, x ← N, Θ (cid:48) (cid:73) (cid:93) (cid:85) (cid:72) M | Θ, x ← R, Θ (cid:48) (cid:73) (cid:85) (cid:72) M | Θ, x ← (cid:85) (cid:72) R | Θ (cid:48) (cid:73) , Θ (cid:48)(cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, Θ (cid:48) , x ← R, Θ (cid:48)(cid:48) (cid:73) (if dom( Θ (cid:48) ) / ∈ FV(

M, Θ (cid:48)(cid:48) )) (cid:85) (cid:72) (cid:85) (cid:72) M | Θ (cid:48) (cid:73) | Θ (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, Θ (cid:48) (cid:73) (cid:85) (cid:72) M | Θ, x ← R where bag N, Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M where bag N | Θ, x ← R, Θ (cid:48) (cid:73) (if x / ∈ FV( N )) (cid:85) (cid:72) ιM | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) | Θ, z ← ιM (cid:73) (cid:85) (cid:72) ιM where bag R | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) where bag R | Θ, z ← ιM (cid:73) (if z / ∈ FV( R )) (cid:85) (cid:72) M − N | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) | Θ, z ← M − N (cid:73) (cid:85) (cid:72) ( M − N ) where bag R | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) where bag R | Θ, z ← M − B (cid:73) (if z / ∈ FV( R )) M where bag true (cid:32) M M where bag false (cid:32) (cid:102) (cid:102) where bag M (cid:32) (cid:102) ( N (cid:93) R ) where bag M (cid:32) ( N where bag M ) (cid:93) ( R where bag M ) (cid:85) (cid:72) N | Θ (cid:73) where bag M (cid:32) (cid:85) (cid:72) N where bag M | Θ (cid:73) (if dom( Θ ) ∩ FV( M ) = ∅ ) R where bag N where bag M (cid:32) R where bag ( M ∧ N ) δ (cid:102) (cid:32) ∅ δ (cid:72) M (cid:73) (cid:32) { M } δ ( M (cid:93) N ) (cid:32) δM ∪ δNδ (cid:85) (cid:72) M | Θ (cid:73) (cid:32) (cid:83) { δM | Θ δ } δ ( M where bag N ) (cid:32) δM where set NδιM (cid:32)

M ι ∅ (cid:32) (cid:102) ι ( M where set N ) (cid:32) ιM where bag N empty set ( M ) (cid:32) empty set ( (cid:83) {{(cid:104)(cid:105)}| x ← M } ) (if M is not a ﬂat set) empty bag ( M ) (cid:32) empty bag ( (cid:83) { (cid:72) (cid:104)(cid:105) (cid:73) | x ← M } ) (if M is not a ﬂat bag)( −−−−→ x ← M ) [ N / y ] (cid:44) −−−−−−−−→ x ← M [ N / y ] (if x (cid:54) = y, FV( N ) ∩ −→ x = ∅ )( −−−−→ x ← M ) δ (cid:44) −−−−−→ x ← δM Fig. 11.

Query normalizationuery Lifting 31 (cid:74) ∅ (cid:75) ρ = λu. (cid:74) { M } (cid:75) ρ = λu. { M } ρ = u (cid:74) M ∪ N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu ∨ (cid:74) N (cid:75) ρu (cid:114) (cid:91) { N | x ← M } (cid:122) ρ = λu. (cid:95) v (cid:74) M (cid:75) ρv ∧ (cid:74) N (cid:75) ρ [ x (cid:55)→ v ] u (cid:74) empty set ( M ) (cid:75) ρ = λu. ¬ ( (cid:95) u (cid:74) M (cid:75) ρu ) (cid:74) M where set N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu ∧ (cid:74) N (cid:75) ρ (cid:74) δ ( M ) (cid:75) ρ = λu.ζ ( (cid:74) M (cid:75) ρu ) (cid:74) (cid:102) (cid:75) ρ = λu. (cid:74)(cid:72) M (cid:73)(cid:75) ρ = λu.χ ( { M } ρ = u ) (cid:74) M ∪ N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu + (cid:74) N (cid:75) ρu (cid:114) (cid:91) { N | x ← M } (cid:122) ρ = λu. (cid:88) v (cid:74) M (cid:75) ρv × (cid:74) N (cid:75) ρ [ x (cid:55)→ v ] u (cid:74) M where bag N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu × χ ( (cid:74) N (cid:75) ρ ) (cid:113) empty bag ( M ) (cid:121) ρ = λu. ¬ ( ζ ( (cid:88) u (cid:74) M (cid:75) ρu )) (cid:74) M − N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu − (cid:74) N (cid:75) ρu (cid:74) ι ( M ) (cid:75) ρ = λu.χ ( (cid:74) M (cid:75) ρu ) Fig. 12.

Semantics of set and multiset operations of

N RC λ ( Set , Bag ) Basic types and records are represented by the usual interpretations of suchtypes, and the details are elided. For set types, the interpretation of a set { A } is (cid:74) A (cid:75) → fs { , } . Here → fs is the set of ﬁnitely-supported functions from (cid:74) A (cid:75) , herethe support is the set of elements mapped to a nonzero value. We consider { , } equipped with the usual structure of a Boolean algebra, with operations ∧ , ∨ , ¬ ,and we consider equality and other meta-level predicates as functions returningBoolean values. Likewise, we consider bag types (cid:72) A (cid:73) to be interpreted as ﬁnitely-supported functions (cid:74) A (cid:75) → fs N , where N is the set of natural numbers, equippedwith the usual arithmetic operations + , . − , × ; here . − is truncated subtraction m . − n = max( m − n, χ : { , } → N forthe “characteristic function” and ζ : N → { , } for the “nonzero test” function x (cid:55)→ ( x >

0) that maps 0 to 0 and any nonzero value to 1. Note that ζ ( χ ( n )) = n .Since we work with ﬁnitely-supported functions f, p , we write (cid:80) u f ( u ) (resp. (cid:87) u p ( u ) for the summation (resp. disjunction) over all possible u of f ( u ) (resp. p ( u )). Although this summation or disjunction is inﬁnite, the number of valuesof u for which f / p can be nonzero is ﬁnite, so this is a ﬁnite sum or disjunctionand thus well-deﬁned. Finally, although N RC λ ( Set , Bag ) also includes function types, lambda abstraction, and application, but not recursion, their additionposes no diﬃculty and since these features can be normalized away prior toapplying the results in this paper, we do not explicitly discuss them in thesemantics.

B Proofs for Section 5

Lemma 5. (cid:80) t χ ( t = u ) × e ( t, u ) = e ( t, t ) χ ( ζ ( e )) × e = χ ( e > × e = e Proof.

For part (1), all of the summands are zero except (possibly) when t = u .Part (2) follows by a simple case analysis on e >

0; if e = 0 then both sides arezero while if e > χ ( e > × e = 1 × e = e . (cid:117)(cid:116) Lemma 6 (Commutativity).

Suppose { x, y } ∩ F V ( M, N ) = ∅ . Then (cid:93) (cid:72) M | x ← N, y ← P (cid:73) ≡ (cid:93) (cid:72) M | y ← P, x ← N (cid:73) (cid:91) { M | x ← N, y ← P } ≡ (cid:91) { M | y ← P, x ← N } Proof.

Straightforward by unfolding deﬁnitions. (cid:117)(cid:116)

Recall (for example from Buneman et al. [4]) that set membership M ∈ N isdeﬁnable as ¬ empty set ( { x | x ← N, x = M } ) It is straightforward to show that (cid:74) M ∈ N (cid:75) ρv = (cid:74) N (cid:75) ρ ( (cid:74) M (cid:75) ρ ), that is, the result is true iﬀ the interpretation of N returns true on the interpretation of M . We will use this as a primitive in thefollowing proofs. First we observe that when x was introduced by a generator x ← N , then it is redundant to check that x ∈ δ ( N ) (if N is a bag) or x ∈ N (if N is a set). Lemma 7. (cid:93) (cid:72) M where bag x ∈ δ ( N ) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N (cid:73) (cid:91) { M where set x ∈ N | x ← N } ≡ (cid:91) { M | x ← N } uery Lifting 33 Proof.

For the ﬁrst equation we reason as follows: (cid:114) (cid:93) (cid:72) M where bag x ∈ δ ( N ) | x ← N (cid:73)(cid:122) ρu = (cid:88) u (cid:74) M where bag x ∈ δ ( N ) (cid:75) ρ [ x (cid:55)→ u ] × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( (cid:74) x ∈ δ ( N ) (cid:75) ρ [ x (cid:55)→ u ]) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( (cid:74) δ ( N ) (cid:75) ρ [ x (cid:55)→ u ]( (cid:74) x (cid:75) ρ [ x (cid:55)→ u ])) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( ζ ( (cid:74) N (cid:75) ρu )) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × (cid:74) N (cid:75) ρu = (cid:114) (cid:93) (cid:72) M | x ← N (cid:73)(cid:122) ρv The proof of the second equation is similar, but simpler. (cid:117)(cid:116)

Lemma 8. G bag ( x ← N ; M ) (cid:16) O ≡ M [ O/x ] where bag O ∈ N G set ( x ← N ; M ) (cid:16) O ≡ M [ O/x ] where set O ∈ N Proof.

The proofs are similar; we show the ﬁrst. (cid:74) G ( x ← N ; M ) (cid:16) O (cid:75) ρv = (cid:74) G ( x ← N ; M ) (cid:75) ρ ( (cid:74) O (cid:75) ρ, v )= χ ( (cid:74) N (cid:75) ρ ( (cid:74) O (cid:75) ρ )) × (cid:74) M (cid:75) ρ [ x (cid:55)→ (cid:74) O (cid:75) ρ ] v = (cid:74) M [ O/x ] (cid:75) ρv × (cid:74) O ∈ N (cid:75) ρv = (cid:74) M [ O/x ] where bag O ∈ N (cid:75) ρv (cid:117)(cid:116) Next we show graph construction commutes with promotion, deduplication,union, multiset union and diﬀerence:

Lemma 9. ι ( G set ( −−−−→ x ← N ; M )) ≡ G bag ( −−−−→ x ← N ; ι ( M )) Proof. (cid:114) ι ( G set ( −−−−→ x ← N ; M )) (cid:122) ρ ( −→ u , v )= χ ( (cid:114) G set ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ))= χ ( (cid:114) −→ N (cid:122) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( (cid:114) −→ N (cid:122) ρ −→ u ) × χ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( (cid:114) −→ N (cid:122) ρ −→ u ) × (cid:74) ι ( M ) (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = (cid:114) G bag ( −−−−→ x ← N ; ι ( M )) (cid:122) ρv (cid:117)(cid:116) Corollary 2. ι ( G set ( x ← N ; M )) (cid:16) O ≡ ι ( M [ O/x ]) where bag O ∈ N Lemma 10. δ ( G bag ( −−−−→ x ← N ; M )) ≡ G set ( −−−−→ x ← N ; δ ( M )) Proof. (cid:114) δ ( G bag ( −−−−→ x ← N ; M )) (cid:122) ρ ( −→ u , v )= ζ ( (cid:114) G bag ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ))= ζ ( χ ( (cid:114) −→ N (cid:122) ρ ) −→ u × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= ζ ( χ ( (cid:114) −→ N (cid:122) ρ −→ u )) ∧ ζ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) −→ N (cid:122) ρ −→ u ∧ (cid:74) δ ( M ) (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = (cid:114) G bag ( −−−−→ x ← N ; δ ( M )) (cid:122) ρv (cid:117)(cid:116) Lemma 11. G ( −−−−→ x ← N ; M ) ∪ G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M ∪ M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) ∪ G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ) ∪ (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= −−→ (cid:74) N (cid:75) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v ∨ −−→ (cid:74) N (cid:75) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = −−→ (cid:74) N (cid:75) ρ −→ u ) ∧ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v ∨ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= −−→ (cid:74) N (cid:75) ρ −→ u ) ∧ ( (cid:74) M ∪ M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M ∪ M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) Lemma 12. G ( −−−−→ x ← N ; M ) (cid:93) G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M (cid:93) M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) (cid:93) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( u, v ) (cid:93) (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v + χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v + (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:93) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M (cid:93) M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) uery Lifting 35 Lemma 13. G ( −−−−→ x ← N ; M ) − G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M − M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) − G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ) − (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v − χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v − (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M − M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M − M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) Corollary 3. ( G ( x ← N ; M ) −G ( x ← N ; M )) (cid:16) O ≡ ( M [ O/x ] − M [ O/x ]) where bag O ∈ N We can now use these equivalences to show the correctness of the delateral-ization rules for promotion and diﬀerence:

Theorem 5. (cid:93) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( G ( x ← N ; P )) (cid:16) x (cid:73) Proof.

We use Cor. 2 and Lemma 7 and standard equivalences for monadiccomprehensions: (cid:93) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) (cid:73) where bag x ∈ N | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) where bag x ∈ N (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( P ) where bag x ∈ N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( G ( x ← N ; P )) (cid:16) x (cid:73) (cid:117)(cid:116) Theorem 6. (cid:93) (cid:72) M | x ← N, y ← P − P (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← G ( x ← δ ( N ); P ) − G ( x ← δ ( N ); P ) (cid:73) Proof.

We use Cor. 3 and Lemma 7 and standard equivalences for monadiccomprehensions: (cid:93) (cid:72) M | x ← N, y ← P − P (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← P − P (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← P − P (cid:73) where bag x ∈ δ ( N ) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ( P − P ) where bag x ∈ δ ( N ) (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ( G ( x ← δN ; P ) − G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:117)(cid:116) Theorem 7. (cid:91) { M | x ← N, y ← δ ( P − P ) } ≡ (cid:91) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) } Proof.

We use Lemmas 7, 10 and 13 and standard equivalences for monadiccomprehensions: (cid:91) { M | x ← N, y ← δ ( P − P ) }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) } | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) } where set x ∈ N | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) where set x ∈ N } | x ← N }≡ (cid:91) { (cid:91) { M | y ← G ( x ← N ; δ ( P − P )) (cid:16) x } | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( G ( x ← N ; P − P )) (cid:16) x } | x ← N }≡ (cid:91) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) (cid:16) x } (cid:117)(cid:116) Now, to prove that delateralization eventually terminates, we consider a met-ric on query expressions deﬁned as follows: given an expression in normal form,for each subexpression of the form ι ( N ) or M − N , add up the number of free uery Lifting 37 variables occurring in M, N . (cid:107) (cid:102) (cid:107) = 0 (cid:107){ M }(cid:107) = (cid:72) M (cid:73) } = (cid:107) M (cid:107)(cid:107) M ∪ N (cid:107) = (cid:107) M (cid:93) N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M − N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107) + | F V ( M, N ) |(cid:107) ι ( M ) (cid:107) = (cid:107) M (cid:107) + | F V ( M, N ) |(cid:107) δ ( M ) (cid:107) = (cid:107) M (cid:107)(cid:107) (cid:91) { M | x ← N }(cid:107) = (cid:107) (cid:93) (cid:72) M | x ← N (cid:73) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M where set N (cid:107) = (cid:107) M where bag N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M (cid:107) = 0 otherwiseIf the metric is zero, then the query is fully delateralized. Combining the basicdelateralization steps above with commutativity, any expression with nonzerometric can be rewritten so as to decrease the metric (though possibly increasingthe query size). We can also undo the eﬀects of commutativity steps to restorethe original order of generators, to preserve the query structure as much aspossible for readability. Theorem 8.

Given M with (cid:107) M (cid:107) > , there exists an equivalent M (cid:48) with (cid:107) M (cid:48) (cid:107) < (cid:107) M (cid:107) that can be obtained by applying commutativity and basic rewrites. Hence,there exists an equivalent fully-delateralized M (cid:48)(cid:48) with (cid:107) M (cid:48)(cid:48) (cid:107) = 0 .Proof. The proof requires establishing that whenever (cid:107) M (cid:107) >

0, there exists atleast one outermost subexpression M of the form ι ( N ) or N − P with (cid:107) M (cid:107) > M should not be a subexpression of any larger such subexpression of M having the same property. Moreover, M must occur as a generator. We need toshow that M therefore contains at least one free record variable bound earlier inthe same comprehension. We can show this by inspection of normal forms. Sincethis is the case, then (if the generator is not already adjacent) we can commute itto be adjacent to M and then apply one of the delateralization rules, decreasing (cid:107) M (cid:107) and hence (cid:107) M (cid:107) . (cid:117)(cid:116) C Proofs for Section 6

Lemma 14. If Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , then Ψ ⊇ Φ . Lemma 15.

Let M an N RC G term and Φ a shredding set. If FV( M ) ⊆ dom( Φ ) ,then for all Φ (cid:48) ⊇ Φ we have M Φ = M Φ (cid:48) .Furthermore, let Ξ and Ξ (cid:48) be the shredding value sets corresponding to Φ and Φ (cid:48) : then (cid:76) (cid:98) M (cid:99) (cid:77) Ξ = (cid:76) (cid:98) M (cid:99) (cid:77) Ξ (cid:48) . In the following proof, whenever Θ = −−−−→ x ← F , we use the abbreviation: (cid:74) Θ (cid:75) ρ −→ v = (cid:94) i (cid:74) F i (cid:75) ρ [ x (cid:55)→ v , . . . , x i − (cid:55)→ v i − ] v i Theorem 4.