Query Lifting: Language-integrated query for heterogeneous nested collections
QQuery Lifting
Language-integrated queryfor heterogeneous nested collections
Wilmer Ricciotti ( (cid:66) ) and James Cheney , Laboratory for Foundations of Computer ScienceUniversity of Edinburgh, Edinburgh, United Kingdom [email protected]@inf.ed.ac.uk The Alan Turing Institute, London, United Kingdom
Abstract.
Language-integrated query based on comprehension syntaxis a powerful technique for safe database programming, and provides abasis for advanced techniques such as query shredding or query flatten-ing that allow efficient programming with complex nested collections.However, the foundations of these techniques are lacking: although SQL,the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabili-ties are not supported by known correctness results or implementationsthat assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language N RC λ ( Set , Bag )that combines set and multiset constructs. We show how to normalizeand translate queries to SQL, and develop a novel approach to queryingheterogeneous nested collections, based on the insight that “local” querysubexpressions that calculate nested subcollections can be “lifted” to thetop level analogously to lambda-lifting for local function definitions.
Keywords: language-integrated query · nested relations · multisets Since the rise of relational databases as important software components in the1980s, it has been widely appreciated that database programming is hard [13].Databases offer efficient access to flat tabular data using declarative SQL queries,a computational model very different from that of most general-purpose lan-guages. To get the best performance from the database, programmers typicallyneed to formulate important parts of their program’s logic as queries, thus effec-tively programming in two languages: their usual general-purpose language (e.g.Java, Python, Scala) and SQL, with the latter query code typically constructedas unchecked, dynamic strings. Programming in two languages is more thantwice as difficult as programming in one language [35]. The result is a hybridprogramming model where important parts of the program’s functionality arenot statically checked and may lead to run-time failures, or worse, vulnerabilities a r X i v : . [ c s . P L ] J a n W. Ricciotti and J. Cheney such as SQL injection attacks. This undesirable state of affairs was recognizedby Copeland and Maier [13] who coined the term impedance mismatch for it.Though higher-level wrapper libraries and tools such as object-relational map-pings (ORM) can help ameliorate the impedance mismatch, they often come ata price of performance and lack of transparency, as high-level operations on in-memory objects representing database data are not always mapped efficiently toqueries [44]. An alternative approach, which has almost as long a history as theimpedance mismatch problem itself, is to elevate queries in the host languagefrom unchecked strings to a typed, domain-specific sublanguage, whose interac-tions with the rest of the program can be checked and which can be mappedto database queries safely while providing strong guarantees. This approach isnowadays typically called language-integrated query following Microsoft’s suc-cessful LINQ extensions to .NET languages such as C flat collections (i.e. tables of records without other collections nested inside field val-ues) can always be translated to an equivalent query only using flat relations (i.e.can be expressed in an SQL-like language). Wong [54] subsequently generalizedthis result and gave a constructive proof, in which the translation from nestedto flat queries is accomplished through a strongly normalizing rewriting system.Wong’s work has informed a number of successful implementations, suchas the influential Kleisli system [55] for biomedical data integration, and theLinks programming language [12]. Although the implementation of LINQ inC nested results [25,8,21,52], by translat-ing such queries to a bounded number of flat queries. This technique, currentlyimplemented in Links and DSH, has several benefits: for example to implement provenance-tracking efficiently in queries [17,46]. Fowler et al. [19] showed that insome cases, Links’s support for nested query results decreased both the numberof queries issued and the total query evaluation time by an order of magnitudeor more compared to a Java database application. Unfortunately, there is still agap between the theory and practice of language-integrated query. Widely-usedand practically important SQL features that mix set and multiset collections,such as duplicate elimination, are supported by some implementations, but with-out guarantees regarding correctness or reliability. So far, such results have only uery Lifting 3 been proved for special cases [7,8], typically for homogeneous queries operatingon one uniform collection type. For example, in Links, queries have multiset se-mantics and cannot use duplicate elimination or set-valued operations. To thebest of our knowledge the questions of how to correctly translate flat or nested heterogeneous queries to SQL are open problems.In this paper, we solve both open problems. We study a heterogeneous querylanguage
N RC λ ( Set , Bag ), which was introduced and studied in our recentwork [42]. We have previously extended the key results on query normalizationto
N RC λ ( Set , Bag ) [43], but unlike the homogeneous case, the resulting nor-mal forms do not directly correspond to SQL. In this paper, we first show howflat
N RC λ ( Set , Bag ) queries can be translated to SQL, and we then developa new approach for evaluating queries over nested heterogeneous collections.The key (and, to us at least, surprising) insight is to recognize that these twosubproblems are really just different facets of one problem. That is, when trans-lating flat
N RC λ ( Set , Bag ) queries to SQL, the main obstacle is how to dealwith query expressions that depend on local variables; when translating nested
N RC λ ( Set , Bag ) queries to equivalent flat ones, the main obstacle is also howto deal with query expressions that depend on local variables. We solve thisproblem by observing that such query subexpressions can be lifted , analogouslyto lambda-lifting of local function definitions in functional programming [30], byabstracting over their free variables. Differently to lambda-lifting, however, welift such expressions by converting them to tabular functions , or graphs , whichcan be calculated using database query constructs.The remainder of this paper presents our contributions as follows: – In section 2 we review the most relevant prior work and present our approachat a high, and we hope accessible, level. – In sections 3 and 4 we present the core languages
N RC λ ( Set , Bag ) and
N RC G which will be used in the rest of the paper. – Section 5 presents our results on translation of flat
N RC λ ( Set , Bag ) queriesto SQL, via
N RC G . – Section 6 presents our results on translation of
N RC λ ( Set , Bag ) queries thatconstruct nested results to a bounded number of flat
N RC G queries. – Sections 7 and 8 discuss related work and conclude.
In this section we sketch our approach. We use Links syntax [12], which differsin superficial respects from the core calculus in the rest of the paper but is morereadable. We rely without further comment on existing capabilities of language-integrated query in Links, which are described elsewhere [11,34,8]. Suppose, hy-pothetically, we are interested in certain presidential candidates and prescriptiondrugs they may be taking . In Links, an expression querying a small database ofpresidential candidates and their drug prescriptions can be written as follows: For example, to see whether drug interactions might explain erratic behavior suchas rage tweeting, creeping authoritarianism, or creepiness more generally. W. Ricciotti and J. Cheney name cid
DJT 45JRB 46
Cand cid did day
45 101 Mon45 223 Tue45 223 Thu46 765 Fri
Pres name drug
DJT hydrochloroquine
DJT adderall
JRB caffeine Q a cs DJT {hydrochloroquine, adderall}
JRB {caffeine} Q Q F Q Q in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caffeinea cs DJT (DJT,45)JRB (JRB,46) did drug hydrochloroquine adderall caffeine
Drug in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caffeine
Fig. 1.
Input tables
Cand, P res, Drug , intermediate result of Q F and result of Q . Q0 = for (c <- Cand, p <- Pres, d <- Drug)where (c.cid == p.cid && p.did == d.did)[(name=c.name,drug=d.drug)]
Some (totally fictitious and not legally actionable) example data is shown inFigure 1; note that the prescriptions table
P res is a multiset containing duplicateentries. Executing this query in Links results in the following SQL query:
SELECT c.name, d.drugFROM Cand c, Pres p, Drug dWHERE c.cid = p.cid AND p.did = d.did
In Links, query results from the database are mapped back to list values non-deterministically, and the result of the above query Q will be a list contain-ing two copies of the tuple ( DJT , adderall ) and one copy of each of the tuples( DJT , hydrochloroquine ) and ( JRB , caffeine ). If we are just interested in whichcandidates take which drugs and not how many times each drug was taken, wewant to remove these duplicates. This can be accomplished in a basic SQL queryusing the DISTINCT keyword after
SELECT . Currently, in Links there is no wayto generate queries involving
DISTINCT , and this duplicate elimination can onlybe performed in-memory. While this is not hard to do when the duplicate elimi-nation happens at the end of the query, it is not as clear how to handle dedupli-cation operations correctly in arbitrary places inside queries. Furthermore, SQLhas several other operations that can have either set or multiset semantics suchas
UNION and
EXCEPT : how should they be handled?To study this problem we introduced a core calculus
N RC λ ( Set , Bag ) [42](reviewed in the next section) in which there are two collection types, sets andmultisets (or bags ); duplicate elimination maps a multiset to a set with the sameelements, and promotion maps a set to the least multiset with the same elements.We considered, but were not previously able to solve, two problems in thecontext of
N RC λ ( Set , Bag ) which are addressed in this paper. First, the fun-damental results regarding normalization and translation to SQL have beenstudied only for homogeneous query languages with collections consisting ofeither sets, bags, or lists. We recently extended the normalization results to
N RC λ ( Set , Bag ) [43], but the resulting normal forms do not correspond directlyto SQL queries if operations such as deduplication, promotion, or bag differenceare present. Second, query expressions that construct nested collections cannotbe translated directly to SQL and can be very expensive to execute in-memory uery Lifting 5 using nested loops, leading to the N + 1 query problem (or query avalanche prob-lem [26]) in which one query is performed for the outer loop and then another N queries are performed, one per iteration of the inner loop. Some techniques havebeen developed for translating nested queries to a fixed number of flat queries,but to date they either handle only homogeneous set or bag collections [53,8],or lack detailed correctness proofs [26,51].Regarding the first problem, the closest work in this respect is by Libkinand Wong [33], who studied and related the expressiveness of comprehension-based homogeneous set and bag query languages but did not consider theirheterogeneous combination or translation to SQL. The following query illustratesthe fundamental obstacle: Q1 = for (c <- Cand)for (d <- dedup(for (p <- Pres, d <- Drug)where (c.cid == p.cid && p.did == d.did)[d.drug]))[(name=c.name, drug=d)]
This query is similar to Q , but eliminates duplicates among the drugs for eachcandidate. The query contains a duplicate elimination operation ( dedup ) appliedto another query subexpression that refers to c , which is introduced in an earliergenerator. This is not directly supported in classic SQL: by default the subqueriesin FROM clauses cannot refer to tuple variables introduced by earlier parts of the
FROM clause. In fact, this query is expressible in SQL:1999 using the
LATERAL keyword, which does allow such sideways information-passing:
SELECT c.name,d.drugFROM Cand c, LATERAL (SELECT DISTINCT d.drugFROM Pres p, Drug dWHERE p.cid = c.cid AND p.did = d.did) d (Without the
LATERAL keyword, this query is not well-formed SQL.) However,such queries have only recently become widely supported, so are not available onlegacy databases, and even when supported, are not typically optimized effec-tively; for example PostgreSQL will evaluate it as a nested loop, with quadraticcomplexity or worse.Regarding the second problem, Van den Bussche [53] showed that any queryreturning nested set collections can be simulated by n flat queries, where n isthe number of occurrences of the set collection type in the result. However,this translation has not been used as the basis for a practical system to ourknowledge, and does not respect multiset semantics. Cheney et al. [8] providedan analogous shredding translation for nested multiset queries, but translated toa richer target language (including SQL:1999 features such as ROW NUMBER ) anddid not handle operations such as multiset difference or duplicate elimination.Thus, neither approach handles the full expressiveness of a heterogeneous querylanguage over bags and sets. The following query illustrates the fundamentalobstacle:
W. Ricciotti and J. Cheney
Q2 = for (x <- Cand)[(name=x.name, drugs=dedup(for (p <- Pres, d <- Drug)where (x.cid == p.cid and p.did == d.did)[d.drug]))]
Much like Q , Q builds a multiset of pairs ( name, drugs ) but here drugs is a set of all of the drugs taken by candidate name . Such a query is, of course, noteven syntactically expressible in SQL because it returns a nested collection; it isnot expressible in previous work on nested query evaluation either, because theresult is a multiset of records, one component of which is a set.We will now illustrate how to translate Q to a plain SQL query (not using LATERAL ), and how to translate Q to two flat queries such that the nestedresult can be constructed easily from their flat results. First, note that we canrewrite both queries as follows, introducing an abbreviation F ( x ) for a querysubexpression parameterized by x : F(x) = for (p <- Pres, d <- Drug)where (x.cid == p.cid and p.did == d.did)[d.drug]Q1 = for (c <- Cand) for (d <- dedup(F(c))) [(name=c.name, drug=d)]Q2 = for (c <- Cand) [(name=c.name, drugs=dedup(F(c)))]
Next, observe that the set of all possible values for x appearing in some call to F ( x ) is finite, and can even be computed by a query. Therefore, we can write a closed query Q F that builds a lookup table that calculates the graph of F (orat least, as much of it as is needed to evaluate the queries) as follows: Q_F = dedup(for (x <- Cand, y <- F(x)) [(in=x,out=y))]
Notice that the use of deduplication here is really essential to define Q F correctly:if we did not deduplicate then there would be repeated tuples in Q F , leading toincorrect results later. If we inline and simplify F ( x ) in the above query, we getthe following: Q_F’ = dedup(for (x <- Cand, y <- Pres, z <- Drug)where (x.cid == y.cid && y.did = z.did)[(in=x,out=z.drug)])
Finally we may replace the call to F ( x ) in Q with a lookup to Q (cid:48) F , as follows: Q1’ = for (c <- Cand, f <- Q_F’) where (c == f.in)[(name=c.name, drug=f.out)]
This expression may now be translated directly to SQL, because the argumentto dedup is now closed:
SELECT c.name,f.drugFROM Cand c, (SELECT DISTINCT x.name,x.cid,z.drugFROM Cand x, Pres y, Drug zWHERE x.cid = y.cid AND y.did = z.did) fWHERE c.cid = f.cid AND c.name = f.name uery Lifting 7 name cid
DJT 45JRB 46
Cand cid did day
45 101 Mon45 223 Tue45 223 Thu46 765 Fri
Pres name drug
DJT hydrochloroquine
DJT adderall
JRB caffeine Q name drugs DJT {hydrochloroquine, adderall}
JRB {caffeine} Q Q F Q Q in out (DJT,45) hydrochloroquine (DJT,45) adderall (JRB,46) caffeinename drugs DJT (DJT,45)JRB (JRB,46) did drug hydrochloroquine adderall caffeine
Drug in out (DJT,45) hydrochloroquine(DJT,45) adderall(JRB,46) caffeine
Fig. 2.
Intermediate results of Q , Q and result of Q . Although this query looks a bit more complex than the one given earlier using
LATERAL , it can be optimized more effectively, for example PostgreSQL generatesa query plan that uses a hash join, giving quasi-linear complexity.On the other hand, to deal with Q , we refactor it into two closed, flat queries Q , Q and an expression Q (cid:48) that builds the nested result from their flat results(illustrated in Figure 2): Q_21 = for (x <- Cand) [(name=x.name, drugs=x)]Q_22 = Q_FQ2’ = for (x <- Q21)[(name=x.name,drugs=for (y <- Q_22) where(x.drugs == y.in) [y.out])]
Notice that in Q we replaced the call to F with the argument x , while Q is just Q F again. The final expression Q (cid:48) builds the nested result (in the hostlanguage’s memory) by traversing Q and computing the set value of each cs field by looking up the appropriate values from Q . Thus, the original queryresult can be computed by first evaluating Q and Q on the database, andthen evaluating the final stitching query expression in-memory. (In practice, asdiscussed in Cheney et al. [8], it is important for performance to use a moresophisticated stitching algorithm than the above naive nested loop, but in thispaper we are primarily concerned with the correctness of the transformation.)The above examples are a bit simplistic, but illustrate the key idea of querylifting . In the rest of this paper we place this approach on a solid foundation,and (partially inspired by Gibbons et al. [20]), to help clarify the reasoning weextend the calculus with a type of tabulated functions or graphs −→ σ (cid:74) { τ } , with graph abstraction introduction form G ( − ; − ) and graph application M (cid:16) (cid:104)−→ x (cid:105) . Inour running example we could define Q F = G ( x ← R ; F ( x )), and we would usethe application operation M (cid:16) (cid:104)−→ x (cid:105) to extract the set of elements correspondingto x in Q F . We will also consider tabular functions that return multisets ratherthan sets, in order to deal with queries that return nested multisets. We recap the main points from [42], which introduced a calculus
N RC λ ( Set , Bag ) with the following syntax:
W. Ricciotti and J. Cheney
Types σ, τ ::= b | (cid:104)−−→ (cid:96) : σ (cid:105) | { σ } | (cid:72) σ (cid:73) | σ → τ Terms
M, N ::= x | t | c ( −→ M ) | (cid:104)−−−−→ (cid:96) = M (cid:105) | M.(cid:96) | λx.M | M N | ∅ | { M } | M ∪ N | (cid:83) { M | Θ }| (cid:102) | (cid:72) M (cid:73) | M (cid:93) N | M − N | (cid:85) (cid:72) M | Θ (cid:73) | δM | ιM | M where set N | M where bag N | empty set ( M ) | empty bag ( M ) Generators Θ ::= −−−−→ x ← M We distinguish between (local) variables x and (global) table names t , andassume standard primitive types b and primitive operations c ( −→ M ) including re-spectively Booleans B and equality at every base type. The syntax for recordsand record projection (cid:104)−−−−→ (cid:96) = M (cid:105) , M.(cid:96) , and for lambda-abstraction and application λx.M, M N is standard; as usual, let-binding is definable. Set operations includeempty set ∅ , singleton construction { M } , union M ∪ N , one-armed conditional M where set N , emptiness test empty set ( M ), and comprehension (cid:83) { M | Θ } ,where Θ is a sequence of generators x ← M . Similarly, multiset operations in-clude empty bag (cid:102) , singleton (cid:72) M (cid:73) , bag union M (cid:93) N , bag difference M − N ,conditional M where bag N , emptiness test empty bag ( M ). The syntax is com-pleted by duplicate elimination δM (converting a bag M into a set with thesame object type) and promotion ιM (which produces the bag containing allthe elements of the set M , with multiplicity 1).The one-way conditional operations M where set N and M where bag N evaluate Boolean test N , and return collection M if N is true, otherwise theempty set/bag; two-way conditionals can supported without problems. Otherset operations, such as intersection, membership, subset, and equality are alsodefinable, as are bag operations such as intersection [4,33]. Also, we may define empty bag ( M ) as empty set ( δ ( M )) and M where set N as δ ( ι ( M ) where bag N ),but we prefer to include these constructs as primitives for symmetry. Generally,we will allow ourselves to write M where N and empty ( M ) without subscriptsif the collection kind of these operations is irrelevant or made clear by the context.We freely use syntax for unlabeled tuples (cid:104)−→ M (cid:105) , M.i and tuple types −→ σ andconsider them to be syntactic sugar for labeled records.The typing rules for the calculus are standard and provided in an appendix.For the purposes of this discussion, we will highlight two features of the typesystem. The first is that the calculus used here differs from our previous work byusing constants and table names, whose types are described by a fixed signature Σ : Σ ( c ) = −→ b → b ( Γ (cid:96) M i : σ i ) i =1 ,...,n Γ (cid:96) c ( −→ M ) : τ Σ ( t ) = −−→ (cid:96) : bΓ (cid:96) t : (cid:72) (cid:104)−−→ (cid:96) : b (cid:105) (cid:73) As usual, a typing judgment Γ (cid:96) M : σ states that a term M is well-typedof type σ , assuming that its free variables have the types declared in the typingcontext Γ = x : σ , . . . , x k : σ k . For the two rules above, note in particular thatthe primitive functions c can only take inputs of base type and produce resultsat base type, and table constants t are always multisets of records where the uery Lifting 9 fields are of base type. We refer to a type of the form (cid:104)−−→ (cid:96) : b (cid:105) as flat ; if σ is flat,we refer to { σ } and (cid:72) σ (cid:73) as flat collection types .The second is that our type system uses an approach `a la Church, meaningthat variable abstractions (in lambdas/comprehensions), empty sets and emptybags are annotated with their type in order to ensure the uniqueness of typing. Lemma 1. In N RC λ ( Set , Bag ) , if Γ (cid:96) M : σ and Γ (cid:96) M : τ , then σ = τ . In the context of a larger language implementation, most of these type anno-tations can be elided and inferred by type inference. We have chosen to dispensewith these details in the main body of this paper to avoid unnecessary syntacticcluttering.We will use a largely standard denotational semantics for
N RC λ ( Set , Bag ),in which sets and multisets are modeled as finitely-supported functions fromtheir element types to Boolean values { , } or natural numbers respectively.This approach follows the so-called K -relation semantics for queries [23,18] asused for example in the HoTTSQL formalization [10]. The full typing rules andsemantics are included in the appendix. N RC λ ( Set , Bag ) subsumes previous systems including
N RC [4,54],
BQL [33]and
N RC λ [11,8]. In this paper, we restrict our attention to queries in whichcollection types taking part in δ , ι or bag difference contain only flat records.There are various reasons for excluding function types from these operators: forstarters, any concrete implementation that used function types in these positionswould need to decide the equality of functions; secondly, our rewrite system canensure that a term whose type does not contain function types has a normal formwithout lambda abstractions and applications only if any δ , ι , or bag differenceused in that term are applied to first-order collections. We thus want to excludeterms such as: (cid:93) (cid:72) x (cid:72) (cid:73) (cid:72) (cid:73) | x ← ι ( { λyz.y } ∪ { λyz.z } ) (cid:73) which do not have an SQL representation despite having a flat collection type.In order to obtain simpler normal forms, in which comprehensions only ref-erence generators with a flat collection type, we also disallow nested collectionswithin δ , ι , and bag difference. We believe this is without loss of generality be-cause of Libkin and Wong’s results showing that allowing such operations atnested types does not add expressiveness to BQL .We have extended Wong’s normalizing rewrite rule system, so as to simplifyqueries to a form that is close to SQL, with no intermediate nested collections.Since our calculus is more liberal than Wong’s, allowing queries to be defined bymixing sets and bags and also using bag difference, we have added non-standardrules to take care of unwanted situations. In particular, we use the followingconstrained eta-expansions for comprehensions: (cid:91) { δ ( M − N ) | Θ } (cid:32) (cid:91) {{ z }| Θ, z ← δ ( M − N ) } (cid:93) (cid:72) ιM | Θ (cid:73) (cid:32) (cid:93) (cid:72)(cid:72) z (cid:73) | Θ, z ← ιM (cid:73) (cid:93) (cid:72) M − N | Θ (cid:73) (cid:32) (cid:93) (cid:72)(cid:72) z (cid:73) | Θ, z ← M − N (cid:73) General normal forms M ::= X | (cid:104)−−−−→ (cid:96) = M (cid:105) | Q | R Base type terms X ::= x.(cid:96) | c ( −→ X ) | empty set ( Q ∗ ) | empty bag ( R ∗ ) Set normal forms Q ::= (cid:83) −→ CC ::= (cid:83) {{ M } where set X |−−−−→ x ← F } F ::= δt | δ ( R ∗ − R ∗ ) Bag normal forms R ::= (cid:85) −→ DD ::= (cid:85) (cid:72)(cid:72) M (cid:73) where bag X |−−−−→ x ← G (cid:73) G ::= t | ιQ ∗ | R ∗ − R ∗ Fig. 3.
Nested relational normal forms.
The rationale of these rules is that in order to achieve, for comprehensions,a form that can be easily translated to an SQL select query, we need to move allthe syntactic forms that are blocking to most normalization rules (i.e. promotionand bag difference) from the head of the comprehension to a generator. In orderfor this strategy to work out, we also need to know that the type of thesesubexpressions is flat, as we previously mentioned.In Figure 3 we show the grammar for the normal forms for terms of nestedrelational types , i.e. types of the following form: σ ::= b | (cid:104)−−→ (cid:96) : σ (cid:105) | { σ } | (cid:72) σ (cid:73) For ease of presentation, the grammar actually describes a “standardized”version of the normal forms in which: – ∅ is represented as the trivial union (cid:83) −→ C where −→ C is the empty sequence; (cid:102) has a similar representation using a trivial disjoint union; – comprehensions without a guard are considered to be the same as those witha trivial true guard: (cid:91) {{ M }| Θ } = (cid:91) {{ M } where true | Θ } – singletons that do not appear as the head of a comprehension are representedas trivial comprehensions: { M } = (cid:91) {{ M } | } Each normal form M can be either a term of base type X , a tuple (cid:104)−−−−→ (cid:96) = M (cid:105) ,a set Q , or a bag R . The normal forms of sets and bags are rather similar, bothbeing defined as unions of comprehensions with a singleton head. The gener-ators for set comprehensions F include deduplicated tables and deduplicatedbag differences; the generators for bag comprehensions G must be either tables,promoted set queries, or bag differences.The non-terminals used as the arguments of emptiness tests, promotion, andbag difference have been marked with a star to emphasize the fact that they uery Lifting 11 ( ∅ ) sql = SELECT WHERE (cid:102) ) sql = SELECT WHERE x.(cid:96) ) sql = x.(cid:96) ( c ( −→ X )) sql = ( c ) sql ( −−−→ ( X ) sql )( (cid:104)−−−−→ (cid:96) = X (cid:105) ) sql = ( X ) sql AS (cid:96) , . . . , ( X n ) sql AS (cid:96) n ( empty set ( Q ∗ )) sql = NOT EXISTS ( Q ∗ ) sql ( empty bag ( R ∗ )) sql = NOT EXISTS ( R ∗ ) sql ( Q ∗ ∪ Q ∗ ) sql = ( Q ∗ ) sql UNION ( Q ∗ ) sql ( R ∗ (cid:93) R ∗ ) sql = ( R ∗ ) sql UNION ALL ( R ∗ ) sql ( t ) sql = SELECT ∗ FROM t ( R ∗ − R ∗ ) sql = ( R ∗ ) sql EXCEPT ALL ( R ∗ ) sql ( δt ) sql = SELECT DISTINCT ∗ FROM t ( ι ( Q ∗ )) sql = ( Q ∗ ) sql ( δ ( R ∗ − R ∗ )) sql = SELECT DISTINCT ∗ FROM (( R ∗ ) sql EXCEPT ALL ( R ∗ ) sql s ) r ( x ← F ) sql = (cid:26) (( F ) sql ) x ( x closed) LATERAL (( F ) sql ) x (otherwise)( x ← G ) sql = (cid:26) (( G ) sql ) x ( x closed) LATERAL (( G ) sql ) x (otherwise)( (cid:83) {{ M ∗ } where set X | −−−−→ x ← F } ) sql = SELECT DISTINCT ( M ∗ ) sql FROM ( −−−−→ x ← F ) sql WHERE ( X ) sql ( (cid:85) (cid:72)(cid:72) M ∗ (cid:73) where bag X | −−−−→ x ← G (cid:73) ) sql = SELECT ( M ∗ ) sql FROM −−−−−−−→ ( x ← G ) sql WHERE ( X ) sql Fig. 4.
Translation to SQL must have a flat collection type. The corresponding grammar can be obtainedfrom the grammar for nested normal forms by replacing the rule for M with thefollowing: M ∗ ::= (cid:104)−−−→ (cid:96) = X (cid:105) Normalized queries can be translated to SQL as shown in Figure 4 as longas they have a flat collection type. The translation uses
SELECT DISTINCT and
UNION where a set semantics is needed, and
SELECT , UNION ALL and
EXCEPT ALL in the case of bag semantics. Note that promotion expressions ιQ ∗ are translatedsimply by translating Q ∗ , because in SQL there is no type distinction betweenset and multiset queries: all query results are multisets, and sets are consideredto be multisets having no duplicates.The other main complication in this translation is in handling generators x ← F , x ← G where F or G may be a non-closed expression ι ( Q ∗ ), R ∗ − R ∗ , or δ ( R ∗ − R ∗ ) containing references to other locally-bound variables. To deal withthe resulting lateral variable references, we add the LATERAL keyword to suchqueries. As explained earlier, the use of
LATERAL can be problematic and we willreturn to this issue in Section 5.
Remark 1 (Record flattening).
The above translations handle queries that takeflat tables as input and produce flat results (collections of flat records (cid:104)−−→ (cid:96) : b (cid:105) ). Itis straightforward to support queries that return nested records (i.e. records con-taining other records, but not collections). For example, a query M : (cid:72) (cid:104) b , (cid:104) b , b (cid:105)(cid:105) (cid:73) can be handled by defining both directions of the obviousisomorphism N : (cid:72) (cid:104) b , (cid:104) b , b (cid:105)(cid:105) (cid:73) ∼ = (cid:72) (cid:104) b , b , b (cid:105) (cid:73) : N − , normalizing the flat query N ◦ M , evaluating the corresponding SQL, and applying the inverse N − to theresults. Such record flattening is described in detail by Cheney et al. [9] and isimplemented in Links, so we will use it from now on without further discussion. Γ, −−−−−−−→ x i − : σ i − (cid:96) L i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : { τ } Γ (cid:96) G set ( −−−−→ x ← L ; M ) : −→ σ (cid:74) { τ } ( Γ, −−−−−−−→ x i − : σ i − (cid:96) L i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : (cid:72) τ (cid:73) Γ (cid:96) G bag ( −−−−→ x ← L ; M ) : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) τ ( Γ (cid:96) N i : σ i ) i Γ (cid:96) M (cid:16) ( −→ N ) : τ Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M − N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) { τ } Γ (cid:96) N : −→ σ (cid:74) { τ } Γ (cid:96) M ∪ N : −→ σ (cid:74) { τ } Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M (cid:93) N : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) M : −→ σ (cid:74) (cid:72) τ (cid:73) Γ (cid:96) δM : −→ σ (cid:74) { τ } Γ (cid:96) M : −→ σ (cid:74) { τ } Γ (cid:96) ιM : −→ σ (cid:74) (cid:72) τ (cid:73) Fig. 5.
N RC G additional typing rules. We now introduce
N RC G , an extension of the calculus N RC λ ( Set , Bag ) provid-ing a new type of finite tabular function graphs (in the remainder of this paper,also called simply “graphs”; they are similar to the finite maps and tables ofGibbons et al. [20]). The syntax of
N RC G is defined as follows: Types σ, τ ::= · · · | −→ σ (cid:74) τ Terms
M, N ::= · · · | G set ( Θ ; N ) | G bag ( Θ ; N ) | M (cid:16) ( −→ N )Semantically, the type of graphs −→ σ (cid:74) τ will be interpreted as the set offinite functions from sequences of values of type −→ σ to values in τ : such functionscan return non-trivial values only for a finite subset of their input type. In oursettings, we will require the output type of graphs to be a collection type (i.e. τ shall be either { τ (cid:48) } or (cid:72) τ (cid:48) (cid:73) for some τ (cid:48) ), and we will use ∅ or (cid:102) as the trivialvalue. The typing rules involving graphs are shown in Figure 5.Graphs are created using the graph abstraction operations G set ( Θ ; N ) and G bag ( Θ ; N ), where Θ is a sequence of generators in the form −−−−→ x ← M ; the dualoperation of graph application is denoted by M (cid:16) ( −→ N ). An expression of theform G set ( −−−−→ x ← M ; N ) is used to construct a (finite) tabular function mappingeach sequence of values R , . . . , R n in the sets M , . . . , M n to the set N (cid:104) −→ R / −→ x (cid:105) .If each M i has type { σ i } and N has type { τ } , then the graph has type −→ σ (cid:74) { τ } .Similarly, if N has type (cid:72) τ (cid:73) , G bag ( −−−−→ x ← M ; N ) has type −→ σ (cid:74) (cid:72) τ (cid:73) . The terms M , . . . , M n constitute the (finite) domain of this graph. When the kind of graphapplication (set-based or bag-based) is clear from the context or unimportant,we will allow ourselves to write G ( − ; − ) instead of G set ( − ; − ) or G bag ( − ; − ). uery Lifting 13 A graph G of type −→ σ (cid:74) τ can be applied to a sequence of terms N , . . . , N n of type σ , . . . , σ n to obtain a term of type τ . If G = G ( −−−−→ x ← L ; M ), then we willwant the semantics of G ( −−−−→ x ← L ; M ) (cid:16) ( −→ N ) to be the same as that of M (cid:104) −→ N / −→ x (cid:105) ,provided that each of the N i is in the corresponding element of the domain ofthe graph. The typing rule does not enforce this requirement and if any of the N i is not an element of L i , the graph application will evaluate to an empty setor bag (depending on τ ).Graphs can also be merged by union, using ∪ or (cid:93) depending on their outputcollection kind. Furthermore, graphs that return bags can be subtracted fromone another using bag difference; the deduplication and promotion operationsalso extend to graphs in the obvious way. Lemma 2. In N RC G , Γ (cid:96) M : σ and Γ (cid:96) M : τ , then σ = τ . Whenever M is well typed and its typing environment is made clear by thecontext, we will allow ourselves to write ty ( M ) for the type of M . Furthermore,given a sequence of generators Θ = x ← L , . . . x n ← L n , such that for i =1 , . . . , n we have x : σ , . . . , x i − : σ i − (cid:96) L i : σ i , we will write ty ( Θ ) to denotethe associated typing context: ty ( Θ ) := x : σ , . . . , x n : σ n N RC λ ( Set , Bag ) The semantics of
N RC λ ( Set , Bag ) is extended to
N RC G as follows: (cid:114) G set ( −−−−→ x ← L ; M ) (cid:122) ρ ( −→ u , v )= ( (cid:86) i (cid:74) L i (cid:75) ρ [ x (cid:55)→ u , . . . , x i − (cid:55)→ u i − ] u i ) ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v (cid:114) G bag ( −−−−→ x ← L ; M ) (cid:122) ρ ( −→ u , v )= ( (cid:86) i (cid:74) L i (cid:75) ρ [ x (cid:55)→ u , . . . , x i − (cid:55)→ u i − ] u i ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v (cid:114) M (cid:16) ( −→ N ) (cid:122) ρv = (cid:74) M (cid:75) ρ ( −−−→ (cid:74) N (cid:75) ρ, v )In this definition, graph abstractions are interpreted as collections of pairs ofvalues ( −→ u , v ) where the −→ u represent the input and v the corresponding outputof the graph; consequently, the semantics of a graph G set ( −−−−→ x ← L ; M ) states thatthe multiplicity of ( −→ u , v ) is equal to the multiplicity of v in the semantics of M (where each x i is mapped to u i ) if each u i is in the semantics of L i , and zerootherwise. The semantics of bag graph abstractions is similar, with × substitutedfor ∧ to allow multiplicities greater than one in the graph output.For graph applications M (cid:16) ( −→ N ), the multiplicity of v is obtained as the mul-tiplicity of ( −−−→ (cid:74) N (cid:75) ρ, v ) in the semantics of M . The semantics of set and bag union,bag difference, bag deduplication, and set promotion, as defined in N RC λ ( Set , Bag ), are extended to graphs and remain otherwise unchanged in
N RC G . In fact (as noted for example by Gibbons et al. [20]), the graph constructsof
N RC G are just a notational convenience: we can translate N RC G back to N RC λ ( Set , Bag ) by translating types −→ σ (cid:74) { τ } and −→ σ (cid:74) (cid:72) τ (cid:73) to {(cid:104)−→ σ , τ (cid:105)} and (cid:72) (cid:104)−→ σ , τ (cid:105) (cid:73) respectively, and the term constructs are rewritten as follows: G set ( −−−−→ x ← L ; M ) (cid:32) (cid:91) {{(cid:104)−→ x , y (cid:105)} | −−−−→ x ← L, y ← M }G bag ( −−−−→ x ← L ; M ) (cid:32) (cid:93) (cid:72)(cid:72) (cid:104)−→ x , y (cid:105) (cid:73) | −−−−−−→ x ← ι ( L ) , y ← M (cid:73) M (cid:16) (cid:104)−→ N (cid:105) (cid:32) (cid:91) {{ y } where set −→ x = −→ N | (cid:104)−→ x , y (cid:105) ← M } ( M : −→ σ (cid:74) { τ } ) M (cid:16) (cid:104)−→ N (cid:105) (cid:32) (cid:93) (cid:72)(cid:72) y (cid:73) where bag −→ x = −→ N | (cid:104)−→ x , y (cid:105) ← M (cid:73) ( M : −→ σ (cid:74) (cid:72) τ (cid:73) ) As explained at the end of section 3, if a subexpression of the form ι ( N ) or N − N contains free variables introduced by other generators in the query (i.e.not globally-scoped table variables), such queries cannot be translated directlyto SQL, unless the SQL:1999 LATERAL keyword is used.More precisely, we can give the following definition of lateral variable occur-rence.
Definition 1.
Given a query containing a comprehension (cid:83) { M | Θ, x ← N, Θ (cid:48) } or (cid:85) (cid:72) M | Θ, x ← N, Θ (cid:48) (cid:73) as a subterm, we say that x occurs laterally in Θ (cid:48) if,and only if, there is a binding y ← N (cid:48) in Θ (cid:48) such that x ∈ FV( N (cid:48) ) . Since
LATERAL is not implemented on all databases, and is sometimes imple-mented inefficiently, we would still like to avoid it. In this section we show howlateral occurrences can be eliminated even in the presence of bag promotion andbag difference, by means of a process we call delateralization .Using the
N RC G constructs, we can delateralize simple cases of deduplicationor multiset difference as follows: (cid:85) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) (cid:32) (cid:85) (cid:72) M | x ← N, y ← ι ( G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:85) (cid:72) M | x ← N, y ← P − P (cid:73) (cid:32) (cid:85) (cid:72) M | x ← N, y ← ( G ( x ← δN ; P ) − G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:83) { M | x ← N, y ← δ ( P − P ) } (cid:32) (cid:83) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) (cid:16) x } It is necessary to deduplicate N in the first two rules to ensure that the resultscorrectly represent finite maps from the distinct elements of N to multisets ofcorresponding elements of P . (In any case, N needs to be deduplicated in orderto be used as a set in G ( x ← δN ; )).Given a query expression in normal form, the above rules together withstandard equivalences (such as commutativity of independent generators) canbe used to delateralize it: that is, remove all occurrences of free variables insubexpressions of the form ι ( N ), M − M , or δ ( M − M ). Theorem 1. If M is a flat query in normal form, then there exists M (cid:48) equiva-lent to M with no lateral variable occurrences. uery Lifting 15 The proof of correctness of the basic delateralization rules and the above cor-rectness theorem are in the appendix.To illustrate some subtleties of the translation, here is a trickier example: (cid:93) (cid:72) M | x ← N, y ← Q − ι ( P ) (cid:73) where Q, P both depend on x . We proceed from the outside in, first delateralizingthe difference: (cid:93) (cid:72) M | x ← N, y ← ( G ( x ← δ ( N ); Q ) − G ( x ← δ ( N ); ι ( P ))) (cid:16) x (cid:73) Note that this still contains a lateral subquery, namely ι ( P ) depends on x . Aftertranslating back to N RC λ ( Set , Bag ), and delateralizing ι ( P ), the query normal-izes to: Q = (cid:83) { ( x, z ) | x ∈ δ ( N ) , z ← P } Q = ( (cid:85) (cid:72) ( x, z ) | x ∈ ιδ ( N ) , z ← Q (cid:73) ) − ( (cid:85) (cid:72) ( x, z ) | x ∈ ιδ ( N ) , ( x (cid:48) , z ) ← ι ( Q ) , x = x (cid:48) (cid:73) ) (cid:85) (cid:72) M | x ← N, ( x (cid:48) , y ) ← Q , x = x (cid:48) (cid:73) In the previous sections, we have discussed how to translate queries with flatcollection input and output to SQL. The shredding technique, introduced in [8],can be used to convert queries with nested output (but flat input) to multiple flatqueries that can be independently evaluated on an SQL database, then stitchedtogether to obtain the required nested result. This section provides an improvedversion of shredding, extended to a more liberal setting mixing sets and bags andallowing bag difference operations, and described using the graph operations wehave introduced, allowing an easier understanding of the shredding process.We introduce, in Figure 6, a shredding judgment to denote the process bywhich, given a normalized
N RC λ ( Set , Bag ) query, each of its subqueries havinga nested collection type is lifted (in a manner analogous to lambda-lifting [30]) toan independent graph query: more specifically, shredding will produce a shred-ding environment (denoted by
Φ, Ψ, . . . ), which is a finite map associating special graph variables ϕ, ψ to N RC G terms: Φ, Ψ, . . . ::= [ −−−−−→ ϕ (cid:55)→ M ]The shredding judgment has the following form: Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ where the (cid:90) ⇒ symbol separates the input (to the left) from the output (to theright). The normalized N RC λ ( Set , Bag ) term M is the query that is being con-sidered for shredding; M may contain free variables declared in Θ , which must bea sequence of N RC λ ( Set , Bag ) set comprehension bindings. Θ is initially empty, X is a base term Φ ; Θ (cid:96) X (cid:90) ⇒ X | Φ ( Φ i − ; Θ (cid:96) M i (cid:90) ⇒ ˘ M i | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:104)−−−−→ (cid:96) = M (cid:105) (cid:90) ⇒ (cid:104)−−−−→ (cid:96) = ˘ M (cid:105) | Φ n ϕ / ∈ dom( Φ n )( Φ i − ; Θ (cid:96) C i (cid:90) ⇒ ψ i (cid:16) dom( Θ ) | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:83) −→ C (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | ( Φ n \ −→ ψ )[ ϕ (cid:55)→ (cid:83) −−−−→ Φ n ( ψ )] ϕ / ∈ dom( Φ n )( Φ i − ; Θ (cid:96) D i (cid:90) ⇒ ψ i (cid:16) dom( Θ ) | Φ i ) i =1 ,...,n Φ ; Θ (cid:96) (cid:85) −→ D (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | ( Φ n \ −→ ψ )[ ϕ (cid:55)→ (cid:85) −−−−→ Φ n ( ψ )] ϕ / ∈ dom( Ψ ) Φ ; Θ, −−−−→ x ← F (cid:96) M (cid:90) ⇒ ˘ M | ΨΦ ; Θ (cid:96) (cid:83) {{ M } where X |−−−−→ x ← F } (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } )] ϕ / ∈ dom( Ψ ) Φ ; Θ, −−−−−→ x ← G δ (cid:96) M (cid:90) ⇒ ˘ M | ΨΦ ; Θ (cid:96) (cid:85) (cid:72)(cid:72) M (cid:73) where X |−−−−→ x ← G (cid:73) (cid:90) ⇒ ϕ (cid:16) dom( Θ ) | Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:85) (cid:72)(cid:72) ˘ M (cid:73) where X |−−−−→ x ← G (cid:73) )] G δ (cid:44) (cid:26) Q ∗ if G = ιQ ∗ δG otherwise Φ \ −→ ψ (cid:44) [( ϕ (cid:55)→ N ) ∈ Φ | ϕ / ∈ −→ ψ ] Fig. 6.
Shredding rules. but during shredding it is extended with parts of the input that have alreadybeen processed. Similarly, the input shredding environment Φ is initially empty,but will grow during shredding to collect shredded queries that have alreadybeen generated. It is crucial, for our algorithm to work, that M be in the formpreviously described in Figure 3, as this allows us to make assumptions on itsshape: in describing the judgment rules, we will use the same metavariables asare used in that grammar.The output of shredding consists of a shredded term ˘ M and an output shred-ding environment Ψ . Ψ extends Φ with the new queries obtained by shredding M ; ˘ M is an output N RC G query obtained from M by lifting its collection typedsubqueries to independent queries defined in Ψ .The rules for the shredding judgment operate as follows: the first rule ex-presses the fact that a normalized base term X does not contain subexpressionswith nested collection type, therefore it can be shredded to itself, leaving theshredding environment Φ unchanged; in the case of tuples, we perform shred-ding pointwise on each field, connecting the input and output shredding envi-ronments in a pipeline, and finally combining together the shredded subterms inthe obvious way.The shredding of collection terms (i.e. unions and comprehensions) is per-formed by means of query lifting : we turn the collection into a globally defined(graph) query, which will be associated to a fresh name ϕ and instantiated to thelocal comprehension context by graph application. This operation is reminiscent uery Lifting 17 (cid:96) · : · (cid:96) Φ : Γ Γ (cid:96) M : −→ σ (cid:74) τ ϕ / ∈ dom( Γ ) (cid:96) Φ [ ϕ (cid:55)→ M ] : ( Γ, ϕ : −→ σ (cid:74) τ ) Fig. 7.
Typing rules for shredding environments. of the lambda lifting and closure conversion techniques used in the implementa-tion of functional languages to convert local function definitions into global ones.Thus, when shredding a collection, besides processing its subterms recursively,we will need to extend the output shredding environment with a definition forthe new global graph ϕ . In the interesting case of comprehensions, ϕ is defined bygraph-abstracting over the comprehension context Θ ; notice that, since we areonly shredding normalized terms, we know that they have a certain shape and,in particular, the judgment for bag comprehensions must ensure that generators −→ G be converted into sets.The shredding of set and bag unions is performed by recursion on the sub-terms, using the same plumbing technique we employed for tuples; additionally,we optimize the output shredding environment by removing the graph queries −→ ψ resulting from recursion, since they are absorbed into the new graph ϕ .Notice that since the comprehension generators of our normalized queriesmust have a flat collection type, they do not need to be processed recursively.Furthermore, since our normal forms ensure that promotion and bag differenceterms can only appear as comprehension generators, we do not need to providerules for these cases.The shredding environments used by the shredding judgment must be welltyped, in the sense described by the rules of Figure 7: the judgment (cid:96) Φ : Γ meansthat the graph variables of Φ are mapped to terms whose type is described by Γ . Whenever we add a mapping [ ϕ (cid:55)→ M ] to Φ , we must make sure that M iswell typed (of graph type) in the typing environment Γ associated to Φ .If (cid:96) Φ : Γ , we will write ty ( Φ ) to refer to the typing environment Γ associatedto Φ . The following result states that shredding preserves well-typedness: Theorem 2.
Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . If Θ (cid:96) M (cid:90) ⇒ ˘ M | Φ , then: – Φ is well-typed – ty ( Φ ) , ty ( Θ ) (cid:96) ˘ M : σ We now intend to prove the correctness of shredding: first, we state a lemmawhich we can use to simplify certain expressions involving the semantics of graphapplication:
Definition 2.
Let Θ be a closed, well-typed sequence of generators. A substitu-tion ρ is a model of Θ (notation: ρ (cid:15) Θ ) if, and only if, for all x ∈ dom( Θ ) , wehave (cid:74) Θ ( x )) (cid:75) ρ ( x ) > . Lemma 3. (cid:114) ( (cid:83) −→ G ) (cid:16) ( −→ N ) (cid:122) ρ = (cid:87) i (cid:114) G i (cid:16) ( −→ N ) (cid:122) ρ
2. If ρ (cid:15) Θ , then for all M we have (cid:74) G ( Θ ; M ) (cid:16) (dom( Θ )) (cid:75) ρ = (cid:74) M (cid:75) ρ . To state the correctness of shredding, we need the following notion of shred-ding environment substitution.
Definition 3.
For every well-typed shredding environment Φ , the substitution of Φ into an N RC G term M (notation: M Φ ) is defined as the operation replacingwithin M every free variable ϕ ∈ dom( Φ ) with ( Φ ( ϕ )) Φ (i.e.: the value assignedby Φ to ϕ , after recursively substituting Φ ). We can easily show that the above definition is well posed for well-typed Φ .We now show that shredding preserves the semantics of the input term, in thesense that the term obtained by substituting the output shredding environmentinto the output term is equivalent to the input. Theorem 3 (Correctness of shredding).
Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . If Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , then, for all ρ (cid:15) Θ , we have (cid:74) M (cid:75) ρ = (cid:114) ˘ M Ψ (cid:122) ρ .Proof. By induction on the shredding judgment. We comment two representativecases: – in the set comprehension case, we want to prove (cid:114) (cid:83) {{ M } where X |−−−−→ x ← F ] } (cid:122) ρ v = (cid:114) ( ϕ (cid:16) (dom( Θ ))) Ψ [ ϕ (cid:55)→ (cid:83) {G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } ) } ] (cid:122) ρ v where ρ (cid:15) Θ . We rewrite the lhs as follows: (cid:114) (cid:83) {{ M } where X |−−−−→ x ← F ] } (cid:122) ρ v = (cid:87) −→ u ( (cid:74) M (cid:75) ρ n = v ) ∧ ( (cid:74) X (cid:75) ρ n ) ∧ ( (cid:74) F i (cid:75) ρ i − u i )) i =1 ,...,n where ρ i = ρ [ x (cid:55)→ u , . . . , x i (cid:55)→ u i ] (cid:15) Θ, x ← F , . . . , x i ← F i for all i = 1 , . . . , n , and u i s.t. (cid:74) F i (cid:75) ρ i − u i . By the definition of substitution and byLemma 3, we rewrite the rhs: (cid:114) ( ϕ (cid:16) (dom( Θ ))) Ψ [ ϕ (cid:55)→ G ( Θ ; (cid:83) {{ ˘ M } where X |−−−−→ x ← F } )] (cid:122) ρ v = (cid:114) ( G ( Θ ; (cid:83) {{ ˘ M Ψ } where X |−−−−→ x ← F } )) (cid:16) (dom( Θ )) (cid:122) ρ v = (cid:114) (cid:83) {{ ˘ M Ψ } where X |−−−−→ x ← F } (cid:122) ρ v = (cid:87) −→ u ( (cid:114) ˘ M Ψ (cid:122) ρ n = v ) ∧ ( (cid:74) F i (cid:75) ρ i − u i )) i =1 ,...,n ∧ ( (cid:74) X (cid:75) ρ (cid:48) )We can prove that for all −→ u such that ρ n (cid:50) Θ, −−−−→ x ← F , ( (cid:74) F i (cid:75) ρ i − u i ) i =1 ,...,n =0. Therefore, we only need to consider those −→ u such that ρ n (cid:15) Θ, −−−−→ x ← F .Then, to prove the thesis, we only need to show: (cid:74) M (cid:75) ρ n = (cid:114) ˘ M Φ (cid:122) ρ n which follows by induction hypothesis, for ρ n (cid:15) Θ, −−−−→ x ← F . uery Lifting 19 – in the set union case, we want to prove (cid:114) (cid:91) −→ C (cid:122) ρ v = (cid:114) ( ϕ (cid:16) (dom( Θ )))( Ψ \ −→ ψ )[ ϕ (cid:55)→ (cid:91) −−−→ Ψ ( ψ ))] (cid:122) ρ v where ρ (cid:15) Θ . We rewrite the lhs as follows: (cid:114) (cid:91) −→ C (cid:122) ρ v = (cid:95) i (cid:74) C i (cid:75) ρ v By the definition of substitution and by Lemma 3, we rewrite the rhs: (cid:114) ( ϕ (cid:16) (dom( Θ )))( Ψ \ −→ ψ )[ ϕ (cid:55)→ (cid:83) −−−→ Ψ ( ψ ))] (cid:122) ρ v = (cid:114) ( (cid:83) −−−−−→ ( Ψ ( ψ )) Ψ ) (cid:16) (dom( Θ )) (cid:122) ρ v = (cid:87) i (cid:74) ( Ψ ( ψ i )) Ψ (cid:16) (dom( Θ )) (cid:75) ρ v By induction hypothesis and unfolding of definitions, we know for all i : (cid:74) C i (cid:75) ρ = (cid:114) ( ψ i (cid:16) ( −−−−−→ dom( Θ ))) Ψ (cid:122) ρ = (cid:114) ( Ψ ( ψ i )) Ψ (cid:16) ( −−−−−→ dom( Θ )) (cid:122) ρ which proves the thesis. (cid:117)(cid:116) N RC λ ( Set , Bag ) The output of the shredding judgment is a stratified version of the input term,where each element of the output shredding environment provides a layer of col-lection nesting; furthermore, the output is ordered so that each element of theshredding environment only references graph variables defined to its left, whichis convenient for evaluation. Our goal is to evaluate each shredded item as anindependent query: however, these items are not immediately convertible to flatqueries, partly because their type is still nested, and also due to the presence ofgraph operations introduced during shredding. We thus need to provide a trans-lation operation capable of converting the output of shredding into independentflat terms of
N RC λ ( Set , Bag ). This translation uses two main ingredients: – an index function to convert graph variable references to a flat type I ofindices, such that φ, −→ x are recoverable from index ( φ, −→ x ); – a technique to express graphs as standard N RC λ ( Set , Bag ) relations.The resulting translation, denoted by (cid:98)·(cid:99) , is shown in in Figure 8. Let usremark that the translation need be defined only for term forms that can beproduced as the output of shredding: this allows us, for instance, not to considerterms such as ιM or M − N , which can only appear as part of flat generatorsof comprehensions or graphs.We discuss briefly the interesting cases of the definition of the flatteningtranslation. Base expressions X are expressible in N RC λ ( Set , Bag ), thereforethey can be mapped to themselves (this is also true for empty ( M ), since nor-malization ensures that the type of M be a flat collection). Graph applications (cid:98) X (cid:99) = X (cid:106) (cid:104)−−−−→ (cid:96) = M (cid:105) (cid:107) = (cid:104)−−−−−→ (cid:96) = (cid:98) M (cid:99)(cid:105) (cid:106)(cid:91) −→ C (cid:107) = (cid:91) −−→(cid:98) C (cid:99) (cid:106)(cid:93) −→ D (cid:107) = (cid:93) −−→(cid:98) D (cid:99) (cid:4) ϕ (cid:16) ( −→ x ) (cid:5) = index ( ϕ, −→ x ) (cid:106)(cid:91) {{ M } where X |−−−−→ x ← F } (cid:107) = (cid:91) {{(cid:98) M (cid:99)} where X |−−−−→ x ← F } (cid:106)(cid:93) (cid:72)(cid:72) M (cid:73) where X |−−−−→ x ← G (cid:73) (cid:107) = (cid:93) (cid:72)(cid:72) (cid:98) M (cid:99) (cid:73) where X |−−−−→ x ← G (cid:73) (cid:106) G set ( −−−−→ x ← F ; M ) (cid:107) = (cid:91) {(cid:104) x, y (cid:105)|−−−−→ x ← F , y ← (cid:98) M (cid:99)} (cid:106) G bag ( −−−−→ x ← F ; M ) (cid:107) = (cid:93) (cid:72) (cid:104) x, y (cid:105)|−−−−−→ x ← ιF , y ← (cid:98) M (cid:99) (cid:73) Fig. 8.
Flattening embedding of shredded queries into
N RC λ ( Set , Bag ). ϕ (cid:16) ( −→ x ), as we said, are translated with the help of an index abstract operation:this is where the primary purpose of the translation is accomplished, by flatten-ing a collection type to the flat type I , making it possible for a shredded query tobe converted to SQL; although we do not specify the concrete implementation of index , it is worth noting that it must store the arguments of the graph applica-tion along with the (quoted) name of the graph variable ϕ . Tuples, unions, andcomprehensions only require a recursive translation of their subterms: howeverthe generators of comprehensions must have a flat collection type, so no recursionis needed there. Finally, we translate graphs as collections of the pairs obtainedby associating elements of the domain of the graph to the corresponding output;it is simple to come up with a comprehension term building such a collection:set-valued graphs are translated using set comprehension, while bag-valued onesuse bag comprehension (this also means that in the latter case the generatorsfor the domain of the graph, which are set-typed, must be wrapped in a ι ).We can prove that the flattening embedding produces flat-typed terms, asexpected. Definition 4.
A well-typed set comprehension generator Θ is flat-typed if, andonly if, for all x ∈ dom( Θ ) , there exists a flat type σ such that ty ( Θ ( x )) = { σ } .A well-typed shredding environment Φ is flat-typed if, and only if, for all ϕ ∈ dom( Φ ) , we have that ty ( (cid:98) Φ ( ϕ ) (cid:99) ) is a flat collection type. Lemma 4.
Suppose Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , where Φ and Θ are flat-typed. Then, ˘ M and Ψ are also flat-typed. It is important to note that the composition of shredding and (cid:98)·(cid:99) does notproduce normalized
N RC λ ( Set , Bag ) terms: when we shred a comprehension, weadd to the output shredding environment a graph returning a comprehension,and when we translate this to
N RC λ ( Set , Bag ) we get two nested comprehen-sions: (cid:106) G ( x ← δt ; (cid:93) (cid:72)(cid:72) ˘ M (cid:73) | y ← ιQ ∗ (cid:73) ) (cid:107) = (cid:93) (cid:72) (cid:104) x, z (cid:105)| x ← ιδt, z ← (cid:93) (cid:72)(cid:72) (cid:106) ˘ M (cid:107) (cid:73) | y ← ιQ ∗ (cid:73)(cid:73) uery Lifting 21 (cid:76) X : b (cid:77) Ξ (cid:44) X (if X is not an index) (cid:76) (cid:104)−−−→ (cid:96) = ˘ N (cid:105) : (cid:104)−−→ (cid:96) : τ (cid:105) (cid:77) Ξ (cid:44) (cid:104)−−−−−−−−−→ (cid:96) = (cid:76) ˘ N : τ (cid:77) Ξ (cid:105) (cid:76) (cid:104)−−−→ (cid:96) = ˘ N (cid:105) .(cid:96) i : τ (cid:77) Ξ (cid:44) (cid:76) N i : τ (cid:77) Ξ (cid:76) index ( ϕ, −→ V ) : { τ } (cid:77) Ξ (cid:44) (cid:83) {{ (cid:76) p. τ (cid:77) Ξ } | p ← Ξ ( ϕ ) , p. (cid:104)−→ V (cid:105)} (cid:76) index ( ϕ, −→ V ) : (cid:72) τ (cid:73)(cid:77) Ξ (cid:44) (cid:85) (cid:72)(cid:72)(cid:76) p. τ (cid:77) Ξ (cid:73) | p ← Ξ ( ϕ ) , p. (cid:104)−→ V (cid:105) (cid:73) Fig. 9.
The stitching function.
In fact, not only is this term not in normal form, but it may even contain, within Q ∗ , a lateral reference to x ; thus, after a flattening translation, we will alwaysrequire the resulting queries to be renormalized and, if needed, delateralized.Let norm denote N RC λ ( Set , Bag ) normalization, and S denote the evalua-tion of relational normal forms: we define the shredded value set Ξ correspondingto a shredding environment Φ as follows: Ξ (cid:44) { ϕ (cid:55)→ S ( norm ( (cid:98) M (cid:99) )) | [ ϕ (cid:55)→ M ] ∈ Φ } The evaluation S is ordinarily performed by a DBMS after converting the N RC λ ( Set , Bag ) query to SQL, as described in Section 5. The result of thisevaluation is reflected in a programming language such as Links as a list ofrecords.
Given a
N RC λ ( Set , Bag ) term with nested collections, we have first shredded it,obtaining a shredded
N RC G term ˘ M and a shredding environment Φ containing N RC G graphs; then we have used a flattening embedding to reflect both ˘ M and Φ back into the flat fragment of N RC λ ( Set , Bag ); next we used normalizationand DBMS evaluation to convert the shredding environment into a shreddedvalue set Ξ . As the last step to evaluate M : τ , we need to combine (cid:106) ˘ M (cid:107) and Ξ together to reconstruct the correct nested value (cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ by stitching together partial flat values.The stitching function is shown in Figure 9: its job is to visit all the compo-nents of tuples and collections, ignoring atomic values other than indices alongthe way. The real work is performed when an index ( ϕ, −→ V ) is found: conceptu-ally, the index should be replaced by the result of the evaluation of ϕ (cid:16) ( −→ V ).Remember that Ξ contains the result of the evaluation of the graph function ϕ after translation to N RC λ ( Set , Bag ), i.e. a collection of pairs associating eachinput of ϕ to the corresponding output: then, to obtain the desired result, wecan take Ξ ( ϕ ), filter all the pairs p whose first component is (cid:104)−→ V (cid:105) , and returnthe second component of p after a recursive stitching. Finally, observe that wetrack the result type argument in order to disambiguate whether to construct aset or multiset when we encounter an index. Theorem 4 (Correctness of stitching).
Let Θ be well-typed and ty ( Θ ) (cid:96) M : σ . Let Φ be well-typed, and suppose Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ . Let Ξ be the result ofevaluating the flattened queries in Ψ as above. Then (cid:114) ˘ M Ψ (cid:122) ρ = (cid:114)(cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ (cid:122) ρ . The full correctness result follows by combining the Theorems 3 and 4.
Corollary 1.
For all M such that (cid:96) M : τ , suppose (cid:96) M (cid:90) ⇒ ˘ M (cid:48) | Ψ , and let Ξ be the shredded value set obtained by evaluating the flattened queries in Ψ . Then (cid:74) M (cid:75) = (cid:114)(cid:76) (cid:106) ˘ M (cid:107) : τ (cid:77) Ξ (cid:122) . Work on language-integrated query and comprehension syntax has taken placeover several decades in both the database and programming language commu-nities. We discuss the most closely related work below.
Comprehensions, normalization and language integration
The database commu-nity had already begun in the late 1980s to explore proposals for so-called non-first-normal-form relations in which collections could be nested inside other col-lections [45], but following Trinder and Wadler’s initial work connecting databasequeries with monadic comprehensions [49], query languages based on these foun-dations were studied extensively, particularly by Buneman et al. [4,3]. For ourpurposes, Wong’s work on query normalization and translation to SQL [54] isthe most important landmark; this work provided the basis for practical imple-mentations such as Kleisli and later Links. Almost as important is the later workby Libkin and Wong [33], studying the questions of expressiveness of bag querylanguages via a language
BQL that extended basic
N RC with deduplication andbag difference operators. They related this language to
N RC with set semanticsextended with aggregation (count/sum) operations, but did not directly addressthe question of normalizing and translating
BQL queries to SQL. Grust andScholl [28] were early advocates of the use of comprehensions mixing set, bagand other monadic collections for query rewriting and optimization, but did notstudy normalization or translatability properties.Although comprehension-based queries began to be used in general-purposeprogramming languages with the advent of Microsoft LINQ [36] and Links [12],Cooper [11] made the next important foundational contribution by extendingWong’s normalization result to queries containing higher-order functions andshowing that an effect system could be used to safely compose queries usinghigher-order functions even in an ambient language with side-effects and recur-sive functions that cannot be used in queries. This work provided the basis forsubsequent development of language-integrated query in Links [34] and was lateradapted for use in F
Que Λ . However, on revisiting Cooper’s proof to extend it to heteroge-neous queries, we found a subtle gap in the proof, which was corrected in a recentpaper [43]; the original result was correct. As a result, in this paper we focus onfirst-order fragments of these languages without loss of generality. uery Lifting 23 Giorgidze et al. [22] have shown how to support non-recursive datatypes (i.e.sums) and Grust and Ulrich [29] built on this to show how to support functiontypes in query results using defunctionalization [29]. We considered using sumsto support a defunctionalization-style strategy for query lifting, but Giorgidzeet al. [22] map sum types to nested collections, which makes their approachunsuitable to our setting. Wong’s original normalization result also consideredsum types, but to the best of our knowledge normalization for
N RC λ ( Set , Bag )extended with sum types has not yet been proved.Recent work by Suzuki et al. [47] have outlined further extensions to lan-guage-integrated query in the
Que Λ system, which is based on finally-taglesssyntax [6] and employs Wong’s and Cooper’s rewrite rules; Katsushima and Kise-lyov’s subsequent short paper [31] outlined extensions to handling ordering andgrouping. Kiselyov and Katsushima [32] present an extension to Que Λ called Squr to handle ordering based on effect typing, and they provide an eleganttranslation from
Squr queries to SQL based on normalization-by-evaluation.Okura and Kameyama [39] outline an extension to handle SQL-style groupingand aggregation operators in
Que Λ G ; however, their approach potentially gen-erates lateral variable occurrences inside grouping queries. These systems Que Λ , Squr and
Que Λ G consider neither heterogeneity nor nested results.Our adoption of tabulated functions ( graphs ) is inspired in part by Gibbonset al. [20], who provided an elegant rational reconstruction of relational algebrashowing how standard principles for reasoning about queries arise from adjunc-tions. They employed types for (finite) maps and tables to show how joins can beimplemented efficiently, and observed that such structures form a graded monad .We are interested in further exploring these structures and extending our workto cover ordering, grouping and aggregation. Query decorrelation and delateralization
There is a large literature on querydecorrelation , for example to remove aggregation operations from
SELECT or WHERE clauses (see e.g. [38,5] for further discussion). Delateralization appearsrelated to decorrelation, but we are aware of only a few works on this problem,perhaps because most DBMSs only started to support
LATERAL in the last fewyears. (Microsoft SQL Server has supported similar functionality for much longerthrough a keyword
APPLY .) Our delateralization technique appears most closelyrelated to Neumann and Kemper’s work on query unnesting [38]. In this con-text, unnesting refers to removal of “dependent join” expressions in a relationalalgebraic query language; such joins appear to correspond to lateral subqueries.This approach is implemented in the HyPER database system, but is not ac-companied by a proof of correctness, nor does it handle nested query results. Itwould be interesting to formalize this approach (or others from the decorrelationliterature) and relate it to delateralization.
Querying nested collections
Our approach to querying nested heterogeneouscollections clearly specializes to the homogeneous cases for sets and multisetsrespectively, which have been studied separately. Van den Bussche’s work on simulating queries on nested sets using flat ones [53] has also inspired subse-quent work on query shredding, flattening and (in this paper) lifting, thoughthe simulation technique itself does not appear practical (as discussed in theextended version of Cheney et al. [9]). More recently, Benedikt and Pradic [1]presented results on representing queries on nested collections using a boundednumber of interpretations (first-order logic formulas corresponding to definableflat query expressions) in the context of their work on synthesizing
N RC queriesfrom proofs. This approach considers set-valued
N RC only, and its relationshipto our approach should be investigated further.Cheney et al.’s previous work on query shredding for multiset queries [8] isdifferent in several important respects. In that work we did not consider dedupli-cation and bag difference operations from
BQL , which Libkin and Wong showedcannot be expressed in terms of other
N RC operations. The shredding transla-tion was given in several stages, and while each stage is individually comprehen-sible, the overall approach is not easy to understand. Finally, the last stages ofthe translation relied on SQL features not present (or expressible) in the sourcelanguage, such as ordering and the SQL:1999
ROW NUMBER construct, to synthe-size uniform integer keys. Our approach, in contrast, handles set, bag, and mixedqueries, and does not rely on any SQL:1999 features.In a parallel line of work, Grust et al. [26,21,50,52,51] have developed a num-ber of approaches to querying nested list data structures, first in the context ofXML processing [24] and subsequently for
N RC -like languages over lists. Theearlier approach [26], named loop-lifting (not to be confused with query lifting !)made heavy use of SQL:1999 capabilities for numbering and indexing to decouplenested collections from their context, and was implemented in both Links [50]and earlier versions of the Database Supported Haskell library [21], both of whichrelied on an advanced query optimizer called
Pathfinder [27] to optimize thesequeries. The more recent approach, implemented by Ulrich in the current versionof DSH and described in detail in his thesis [51], is called query flattening andis instead based on techniques from nested data parallelism [2]. Both loop-liftingand query flattening are very powerful, and do not rely on an initial normaliza-tion stage, while supporting a rich source language with list semantics, ordering,grouping, aggregation, and deduplication which can in principle emulate set ormultiset semantics. However, to the best of our knowledge no correctness proofsexist for either technique. We view finding correctness results for richer querylanguages as an important challenge for future work.Another parallel line of work started by Fegaras and Maier [15,14] considersheterogeneous query languages based on monoid comprehensions, with set, list,and bag collections as well as grouping, aggregation and ordering operations, inthe setting of object-oriented databases, and forms the basis for complex objectdatabase systems such as λ DB [16] and Apache MRQL [14]. However, Wong-style normalization results or translations from flat or nested queries to SQL arenot known for these calculi.
Lambda-lifting and closure conversion
Since Johnsson’s original work [30],lambda-lifting and closure conversion have been studied extensively for func- uery Lifting 25 tional languages, with Minamide et al.’s typed closure conversion [37] of par-ticular interest in compilers employing typed intermediate languages. We planto study whether known optimizations in the lambda-lifting and closure con-version literature offer advantages for query lifting. The immediate importantnext step is to implement our approach and compare it empirically with previ-ous techniques such as query shredding and query flattening. By analogy withlambda-lifting and closure conversion, we expect additional optimizations to bepossible by a deeper analysis of how variables/fields are used in lifted subqueries.Another problem we have not resolved is how to deal with deduplication or bagdifference at nested collection types in practice. Libkin and Wong [33] showedthat such nesting can be eliminated from
BQL queries, but their results do notprovide a constructive algorithm for eliminating the nesting.
Monadic comprehensions have proved to be a remarkably durable foundation fordatabase programming and language-integrated query, and has led to languagesupport (LINQ for .NET, Quill for Scala) with widespread adoption. Recentwork has demonstrated that techniques for evaluating queries over nested collec-tions, such as query shredding or query flattening, can offer order-of-magnitudespeedups in database applications [19] without sacrificing declarativity or read-ability. However, query shredding lacks the ability to express common operationssuch as deduplication, while query flattening is more expressive but lacks a de-tailed proof of correctness, and both techniques are challenging to understand,implement, or extend. We provide the first provably correct approach to queryingnested heterogeneous collections involving both sets and multisets.Our most important insight is that working in a heterogeneous language,with both set and multiset collection types, actually makes the problem easier,by making it possible to calculate finite maps representing the behavior of nestedquery subexpressions under all of the possible environments encountered at runtime. Thus, instead of having to maintain or synthesize keys linking inner andouter collections, as is done in all previous approaches, we can instead use thevalues of variables in the closures of nested query expressions themselves asthe keys. The same approach can be used to eliminate sideways information-passing. This is analogous to lambda-lifting or closure conversion in compilationof functional languages, but differs in that we lift local queries to (queries thatcompute) finite maps rather than ordinary function abstractions. We believethis idea may have broader applications and will next investigate its behavior inpractice and applications to other query language features.
Acknowledgments
This work was supported by ERC Consolidator Grant Skye(grant number 682315), and by an ISCF Metrology Fellowship grant providedby the UK government’s Department for Business, Energy and Industrial Strat-egy (BEIS). We are grateful to Simon Fowler for feedback and to anonymousreviewers for constructive comments.
References
1. Benedikt, M., Pradic, P.: Generating collection transformations from proofs. Proc.ACM Program. Lang. (POPL) (Jan 2021), https://doi.org/10.1145/34342952. Blelloch, G.E.: Vector Models for Data-Parallel Computing. MIT Press (1990)3. Buneman, P., Libkin, L., Suciu, D., Tannen, V., Wong, L.: Comprehension syntax.SIGMOD Record (1994)4. Buneman, P., Naqvi, S., Tannen, V., Wong, L.: Principles of programmingwith complex objects and collection types. Theor. Comput. Sci. (1) (1995).https://doi.org/10.1016/0304-3975(95)00024-Q5. Cao, B., Badia, A.: SQL query optimization through nested rela-tional algebra. ACM Trans. Database Syst. (3), 18–es (Aug 2007).https://doi.org/10.1145/1272743.12727486. Carette, J., Kiselyov, O., Shan, C.: Finally tagless, partially evaluated: Taglessstaged interpreters for simpler typed languages. J. Funct. Program. (5), 509–543 (2009). https://doi.org/10.1017/S09567968090072057. Cheney, J., Lindley, S., Wadler, P.: A practical theory of language-integrated query.In: ICFP (2013). https://doi.org/10.1145/2500365.25005868. Cheney, J., Lindley, S., Wadler, P.: Query shredding: efficient relational evalua-tion of queries over nested multisets. In: SIGMOD. pp. 1027–1038. ACM (2014).https://doi.org/10.1145/2588555.26121869. Cheney, J., Lindley, S., Wadler, P.: Query shredding: Efficient relational evaluationof queries over nested multisets (extended version). CoRR abs/1404.7078 (2014),http://arxiv.org/abs/1404.707810. Chu, S., Weitz, K., Cheung, A., Suciu, D.: HoTTSQL: Proving queryrewrites with univalent SQL semantics. In: PLDI. pp. 510–524. ACM (2017).https://doi.org/10.1145/3062341.306234811. Cooper, E.: The script-writer’s dream: How to write great SQL in your own lan-guage, and be sure it will succeed. In: DBPL (2009). https://doi.org/10.1007/978-3-642-03793-1 312. Cooper, E., Lindley, S., Wadler, P., Yallop, J.: Links: web programming withouttiers. In: FMCO (2007). https://doi.org/10.1007/978-3-540-74792-5 1213. Copeland, G., Maier, D.: Making Smalltalk a database system. SIGMOD Rec. (2) (1984)14. Fegaras, L.: An algebra for distributed big data analytics. J. Funct. Program. ,e27 (2017). https://doi.org/10.1017/S095679681700019315. Fegaras, L., Maier, D.: Optimizing object queries using an effective calculus. ACMTrans. Database Syst. (4), 457–516 (2000)16. Fegaras, L., Srinivasan, C., Rajendran, A., Maier, D.: lambda-DB: An ODMG-based object-oriented DBMS. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.)SIGMOD. p. 583. ACM (2000). https://doi.org/10.1145/342009.33549417. Fehrenbach, S., Cheney, J.: Language-integrated provenance. Science of ComputerProgramming , 103–145 (2018)18. Foster, J.N., Green, T.J., Tannen, V.: Annotated XML: queries and provenance.In: PODS. pp. 271–280 (2008)19. Fowler, S., Harding, S., Sharman, J., Cheney, J.: Cross-tier web programming forcurated databases: a case study. International Journal of Digital Curation (1)(2020). https://doi.org/10.2218/ijdc.v15i1.717, pre-print presented at IDCC 202020. Gibbons, J., Henglein, F., Hinze, R., Wu, N.: Relational algebra byway of adjunctions. Proc. ACM Program. Lang. (ICFP) (Jul 2018).https://doi.org/10.1145/3236781uery Lifting 2721. Giorgidze, G., Grust, T., Schreiber, T., Weijers, J.: Haskell boards the Ferry -database-supported program execution for Haskell. In: IFL. pp. 1–18. No. 6647 inLNCS, Springer-Verlag (2010)22. Giorgidze, G., Grust, T., Ulrich, A., Weijers, J.: Algebraic data types for language-integrated queries. In: DDFP. pp. 5–10 (2013)23. Green, T.J., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS(2007)24. Grust, T., Mayr, M., Rittinger, J.: Let SQL drive the XQueryworkhorse (XQuery join graph isolation). In: EDBT. pp. 147–158 (2010).https://doi.org/10.1145/1739041.173906225. Grust, T., Mayr, M., Rittinger, J., Schreiber, T.: Ferry: Database-supported pro-gram execution. In: SIGMOD (June 2009)26. Grust, T., Rittinger, J., Schreiber, T.: Avalanche-safe LINQ compilation. PVLDB (1) (2010)27. Grust, T., Rittinger, J., Teubner, J.: Pathfinder: XQuery off the relational shelf.IEEE Data Eng. Bull. (4) (2008)28. Grust, T., Scholl, M.H.: How to comprehend queries functionally. J. Intell. Inf.Syst. (2-3), 191–218 (1999). https://doi.org/10.1023/A:100870502644629. Grust, T., Ulrich, A.: First-class functions for first-order database engines. In:DBPL (2013), http://arxiv.org/abs/1308.0158
30. Johnsson, T.: Lambda lifting: Treansforming programs to recursive equations. In:FPCA. pp. 190–203 (1985). https://doi.org/10.1007/3-540-15975-4 3731. Katsushima, T., Kiselyov, O.: Language-integrated query with ordering, groupingand outer joins (poster paper). In: PEPM. pp. 123–124 (2017)32. Kiselyov, O., Katsushima, T.: Sound and efficient language-integratedquery - maintaining the ORDER. In: APLAS 2017. pp. 364–383 (2017).https://doi.org/10.1007/978-3-319-71237-6 1833. Libkin, L., Wong, L.: Query languages for bags and aggregate functions. J. Comput.Syst. Sci. (2) (1997). https://doi.org/10.1006/jcss.1997.152334. Lindley, S., Cheney, J.: Row-based effect types for database integration. In: TLDI(2012). https://doi.org/10.1145/2103786.210379835. Lindley, S., Wadler, P.: The audacity of hope: Thoughts on reclaiming the databasedream. In: ESOP (2010)36. Meijer, E., Beckman, B., Bierman, G.M.: LINQ: reconciling object,relations and XML in the .NET framework. In: SIGMOD (2006).https://doi.org/10.1145/1142473.114255237. Minamide, Y., Morrisett, J.G., Harper, R.: Typed closure conversion. In: POPL.pp. 271–283 (1996). https://doi.org/10.1145/237721.23779138. Neumann, T., Kemper, A.: Unnesting arbitrary queries. In: Datenbanksysteme f¨urBusiness, Technologie und Web (BTW). pp. 383–402 (2015)39. Okura, R., Kameyama, Y.: Language-integrated query with nested data structuresand grouping. In: FLOPS. pp. 139–158 (2020). https://doi.org/10.1007/978-3-030-59025-3 940. Paredaens, J., Van Gucht, D.: Converting nested algebra expressionsinto flat algebra expressions. ACM Trans. Database Syst. (1) (1992).https://doi.org/10.1145/128765.12876841. Quill: Compile-time language integrated queries for Scala. Open source project,https://github.com/getquill/quill42. Ricciotti, W., Cheney, J.: Mixing set and bag semantics. In: DBPL. pp. 70–73(2019). https://doi.org/10.1145/3315507.33302028 W. Ricciotti and J. Cheney43. Ricciotti, W., Cheney, J.: Strongly normalizing higher-order relational queries. In:FSCD. pp. 28:1–28:22 (2020). https://doi.org/10.4230/LIPIcs.FSCD.2020.2844. Russell, C.: Bridging the object-relational divide. Queue (May 2008).https://doi.org/10.1145/1394127.139413945. Schek, H., Scholl, M.H.: The relational model with relation-valued attributes. Inf.Syst. (2), 137–147 (1986). https://doi.org/10.1016/0306-4379(86)90003-746. Stolarek, J., Cheney, J.: Language-integrated provenance in Haskell. The Art, Sci-ence, and Engineering of Programming (3), A11 (2018)47. Suzuki, K., Kiselyov, O., Kameyama, Y.: Finally, safely-extensibleand efficient language-integrated query. In: PEPM. pp. 37–48 (2016).https://doi.org/10.1145/2847538.284754248. Syme, D.: Leveraging .NET meta-programming components from F (1-2) (2001)54. Wong, L.: Normal forms and conservative extension properties for querylanguages over collection types. J. Comput. Syst. Sci. (3) (1996).https://doi.org/10.1006/jcss.1996.003755. Wong, L.: Kleisli, a functional query system. J. Funct. Program. (1) (2000).https://doi.org/10.1017/S0956796899003585uery Lifting 29 A N RC λ ( Set , Bag ) A.1 Type system
We give here the full set of typing rules for
N RC λ ( Set , Bag ) that we omittedfrom the main body of the paper: they are shown in Figure 10. x : σ ∈ ΓΓ (cid:96) x : σ Σ ( c ) = −→ b → b ( Γ (cid:96) M i : b i ) i =1 ,...,n Γ (cid:96) c ( −→ M n ) : b ( Γ (cid:96) M i : σ i ) i =1 ,...,n Γ (cid:96) (cid:104)−−−−→ (cid:96) = M (cid:105) : (cid:104)−−→ (cid:96) : σ (cid:105) Γ (cid:96) M : (cid:104)−−→ (cid:96) : σ (cid:105) i = 1 , . . . , nΓ (cid:96) M.(cid:96) i : σ i Γ, x : σ (cid:96) M : τΓ (cid:96) λx σ .M : σ → τ Γ (cid:96) M : σ → τ Γ (cid:96) N : σΓ (cid:96) ( M N ) : τΓ (cid:96) ∅ σ : { σ } Γ (cid:96) M : σΓ (cid:96) { M } : { σ } Γ (cid:96) M : { σ } Γ (cid:96) N : { σ } Γ (cid:96) M ∪ N : { σ } ( Γ, x : σ , . . . , x i − : σ i − (cid:96) N i : { σ i } ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : { τ } Γ (cid:96) (cid:83) { M |−−−−→ x ← N } : { τ } Γ (cid:96) M : { σ } Γ (cid:96) empty set ( M ) : B Γ (cid:96) M : { σ } Γ (cid:96) N : B Γ (cid:96) M where set N : { σ } Γ (cid:96) (cid:102) σ : (cid:72) σ (cid:73) Γ (cid:96) M : σΓ (cid:96) (cid:72) M (cid:73) : { σ } Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) N : (cid:72) σ (cid:73) Γ (cid:96) M (cid:93) N : (cid:72) σ (cid:73) ( Γ, x : σ , . . . , x i − : σ i − (cid:96) N i : (cid:72) σ i (cid:73) ) i =1 ,...,n Γ, −−→ x : σ (cid:96) M : (cid:72) τ (cid:73) Γ (cid:96) (cid:85) (cid:72) M |−−−−→ x ← N (cid:73) : (cid:72) τ (cid:73) Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) empty bag ( M ) : B Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) N : B Γ (cid:96) M where bag N : (cid:72) σ (cid:73) Γ (cid:96) M : (cid:72) σ (cid:73) Γ (cid:96) δM : { σ } Γ (cid:96) M : { σ } Γ (cid:96) ιM : (cid:72) σ (cid:73) Fig. 10.
Type system of
N RC λ ( Set , Bag ). A.2 Normalization
We show in Figure 11 the rewrite system used to normalize
N RC λ ( Set , Bag )queries.
A.3 Semantics
We follow the K -relation style of semantics, as introduced by Green et al. [23]and used for formalization by Chu et al. [10]. λx.M ) N (cid:32) M [ N/x ] (cid:104) . . . , (cid:96) = M, . . . (cid:105) .(cid:96) (cid:32) M (cid:83) {∅| Θ } (cid:32) ∅ (cid:83) { M | Θ, x ← ∅ , Θ (cid:48) } (cid:32) ∅ (cid:83) { M | Θ, x ← { N } , Θ (cid:48) } (cid:32) (cid:83) { M [ N / x ] | Θ, Θ (cid:48) [ N / x ] } (cid:83) { M ∪ N | Θ } (cid:32) (cid:83) { M | Θ } ∪ (cid:83) { N | Θ } (cid:83) { M | Θ, x ← N ∪ R, Θ (cid:48) } (cid:32) (cid:83) { M | Θ, x ← N, Θ (cid:48) } ∪ (cid:83) { M | Θ, x ← R, Θ (cid:48) } (cid:83) { M | Θ, x ← (cid:83) { R | Θ (cid:48) } , Θ (cid:48)(cid:48) } (cid:32) (cid:83) { M | Θ, Θ (cid:48) , x ← R, Θ (cid:48)(cid:48) } (if dom( Θ (cid:48) ) / ∈ FV(
M, Θ (cid:48)(cid:48) )) (cid:83) { (cid:83) { M | Θ (cid:48) }| Θ } (cid:32) (cid:83) { M | Θ, Θ (cid:48) } (cid:83) { M | Θ, x ← R where set N, Θ (cid:48) } (cid:32) (cid:83) { M where set N | Θ, x ← R, Θ (cid:48) } (if x / ∈ FV( N )) (cid:83) { δ ( M − N ) | Θ } (cid:32) (cid:83) {{ z }| Θ, z ← δ ( M − N ) } (cid:83) { δ ( M − N ) where set R | Θ } (cid:32) (cid:83) {{ z } where set R | Θ, z ← δ ( M − N ) } (if z / ∈ FV( R )) M where set true (cid:32) M M where set false (cid:32) ∅ ∅ where set M (cid:32) ∅ ( N ∪ R ) where set M (cid:32) ( N where set M ) ∪ ( R where set M ) (cid:83) { N | Θ } where set M (cid:32) (cid:83) { N where set M | Θ } (if dom( Θ ) ∩ F V ( M ) = ∅ ) R where set N where set M (cid:32) R where set ( M ∧ N ) (cid:85) (cid:72) (cid:102) | Θ (cid:73) (cid:32) (cid:102) (cid:85) (cid:72) M | Θ, x ← (cid:102) , Θ (cid:48) (cid:73) (cid:32) (cid:102) (cid:85) (cid:72) M | Θ, x ← (cid:72) N (cid:73) , Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M [ N / x ] | Θ, Θ (cid:48) [ N / x ] (cid:73) (cid:85) (cid:72) M (cid:93) N | Θ (cid:73) (cid:32) (cid:85) (cid:72) M | Θ (cid:73) (cid:93) (cid:85) (cid:72) N | Θ (cid:73) (cid:85) (cid:72) M | Θ, x ← N (cid:93) R, Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, x ← N, Θ (cid:48) (cid:73) (cid:93) (cid:85) (cid:72) M | Θ, x ← R, Θ (cid:48) (cid:73) (cid:85) (cid:72) M | Θ, x ← (cid:85) (cid:72) R | Θ (cid:48) (cid:73) , Θ (cid:48)(cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, Θ (cid:48) , x ← R, Θ (cid:48)(cid:48) (cid:73) (if dom( Θ (cid:48) ) / ∈ FV(
M, Θ (cid:48)(cid:48) )) (cid:85) (cid:72) (cid:85) (cid:72) M | Θ (cid:48) (cid:73) | Θ (cid:73) (cid:32) (cid:85) (cid:72) M | Θ, Θ (cid:48) (cid:73) (cid:85) (cid:72) M | Θ, x ← R where bag N, Θ (cid:48) (cid:73) (cid:32) (cid:85) (cid:72) M where bag N | Θ, x ← R, Θ (cid:48) (cid:73) (if x / ∈ FV( N )) (cid:85) (cid:72) ιM | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) | Θ, z ← ιM (cid:73) (cid:85) (cid:72) ιM where bag R | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) where bag R | Θ, z ← ιM (cid:73) (if z / ∈ FV( R )) (cid:85) (cid:72) M − N | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) | Θ, z ← M − N (cid:73) (cid:85) (cid:72) ( M − N ) where bag R | Θ (cid:73) (cid:32) (cid:85) (cid:72)(cid:72) z (cid:73) where bag R | Θ, z ← M − B (cid:73) (if z / ∈ FV( R )) M where bag true (cid:32) M M where bag false (cid:32) (cid:102) (cid:102) where bag M (cid:32) (cid:102) ( N (cid:93) R ) where bag M (cid:32) ( N where bag M ) (cid:93) ( R where bag M ) (cid:85) (cid:72) N | Θ (cid:73) where bag M (cid:32) (cid:85) (cid:72) N where bag M | Θ (cid:73) (if dom( Θ ) ∩ FV( M ) = ∅ ) R where bag N where bag M (cid:32) R where bag ( M ∧ N ) δ (cid:102) (cid:32) ∅ δ (cid:72) M (cid:73) (cid:32) { M } δ ( M (cid:93) N ) (cid:32) δM ∪ δNδ (cid:85) (cid:72) M | Θ (cid:73) (cid:32) (cid:83) { δM | Θ δ } δ ( M where bag N ) (cid:32) δM where set NδιM (cid:32)
M ι ∅ (cid:32) (cid:102) ι ( M where set N ) (cid:32) ιM where bag N empty set ( M ) (cid:32) empty set ( (cid:83) {{(cid:104)(cid:105)}| x ← M } ) (if M is not a flat set) empty bag ( M ) (cid:32) empty bag ( (cid:83) { (cid:72) (cid:104)(cid:105) (cid:73) | x ← M } ) (if M is not a flat bag)( −−−−→ x ← M ) [ N / y ] (cid:44) −−−−−−−−→ x ← M [ N / y ] (if x (cid:54) = y, FV( N ) ∩ −→ x = ∅ )( −−−−→ x ← M ) δ (cid:44) −−−−−→ x ← δM Fig. 11.
Query normalizationuery Lifting 31 (cid:74) ∅ (cid:75) ρ = λu. (cid:74) { M } (cid:75) ρ = λu. { M } ρ = u (cid:74) M ∪ N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu ∨ (cid:74) N (cid:75) ρu (cid:114) (cid:91) { N | x ← M } (cid:122) ρ = λu. (cid:95) v (cid:74) M (cid:75) ρv ∧ (cid:74) N (cid:75) ρ [ x (cid:55)→ v ] u (cid:74) empty set ( M ) (cid:75) ρ = λu. ¬ ( (cid:95) u (cid:74) M (cid:75) ρu ) (cid:74) M where set N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu ∧ (cid:74) N (cid:75) ρ (cid:74) δ ( M ) (cid:75) ρ = λu.ζ ( (cid:74) M (cid:75) ρu ) (cid:74) (cid:102) (cid:75) ρ = λu. (cid:74)(cid:72) M (cid:73)(cid:75) ρ = λu.χ ( { M } ρ = u ) (cid:74) M ∪ N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu + (cid:74) N (cid:75) ρu (cid:114) (cid:91) { N | x ← M } (cid:122) ρ = λu. (cid:88) v (cid:74) M (cid:75) ρv × (cid:74) N (cid:75) ρ [ x (cid:55)→ v ] u (cid:74) M where bag N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu × χ ( (cid:74) N (cid:75) ρ ) (cid:113) empty bag ( M ) (cid:121) ρ = λu. ¬ ( ζ ( (cid:88) u (cid:74) M (cid:75) ρu )) (cid:74) M − N (cid:75) ρ = λu. (cid:74) M (cid:75) ρu − (cid:74) N (cid:75) ρu (cid:74) ι ( M ) (cid:75) ρ = λu.χ ( (cid:74) M (cid:75) ρu ) Fig. 12.
Semantics of set and multiset operations of
N RC λ ( Set , Bag ) Basic types and records are represented by the usual interpretations of suchtypes, and the details are elided. For set types, the interpretation of a set { A } is (cid:74) A (cid:75) → fs { , } . Here → fs is the set of finitely-supported functions from (cid:74) A (cid:75) , herethe support is the set of elements mapped to a nonzero value. We consider { , } equipped with the usual structure of a Boolean algebra, with operations ∧ , ∨ , ¬ ,and we consider equality and other meta-level predicates as functions returningBoolean values. Likewise, we consider bag types (cid:72) A (cid:73) to be interpreted as finitely-supported functions (cid:74) A (cid:75) → fs N , where N is the set of natural numbers, equippedwith the usual arithmetic operations + , . − , × ; here . − is truncated subtraction m . − n = max( m − n, χ : { , } → N forthe “characteristic function” and ζ : N → { , } for the “nonzero test” function x (cid:55)→ ( x >
0) that maps 0 to 0 and any nonzero value to 1. Note that ζ ( χ ( n )) = n .Since we work with finitely-supported functions f, p , we write (cid:80) u f ( u ) (resp. (cid:87) u p ( u ) for the summation (resp. disjunction) over all possible u of f ( u ) (resp. p ( u )). Although this summation or disjunction is infinite, the number of valuesof u for which f / p can be nonzero is finite, so this is a finite sum or disjunctionand thus well-defined. Finally, although N RC λ ( Set , Bag ) also includes function types, lambda abstraction, and application, but not recursion, their additionposes no difficulty and since these features can be normalized away prior toapplying the results in this paper, we do not explicitly discuss them in thesemantics.
B Proofs for Section 5
Lemma 5. (cid:80) t χ ( t = u ) × e ( t, u ) = e ( t, t ) χ ( ζ ( e )) × e = χ ( e > × e = e Proof.
For part (1), all of the summands are zero except (possibly) when t = u .Part (2) follows by a simple case analysis on e >
0; if e = 0 then both sides arezero while if e > χ ( e > × e = 1 × e = e . (cid:117)(cid:116) Lemma 6 (Commutativity).
Suppose { x, y } ∩ F V ( M, N ) = ∅ . Then (cid:93) (cid:72) M | x ← N, y ← P (cid:73) ≡ (cid:93) (cid:72) M | y ← P, x ← N (cid:73) (cid:91) { M | x ← N, y ← P } ≡ (cid:91) { M | y ← P, x ← N } Proof.
Straightforward by unfolding definitions. (cid:117)(cid:116)
Recall (for example from Buneman et al. [4]) that set membership M ∈ N isdefinable as ¬ empty set ( { x | x ← N, x = M } ) It is straightforward to show that (cid:74) M ∈ N (cid:75) ρv = (cid:74) N (cid:75) ρ ( (cid:74) M (cid:75) ρ ), that is, the result is true iff the interpretation of N returns true on the interpretation of M . We will use this as a primitive in thefollowing proofs. First we observe that when x was introduced by a generator x ← N , then it is redundant to check that x ∈ δ ( N ) (if N is a bag) or x ∈ N (if N is a set). Lemma 7. (cid:93) (cid:72) M where bag x ∈ δ ( N ) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N (cid:73) (cid:91) { M where set x ∈ N | x ← N } ≡ (cid:91) { M | x ← N } uery Lifting 33 Proof.
For the first equation we reason as follows: (cid:114) (cid:93) (cid:72) M where bag x ∈ δ ( N ) | x ← N (cid:73)(cid:122) ρu = (cid:88) u (cid:74) M where bag x ∈ δ ( N ) (cid:75) ρ [ x (cid:55)→ u ] × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( (cid:74) x ∈ δ ( N ) (cid:75) ρ [ x (cid:55)→ u ]) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( (cid:74) δ ( N ) (cid:75) ρ [ x (cid:55)→ u ]( (cid:74) x (cid:75) ρ [ x (cid:55)→ u ])) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × χ ( ζ ( (cid:74) N (cid:75) ρu )) × (cid:74) N (cid:75) ρu = (cid:88) u (cid:74) M (cid:75) ρ [ x (cid:55)→ u ] v × (cid:74) N (cid:75) ρu = (cid:114) (cid:93) (cid:72) M | x ← N (cid:73)(cid:122) ρv The proof of the second equation is similar, but simpler. (cid:117)(cid:116)
Lemma 8. G bag ( x ← N ; M ) (cid:16) O ≡ M [ O/x ] where bag O ∈ N G set ( x ← N ; M ) (cid:16) O ≡ M [ O/x ] where set O ∈ N Proof.
The proofs are similar; we show the first. (cid:74) G ( x ← N ; M ) (cid:16) O (cid:75) ρv = (cid:74) G ( x ← N ; M ) (cid:75) ρ ( (cid:74) O (cid:75) ρ, v )= χ ( (cid:74) N (cid:75) ρ ( (cid:74) O (cid:75) ρ )) × (cid:74) M (cid:75) ρ [ x (cid:55)→ (cid:74) O (cid:75) ρ ] v = (cid:74) M [ O/x ] (cid:75) ρv × (cid:74) O ∈ N (cid:75) ρv = (cid:74) M [ O/x ] where bag O ∈ N (cid:75) ρv (cid:117)(cid:116) Next we show graph construction commutes with promotion, deduplication,union, multiset union and difference:
Lemma 9. ι ( G set ( −−−−→ x ← N ; M )) ≡ G bag ( −−−−→ x ← N ; ι ( M )) Proof. (cid:114) ι ( G set ( −−−−→ x ← N ; M )) (cid:122) ρ ( −→ u , v )= χ ( (cid:114) G set ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ))= χ ( (cid:114) −→ N (cid:122) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( (cid:114) −→ N (cid:122) ρ −→ u ) × χ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( (cid:114) −→ N (cid:122) ρ −→ u ) × (cid:74) ι ( M ) (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = (cid:114) G bag ( −−−−→ x ← N ; ι ( M )) (cid:122) ρv (cid:117)(cid:116) Corollary 2. ι ( G set ( x ← N ; M )) (cid:16) O ≡ ι ( M [ O/x ]) where bag O ∈ N Lemma 10. δ ( G bag ( −−−−→ x ← N ; M )) ≡ G set ( −−−−→ x ← N ; δ ( M )) Proof. (cid:114) δ ( G bag ( −−−−→ x ← N ; M )) (cid:122) ρ ( −→ u , v )= ζ ( (cid:114) G bag ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ))= ζ ( χ ( (cid:114) −→ N (cid:122) ρ ) −→ u × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= ζ ( χ ( (cid:114) −→ N (cid:122) ρ −→ u )) ∧ ζ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) −→ N (cid:122) ρ −→ u ∧ (cid:74) δ ( M ) (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = (cid:114) G bag ( −−−−→ x ← N ; δ ( M )) (cid:122) ρv (cid:117)(cid:116) Lemma 11. G ( −−−−→ x ← N ; M ) ∪ G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M ∪ M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) ∪ G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ) ∪ (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= −−→ (cid:74) N (cid:75) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v ∨ −−→ (cid:74) N (cid:75) ρ −→ u ∧ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = −−→ (cid:74) N (cid:75) ρ −→ u ) ∧ ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v ∨ (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= −−→ (cid:74) N (cid:75) ρ −→ u ) ∧ ( (cid:74) M ∪ M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M ∪ M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) Lemma 12. G ( −−−−→ x ← N ; M ) (cid:93) G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M (cid:93) M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) (cid:93) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( u, v ) (cid:93) (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v + χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v + (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:93) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M (cid:93) M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) uery Lifting 35 Lemma 13. G ( −−−−→ x ← N ; M ) − G ( −−−−→ x ← N ; M ) ≡ G ( −−−−→ x ← N ; M − M ) Proof. (cid:114) G ( −−−−→ x ← N ; M ) − G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v ) − (cid:114) G ( −−−−→ x ← N ; M ) (cid:122) ρ ( −→ u , v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v − χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v = χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v − (cid:74) M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= χ ( −−→ (cid:74) N (cid:75) ρ −→ u ) × ( (cid:74) M − M (cid:75) ρ [ −−−−→ x (cid:55)→ u ] v )= (cid:114) G ( −−−−→ x ← N ; M − M ) (cid:122) ρ ( −→ u , v ) (cid:117)(cid:116) Corollary 3. ( G ( x ← N ; M ) −G ( x ← N ; M )) (cid:16) O ≡ ( M [ O/x ] − M [ O/x ]) where bag O ∈ N We can now use these equivalences to show the correctness of the delateral-ization rules for promotion and difference:
Theorem 5. (cid:93) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( G ( x ← N ; P )) (cid:16) x (cid:73) Proof.
We use Cor. 2 and Lemma 7 and standard equivalences for monadiccomprehensions: (cid:93) (cid:72) M | x ← N, y ← ι ( P ) (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) (cid:73) where bag x ∈ N | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ι ( P ) where bag x ∈ N (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( P ) where bag x ∈ N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ι ( G ( x ← N ; P )) (cid:16) x (cid:73) (cid:117)(cid:116) Theorem 6. (cid:93) (cid:72) M | x ← N, y ← P − P (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← G ( x ← δ ( N ); P ) − G ( x ← δ ( N ); P ) (cid:73) Proof.
We use Cor. 3 and Lemma 7 and standard equivalences for monadiccomprehensions: (cid:93) (cid:72) M | x ← N, y ← P − P (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← P − P (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← P − P (cid:73) where bag x ∈ δ ( N ) | x ← N (cid:73) ≡ (cid:93) (cid:72) (cid:93) (cid:72) M | y ← ( P − P ) where bag x ∈ δ ( N ) (cid:73) | x ← N (cid:73) ≡ (cid:93) (cid:72) M | x ← N, y ← ( G ( x ← δN ; P ) − G ( x ← δN ; P )) (cid:16) x (cid:73) (cid:117)(cid:116) Theorem 7. (cid:91) { M | x ← N, y ← δ ( P − P ) } ≡ (cid:91) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) } Proof.
We use Lemmas 7, 10 and 13 and standard equivalences for monadiccomprehensions: (cid:91) { M | x ← N, y ← δ ( P − P ) }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) } | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) } where set x ∈ N | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( P − P ) where set x ∈ N } | x ← N }≡ (cid:91) { (cid:91) { M | y ← G ( x ← N ; δ ( P − P )) (cid:16) x } | x ← N }≡ (cid:91) { (cid:91) { M | y ← δ ( G ( x ← N ; P − P )) (cid:16) x } | x ← N }≡ (cid:91) { M | x ← N, y ← δ ( G ( x ← N ; P ) − G ( x ← N ; P )) (cid:16) x } (cid:117)(cid:116) Now, to prove that delateralization eventually terminates, we consider a met-ric on query expressions defined as follows: given an expression in normal form,for each subexpression of the form ι ( N ) or M − N , add up the number of free uery Lifting 37 variables occurring in M, N . (cid:107) (cid:102) (cid:107) = 0 (cid:107){ M }(cid:107) = (cid:72) M (cid:73) } = (cid:107) M (cid:107)(cid:107) M ∪ N (cid:107) = (cid:107) M (cid:93) N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M − N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107) + | F V ( M, N ) |(cid:107) ι ( M ) (cid:107) = (cid:107) M (cid:107) + | F V ( M, N ) |(cid:107) δ ( M ) (cid:107) = (cid:107) M (cid:107)(cid:107) (cid:91) { M | x ← N }(cid:107) = (cid:107) (cid:93) (cid:72) M | x ← N (cid:73) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M where set N (cid:107) = (cid:107) M where bag N (cid:107) = (cid:107) M (cid:107) + (cid:107) N (cid:107)(cid:107) M (cid:107) = 0 otherwiseIf the metric is zero, then the query is fully delateralized. Combining the basicdelateralization steps above with commutativity, any expression with nonzerometric can be rewritten so as to decrease the metric (though possibly increasingthe query size). We can also undo the effects of commutativity steps to restorethe original order of generators, to preserve the query structure as much aspossible for readability. Theorem 8.
Given M with (cid:107) M (cid:107) > , there exists an equivalent M (cid:48) with (cid:107) M (cid:48) (cid:107) < (cid:107) M (cid:107) that can be obtained by applying commutativity and basic rewrites. Hence,there exists an equivalent fully-delateralized M (cid:48)(cid:48) with (cid:107) M (cid:48)(cid:48) (cid:107) = 0 .Proof. The proof requires establishing that whenever (cid:107) M (cid:107) >
0, there exists atleast one outermost subexpression M of the form ι ( N ) or N − P with (cid:107) M (cid:107) > M should not be a subexpression of any larger such subexpression of M having the same property. Moreover, M must occur as a generator. We need toshow that M therefore contains at least one free record variable bound earlier inthe same comprehension. We can show this by inspection of normal forms. Sincethis is the case, then (if the generator is not already adjacent) we can commute itto be adjacent to M and then apply one of the delateralization rules, decreasing (cid:107) M (cid:107) and hence (cid:107) M (cid:107) . (cid:117)(cid:116) C Proofs for Section 6
Lemma 14. If Φ ; Θ (cid:96) M (cid:90) ⇒ ˘ M | Ψ , then Ψ ⊇ Φ . Lemma 15.
Let M an N RC G term and Φ a shredding set. If FV( M ) ⊆ dom( Φ ) ,then for all Φ (cid:48) ⊇ Φ we have M Φ = M Φ (cid:48) .Furthermore, let Ξ and Ξ (cid:48) be the shredding value sets corresponding to Φ and Φ (cid:48) : then (cid:76) (cid:98) M (cid:99) (cid:77) Ξ = (cid:76) (cid:98) M (cid:99) (cid:77) Ξ (cid:48) . In the following proof, whenever Θ = −−−−→ x ← F , we use the abbreviation: (cid:74) Θ (cid:75) ρ −→ v = (cid:94) i (cid:74) F i (cid:75) ρ [ x (cid:55)→ v , . . . , x i − (cid:55)→ v i − ] v i Theorem 4.