[PDF] A Symbolic Approach to Proving Query Equivalence Under Bag Semantics

Abstract

In database-as-a-service platforms, automated verification of query equivalence helps eliminate redundant computation in the form of overlapping sub-queries. Researchers have proposed two pragmatic techniques to tackle this problem. The first approach consists of reducing the queries to algebraic expressions and proving their equivalence using an algebraic theory. The limitations of this technique are threefold. It cannot prove the equivalence of queries with significant differences in the attributes of their relational operators. It does not support certain widely-used SQL features. Its verification procedure is computationally intensive. The second approach transforms this problem to a constraint satisfaction problem and leverages a general-purpose solver to determine query equivalence. This technique consists of deriving the symbolic representation of the queries and proving their equivalence by determining the query containment relationship between the symbolic expressions. While the latter approach addresses all the limitations of the former technique, it only proves the equivalence of queries under set semantics. However, in practice, database applications use bag semantics. In this paper, we introduce a novel symbolic approach for proving query equivalence under bag semantics. We transform the problem of proving query equivalence under bag semantics to that of proving the existence of a bijective, identity map between tuples returned by the queries on all valid inputs. We implement this symbolic approach in SPES and demonstrate that SPES proves the equivalence of a larger set of query pairs (95/232) under bag semantics compared to the state-of-the-art tools based on algebraic (30/232) and symbolic approaches (67/232) under set and bag semantics, respectively. Furthermore, SPES is 3X faster than the symbolic tool that proves equivalence under set semantics.

Full PDF

SSPES: A Two-Stage Query Equivalence Verifier

Qi Zhou Joy Arulraj Shamkant Navathe William Harris Jinpeng [email protected] [email protected] [email protected]

Georgia Institute of Technology Galois.Inc Alibaba Group

ABSTRACT

In database-as-a-service platforms, automated verification of queryequivalence helps eliminate redundant computation across queries( i.e. , overlapping sub-queries). State-of-the-art tools for automateddetection of query equivalence adopt two different approaches. Thefirst technique is based on reducing queries to algebraic expressionsand proving their equivalence using an algebraic theory. The limita-tions of this approach are threefold. It cannot prove the equivalenceof queries with significant differences in the attributes of their re-lational operators ( e.g. , predicates in the filter operator). It doesnot support certain widely-used

SQL features ( e.g. , NULL values).Its verification procedure is computationally intensive. The secondtechnique is based on deriving the symbolic representation of thequeries and proving their equivalence using the satisfiability modulotheory. The limitations of this approach are twofold. It only provesthe equivalence of queries under set semantics. It cannot prove theequivalence of queries with significant structural differences in theirabstract syntax trees.In this paper, we present a novel two-stage approach to automatedverification of query equivalence that addresses the limitations ofthese individual techniques. The first stage consists of reducingqueries to a novel algebraic representation and then normalizing theresulting algebraic expressions to minimize structural differences.The second stage consists of applying a verification algorithm toconvert the normalized algebraic expressions to a novel query pairsymbolic representation and proving their equivalence under bagsemantics using satisfiability modulo theory. We implement ourtwo-stage approach in SPES. SPES proves the equivalence of alarger set of query pairs (90/232) under bag semantics comparedto the state-of-the-art tools based on algebraic (30/232) under bagsemantics and symbolic approaches (67/232) under set semantics.Furthermore, the average query equivalence verification time is 83 × and 3 × shorter than those tools, respectively. PVLDB Reference Format: . SPES: A Two-Stage Query Equivalence Verifier.

PVLDB , xx(yyy): xxxx-yyyy, 2019.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

1. INTRODUCTION

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. xx, No. yyyISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

Database-as-a-service (DBaaS) platforms ( e.g. , Alibaba’s Max-Compute [1], Microsoft’s Azure Data Lake [9], and Google’s Big-Query [11]) enable users to quickly deploy complex data processingpipelines consisting of SQL queries. In practice, these data process-ing pipelines exhibit a significant amount of computational overlap( i.e. , semantically equivalent sub-queries) [17, 50]. This results inhigher resource usage and longer query execution times.Researchers have developed techniques for minimizing redundantcomputation by materializing the overlapping sub-queries as viewsand rewriting the original queries to operate on these materializedviews [43, 33]. All of these techniques rely on an effective andefficient algorithm for automatically deciding the equivalence of apair of SQL queries. Two queries are equivalent if they always returnthe same output table for any given set of input tables. In general,proving query equivalence (QE) is an undecidable problem [14,18]. Given this constraint, prior efforts have focused on a subsetof SQL queries where this problem is decidable ( e.g. , SELECT - PROJECT - JOIN queries) [20, 45, 27, 39]. While this line of researchhas studied the theoretical underpinnings of this problem, thesetechniques are unable to identify overlap in complex SQL queries. P RIOR E FFORTS : Recently, researchers have formulated two prag-matic approaches for automatically proving QE. These efforts arebased on two different representations of a query: (1) algebraicrepresentation, and (2) symbolic representation.UDP is the state-of-the-art prover based on an algebraic ap-proach [23]. It determines QE using three steps. First, it transformsthe queries from an abstract syntax tree (AST) representation toan algebraic representation (AR). Next, it applies a set of rules forrewriting the ARs . Lastly, it attempts to find an isomorphism be-tween the vocabularies of these ARs to determine their equivalencevia proof construction. While this algebraic approach works wellfor proving queries with significant structural differences in theirASTs, it suffers from three limitations. First, UDP cannot provethe equivalence of queries when the attributes in their relationaloperators exhibit significant differences ( e.g. , predicates in the filteroperator). This is because it uses a set of syntax-driven re-writerules to construct the proof. Second, it does not support certainwidely used SQL features ( e.g. , NULL values). Third, its verificationprocedure is computationally intensive due to the long sequence ofre-writes required for proving equivalence.EQUITAS circumvents these limitations of the algebraic ap-proach by adopting an alternate approach based on symbolic repre-sentation [50]. It determines QE using two steps. First, it transformsthe queries from an AST representation to a symbolic representation(SR) ( i.e. , a set of first-order logic (FOL) formulae). A query’sSR symbolically represents the tuples that it returns. Next, it lever- We refer to the algebraic representation of a pair of queries as ARs.1 a r X i v : . [ c s . D B ] A p r ges a general-purpose solver based on satisfiability modulo theory(SMT) to determine the containment relationship between two SRs .If the containment relationship holds in both directions, then thequeries are equivalent. While this symbolic approach addresses thedrawbacks of the algebraic approach, it suffers from two limitations.First, it only proves equivalence of queries under set semantics ( i.e. ,output tables must not contain duplicate tuples [41]). In practice,queries rely on bag semantics ( i.e. , output tables may contain du-plicate tuples [16]) Proving QE under bag semantics is a strictlyharder problem that doing so under set. This is because if twoqueries are equivalent under bag semantics, then they are also equiv-alent under set semantics. However, the converse does not hold.Second, it cannot prove the equivalence of queries with significantstructural differences in their ASTs ( e.g. , aggregate and union oper-ators). These limitations constrain the applicability of the SR-basedapproach. O UR A PPROACH : In this paper, we present a two-stage queryequivalence prover that addresses the limitations of EQUITAS. Toaddress the first limitation, we introduce a novel SR and verificationalgorithm that directly proves QE under bag semantics (withoutdetermining the query containment relationship). To address the sec-ond limitation, we present a novel AR that normalizes the structuraldifferences between queries.We implemented our two-stage approach in SPES, a tool forautomatically verifying the equivalence of SQL queries under bagsemantics. We evaluate SPES using a collection of 232 pairs ofequivalent SQL queries available in the Apache C

ALCITE frame-work [3]. Each query pair is constructed by applying various opti-mization rules on complex

SQL queries with diverse features ( e.g. ,arithmetic operations, three-valued logic for supporting

NULL , sub-queries, grouping, and aggregate functions). Our evaluation showsthat SPES proves the semantic equivalence of a larger set of querypairs (90/232) compared to UDP (34/232) and EQUITAS (67/232).Furthermore, SPES is 83 × faster than UDP and 3 × faster thanEQUITAS on this benchmark. In addition to the C ALCITE bench-mark, we evaluate the efficacy of SPES on a cloud-scale workloadcomprising , real-world SQL queries from Ant Financial Ser-vices Group [2]. SPES automatically found that 27% of the queriesin this workload contain overlapping computation. These queriescontain complex relational operators ( e.g. , aggregate functions). C ONTRIBUTIONS : We make the following contributions: • We illustrate the limitations of AR- and SR-based approachesto determining the equivalence of

SQL queries in §2. • We present a novel AR of queries and a set of rules for normal-izing the structural differences between queries in §4. • We introduce a novel verification algorithm and SR to deter-mine the equivalence of normalized ARs in §5. • We implement this approach in SPES and evaluate its efficacy.We demonstrate that SPES proves the equivalence of a largerset of query pairs in C

ALCITE benchmark compared to UDPand EQUITAS in §7. More importantly, unlike EQUITAS,SPES demonstrates QE under bag semantics.

2. MOTIVATION

We motivate the need for a new approach to automated queryequivalence verification by illustrating the limitations of prior ap-proaches. We illustrate the limitations of the algebraic and symbolicapproaches using a set of examples based on these two tables: • EMP table: ⟨ EMP_ID , SALARY , DEPT_ID , LOCATION ⟩ We refer to the symbolic representation of a pair of queries as SRs. • DEPT table: ⟨ DEPT_ID , DEPT_NAME ⟩ UDP is the state-of-the-art tool for automatically determining QEusing an algebraic approach [23]. It first converts the given pair ofqueries to U -semirings, a family of algebraic structures, to obtainthe U -expression of each query ( i.e. , their ARs). It then applies aset of semantically equivalent algebraic re-write rules to normalizeand simplify each U -expression. UDP seeks to find isomorphismsbetween the vocabularies of the two U -expressions to prove thatthey are syntactically equivalent. It proves QE by demonstratingthat the ARs are equivalent.The algebraic approach cannot prove the equivalence of queriescontaining semantically-equivalent, syntactically-different predi-cates ( e.g. , predicates with arithmetic operators, predicates based onthree-valued logic for handling NULL ). E XAMPLE

1. S

YNTACTICALLY -D IFFERENT P REDICATES : Q1:

SELECT ∗ FROM ( SELECT ∗ FROM

EMP

WHERE

DEPT_ID = 10) AS T WHERE

T.DEPT_ID + 5 > T.EMP_ID;Q2:

SELECT ∗ FROM ( SELECT ∗ FROM

EMP

WHERE

DEPT_ID = 10) AS T WHERE

15 > T.EMP_ID; Q1 first chooses employee tuples whose department id is 10. Itthen applies another filter to retrieve the employee tuples where thedepartment id plus five is greater than employee id. Q2 containsthe same inner query as that in Q1 . The only difference is that Q2 chooses the tuples where the employee id is less than 15. Q1 and Q2 are semantically equivalent since the inner query retrieves tupleswhose department id is 10 (and 10 + 5 = 15). UDP converts thesequeries to the following ARs: Q1 : [ t . DEPT_ID = 10] × [ t . DEPT_ID + 5 > t . EMP_ID ] × EMP ( t ) Q2 : [ t . DEPT_ID = 10] × [15 > t . EMP_ID ] × EMP ( t ) Each algebraic expression is a function that returns the cardinalityof an arbitrary tuple t in the output table ( i.e. , number of timesthe tuple appears). Each predicate is a boolean function that eitherreturns zero or one depending on whether the tuple satisfies thepredicate. For example, [ t . DEPT_ID = 10] returns one when thegiven tuple t satisfies the condition that the department id equalsto 10. Each input table expression is a function that returns thecardinality of an arbitrary tuple t in the input table. For example, EMP ( t ) returns the cardinality of t in EMP table. × represents thearithmetic multiplication operation.UDP cannot prove the semantic equivalence of Q1 and Q2 sinceit cannot use its re-write rules to prove that the two predicates [ t . DEPT_ID + 5 > t . EMP_ID ] and [15 > t . EMP_ID ] are logicallyequivalent when the predicate [ t . DEPT_ID = 10] holds. We notethat UDP may handle this specific example by adding a re-writerule for constant propagation. However, due to the complexityof predicates in general, it is challenging to formulate a set ofre-write rules to normalize all possible predicates. Furthermore,it is challenging to normalize predicates in operators containingadditional attributes ( e.g. , group set in aggregate operator).

EQUITAS is the state-of-the-art tool for automatically determin-ing QE using a symbolic approach [50]. It first converts the givenpair of queries to their SR ( i.e. , a set of FOL formulae). Each SRsymbolically represents the output table of that query. It then uses anSMT solver to verify the relational properties between SRs to provethe containment relationship between queries. On all valid input2ables, if any tuple returned by Q2 is also returned by Q1 , then Q1 contains Q2 . EQUITAS proves QE by proving the containment rela-tionship in both directions ( i.e. , Q1 contains Q2 and Q2 contains Q1 ).Using the SRs and the SMT solver, EQUITAS proves the equiva-lence of queries containing semantically-equivalent, syntactically-different predicates (Listing 2.1).While it overcomes the limitations of the algebraic approach,it suffers from two limitations. First, EQUITAS cannot provethe equivalence of queries with significant structural differences interms of aggregation and outer-join operators. Second, it can onlyprove the equivalence of queries under set semantics if they containaggregation and outer-join operators. E XAMPLE

2. S

TRUCTURAL D IFFERENCES : Q1:

SELECT

EMP.DEPT_ID,

SUM (EMP.SALARY)

FROM

EMP, DEPT

WHERE

EMP.DEPT_ID = DEPT.DEPT_ID

AND

EMP.SALARY > 1000

GROUP BY

EMP.DEPT_ID ;Q2:

SELECT

T.DEPT_ID,

SUM (T.s)

FROM ( SELECT

EMP.DEPT_ID, EMP.LOCATION,

SUM (EMP.SALARY) as s FROM

DEPT, EMP

WHERE

EMP.DEPT_ID = DEPT.DEPT_ID

AND

EMP.SALARY + 1000 > 2000

GROUP BY

EMP.DEPT_ID, EMP.LOCATION) as T GROUP BY

T.DEPT_ID; Q1 is an aggregation query that calculates the sum of salaries ofemployees whose salary is greater than 1000 grouped by their de-partment id. Q2 is a nested query whose inner and outer queriesare both aggregation queries. The inner query calculates the sumof salaries of all employees whose salary plus 1000 is greater than2000, grouped by their department id and location. The outer querythen calculates the sum of salaries of those employees grouped bytheir department id. Since the set of GROUP BY columns in the outerquery of Q2 is a subset of the group set of the inner query, Q1 and Q2 are equivalent. EQUITAS reduces these queries to their SRs: Q1: COND1: (v2 > 1000 and !n2)COLS1: {(v3,n3),(v4,n4)}Q2: COND2: (v2 + 1000 > 2000 and !n2)COLS2: {(v3,n3),(v5,n5)}

Each SR contains two fields:

COND and ⃗ COLS . COND is a FOL for-mula that represents the constraints that a tuple must satisfy to bepresent in the output table. For example,

COND denotes that thevalue of salary column must be greater than 1000 and must not be NULL . ( v , n ) symbolically represents the salary column in an ar-bitrary EMP tuple. v represents the value and the boolean symbolicvariable n indicates if the value is NULL. ⃗ COLS is a vector of pairsof FOL formula that symbolically represent an arbitrary tuple in theoutput table. For example, ⃗ COLS represents the tuple returned by Q1 . Here, ( v , n and ( v , n represent the department id andsum of salaries, respectively.EQUITAS cannot prove that Q1 and Q2 are equivalent sinceit cannot prove that ( v , n is equivalent to ( v , n when bothconditions COND and COND hold. The reasons for this are twofold.First, the result of the summation operation is not constructed froma bounded number of tuples in the input table. So, EQUITASuses two different set of independent variables ( i.e. , ( v , n and ( v , n ) to symbolically represent the results of the summationoperations of Q1 and Q2 ’s outer query, respectively. Second, theinput tables for the aggregation operations in Q1 and Q2 ’s outerquery are different. So, EQUITAS cannot establish any relationbetween ( v , n and ( v , n . (a) Cardinal Equivalence (b)

Full Equivalence

Figure 1: Types of Query Equivalence –

Bijective maps implicitly con-structed by SPES to determine: (a) cardinal equivalence and (b) full equiva-lence of queries under bag semantics. E XAMPLE

3. S ET S EMANTICS : Q1:

SELECT

EMP.DEPT_ID, EMP.LOCATION

FROM

EMP;Q2:

SELECT

EMP.DEPT_ID, EMP.LOCATION

FROM

EMP

GROUP BY

EMP.DEPT_ID, EMP.LOCATION; Q1 selects the department id and location columns of all employees. Q2 selects the same columns from the employee table grouped bythe department id and location columns. Q1 and Q2 are semanticallyequivalent only under set semantics. They are not equivalent underbag semantics since the output tables would differ if there were twoemployees in the same department and location. EQUITAS reducesthese queries to their SRs: Q1: COND1:

True ; COLS1: {(v1,n1),(v4,n4)}Q2: COND2:

True ; COLS2: {(v1,n1),(v4,n4)}

EQUITAS proves that Q1 and Q2 contain each other under setsemantics. However, most database systems use bag semantics. So,it is critical to prove that they are not equivalent under bag semantics.

3. OVERVIEW

In this section, we first present an overview of the two-stageapproach in §3.1 We then use an example to illustrate how thisapproach proves QE of structurally different queries under bagsemantics in §3.2.

We decompose the previously monolithic equivalence provingproblem to two stages. ❶ In the first stage, SPES converts the two queries to their ARs.It then normalizes each AR using a set of rewrite rules. These rulesdiffer from those used in UDP in that they only focus on minimizingthe structural differences between the relational operators of thequeries ( e.g. , inner join, aggregation, and union). These rules donot transform the attributes of the operators in the queries ( e.g. ,complex predicates and projections). The output of this stage is anAR in union normal form (UNF) that represents the semantics of theoriginal query. We defer the description of UNF and normalizationrules to §4.2. These normalization rules enable SPES to prove theequivalence of structurally different queries. ❷ In the second stage, SPES prove the equivalence of the nor-malized ARs under bag semantics. This stage consists of two steps. C ARDINAL E QUIVALENCE : In the first step, SPES first verifies ifthe given pair of ARs are cardinally equivalent under bag semantics.Two queries are cardinally equivalent if and only if for all validinput tables, their output tables contain the same number of tuples.We defer a formal definition of cardinal equivalence to §5.1. If twoqueries are cardinally equivalent, then there exists a bijective map between the tuples returned by these two queries for all valid inputs,3s illustrated in Figure 1a. In this map, each tuple in first table ismapped to an unique tuple in the second table, and all of the tuplesin second table are covered by the map. We note that the contentsof the output tables of two cardinally equivalent queries may differ.SPES constructs a

Query Pair Symbolic Representation (QPSR) for two cardinally equivalent ARs to symbolically represent thebijective map between the returned tuples. It proves cardinal equiv-alence of two ARs by recursively constructing the QPSR of theirsub-ARs and using the SMT solver to verify specific propertiesbased on the semantics of different types of ARs. We defer a discus-sion of how SPES constructs QPSR to prove cardinal equivalenceto Sections 5.3 to 5.7. F ULL E QUIVALENCE : In the second step, SPES uses the con-structed QPSR to verify that the given pair of ARs are fully equiva-lent under bag semantics. Two queries are fully equivalent if andonly if for all valid input tables, their output tables contain the sametuples (ignoring the order of the tuples). We defer a formal definitionof full equivalence to §5.1. If two queries are fully equivalent, thenthere exists a bijective, identity map between the tuples returned bythese two queries for all valid inputs, as illustrated in Figure 1b. Inthis map, each tuple in first table is mapped to an unique, identicaltuple in the second table. Since the QPSR of two given ARs sym-bolically represents the bijective map between the returned tuples,SPES proves full equivalence of two ARs by using the SMT solverto show that the bijective map is an identity map.

SPES VS EQUITAS:

SPES differs from EQUITAS in the fol-lowing ways: • SPES constructs a QPSR for ARs of a pair of queries afterverifying that the ARs are cardinally equivalent. The QPSRsymbolically represents the bijective map between the tuples intheir output tables. In contrast, EQUITAS directly constructsa SR for each individual query that represents the tuples in itsoutput table. • SPES directly proves full equivalence without determining thequery containment relationship by showing that the bijectivemap represented by the QPSR is an identity map. EQUITASproves QE by showing that the query containment relationshipholds in both directions. • SPES decomposes the problem of proving equivalence of ARsinto smaller proofs of equivalence of their sub-ARs. It con-structs the bijective map between tuples in the final outputtables by recursively constructing the bijective maps betweentuples in all of the intermediate output tables. • SPES uses the SMT solver to verify conditions at each levelof the AR tree to verify cardinal equivalence. In contrast,EQUITAS uses the SMT solver to verify the conditions onlyafter constructing the SRs.These differences allows SPES prove QE under bag semantics. Asshown in Table 1, SPES supports a larger set of

SQL features incomparison to UDP and EQUITAS.

SMT S

OLVER : SPES leverages a SMT solver to proves cardinaland full equivalence of ARs [30]. An SMT solver determines if agiven FOL formula is satisfiable. For example, the solver decidesthat the following formula can be satisfied: x + 5 > ∧ x > when x is six. Similarly, it determines that the following formulacannot be satisfied: x + 5 > ∧ x < since there is no integralvalue of x for which this formula holds. A detailed description ofan SMT solver is available in [29]. EQUITAS UDP SPES

SPJ ✓ ✓ ✓

Aggregate ✓ ✓ ✓

Outer Join ✓ ✓

Complex Predicates ✓ ✓

Table Semantics set bag bagDISTINCT ✓ ✓

Union ✓ ✓

Table 1: Support for

SQL

Features –

Comparison of the

SQL features sup-ported by UDP, EQUITAS and SPES. ✓ denotes that the tool supports thisfeature. Complex predicates include those using: (1) arithmetic operations,(2) NULL , and (3)

CASE . Figure 2: Illustrative Example – The two-stage approach that SPES usesto prove QE under bag semantics.

We use Listing 2.2 (Example 2) to show how SPES proves equiv-alence of structurally different queries under bag semantics. F IRST S TAGE : SPES first converts the queries to ARs. Figure 2shows the ARs of two queries Q1 and Q2 .The AR of Q1 is an aggregate AR that takes a SELECT - PROJECT - JOIN (SPJ) AR as input. An aggregate AR contains three fields:input AR, group set, and a vector of aggregate operations. In thiscase, the group set contains department id and the aggregate opera-tion is sum of salaries. The input AR is an SPJ AR. An SPJ AR alsocontains three fields: a vector of input ARs, a filter predicate, and avector of projection expressions. In this case, the SPJ AR takes twotable ARs as input (

EMP and

DEPT ).The AR of Q2 is an aggregate AR that takes another aggregateAR as input. In this case, the group set contains department id andthe aggregate operation is sum of the sum of salaries computedby the input aggregate AR. The input aggregate AR takes an SPJAR as input. Its group set contains department id and location. Itsaggregate operation is sum of salaries. The SPJ AR is the sameas the SPJ AR in Q1 except that its filter predicate is different ( i.e. , EMP.SALARY + 1000 > 2000) and the order of input table ARs isreversed.To prove that Q1 and Q2 are semantically equivalent, SPES firstapplies a set of re-write rules to normalize these ARs. Specifically,it merges the two aggregate ARs within Q2 into a single one. Thisnormalized AR of Q2 is denoted by Q2 ’ in Figure 2. The AR of Q1 remains unchanged after normalization. We defer a discussion ofhow SPES normalizes ARs to §4. S ECOND S TAGE : In the second stage, SPES first verifies the cardi-nal equivalence of two aggregate ARs. In order to verify the cardinalequivalence of two aggregate ARs, SPES recursively constructs theQPSR of two SPJ ARs that the aggregate ARs take as inputs. Toverify the cardinal equivalence of two SPJ ARs, it constructs thebijective maps between each pair of its inputs and checks if they arecardinally equivalent. If that is the case, then it constructs a QPSRfor each pair of table ARs. SPES maps the

EMP table AR in Q1 withthe EMP table AR in Q2 ’, and the DEPT table AR in Q2 with the DEPT table AR in Q2 ’.4 PSR-1:

SPES constructs a QPSR for the pair of

EMP table ARs:

COND:

True

COLS1: {(v1,n1),(v2,n2),(v3,n3),(v4,n4)}COLS2: {(v1,n1),(v2,n2),(v3,n3),(v4,n4)}

Here,

COLS and COLS symbolically represent two correspond-ing tuples returned by the two cardinally equivalent table ARs,respectively. Each symbolic tuple is a vector of pairs of FOL terms.We present the formal definitions of COLS and COLS in §5.2. Thispair of symbolic tuples COLS and COLS defines the bijective mapbetween the tuples returned by the table ARs.Since these table ARs refer to same EMP input table, the bijectivemap is an identity map. { ( v1 , n1 ) , ( v2 , n2 ) , ( v3 , n3 ) , ( v4 , n4 ) } symbolically represents a tuple returned by the EMP table AR. Eachpair of symbolic variables represents a column in an arbitrary

EMP tuple. For instance, ( v1 , n1 ) denotes EMP_ID in this symbolic tuple. v1 represents the value of EMP_ID, the boolean symbolic variable n1 indicates if the value is NULL. The encoding scheme is the sameas the one used by EQUITAS [50]. COND is an FOL formula thatrepresents the two predicates in the SPJ AR. It must be satisfied forthe tuples to be present in the output table. Since these table ARsreturn all tuples,

COND is TRUE . QPSR-2:

SPES constructs a QPSR for the pair of

DEPT table ARs:

COND:

True

COLS1: {(v5,n5),(v6,n6)}COLS2: {(v5,n5),(v6,n6)} { ( v , n , ( v , n } symbolically represents a tuple is returned bythe DEPT table AR.

QPSR-3:

SPES uses these two QPSRs and leverages the SMTsolver to verify that predicates always returns the same booleanresults for the corresponding tuples in the join table to verify thatthe two SPJ ARs are cardinally equivalent. SPES then constructs aQPSR for these two SPJ ARs:

COND: (v2 + 1000 > 2000 and !n2) and (v2 > 1000 and !n2)COLS1: {(v1,n1),(v2,n2),(v3,n3),(v4,n4),(v5,n5),(v6,n6)}COLS2: {(v1,n1),(v2,n2),(v3,n3),(v4,n4),(v5,n5),(v6,n6)}

COLS and COLS symbolically represent the bijective map betweentuples in the output tables of two SPJ ARs. This bijective map preserves the two bijective maps in the two sub-QPSRs betweentheir input table ARs. In other words, if a tuple t1 is mapped toanother tuple t2 in QPSR-1, and a tuple t3 is mapped to anothertuple t4 in QPSR-2, then the join tuple of t1 and t2 maps to thatof t3 and t4 in QPSR-3. In this manner, the mapping in the lower-level QPSRs is preserved in the higher-level QPSR. COND is theconjunction of the filter predicates.

QPSR-4:

SPES uses QPSR-3 and the SMT solver to verify thatthe two aggregate ARs are cardinally equivalent. If so, it constructsa QPSR for the aggregate ARs ( i.e. , Q1 and Q2 ): COND: (v2 + 1000 > 2000 and !n2) and (v2 > 1000 and !n2)COLS1: {(v1,n1),(v7,n7)}COLS2: {(v1,n1),(v7,n7)}

Here,

COLS and COLS symbolically represent the bijective mapbetween tuples returned by Q1 and Q2 , respectively. ( v , n repre-sents the sum of salaries column. F ULL E QUIVALENCE : After determining cardinal equivalence,SPES proves the full equivalence of Q1 and Q2 by using an SMTsolver to verify the following property of QPSR-4: COND = ⇒ COLS = COLS . SPES feeds this formula to the SMT solver: COND ∧ ¬ ( COLS = COLS ) The solver determines that it cannot besatisfied, thereby showing that the paired symbolic tuples are always equivalent when

COND holds. Thus, the bijective map between thetuples returned by the ARs is an identity map. So, Q1 and Q2 arefully equivalent under bag semantics. S UMMARY : SPES first constructs QPSR-1 for

EMP table ARs andQPSR-2 for

DEPT table ARs. It then uses these QPSRs to determinethe cardinal equivalence of SPJ ARs. Next, it constructs QPSR-3for the SPJ ARs. SPES then uses QPSR-3 to determine the cardinalequivalence of aggregate ARs and constructs QPSR-4 for the overallqueries. Lastly, it uses QPSR-4 to decide the full equivalence of Q1 and Q2 . Thus, SPES only establishes cardinal equivalence beforeconstructing the QPSRs. It only checks full equivalence for thetop-level QPSR ( i.e. , QPSR-4).

4. ALGEBRAIC REPRESENTATION

In this section, we first define an AR system that captures thesemantics of SQL queries in §4.1. We then introduce the

UnionNormal Form (UNF) of an AR and how SPES converts each AR toUNF in §4.2. We finally present a minimal set of pre-defined rulesfor reducing an UNF AR to a simplified, semantically equivalentUNF AR in §4.3.

We first present the syntax of the AR. We then describe thesemantics of the AR based on the relationships between the inputand output tables. The formal definition of the semantics is givenin Appendix A. An AR e is defined thus: e ::= T ABLE ( n ) | SPJ ( ⃗ e , P , ⃗o ) | A GG ( e , ⃗g, ⃗agg ) | U NION ( ⃗ e ) In SPES, an AR can be: (1) a table AR, (2) an SPJ AR, (3) an aggre-gate AR, or (4) an union AR. We define the semantics of these fourtypes of AR in terms of the relationship between the input tables andthe output table. We consider a table to be a bag ( i.e. , multi-valuedset) of tuples as it best represents real-world databases. SPES sup-ports the

DISTINCT keyword for discarding duplicate tuples in a bag.Consequently, it also supports set semantics. This representation isbased on the AR presented in [23]. We next describe the semanticsof these ARs. T ABLE

AR: T ABLE ( n ) represents a table in a database. It containsonly one field: the name of the table ( n ). Given valid input tables Ts , this AR returns all the tuples in table n . SPJ AR:

This AR contains three fields: (1) a vector of input ARs( ⃗ e ), (2) a predicate that determines whether a tuple is selected ( P ),and (3) a vector of projection expressions that transform each se-lected tuple ( ⃗o ). Given a set of valid input tables Ts , the SPJ ARfirst evaluates the vector of input ARs on Ts to obtain a vector ofinput tables. For each tuple t in the cartesian product of the vectorof input tables, if t satisfies the given predicate p , it then applies thevector of expressions ⃗o on t and emits the transformed tuple. P REDICATE : A predicate may contain arithmetic operators, logicaloperators, and functions that check if a term is

NULL . SPES sup-ports higher-order predicates ( e.g. , EXISTS ) which are encoded asuninterpreted functions. P ROJECTION E XPRESSION : A projection expression may containcolumns, constant values,

NULL , arithmetic operations, user-definedfunctions, and the

CASE keyword. We present the formal definitionsof the syntax of predicate and projection expression in Appendix A. A GGREGATE

AR:

The aggregate AR contains three fields: (1) aninput AR ( e ), (2) a set of grouping attributes ( ⃗g ), and (3) a vector ofaggregate functions ( ⃗agg ). Given a set of valid input tables Ts , the5ggregate AR first evaluates the input AR on Ts to get an input table T . It then partitions the input table T into a set of bags of tuplesas defined by a set of grouping attributes ⃗g (tuples in each bag takethe same values for the grouping attributes). Lastly, for each bag oftuples, it applies the vector of aggregate functions and returns onetuple. Each aggregate function generates a column in that tuple. U NION

AR:

The union AR contains one field: a vector of input ARs( ⃗ e ). Given a set of valid input tables Ts , the union AR first evaluatesthe vector of input ARs on Ts to get a vector of input tables. It thenreturns all the tuples present in the input tables (without discardingduplicate tuples). The union AR captures the semantic of the UNIONALL operator [35]. M OTIVATION : The reasons for these definitions are twofold. First,it allows SPES to cover most of the frequently-observed SQLqueries. Second, since different types of ARs have different se-mantics, SPES can leverage AR-specific comparison functions. Wetreat SPJ queries as a separate category since the problem of deter-mining their equivalence is decidable [15]. We include aggregateand union ARs in our definitions since they are widely used SQLconstructs, and their semantics differs from that of SPJ queries. C OMPLEX

SQL C

ONSTRUCTS : SPES reduces certain SQL con-structs that do not directly map to these four categories to a combi-nation of these categories. Here are two examples: • SPES expresses the

LEFT OUTER JOIN operator as an

UNION expression that takes a vector of two SPJ expressions as input.The first SPJ expression represents the

INNER JOIN compo-nent of the

LEFT OUTER JOIN operator. The second SPJ ex-pression represents the

OUTER JOIN component of the

LEFTOUTER JOIN operator and uses

EXISTS in the predicate. • SPES expresses the

DISTINCT operator as an aggregate ex-pression where the

GROUP BY set contains all columns.

The syntax of an UNF AR is defined as follows:

UNF ::= U NION ( ⃗ SPJE ) SPJE ::=

SPJ ( ⃗ E , P , ⃗o ) E ::= T ABLE ( n ) | A GG ( UNF , ⃗g, ⃗agg ) The UNF AR is an union AR that takes a vector of normalized SPJARs as input ( ⃗ SPJE ). Each normalized SPJ AR takes a vector ofARs as input ( ⃗ E ). These ARs are either a table AR or a normalizedaggregate AR. Each normalized aggregate AR can recursively takean UNF AR as input. N ORMALIZATION R ULES : An AR can be normalized to UNF byrepeatedly applying a set of normalization rules. The number of ruleapplications is finite and the rules are not applied in a specific order.1. If an AR is SPJ ( e :: ⃗ e , p , ⃗o ) and e = SPJ ( ⃗ e , p , ⃗o ) ,then transform the AR to SPJ ( ⃗ e :: e , p ∧ p , ⃗o ◦ ⃗o ) . Here, :: denotes concatenation of two vectors and ◦ represents element-wise composition of two vectors of projection expressions.2. If an AR is SPJ ( e :: ⃗ e , p , ⃗o ) and e = U NION ( ⃗ e ) , thentransform the AR to U NION ( ⃗ e ) . Each AR in ⃗ e is SPJ ( e ′ :: ⃗ e , p , ⃗o ) where e ′ is an AR in e .3. If an AR is U NION ( e :: ⃗ e ) and e = U NION ( ⃗ e ) , thentransform the AR to U NION ( ⃗ e :: ⃗ e ) .4. If an AR is U NION ( e :: ⃗ e ) and e is aggregate or tableAR, then transforms e to SPJ ( e , true,⃗i ) . ⃗i represents theidentity map. 5. If an AR is A GG ( e , ⃗g, ⃗agg ) and e is SPJ or table AR, thentransforms e to U NION ( e ) .The first three rules merge SPJ and Union ARs. The fourth andfifth rules update the input AR of Union and Aggregate to satisfythe UNF. We now present a minimal set of pre-defined rules to furthersimplify the UNF ARs. These rules allow SPES to prove the equiv-alence of larger set of

SQL queries. E MPTY T ABLES : For this rule, we first define a special AR calledthe empty table . For any valid input, this AR always returns anempty table. It is only equivalent to another empty table AR.For an SPJ AR SPJ ( ⃗ e , p, ⃗o ) , if no tuple satisfies the predicate p ,then the rule transforms this SPJ AR to an empty table AR. For anSPJ AR, if there exists an empty table AR in its input, then the ruletransforms the SPJ AR to an empty table AR. For an union AR, therule removes all empty table ARs from its input. If the input of anunion AR is an empty vector, then the rule transforms the union ARto an empty table AR. For an aggregate AR, if the input is an emptytable AR and the group set is not empty, then the rule transforms theaggregate AR to an empty table AR. P REDICATE P USHDOWN : For an SPJ AR SPJ ( e :: ⃗ e , p ∧ p , ⃗o ) ,where e is an aggregate AR A GG ( e , ⃗g, ⃗agg ) , if p only dependson the group set ⃗g , then the rule updates the SPJ AR to SPJ ( e :: ⃗ e , p , ⃗o ) and updates the predicates of all SPJ expressions in e bytaking the conjunction of their original predicates with p . A GGREGATE M ERGE : Given an aggregate AR A GG ( e , ⃗g , ⃗agg ) ,if the input union AR only has one SPJ AR as its input, and the SPJAR only has one aggregate AR A GG ( e , ⃗g , ⃗agg ) as its input,then SPES checks the following conditions: (1) the predicate of theSPJ AR only depends on the group set ⃗g , (2) the group set ⃗g isa subset of the group set ⃗g , and (3) for each aggregate operationin ⃗agg , the operand is either in the group set ⃗g or in aggregateoperation ⃗agg such that they are the same aggregation function,and the aggregation function can only be MAX , MIN , SUM and

COUNT .If the aggregate AR satisfies all of these conditions, then SPESremoves A GG ( e , ⃗g , ⃗agg ) from the SPJ expression, and adds e to the input vector of the SPJ AR. SPES uses this rule to simplifythe AR of Q2 in Figure 2.

5. EQUIVALENCE VERIFICATION

In this section, we discuss how SPES verifies the equivalenceof two normalized ARs. We first present the formal definitions oftwo types of equivalence for ARs ( i.e. , algebraic representation ofa pair of queries) in §5.1. We then describe how SPES proves thefull equivalence of a pair of cardinally equivalent ARs using theirQPSR in §5.2. Lastly, we discuss how it decides if a pair of ARs arecardinally equivalent, and how it constructs QPSR when they arecardinally equivalent in Sections 5.3 to 5.7.

To define the full equivalence relationship between ARs, we firstdefine the cardinal equivalence relationship.D

EFINITION

1. C

ARDINAL E QUIVALENCE : Given a pair ofqueries Q1 and Q2 , Q1 and Q2 are cardinally equivalent if and onlyif, for all valid input tables, the output tables T and T of Q1 and Q2 contain the same number of tuples. If Q1 and Q2 are cardinally equivalent, for all valid inputs, eachtuple in T can be mapped to an unique tuple in T , and all tuples in6 are in the map. Thus, it is a bijective (one-to-one) map betweentuples in T and T . However, the two mapped tuples may differ intheir values, as shown in Figure 1a.D EFINITION

2. F

ULL E QUIVALENCE : Given a pair of queries Q1 and Q2 , Q1 and Q2 are fully equivalent if and only if, for allvalid input tables Ts , the output tables T and T of Q1 and Q2 areidentical. If Q1 and Q2 are fully equivalent, for all valid inputs, there existsa bijective map between tuples in T and T , and this bijective mapis an identity map. In other words, each tuple in T can always bemapped to an unique, identical tuple in T , and all tuples in T arein the map, as shown in Figure 1b. M OTIVATION

We first try to prove cardinal equivalence beforechecking for full equivalence. This is because if Q1 and Q2 are fullyequivalent, then they must be cardinally equivalent. To prove fullequivalence, we prove that the bijective map between tuples in theoutput tables is an identity map. In the rest of the paper, equivalent queries without any qualifier refer to fully-equivalent queries.SPES can prove that ARs are fully equivalent even if their sub-ARs are only cardinally equivalent. Consider the following queries: Q1:

SELECT

EMP.DEPT_ID,

SUM (EMP.SALARY)

FROM ( SELECT

DEPT_ID, SALARY

FROM

EMP)

GROUP BY

EMP.DEPT_ID;Q2:

SELECT

EMP.DEPT_ID,

SUM (EMP.SALARY)

FROM ( SELECT

DEPT_ID, SALARY, DEPT_ID+10

FROM

EMP)

GROUP BY

EMP.DEPT_ID;

While these queries are fully equivalent, their sub-queries are notfully equivalent. This is because the second sub-query returns threecolumns while the first one only returns two columns.

We now define the symbolic representation of normalized ARsthat SPES uses for proving equivalence. QPSR is an extension ofthe SR defined in EQUITAS to prove QE under bag semantics [50].In QPSR, we augment the SR to use a pair of symbolic tuples totrack a bijective map between the tuples that are returned by twocardinally equivalent ARs. QPSR of a pair of cardinally equivalentARs Q1 and Q2 is a tuple of the form: ⟨ ⃗ COLS , ⃗ COLS , COND , ASSIGN ⟩ ⃗ COLS is a vector of pairs of FOL terms that represent an arbitrarytuple returned by Q1 . Each element of this vector represents acolumn and is of the form: ( Val , Is-Null ) , where Val representsthe value of the column and

Is-Null denotes the nullability ofthe column. ⃗ COLS is another vector of pairs of FOL termsthat represents a tuple returned by Q2 . Since Q1 and Q2 must becardinally equivalent before SPES constructs their QPSR, the twosymbolic tuples ⃗ COLS and ⃗ COLS define a bijective map betweenthe returned tuples. COND is an FOL formula that represents theconstraints that must be satisfied for the symbolic tuples ⃗ COLS and ⃗ COLS to be returned by Q1 and Q2 , respectively. They encode thesemantics of the predicates in the queries. ASSIGN is another FOLformula that specifies the relational constraints between symbolicvariables used in ⃗ COLS , ⃗ COLS and COND . This formula is used forsupporting complex SQL operators, such as CASE. V ERIFYING FULL E QUIVALENCE : To prove that two cardinallyequivalent ARs Q1 and Q2 are fully equivalent, SPES needs to provethat the bijective map between returned tuples is an identity map.In other words, SPES needs to prove that, for an arbitrary tuple t returned by Q1 , the bijective map associates t to an identical tuplereturned by Q2 with the same values. SPES verifies this property Algorithm 1:

Procedure for verifying cardinal equivalence of ARs. Itconstructs the QPSR only if the ARs are cardinally equivalent.

Input :

A pair of ARs ( i.e. , Q1 and Q2 ) Output :

QPSR of Q1 and Q2 or NULL Procedure

VeriCard ( Q1 , Q2 ) switch TypeOf ( Q1 , Q2 ) do case T able do return

VeriTable ( Q1 , Q2 ) ; case SP J do return

VeriSPJ ( Q1 , Q2 ) ; case Union do return

VeriUnion ( Q1 , Q2 ) ; case Agg do return

VeriAgg ( Q1 , Q2 ) ; case Type Mismatch do return

NULL ; end using the QPSR of Q1 and Q2 . When both symbolic tuples satisfythe predicate ( i.e. , COND ), it must verify that ⃗ COLS is equivalent to ⃗ COLS . This property is formalized as: COND ∧ ASSIGN = ⇒ ⃗ COLS = ⃗ COLS SPES verifies this property using an SMT solver [30]. If the prop-erty does not hold, then the following formula is satisfiable:

COND ∧ ASSIGN ∧ ¬ ( ⃗ COLS = ⃗ COLS ) SPES feeds this formula into the SMT solver. If the solver deter-mines that this formula is unsatisfiable, then we prove that ⃗ COLS and ⃗ COLS are always identical. In this manner, we leverage theQPSR to prove full equivalence. Alg. 1 presents a recursive procedure

VeriCard for verifying thecardinal equivalence of two ARs. The

VeriCard procedure takes apair of ARs as inputs ( i.e. , Q1 ’s AR and Q2 ’s AR). It first checks thetypes of the given ARs. If they are of the same type, then it invokesthe appropriate sub-procedure for that particular type. We describethese four sub-procedures in Sections 5.4 to 5.7. If Q1 and Q2 arecardinally equivalent, then VeriCard returns their QPSR. If theseARs are of different types, it returns

NULL to indicate that it cannotdetermine their cardinal equivalence. This is because each type ofAR has different semantics (§4.1).Some sub-procedures recursively invoke

VeriCard to verify thecardinal equivalence between their sub-queries. It applies the nor-malization rules defined in §4 to transform the given two ARs sothat they are of the same type (and the sub-queries are also of thesame types recursively). This normalization process is incomplete ( i.e. , SPES may conclude that two ARs are not cardinally equivalentsince they cannot be normalized to the same type, even if they areactually cardinally equivalent). We discuss this limitation in §7.4.Each sub-procedure takes a pair of ARs of the same type as inputs.It first attempts to determine if they are cardinally equivalent. If theyare cardinally equivalent, then it constructs the QPSR of Q1 and Q2 .Otherwise, it returns NULL to indicate that it cannot determine theircardinal equivalence.In each of the following sub-sections, we first describe the condi-tions that are sufficient for proving cardinal equivalence based on thesemantics of the AR. We then describe how each sub-procedure ver-ifies these conditions to prove cardinal equivalence. We then discusshow SPES constructs the QPSR if they are cardinally equivalent.Lastly, we describe their soundness and completeness properties . Alg. 2 illustrates the

VeriTable procedure for table ARs. A sub-procedure P is sound if whenever it returns a QPSR, thegiven ARs are cardinally equivalent and the two symbolic tuplesdefine a bijective map. A sub-procedure P is complete if wheneverit returns NULL , the given ARs are not cardinally equivalent.7 lgorithm 2:

Comparison function for Table ARs

Input :

A pair of table ARs

Output :

QPSR of the table ARs or

NULL Procedure

VeriTable ( T ABLE ( n ) , T ABLE ( n )) if n = n then ⃗ COLS ← InitTuple ( T-Schema ( n )) ⃗ COLS ← ⃗ COLS return ( ⃗ COLS , ⃗ COLS , TRUE , TRUE ) else return NULL ; Algorithm 3:

Comparison function for SPJ ARs

Input :

A pair of SPJ ARs

Output :

QPSR of given SPJ ARs or

NULL Procedure

VeriSPJ ( SPJ ( ⃗e , p , ⃗o ) , SPJ ( ⃗e , p , ⃗o )) { ⃗QP SR } ← VeriVec ( ⃗e , ⃗e ) foreach ⃗QP SR ∈ { ⃗QP SR } do ( ⃗ COLS , ⃗ COLS , COND , ASSIGN ) ← Compose ( ⃗QP SR ) ( COND , ASSIGN ) ← ConstPred ( p , ⃗ COLS ) ( COND , ASSIGN ) ← ConstPred ( p , ⃗ COLS ) if COND ↔ COND then ( ⃗ COLS ′ , ASSIGN ) ← ConstExpr ( ⃗ COLS , ⃗o ) ( ⃗ COLS ′ , ASSIGN ) ← ConstExpr ( ⃗ COLS , ⃗o ) COND ← COND ∧ COND ∧ COND ASSIGN ← ASSIGN ∧ ASSIGN ∧ ASSIGN ∧ ASSIGN ∧ ASSIGN return ( ⃗ COLS ′ , ⃗ COLS ′ , COND , ASSIGN ) end end return NULL C ARDINAL E QUIVALENCE : L EMMA A pair of table ARs T ABLE ( n ) and T ABLE ( n ) arecardinally equivalent if and only if their input tables are same. (i.e., n = n ). Since the table AR returns all tuples from the input table, thus iftwo table ARs’ input tables are same, then they will always havethe same number of tuples. So

VeriTable compares the names ofthe two input tables ( i.e. , n and n ). SPES cannot show thattables with differing number of tuples are cardinally equivalent inthe presence of integrity constraints. QPSR:

We define the QPSR of the two cardinally equivalent tableARs using an identity map between the returned tuples ( e.g. , QPSR-1in Section 3.2).

VeriTable first constructs the symbolic tuple ⃗ COLS using a vector of new pairs of variables based on the table schema,and then sets the symbolic tuple ⃗ COLS to be the same as ⃗ COLS .These two equivalent tuples ⃗ COLS and ⃗ COLS define a bijective mapbetween returned tuples. VeriTable sets the

COND and

ASSIGN fieldsas

TRUE since there are no additional constraints that the tuples inthe table must satisfy. P ROPERTIES : VeriTable is sound and complete. These two prop-erties directly follow from Lemma 1. We present a formal proofin Appendix C.2.1.

Alg. 3 illustrates the

VeriSPJ procedure for SPJ ARs.

VeriSPJ leverages two procedures from [50]:

ConstExpr and

ConstPred . ConstExpr takes a vector of projection expressions and a sym-bolic tuple as inputs, and returns a new symbolic tuple with addi-tional constraints

ASSIGN that models the relation between variables.This new symbolic tuple represents the modified tuple based on thevector of projection expressions.

ConstPred takes a predicate and asymbolic tuple as the input and returns a boolean formula

COND with

Figure 3: SPJ ARs – Cardinally equivalent SPJ ARs. additional constraints

ASSIGN . COND symbolically represents the re-sult of evaluating the predicate on the symbolic tuples.

ConstPred supports higher-order predicates, such as

EXISTS , by encoding themas an uninterpreted function. C ARDINAL E QUIVALENCE : As covered in §4.1, an SPJ AR firstcomputes the cartesian product of all input ARs as the intermediatetable (

JOIN ). It then selects all tuples in the intermediate table thatsatisfy the predicate (

SELECT ), and applies the projection on eachselected tuple (

PROJECT ).L

EMMA A pair of SPJ ARs

SPJ ( ⃗e , p , ⃗o ) and SPJ ( ⃗e , p , ⃗o ) are cardinally equivalent if there is a bijective map m between tu-ples in intermediate join tables, such that the predicates p and p always return the same result for the corresponding tuples in m . To prove that there is a bijective map between the tuples in the twointermediate join tables,

VeriSPJ first uses the

VeriVec procedureto find a bijective map between sub-ARs such that each pair of sub-ARs are cardinally equivalent.

VeriVec exhaustively examines allpossible maps and recursively uses

VeriCard to verify the cardinalequivalence between two sub-ARs.

VeriVec returns all possible can-didate maps wherein each pair of sub-ARs are cardinally equivalent( { ⃗QP SR } ).Each candidate map is represented by a vector of QPSR ( ⃗QP SR ),wherein each QPSR defines a bijective map between tuples returnedby a pair of cardinally equivalent sub-ARs. VeriSPJ then uses the

Compose procedure to construct two symbolic tuples ⃗ COLS and ⃗ COLS (line 4) that represent a bijective map between the tuplesin the two intermediate join tables. These two symbolic tuplesare constructed by concatenating symbolic tuples from the QPSRsof sub-ARs based on the order of sub-ARs in the input vectors. Compose also constructs

COND and

ASSIGN by taking the conjunc-tion of

COND and

ASSIGN from the QPSRs of sub-ARs, respectively.

VeriSPJ then tries to prove that the two predicates always returnthe same result for the two symbolic tuples.

VeriSPJ first leveragesthe

ConstPred procedure to encode predicates p and p on ⃗ COLS and ⃗ COLS , respectively (line 6). VeriSPJ uses an SMT solver toprove this property under sub-conditions

COND and all relationalconstraints:

ASSIGN , ASSIGN , ASSIGN (line 7). If the propertyholds, then this formula is unsatisfiable: COND ∧ ASSIGN ∧ ASSIGN ∧ ASSIGN ∧ ¬ ( COND = COND ) VeriSPJ feeds this formula to an SMT solver. If the solver de-termines that this formula is unsatisfiable, then we prove

COND and COND are always equivalent when the relational constraints ASSIGN , ASSIGN , and ASSIGN and sub-conditions COND hold.Consider the cardinally equivalent SPJ ARs shown in Figure 3.In this case,

VeriSPJ first verifies that sub-AR E is cardinallyequivalent to sub-AR E , and sub-AR E is cardinally equiv-alent to sub-AR E . Thus, the two intermediate join tables ( i.e. ,cartesian product of sub-tables) are cardinally equivalent. VeriSPJ lgorithm 4: Comparison function for aggregate ARs

Input :

A pair of aggregate ARs

Output :

QPSR of given aggregate ARs or

NULL Procedure

VeriAgg ( A GG ( e , ⃗g , ⃗agg ) , A GG ( e , ⃗g , ⃗agg )) QP SR ← VeriCard ( e , e ) if QPSR !=

NULL then ( ⃗ COLS , ⃗ COLS , COND , ASSIGN ) ← QP SR if ⃗g ↔ ⃗g then ⃗ COLS ← InitAgg ( ⃗agg ) :: ⃗g ⃗ COLS ← ConstAgg ( ⃗agg , ⃗ COLS , ⃗agg ) :: ⃗g return ( ⃗ COLS , ⃗ COLS , TRUE , TRUE ) end end else return NULL ; constructs two symbolic tuples to represent the bijective map be-tween these intermediate join tables by leveraging the two bijectivemaps between the underlying tables. VeriSPJ then verifies that twocorresponding tuples in the map either both satisfy the predicateor not satisfy the predicate. Thus, the bijective map between thetuples in the intermediate join tables is the bijective map betweenthe tuples in the output tables before projection.

QPSR:

Since

VeriSPJ verifies that the given pair of SPJ ARs arecardinally equivalent, the two symbolic tuples ⃗ COLS and ⃗ COLS define a bijective map between tuples in the output tables beforeprojection. Projection does not change the bijective map betweentuples as it is applied separately on each tuple. Thus, VeriSPJ leverages

ConstExpr to construct new symbolic tuples ⃗ COLS ′ and ⃗ COLS ′ based on the vector of projection expressions and the givensymbolic tuples. The QPSR consists of the derived symbolic tuples ⃗ COLS ′ , ⃗ COLS ′ , the conjunction of COND , COND and COND , and theconjunction of all the relational constraints. P ROPERTIES : VeriSPJ is sound. Based on Lemma 2, if

VeriSPJ returns the QPSR, then the given SPJ ARs are cardinally equivalent.We present a formal proof in §C.2.2.In general,

VeriSPJ is not complete . The reasons are threefold.First, the SMT solver is only complete for linear operators. If thepredicates have non-linear operators ( e.g. , multiplication betweencolumns), then the solver may return UNKNOWN when it should re-turn

UNSAT [50]. Second, SPES encodes all user-defined functions,string operations, and higher-order predicates as uninterpreted func-tions. These encodings do not preserve the semantics of theseoperations. Third,

VeriCard is not complete (§5.3).

VeriSPJ procedure is complete if all input ARs for the giventwo SPJ ARs are table ARs and the SMT solver can determinethe satisfiability of the predicates. This is because the problemof deciding equivalence of two conjunctive ( i.e. , SPJ) queries isdecidable [26]. We present a formal proof in Appendix D.

Alg. 4 illustrates the

VeriAgg procedure for aggregate ARs. C ARDINAL E QUIVALENCE : An aggregate AR groups the tuplesin the input table based on the

GROUP BY column set, then appliesthe aggregate function on each group to generate a tuple in theoutput table.L

EMMA A given pair of aggregate ARs A GG ( e , ⃗g , ⃗agg ) and A GG ( e , ⃗g , ⃗agg ) are cardinally equivalent if these two con-ditions are satisfied: (1) the two input sub-ARs e and e are cardi-nally equivalent; (2) for any two pairs of corresponding tuples ina bijective map of the QPSR of e and e , two tuples in e belongto the same group as defined by g if and only if their associatedtuples in e belong to the same group as defined by g . Figure 4: Aggregate ARs – Cardinally equivalent aggregate ARs.

VeriAgg first recursively invokes the

VeriCard procedure to deter-mine the cardinal equivalence of the two input sub-ARs e and e (line 2). If VeriCard returns the QPSR of e and e , then VeriAgg has proved the first condition in Lemma 3.To prove the second condition,

VeriAgg collects the symbolictuples ⃗ COLS and ⃗ COLS from the QPSR. Since these two symbolictuples define a bijective map between tuples returned by e and e , VeriAgg replaces all variables in ⃗ COLS and ⃗ COLS by a set of freshvariables to generate a second pair of symbolic tuples ⃗ COLS ′ and ⃗ COLS ′ that represents the same bijective map with different tuples.We decompose the proof for the second condition into two stages(line 5). In the first stage, we want to prove that if ⃗ COLS and ⃗ COLS ′ belong to the same group, then ⃗ COLS and ⃗ COLS ′ also belongto the same group. To prove this, VeriAgg extracts the

GROUPBY column sets ⃗g , ⃗g ′ , ⃗g and ⃗g ′ from ⃗ COLS , ⃗ COLS ′ , ⃗ COLS and ⃗ COLS ′ , respectively. It then attempts to prove the property: ( COND ∧ ASSIGN ∧ ⃗g = ⃗g ′ ) = ⇒ ⃗g = ⃗g ′ VeriAgg sends the negation of this property to the solver : ( COND ∧ ASSIGN ∧ ⃗g = ⃗g ′ ) ∧ ¬ ⃗g = ⃗g ′ If the solver decides that this formula is unsatisfiable, then it isimpossible to find two tuples returned by e that are assigned to thesame group by ⃗g , such that their corresponding tuples returned by e are assigned to different groups by ⃗g . In the second stage, weuse the same technique in the reverse direction of the implication.Consider the cardinally equivalent aggregate ARs shown in Fig-ure 4. VeriAgg first verifies that the two input ARs E and E arecardinally equivalent, and then constructs the QPSR to represent thebijective map between their returned tuples. VeriAgg then verifiesthat if two arbitrary tuples in E belong to same group ( e.g. , first twotuples), then the two corresponding tuples in E also belong to thesame group. It also verifies that if two arbitrary tuples in E belongto different groups ( e.g. , first and third tuples), then the two cor-responding tuples in E also belong to different groups. VeriAgg verifies that the two aggregate ARs are cardinally equivalent byverifying that they emit the same number of groups.

QPSR:

VeriAgg constructs the QPSR of two given aggregate ARsafter proving they are cardinally equivalent. ⃗ COLS and ⃗ COLS definea bijective map between tuples returned by input ARs, and can alsobe used to define a bijective map between groups in two aggregateARs. If two aggregate functions in ⃗agg and ⃗agg are same andoperate on same values ( i.e. , input columns of the symbolic tuplesare same), then the aggregate values in the output tuples are same,since each group contains the same number of tuples. VeriAgg invokes the

InitAgg procedure on ⃗agg to construct avector of pairs of new symbolic variables as the symbolic tuplesfor aggregate functions. In each pair of symbolic variables, thefirst variable represents the aggregate value. The second variableindicates if the aggregate value is NULL . VeriAgg concatenates the

GROUP BY column set ⃗g with the symbolic tuple ⃗ COLS . VeriAgg then invokes the

ConstAgg procedure to construct the symboliccolumns for ⃗agg , and then concatenates with the GROUP BY column9 lgorithm 5:

Comparison function for Union ARs

Input :

A pair of union ARs

Output :

QPSR of given two Union ARs or

NULL Procedure

VeriUnion ( U NION ( ⃗e ) , U NION ( ⃗e )) { ⃗QP SR } ← VeriVec ( ⃗e , ⃗e ) if { ⃗QP SR } != ∅ then ⃗ COLS ← InitTuple () ⃗ COLS ← InitTuple () ⃗QP SR ← { ⃗QP SR } ASSIGN ← ConstAssign ( ⃗QP SR, ⃗ COLS , ⃗ COLS ) return ( ⃗ COLS , ⃗ COLS , TRUE , ASSIGN ) end else return NULL ; set ⃗g . ConstAgg uses the same pairs of symbolic variables forall aggregation operations in ⃗agg , where the aggregation functiontype and operand columns are the same in ⃗agg . VeriAgg sets

COND and

ASSIGN to TRUE since all tuples must be returned in case of anaggregation AR and there are no additional constraints, respectively. P ROPERTIES : VeriAgg is sound. Based on Lemma 3, if

VeriAgg returns the QPSR, then the two given aggregate ARs are cardinallyequivalent. This is because the two symbolic tuples ⃗ COLS and ⃗ COLS are constructed from corresponding groups. Thus, ⃗ COLS and ⃗ COLS define a bijective map between tuples returned by thetwo aggregate ARs. We present a formal proof in Appendix C.2.3. VeriAgg is not complete. The sources of incompleteness arethreefold: (1) incompleteness of

VeriCard , (2) limitations of theSMT solver, and (3) when

VeriCard returns the QPSR of two inputsub-ARs, the symbolic tuples in the QPSR define only one possiblebijective map between tuples in the input tables. If

VeriAgg fails toprove the second condition in Lemma 3, it is still possible that thereexists another bijective map that satisfies the second condition.

Alg. 5 illustrates the

VeriUnion procedure for union ARs. C ARDINAL E QUIVALENCE : An union AR returns all tuples ineach of the input tables.L

EMMA A given pair of union ARs U NION ( ⃗e ) and U NION ( ⃗e ) are cardinally equivalent if there exists a bijective map between thetwo input sub-ARs ⃗e and ⃗e , such that each pair of ARs are cardi-nally equivalent. The lemma follows from the semantics of the union AR.

VeriUnion procedure invokes

VeriVec (§5.5) to find a bijective map between ⃗e and ⃗e (line 2), such that each pair of ARs are cardinally equivalent. QPSR:

VeriVec finds all candidate bijective maps ( { ⃗QP SR } ) be-tween two input sub-ARs ⃗e and ⃗e , such that each pair of sub-ARs are cardinally equivalent. In each candidate bijective map( ⃗QP SR ), a vector of QPSRs is constructed such that each QPSRdefines a bijective map between tuples returned by a pair of sub-ARs. VeriUnion gets an arbitrary ⃗QP SR ( i.e. , one candidate bijectivemap between the sub-ARs). It seeks to construct a bijective mapbetween tuples returned by two union ARs that preserves all of thebijective maps between tuples returned by sub-ARs in that ⃗QP SR .It first constructs two fresh symbolic tuples ⃗ COLS and ⃗ COLS . Itthen invokes the ConstAssign procedure to set

ASSIGN such thatboth ⃗ COLS and ⃗ COLS are always equivalent to the symbolic tu-ples in one sub-QPSR returned by VeriVec . ConstAssign createsa vector of boolean variables to set these constraints.

VeriUnion returns these two symbolic tuples,

TRUE condition, and

ASSIGN asthe QPSR of the given union ARs.

Figure 5: Union ARs – Cardinally equivalent union ARs.

Consider the cardinally equivalent union ARs shown in Figure 5.

VeriSPJ verifies that sub-AR E is cardinally equivalent to sub-AR E , and sub-AR E is cardinally equivalent to sub-AR E .Thus, the two union ARs are cardinally equivalent. VeriSPJ thenconstructs the QPSR of two union ARs that defines the bijectivemap between tuples in union tables. This bijective map preservesthe bijective map between tuples in E and E , and the bijectivemap between tuples in E and E . P ROPERTIES : VeriUnion is sound. Based on Lemma 4, if

VeriUnion returns the QPSR, then the two union ARs are cardinally equiva-lent. The symbolic tuples ⃗ COLS and ⃗ COLS define a bijective mapbetween tuples returned by two union ARs that preserves all of thebijective maps between tuples in their cardinally equivalent sub-ARs.The formal proof is given in Appendix C.2.4. VeriUnion is incomplete. The sources of incompleteness arethreefold: (1) incompleteness of

VeriCard , (2) limitations of theSMT solver, and (3) two union ARs may be cardinally equivalenteven if there is no bijective map between their sub-ARs such thateach pair of sub-ARs is cardinally equivalent.

6. SOUNDNESS AND COMPLETENESS

We now discuss the soundness and completeness of SPES forverifying the equivalence of two queries. S OUNDNESS : SPES is sound. Given two queries Q1 and Q2 , ifSPES constructs the QPSR for two normalized ARs that represent Q1 and Q2 , and checks the formula holds for the QPSR: COND ∧ ASSIGN = ⇒ ⃗ COLS = ⃗ COLS then Q1 and Q2 are fully equivalent. P ROOF S KETCH : If VeriCard returns the QPSR of Q1 and Q2 ,then Q1 and Q2 are cardinally equivalent and the symbolic tuples inthe QPSR define the bijective map between the tuples returned by Q1 and Q2 . If SPES determines that the following formula holds for theQPSR: COND ∧ ASSIGN = ⇒ ⃗ COLS = ⃗ COLS , then the bijectivemaps is an identity map. Thus, Q1 and Q2 are fully equivalent. Wepresent a formal proof in Appendix C. C OMPLETENESS : In general, SPES is not complete. We discussedthe sources of incompleteness in Sections 5.3 to 5.7. However,SPES is complete for a pair of SPJ queries Q1 and Q2 that do nothave predicates or projection expressions whose satisfiability cannotbe determined by the SMT solver. P ROOF S KETCH : Since Q1 and Q2 are SPJ queries, after normal-ization, SPES represents them with SPJ ARs that only take tableARs as inputs. VeriSPJ is complete under these conditions. Wepresent a formal proof in Appendix D. Thus, if Q1 and Q2 are fullyequivalent, VeriSPJ returns the QPSR of ARs that represents Q1 and Q2 . Since Q1 and Q2 do not have predicates or projection expres-sions whose satisfiability cannot be determined by the SMT solver,the solver will verify whether the following formula holds for the10 alcite CompilerFirst Stage: ConverterSecond Stage: VeriﬁerSMT Solver: Z3 FOLs Satisﬁable?Logical Plan: Q1 = Q2?Normalized AR: E1 = E2?

SQL: Q1 = Q2? Decision

Figure 6: Query Equivalence Verification Pipeline - The pipeline fordetermining the equivalence of SQL queries.

QPSR:

COND ∧ ASSIGN = ⇒ ⃗ COLS = ⃗ COLS . Thus, SPES iscomplete.

7. EVALUATION

In this section we describe our implementation and evaluation ofSPES. We begin with a description of our implementation in §7.1.We next report the results of a comparative analysis of SPES againstUDP [23] and EQUITAS [50], the state-of-the-art automated QEverifiers based on AR and SR, respectively. We then present theresults of a comparative analysis of SPES against QE verification al-gorithms used in systems for leveraging materialized views. We nextquantify the efficacy of SPES in identifying overlapping queriesacross production SQL queries in §7.3. We conclude with the limi-tations of the current implementation of SPES in §7.4.

The architecture of SPES is illustrated in Figure 6. SPES takesa pair of SQL queries ( Q1 and Q2 ) as inputs and returns a booleandecision that indicates whether they are fully equivalent. The QEverification pipeline consists of three components: ❶ The compilerfirst converts the given queries to logical query execution plans. Weuse the open-source C

ALCITE framework [3]. ❷ SPES operateson these logical plans in two stages. First, it converts them to theirARs and normalizes these ARs. Next, it uses the third componentto verify the cardinal equivalence of ARs and then constructs theirQPSR. It also uses the third component for verifying the propertiesof QPSR to determine full equivalence. This component is imple-mented in Java (2,065 lines of code). ❸ The third component is anSMT solver that SPES leverages for determining the satisfiabilityof FOL formulae [12]. We will release the source code of SPESafter this paper is published. B ENCHMARK : We use queries in the test suite of Apache C AL - CITE [3] as our benchmark. This test suite contains 232 semanticallyequivalent query pairs. The reasons for using this benchmark aretwofold. First, the C

ALCITE optimizer is widely used in data pro-cessing engines [4, 5, 6, 7, 8]. So, it covers a wide range of SQLfeatures . Second, since UDP and EQUITAS are both evaluatedon this query pair benchmark [23, 50], we can quantitatively andqualitatively compare the efficacy of these tools. We send everyquery pair with the schemata of their input tables to SPES and askit to check their QE. We conduct this experiment on a commodityserver (Intel Core i7-860 processor and 16 GB RAM). A UTOMATED

SQL QE V

ERIFIERS : The results of this experi-ment are shown in Table 2. We compare SPES against EQUITAS The test cases used in this experiment were obtained from theopen-sourced COSETTE repository [10]. in the same environment. We present the results reported in the UDPpaper [23] . SPES proves the equivalence of a larger set of querypairs (90/232) compared to UDP (34/232) and EQUITAS (67/232).SPES currently supports 120 out of 232 pairs. The un-supportedqueries either: (1) contain SQL features that are not yet supported( e.g. , CAST ), or (2) cannot be compiled by C

ALCITE due to syntaxerrors. Among the 120 pairs supported by SPES, it proves that 90pairs (75%) are equivalent under bag semantics. In contrast, UDPproves the equivalence of 34 pairs under bag semantics. EQUITASproves the equivalence of 67 pairs, but only under set semantics. Wegroup the proved query pairs into three categories: • USPJ:

Queries that are union of

SELECT - PROJECT - JOIN . • Aggregate:

Queries containing at least one aggregate. • Outer-Join:

Queries containing at least one outer

JOIN .Table 2 reports the number of pairs proved by UDP and EQUI-TAS in each category. The number of proved pairs containing outer

JOIN is not known in case of UDP. SPES outperforms the othertools on queries containing aggregate and outer

JOIN operators.We next compare the average time taken by SPES, UDP andEQUITAS to prove the equivalence of a pair of queries in eachcategory. This is an important metric for a cloud-scale tool thatmust be deployed in a DBaaS platform. We only compute thismetric for the pairs that these tools can prove. SPES, UDP, andEQUITAS take 0.05 s, 4.16 s, and 0.15 s on average to prove QE.So, SPES is 83 × faster than UDP and 3 × faster than EQUITASon this benchmark. L EVERAGING M ATERIALIZED V IEWS : In this experiment, wecompare SPES against equivalence verification algorithms used insystems for leveraging materialized views: (1) M

INI C ON [43] and(2) V IEW M ATCHER [33].M

INI C ON only proves containment relationships between con-junctive queries ( i.e. , SPJ queries). So, it supports pairs of SPJqueries in the C ALCITE benchmark. In contrast, SPES proves that27 of these 30 query pairs are equivalent. It does not support pairssince their equivalence is conditioned on integrity constraints thatSPES currently does not support. SPES supports other types ofqueries in the C ALCITE benchmark that M

INI C ON cannot support.V IEW M ATCHER only proves containment relationships betweenSPJ queries and aggregate queries whose inputs are SPJ queries. Itleverages a syntactical comparison scheme to verify the containmentrelationship between queries. We implemented this comparisonscheme and found that it proves pairs of queries are equivalentin C ALCITE . Since SPES relies on semantic comparison whichsubsumes syntactical comparison, it supports all of them.

In this experiment, we quantify the efficacy of SPES in detectingoverlap in production

SQL queries. We leverage three set of realproduction queries from Ant Financial [2], a financial technologycompany. These queries are used to detect fraud in business trans-actions. In each set, we run SPES on each pair of queries thatoperate on the same set of input tables. If SPES decides that a givenpair of queries are not equivalent, then we check any constituentsub-queries that operate on the same input tables. We skip checkingqueries containing only table scans and those that only differ in theparameters passed on to their predicates. This is because SPEStrivially proves their equivalence and the computational resourcesneeded for evaluating such queries are negligible. We were unable to conduct a comparative performance analysis un-der the same environment since UDP is currently not open-sourced.11

ETool SupportedSemantics SupportedPairs ProvedPairs AverageTime (s) USPJPairs AverageTime (s) AggregatePairs AverageTime (s) Outer-JoinPairs AverageTime (s)

SPES Bag 120 90 0.05 39 0.03 42 0.06 20 0.09EQUITAS Set 91 67 0.15 28 0.10 32 0.19 9 0.19UDP Bag 39 34 4.16 21 2.7 11 6.9 – –

Table 2: Comparative analysis between SPES, EQUITAS, and UDP - The results include the number of query pairs in the C

ALCITE benchmark that thesetools support, the number of pairs whose equivalence they can prove, and the average time they take to determine QE.

QuerySet Number ofQueries Queries withOverlapping Computation HighestQuery Frequency ComparedQuery Pairs EquivalentQuery Pairs Equivalent Pairs withAggregate and Join

Set 1 3285 943 52 122900 3344 653Set 2 3633 984 97 55311 7225 4822Set 3 2568 664 30 15442 1521 356Total 9486 2591 (27%) – 193633 12090 5831 (48%)

Table 3: Efficacy of SPES on Production Queries - "Highest Query Frequency" indicates the highest frequency of a query in equivalent query pairs.“Compared Query Pairs" refers to number of query pairs that operate on the same set of input tables.

Figure 7: Complexity of Production Queries - We quantify the complexityof production queries in the Ant Financial workload by measuring the numberof algebraic expressions (sub-ARs) in each query.

Table 3 presents the results of this experiment. SPES effectivelyidentifies overlap between complex analytical queries. Among 9486queries, SPES finds overlapping computation between 2591 (27%)queries, while EQUITAS only finds overlapping computation be-tween 1126 (12%) queries. We also report the highest frequencyof queries present in these pairs that are repeatedly executed inthe workload. In practice, most of the computational resources areexpended on executing queries containing aggregate functions ordifferent types of join. Among 12090 equivalent pairs, 5831 (48%)contains join and aggregate operations. This illustrates that SPESworks well on queries containing these operators. Q UERY C OMPLEXITY : Figure 7 illustrates the complexity ofqueries in this workload. We compute the distribution of the numberof algebraic expressions ( i.e. , sub-ARs) in a given query (complexqueries will have a larger set of expressions). We found that the aver-age number of algebraic expressions in the Ant Financial workloadand the C

ALCITE benchmark is 45.38 and 5.37, respectively.

In general, the problem of deciding QE is undecidable [15].Among the 120 query pairs supported by SPES, it cannot prove theQE of 30 pairs. We classify them into three categories: (1) lack ofnormalization rules [22], (2) support for integrity constraints [7],and (3) support for type casting [1]. N ORMALIZATION R ULES : SPES can verify the cardinal equiva-lence of two ARs only if it can normalize them into the same type ofAR using a set of pre-defined semantically-equivalent re-write rules(§5.3). We will need to introduce additional normalization rules forARs with: (1) union and aggregate [15], (2) join and aggregate [7],and (3) multiple aggregates with a complex relationship [2]. Addingthese re-write rules in the normalization stage will enable SPES toprove the QE of these 22 pairs. However, that will also increase the average QE verification time. Furthermore, these rules are notrequired for supporting production queries discussed in §7.3. I NTEGRITY C ONSTRAINTS : SPES currently does not supportintegrity constraints ( e.g. , distinct values, foreign keys, and primarykeys). We will need to encode these integrity constraints in ournormalization rules. Updating these rules will enable SPES toprove the QE of 7 pairs. For example, we may normalize an

OUTERJOIN operation based on a foreign key to an

INNER JOIN operation.

8. RELATED WORK Q UERY E QUIVALENCE : The state-of-the-art QE verification toolsare based on either AR [24, 22, 25] or SR [50]. We highlighted thedifferences between SPES and these tools in §2. Prior efforts haveexamined the theoretical aspects of equivalence and containment re-lationships between queries. Since it is an undecidable problem [14,18], these efforts focused on determining categories of queries forwhich it is a decidable problem: (1) conjunctive queries [21], (2)conjunctive queries with additional constraints [19, 36, 28], and (3)conjunctive queries under bag semantics [37]. The problem of de-ciding containment relationship between conjunctive queries can bereduced to a constraint satisfiability problem [40]. Other proposalsinclude decision procedures for proving equivalence of a subset ofqueries under set [20, 45, 44] and bag semantics [27, 31, 32, 42]. S YMBOLIC E XECUTION IN

DBMS S : Researchers have lever-aged symbolic execution in DBMSs by reducing the given problemto a FOL satisfiability problem and then using an SMT solver tosolve it. These efforts include: (1) automatically generating testcases for database applications [46, 47, 13], (2) verifying the cor-rectness of database applications [48, 38, 34], (3) disproving theequivalence of SQL queries [24], and (4) finding the best application-aware memory layout [49]. SPES differs from these efforts in thatwe seek to address the limitations of symbolic approaches to QE.

9. CONCLUSION

In this paper, we presented the design and implementation ofSPES that takes a two-stage approach to query equivalence. We il-lustrated how it supports structurally-different queries with complexoperators under bag semantics. SPES uses a set of rules for normal-izing the differences between complex, structurally-different queries.It then converts these expressions to a QPSR and determines theirfull equivalence under bag semantics using an SMT solver. Ourevaluation shows that SPES proves the equivalence of a larger setof query pairs under bag semantics compared to the state-of-the-arttools based on algebraic and symbolic approaches.12

0. REFERENCES [1] Alibaba MaxCompute. .[2] Ant Financial Services Group. .[3] Apache Calcite project. http://calcite.apache.org/ .[4] Apache Drill project. http://drill.apache.org/ .[5] Apache Flink project. http://flink.apache.org/ .[6] Apache Hive project. http://hive.apache.org/ .[7] Apache Kylin project. http://kylin.apache.org/ .[8] Apache Phoenix project. http://phoenix.apache.org/ .[9] Azure Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/ .[10] Cosette: An automated SQL solver. https://github.com/uwdb/Cosette .[11] Google BigQuery. https://cloud.google.com/bigquery/ .[12] Z3prover: Z3 theorem prover. https://github.com/Z3Prover/z3 .[13] S. Abdul Khalek, B. Elkarablieh, Y. O. Laleye, andS. Khurshid. Query-aware test generation using a relationalconstraint solver. In

ASE , 09 2008.[14] S. Abiteboul, R. Hull, and V. Vianu.

Foundations ofdatabases: the logical level . Addison-Wesley LongmanPublishing Co., Inc., 1995.[15] A. V. Aho, Y. Sagiv, and J. Ullman. Equivalence amongrelational expressions.

SIAM Journal of Computing ,8(9):218–246, 1979.[16] J. Albert. Algebraic properties of bag data types. In

VLDB ,1991.[17] S. R. Alekh Jindal, Konstantions Karanasos and H. Patel.Selecting subexpressions to materialize at datacenter scale. In

VLDB , 2018.[18] B.A.Trakhtenbrot. Impossibility of an algorithm for thedecision problem in finite classes. In

Journal of SymbolicLogic , 1950.[19] D. Calvanese, G. D. Giacomo, and M. Lenzerini. Conjunctivequery containment and answering under description logicconstraints. In

TOCL , 2008.[20] A. K. Chandra and P. M. Merlin. Optimal implementation ofconjunctive queries in relational data bases. In

STOC , 1977.[21] C. Chekuri and A. Rajaraman. Conjunctive query containmentrevisited. In

ICDT , 1997.[22] S. Chu, D. Li, C. Wang, A. Cheung, and D. Suciu.Demonstration of the Cosette automated sql prover. In

SIGMOD , 2017.[23] S. Chu, B. Murphy, J. Roesch, A. Cheung, and D. Suciu.Axiomatic foundations and algorithms for deciding semanticequivalences of sql queries. In

VLDB , 2018.[24] S. Chu, C. Wang, K. Weitz, and A. Cheung. Cosette: Anautomated SQL prover. In

CIDR , 2017.[25] S. Chu, K. Weitz, A. Cheung, and D. Suciu. HoTTSQL:proving query rewrites with univalent sql semantics. In

PLDI ,2017.[26] S. Cohen, W. Nutt, and Y. Sagiv. Deciding equivalencesamong conjunctive aggregate queries. 1998.[27] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregatequeries using views. In

PODS , 1999.[28] G. De Giacomo, D. Calvanese, and M. Lenzerini. On thedecidability of query containment under constraints. In

PODS ,12 1999. [29] L. De Moura and N. Bjørner. Satisfiability modulo theories:Introduction and applications.

Commun. ACM , 54(9):69–77,2011.[30] L. M. de Moura and N. Bjørner. Z3: an efficient SMT solver.In

TACAS , 2008.[31] A. Deutsch, A. Nash, and J. Remmel. The chase revisited.2008.[32] A. Deutsch, L. Popa, and V. Tannen. Physical dataindependence, constraints, and optimization with universalplans. In

VLDB , 03 2002.[33] J. Goldstein and P. Larson. Optimizing queries usingmaterialized views: A practical, scalable solution. In

SIGMOD , pages 331–342, 06 2001.[34] S. Grossman, S. Cohen, S. Itzhaky, N. Rinetzky, and M. Sagiv.Verifying equivalence of spark programs. In

CAV , 2017.[35] P. Guagliardo and L. Libkin. A formal semantics of sqlqueries, its validation, and applications. 2017.[36] I. Horrocks, U. Sattler, S. Tessaris, and S. Tobies. How todecide query containment under constraints using adescription logic. In

LPAR , 2000.[37] Y. E. Ioannidis and R. Ramakrishnan. Containment ofconjunctive queries: Beyond relations as sets. In

TODS , 1995.[38] S. Itzhaky, T. Kotek, N. Rinetzky, M. Sagiv4, O. Tamir,H. Veith, and F. Zuleger. On the automated verification of webapplications with embedded sql. In

ICDT , 2017.[39] T. S. Jayram, P. G. Kolaitis, and E. Vee. The containmentproblem for real conjunctive queries with inequalities. In

PODS , 2006.[40] P. G. Kolaitis and M. Y. Vardi. Conjunctive-query containmentand constraint satisfaction. In

PODS , 1998.[41] M. Negri, G. Pelagatti, and L. Sbattella. Formal semantics ofsql queries. In

ACM Trans. Database Syst. , 1991.[42] L. Popa, A. Deutsch, A. Sahuguet, and V. Tannen. A chase toofar? In

SIGMOD , 2002.[43] R. Pottinger and A. Levy. A scalable algorithm for answeringqueries using views. In

VLDB , pages 484–495, 2000.[44] Y. Sagiv and M. Yannakakis. Equivalences among relationalexpressions with the union and difference operators. In

J.ACM , 1980.[45] V. Tannen and L. Popa. An equational chase forpath-conjunctive queries, constraints, and views. In

ICDT ,1999.[46] M. Veanes, P. Grigorenko, P. de Halleux, and N. Tillmann.Symbolic query exploration. In

FormaliSE , 2009.[47] M. Veanes, N. Tillmann, and J. de Halleux. Qex: Symbolic sqlquery explorer. In

LPAR , 2010.[48] Y. Wang, I. Dillig, S. K. Lahiri, and W. Cook. Verifyingequivalence of database-driven applications. In

PACMPL ,2017.[49] C. Yan and A. Cheung. Generating application-specific datalayouts for in-memory databases. In

VLDB , 2019.[50] Q. Zhou, J. Arulraj, S. B. Navathe, W. Harris, and D. Xu.Automated verification of query equivalence usingsatisfiability modulo theories.

PVLDB , 12(11):1276–1288,2019.13

PPENDIXA. SEMANTICS OF AR

We now formally define the semantics AR queries, using thefollowing formal notation. ⇓ is the evaluation symbol. The left sideof this symbol is an algebraic expression that is evaluated on validinput tables Ts . The right side of this symbol is the evaluation result,which is the output table. All output tables are bags ( i.e. , can containduplicate tuples). A horizontal line separates the pre- and the post-conditions. The pre-conditions on the top of the line include a set ofevaluation relations. The post-condition on the bottom side of theline is an evaluation relation. If all the relations in the pre-conditionshold, then the relation in the post-condition holds.E-T ABLE ⟨ T ABLE ( n )[ Ts ] ⟩ ⇓ [ t |∀ t ∈ n ] E-SPJ ⃗ e = e , e , . . . , e n ⟨ e [ Ts ] ⟩ ⇓ T . . . ⟨ e n [ Ts ] ⟩ ⇓ T n ⟨ SPJ ( ⃗ e , P , ⃗o )[ Ts ] ⟩ ⇓ [( ⃗o ( t ) |∀ t ∈ ( T × · · · × T n ) , p ( t )] E-A GG ⟨ e [ Ts ] ⟩ ⇓ T ⟨ A GG ( e , ⃗g, ⃗agg )[ Ts ] ⟩ ⇓ [ ⃗agg ( t ) |∀ t ∈ part ( T , ⃗g )] E-U

NION ⃗ e = e , e , . . . , e n ⟨ e [ Ts ] ⟩ ⇓ T . . . ⟨ e n [ Ts ] ⟩ ⇓ T n ⟨ U NION ⃗ e [ Ts ] ⟩ ⇓ [ t |∀ t ∈ T + · · · + T n ] Figure 8: Semantics –

Semantics of AR used in SPES • Given a set of valid input tables Ts , the table AR returns all thetuples in table n . • Given a set of valid input tables Ts , the SPJ AR first evaluatesthe vector of input ARs on Ts to obtain a vector of input tables.For each tuple t in the cartesian product of the vector of inputtables, if t satisfies the given predicate p , it then applies thevector of expressions −→ o on the selected tuple t and emits thetransformed tuple. • Given a set of valid input tables Ts , this aggregate AR firstevaluates the input AR on Ts to get an input table T . Then,it uses part to partition the input table T into a set of bagsof tuples as defined by a set of group set ⃗g (tuples in each bagtake the same values for the grouping attributes). Lastly, foreach bag of tuples, it applies the vector of aggregate functionsand returns one tuple. • Given a set of valid input tables Ts , this union AR first evaluatesthe vector of input ARs on Ts to get a vector of input tables.It then returns all the tuples present in the input tables, whichdoes not eliminate duplicate tuples. B. PREDICATE & PROJ. EXPRESSION

SPES supports the predicate and project expressions shownin Figure B. It uses the same encoding scheme as the one employedin EQUITAS (described in Section 3.4 of [50]).A projection expression E can either be a column that refer toa specific column, a constant value, NULL , a binary expression, anuninterpreted function, or an CASE expression (Eqn. 1). A predi-cate P can either be a binary comparison between two projectionexpression, a binary predicate that is composed by two predicates, E ::= Column i | Const v | NULL | Bin E OP E | Fun N ( −→ E ) | CASE (1)

CASE ::=

Pair E (2)

Pair ::= (

WHEN P E ) Pair | ϵ (3) OP ::= + | − | × | ÷ | mod (4) P ::= BinE E CP E | BinL P LOGIC P | Not P | IsNull E (5) CP ::= > | < | = | ≤ | ≥ (6) LOGIC ::=

AND | OR (7) Figure 9: Predicate & Projection Expressions –

Types of predicates andprojection expressions supported by SPES. a not predicate and a predicate decide if a projection expression is

NULL (Eqn. 5).

C. SOUNDNESS OF VERIFICATION

In this section, we give the formal proof of the soundness ofSPES. The overall proof is structured as follows: we introducesymbolic representations of bijections over tuples in (§C.1), provecorrectness of the procedure for generating symbolic representations(§C.2), and then prove correctness of the procedure for determiningequivalence (§C.3).

C.1 Symbolic bijections between queries

All definitions in this section are given with respect to arbitraryqueries Q1 and Q2 , whose columns are denoted ⃗ COLS and ⃗ COLS ,respectively.A cardinality-preserving binary relation between Q1 and Q2 is arelation R ⊆ ⃗ COLS × ⃗ COLS such that for each input table set I and all tuples ( t, u ) ∈ R , it holdsthat | t | Q1 ( I ) = | u | Q2 ( I ) where | u | T denotes the number of occurrences of tuple u in table T .Cardinality-preserving binary relations can act as witnesses offull equivalence (see Definition 2) between queries.L EMMA If the identity function is a cardinality-preserving bi-nary relation between Q1 and Q2 , then Q1 and Q2 are fully equivalent(denoted Q1 ≡ Q2 ). P ROOF . Let I be an arbitrary table set and let t be an arbitrarytuple over columns COLS . Then | t | Q1 ( I ) = | t | Q2 ( I ) by the fact that the identity function is cardinality-preserving. Thus Q1 ( I ) and Q2 ( I ) are equivalent under bag semantics by the defi-nition of bag semantics. Thus Q1 ≡ Q2 , by definition of equiva-lence.A symbolic representation of a binary relation R ⊆ ⃗ COLS × ⃗ COLS is an SMT formula over a vocabulary that extends ⃗ COLS and ⃗ COLS such that for each ( t, u ) ∈ R , there is some model m of φ such that t = m ( ⃗ COLS ) u = m ( ⃗ COLS ) where m ( ⃗ COLS ) is the tuple of interpretations of each column namein ⃗ COLS (and similarly for m ( ⃗ COLS ) ).Symbolic representations of cardinality-preserving bijections canbe viewed as QPSRs (defined in Section 5.2), collapsed into single14MT formulas. In particular each QPSR ( C , C , c, a ) of Q1 and Q2 corresponds to the symbolic relation ⃗ COLS = C ∧ ⃗ COLS = C ∧ c ∧ a Symbolic cardinality-preserving bijections can be conjoined withequivalent constraints over column fields to form new symboliccardinality-preserving bijections. In order to formalize this, wewill say that for each partial bijection b between ⃗ COLS and ⃗ COLS ,each formula φ over vocabulary COLS , and each formula φ overvocabulary φ , φ and φ are equivalent over b if m is a model of φ if and only if b ( m ) is a model of φ .L EMMA For each cardinality-preserving bijection b symboli-cally represented by φ and all ψ over ⃗ COLS and ψ over ⃗ COLS that are equivalent over b , φ ∧ ψ ∧ ψ is a symbolic cardinality-preserving bijection. P ROOF . b | ι ( φ ) is is the interpretation of φ ∧ ψ ∧ ψ . It isa cardinality-preserving bijection because it is a restriction of acardinality-preserving bijection.Because cardinality-preserving bijections can act as witnesses ofequivalence, their symbolic representations naturally can, as well.L EMMA If there is some symbolic cardinality-preserving bi-jection φ between Q1 and Q2 that entails ⃗ COLS = ⃗ COLS then Q1 ≡ Q2 . P ROOF . φ represents the identity function by the assumptionsthat it represents a total function and that it logically entails a sym-bolic representation of the identity relation. Thus, Q1 ≡ Q2 , byLemma 5. C.2 Synthesizing symbolic bijections

We now prove the soundness of the procedure

VeriCard . Theproof is defined using a set of lemmas per form of input query(Appendix C.2.1—Appendix C.2.4), each of which are predicatedon assumptions that

VeriCard is sound on smaller queries. Theproof for arbitrary queries combines the lemmas that concern eachform of query in a proof by induction on

VeriCard ’s input query(Appendix C.2.5).

C.2.1 Symbolic bijections between table queries

We now state and prove the soundness of

VeriTable , which isgiven in Algorithm 2. For a given pair of table ARs T

ABLE ( n ) and T ABLE ( n ) , VeriTable first checks if two table ARs have thesame name. If two table ARs have the same names, then

VeriTable uses procedure

InitTuple to create a new vector of pair of symbolicvariables based on the input table schema, and assign this new vectorto ⃗ COLS . VeriTable then sets ⃗ COLS is equal to ⃗ COLS . VeriTable returns the QPSR with ⃗ COLS and ⃗ COLS , where both COND and

ASSIGN are

TRUE . If two table ARs have different names, then

VeriTable returns

NULL .L EMMA If VeriTable , given table ARs q = T ABLE ( n ) and q = T ABLE ( n ) , returns some QPSR φ , then φ is symboliccardinality-preserving bijection between q and q . P ROOF . VeriTable determines that n = n , by the fact that VeriTable only returns a QPSR if n = n and by the assumptionthat VeriTable returns a QPSR. The QPSR returned by

VeriTable is the symbolic representation of the identity relation, and is thus asymbolic cardinality-preserving bijection.

C.2.2 Symbolic bijections between SPJ queries

We now formalize and prove the correctness of

VeriSPJ (seeSection 5.5).L

EMMA For vectors of queries −→ e and −→ e , if VeriCard ( e ′ , e ′ ) is an QPSR φ ′ only if φ ′ is a symbolic cardinality-preserving bi-jection between e ′ and e ′ for all e ′ ∈ −→ e and e ′ ∈ −→ e , thenthen VeriSPJ , given queries Q1 = SPJ ( −→ e , p , −→ o ) and Q2 = SPJ ( −→ e , p , −→ o ) , returns a QPSR φ only if φ is a symbolic cardinality-preserving bijection between Q1 and Q2 . P ROOF . There is some

QPSR in VeriVec ( −→ e , −→ e ) , by the def-inition of VeriSPJ (Algorithm 3, Line 2, Line 3, Line 12, andLine 15) and the semantics of

VeriVec . QPSR represents a sym-bolic cardinality-preserving bijection φ between the Cartesian prod-uct of −→ e and the Cartesian product of −→ e , by the assumption that VeriCard ( e ′ , e ′ ) is a symbolic cardinality-preserving bijection forall e ∈ −→ e and e ∈ −→ e .Thus φ ∧ COND ∧ COND is a symbolic cardinality-preservingbijection by Lemma 6 applied to COND and COND and the defini-tion of VeriSPJ (Line 7 and Line 12). Furthermore, it is a sym-bolic cardinality-preserving bijection of between the selection of theCartesian products of −→ e on p and the Cartesian product of −→ e on p . Thus the QPSR returned by VeriSPJ is a symbolic cardinality-preserving bijection between Q1 and Q2 by the definition of VeriSPJ (Line 8—Line 12).

C.2.3 Symbolic bijections between aggregate queries

We now formalize and prove the correctness of

VeriAgg (seeSection 5.6).L

EMMA If VeriAgg , given aggregate queries of the form q = A GG ( e , −→ g , −→ a ) and q = A GG ( e , −→ g , −→ a ) returns aQPSR φ , and if VeriCard given e and e , only returns a QPSR if itis a symbolic cardinality-preserving bijection, then φ is a symboliccardinality-preserving bijection. P ROOF . In this proof, let

COLS and COLS denote the columnsof e and e . VeriCard , given e and e , returns a symboliccardinality-preserving bijection b , by the definition of VeriAgg (see Algorithm 4, Line 3) and the assumption that if

VeriCard returns a QPSR, then it is a symbolic cardinality-preserving bijec-tion. The QPSR returned by

VeriAgg is a symbolic cardinality-preserving bijection whose interpretation is the composition ofbijections b = b − ◦ b E ◦ b (where b − denotes the inverse ofbijection b ), which are defined as follows. Let b be the bijectionfrom each equivalence class of COLS induced −→ g to its image underthe aggregate functions −→ a , and similarly for b .Let b E be the relation from each equivalence class E of COLS induced by −→ g to equivalence class E of COLS induced by −→ g ifthere is some t ∈ E such that b ( t ) ∈ E . b E is in fact a bijection,by the definition of VeriAgg (see Line 5 and Line 8, which checksthat the equivalence classes induced by −→ g and −→ g preserve b ). b is the interpretation of the QPSR returned by VeriAgg by thesemantics of

InitAgg and

ConstAgg and the definition of

VeriAgg (see Line 6—Line 8).

C.2.4 Symbolic bijections between union queries

We now formalize and prove soundness of

VeriUnion (see Sec-tion 5.7).L

EMMA If VeriUnion , given q = U NION ( −→ e ) and q = U NION ( −→ e ) returns some QPSR φ and if VeriCard ( e ′ , e ′ ) returnsa QPSR then it is a symbolic cardinality-preserving bijection for all e ′ ∈ e and e ′ ∈ e , then φ is a symbolic cardinality-preservingbijection between q and q . lgorithm 6: SPES: an equivalence verifier.

Input :

A pair of queries Q1 and Q2 Output :

An decision if two queries are fully equivalent Procedure

SPES ( Q1 , Q2 ) Q1 ′ ← normalize ( Q1 ) ; Q2 ′ ← normalize ( Q2 ) ; φ ← VeriCard ( Q1 ′ , Q2 ′ ) ; if φ ̸ = NULL then return isValid ( φ = ⇒ ⃗ COLS = ⃗ COLS ) else return FALSE ; P ROOF . The result of

VeriVec ( −→ e , −→ e ) is a vector of QPSRs,by the definition of VeriUnion and the assumption that

VeriUnion returns some QPSR. Each of the QPSRs is a symbolic cardinality-preserving bijection, by the inductive hypothesis. Thus, the QPSR φ returned by VeriVec is a symbolic cardinality-preserving bijection,by the definition of

ConstAssign . C.2.5 Symbolic bijections between arbitrary queries

The soundness of

VeriCard follows from the correctness proper-ties satisfied for each AR constructor, given above.L

EMMA If VeriCard , given ARs Q1 and Q2 , returns someQPSR φ , then φ is a symbolic representation of a cardinality-preserving bijection between Q1 and Q2 . P ROOF . Proof is by induction on Q1 (the choice of Q1 versus Q2 is basically arbitrary). Each case on the form of Q1 is proved byapplying one the above lemmas, to the inductive hypothesis in caseswhen Q1 is a composite of sub-queries. In particular: • If Q1 is a table name , then the proof follows immediately fromLemma 8. • If Q1 is an SPJ query, the proof follows immediately fromLemma 9. • If Q1 is a aggregate query of the form A GG ( Q1 ′ , −→ g , −→ a ) ,then Q2 is a union query of the form A GG ( Q2 ′ , −→ g , −→ a ) , bythe assumption that VeriCard returns some QPSR. The prooffollows from applying Lemma 10 to the inductive hypothesison queries Q1 ′ and Q2 ′ . • If Q1 is a union query of the form U NION ( −→ Q1 ′ ) , then Q2 isa union query of the form U NION ( −→ Q2 ′ ) , by the assumptionthat VeriCard returns some QPSR. The proof follows fromapplying Lemma 11 to the inductive hypothesis on all sub-queries in Q1 ′ and Q2 ′ . C.3 Soundness of

SPESWith the soundness of

VeriCard established, we are prepared tostate and prove the soundness of SPES.Given a pair of queries Q1 and Q2 , SPES uses the procedure normalize to converts each queries to algebraic expressions, anduses a set of semantic preserving rewrite rules to normalize them.These semantic preserving rules are defined in Section 4. ThenSPES uses the procedure VeriCard to constructs the QPSR of twonormalized queries Q1 ’ and Q2 ’. If VeriCard returns the QPSR (i.e.,the QPSR is not

NULL ), then SPES returns if the formula is valid.If

VeriCard doesn’t return the QPSR (i.e., the QPSR is

NULL ), thenSPES returns

FALSE .The soundness of SPES is formalized and proved in the followingtheorem:T

HEOREM If SPES , given queries Q1 and Q2 , returns true ,then Q1 ≡ Q2 . P ROOF . Q1 ′ is equivalent to Q1 and Q2 ′ is equivalent to Q2 be-cause the normalization rules that it applies preserve semantics(Section 4). QPSR is a symbolic cardinality-preserving bijection bythe assumption that SPES returns true and thus QPSR is not null ,and by Lemma 12. The formula φ = ⇒ ⃗ COLS = ⃗ COLS is valid by the assumption that SPES returns True . Thus φ entails ⃗ COLS = ⃗ COLS by the semantics of SMT. Thus, Q1 ≡ Q2 by Lemma 7. D. COMPLETENESS OF

SPESL

EMMA

For a pair of

SELECT - PROJECT - JOIN queries Q1 and Q2 that do not have predicates and projection expressions whosesatisfiability cannot be determined by the SMT solver, then SPES iscomplete. P ROOF . Since Q1 and Q2 are SELECT - PROJECT - JOIN queries,after normalization, Q1 ’ and Q2 ’ are SPJ ARs with all inputs sub-ARs are table ARs. Based on Lemma 14, if Q1 ’ and Q2 ’ are fullyequivalent, then VeriCard returns an QPSR φ , and φ is a symbolicrepresentation of a cardinality-preserving bijection between Q1 ′ and Q2 ′ . Because Q1 and Q2 do not have predicates and projectionexpressions whose satisfiability cannot be determined by the SMTsolver. Using SMT solver to verify the following formula holdsfor QPSR is complete: φ = ⇒ ⃗ COLS = ⃗ COLS . Thus, if SPESdecides this formula does not hold. Based on the model M SPESgenerates, we can constructs the inputs tables that Q1 and Q2 eachreturn one tuple that satisfy the model M . These two tuples are notidentical. Thus, Q1 and Q2 are not fully equivalent. By contradiction,SPES is complete.L EMMA

If all input ARs for the given two SPJ ARs are tableARs and the the SMT solver can determine the satisfiability of thepredicates, then

VeriSPJ procedure is complete. P ROOF . We prove this theorem using the method of contraposi-tion. Suppose that

VeriSPJ returns

NULL for a pair of SPJ ARs. Bythe definition of

VeriSPJ , there are two cases:

Case 1:

There is no bijective map between −→ e and −→ e , such thateach pair of ARs are cardinally equivalent. Since each input ARsare table ARs, there are two possible sub-cases.For the first sub-cases, −→ e has more input table ARs than −→ e .For this sub-cases, we can always construct the input such that theintermediate join table of ⃗e has more tuples than the intermediatejoin table of ⃗e . SPES have eliminated the case where the predicatesare F alse (§4.3). Thus. these constructed tuples in intermediatejoin table all can satisfy the predicate. Thus, the two SPJ ARs returndiffering number of tuples, and are hence not cardinally equivalent.For the second sub-cases, −→ e have different table ARs than −→ e .For this sub-case, all input tables in −→ e are empty tables. And thedifferent table in −→ e has one tuple that satisfy the predicate. Thus,the two SPJ ARs return differing number of tuples, and are hencenot cardinally equivalent. Case 2: