Data Migration using Datalog Program Synthesis
Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, Isil Dillig
DData Migration using Datalog Program Synthesis
Yuepeng Wang
University of Texas at Austin [email protected] Rushi Shah
University of Texas at Austin [email protected] Abby Criswell
University of Texas at Austin [email protected] Pan
University of Texas at Austin [email protected] Isil Dillig
University of Texas at Austin [email protected]
ABSTRACT
This paper presents a new technique for migrating data be-tween different schemas. Our method expresses the schemamapping as a Datalog program and automatically synthe-sizes a Datalog program from simple input-output examplesto perform data migration. This approach can transformdata between different types of schemas (e.g., relational-to-graph, document-to-relational) and performs synthesis effi-ciently by leveraging the semantics of Datalog. We imple-ment the proposed technique as a tool called
Dynamite andshow its effectiveness by evaluating
Dynamite on 28 realis-tic data migration scenarios.
PVLDB Reference Format:
Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, Isil Dil-lig. Data Migration using Datalog Program Synthesis.
PVLDB ,13(xxx): xxxx-yyyy, 2020.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
1. INTRODUCTION
A prevalent task in today’s “big data era” is the need totransform data stored in a source schema to a different tar-get schema. For example, this task arises frequently whenparties need to exchange or integrate data that are storedin different formats. In addition, as the needs of businessesevolve over time, it may become necessary to change theschema of the underlying database or move to a differenttype of database altogether. For instance, there are severalreal-world scenarios that necessitate shifting from a rela-tional database to a non-SQL database or vice versa.In this paper, we present a new programming-by-exampletechnique for automatically migrating data from one schemato another. Given a small input-output example illustrat-ing the source and target data, our method automaticallysynthesizes a program that transforms data in the source for-mat to its corresponding target format. Furthermore, unlikeprior programming-by-example efforts in this space [5, 38,48], our method can transform data between several types
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.
Proceedings of the VLDB Endowment,
Vol. 13, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx
TargetInstance
Datalog Program SynthesisAttr. Map.Inference
SourceSchemaTagetSchema
Example Ψ DatalogProgram
SketchCompletionSketchGeneration Ω MigrationFrameworkSourceInstance
Figure 1: Schematic workflow of
Dynamite .of database schemas, such as from a graph database to a re-lational one or from a SQL database to a JSON document.One of the key ideas underlying our method is to reducethe automated data migration problem to that of synthe-sizing a Datalog program from examples. Inspired by thesimilarity between Datalog rules and popular schema map-ping formalisms, such as GLAV [5, 21] and tuple-generatingdependencies [37], our method expresses the correspondencebetween the source and target schemas as a Datalog pro-gram in which extensional relations define the source schemaand intensional relations represent the target. Then, givenan input-output example ( I , O ), finding a suitable schemamapping boils down to inferring a Datalog program P suchthat ( I , O ) is a model of P . Furthermore, because a Datalogprogram is executable, we can automate the data migrationtask by simply executing the synthesized Datalog program P on the source instance.While we have found Datalog programs to be a naturalfit for expressing data migration tasks that arise in practice, automating Datalog program synthesis turns out to be achallenging task for several reasons: First, without some a-priori knowledge about the underlying schema mapping, itis unclear what the structure of the Datalog program wouldlook like. Second, even if we “fix” the general structure ofthe Datalog rules, the search space over all possible Datalogprograms is still very large. Our method deals with thesechallenges by employing a practical algorithm that leveragesboth the semantics of Datalog programs as well as our targetapplication domain. As shown schematically in Figure 1, ourproposed synthesis algorithm consists of three steps:
Attribute mapping inference.
The first step of our ap-proach is to infer an attribute mapping
Ψ which maps eachattribute in the source schema to a set of attributes that itmay correspond to. While this attribute mapping does notuniquely define how to transform the source database to thetarget one, it substantially constrains the space of possibleDatalog programs that we need to consider.
Sketch generation.
In the next step, our method leveragesthe inferred attribute mapping Ψ to express the search spaceof all possible schema mappings as a
Datalog program sketch a r X i v : . [ c s . D B ] M a r here some of the arguments of the extensional relationsare unknown. While such a sketch represents a finite searchspace, this space is exponentially large, making it infeasibleto naively enumerate all programs defined by the sketch. Sketch completion.
The final and most crucial ingredi-ent of our method is the sketch completion step that per-forms Datalog-specific deductive reasoning to dramaticallyprune the search space. Specifically, given a Datalog pro-gram that does not satisfy the input-output examples, ourmethod performs logical inference to rule out many other
Datalog programs from the search space. In particular, ourmethod leverages a semantics-preserving transformation aswell as a new concept called minimal distinguishing projec-tion (MDP) to generalize from one incorrect Datalog pro-gram to many others.
Results.
We have implemented our proposed technique ina prototype called
Dynamite and evaluate it on 28 datamigration tasks between real-world data-sets. These tasksinvolve transformations between different types of sourceand target schemas, including relational, document, andgraph databases. Our experimental results show that
Dyna-mite can successfully automate all of these tasks using smallinput-output examples that consist of just a few records.Furthermore, our method performs synthesis quite fast (withan average of 7 . . Contributions.
The contributions of this paper include: • a formulation of the automated data migration problemin terms of Datalog program synthesis; • a new algorithm for synthesizing Datalog programs; • an implementation of our technique in a tool called Dy-namite and experimental evaluation on 28 data exchangetasks between real-world data-sets.
2. OVERVIEW
In this section, we give a high-level overview of our methodusing a simple motivating example. Specifically, consider adocument database with the following schema:
Univ : [ { id: Int , name : String ,Admit : [ { uid : Int , count : Int } ] } ] This database stores a list of universities, where each uni-versity has its own id, name, and graduate school admissioninformation. Specifically, the admission information consistsof a university identifier and the number of undergraduatestudents admitted from that university.Now, suppose that we need to transform this data to thefollowing alternative schema:
Admission :[ { grad : String , ug: String , num : Int } ] This new schema stores admission information as tuplesconsisting of a graduate school grad , an undergraduate school ug , and an integer num that indicates the number of under-graduates from ug that went to graduate school at grad . Asan example, Figure 2(a) shows a small subset of the data inthe source schema, and 2(b) shows its corresponding repre-sentation in the target schema.For this example, the desired transformation from thesource to the target schema can be represented using thefollowing simple Datalog program: Admission ( grad , ug , num ) :- Univ ( id , grad , v ) , Admit ( v , id , num ) , Univ ( id , ug , ) . Univ: [ { id:1, name:"U1",Admit: [ { uid:1, count:10 } , { uid:2, count:50 } ] } , { id:2, name:"U2",Admit: [ { uid:2, count:20 } , { uid:1, count:40 } ] } ] (a) Input Documents Admission : [ { grad:"U1",ug:"U1", num:10 } , { grad:"U1",ug:"U2", num:50 } , { grad:"U2",ug:"U2", num:20 } , { grad:"U2",ug:"U1", num:40 } ] (b) Output Documents Figure 2: Example database instances.Here, the relation
Univ corresponds to a university entityin the source schema, and the relation
Admit denotes itsnested
Admit attribute. In the body of the Datalog rule, thethird argument of the first
Univ occurrence has the samefirst argument as
Admit ; this indicates that ( id , num ) isnested inside the university entry ( id , grad ). Essentially,this Datalog rule says the following: “If there exists a pair ofuniversities with identifiers id , id and names grad , ug in thesource document, and if ( id , num ) is a nested attribute of id , then there should be an Admission entry ( grad , ug , num )in the target database.”In what follows, we explain how Dynamite synthesizesthe above Datalog program given just the source and targetschemas and the input-output example from Figure 2.
Attribute Mapping.
As mentioned in Section 1, our methodstarts by inferring an attribute mapping
Ψ, which specifieswhich attribute in the source schema may correspond towhich other attributes (either in the source or target). Forinstance, based on the example provided in Figure 2,
Dy-namite infers the following attribute mapping Ψ: id → { uid } name → { grad , ug } uid → { id } count → { num } Since the values stored in the name attribute of
Univ in thesource schema are the same as the values stored in the grad and ug attributes of the target schema, Ψ maps source at-tribute name to both target attributes grad and ug . Observethat our inferred attribute mapping can also map source at-tributes to other source attributes. For example, since thevalues in the id field of Univ are the same as the valuesstored in the nested uid attribute, Ψ also maps id to uid and vice versa. Sketch Generation.
In the next step,
Dynamite uses theinferred attribute mapping Ψ to generate a program sketchΩ that defines the search space over all possible Datalogprograms that we need to consider. Towards this goal, weintroduce an extensional (resp. intensional) relation for eachdocument in the source (resp. target) schema, includingrelations for nested documents. In this case, there is a singleintensional relation
Admission for the target schema; thus,we introduce the following single Datalog rule sketch withthe
Admission relation as its head:
Admission ( grad, ug, num ) :- Univ ( ?? , ?? , v ) , Admit ( v , ?? , ?? ) , Univ ( ?? , ?? , ) , Univ ( ?? , ?? , ) . (1) ?? , ?? , ?? , ?? ∈ { id , id , id , uid } , ?? ∈ { num , count } ?? , ?? , ?? ∈ { grad, ug, name , name , name } Here, ?? i represents a hole (i.e., unknown) in the sketch,and its domain is indicated as ?? i ∈ { e , . . . , e n } , meaningthat hole ?? i can be instantiated with an element drawn2 rad ug numU1 U1 10U2 U2 20 (a) Actual Result grad ug numU1 U1 10U1 U2 50U2 U2 20U2 U1 40 (b) Expected Result Figure 3: Actual and expected results of program P .from { e , . . . , e n } . To see where this sketch is coming from,we make the following observations: • According to Ψ, the grad attribute in the target schemacomes from the name attribute of
Univ ; thus, we musthave an occurrence of
Univ in the rule body. • Similarly, the ug attribute in the target schema comesfrom the name attribute of Univ in the source; thus, wemay need another occurrence of
Univ in the body. • Since the num attribute comes from the count attribute inthe nested
Admit document, the body of the Datalog rulecontains
Univ ( ?? , ?? , v ) , Admit ( v , ?? , ?? ) denotingan Admit document stored inside some Univ entity (thenesting relation is indicated through variable v ). • The domain of each hole is determined by Ψ and the num-ber of occurrences of each relation in the Datalog sketch.For example, since there are three occurrences of
Univ ,we have three variables id , id , id associated with the id attribute of Univ . The domain of hole ?? is given by { id , id , id , uid } because it refers to the id attribute of Univ , and id may be an “alias” of uid according to Ψ. Sketch Completion.
While the Datalog program sketchΩ given above looks quite simple, it actually has 64 ,
000 pos-sible completions; thus, a brute-force enumeration strategyis intractable. To solve this problem,
Dynamite utilizes anovel sketch completion algorithm that aims to learn fromfailed synthesis attempts. Towards this goal, we encode allpossible completions of sketch Ω as a satisfiability-modulo-theory (SMT) constraint Φ where each model of Φ corre-sponds to a possible completion of Ω. For the sketch fromEquation 1, our SMT encoding is the following formula Φ:( x = id ∨ x = id ∨ x = id ∨ x = uid ) ∧ ( x = grad ∨ x = ug ∨ x = name ∨ . . . ∨ x = name ) ∧ . . . ∧ ( x = grad ∨ x = ug ∨ . . . ∨ x = name )Here, for each hole ?? i in the sketch, we introduce a variable x i and stipulate that x i must be instantiated with exactlyone of the elements in its domain. Furthermore, since Dat-alog requires all variables in the head to occur in the rulebody, we also conjoin the following constraint with Φ toenforce this requirement:( x = grad ∨ x = grad ) ∧ ( x = ug ∨ x = ug ) ∧ ( x = num )Next, we query the SMT solver for a model of this formula.In this case, one possible model σ of Φ is: x = id ∧ x = grad ∧ x = id ∧ x = num ∧ x = id ∧ x = ug ∧ x = id ∧ x = name (2)which corresponds to the following Datalog program P : Admission ( grad , ug , num ) :- Univ ( id , grad , v ) , Admit ( v , id , num ) , Univ ( id , ug , ) , Univ ( id , name , ) . However, this program does not satisfy the user-providedexample because evaluating it on the input yields a resultthat is different from the expected one (see Figure 3). In the SMT encoding, one should think of id , id etc. asconstants rather than variables. Now, in the next iteration, we want the SMT solver toreturn a model that corresponds to a different Datalog pro-gram. Towards this goal, one possibility would be to conjointhe negation of σ with our SMT encoding, but this wouldrule out just a single program in our search space. To makesynthesis more tractable, we instead analyze the root causeof failure and try to infer other Datalog programs that alsodo not satisfy the examples.To achieve this goal, our sketch completion algorithm lever-ages two key insights: First, given a Datalog program P ,we can obtain a set of semantically equivalent Datalog pro-grams by renaming variables in an equality-preserving way.Second, since our goal is to rule out incorrect (rather thanjust semantically equivalent) programs, we can further en-large the set of rejected Datalog programs by performingroot-cause analysis. Specifically, we express the root cause ofincorrectness as a minimal distinguishing projection (MDP) ,which is a minimal set of attributes that distinguishes theexpected output from the actual output. For instance, con-sider the actual and expected outputs O and O (cid:48) shown inFigure 3. An MDP for this example is the singleton num because taking the projection of O and O (cid:48) on num yieldsdifferent results.Using these two key insights, our sketch completion algo-rithm infers 720 other Datalog programs that are guaranteed not to satisfy the input-output example and represents themusing the following SMT formula:( x = num ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ · · · ∧ x (cid:54) = x ) (3)We can use the negation of this formula as a “blockingclause” by conjoining it with the SMT encoding and ruleout many infeasible solutions at the same time.After repeatedly sampling models of the sketch encodingand adding blocking clauses as discussed above, Dynamite finally obtains the following model: x = id ∧ x = grad ∧ x = id ∧ x = num ∧ x = id ∧ x = ug ∧ x = id ∧ x = name which corresponds to the following Datalog program (aftersome basic simplification): Admission ( grad , ug , num ) :- Univ ( id , grad , v ) , Admit ( v , id , num ) , Univ ( id , ug , ) . This program is consistent with the provided examples andcan automate the desired data migration task.
3. PRELIMINARIES
In this section, we review some preliminary informationon Datalog and our schema representation; then, we explainhow to represent data migration programs in Datalog.
We represent database schemas using (non-recursive) recordtypes, which are general enough to express a wide varietyof database schemas, including XML and JSON documentsand graph databases. Specifically, a schema S is a mappingfrom type names N to their definition: Schema S ::= N → T Type T ::= τ | { N , . . . , N n } A type definition is either a primitive type τ or a set ofnamed attributes { N , . . . N k } , and the type of attribute N i is given by the schema S . An attribute N is a primitiveattribute if S ( N ) = τ for some primitive type τ . Given a3chema S , we write PrimAttrbs ( S ) to denote all primitiveattributes in S , and we write parent ( N ) = N (cid:48) if N ∈ S ( N (cid:48) ). Example Consider the JSON document schema fromour motivating example in Section 2:
Univ: [ { id: Int, name: String ,Admit: [ { uid: Int, count: Int } ] } ] In our representation, this schema is represented as follows: S (Univ) = { id , name , Admit } S (Admit) = { uid , count }S (id) = S (uid) = S (count) = Int S (name) = String
Example Consider the following relational schema:
User ( id : Int , name : String , address : String ) In our representation, this corresponds to the following schema: S (User) = { id , name , address }S (id) = Int S (name) = S (address) = String
Example Consider the following graph schema:
Actoraid : Int name : String ACT_INrole : String Moviemid : Int title : String
To convert this schema to our representation, we firstintroduce two attributes source and target to denote thesource and target nodes of the edge. Then, the graph schemacorresponds the following mapping in our representation: S ( Movie ) = { mid , title } S ( Actor ) = { aid , name }S ( ACT IN ) = { source , target , role }S ( mid ) = S ( aid ) = S ( source ) = S ( target ) = Int S ( title ) = S ( name ) = S ( role ) = String
As shown in Figure 4, a Datalog program consists of a listof rules, where each rule is of the form H :- B. Here, H is referred as the head of the rule and B is the body . Thehead H is a single relation of the form R ( v , . . . , v n ), andthe body B is a collection of predicates B , B , . . . , B n . Inthe remainder of this paper, we sometimes also write H , . . . , H m :- B , B , . . . , B n . as short-hand for m Datalog rules with the same body. Pred-icates that appear only in the body are known as extensionalrelations and correspond to known facts. Predicates thatappear in the head are called intensional relations and cor-respond to the output of the Datalog program.
Semantics.
The semantics of Datalog programs are typ-ically given using Herbrand models of first-order logic for-mulas [13]. In particular, each Datalog rule R of the form H ( (cid:126)x ) :- B ( (cid:126)x, (cid:126)y ) corresponds to a first-order formula (cid:74) R (cid:75) = ∀ (cid:126)x, (cid:126)y. B ( (cid:126)x, (cid:126)y ) → H ( (cid:126)x ), and the semantics of the Datalogprogram can be expressed as the conjunction of each rule-level formula. Then, given a Datalog program P and aninput I (i.e., a set of ground formulas), the output corre-sponds to the least Herbrand model of (cid:74) P (cid:75) ∧ I . We now discuss how to perform data migration using Dat-alog. The basic idea is as follows: First, given a sourcedatabase instance D over schema S , we express D as a col-lection of Datalog facts over extensional relations R . Then,we express the target schema S (cid:48) using intensional relations Program ::=
Rule + Rule ::=
Head :- Body . Head ::=
Pred Body ::=
Pred + Pred ::= R ( v + ) v ∈ Variable R ∈ Relation
Figure 4: Syntax of Datalog programs. R (cid:48) and construct a set of (non-recursive) Datalog rules, onefor each intensional relation in R (cid:48) . Finally, we run this Dat-alog program and translate the resulting facts into the targetdatabase instance. Since programs can be evaluated usingan off-the-shelf Datalog solver, we only explain how to trans-late between database instances and Datalog facts. From instances to facts.
Given a database instance D over schema S , we introduce an extensional relation symbol R N for each record type with name N in S and assign aunique identifier Id ( r ) to every record r in the database in-stance. Then, for each instance r = { a : v , . . . , a n : v n } ofrecord type N , we generate a fact R N ( c , c , . . . , c n ) where: c i = Id ( parent ( r )) , if i = 0 and r is a nested record v i , if S ( a i ) is a primitive type Id ( r ) , if S ( a i ) is a record typeIntuitively, relation R N has an extra argument that keepstrack of its parent record in the database instance if N isnested in another record type. In this case, the first argu-ment of R N denotes the unique identifier for the record inwhich it is nested. Example For the JSON document from Figure 2(a),our method generates the following Datalog facts
Univ (1 , “ U1 ” , id ) Univ (2 , “ U2 ” , id ) Admit ( id , , Admit ( id , , Admit ( id , , Admit ( id , , where id and id are unique identifiers. From facts to instances.
We convert Datalog facts tothe target database instance using the inverse procedure.Specifically, given a Datalog fact R N ( c , . . . , c n ) for recordtype N : { a , . . . , a n } , we create a record instance using afunction BuildRecord ( R N , N ) = { a : v , . . . , a n : v n } where v i = c i , if S ( a i ) is a primitive type BuildRecord ( R a i , a i ) , if S ( a i ) is a record type andthe first argument of R a i is c i Observe that the
BuildRecord procedure builds the recordrecursively by chasing parent identifiers into other relations.
4. DATALOG PROGRAM SYNTHESIS
In this section, we describe our algorithm for automati-cally synthesizing Datalog programs from an input-outputexample E = ( I , O ). Here, I corresponds to an example ofthe database instance in the source schema, and O demon-strates the desired target instance. We start by giving ahigh-level overview of the synthesis algorithm and then ex-plain each of the key ingredients in more detail. The top-level algorithm for synthesizing Datalog programsis summarized in Algorithm 1. The
Synthesize proceduretakes as input a source schema S , a target schema S (cid:48) , andan input-output example E = ( I , O ). The return value is ei-ther a Datalog program P such that evaluating P on I yields O (i.e. (cid:74) P (cid:75) I = O ) or ⊥ to indicate that the desired datamigration task cannot be represented as a Datalog program.4 lgorithm 1 Synthesizing Datalog programs1: procedure
Synthesize ( S , S (cid:48) , E ) Input:
Source schema S , target schema S (cid:48) ,example E = ( I , O ) Output:
Datalog program P or ⊥ to indicate failure Ψ ← InferAttrMapping ( S , S (cid:48) , E ); Ω ← SketchGen (Ψ , S , S (cid:48) ); Φ ← Encode (Ω); while SAT (Φ) do σ ← GetModel (Φ); P ←
Instantiate (Ω , σ ); O (cid:48) ← (cid:74) P (cid:75) I ; if O (cid:48) = O then return P ; Φ ← Φ ∧ Analyze ( σ, O (cid:48) , O ); return ⊥ ; As shown in Algorithm 1, the
Synthesize procedure firstinvokes the
InferAttrMapping procedure (line 2) to inferan attribute mapping Ψ. Specifically, Ψ is a mapping fromeach a ∈ PrimAttrbs ( S ) to a set of attributes { a , . . . , a n } where a i ∈ PrimAttrbs ( S ) ∪ PrimAttrbs ( S (cid:48) ) such that: a (cid:48) ∈ Ψ( a ) ⇔ Π a (cid:48) ( D ) ⊆ Π a ( I )where D stands for either I or O . Thus, InferAttrMap-ping is conservative and maps a source attribute a to an-other attribute a (cid:48) if the values contained in a (cid:48) are a subsetof those contained in a .Next, the algorithm invokes SketchGen (line 3) to gener-ate a Datalog program sketch Ω based on Ψ. As mentionedin Section 2, a sketch Ω is a Datalog program with unknownarguments in the rule body, and the sketch also determinesthe domain for each unknown. Thus, if the sketch contains n unknowns, each with k elements in its domain, then thesketch encodes a search space of k n possible programs.Lines 4-10 of the Synthesize algorithm perform lazy enu-meration over possible sketch completions. Given a sketchΩ, we first generate an SMT formula Φ whose models cor-respond to all possible completions of Ω (line 4). Then, theloop in lines 5-10 repeatedly queries a model of Φ (line 6),tests if the corresponding Datalog program is consistent withthe example (lines 7-9), and adds a blocking clause to Φ if itis not (line 10). The blocking clause is obtained via the callto the
Analyze procedure, which performs Datalog-specificdeductive reasoning to infer a whole set of programs thatare guaranteed not to satisfy the examples.In the remainder of this section, we explain the sketchgeneration and completion procedures in more detail.
Given an attribute mapping Ψ, the goal of sketch gen-eration is to construct the skeleton of the target Datalogprogram. Our sketch language is similar to the Datalogsyntax in Figure 4, except that it allows holes (denoted by ?? ) as special constructs indicating unknown expressions.As summarized in Algorithm 2, the SketchGen procedureiterates over each top-level record in the target schema and,for each record type, it generates a Datalog rule sketch us-ing the helper procedure
GenRuleSketch . Conceptu-ally,
GenRuleSketch performs the following tasks: First,it generates a set of intensional predicates for each top-level The property of the generated sketch is characterized andproved in Appendix A.
Algorithm 2
Generating Datalog program sketches1: procedure
SketchGen (Ψ , S , S (cid:48) ) Input:
Attribute mapping Ψ, source schema S ,target schema S (cid:48) Output:
Program sketch Ω Ω ← ∅ ; for each top-level record type N ∈ S (cid:48) do R ← GenRuleSketch (Ψ , S , S (cid:48) , N ); Ω ← Ω ∪ { R } ; return Ω; procedure GenRuleSketch (Ψ , S , S (cid:48) , N ) H ← GenIntensionalPreds ( S (cid:48) , N ); B ← ∅ ; for each a ∈ dom (Ψ) do repeat |{ a (cid:48) | a (cid:48) ∈ PrimAttrbs ( N ) ∧ a (cid:48) ∈ Ψ( a ) }| times N ← RecName ( a ); B ← B ∪ GenExtensionalPreds ( S , N ); for each ?? a ∈ Holes ( B ) do V ← { v a (cid:48) | a (cid:48) ∈ PrimAttrbs ( N ) ∧ a (cid:48) ∈ Ψ( a ) } ; for each a (cid:48) ∈ Ψ( a ) ∪ { a } and a (cid:48) ∈ PrimAttrbs ( S ) do n ← CopyNum ( B, RecName ( a (cid:48) )); V ← V ∪ (cid:83) ni =1 { v ia (cid:48) } ; B ← B [ ?? a (cid:55)→ ?? a ∈ V ]; return H :- B. ; S (cid:48) ( N ) ∈ PrimType S (cid:48) (cid:96) N (cid:59) ( v N , ∅ ) (InPrim) S (cid:48) ( N ) = { a , . . . , a n } isNested ( N ) S (cid:48) (cid:96) a i (cid:59) ( v i , H i ) i = 1 , . . . , n S (cid:48) (cid:96) N (cid:59) ( v N , { R N ( v N , v , . . . , v n ) } ∪ (cid:83) ni =1 H i ) (InRecNested) S (cid:48) ( N ) = { a , . . . , a n } ¬ isNested ( N ) S (cid:48) (cid:96) a i (cid:59) ( v i , H i ) i = 1 , . . . , n S (cid:48) (cid:96) N (cid:59) ( , { R N ( v , . . . , v n ) } ∪ (cid:83) ni =1 H i ) (InRec) Figure 5: Inference rules describing
GenIntensionalPreds record in the target schema (line 8). The intensional predi-cates do not contain any unknowns and only appear in thehead of the Datalog rules. Next, the loop (lines 9–12) con-structs the skeleton of each Datalog rule body by generatingextensional predicates for the relevant source record types.The extensional predicates do contain unknowns, and therecan be multiple occurrences of a relation symbol in the body.Finally, the loop in lines 13–18 generates the domain for eachunknown used in the rule body.
Head generation.
Given a top-level record type N in thetarget schema, the procedure GenIntensionalPreds gen-erates the head of the corresponding Datalog rule for N . If N does not contain any nested records, then the head con-sists of a single predicate, but, in general, the head containsas many predicates as are (transitively) nested in N .In more detail, Figure 5 presents the GenIntensional-Preds procedure as inference rules that derive judgmentsof the form S (cid:48) (cid:96) N (cid:59) ( v, H ) where H corresponds to thehead of the Datalog rule for record type N . As expected,these rules are recursive and build the predicate set H for N from those of its nested records. Specifically, given atop-level record N with attributes a , . . . , a n , the rule In-Rec first generates predicates H i for each attribute a i andthen introduces an additional relation R N ( v , . . . , v n ) for N itself. Predicate generation for nested relations (rule InRec-Nested) is similar, but we introduce a new variable v N that5 ( N ) ∈ PrimType
S (cid:96)
N (cid:44) → ( ?? N , ) (ExPrim) S ( N ) = { a , . . . , a n } fresh v N S (cid:96) a i (cid:44) → ( h i , ) i = 1 , . . . , nN (cid:48) = parent ( N ) S (cid:96) N (cid:48) (cid:44) → ( , B (cid:48) ) S (cid:96)
N (cid:44) → ( v N , { R N ( v N , h , . . . , h n ) } ∪ B (cid:48) ) (ExRecNested) S ( N ) = { a , . . . , a n } ¬ isNested ( N ) S (cid:96) a i (cid:44) → ( h i , ) i = 1 , . . . , n S (cid:96)
N (cid:44) → ( , { R N ( h , . . . , h n ) } ) (ExRec) Figure 6: Inference rules for
GenExtensionalPreds is used for connecting N to its parent relation. The InPrimrule corresponds to the base case of GenIntensionalPreds and generates variables for attributes of primitive type.
Body sketch generation.
We now consider sketch genera-tion for the body of each Datalog rule (lines 9–12 in Algo-rithm 2). Given a record type N in the source schema andits corresponding predicate(s) R N , the loop in lines 9–12 ofAlgorithm 2 generates as many copies of R N in the rule bodyas there are head attributes that “come from” R N accordingto Ψ. Specifically, Algorithm 2 invokes a procedure called GenExtensionalPreds , described in Figure 6, to generateeach copy of the extensional predicate symbol.Given a record type N in the source schema, GenEx-tensionalPreds generates predicates up until the top-levelrecord that contains N . The rules in Figure 6 are of the form S (cid:96)
N (cid:44) → ( h, B ), where B is the sketch body for record type N . The ExPrim rule is the base case to generate sketch holesfor primitive attributes. Given a record N with attributes a , . . . , a n and its parent N (cid:48) , the rule ExRecNested recur-sively generates the body predicates B (cid:48) for the parent record N (cid:48) and adds an additional predicate R N ( v N , h , . . . , h n ) for N itself. Here v N is a variable for connecting N and its par-ent N (cid:48) , and h i is the hole or variable for attribute a i . In thecase where N is a top level record, the ExRec rule generatesa singleton predicate R N ( h , . . . , h n ). Example Suppose we want to generate the body sketchfor the rule associated with record type T : { a (cid:48) : Int , b (cid:48) : Int } in the target schema. Also, suppose we are given theattribute mapping Ψ where Ψ( a ) = a (cid:48) and Ψ( b ) = b (cid:48) andsource attributes a, b belong to the following record type inthe source schema: C : { a : Int , D : { b : Int }} . Accordingto Ψ , a (cid:48) comes from attribute a of record type C in the sourceschema, so we have a copy of R C in the sketch body. Basedon the rules of Figure 6, we generate predicate R C ( ?? a , v D ) ,where ?? a is the hole for attribute a and v D is a fresh vari-able. Similarly, since b (cid:48) comes from attribute b of record type D , we generate predicates R C ( ?? a , v D ) and R D ( v D , ?? b ) .Putting them together, we obtain the following sketch body: R C ( ?? a , v D ) , R C ( ?? a , v D ) , R D ( v D , ?? b ) Domain generation.
Having constructed the skeleton ofthe Datalog program, we still need to determine the set ofvariables that each hole in the sketch can be instantiatedwith. Towards this goal, the last part of
GenRuleSketch (lines 13–18 in Algorithm 2) constructs the domain V foreach hole as follows: First, for each attribute a of sourcerelation R N , we introduce as many variables v a , . . . , v ka asthere are copies of R N . Next, for the purposes of this dis-cussion, let us say that attributes a and b “alias” each other if b ∈ Ψ( a ) or vice versa. Then, given a hole ?? x associ-ated with attribute x , the domain of ?? x consists of all thevariables associated with attribute x or one of its aliases. Example Consider the same schemas and attributemapping from Example 5 and the following body sketch: R C ( ?? a , v D ) , R C ( ?? a , v D ) , R D ( v D , ?? b ) Here, we have ?? a ∈ { v a (cid:48) , v a , v a } and ?? b ∈ { v b (cid:48) , v b } . While the sketch generated by Algorithm 2 defines a fi-nite search space of Datalog programs, this search space isstill exponentially large. Thus, rather than performing naivebrute-force enumeration, our sketch completion algorithmcombines enumerative search with Datalog-specific deduc-tive reasoning to learn from failed synthesis attempts. Asexplained in Section 4.1, the basic idea is to generate anSMT encoding of all possible sketch completions and theniteratively add blocking clauses to rule out incorrect Datalogprograms. In the remainder of this section, we discuss howto generate the initial SMT encoding as well as the
Analyze procedure for generating useful blocking clauses.
Sketch encoding.
Given a Datalog program sketch Ω, ourinitial SMT encoding is constructed as follows: First, foreach hole ?? i in the sketch, we introduce an integer variable x i , and for every variable v j in the domain of some hole, weintroduce a unique integer constant denoted as Const ( v j ).Then, our SMT encoding stipulates the following constraintsto enforce that the sketch completion is well-formed: • Every hole must be instantiated:
For each hole of theform ?? i ∈ { v , . . . , v n } , we add a constraint n (cid:95) j =1 x i = Const ( v j ) • Head variables must appear in the body.
In a well-formed Datalog program, every head variable must appearin the body. Thus, for each head variable v , we add: (cid:95) i x i = Const ( v ) where v is in the domain of ?? i Since there is a one-to-one mapping between integer con-stants in the SMT encoding and sketch variables, each modelof the SMT formula corresponds to a Datalog program.
Adding blocking clauses.
Given a Datalog program P that does not satisfy the examples ( I , O ), our top-level syn-thesis procedure (Algorithm 1) invokes a function called An-alyze to find useful blocking clauses to add to the SMT en-coding. This procedure is summarized in Algorithm 3 andis built on two key insights. The first key insight is thatthe semantics of a Datalog program is unchanged under anequality-preserving renaming of variables:
Theorem Let P be a Datalog program over variables X and let ˆ σ be an injective substitution from X to anotherset of variables Y . Then, we have P (cid:39) P ˆ σ . Proof.
See Appendix A.To see how this theorem is useful, let σ be a model ofour SMT encoding. In other words, σ is a mapping fromholes in the Datalog sketch to variables V . Now, let ˆ σ be aninjective renaming of variables in V . Then, using the abovetheorem, we know that any other assignment σ (cid:48) = σ ˆ σ is alsoguaranteed to result in an incorrect Datalog program.6ased on this insight, we can generalize from the specificassignment σ to a more general class of incorrect assign-ments as follows: If a hole is not assigned to a head vari-able, then it can be assigned to any variable in its domainas long as it respects the equalities and disequalities in σ .Concretely, given assignment σ , we generalize it as follows: Generalize ( σ ) = (cid:94) x i ∈ dom ( σ ) α ( x i , σ ) , where α ( x, σ ) = (cid:26) x = σ ( x ) if σ ( x ) is a head variable (cid:86) x j ∈ dom ( σ ) x (cid:63) x j otherwiseHere the binary operator (cid:63) is defined to be equality if σ assigns both x and x j to the same value, and disequalityotherwise. Thus, rather than ruling out just the current as-signment σ , we can instead use ¬ Generalize ( σ ) as a muchmore general blocking clause that rules out several equiva-lent Datalog programs at the same time. Example Consider again the sketch from Section 2:
Admission( grad, ug, num ) :- Univ( ?? , ?? , v ) , Admit( v , ?? , ?? ) , Univ( ?? , ?? , ) , Univ( ?? , ?? , ) . ?? , ?? , ?? , ?? ∈ { id , id , id , uid } ?? ∈ { num, count } ?? , ?? , ?? ∈ { grad, ug, name , name , name } Suppose the variable for ?? i is x i and the assignment σ is: x = id ∧ x = grad ∧ x = id ∧ x = num ∧ x = id ∧ x = ug ∧ x = id ∧ x = name Since grad , ug , and num occur in the head, Generalize ( σ ) yields the following formula: x = grad ∧ x = num ∧ x = ug ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ · · · ∧ x (cid:54) = x (4)The other key insight underlying our sketch completionalgorithm is that we can achieve even greater generalizationpower using the concept of minimal distinguishing projec-tions (MDP) , defined as follows: Definition (MDP) We say that a set of attributes A is a minimal distinguishing projection for Datalog program P and input-output example ( I , O ) if (1) Π A ( O ) (cid:54) = Π A ( P ( I )) ,and (2) for any A (cid:48) ⊂ A , we have Π A (cid:48) ( O ) = Π A (cid:48) ( P ( I )) . In other words, the first condition ensures that, by justlooking at attributes A , we can tell that program P doesnot satisfy the examples. On the other hand, the secondcondition ensures that A is minimal.To see why minimal distinguishing projections are use-ful for pruning a larger set of programs, recall that our Generalize ( σ ) function from earlier retains a variable assign-ment x (cid:55)→ v if v corresponds to a head variable. However,if v does not correspond to an attribute in the MDP, thenwe will still obtain an incorrect program even if we rename x to something else; thus Generalize can drop the assign-ments to head variables that are not in the MDP. Thus,given an MDP ϕ , we can obtain an improved generalizationprocedure Generalize ( σ, ϕ ) by using the following α ( x, σ, ϕ )function instead of α ( x, σ ) from earlier: α ( x, σ, ϕ ) = (cid:26) x = σ ( x ) if σ ( x ) ∈ ϕ (cid:86) x j ∈ dom ( σ ) x (cid:63) x j otherwiseBecause not all head variables correspond to an MDPattribute, performing generalization this way allows us toobtain a better blocking clause that rules out many moreDatalog programs in one iteration. Algorithm 3
Analyzing outputs to prune search space1: procedure
Analyze ( σ, O (cid:48) , O ) Input:
Model σ , actual output O (cid:48) , expected output O Output:
Blocking clause φ φ ← true ; ∆ ← MDPSet ( O (cid:48) , O ); for each ϕ ∈ ∆ do ψ ← true ; for each ( x i , x j ) ∈ dom ( σ ) × dom ( σ ) do if σ ( x i ) = σ ( x j ) then ψ ← ψ ∧ x i = x j ; else ψ ← ψ ∧ x i (cid:54) = x j ; for each x i ∈ dom ( σ ) do if σ ( x i ) ∈ ϕ then ψ ← ψ ∧ x i = σ ( x i ); φ ← φ ∧ ¬ ψ ; return φ ; Example Consider the same sketch and assignment σ from Example 7, but now suppose we are given an MDP ϕ = { num } . Then the function Generalize ( σ, ϕ ) yields thefollowing more general formula: x = num ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ · · · ∧ x (cid:54) = x (5) Note that (5) is more general (i.e., weaker) than (4) becauseit drops the constraints x = grad and x = ug . There-fore, the negation of (5) is a better blocking clause than thenegation of (4), since it rules out more programs in one step. Based on this discussion, we now explain the full
Analyze procedure in Algorithm 3. This procedure takes as input amodel σ of the SMT encoding and the actual and expectedoutputs O (cid:48) , O . Then, at line 3, it invokes the MDPSet procedure to obtain a set ∆ of minimal distinguishing pro-jections and uses each MDP ϕ ∈ ∆ to generate a blockingclause as discussed above (lines 6–10).The MDPSet procedure is shown in Algorithm 4 and usesa breadth-first search algorithm to compute the set of allminimal distinguishing projections. Specifically, it initializesa queue W with singleton projections { a } for each attribute a in the output (lines 2 – 5). Then, it repeatedly dequeuesa projection L from W and checks if L is an MDP (lines6 – 14). In particular, if L can distinguish outputs O (cid:48) and O (line 14) and there is no existing projection L (cid:48)(cid:48) in thecurrent MDP set ∆ such that L (cid:48)(cid:48) ⊆ L , then L is an MDP. If L cannot distinguish outputs O (cid:48) and O (line 8), we enqueueall of its extensions L (cid:48) with one more attribute than L andmove on to the next projection in queue W . Example Let us continue with Example 8 to illus-trate how to prune incorrect Datalog programs using mul-tiple MDPs. Suppose we obtain the MDP set ∆ = { ϕ , ϕ } ,where ϕ = { num } and ϕ = { grad, ug } . In addition to Generalize ( σ, ϕ ) (see formula (5) of Example 8), we alsocompute Generalize ( σ, ϕ ) as: x = grad ∧ x = ug ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ x (cid:54) = x ∧ · · · ∧ x (cid:54) = x By adding both blocking clauses ¬ Generalize ( σ, ϕ ) as wellas ¬ Generalize ( σ, ϕ ) , we can prune even more incorrectDatalog programs. Theorem Let φ be a blocking clause returned by thecall to Analyze at line 10 of Algorithm 1. If σ is a modelof ¬ φ , then σ corresponds to an incorrect Datalog program. Proof.
See Appendix A.7 lgorithm 4
Computing a set of MDPs1: procedure
MDPSet ( O (cid:48) , O ) Input:
Actual output O (cid:48) , expected output O Output:
A set of minimal distinguishing projections ∆ ∆ ← ∅ ; V ← ∅ ; W ←
EmptyQueue (); for each a ∈ Attributes ( O (cid:48) ) do W . Enqueue ( { a } ); V ← V ∪ ; while ¬W . IsEmpty () do L ← W . Dequeue (); if Π L ( O (cid:48) ) = Π L ( O ) then for each a (cid:48) ∈ Attributes ( O (cid:48) ) \ L do L (cid:48) ← L ∪ { a (cid:48) } ; if L (cid:48) (cid:54)∈ V then W . Enqueue ( L (cid:48) ); V ← V ∪ { L (cid:48) } ; else if (cid:64) L (cid:48)(cid:48) ∈ ∆ . L (cid:48)(cid:48) ⊆ L then ∆ ← ∆ ∪ { L } ; return ∆;
5. IMPLEMENTATION
We have implemented the proposed technique as a newtool called
Dynamite . Internally,
Dynamite uses the Z3solver [17] for answering SMT queries and leverages theSouffle framework [29] for evaluating Datalog programs. Inthe remainder of this section, we discuss some extensionsover the synthesis algorithm described in Section 4.
Interactive mode.
In Section 4, we presented our tech-nique as returning a single program that is consistent with agiven input-output example. However, in this non-interactivemode , Dynamite does not guarantee the uniqueness of theprogram consistent with the given example. To address thispotential usability issue,
Dynamite can also be used in a so-called interactive mode where
Dynamite iteratively queriesthe user for more examples in order to resolve ambiguities.Specifically, when used in this interactive mode,
Dynamite first checks if there are multiple programs P , P (cid:48) that areconsistent with the provided examples ( I , O ), and, if so, Dynamite identifies a small differentiating input I (cid:48) suchthat P and P (cid:48) yield different outputs on I (cid:48) . Then, Dyna-mite asks the user to provide the corresponding output for I (cid:48) . More details on how we implement this interactive modecan be found in Appendix B. Example
Suppose the source database contains tworelations Employee(name, deptId) and Department(id, dept-Name), and we want to obtain the relation WorksIn(name,deptName) by joining Employee and Department on dep-tId=id and then applying projection. Suppose the user onlyprovides the input example Employee(Alice, 11), Depart-ment(11, CS) and the output WorksIn(Alice, CS). Now
Dy-namite may return one of the following results:(1) WorksIn(x, y) :- Employee(x, z), Department(z, y).(2) WorksIn(x, y) :- Employee(x, z), Department(w, y).Note that both Datalog programs are consistent with the input-output example, but only program (1) is the transformationthe user wants. Since the program returned by
Dynamite depends on the model sampled by the SMT solver, it is possi-ble that
Dynamite returns the incorrect solution (2) insteadof the desired program (1). Using
Dynamite in the inter-active mode solves this problem. In this mode,
Dynamite searches for an input that distinguishes the two programsshown above. In this case, such a distinguishing input is
Name Size Description
Yelp 4.7GB Business and reviews from YelpIMDB 6.3GB Movie and crew info from IMDBMondial 3.7MB Geography informationDBLP 2.0GB Publication records from DBLPMLB 0.9GB Pitch data of Major League BaseballAirbnb 0.4GB Berlin Airbnb dataPatent 1.7GB Patent Litigation Data 1963-2015Bike 2.7GB Bike trip data in Bay AreaTencent 1.0GB User followers in Tencent WeiboRetina 0.1GB Biological info of mouse retinaMovie 0.1GB Movie ratings from MovieLensSoccer 0.2GB Transfer info of soccer players
Table 1: Datasets used in the evaluation.
Employee(Alice, 11), Employee(Bob, 12), Department(11,CS), Department(12, EE), and
Dynamite asks the user toprovide the corresponding output. Now, if the user providesthe output WorksIn(Alice, CS), WorksIn(Bob, EE), onlyprogram (1) will be consistent and
Dynamite successfullyeliminates the initial ambiguity.
Filtering operation.
While the synthesis algorithm de-scribed in Section 4 does not support data filtering duringmigration,
Dynamite allows the target database instanceto contain a subset of the data in the source instance. How-ever, the filtering operations supported by
Dynamite arerestricted to predicates that can be expressed as a conjunc-tion of equalities. To see how
Dynamite supports such fil-tering operations, observe that if an extensional relation R uses a constant c as the argument of attribute a i , this is thesame as filtering out tuples where the corresponding value isnot c . Based on this observation, Dynamite allows programsketches where the domain of a hole can include constantsin addition to variables. These constants are drawn fromvalues in the output example, and the sketch completion al-gorithm performs enumerative search over these constants.
Database instance construction.
Dynamite builds thetarget database instance from the output facts of the syn-thesized Datalog program as described in Section 3.3. How-ever,
Dynamite performs one optimization to make large-scale data migration practical: We leverage MongoDB [2]to build indices on attributes that connect records to theirparents. This strategy allows
Dynamite to quickly look upthe children of a given record and makes the construction ofthe target database more efficient.
6. EVALUATION
To evaluate
Dynamite , we perform experiments that aredesigned to answer the following research questions:
RQ1
Can
Dynamite successfully migrate real-world datasets given a representative set of records, and how goodare the synthesized programs?
RQ2
How sensitive is the synthesizer to the number andquality of examples?
RQ3
How helpful is
Dynamite for users in practice, andhow do users choose between multiple correct answers?
RQ4
Is the proposed sketch completion algorithm signifi-cantly more efficient than a simpler baseline?
RQ5
How does the proposed synthesis technique compareagainst prior techniques?8 enchmark Source Schema Target SchemaType
Yelp-1 D 11 58 R 8 32IMDB-1 D 12 21 R 9 26DBLP-1 D 37 42 R 9 35Mondial-1 D 37 113 R 25 110MLB-1 R 5 83 D 7 85Airbnb-1 R 4 30 D 6 24Patent-1 R 5 49 D 7 50Bike-1 R 4 48 D 7 47Tencent-1 G 2 8 R 1 3Retina-1 G 2 17 R 2 13Movie-1 G 5 18 R 5 21Soccer-1 G 10 30 R 7 21Tencent-2 G 2 8 D 1 3Retina-2 G 2 17 D 2 15Movie-2 G 5 18 D 4 14Soccer-2 G 10 30 D 7 23Yelp-2 D 11 58 G 4 31IMDB-2 D 12 21 G 11 19DBLP-2 D 37 42 G 17 28Mondial-2 D 37 113 G 27 78MLB-2 R 5 83 G 12 90Airbnb-2 R 4 30 G 7 32Patent-2 R 5 49 G 8 49Bike-2 R 4 48 G 6 52MLB-3 R 5 83 R 4 75Airbnb-3 R 4 30 R 7 33Patent-3 R 5 49 R 8 52Bike-3 R 4 48 R 5 52
Average - - Table 2: Statistics of benchmarks. “R” stands for relational,“D” stands for document, and “G” stands for graph.
Benchmarks.
To answer these research questions, we col-lected 12 real-world database instances (see Table 1 for de-tails) and created 28 benchmarks in total. Specifically, fourof these datasets (namely Yelp, IMDB, Mondial, and DBLP)are taken from prior work [48], and the remaining eight aretaken from open dataset websites such as Kaggle [1]. Forthe document-to-relational transformations, we used exactlythe same benchmarks as prior work [48]. For the remainingcases (e.g., document-to-graph or graph-to-relational), weused the source schemas in the original dataset but createda suitable target schema ourselves. As summarized in Ta-ble 2, our 28 benchmarks collectively cover a broad range ofmigration scenarios between different types of databases. Experimental setup.
All experiments are conducted on amachine with Intel Xeon(R) E5-1620 v3 quad-core CPU and32GB of physical memory, running the Ubuntu 18.04 OS.
In this section, we evaluate RQ1 by using
Dynamite tomigrate the datasets from Table 1 for the source and tar-get schemas from Table 2. To perform this experiment, wefirst constructed a representative set of input-output exam-ples for each record in the source and target schemas. Asshown in Table 3, across all benchmarks, the average num-ber of records in the input (resp. output) example is 2 . . Dyna-mite to synthesize a migration script consistent with thegiven examples and ran it on the real-world datasets from Schemas for all benchmarks are available at https://bit.ly/schemas-dynamite . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (a) Yelp-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (b) IMDB-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (c) DBLP-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (d) Mondial-1
Figure 7: Sensitivity analysis.Table 1. We now highlight the key take-away lessons fromthis experiment whose results are summarized in Table 3.
Synthesis time.
Even though the search space of possi-ble Datalog programs is very large (5 . × on average), Dynamite can find a Datalog program consistent with theexamples in an average of 7 . . Statistics about synthesized programs.
As shown in Ta-ble 3, the average number of rules in the synthesized Data-log program is 8 .
0, and each rule contains an average of 2 . Quality of synthesized programs.
To evaluate the qualityof the synthesized programs, we compared the synthesizedDatalog programs against manually written ones (which webelieve to be optimal). As shown in the column labeled“ . . syntactically identical to themanually-written ones. In cases where the synthesized rulediffers from the manually-written one, we observed that thesynthesized program contains redundant body predicates.In particular, if we quantify the distance between the twoprograms in terms of additional predicates, we found thatthe synthesized rules contain an average of 0 .
79 extra predi-cates (shown in column labeled “Dist to Optim”). However,note that, even in cases where the synthesized rule differssyntactically from the manually-written rule, we confirmedthat the synthesized and manual rules produce the exactsame output for the given input relations in all cases . Migration time and results.
For all 28 benchmarks, weconfirmed that
Dynamite is able to produce the intendedtarget database instance. As reported in the column labeled“Migration time”, the average time taken by
Dynamite toconvert the source instance to the target one is 12 . . To answer RQ2, we perform an experiment that measuresthe sensitivity of
Dynamite to the number and quality ofrecords in the provided input-output examples. To perform All input-output examples and synthesized programs areavailable at https://bit.ly/benchmarks-dynamite .9 enchmark Avg Yelp-1 4.7 3.9 4 . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × . × Average 2.6 2.2 5 . × Table 3: Main results. Average search space size is calculated by geometric mean; all other averages are arithmetic mean.this experiment, we first fix the number r of records in theinput example. Then, we randomly generate 100 examplesof size r and obtain the output example by running the“golden” program (written manually) on the randomly gen-erated input example. Then, for each size r ∈ [1 , Dynamite synthesizesthe correct program within 10 minutes.The results of this experiment are summarized in Figure 7for four representative benchmarks (the remaining 24 areprovided in Appendix C). Here, the x -axis shows the numberof records r in the input example, and the y -axis showsboth (a) the average running time in seconds for each r (theblue line with circles) and (b) the % of correctly synthesizedprograms given r randomly-generated records (the red linewith squares).Overall, this experiment shows that Dynamite is not par-ticularly sensitive to the number and quality of examplesfor 26 out of 28 benchmarks: it can synthesize the cor-rect Datalog program in over 90% of the cases using 2 − randomly-generated examples. Furthermore, synthesis timegrows roughly linearly for 24 out of 28 benchmarks. For 2of the remaining benchmarks (namely, IMDB-1 and Movie-2), synthesis time seems to grow exponentially in examplesize; however, since Dynamite can already achieve a successrate over 90% with just 2-3 examples, this growth in runningtime is not a major concern. Finally, for the last 2 bench-marks (namely, Retina-2 and Soccer-2),
Dynamite does notseem to scale beyond example size of 3. For these bench-marks,
Dynamite seems to generate complicated intermedi-ate programs with complex join structure, which causes theresulting output to be very large and causes MDP analy- sis to become very time-consuming. However, this behavior(which is triggered by randomly generated inputs) can beprevented by choosing more representative examples thatallow
Dynamite to generate better sketches.
To answer question RQ3, we conduct a small-scale userstudy to evaluate whether
Dynamite is helpful to users inpractice. To conduct this user study, we recruited 10 par-ticipants (all of them graduate computer science studentswith advanced programming skills) and asked them to solvetwo of our benchmarks from Table 2, namely Tencent-1 andRetina-1, with and without using
Dynamite . In order toavoid any potential bias, we randomly assigned users to solveeach benchmark either using
Dynamite or without. For thesetting where users were not allowed to use
Dynamite , theywere, however, permitted to use any programming languageand library of their choice, and they were also allowed to con-sult search engines and on-line forums. In the setting wherethe participants did use
Dynamite , we instructed them touse the tool in interactive mode (recall Section 5). Over-all, exactly 5 randomly chosen participants solved Tencent-1 and Retina-1 using
Dynamite , and 5 users solved eachbenchmark without
Dynamite .The results of this study are provided in Figure 8. Specif-ically, Figure 8(a) shows the average time to solve the twobenchmarks with and without using
Dynamite . As we cansee from this Figure, users are significantly more produc-tive, namely by a factor of 6 .
2x on average, when migratingdata with the aid of
Dynamite . Furthermore, as shown inFigure 8(b), participants always generate the correct targetdatabase instance when using
Dynamite ; however, they fail10 encent-1 Retina-101 , , ,
000 184 5791 ,
800 2 , A v e r ag e T i m e ( s ) Dynamite Manual (a) Completion Time
Tencent-1 Retina-10246 5 53 2 C o rr e c t P r og r a m s Dynamite Manual (b) Program Correctness
Figure 8: Results of user study. S y n t h e s i s T i m e ( s ) DynamiteDynamite - Enum (a) Comparison with enum
Yelp IMDB DBLP Mondial . . . . . . . S y n t h e s i s T i m e ( s ) Dynamite Mitra (b) Comparison with
Mitra
Figure 9: Comparing
Dynamite to baseline and
Mitra .to do so in 50% of the cases when they write the programmanually. Upon further inspection, we found that the man-ually written programs contain subtle bugs, such as failingto introduce newlines or quotation marks. We believe theseresults demonstrate that
Dynamite can aid users, includ-ing seasoned programmers, successfully and more efficientlycomplete real-world data migration tasks.
To answer RQ4, we compare
Dynamite against a base-line called
Dynamite - Enum that uses enumerative searchinstead of the sketch completion technique described in Sec-tion 4.3. In particular,
Dynamite - Enum uses the lazy enu-meration algorithm based on SMT, but it does not use the
Analyze procedure for learning from failed synthesis at-tempts. Specifically, whenever the SMT solver returns anincorrect assignment σ , Dynamite - Enum just uses ¬ σ asa blocking clause. Thus, Dynamite - Enum essentially enu-merates all possible sketch completions until it finds a Dat-alog program that satisfies the input-output example.Figure 9(a) shows the results of the comparison when us-ing the manually-provided input-output examples from Sec-tion 6.1. In particular, we plot the time in seconds that eachversion takes to solve the first n benchmarks. As shown inFigure 9(a), Dynamite can successfully solve all 28 bench-marks whereas
Dynamite - Enum can only solve 22 (78 . Dy-namite is 9 .
2x faster compared to
Dynamite - Enum (1 . . While there is no existing programming-by-example (PBE)tool that supports the full diversity of source/target schemashandled by
Dynamite , we compare our approach againsttwo other tools, namely
Mitra and
Eirene , in two more
MLB-3 Airbnb-3Patent-3 Bike-3 . . . . . . . . S y n t h e s i s T i m e ( s ) Dynamite Eirene (a) Synthesis Time
MLB-3 Airbnb-3Patent-3 Bike-3
012 0 . . . . . . D i s t a n c e t o O p t i m a l Dynamite Eirene (b) Mapping Quality
Figure 10: Comparing
Dynamite against
Eirene .specialized data migration scenarios. Specifically,
Mitra [48]is a PBE tool that automates document-to-relational trans-formations, whereas
Eirene [6] infers relational-to-relationalschema mappings from input-output examples.
Comparison with Mitra.
Since
Mitra uses a domain-specific language that is customized for transforming tree-structured data into a tabular representation, we compare
Dynamite against
Mitra on the four data migration bench-marks from [48] that involve conversion from a documentschema to a relational schema. The results of this compar-ison are summarized in Figure 9(b), which shows synthesistime for each tool for all four benchmarks. In terms of syn-thesis time,
Dynamite outperforms
Mitra by roughly anorder of magnitude: in particular,
Dynamite takes an aver-age of 3 seconds to solve these benchmarks, whereas
Mitra needs 29 . Mitra synthesizes 559and 780 lines of JavaScript for Yelp and IMDB, and syn-thesizes 134 and 432 lines of XSLT for DBLP and Mondial.In contrast,
Dynamite synthesizes 13 Datalog rules on av-erage. These statistics suggest that the programs synthe-sized by
Dynamite are more easily readable compared tothe JavaScript and XSLT programs synthesized by
Mitra .Finally, if we compare
Dynamite and
Mitra in terms ofefficiency of the synthesized programs, we observe that
Dy-namite -generated programs are 1 .
1x faster.
Comparison with Eirene.
Since
Eirene specializes ininferring relational-to-relational schema mappings, we com-pare
Dynamite against
Eirene on the four relational-to-relational benchmarks from Section 6.1 using the same input-output examples. As shown in Figure 10(a),
Dynamite is, on average, 1 .
3x faster than
Eirene in terms of syn-thesis time. We also compare
Dynamite with
Eirene interms of the quality of inferred mappings using the same“distance from optimal schema mapping metric” defined inSection 6.1. As shown in Figure 10(b), the schema map-pings synthesized by
Dynamite are closer to the optimalmappings than those synthesized by
Eirene . In particular,
Eirene -synthesized rules have 4 .
5x more redundant bodypredicates than the
Dynamite -synthesized rules.
7. RELATED WORK
Schema mapping formalisms.
There are several differentformalisms for expressing schema mappings, including visualrepresentations [39, 38], schema modification operators [16,15], and declarative constraints [5, 4, 6, 21, 30, 12, 8]. Sometechniques require the user to express the schema mappingvisually by drawing arrows between attributes in the source To conduct this measurement, we manually wrote optimalschema mappings in the formalism used by
Eirene .11nd target schemas [39, 37]. In contrast, schema modifi-cation operators express the schema mapping in a domain-specific language [16, 15]. Another common approach is toexpress the schema mapping using declarative specifications,such as Global-Local-As-View (GLAV) constraints [5, 4, 6,21]. Similar to this third category, we express the schemamapping using a declarative, albeit executable, formalism.
Schema mapping inference.
There is also a body of priorwork on automatically inferring schema mappings [33, 49,7, 20, 37, 25, 26, 5, 6, 38].
Clio [37, 25] infers schema map-pings for relational and XML schemas given a value corre-spondence between atomic schema elements. Another line ofwork [9, 36] uses model management operators such as Mod-elGen [10] to translate schemas from one model to another.In contrast,
Dynamite takes examples as input, which arepotentially easier to construct for non-experts. There arealso several other schema mapping techniques that use ex-amples. For instance,
Eirene [5, 6] interactively solicitsexamples to generate a GLAV specification.
Eirene is re-stricted to relational-to-relational mappings and does notsupport data filtering, but their language can also expressmappings that are not expressible in the Datalog fragmentused in this work. Similarly, Bonifati et al. use exampletuples to infer possible schema mappings and interact withthe user via binary questions to refine the inferred map-pings [11]. In contrast to
Dynamite , [11] only focuses onrelational-to-relational mappings [5]. Finally,
MWeaver [38]provides a GUI to help users to interactively generate at-tribute correspondences based on examples.
MWeaver isalso restricted to relational-to-relational mappings and dis-allows numerical attributes in the source database for per-formance reasons. Furthermore, it requires the entire sourceinstance to perform mapping inference.
Program synthesis for data transformation.
There hasbeen significant work on automating data transformationsusing program synthesis [48, 24, 14, 28, 47, 45, 43, 44].Many techniques focus purely on table or string transfor-mations [24, 28, 45, 14, 43], whereas
Hades [47] (resp.
Mi-tra [48]) focuses on document-to-document (resp. document-to-table) transformations. Our technique generalizes priorwork by automating transformations between many differenttypes of database schemas. Furthermore, as we demonstratein Section 6.5, this generality does not come at the cost ofpracticality, and, in fact, performs faster synthesis.
Inductive logic programming.
Our work is related toinductive logic programming (ILP) where the goal is to syn-thesize a logic program consistent with a set of examples[34, 35, 40, 27, 31, 50, 19, 41]. Among ILP techniques,our work is most similar to recent work on Datalog pro-gram synthesis [3, 42]. In particular, Zaatar [3] encodes anunder-approximation of Datalog semantics using the the-ory of arrays and reduces synthesis to SMT solving. How-ever, this technique imposes an upper bound on the numberof clauses and atoms in the Datalog program. The
Alps tool [42] also performs Datalog program synthesis from ex-amples but additionally requires meta-rule templates. Incontrast, our technique focuses on a recursion-free subset ofDatalog, but it does not require additional user input be-yond examples and learns from failed synthesis attempts byusing the concept of minimal distinguishing projections.
Learning from conflicts in synthesis.
Our method bearssimilarities to recent work on conflict-driven learning in pro- gram synthesis [23, 14, 46] where the goal is to learn usefulinformation from failed synthesis attempts. For example,
Neo [23] and
Trinity [14] use component specifications toperform root cause analysis and identify other programs thatalso cannot satisfy the specification.
Dynamite ’s sketchcompletion approach is based on a similar insight, but ituses Datalog-specific techniques to perform inference. An-other related work is
Migrator [46], which automaticallysynthesizes a new version of a SQL program given its oldversion and a new relational schema. In contrast to
Migra-tor , our method addresses the problem of migrating data rather than code and is not limited to relational schemas.In addition, while
Migrator also aims to learn from failedsynthesis attempts, it does so using testing as opposed toMDP analysis for Datalog.
Universal and core solutions for data exchange.
Ourwork is related to the data exchange problem [21], wherethe goal is to construct a target instance J given a sourceinstance I and a schema mapping Σ such that ( I, J ) | = Σ.Since such a solution J is not unique, researchers have devel-oped the concept of universal and core solutions to charac-terize generality and compactness [21, 22]. In contrast, dur-ing its data migration phase, Dynamite obtains a uniquetarget instance by executing the synthesized Datalog pro-gram on the source instance. The target instance generatedby
Dynamite is the least Herbrand model of the Datalogrules and the source instance [13]. While the least Herbrandmodel also characterizes generality and compactness of thetarget instance, the relationship between the least Herbrandmodel and the universal/core solution for data exchange re-quires further theoretical investigation.
8. LIMITATIONS
Our approach has three limitations that we plan to ad-dress in future work. First, our synthesis technique doesnot provide any guarantees about the optimality of the syn-thesized Datalog programs, either in terms of performanceor size. Second, we assume that the examples provided bythe user are always correct; thus, our method does not han-dle any noise in the specification. Third, we assume thatwe can compare string values for equality when inferringthe attribute mapping and obtain the proper matching us-ing set containment. If the values are slightly changed, orif there is a different matching heuristic between attributes,our technique would not be able to synthesize the desiredprogram. However, this shortcoming can be overcome bymore sophisticated schema matching techniques [18, 32].
9. CONCLUSION
We have proposed a new PBE technique that can synthe-size Datalog programs to automate data migration tasks.We evaluated our tool,
Dynamite , on 28 benchmarks thatinvolve migrating data between different types of databaseschemas and showed that
Dynamite can successfully au-tomate the desired data migration task from small input-output examples.
10. ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers and ourshepherd for their useful comments and suggestions. Thiswork is supported by NSF Awards
1. REFERENCES [1] Kaggle. , 2019.[2] MongoDB. , 2019.[3] A. Albarghouthi, P. Koutris, M. Naik, and C. Smith.Constraint-based synthesis of datalog programs. In
Proceedings of CP , pages 689–706, 2017.[4] B. Alexe, P. G. Kolaitis, and W.-C. Tan.Characterizing schema mappings via data examples.In
Proceedings of PODS , pages 261–272, 2010.[5] B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan.Designing and refining schema mappings via dataexamples. In
Proceedings of SIGMOD , pages 133–144,2011.[6] B. Alexe, B. Ten Cate, P. G. Kolaitis, and W.-C. Tan.Eirene: Interactive design and refinement of schemamappings via data examples.
Proceedings of the VLDBEndowment , 4(12):1414–1417, 2011.[7] Y. An, A. Borgida, R. J. Miller, and J. Mylopoulos. Asemantic approach to discovering schema mappingexpressions. In
Proceedings of ICDE , pages 206–215,2007.[8] P. C. Arocena, B. Glavic, and R. J. Miller. Valueinvention in data exchange. In
Proceedings ofSIGMOD , pages 157–168, 2013.[9] P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein,and G. Gianforme. Model-independent schematranslation.
VLDB J. , 17(6):1347–1370, 2008.[10] P. A. Bernstein. Applying model management toclassical meta data problems. In
Proceedings of CIDR ,2003.[11] A. Bonifati, U. Comignani, E. Coquery, and R. Thion.Interactive mapping specification with exemplartuples. In
Proceedings of SIGMOD , pages 667–682,2017.[12] B. T. Cate, P. G. Kolaitis, K. Qian, and W.-C. Tan.Approximation algorithms for schema-mappingdiscovery from data examples.
ACM Transactions onDatabase Systems , 42(2):12, 2017.[13] S. Ceri, G. Gottlob, and L. Tanca. What you alwayswanted to know about datalog (and never dared toask).
IEEE TKDE , 1(1):146–166, 1989.[14] J. Chen, R. Martins, Y. Chen, Y. Feng, and I. Dillig.Trinity: An extensible synthesis framework for datascience.
PVLDB , 2019.[15] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo.Automating the database schema evolution process.
VLDB J. , 22(1):73–98, 2013.[16] C. Curino, H. J. Moon, and C. Zaniolo. Gracefuldatabase schema evolution: the PRISM workbench.
PVLDB , 1(1):761–772, 2008.[17] L. M. de Moura and N. Bjørner. Z3: an efficient SMTsolver. In
Proceedings of TACAS , pages 337–340, 2008.[18] A. Doan, P. M. Domingos, and A. Y. Halevy.Reconciling schemas of disparate data sources: Amachine-learning approach. In
Proceedings ofSIGMOD , pages 509–520, 2001.[19] R. Evans and E. Grefenstette. Learning explanatoryrules from noisy data (extended abstract). In
Proceedings of IJCAI , pages 5598–5602, 2018.[20] R. Fagin, L. M. Haas, M. A. Hern´andez, R. J. Miller,L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In
Conceptual Modeling:Foundations and Applications - Essays in Honor ofJohn Mylopoulos , pages 198–236, 2009.[21] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa.Data exchange: semantics and query answering.
Theor. Comput. Sci. , 336(1):89–124, 2005.[22] R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange:getting to the core.
ACM Trans. Database Syst. ,30(1):174–210, 2005.[23] Y. Feng, R. Martins, O. Bastani, and I. Dillig.Program synthesis using conflict-driven learning. In
Proceedings of PLDI , pages 420–435, 2018.[24] Y. Feng, R. Martins, J. V. Geffen, I. Dillig, andS. Chaudhuri. Component-based synthesis of tableconsolidation and transformation tasks from examples.In
Proceedings of PLDI , pages 422–436, 2017.[25] A. Fuxman, M. A. Hern´andez, C. T. H. Ho, R. J.Miller, P. Papotti, and L. Popa. Nested mappings:Schema mapping reloaded. In
Proceedings of VLDB ,pages 67–78, 2006.[26] G. Gottlob and P. Senellart. Schema mappingdiscovery from data instances.
J. ACM ,57(2):6:1–6:37, 2010.[27] S. Gulwani, J. Hern´andez-Orallo, E. Kitzelmann, S. H.Muggleton, U. Schmid, and B. G. Zorn. Inductiveprogramming meets the real world.
Communication ofthe ACM , 58(11):90–99, 2015.[28] W. R. Harris and S. Gulwani. Spreadsheet tabletransformations from examples. In
Proceedings of thePLDI , pages 317–328, 2011.[29] H. Jordan, B. Scholz, and P. Subotic. Souffl´e: Onsynthesis of program analyzers. In
Proceedings ofCAV , pages 422–430, 2016.[30] P. G. Kolaitis. Schema mappings, data exchange, andmetadata management. In
Proceedings of PODS ,pages 61–75, 2005.[31] D. Lin, E. Dechter, K. Ellis, J. B. Tenenbaum, andS. Muggleton. Bias reformulation for one-shot functioninduction. In
Proceedings of ECAI , pages 525–530,2014.[32] J. Madhavan, P. A. Bernstein, and E. Rahm. Genericschema matching with cupid. In
Proceedings of VLDB ,pages 49–58, 2001.[33] R. J. Miller, L. M. Haas, and M. A. Hern´andez.Schema mapping as query discovery. In
Proceedings ofVLDB , pages 77–88, 2000.[34] S. H. Muggleton, D. Lin, N. Pahlavi, andA. Tamaddoni-Nezhad. Meta-interpretive learning:application to grammatical inference.
MachineLearning , 94(1):25–49, 2014.[35] S. H. Muggleton, D. Lin, and A. Tamaddoni-Nezhad.Meta-interpretive learning of higher-order dyadicdatalog: predicate invention revisited.
MachineLearning , 100(1):49–73, 2015.[36] P. Papotti and R. Torlone. Heterogeneous datatranslation through XML conversion.
J. Web Eng. ,4(3):189–204, 2005.[37] L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hern´andez,and R. Fagin. Translating web data. In
PVLDB , pages598–609, 2002.[38] L. Qian, M. J. Cafarella, and H. V. Jagadish.13ample-driven schema mapping. In
Proceedings ofSIGMOD , pages 73–84, 2012.[39] E. Rahm and P. A. Bernstein. A survey of approachesto automatic schema matching.
VLDB J. ,10(4):334–350, 2001.[40] T. Rockt¨aschel and S. Riedel. End-to-enddifferentiable proving. In
Proceedings of NIPS , pages3788–3800, 2017.[41] X. Si, M. R. Lee, K. Heo, and M. Naik. Synthesizingdatalog programs using numerical relaxation. In
Proceedings of IJCAI , 2019.[42] X. Si, W. Lee, R. Zhang, A. Albarghouthi, P. Koutris,and M. Naik. Syntax-guided synthesis of datalogprograms. In
Proceedings of ESEC/SIGSOFT FSE ,pages 515–527, 2018.[43] R. Singh. Blinkfill: Semi-supervised programming byexample for syntactic string transformations.
PVLDB ,9(10):816–827, 2016.[44] R. Singh, V. V. Meduri, A. K. Elmagarmid,S. Madden, P. Papotti, J. Quian´e-Ruiz,A. Solar-Lezama, and N. Tang. Synthesizing entitymatching rules by examples.
PVLDB , 11(2):189–202,2017.[45] C. Wang, A. Cheung, and R. Bod´ık. Synthesizinghighly expressive SQL queries from input-outputexamples. In
Proceedings of PLDI , pages 452–466,2017.[46] Y. Wang, J. Dong, R. Shah, and I. Dillig. Synthesizingdatabase programs for schema refactoring. In
Proceedings of PLDI , pages 286–300, 2019.[47] N. Yaghmazadeh, C. Klinger, I. Dillig, andS. Chaudhuri. Synthesizing transformations onhierarchically structured data. In
Proceedings ofPLDI , pages 508–521, 2016.[48] N. Yaghmazadeh, X. Wang, and I. Dillig. Automatedmigration of hierarchical data to relational tablesusing programming-by-example.
PVLDB ,11(5):580–593, 2018.[49] L. Yan, R. J. Miller, L. M. Haas, and R. Fagin.Data-driven understanding and refinement of schemamappings. In
Proceedings of SIGMOD , pages 485–496,2001.[50] F. Yang, Z. Yang, and W. W. Cohen. Differentiablelearning of logical rules for knowledge base reasoning.In
Proceedings of NIPS , pages 2319–2328, 2017.
APPENDIXA. THEOREMS AND PROOFS
Definition 2 (Equivalence of Datalog Programs).
Given two Datalog programs P and P , P and P aresaid to be equivalent , denoted by P (cid:39) P , if and only if (cid:74) P (cid:75) I = (cid:74) P (cid:75) I holds for any input instance I , i.e. P (cid:39) P (cid:44) ∀I . (cid:74) P (cid:75) I = (cid:74) P (cid:75) I Proof of Theorem 1.
Suppose (cid:74) P (cid:75) = ∀ X. Φ( X ). Sinceˆ σ is an injective mapping from variables X to variables Y ,we have (cid:74) P ˆ σ (cid:75) = ( ∀ X. Φ( X ))[ Y /X ] = ∀ Y. Φ( Y ) = (cid:74) P (cid:75) . Basedon the semantics of Datalog programs, we have program P is equivalent to P ˆ σ , i.e. P (cid:39) P ˆ σ . Lemma 1 (Correctness of Generalize).
Given twomodels σ, σ (cid:48) with the same domain, if σ (cid:48) | = Generalize ( σ ) ,then (1) σ (cid:48) = σ ˆ σ where ˆ σ is an injective substitution, and(2) for any x ∈ dom( σ ) , if σ ( x ) corresponds to a head vari-able, then σ ( x ) = σ (cid:48) ( x ) . Proof.
Since (2) holds by the definition of
Generalize ( σ ),let us focus on property (1). Given that σ and σ (cid:48) havethe same domain and σ (cid:48) = σ ˆ σ , we only need to prove ˆ σ isinjective. Let us prove it by contradiction. Suppose ˆ σ is notinjective and maps two different values v , v ( v (cid:54) = v ) in therange of σ to one single value v , i.e. ˆ σ ( v ) = ˆ σ ( v ) = v .Since σ (cid:48) = σ ˆ σ , we have v in the range of σ (cid:48) . On one hand,suppose σ ( x ) = v , σ ( x ) = v , it follows that σ (cid:48) ( x ) = σ ( x )ˆ σ = v and σ (cid:48) ( x ) = σ ( x )ˆ σ = v , so we have σ (cid:48) ( x ) = σ (cid:48) ( x ). On the other hand, we know σ (cid:48) ( x ) (cid:54) = σ (cid:48) ( x ) fromthe definition of Generalize ( σ ) because v (cid:54) = v , which is acontradiction. Definition 3 (Projection).
Let R be a non-recursiverule with head H ( X ) where X is a set of variables that cor-respond to attributes A of relation H . Assume that L ⊆ A isa subset of attributes in H and X L are their correspondingvariables, rule R ( L ) that substitutes head H ( X ) with H ( X L ) in R is said to be the projection of R . Lemma 2 (Projection Property).
Assuming R is anon-recursive rule and R ( L ) is its projection on attributes L , it holds that (cid:74) R ( L ) (cid:75) I = Π L ( (cid:74) R (cid:75) I ) for any input I . Proof.
Without loss of generality, suppose rule R is ofthe form H ( X ) :- B ( X, Y ). Then (cid:74) R (cid:75) = ∀ X, Y. B ( X, Y ) → H ( X ). Based on the definition of R ( L ), rule R ( L ) is of theform H ( X L ) :- B ( X, Y ), so (cid:74) R ( L ) (cid:75) = ∀ X, Y.B ( X, Y ) → H ( X L ). According to the semantics of Datalog, we have (cid:74) R ( L ) (cid:75) I = Π L ( (cid:74) R (cid:75) I ) for any input I . Lemma 3 (Correctness of Generalize with MDP).
Consider a program sketch Ω , two models σ, σ (cid:48) , and an MDP ϕ . Let P = Instantiate (Ω , σ ) and P (cid:48) = Instantiate (Ω , σ (cid:48) ) . If σ (cid:48) | = Generalize ( σ, ϕ ) , then Π ϕ ( (cid:74) P (cid:75) I ) = Π ϕ ( (cid:74) P (cid:48) (cid:75) I ) holdsfor any input I . Proof.
Let O = (cid:74) P (cid:75) I , O (cid:48) = (cid:74) P (cid:48) (cid:75) I and consider theprojection P ( ϕ ) of P and projection P (cid:48) ( ϕ ) of P (cid:48) . Observethe difference between Generalize ( σ, ϕ ) and Generalize ( σ ),we know that projection P ( ϕ ) = Instantiate (Ω , σ ) and pro-jection P (cid:48) ( ϕ ) = Instantiate (Ω , σ (cid:48) ). Furthermore, applying Generalize ( σ, ϕ ) on program P is equivalent to applying Generalize ( σ ) on program P ( ϕ ), because the head variablesof P ( ϕ ) is the same as variables corresponding to attributesin ϕ for P . Then by Lemma 1, it holds that (1) σ (cid:48) = σ ˆ σ where ˆ σ is an injective substitution, and (2) for any x ∈ dom ( σ ), if σ ( x ) corresponds to a head variable of P ( ϕ ), then σ ( x ) = σ (cid:48) ( x ). Thus, given that P (cid:48) ( ϕ ) = P ( ϕ )ˆ σ , we know P ( ϕ ) (cid:39) P (cid:48) ( ϕ ) by Theorem 1. In other words, (cid:74) P ( ϕ ) (cid:75) I = (cid:74) P (cid:48) ( ϕ ) (cid:75) I holds for any input I . Therefore, by Lemma 2, wehave Π ϕ ( (cid:74) P (cid:75) I ) = (cid:74) P ( ϕ ) (cid:75) I = (cid:74) P (cid:48) ( ϕ ) (cid:75) I = Π ϕ ( (cid:74) P (cid:48) (cid:75) I ) for anyinput I . Lemma 4 (Correctness of MDPSet).
Given a Dat-alog program P and an input-output example ( I , O ) , suppose (cid:74) P (cid:75) I = O (cid:48) and O (cid:48) (cid:54) = O . If ∆ = MDPSet ( O (cid:48) , O ) , then forevery element ϕ ∈ ∆ , it holds that ϕ is an MDP for program P and example ( I , O ) . roof. Observe that Lines 2 – 5 of the
MDPSet pro-cedure in Algorithm 4 establish the facts: (1) ∆ = ∅ , (2) V = {{ a } | a ∈ Attrs ( O (cid:48) ) } , and (3) W is a queue initializedto contain all the elements in V . Now consider the followinginvariants of the while loop at Lines 6 – 14: I . ∀ ϕ ∈ ∆ . Π ϕ ( O ) (cid:54) = Π ϕ ( (cid:74) P (cid:75) I ) I . ∀ ϕ ∈ ∆ . ∀ A ⊂ ϕ. Π A ( O ) = Π A ( (cid:74) P (cid:75) I )Then let us prove they are loop invariants in more detail: • I trivially holds at the beginning of the loop, because∆ is empty. To see why I holds after every iteration ofthe loop, observe Line 8 and Line 14 of the algorithm.A new element L can only be added to ∆ if Π L ( O (cid:48) ) (cid:54) =Π L ( O ). Since (cid:74) P (cid:75) I = O (cid:48) , we have Π L ( O ) (cid:54) = Π L ( (cid:74) P (cid:75) I ).Thus, I holds after every iteration of the loop. • I trivially holds at the beginning of the loop, because∆ is empty. Let us prove I holds after every iterationof the loop. For an element L to be added to ∆ atLine 14, we only need to prove ∀ L (cid:48) ⊂ L. Π L (cid:48) ( O ) =Π L (cid:48) ( (cid:74) P (cid:75) I ). We now prove it by contradiction. Supposethere exists an attribute set A ⊂ L such that Π A ( O ) (cid:54) =Π A ( (cid:74) P (cid:75) I ). (i) If A ∈ ∆, then L should not be addedto ∆ by the condition at Line 14. Contradiction. (ii)If A / ∈ ∆, it must hold that ∃ A (cid:48) ⊂ A. A (cid:48) ∈ ∆ by thecondition at Line 14. Since A (cid:48) ⊂ A ⊂ L and A (cid:48) ∈ ∆,we know L should not be added to ∆. Contradiction.Based on the definition of MDP and loop invariants I and I , we have proved that every element ϕ ∈ ∆ is an MDP forprogram P and example ( I , O ). Proof of Theorem 2.
Given an input-output example( I , O ) and a candidate program P with (cid:74) P (cid:75) I = O (cid:48) , observethat the Synthesize procedure in Algorithm 1 can onlyreach
Analyze when O (cid:48) (cid:54) = O . Thus by Lemma 4, the returnvalue ∆ of MDPSet in Algorithm 3 contains a set of MDPs.For each MDP ϕ ∈ ∆, observe that ψ = Generalize ( σ, ϕ )according to Lines 5 – 10 of the Analyze procedure in Al-gorithm 3, so it holds that σ (cid:48) | = Generalize ( σ, ϕ ) for a model σ (cid:48) corresponding to formula ψ . Let us denote the programcorresponding to σ (cid:48) by P (cid:48) . Then by Lemma 3, we haveΠ ϕ ( (cid:74) P (cid:75) I ) = Π ϕ ( (cid:74) P (cid:48) (cid:75) I ). Since (cid:74) P (cid:75) I = O (cid:48) and O (cid:48) (cid:54) = O , weknow (cid:74) P (cid:48) (cid:75) I (cid:54) = O , i.e. P (cid:48) is an incorrect program. Giventhat the return value of Analyze is the conjunction of all ¬ ψ , we have proved the theorem. Definition 4 (Sketch Completion).
Given a sketch Ω with holes (cid:126) ?? , a program P is said to be the completion of Ω , denoted by P ∈ γ (Ω) , if P = Ω[ (cid:126)c/ (cid:126) ?? ] where c i is aconstant in the domain of ?? i . Lemma 5 (Correctness of GenIntensionalPreds).
Given a top-level record type N in target schema S (cid:48) , invok-ing GenIntensionalPreds ( S (cid:48) , N ) generates the followingrelations:1. relation R N ( v a , . . . , v a n ) for record type N , where v a i is the variable corresponding to attribute a i of N
2. relation R N (cid:48) ( v N (cid:48) , v a , . . . , v a n ) for each nested recordtype N (cid:48) in N , where v N (cid:48) is the variable correspondingto N (cid:48) and v a i corresponds to attribute a i of N (cid:48) Proof.
Prove by structural induction on name N . • Base case. N is a primitive attribute, i.e. S (cid:48) ( N ) ∈ PrimType . According to the InPrim rule in Figure 5,
GenIntensionalPreds generates a variable v N for N and no head relations. • Inductive case. N is a record type with attributes N , . . . , N m , i.e. S (cid:48) ( N ) = { N , . . . , N m } . Supposefor every name N j with j ∈ [1 , m ], GenIntensional-Preds generates R N j ( v a , . . . , v a n ) if N j is a top-levelrecord type and otherwise generates relation R N (cid:48) j ( v N (cid:48) j , v a , . . . , v a n ) for each nested record type N (cid:48) j in N j , weneed to prove the property holds for N . We perform acase split on name N .(1) If N is a top-level record type. The property simplyholds by the InRec rule in Figure 5.(2) If N is not a top-level record type. The propertyholds by the InRecNested rule in Figure 5.By the principle of structural induction, we have proved thislemma. Lemma 6 (Correctness of GenExtensionalPreds).
Given a sequence of record types N , . . . , N m in source schema S , where N i +1 is nested in N i for i ∈ [1 , m − and N isat the top level, invoking GenExtensionalPreds ( S , N m ) generates the following relations:1. relation R N ( h a , . . . , h a n ) , where h a i is the hole cor-responding to attribute a i of N
2. relation R N j ( v N j , h a , . . . , h a n ) for each N j with j ∈ [2 , m ] , where v N j is the variable corresponding to recordtype N j , and h a i is the hole corresponding to attribute a i of N j Proof.
Prove by induction on m . • Base case. m = 1. In this case, we only have onetop-level record N . According to the rules ExRec andExPrim in Figure 6, GenExtensionalPreds gener-ates a relation R N ( h a , . . . , h a n ), where h a i is the holecorresponding to attribute a i of N . • Inductive case. Suppose the property holds for m = k −
1, let us prove it holds for m = k . By inductive hy-pothesis, we know that GenExtensionalPreds gen-erates a relation R N ( h a , . . . , h a n ) for N and gener-ates R N j ( v N j , h a j , . . . , h a jn ) for each j ∈ [2 , k − R N k ( v N k , h a k , . . . , h a kn ).Thus the property holds.By the principle of induction, we have proved this lemma. Lemma 7 (Sketch of Rules).
Given a source and tar-get schema S , S (cid:48) , an example E = ( I , O ) , and a top levelrecord type N in S (cid:48) , let Ψ =
InferAttrMapping ( S , S (cid:48) , E ) and sketch Ω =
GenRuleSketch (Ψ , S , S (cid:48) , N ) . If a Datalogrule R satisfies the following conditions:1. every relation corresponding to a record type nested in N occurs exactly once in the head of R
2. all body relations correspond to record types in S (cid:74) R (cid:75) I = O N where O N only contains the output forrecord type N in O then there exists a program R (cid:48) ∈ γ (Ω) such that R (cid:48) (cid:39) R . roof. First of all, let us consider the head of rule R .Since the head of R contains exactly one relation for eachnested record in N and (cid:74) R (cid:75) I = O N , so the arguments ofthese relations should cover all the attributes in N . In ad-dition, considering the migration procedure in Section 3.3,each nested record type should have an extra attribute totrack its parent. Thus, the head of R contains (1) a relation R N for N and its arity is the number of attributes in N , and(2) relation R N (cid:48) for each nested record type in N and itsarity is one more than the number of attributes in N (cid:48) . ByLemma 5, we know the head of Ω is identical to the head of R up to variable renaming, and let us denote the variablerenaming by ˆ σ . Without loss of generality, suppose ˆ σ is en-forced to be applied on R , i.e. the head variable for targetattribute a is renamed to v a and so does the variables in thebody. Then the head of R and Ω are syntactically identical,say H , and we will use H as the head of R (cid:48) .Next let us focus on the body of R . Observe that thenatural bound on the number of copies for one source at-tribute is the number of aliasing attributes in the sourceand target schema, i.e. the maximum number of copies forattribute a in record type N (cid:48) is bounded by |{ a (cid:48) | a (cid:48) ∈ PrimAttrbs ( N (cid:48) ) ∧ a (cid:48) ∈ Ψ( a ) }| . Since adding more copiesbeyond it would eventually boil down to renaming variablesof existing predicates, we will prove there exists R (cid:48) ∈ γ (Ω)such that R (cid:48) (cid:39) R under this natural bound by induction. • Base case. R has only one relation in the body. Sup-pose R is of the form H ( v a (cid:48) , . . . , v a (cid:48) m ) :- B ( v , . . . , v n )where B corresponds to record type N B in S and v i is the variable for attribute a i . Since (cid:74) R (cid:75) I = O N ,for every attribute a (cid:48) j in S (cid:48) , there exists an attribute a k in S such that Π a k ( I ) = Π a (cid:48) j ( O N ). Thus, we have a (cid:48) j ∈ Ψ( a k ) based on the definition of InferAttrMap-ping . Observing Lines 9 – 12 of Algorithm 2 and com-bining with Lemma 6, we know that
GenRuleSketch generates at least m copies of predicate B ( h , . . . , h n ).Then according to Lines 13 – 18 of Algorithm 2, for anyattribute a i ∈ dom (Ψ), a (cid:48) j is included in the domainof h i . Furthermore, for any attribute a i (cid:54)∈ dom (Ψ),a fresh variable v a i is in the domain of h i . Thus wecan construct the body of R (cid:48) such that R (cid:48) ∈ γ (Ω), byinstantiating h i with v a (cid:48) j if a i ∈ dom (Ψ) and instanti-ating h i with v a i if a i ∈ dom (Ψ). In this way, R (cid:48) isequivalent to R up to variable renaming. • Inductive case. Assuming there exists R (cid:48) k ∈ γ (Ω) withat most k body predicates such that R (cid:48) k (cid:39) R k if R k is of the form H :- B , . . . , B k , let us prove thereexists R (cid:48) k +1 ∈ γ (Ω) with at most k + 1 body pred-icates such that R (cid:48) k +1 (cid:39) R k +1 if R k +1 is of the form H :- B , . . . , B k , B k +1 . For the purpose of illustration,suppose R k +1 is of the form H ( v a (cid:48) , . . . , v a (cid:48) m ) :- B , . . . , B k , B k +1 ( v , . . . , v n ), where B i corresponds to recordtype N i in S for i ∈ [1 , k + 1], and v j is the variable forattribute a j for j ∈ [1 , n ]. Note that N i and N i mayrefer to the same record type in general. Without lossof generality, we assume N k +1 is record type N B . Byinductive hypothesis, we know that R (cid:48) k (cid:39) R k with atmost k predicates in the body. To construct the newrule R (cid:48) k +1 , we only need to instantiate one more predi-cate B ( h , . . . , h n ) and add it to the body of R (cid:48) k , where B is the relation corresponding to N B . And such pred-icate sketch B ( h , . . . , h n ) exists in the body of Ω byLemma 6. To see how to construct the variable v (cid:48) i , let us consider the variable v i in B k +1 ( v , . . . , v n ) and itscorresponding attribute a i in schema S .(1) v i is a fresh variable. We can simply pick a variable v ja i that does not occur in R (cid:48) k .(2) v i occurs in the head and body predicates B , . . . , B k .Suppose the corresponding head attribute is a (cid:48) r , weknow Π a i ( I ) = Π a (cid:48) r ( O N ) from (cid:74) R k +1 (cid:75) I = O N .Based on the definition of InferAttrMapping ,we have a (cid:48) r ∈ Ψ( a i ), so v a (cid:48) r is in the domain of h i according to Lines 9 – 18 of Algorithm 2 andLemma 6. Thus, we can use v a (cid:48) r to instantiate h i .(3) v i does not occur in the head but occurs in bodypredicates B , . . . , B k . Suppose the correspondingsource attributes are a j , . . . , a j r . Since (cid:74) R k +1 (cid:75) I = O N = (cid:74) R k (cid:75) I , we have Π a js ( I ) ⊆ Π a i ( I ) or Π a js ( I ) ⊇ Π a i ( I ) for all s ∈ [1 , r ]. So by the definitionof InferAttrMapping , we have a i ∈ Ψ( a j s ) or a j s ∈ Ψ( a i ). Thus, v a js is in the domain of h i or v i is in the domain of the hole corresponding toattribute a j s . Therefore, we can instantiate theseholes accordingly.So far, we have constructed the new rule R (cid:48) k +1 suchthat R (cid:48) k +1 ∈ γ (Ω) and R (cid:48) k +1 is equivalent to R k +1 upto variable renaming.By the principle of induction, we have proved this lemma. Theorem 3 (Sketch Property).
Given a source andtarget schema S , S (cid:48) , and an example E = ( I , O ) , let Ψ =
InferAttrMapping ( S , S (cid:48) , E ) and sketch Ω =
SketchGen (Ψ , S , S (cid:48) ) . If a Datalog program P conforms to the syntaxshown in Figure 4 and P satisfies the following conditions:1. for each record type N in S (cid:48) , there is exactly one rulein P such that its head corresponds to N
2. for each rule R in P , the body relations of R correspondto record types in S (cid:74) P (cid:75) I = O then there exists a program P (cid:48) ∈ γ (Ω) such that (cid:74) P (cid:48) (cid:75) I = O . Proof.
Observe that the
SketchGen procedure simplyinvokes the
GenRuleSketch for each top-level record type N in target schema S (cid:48) . This theorem directly follows fromLemma 7. B. INTERACTIVE MODE
As mentioned in Section 5,
Dynamite can be used in aninteractive mode, where the tool queries the user for moreinput-output examples if there are multiple Datalog pro-grams that are consistent with the given example. Whenused in this mode,
Dynamite does not stop as soon as itfinds a single program that satisfies the examples; rather, itkeeps searching for additional programs (i.e., sketch comple-tions) until either (a) it finds another program that is alsoconsistent with the given example, or (b) it proves that thereis no other program consistent with the given example (i.e.,the SAT encoding from Algorithm 1 becomes unsatisfiable).If there is another program consistent with the example,
Dynamite uses testing to find a smallest set of input tuplesthat distinguish the two programs. Specifically,
Dynamite . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (a) Yelp-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (b) IMDB-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (c) DBLP-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (d) Mondial-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (e) MLB-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (f) Airbnb-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (g) Patent-1 . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (h) Bike-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (i) Tencent-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (j) Retina-1 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (k) Movie-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (l) Soccer-1 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (m) Tencent-2 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (n) Retina-2 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (o) Movie-2 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (p) Soccer-2 . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (q) Yelp-2 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (r) IMDB-2 . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (s) DBLP-2 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (t) Mondial-2 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (u) MLB-2 . . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (v) Airbnb-2 . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (w) Patent-2 . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (x) Bike-2
Figure 11: Detailed sensitivity analysis.17 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (a) MLB-3 . . . S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (b) Airbnb-3 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (c) Patent-3 S y n t h e s i s T i m e ( s ) Su cce ss R a t e ( % ) Time(s) Succ Rate(%) (d) Bike-3
Figure 12: Detailed sensitivity analysis (continued)enumerates test inputs in increasing order of size where eachtuple in the set comes from a validation set sampled fromthe source database instance. Then, given an input I onwhich the outputs of the two programs differ, Dynamite asks the user to provide the corresponding output.
C. SENSITIVITY ANALYSIS
Our detailed sensitivity analysis for all benchmarks in Sec-tion 6 is summarized in Figures 11 and 12. Here, the x -axis shows the input size r , and the y -axis shows both (a) therunning time in seconds for each r (the blue line with cir-cles) and (b) the % of correctly synthesized programs given r randomly-generated records (the red line with squares).To reduce the impact of random errors, we run Dynamite
100 times with different random examples for each rr