Towards Inferring Queries from Simple and Partial Provenance Examples
TTowards Inferring Queries from Simple and Partial ProvenanceExamples
Amir GiladTel Aviv [email protected] Yuval MoskovitchUniversity of [email protected]
Abstract
The field of query-by-example aims at inferringqueries from output examples given by non-expertusers, by finding the underlying logic that binds theexamples. However, for a very small set of examples,it is difficult to correctly infer such logic. To bridgethis gap, previous work suggested attaching expla-nations to each output example, modeled as prove-nance, allowing users to explain the reason behindtheir choice of example. In this paper, we explore theproblem of inferring queries from a few output exam-ples and intuitive explanations . We propose a twostep framework: (1) convert the explanations into(partial) provenance and (2) infer a query that gen-erates the output examples using a novel algorithmthat employs a graph based approach. This frame-work is suitable for non-experts as it does not requirethe specification of the provenance in its entirety oran understanding of its structure. We show promis-ing initial experimental results of our approach.
With the growing interest in data science and theubiquity of data in recent years, the need fortools that allow non-expert users to interact withdatabases becomes crucial. Different approacheshave been proposed to facilitate manners in whichnon-expert users can query the database, includingnatural language interfaces [9] and query by example[12, 1], where the user provides the system outputexamples in order to obtain a query that produces the same output. In this paper we focus on an al-ternative approach, based on the use of explanationsfor output examples. This notion relies on the per-ception that providing a small number of output ex-amples, along with explanations may be more intu-itive and easier for users than solely providing manyoutput examples. These explanations encapsulate in-formation on the query structure which significantlyreduces the number of examples needed to infer theintended query.
Example 1.1.
Consider the Microsoft AcademicDatabase (MAS) [10]. An instance with a similarschema is depicted in Figure 1. Also consider thequery returning all conferences in the database areaand authors from Tel Aviv University (TAU) whopublished papers in these conferences (shown as aConjunctive Query in Figure 3). A user who does notknow how to formulate it can input a few examples ofoutputs and explain their rational. For instance, thetwo output examples can be “SIGMOD, Alice” and“CIKM, Bob”, and their respective explanations say-ing that Alice published the papers “X” and “Y” inSIGMOD, which is a database conference, and thatBob published the paper “Z” in CIKM which is alsoa database conference, and both are from Tel AvivUniversity.
The idea of inferring queries from provenance wasfirst proposed in [5, 6]. A key property of anyprovenance based solution is its level of detail. In[6] we studied provenance polynomials, trio, positiveBoolean expressions, and why-provenance, each cor-responds to an explanation with a different granular-1 a r X i v : . [ c s . D B ] A ug ty. We showed that consistent conjunctive queriescan be inferred, even when given only a few outputexamples and their provenance. A notable prove-nance model that is missing in [6] is that of lineage [4]where the explanation of an output tuple t is the setof input tuples’ annotations that contribute to thegeneration of t . Example 1.2.
Consider again the database de-picted in Figure 1 and the query from Example 1.1.The lineage of the tuple “SIGMOD, Alice” is theset of tuples annotated (in the prov. column) by o , a , p , p , w , w , c , dc and d . Note that this setdepicts two different ways to obtain the output, oneusing the fact that she published “X” and the otherthat she published “Y”. Providing precise and “full” provenance as expla-nations may be tedious or even impossible for non-expert users, and requires an understanding of theschema. A more likely non-expert explanation wouldbe partial. For instance, a more natural explana-tion for the output “SIGMOD, Alice” from Exam-ple 1.1 may contain only the tuples o , a , p , c and d . Moreover, a major barrier in the usability ofthe provenance based query inference approach, isthat formulating explanations into provenance, evenin a partial form, may be a challenging task for non-expert users. To this end we propose a two phaseframework. Given a small set of output examplesalong with informal explanations, we first convertthese explanations into (partial) provenance. Then,we infer a query that generates each output examplefrom its corresponding provenance. In this paper we describe our ongoing work towarda solution for the problem of inferring queries basedon examples with intuitive explanations.
We proposeto use values as explanation, and leverage techniquesdeveloped in previous work [11] for the first part. Thebasic idea is to identify the tuples that contain thevalues given as explanations using a similarity score.
Example 1.3.
Consider again the database shown inFigure 1 and the output example ’SIGMOD, Alice’.A possible explanation may contain values such as‘Tel Aviv University’, ‘Alice’, “X”, “Y”, ‘SIGMOD’,and ‘Databases’. These values would then be mappedto the tuples o , a , p , p , c , d , respectively. Rel. org oid oname o o Rel. author aid aname oid a a a Rel. domain conf cid did dc
10 18 dc
11 18
Rel. pub wid cid ptitle pyear p p p Rel. writes aid wid w w w w Rel. conf cid cname c
10 CIKM c
11 SIGMOD
Rel. domain did dname d
18 DB
Figure 1: DB instanceThe second part of the proposed solution is infer-ring queries from output examples and partial lin-eage. An essential question that arises within thisnotion of partial provenance, is how partial is theprovenance. Possible characterization of the expla-nation’s form may include the presence or absenceof projected tuples or join tuples. In this paper, wedescribe our results for joinless lineage –e.g, lineagewith missing join tuples. We start with this frag-ment since it allows for intuitive explanations, andit overcomes an inherent limitation of the mappingcomponent where some tuples in the lineage are ab-sent in the mapping values function range. For in-stance, there is no value in Example 1.3 that can bemapped to the tuples w and w , although it is partof the provenance of the intended query, as shown inExample 1.2. Other possible forms of partial lineageare left for our ongoing and future work. Related Work
Query-by-example [2, 12, 11], isthe problem of inferring queries based on output ex-amples given by the user. Along side these works,various models of provenance have been proposed inthe literature [7, 4, 3]. Recent work [5, 6] has exploredthe case where a small number of positive examplesare provided, coupled with explanations for these ex-amples in the form of provenance. These assume thatoutputs with multiple explanations are separated intoindividual ones and either are given in full, or are2 utupt Full Prov. Partial Prov.SIGMOD, Alice o , a , p , p , w ,w , c , dc , d o , a , p , c , d EDBT, Bob o , a , p , w ,c , dc , d o , a , p , c , d Figure 2: prov-example q(cname, aname) :- author(aid, aname, oid),writes(aid, wid), pub(wid, cid, ptitle, pyear),conf(cid, cname), domain_conf(cid, did),domain(did, dname), oname = ‘TAU’, dname = ‘Databases’
Figure 3: Intended querymissing mention of duplicate uses of the same tuplein an explanation. Our approach attempts to fill inthe gap focusing on the lineage formalism [4], whichwas not included in [6], and thus relaxes the desider-ata of the explanations, allowing for missing tuplesand unseparated explanations for the same output.
We first describe our initial results for the query infer-ence phase. We formally define the problem of queryinference from partial lineage and present a novel al-gorithm for the joinless lineage case. We then brieflydiscuss our approach for provenance extraction.
We use the foundations laid in [6] to define the prob-lem of inferring queries by lineage. As in [6], we focuson Conjunctive Queries (CQ). An assignment of a CQ Q with respect to a database D is a mapping of therelational atoms of Q to tuples in D that respectsrelation names. The input to the problem is a setof output tuples, each with its (possibly partial) lin-eage. The lineage is given as a set of input tuplesidentifiers. We use ( I, O ) to denote a pair of outputexample O and it’s lineage I (i.e., a set of input tupleidentifiers). As a simple example, consider the tablein Figure 2, referring to the annotation of the tuplesin Figure 1. The output column along with the FullProv. column forms an example where the first andseconds rows can be denoted by ( I , O ) and ( I , O ).The Partial Prov. column depicts examples of partiallineage for the same outputs.We can then define the notion of consistency withrespect to an output example and (partial) lineage, conf pub writes authororgdomain conf domain (a) Join graph of the query c p w a o dc d p w (b) Prov. graph of row 1 Figure 4: (a) Query join graph in Figure 3, (b) prove-nance graph of row 1 in Figure 2and introduce our problem statement. Intuitively, welook for a query whose output contains the exampleoutput, and for each output tuple, its provenance is“reflected” in the computation of the tuple by thequery.
Definition 2.1 (adapted from [6]) . Given an ( I, O ) ,a database D , and a CQ Q , we say that Q is consis-tent with respect to the example if there exists S ⊆ D such that O ∈ Q ( I ∪ S ) and I ⊆ lin ( Q | O ( I ∪ S )) ,where lin ( Q | O ( I ∪ S )) is the lineage of the tuple O according Q evaluated on I ∪ S . The provenance specified in the explanation canbe partial but has to be a part of the assignmentand cannot includes irrelevant tuples. A partial ex-planation is any non-empty subset of the lineage ofthe output. For example, consider the the partial ex-planation, I , shown in the Partial Prov. column inthe first row in Figure 2. The query in Figure 3 isconsistent w.r.t. it since we have S = { w , dc } suchthat ( SIGM OD, Alice ) ∈ Q ( I ∪ S ) and I is a partialexplanation.We refer to a set of pairs ( I, O ) as a prov-example .Given a CQ Q , we say that Q is consistent with re-spect to this prov-example if it is consistent with eachone of the ( I, O ) pairs it contains. A consistent querycan be very general, in fact we showed in [6] that thereexists a set of (
I, O ) examples with an infinite numberof non-equivalent consistent queries. A major factorinfluencing the number of consistent queries is thelength of the query, and in particular, the number ofself-joins allowed. Therefore, a natural desideratumis a small number of joins. We utilize the concept of join graph for conjunctive queries [8]. In a join graphevery two atoms joined together in the CQ are joinedby an edge . For instance, the join graph of the queryin Figure 3 is depicted in Figure 4a. We say that aquery is connected if its join graph is connected.3 efinition 2.2.
A consistent query Q with respect toa given prov-example is consistent-minimal if (1) Q isconsistent and connected, and (2) for every query Q (cid:48) such that Q (cid:48) is consistent and connected, the numberof nodes in the join graph of Q is smaller or equal tothe number of nodes in the join graph of Q (cid:48) . Intuitively, consistent-minimal queries are morenatural as they offer a concise reason for the givenoutput according to the explanation. To demon-strate, consider the following query q(cname, aname) :- author(aid, aname, oid),writes(aid, wid1), pub(wid1, cid, ptitle1, pyear1),writes(aid, wid2), pub(wid2, cid, ptitle2, pyear2),conf(cid, cname), domain_conf(cid, did),domain(did, dname), oname = ‘TAU’, dname = ‘Databases’
It is consistent but not minimal. Conversely, thequery depicted in Figure 3 is consistent-minimal. Ourgoal is then to find a consistent-minimal query givena prov-example.
Given a pair (
I, O ) where I = { t , . . . , t n } , its prove-nance graph G P = ( V P , E P ) is defined as follows. V P are the tuple annotations and E P = {{ t i , t j } |∃ A, B. t i .A = t j .B } , i.e., there is an edge betweeneach two tuples who share a constant, including selfedges (e.g., the tuple R (1 ,
1) would have a self edge).The provenance graph of the first row of the prov-example in Figure 2 is shown in Figure 4b. We nextestablish the connection between the join and prove-nance graphs through the concept of graph homo-morphism.
Definition 2.3.
Let G J = ( V J , E J ) be a query joingraph and G P = ( V P , E P ) a provenance graph of an ( I, O ) . A graph homomorphism between G J and G P is a function h : V J → V P such that (1) ( h ( u ) , h ( v )) ∈ E P if ( u, v ) ∈ E J , particularly, for every variableshared by u, v at index i , h ( u ) , h ( v ) share a constantat the same index, (2) u, h ( u ) have the same relationname, and (3) if v ∈ V J has a projected variable atindex i to the head of Q at index j , then h ( v ) hasthe same constant at index i as the constant of O atindex j . Example 2.4.
Reconsider the join graph G J in Fig-ure 4a and the provenance graph G P in Figure 4b.An homomorphism h : V J → V P is h ( conf ) = c , h ( author ) = a , h ( pub ) = p , h ( writes ) = w , h ( org ) = o , h ( domain ) = d , h ( domain conf ) = dc . Given a set of homomorphisms H = { h , . . . , h k } between a join graph G J and a provenance graph G P ,we say that H covers G P if (cid:83) h ∈H h [ V J ] = V P . Let Q be a CQ with a join graph G J , let ( O, I ) be a rowof a prov-example with a provenance graph G P . Wecan show that. Proposition 2.5.
There exists a set H of homomor-phisms from G J to G P that covers G P iff Q is con-sistent with respect to ( O, I ) .Proof. Suppose we have a set of homomorphisms H that covers G P , so every h ∈ H maps all the nodesof G J to some of the nodes of G P such that Def-inition 2.3 holds. It is enough to show that every h ∈ H is equivalent to an assignment of the prove-nance tuples to Q that produces the otuput tuple,since (cid:83) h ∈H h [ V J ] = V P so all provenance tuples areused for some assignment to Q . Condition (1) inDefinition 2.3 defines that if two atoms share a vari-able, the two tuples to which they are mapped to alsoshare a constant in the same index. Condition (2)says that all homomorphisms map atoms to tupleswith the same relation name, and finally condition(3) says that the mapping produces the output tu-ple. Thus, every h ∈ H is essentially an assignmentto Q . Similarly for the other direction, assume Q isconsistent, then it uses all of the provenance tuples togenerate the output tuple. Every assignment definesa homomorphism that maps the join graph G J to theprovenance graph G P such that Definition 2.3 holds.Denote these homomorphisms by H . Since Q uses allprovenance tuples, H covers G P .For instance, the query in Figure 3 is consistentw.r.t. the prov-example in Figure 2 since there aretwo homomorphisms that map its join graph in Fig-ure 4a to the provenance graph of the first row inFigure 4b. One is described in Example 2.4.From Proposition 2.5 it follows that given a pair( I, O ) and a query Q such that (1) Q has the smallest4onnected join graph and (2) there is a set of homo-morphisms from the join graph of Q to the graph of( I, O ) that covers it, then Q is consistent-minimal.We use this observation when inferring queries as weshow next. We start by presenting a solution for the case wherea prov-example includes full lineage, and then relaxthis solution to account for joinless lineage. The in-put to the algorithm is a prov-example Ex and thedatabase schema graph, and it consists two parts asfollows.The first step of the algorithm is finding the pro-jected attributes of the query. To this end, we gen-erate a list of candidates from each row of Ex . Thecandidates C i of the row ( O i , I i ) are the attributeof the subset of relation names of tuples in I i thatshare a common constant with O i . The algorithmthen intersect the candidates of the projection to getthe projected attributes (cid:84) i C i . Example 2.6.
Reconsider the first row of the prov-example shown in Figure 2. The output values ‘SIG-MOD’ and ‘Alice’ are searched in the provenance tu-ples. Since the tuples annotated by c , a are the onlyones that contain them, they are the candidates fromthe first row. In the same process, we get the tu-ples c , a from the second row. For each output at-tribute, we intersect the attributes that are candidatefor projection, e.g., { conf.cname }∩{ conf.cname } = { conf.cname } for the first attribute. At a high level, the main idea of the second step ofthe algorithm is to generate all possible join graphs,and for each such graph, search for a set of homomor-phisms that covers the provenance graph of each oneof the rows in Ex . When such a set is found thecorresponding query is returned. The join graphsgeneration is done in a growing size order, whichguarantees that the first graph satisfying the cover-age condition is minimal. In fact, as shown in [6],if a consistent query exists, it must be of length atleast d and at most k + d · n , where k is the num-ber of attributes of the output tuple, d is the numberof distinct relations in the provenance, and n is the number of tuples in the largest provenance set. Thus,we consider only graphs with a least d nodes and atmost k + d · n that satisfy essential conditions fora consistent-minimal query: (1) they are connected,and (2) they include exactly the same relation namesappearing in the provenance of each one of the rowsin Ex . Finally, the algorithm adds selection condi-tions, by inspecting possible values for constants ineach attribute according to the tuples in the prov-example. For attributes that take a single value, itassigns a constant in the query. Example 2.7.
Reconsider the prov-example in Fig-ure 2 with the Full Prov. column and the schemaof the database in Figure 1. The provenance graphof the first row is depicted in Figure 4b. The algo-rithm finds the candidates for projection which are conf . cname and author . aname in both rows. Wethen start generating join graphs. We consider onlyconnected graphs, that include all the relations ap-pearing in the prov-example with at least nodes, asthe number of distinct relations in the prov-example.The only graph that fits this description is given inFigure 4a. As shown in Section 2.2, there is a set ofhomomoprphisms that covers the provenance graphof each one of the rows in the prov-example. Finally,the algorithm adds constants and returns the queryin Figure 3. Joinless lineage
We next present a heuristic toaccount for joinless lineage , a fragment of partial lin-eage where join relations containing only foreign keysis missing from the lineage. More formally, suppose { t , t , t , . . . , t k } ⊆ I constitute a full assignmentpresent in the provenance set of an output tuple O .Assume the relation of t is joined with the relationof t in the schema by a join table t such that theattributes of the table of t are only the primary keysof t and t , then I \ { t } is an acceptable explana-tion. This fragment is particularly appealing sinceit allows for intuitive explanations while nicely pre-serving the information encapsulated in the lineage.In this case, the algorithm will “complete” the expla-nation by finding the missing tuples. We are givena prov-example where the provenance I in each rowcontains all tuples except join relations that connect5wo tuples t i , t j that are present in I . We add a pro-cedure after finding the projection attributes, thatchecks whether the provenance graph G P of each rowis connected. If not, there is a gap of size one betweentwo tuples t i and t j (this can be detected by findingthe shortest path between them in the schema graph).We then look for a relevant tuple to connect t i and t j (with relations R i and R j resp.) through anotherrelation R v . This is done by querying the databaseto get the tuple t v . Example 2.8.
Consider the prov-example in Figure2 with the Partial Prov. column. We find that theprovenance graph is not connected, and specifically,that a and p do not share a constant and their re-lations, according to the schema, are connected by the writes table, so we query the writes relation in Fig-ure 1, with the aid of a and the wid of p . The resultis the tuple w so we add it to the provenance. Wedo the same for c and d to find dc . In the secondrow, we add the tuples w and dc and then continueto the graph generation step to find the query. To allow users to easily formulate explanations, weassume that the initial explanation is composed ofindividual values, and the values are then mapped totuples in the database. The challenge is finding tu-ples that most resemble the intended tuples. We willdraw on works such as [11] and use a value similar-ity score that compares the values inputted by theuser and the tuples in the database. The score ofa tuple t w.r.t an inputted value v can be definedas score ( t, v ) = max t [ a ] sim ( t [ a ] , v ), i.e., the maxi-mum similarity between a cell value in t , t [ a ], andthe value v . The sim function can be changed ac-cording to the context, e.g., string similarity or L norm for numbers. In our implementation, we haveused a greedy approach so that if a cell contains astring that exactly matches the inputted string, wereturn it immediately. Example 2.9.
Reconsider our running example withthe output examples depicted in Figure 2. An expla-nation for Alice, SIGMOD may be ‘TAU’, ‘Alice’,
Figure 5: Runtime breakdown for query inferencefrom prov-example with two rows. The red bars arethe time to extract the provenance tuples from thegiven values, the blue bars are the time to get themissing join tuples, and the green bars are the timeto infer the query from the completed prov. ‘X’, ‘Y’, ‘SIGMOD’, ‘Databases’. The values wouldthen be mapped the tuples o , a , p , p , c , d sincethese are the tuples most similar to the inputted val-ues as they have an attribute that is identical to them.For example, a [ aname ] = Alice . We have implemented our algorithm in Python 3.6with an underlying database engine in MySQL 5.7,and performed a preliminary evaluation of our ap-proach. Particularly we assessed the queries inferredby the algorithm using joinless provenance, and mea-sured the runtimes for different queries. For the firststage for inferring tuples from values we used the SQL‘LIKE’ function for strings. We evaluated queries 1–9(Table 1 depicts three of the queries) with the MASdataset [10]. The examples consisted of two outputexamples and their provenance values.We observed that the inferred queries containedadditional filters that were not included in the origi-nal query but are present in the prov-example. Thereason for that is the small number of examples givenas input. This observation raises the need for inter-activity in the system, that presents optional selec-tion criteria and allows the user select or refine thembased on the desired actual query. Apart from theseadditional filters, the algorithm inferred all queriescorrectly from a set of two rows. This is due to itsgreedy approach of trying the smaller join graphs firstthe returning the first query that has a set of homo-morphisms that covers all provenance graphs.The runtimes of the algorithm are shown in log-6able 1: Queries used in the experiments
Num. Query scale in Figure 5. The stage that was most costlywas inferring the initial provenance tuples from the database. The next stage was finding the connect-ing join tuples. Here the queries are more focusedso this stage is faster. The last stage of inferringthe query from the provenance using our homomor-phism algorithm is usually very fast. However, themore complex query 6 (shown in Table 1) contains7 joins took 9 seconds to infer. Since more joins areinvolved, the algorithm had to inspect more complexhomomorphisms and, thus, took longer to find thequery join graph and check its consistency.
We have presented an initial solution for the prob-lem of inferring queries based on output exampleswith partial explanations represented as values fromthe database. The problem requires extensive inves-tigation, which is the subject of our on-going work.In particular we consider other, more relaxed vari-ants of explanations. This problem calls for differenttechniques that are prevalent in the field of query-by-output where column mappings are performed tomatch an attribute in the output to an attribute inone of the database tables. We also consider a naturallanguage based implementation for the explanations.
References [1] A. Bonifati, R. Ciucanu, and S. Staworko. In-teractive join query inference with jim.
PVLDB ,7(13), 2014.[2] A. Bonifati, R. Ciucanu, and S. Staworko.Learning join queries from user examples.
ACMTrans. Database Syst. , 40(4), 2016.[3] P. Buneman, S. Khanna, and W. Tan. Why andwhere: A characterization of data provenance.In
ICDT , 2001.[4] Y. Cui, J. Widom, and J. L. Wiener. Tracingthe lineage of view data in a warehousing envi-ronment.
ACM Trans. Database Syst. , 2000.75] D. Deutch and A. Gilad. Qplain: Query by ex-planation. In
ICDE , 2016.[6] D. Deutch and A. Gilad. Reverse-engineeringconjunctive queries from provenance examples.In
EDBT , 2019.[7] T. J. Green, G. Karvounarakis, and V. Tannen.Provenance semirings. In
PODS , 2007.[8] D. V. Kalashnikov, L. V. Lakshmanan, andD. Srivastava. Fastqre: Fast query reverse en-gineering. In
SIGMOD , 2018.[9] F. Li and H. V. Jagadish. Constructing an inter-active natural language interface for relationaldatabases.
Proc. VLDB Endow. , 2014.[10] MAS. http://academic.research.microsoft.com/ .[11] F. Psallidas, B. Ding, K. Chakrabarti, andS. Chaudhuri. S4: Top-k spreadsheet-stylesearch for query discovery. In
SIGMOD , 2015.[12] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding,and L. Novik. Discovering queries based on ex-ample tuples. In