Explaining Natural Language Query Results
EExplaining Natural Language Query Results
Daniel DeutchTel Aviv University [email protected]
Nave FrostTel Aviv University [email protected]
Amir GiladTel Aviv University [email protected]
Abstract
Multiple lines of research have developed Natural Language (NL) interfaces for formulating databasequeries. We build upon this work, but focus on presenting a highly detailed form of the answers inNL. The answers that we present are importantly based on the provenance of tuples in the query result,detailing not only the results but also their explanations . We develop a novel method for transformingprovenance information to NL, by leveraging the original NL query structure. Furthermore, since prove-nance information is typically large and complex, we present two solutions for its effective presentationas NL text: one that is based on provenance factorization, with novel desiderata relevant to the NL case,and one that is based on summarization. We have implemented our solution in an end-to-end systemsupporting questions, answers and provenance, all expressed in NL. Our experiments, including a userstudy, indicate the quality of our solution and its scalability.
In the context of databases, data provenance captures the way in which data is used, combined andmanipulated by the system. Provenance information can for instance be used to reveal whether data wasillegitimately used, to reason about hypothetical data modifications, to assess the trustworthiness of a com-putation result, or to explain the rationale underlying the computation.As database interfaces constantly grow in use, in complexity and in the size of data they manipulate,provenance tracking becomes of paramount importance. In its absence, it is next to impossible to understandthe system’s operation and to follow the flow of data through the system, which in turn may be extremelyharmful for the quality of results.A setting where the lack of provenance – and consequently lack of explanations – is of particular concern,is that of database interfaces geared to be used by non-experts. Such non-expert users lack understandingof the system inner workings, and are unable to verify that it has operated correctly. Indeed, an importantcomponent of such systems is the interface through which the non-expert communicates her needs/query tothe system. But then, how does the system communicate its results back to the non-expert user? And howdoes it justify it in a manner that the non-expert can understand? For each system, developers currentlyneed to develop dedicated solutions, if at all, and we are lacking a generic framework for explanations in thissetting.A particularly flourishing line of work for allowing non-experts to interact with a database, is that ofNatural Language Interfaces to Databases (NLIDBs). Multiple such interfaces have been developed in recentyears (see e .g. [55, 7, 52, 76]). The accuracy of translation is constantly improving. Still, it is far from perfect– in general, automated translation of free text to a formal language is an extremely difficult task. Sincethe users of such systems are typically non-experts, they may have a hard time understanding the result or1 a r X i v : . [ c s . D B ] J u l erifying its correctness. Consider for example a complex NL query over a publication database, of the form“return all organizations of authors who published in database conferences after 2005”. After translatingthis query to SQL and running it using a query engine, the answer is a list of qualifying organizations. Bylooking at the answer, the user has no way of knowing whether the retrieved organizations really satisfy herspecified constraints; a slight error in the translation process, e.g. misunderstanding “database conferences”or erroneously associating “after 2005” with the conference inauguration date, could result in a list oforganizations that is completely wrong for the question asked.In this work we complement the efforts of developing high-quality NLIDBs, by developing a genericframework that explains the results of queries posed to NLIDBs. Explanations are based on provenance,but current provenance models are far too complex to allow for their direct presentation to non-experts.The novelty of our work is that we “translate” provenance into NL explanations to the query answers . Theexplanations that we provide elaborate upon answers with additional important information, and are helpfulfor understanding why does each answer qualify to the query criteria.As an example, consider the Microsoft Academic Search database [1] and consider the NL query in Figure1a. A state-of-the-art NL query engine,
NaLIR [55], is able to transform this NL query into the SQL queryalso shown (as a Conjunctive Query, which is the fragment that we focus on in this paper) in Figure 1b.When evaluated using a standard database engine, the query returns the expected list of organizations.However, the answers (organizations) in the query result lack justification , which in this case would includethe authors affiliated with each organization and details of the papers they have published (their titles,their publication venues and publication years). Such additional information, corresponding to the notion of provenance [41, 14, 17, 38, 39] can lead to a richer answer than simply providing the names of organizations:it allows users to also see relevant details of the qualifying organizations. Provenance information is alsovaluable for validation of answers: a user who sees an organization name as an answer is likely to have aharder time validating that this organization qualifies as an answer, than if she was presented with the fulldetails of publications. An understanding of the results also allows users to conclude whether their query wastranslated correctly and reproduce the results if needed. There are several models of provenance previouslysuggested in the literature. The tuple-based model [38, 39] tracks the source tuples which participated in thecomputation of the results, while the value-based model [61, 18] is a more fine grained model and follows thevalues of these tuples.We propose a novel approach of presenting provenance information for answers of NL queries, again assentences in Natural Language . There are several aspects to account for towards a solution, as follows. • The provenance model needs to be very detailed. For example, the NL explanations that we aim forrequire storing not only which input tuples have contributed to the answer – in the above examplethese may e.g. be the author, organization and publication entries – but also the way in which theycontributed to the answer. In our example, for generating the required explanations we need to storethat the organization entry has matched the query “head” and was returned, that the author entry hasbeen joined with it to find authors of the specific organization, that the publication entry was joinedwith the author entry, etc. Naturally, once the query is compiled from NL/examples to e.g. SQL, onecould in principle track provenance as if the query was described in SQL to begin with. As we nextexplain, this would be sub-optimal. • As we shall demonstrate, the way in which the user has phrased the question has a significant impacton both which parts of the computation needs to be tracked and on the way in which users expectprovenance information to be presented to them. In general, in works on provenance, there is a typicallya tight coupling between the query model and the provenance model. In particular, as we already2 eturn the organization of authors who published papersin database conferences after 2005 (a)
NL Query query(oname) :- org(oid, oname), conf(cid, cname),pub(wid, cid, ptitle, pyear), author(aid, aname, oid),domainConf(cid, did), domain(did, dname),writes(aid, wid), dname = ’Databases’, pyear > 2005 (b) CQ Q Figure 1:
NL Query and CQ Q TAU is the organization of Tova M. who published ’OASSIS...’ in SIGMOD in 2014
Figure 2:
Answer for a Single Assignmentobserved, a suitable way for presenting the explanations is again as NL sentences, so that we obtain anend-to-end system where questions, answers and explanations are all expressed in Natural Language.Thus, the provenance model needs to keep track of which parts of the NL question have contributed towhich parts of the computation. Furthermore, We use the value-based model of provenance as it is theimplicitly enforced by the NLIDB which maps words to variables. Once we have this mapping, we storethe mappings between variables to values to be able to reverse it. A major challenge in this respect isto design the model so that it correctly captures those parts of the provenance that are “important”based on the user question. As the basis for our provenance model, we use the value-based modelof provenance (as opposed to tuple-based) as it is the implicitly enforced by the NLIDB which mapswords to variables in the query. Once we have this mapping, we store the mappings between variablesto values to be able to assemble an explanation sentence. • Then, given the tracked provenance, we further need to translate it back from the formal model to anNL sentence. Generating NL sentences is a difficult task in general – but importantly, here we have theNL question that can guide us. A challenge is then how to “plug-in” different parts of the provenanceback into the NL question, to obtain a coherent, well-formed answer. • Last, we need to address the challenge of provenance size. In particular, full information about themanner in which a result is obtained from the input data (and even full description of the input dataitself) is typically exhaustively long to present, especially to a non-expert.The end result for our running example is demonstrated in Figure 2, which shows one of the explainedanswers outputted by our system in response to the NL query in Figure 1a.Having explained the overall approach and challenges, we next provide more details on each of our keycontributions.
Provenance Tracking Based on the NL Query Structure
As mentioned above, a first key idea inour solution is to leverage the
NL query structure in constructing NL provenance. Our solution is genericin that it is not specific to a concrete NL interface (we do have some requirements on the operation of the3nderlying interface, as we detail below). In our implementation, we use and modify
NaLIR so that we storeexactly which parts of the NL query translate to which parts of the formal query. Then, we evaluate theformal query using a provenance-aware engine (we use SelP [26]), further modified so that it stores whichparts of the query “contribute” to which parts of the provenance. By composing these two “mappings”(text-to-query-parts and query-parts-to-provenance) we infer which parts of the NL query text are relatedto which provenance parts. Finally, we use the latter information in an “inverse” manner, to translate theprovenance to NL text.
Factorization
A second key idea is related to the provenance size. In typical scenarios, a single answer mayhave multiple explanations (multiple authors, papers, venues and years in our example). A na¨ıve solution isto formulate and present a separate sentence corresponding to each explanation. The result will however be,in many cases, very long and repetitive. As observed already in previous work [16, 62], different assignments(explanations) may have significant parts in common, and this can be leveraged in a factorization that groupstogether multiple occurrences. In our example, we can e .g. factorize explanations based on author, papername, conference name or year. Importantly, we impose a novel constraint on the factorizations that we lookfor (which we call compatibility ), intuitively capturing that their structure is consistent with a partial orderdefined by the parse tree of the question. This constraint is needed so that we can translate the factorizationback to an NL answer whose structure is similar to that of the question. Even with this constraint, theremay still be exponentially many (in the size of the provenance expression) compatible factorizations, and welook for the factorization with minimal size out of the compatible ones; for comparison, previous work looksfor the minimal factorization with no such “compatibility constraint”. The corresponding decision problemremains coNP-hard (again in the provenance size), but we devise an effective and simple greedy solution.We further translate factorized representations to concise NL sentences, again leveraging the structure ofthe NL query. Summarization
We propose summarized explanations by replacing details of different parts of the expla-nation by their synopsis, e .g. presenting only the number of papers published by each author, the numberof authors, or the overall number of papers published by authors of each organization. Such summarizationsincur by nature a loss of information but are typically much more concise and easier for users to follow.Here again, while provenance summarization has been studied before ( e .g. [5, 66]), the desiderata of asummarization needed for NL sentence generation are different, rendering previous solutions inapplicablehere. We observe a tight correspondence between factorization and summarization: every factorization givesrise to multiple possible summarizations, each obtained by counting the number of sub-explanations thatare “factorized together”. We provide a robust solution, allowing to compute NL summarizations of theprovenance, of varying levels of granularity. Implementation and Experiments
We have implemented our solution in a system prototype called
NLProv [23], forming an end-to-end NL interface to database querying where the NL queries, answers andprovenance information are all expressed in NL. We have further conducted extensive experiments whoseresults indicate the scalability of the solution as well as the quality of the results, the latter through a userstudy.This paper is an extended version of our PVLDB 2017 paper [24] and includes a new section on thetranslation of provenance to NL for UCQs, a new section on a generalized solution that is not specific to We are extremely grateful to Fei Li and H.V. Jagadish for generously sharing with us the source code of
NaLIR , and providinginvaluable support. eturnobjectverb modpropertiesnsubjothers (a) Verb Mod. returnobjectnon-verb modpropertiesothers (b)
Non-Verb Mod.
Figure 3:
Abstract Dependency Trees
NaLIR and a discussion of the use of other provenance models, new and comprehensive experiments, and anextended in-depth review of related work.
We provide here the necessary preliminaries on Natural Language Processing, conjunctive queries and prove-nance.
We start by recalling some basic notions from NLP, as they pertain to the translation process of NL queriesto a formal query language. We further recall a particular formal query language of interest, namely Unionof Conjunctive Queries.A key notion that we will use is that of the syntactic dependency tree of NL queries:
Definition 2.1
A dependency tree T = ( V, E, L ) is a node-labeled tree where labels consist of two com-ponents, as follows: (1) Part of Speech ( P OS ): the syntactic role of the word [49, 57] ; (2) Relationship(
REL ): the grammatical relationship between the word and its parent in the dependency tree [58].
We focus on a sub-class of queries handled by
NaLIR , namely that of Union of Conjunctive Queries,possibly with comparison operators (= , >, < ) and logical combinations thereof (
NaLIR further supportsnested queries and aggregation). Formally, fix a database schema, i.e. a set of relation names along withtheir arities (number of attributes). A query is then defined with respect to a schema.
Definition 2.2 (From [2]) A Union of Conjunctive Queries Q is a set of Conjunctive Queries Q i .In turn, a conjunctive query is an expression ans ( u ) ←− R u , ..., Rn ( un ) , C where R , ..., Rn arerelation names in the database schema, and u, u , ..., un are tuples with either variables or constants, with ui conforming to the schema of Ri . Variables in u must appear in at least one of u , ...un . Finally, C is asequence of comparison constraints ( = , >, < ) over variables in u , ...un and constants. The corresponding NL queries in
NaLIR follow one of the two (very general) abstract forms described inFigure 3: an object (noun) is sought for, that satisfies some properties, possibly described through a complexsub-sentence rooted by a modifier (which may or may not be a verb, a distinction whose importance will bemade clear in our algorithms that follow). 5 oname, TAU)(aname, Tova M.)(ptitle, OASSIS...) (cname, SIGMOD)(pyear, 2014) returnorganization
POS=NN, REL=dobj of POS=IN, REL=prep authors
POS=NNS, REL=pobj published
POS=VBD, REL=rcmod inconferences
POS=NNS, REL=pobj database
POS=NN, REL=nn after
POS=IN, REL=prep
POS=CD, REL=pobj paperswho the (a)
Query Tree organizationofTova M.published inSIGMODin2014’OASSIS...’whoTAU (is the) (b)
Answer Tree
Figure 4:
Question and Answer Trees
Example 2.3
Reconsider the NL query in Figure 1a; its dependency tree is depicted in Figure 4a (ignorefor now the arrows). The part-of-speech (POS) tag of each node reflects its syntactic role in the sentence– for instance, “organization” is a noun (denoted “NN”), and “published” is a verb in past tense (denoted“VBD”). The relation (REL) tag of each node reflects the semantic relation of its sub-tree with its parent.For instance, the REL of “of ” is prep (“prepositional modifier”) meaning that the sub-tree rooted at “of”describes a property of “organization” while forming a complex sub-sentence. The tree in Figure 4a matchesthe abstract tree in Figure 3b since “organization” is the object and “of” is a non-verb modifier (its POS tagis IN, meaning “preposition or subordinating conjunction”) rooting a sub-sentence describing “organization”.
The dependency tree is transformed by
NaLIR , based also on schema knowledge, to SQL. We focus in thiswork on NL queries that are compiled into Union of Conjunctive Queries (UCQs), and discuss extensions toaggregates and nested queries below.
Example 2.4
Reconsider our running example NL query in Figure 1a; a counterpart Conjunctive Query isshown in Figure 1b. Some words of the NL query have been mapped by
NaLIR to variables in the query, e.g.,the word “organization” corresponds to the head variable ( oname ). Additionally, some parts of the sentencehave been complied to boolean conditions based on the MAS schema, e.g., the part “in database conferences”was translated to dname = ‘Databases’ in the CQ depicted in Figure 1b. Figure 4a shows the mapping ofsome of the nodes in the NL query dependency tree to variables of Q (ignore for now the values next to thesevariables). The translation performed by
NaLIR from an NL query to a formal one can be captured by a mapping from (some) parts of the sentence to parts of the formal query. This mapping is not a novel contribution ofthis paper, but we will employ this mapping to generate the NL explanation.6 efinition 2.5
Given a dependency tree T = ( V, E, L ) and a CQ Q , a dependency-to-query-mapping τ : V → V ars ( Q ) is a partial function mapping a subset of the dependency tree nodes to the variables of Q . After compiling a formal query corresponding to the user’s NL query, we evaluate it and keep track of provenance , to be used in explanations.As explained in Section 1, there are essentially two explanation models that will come into play here.The first is a provenance model for the underlying formal query, in our case Union of Conjunctive Queries.We next discuss existing provenance models, then choose a particular model that fits our construction. Thesecond, which we will discuss below, is coupled with the Natural Language model.In terms of provenance for formal database queries, previous work has proposed a large number of differentmodels (see Section 7 for an overview of related work). A basic distinction that we already need to makeis between fine-grained and coarse-grained provenance. Generally speaking, the former keeps track of whichtuples (or even cells) have contributed to each result, while the latter keeps track of the general input andoutput of each query/workflow operator, without necessarily connecting each input and output pieces. Hereour goal is to explain individual query results, and so a fine-grained provenance model is sought for.In context of database queries, capturing fine-grained provenance means that we keep track of the as-signments of database tuples to query atoms. Assignments are the basic building block of query evaluation,and for UCQs they are defined as follows:
Definition 2.6
An assignment α for a query Q ∈ CQ with respect to a database instance D is a mappingof the relational atoms of Q to tuples in D that respects relation names and induces a mapping over vari-ables/constants, i.e. if a relational atom R ( x , ..., x n ) is mapped to a tuple R ( a , ..., a n ) then we say that x i is mapped to a i (denoted α ( x i ) = a i , overloading notations) and we further say that the tuple was used in α . We require that any variable will be mapped to a single value, and that any constant will be mapped toitself. We further define α ( head ( Q )) as the tuple obtained from head ( Q ) by replacing each occurrence of ahead variable x i by α ( x i ) . The set of assignments to Q with respect to D is denoted by Γ( Q, D ) . Note thata single tuple in the query result may have been obtained by multiple assignments in Γ( Q, D ) .Then, for a UCQ Q whose CQs are Q , ..., Q n , the set of assignments to Q is defined as the union ofthe sets of assignments to its CQs, namely Γ( Q ) = n (cid:83) i =1 Γ( Q i ) . The notion of a tuple being obtained froman assignment and the α notation immediately extend, noting that a single tuple may be obtained fromassignments to multiple conjunctive queries. Assignments allow for defining the semantics of CQs: a tuple t is said to appear in the query output ifthere exists an assignment α s.t. t = α ( head ( Q )). They will also be useful in defining provenance below. Example 2.7
Consider again the query Q in Figure 1b and the database in Figure 6. The tuple (TAU) isan output of Q when assigning the highlighted tuples to the atoms of Q . As part of this assignment, the tuple(2, TAU) (the second tuple in the org table) and (4, Tova M., 2) (the second tuple of the author table) areassigned to the first and second atom of Q , respectively. In addition to this assignment, there are 4 moreassignments that produce the tuple (TAU) and one assignment that produces the tuple (UPENN). Assignments may be used in provenance in multiple ways, varying in their granularities. For instance,the lineage [10] of a result tuple t is the set of input tuples appearing in some assignment yielding t ; thewhy-provenance of t is the set of sets of tuples participating in such assignments, i.e. the contributing7 oname,TAU) · (aname,Tova M.) · (ptitle,OASSIS...) · (cname,SIGMOD) · (pyear,14’) + (oname,TAU) · (aname,Tova M.) · (ptitle,Querying...) · (cname,VLDB) · (pyear,06’) + (oname,TAU) · (aname,Tova M.) · (ptitle,Monitoring..) · (cname,VLDB) · (pyear,07’) + (oname,TAU) · (aname,Slava N.) · (ptitle,OASSIS...) · (cname, SIGMOD) · (pyear,14’) + (oname,TAU) · (aname,Tova M.) · (ptitle,A sample...) · (cname,SIGMOD) · (pyear,14’) + (oname,UPENN) · (aname,Susan D.) · (ptitle,OASSIS...) · (cname,SIGMOD) · (pyear,14’) Figure 5:
Value-level Provenancetuples are grouped based on the assignments they are used in. These approaches were shown in [38] tobe concrete examples of a general algebraic construction, termed semiring provenance. At a high-level, theidea there is that we introduce two symbolic operations “ + ” and “ · ”, and use them to form algebraicrepresentations of the provenance. Concretely, “ + ” is used for alternative derivations and “ · ” is used forcombined derivation: for a given output tuple, we sum over the assignments that have yielded it, and eachassignment is represented via a multiplication over the terms that has contributed to it. The idea is thatassignments capture the reasons for a tuple to appear in the query result, with each assignment serving asan alternative such reason (indeed, the existence of a single assignment yielding the tuple suffices, accordingto the semantics, for its inclusion in the query result).In [38], the basic atomic units that appear in a provenance expression are the “annotations” (intuitivelyidentifiers, or meta-data associated with the tuples) of the tuples that contribute to an assignment. Here, inorder to form a detailed explanation of the result of an NL query, we need to keep track of a finer-grainedresolution. Within each assignment, we keep record of the value assigned to each variable, and note that the conjunction of these value assignments is required for the assignment to hold. Definition 2.8
Let A ( Q, D ) be the set of assignments for a UCQ Q and a database instance D . We definethe value-level provenance of Q w.r.t. D as (cid:88) α ∈ A ( Q,D ) Π { x i ,a i | α ( x i )= a i } ( x i , a i ) . The reason for our use of a value-based rather than tuple-based provenance model is that, as we willnext show, we wish to connect different pieces of the provenance back to the NL question, in order to forma detailed NL explanation.
Example 2.9
Re-consider our running example query and consider the database in Figure 6. The value-level provenance is shown in Figure 5. Each of the 6 summands stands for a different assignment (i.e. analternative reason for the tuple to appear in the result). Assignments are represented as multiplication ofpairs of the form ( var, val ) so that var is assigned val in the particular assignment. We only show herevariables to which a query word was mapped; these will be the relevant variables for formulating the answer. It is important to note that we refer to provenance as the mapping between the variables in the query tothe values in the database which occurs during the evaluation process. This process is completely separatefrom
NaLIR ’s framework. The provenance is stored as part of the evaluation of the formal query inferred by
NaLIR over the database, and is therefore performed after
NaLIR has completed the query inference process.8el. org oid oname1 UPENN2 TAU
Rel. author aid aname oid3 Susan D. 14 Tova M. 25 Slava N. 2
Rel. pub wid cid ptitle pyear6 10 “OASSIS...” 20147 10 “A sample...” 20148 11 “Monitoring...” 20079 11 “Querying...” 2006
Rel. writes aid wid4 63 65 64 74 84 9
Rel. conf cid cname10 SIGMOD11 VLDB
Rel. domainConf cid did10 1811 18
Rel. domain did dname18 Databases
Figure 6:
DB Instance
We now start describing our transformation of provenance to an NL sentence, leveraging the structure of theoriginal question. We focus in this section on the case of a Conjunctive Query and a single assignment toits clauses. In subsequent sections we show how to extend the solution to multiple assignments and unionsof conjunctive queries, where the solution presented in this section will serve as a building block.
Our first important observation is that words in the NL question can be connected to (variable,value)pairs in the provenance polynomial. For instance, “conference” corresponds to the assignment of cname to SIGM OD or V LDB , “author” corresponds to the assignment of aname to T ovaM. , and so on. Thereason this connection is important is that it gives us strong hints on how to form a detailed answer inNatural Language: given this information we know for instance that
SIGM OD should replace/reside nextto “database conferences” (the decision of which of the two options to follow will be discussed below basedon the sentence structure). Fortunately, the choice of models we have made in the preliminaries gives usrelatively straightforward means to derive this mapping. The idea is to marry the two mappings discussedin the previous section as a step towards generating an NL explanation: the dependency-to-query-mappingperformed by
NaLIR and the value-based provenance to get a direct mapping from words to database values.First, we have the dependency-to-query-mapping mapping (Definition 2.5) from the NL query’s dependencytree (e.g. “author”) to query variables (e.g. “aname”), which we get from the NLIDB. Second, we have,in the value-based provenance, a detailed account of the assignments of query variables to values from thedatabase (e.g. “aname” to “Tova. M.”). If we compose this mapping, we get a (partial) mapping from wordsin the NL question to data values.
Example 3.1
Continuing our running example, consider the assignment represented by the first monomialof Figure 5. Further reconsider Figure 4a, and now note that each node is associated with a pair ( var, val ) of the variable to which the node was mapped, and the value that this variable was assigned in this particularassignment. For instance, the node “organization” was mapped to the variable oname which was assignedthe value “TAU”. .2 Building an Answer Tree Having established the mapping from words in the NL query to values in the provenance, we are ready toform a basic tree for the provenance-aware answer. The idea is now to follow the structure of the NL querydependency tree and generate an answer tree with the same structure by replacing/modifying the words inthe question with the values from the result and provenance that were mapped using the dependency-to-query-mapping and the assignment. Yet, note that simply replacing the values does not always result in acoherent sentence, as shown in the following example.
Example 3.2
Re-consider the dependency tree depicted in Figure 4a. If we were to replace the value in theorganization node to the value “TAU” mapped to it, the word “organization” will not appear in the answeralthough it is needed to produce the coherent answer depicted in Figure 2. Without this word, it is unclearhow to deduce the information about the connection between “Tova M.” and “TAU”.
We next account for these difficulties and present an algorithm that outputs the dependency tree of adetailed answer, under some plausible assumptions on the structure of the question tree.Recall that we have assumed that the dependency tree of the NL query follows one of the abstract formsin Figure 3. We distinguish between two cases based on nodes whose
REL (relationship with parent node)is modifier ; in the first case, the clause begins with a verb modifier ( e .g., the node “published” in Fig. 4a is averb modifier) and in the second, the clause begins with a non-verb modifier ( e .g., the node “of” in Fig. 4ais a non-verb modifier). Algorithm 1 considers these two forms of dependency tree and provides a tailoredsolution for each one in the form of a dependency tree that fits the correct answer structure. It does so byaugmenting the query dependency tree into an answer tree.The algorithm operates as follows. We start with the dependency tree of the NL query, an empty answertree T A , a dependency-to-query-mapping an assignment and a node object from the query tree. We denotethe set of all modifiers by M OD and the set of all verbs by
V ERB . The algorithm is recursive and handlesseveral cases, depending on object and its children in the dependency tree. If the node object is a leaf (Line2), we replace it with the value mapped to it by dependency-to-query-mapping and the assignment, if sucha mapping exists. Otherwise (it is a leaf without a mapping), it remains in the tree as it is. Second, if L ( object ) .REL is a modifier (Line 5), we call the procedure Replace in order to replace its entire subtreewith the value mapped to it and add the suitable word for equality, depending on the type of its child ( e .g.,location, year, etc. taken from the pre-prepared table), as its parent (using procedure AddP arent ). Thethird case handles a situation where object has a non-verb modifier child (Line 9). We use the procedure
Adjust with a f alse flag to copy T Q into T A , remove the return node and add the value mapped to object as its child in T A . The difference in the fourth case (Line 12) is the value of f lag is now true . This meansthat instead of adding the value mapped to object as its child, the Adjust procedure replaces the node withits value. Finally, if object had a modifier child child (verb or non-verb), the algorithm makes a recursivecall for all of the children of child (Line 16). This recursive call is needed here since a modifier node can bethe root of a complex sub-tree (recall Example 2.3).
Example 3.3
Re-consider Figure 4a, and note the mappings from the nodes to the variables and values asreflected in the boxes next to the nodes. To generate an answer, we follow the NL query structure, “plugging-in” mapped database values. We start with “organization”, which is the first object node. Observe that“organization” has the child “of” which is a non-verb modifier, so we add “TAU” as its child and assign true to the hasM od variable. We then reach Line 15 where the condition holds and we make a recursive callto the children of “of”, i.e., the node object is now “authors”. Again we consider all cases until reachingthe fourth (Line 12). The condition holds since the node “published” is a verb modifier, thus we replace lgorithm 1: ComputeAnswerTree input :
A dependency tree T Q , an answer tree T A (empty in the first call), adependency-to-query-mapping τ , an assignment α , a node object ∈ T Q output: Answer tree with explanations T A child .. = null ; if object is a leaf then value = α ( τ ( object )); Replace ( object, value, T A ); else if L ( object ) .REL is mod then value = α ( τ ( child T Q ( object ))); Replace ( tree ( object ) , value, T A ); AddP arent ( T A , value ); else if object has a child v s.t. L ( v ) .REL ∈ M OD and L ( v ) .P OS / ∈ V ERB then Adjust ( T Q , T A , τ, α, object, f alse ); child .. = v ; else if object has a child v s.t. L ( v ) .REL ∈ M OD and L ( v ) .P OS ∈ V ERB then Adjust ( T Q , T A , τ, α, object, true ); child .. = v ; if child (cid:54) = null then foreach u ∈ children T Q ( child ) do ComputeAnswerT ree ( T Q , T A , τ, α, u ); return T A ; “authors” with “Tova M.”, mapped to it. Then, we make a recursive call for all children of “published” sincethe condition in Line 15 holds. The nodes “who” and “papers” are leaves so they satisfy the condition inLine 2. Only “papers” has a value mapped to it, so it is replaced by this value (“OASSIS...”). However,the nodes “after” and “in” are modifiers so when the algorithm is invoked with object = “after” (“in”), thesecond condition holds (Line 5) and we replace the subtree of these nodes with the node mapped to their child(in the case of “after” it is “2014” and in the case of “in” it is “SIGMOD”) and we attach the node “in” asthe parent of the node, in both cases as it is the suitable word for equality for years and locations. We obtaina tree representation of the answer (Fig. 4b). So far we have augmented the NL query dependency tree to obtain the dependency tree of the answer. Thelast step is to translate this tree to a sentence. To this end, we recall that the original query, in the formof a sentence, was translated by
NaLIR to the NL query dependency tree. To translate the dependency treeto a sentence, we essentially “revert” this process, further using the mapping of NL query dependency treenodes to (sets of) nodes of the answer. When generating the sentence, we have two different scenarios; whena word or phrase in the original dependency tree was replaced by the value to which it was mapped to, wereplace the word/phrase in the NL query with the value mapped to it. Otherwise, the value mapped to itwas added as its child, and in this case we add it either before or after the mapped word/phrase according11o its
P OS with the appropriate connecting word taken from a stored table.
Example 3.4
Converting the answer tree in Figure 4b to a sentence is done by replacing the words of theNL query with the values mapped to them, e.g., the word “authors” in the NL query (Figure 1a) is replacedby “Tova M.” and the word “papers” is replaced by “OASSIS...”. The word “organization” is not replaced(as it remains in the answer tree) but rather the words “TAU is the” are added prior to it, since its
P OS isnot a verb and its
REL is a modifier. Completing this process, we obtain the answer shown in Figure 2.
Logical operators (and, or) and the part of the NL query they relate to will be converted by
NaLIR to alogical predicate which will be mapped by the assignment to one value that satisfies the logical statement(we consider here logical operators that are compiled by
NaLIR into a single CQ, the UCQ case is consideredin Section 6). To handle these parts of the query, we augment Algorithm 1 as follows: immediately followingthe first case (before the current Line 5), we add a condition checking whether the node object has a logicaloperator (“and” or “or”) as a child. If so, we call Procedure HandleLogicalOps with the trees T Q and T A ,the logical operator node as u , the dependency-to-query-mapping τ and the assignment α . The procedureinitializes a set S to store the nodes whose subtree needs to be replaced by the value given to the logicalpredicate (Line 2). Procedure HandleLogicalOps first locates all nodes in T Q that were mapped by thedependency-to-query-mapping to the same query variable as the sibling of the logical operator (denoted by u ). Then, it removes the subtrees rooted at each of their parents (Line 8), adds the value (denoted by val )from the database mapped to all of them in the same level as their parents (Line 9), and finally, the suitableword for equality is added as the parent of val in the tree by the procedure AddP arent (Line 10).
Procedure
HandleLogicalOps input :
A dependency tree T Q , T A , u ∈ V T A , dependency-to-query-mapping τ and an assignment α w ← parent T Q ( u ); S ← { w } ; var ← τ ( children T A ( w ) \ u ); val ← α ( τ ( children T A ( w ) \ u )); for z ∈ siblings T A ( w ) do if z has child mapped by τ to var then S.Insert ( z ); parent T A ( w ) .children T A () .Remove ( S ); parent T A ( w ) .children T A () .Insert ( val ); AddP arent ( T A , val ) ; In the previous section we have considered the case where the provenance consists of a single assignment.In general, as illustrated in Section 2, it may include multiple assignments. This is the case already forConjunctive Queries, as illustrated in Section 2. We next generalize the construction to account for multiple12
TAU] · A ([Tova M.] · B ([VLDB] · ([2006] · [Querying...] + [2007] · [Monitoring...])) + [SIGMOD] · [2014] · ([OASSIS...] + [A Sample...])) B+ [Slava N.] · [OASSIS...] · [SIGMOD] · [2014]) A+ [UPENN] · [Susan D.] · [OASSIS...] · [SIGMOD] · [2014] (a) f [TAU] · ([SIGMOD] · [2014] · ([OASSIS...] · ([Tova M.] + [Slava N.])) + [Tova M.] · [A Sample...]) + [VLDB] · [Tova M.] · ([2006] · [Querying...] + [2007] · [Monitoring...]) + [UPENN] · [Susan D.] · [OASSIS...] · [SIGMOD] · [2014] (b) f Figure 7:
Provenance Factorizationsassignments. Note that a na¨ıve solution in this respect is to generate a sentence for each individual assignmentand concatenate the resulting sentences. However, already for the small-scale example presented here, thiswould result in a long and unreadable answer (recall Figure 5 consisting of six assignments). Instead, wepropose two solutions: the first based on the idea of provenance factorization [62, 16], and the second (inthe following section) leveraging factorization to provide a summarized form.
Provenance size and complexity is a known and well-studied issue, and various solutions were presented toreduce it [26, 8]. Observing that different assignments in the provenance expression typically share significantparts, one prominent approach [8, 62, 16] suggests using algebraic factorization. The idea is to regard theprovenance as a polynomial (see Figure 5) and use distributivity to represent it in a more succinct way. Forinstance, the expression x · y + x · z can be factorized to the equivalent expression x · ( y + z ).The main purpose of classical provenance factorization, as in algebraic factorization, is to reduce the sizeof the provenance by removing duplicate records and nodes. In our setting, different considerations comeinto play, as we shall show.We start by defining the notion of factorization in a standard way (see e .g. [62, 27]). Definition 4.1
Let P be a provenance expression. We say that an expression f is a factorization of P if f may be obtained from P through (repeated) use of some of the following axioms: distributivity of summationover multiplication, associativity and commutativity of both summation and multiplication. Example 4.2
Re-consider the provenance expression in Figure 5. Two possible factorizations are shown inFigure 7, keeping only the values and omitting the variable names for brevity (ignore the A,B brackets fornow). In both cases, the idea is to avoid repetitions in the provenance expression, by taking out a commonfactor that appears in multiple summands. Different choices of which common factor to take out lead todifferent factorizations.
How do we measure whether a possible factorization is suitable/preferable to others? Standard desiderata[62, 27] are that it should be short or that the maximal number of appearances of an atom is minimal. Onthe other hand, we factorize here as a step towards generating an NL answer; to this end, it will be highlyuseful if the (partial) order of nesting of value annotations in the factorization is consistent the (partial)order of corresponding words in the NL query . We will next formalize this intuition as a constraint overfactorizations. We start by defining a partial order on nodes in a dependency tree:
Definition 4.3
Given an dependency tree T , we define ≤ T as the descendant partial order of nodes in T :for each two nodes, x, y ∈ V ( T ) , we say that x ≤ T y if x is a descendant of y in T . xample 4.4 In our running example (Figure 4a) it holds in particular that authors ≤ organization , ≤ authors , conf erences ≤ authors and papers ≤ authors , but papers , and conf erences areincomparable. Next we define a partial order over elements of a factorization, intuitively based on their nesting depth.To this end, we first consider the circuit form [13] of a given factorization:
Example 4.5
Consider the partial circuit of f in Figure 8. The root, · , has two children; the left child isthe leaf “TAU” and the right is a + child whose subtree includes the part that is “deeper” than “TAU”. Given a factorization f and an element n in it, we denote by level f ( n ) the distance of the node n fromthe root of the circuit induced by f multiplied by ( − level f ( n ) is bigger for a node n closerto the circuit root. · + · sub-circuitTova M. · SIGMOD2014OASSISSlava N.TAU
Figure 8:
Sub-Circuit of f Our goal here is to define the correspondence between the level of each node in the circuit and the levelof its “source” node in the NL query’s dependency tree (note that each node in the query corresponds topossibly many nodes in the circuit: all values assigned to the variable in the different assignments). In thefollowing definition we will omit the database instance for brevity and denote the provenance obtained for aquery with dependency tree T by prov T . Recall that dependency-to-query-mapping maps the nodes of thedependency tree to the query variables and the assignment maps these variables to values from the database(Definitions 2.5, 2.6, respectively). Definition 4.6
Let T be a query dependency tree, let prov T be a provenance expression, let f be a factor-ization of prov T , let τ be a dependency-to-query-mapping and let { α , ...α n } be the set of assignments to thequery. For each two nodes x, y in T we say that x ≤ f y if ∀ i ∈ [ n ] : level f ( α i ( τ ( x ))) ≤ level f ( α i ( τ ( y ))) .We say that f is T -compatible if each pair of nodes x (cid:54) = y ∈ V ( T ) that satisfy x ≤ T y also satisfy that x ≤ f y . Essentially, T -compatibility means that the partial order of nesting between values, for each individualassignment, must be consistent the partial order defined by the structure of the question. Note that thecompatibility requirement imposes constraints on the factorization, but it is in general far from dictatingthe factorization, since the order x ≤ T y is only partial – and there is no constraint on the order of each twoprovenance nodes whose “origins” in the query are unordered. Among the T -compatible factorizations, wewill prefer shorter ones. Definition 4.7
Let T be an NL query dependency tree and let prov T be a provenance expression for theanswer. We say that a factorization f of prov T is optimal if f is T -compatible and there is no T -compatiblefactorization f (cid:48) of prov T such that | f (cid:48) | < | f | ( | f | is the length of f ). The following example shows that the T -compatibility constraint still allows much freedom in constructingthe factorization. In particular, different choices can (and sometimes should, to achieve minimal size) be14ade for different sub-expressions, including ones leading to different answers and ones leading to the sameanswer through different assignments. Example 4.8
Recall the partial order ≤ T imposed by our running example query, shown in part in Example4.4. It implies that in every compatible factorization, the organization name must reside at the highest level,and indeed T AU was “pulled out” first in Figure 8; similarly the author name must be pulled out next. Incontrast, since the query nodes corresponding to title, year and conference name are unordered, we may, within a single factorization , factor out e.g. the year in one part of the factorization and the conferencename in another one. As an example, Tova M. has two papers published in VLDB (“Querying...” and“Monitoring”) so factorizing based on VLDB would be the best choice for that part. On the other hand,suppose that Slava N. had two paper published in ; then we could factorize them based on . Thefactorization could, in that case, look like the following (where the parts taken out for Tova and Slava areshown in bold): [TAU] · ([Tova M.] · ( [VLDB] · ([2006] · [Querying...] + [2007] · [Monitoring...])) + [SIGMOD] · [2014] · ([OASSIS...] + [A Sample...])) + ([Slava N.] · ( [2014] · ([SIGMOD] · [OASSIS...] + [VLDB] · [Ontology...]))) The following example shows that in some cases, requiring compatibility can come at the cost of com-pactness.As a sanity check, note that the identity factorization that simply keeps the provenance intact is T -compatible. Further, T -compatible factorizations are factorizations that keep the answer (the object nodein Figure 3) at the start of the sentence and only then refer to the provenance. When many answers andassignments are involved, it is thus possible to obtain a T -compatible factorization by factorizing each answerwith its provenance by itself and then combining all of them under a joint root. Example 4.9
Consider the query tree T depicted in Figure 4a and the factorizations prov T (the identityfactorization) depicted in Figure 5, f , f presented in Figure 7. prov T is of length and is -readable, i.e.,the maximal number of appearances of a single variable is (see [27]). f is of length , while the length of f is only . In addition, both f and f are -readable. Based on those measurements f seems to be the bestfactorization, yet f is T -compatible with the question and f is not. For example, conf erences ≤ T authors but “SIGMOD” appears higher than “Tova M.” in f . Choosing a T -compatible factorization in f will lead(as shown below) to an answer whose structure resembles that of the question, and thus translates to a morecoherent and fitting NL answer. As mentioned above, the identity factorization is always T -compatible, so we are guaranteed at least oneoptimal factorization (but it is not necessarily unique). We next study the problem of computing such afactorization. 15 .2 Computing Factorizations Recall that our notion of compatibility restricts the factorization so that its structure resembles that of thequestion. Without this constraint, finding shortest factorizations is coNP-hard in the size of the provenance( i .e. a boolean expression) [40]. The compatibility constraint does not reduce the complexity since it onlyrestricts choices relevant to part of the expression, while allowing freedom for arbitrarily many other elementsof the provenance. Also recall (Example 4.8) that the choice of which element to “pull-out” needs in generalto be done separately for each part of the provenance so as to optimize its size (which is the reason for thehardness in [40] as well). In general, obtaining the minimum size T -compatible factorization of prov T iscoNP-hard by a reduction from [40]. Greedy Algorithm.
Despite this result, the constraint of compatibility does help in practice, in thatwe can avoid examining choices that violate it. For choices that do maintain compatibility, we devise asimple algorithm that chooses greedily among them. More concretely, the input to Algorithm 2 is the querytree T Q (with its partial order ≤ T Q ), and the provenance prov T Q . The algorithm output is a T Q -compatiblefactorization f . Starting from prov , the progress of the algorithm is made in steps, where at each step,the algorithm traverses the circuit induced by prov in a BFS manner from top to bottom and takes out avariable that would lead to a minimal expression out of the valid options that keep the current factorization T -compatible. Naturally, the algorithm does not guarantee an optimal factorization (in terms of length),but performs well in practice (see Section 8).In more detail, we start by choosing the largest nodes according to ≤ T Q which have not been processedyet (Line 2). Afterwards, we sort the corresponding variables in a greedy manner based on the numberof appearances of each variable in the expression using the procedure sortByF requentV ars (Line 3). InLines 4–5, we iterate over the sorted variables and extract them from their sub-expressions. This is donewhile preserving the ≤ T Q order with the larger nodes, thus ensuring that the factorization will remain T Q -compatible. We then add all the newly processed nodes to the set P rocessed which contains all nodes that have already been processed (Line 6). Lastly, we check whetherthere are no more nodes to be processed, i .e., if the set P rocessed includes all the nodes of T Q (denoted V ( T Q ), see the condition in Line 7). If the answer is “yes”, we return the factorization. Otherwise, we makea recursive call. In each such call, the set P rocessed becomes larger until the condition in Line 7 holds.
Example 4.10
Consider the query tree T Q depicted in Figure 4a, and provenance prov in Figure 5. Asexplained above, the largest node according to ≤ T Q is organization , hence “TAU” will be taken out fromthe brackets multiplying all summands that contain it. Afterwards, the next node according to the orderrelation will be author , therefore we group by author, taking out “Tova M.”, “Slava N.” etc. The followingchoice (between conference, year and paper name) is then done greedily for each author , based on its numberof occurrences. For instance, V LDB appears twice for
T ova.M. whereas each paper title and year appearsonly once; so it will be pulled out. The polynomial [ SlavaN. ] · [ OASSIS... ] · [ SIGM OD ] · [2014] will remainunfactorized as all values appear once. Eventually, the algorithm will return the factorization f depicted inFigure 7, which is T Q -compatible and much shorter than the initial provenance expression. Complexity
Denote the provenance size by n . The algorithm complexity is O ( n · log n ): at each recursivecall, we sort all nodes in O ( n · log n ) (Line 3) and the we handle (in F rontier ) at least one node (in thecase of a chain graph) or more. Hence, in the worst case we would have n recursive calls, each one costing O ( n · log n ). 16 lgorithm 2: GreedyFactorization input : T Q - the query tree, ≤ T Q - the query partial order, prov - the provenance, τ, α -dependency-to-query-mapping and assignment from nodes in T Q to provenance variables, P rocessed - subset of nodes from V ( T Q ) which were already processed (initially, ∅ ) output: f - T Q -compatible factorization of prov T Q f ← prov ; F rontier ← { x ∈ V ( T Q ) |∀ ( y ∈ V ( T Q ) \ P rocessed ) s.t. x (cid:54)≤ T Q y } ; vars ← sortByF requentV ars ( { α ( τ ( x )) | x ∈ F rontier } , f ); foreach var ∈ vars do Take out var from sub-expressions in f not including variables from { x |∃ y ∈ P rocessed : x = α ( τ ( y )) } ; P rocessed ← P rocessed ∪ F rontier ; if | P rocessed | = | V ( T Q ) | then return f ; else return GreedyF actorization ( T Q , f, τ, α, P rocessed ); Optimization
Since T -compatible factorizations keep the answer (the object node in Figure 3) at the startof the sentence, we can utilize the abstract factorization structure for one answer in the factorization of allother answers. For this, we need to augment Algorithm 2 in the following manner. First, only the monomialsthat contain the first answer will be factorized using Algorithm 2. Then, an abstract factorization structure f a can be inferred from this factorization by replacing some of the values with the variables mapped tothem. The values that are replaced with variables are those that have a clear hierarchy between them in T Q while values that were mapped to words in the same level of T Q are not part of the abstract factorizationstructure as the hierarchy between them may vary based on the nature of the assignments each results has.Namely, if x (cid:54) = y ∈ V ( T Q ) satisfy x ≤ T Q y , the variables that x and y are mapped to will be part of f a andwill hold var ( x ) ≤ f a var ( y ) where var ( x ) is the variable x is mapped to. Intuitively, the circuit induced by f a maintains the partial order of nodes in T Q . Finally, given the provenance polynomial of another answer,we replace the variables in f a with the corresponding constants and greedily factorize only the parts of thepolynomial that are not included in f a . The final step is to turn the obtained factorization into an NL answer. Similarly to the case of a singleassignment (Section 3), we devise a recursive algorithm that leverages the mappings and assignments toconvert the query dependency tree into an answer tree. Intuitively, we follow the structure of a single answer,replacing each node there by either a single node, standing for a single word of the factorized expression, orby a recursively generated tree, standing for some brackets (sub-circuit) in the factorized expression.In more detail, the algorithm operates as follows. We iterate over the children of root (the root of thecurrent sub-circuit), distinguishing between two cases. First, for each leaf child, p , we first (Line 4) assignto val the database value corresponding to the first element of p under the assignment α (recall that p is apair (variable,value)). We then lookup the node containing the value mapped to p ’s variable in the answertree T A and change its value to val in Lines 5, 6 (the value of p ). Finally, in Line 7 we reorder nodes in17he same level according to their order in the factorization (so that we output a semantically correct NLanswer). Second, for each non-leaf child, the algorithm performs a recursive call in which the factorizedanswer subtree is computed (Line 9). Afterwards, the set containing the nodes of the resulting subtree asidefrom the nodes of T A are attached to T F under the node corresponding to their LCA in T F (Lines 10 – 13).In this process, we attach the new nodes that were placed lower in the circuit in the most suitable place forthem semantically (based on T A ), while also maintaining the structure of the factorization. Algorithm 3:
ComputeFactAnswerTree input : α - an assignment to the NL query, T A - answer dependency tree based on α , root - theroot of the circuit induced by the factorized provenance output: T F - tree of the factorized answer T F ← copy ( T A ); foreach p ∈ children f ( root ) do if p is a leaf then val ← α ( var ( p )); node ← Lookup ( var ( p ) , α, T A ); ReplaceV al ( val, node, T F ); Rearrange ( node, T A , T F ); else T recF = ComputeF actAnswerT ree ( α, T A , p ); RecN odes = V ( T recF ) \ V ( T A ); parent recF ← LCA ( recN odes ); parent F ← Corresponding node to parent recF in T F ; Attach recN odes to T F under the parent F ; return T F ; Example 4.11
Consider the factorization f depicted in Figure 7, and the structure of single assignmentanswer depicted in Figure 4b which was built based on Algorithm 1. Given this input, Algorithm 3 willgenerate an answer tree corresponding to the following sentence: TAU is the organization ofTova M. who publishedin VLDB’Querying...’ in 2006 and’Monitoring...’ in 2007and in SIGMOD in 2014’OASSIS...’ and ’A sample...’and Slava N. who published’OASSIS...’ in SIGMOD in 2014.UPENN is the organization of Susan D. who published’OASSIS...’ in SIGMOD in 2014.
Note that the query has two results: “TAU” and “UPENN”. “UPENN” was produced with a singleassignment, but there are 5 different assignments producing “TAU”. We now focus on this sub-circuitdepicted in Figure 8. After initializing T F , in Lines 3 – 7 the algorithm finds the value T AU and the nodecorresponding to it in T A (which originally contained the variable organization ). It then copies this node to A) [TAU] · Size ([Tova M.],[Slava N.]) · Size ([VLDB],[SIGMOD]) · Size ([Querying...],[Monitoring...],[OASSIS...],[A Sample...]) · Range ([2006],[2007],[2014]) (B) [TAU] · ([Tova M.] · Size ([VLDB],[SIGMOD]) · Size ([Querying...],[Monitoring...],[OASSIS...],[A Sample...]) · Range ([2006],[2007],[2014])[Slava N.] · [OASSIS...] · [SIGMOD] · [2014]) Figure 9:
Summarized Factorizations T F and assigns it the value “TAU”. Next the algorithm handles the + node with a recursive call in Line 9.This node has the two sub-circuits rooted at the two · nodes (Line 8); one containing [ authors, T ovaM. ] andthe other [ authors, SlavaN. ] . When traversing the sub-circuit containing “Slava N.”, the algorithm simplycopies the subtree rooted at the authors node with the values from the circuit and arranges the nodes in thesame order as the corresponding variable nodes were in T A (Line 7) as they are all leaves on the same level.Those values will be attached under the LCA “of” (Lines 9 – 13). The sub-circuit of “Tova M.” also hasnested sub-circuits. Although the node paper appears before the nodes year and conf erence in the answertree structure, the algorithm identifies that f extracted the variables “VLDB”, “SIGMOD” and “2014”, soit changes their location so that they appear earlier in the final answer tree. Finally, recursive calls are madewith the sub-circuit containing [ authors, T ovaM. ] .Intuitively, “of” is indeed the root of a sub-tree specifying the authors in an institution in the structureof our answers. So far we have proposed a solution that factorizes multiple assignments, leading to a more concise answer.When there are many assignments and/or the assignments involve multiple distinct values, even an optimalfactorized representation may be too long and convoluted for users to follow.
Example 5.1
Reconsider Example 4.11; if there aremany authors from TAU then even the compact representation of the result could be very long. In such caseswe need to summarize the provenance in some way that will preserve the “essence” of all assignments withoutactually specifying them, for instance by providing only the number of authors/papers for each institution.
To this end, we employ summarization , as follows. First, we note that a key to summarization isunderstanding which parts of the provenance may be grouped together. For that, we use again the mappingfrom nodes to query variables: we say that two nodes are of the same type if both were mapped to the samequery variable. Now, let n be a node in the circuit form of a given factorization f . A summarization of thesub-circuit of n is obtained in two steps. First, we group the descendants of n according to their type. Then,we summarize each group separately. The latter is done in our implementation simply by either counting thenumber of distinct values in the group or by computing their range if the values are numeric. In general, onecan easily adapt the solution to apply additional user-defined “summarization functions” such as “greater /smaller than X” (for numerical values) or “in continent Y” for countries. Example 5.2
Re-consider the factorization f from Figure 7. We can summarize it in multiple levels: thehighest level of authors (summarization “A”), or the level of papers for each particular author (summarization“B”), or the level of conferences, etc. Note that if we choose to summarize at some level, we must summarize A) TAU is the organization of 2 authors who published4 papers in 2 conferences in 2006 - 2014. (B)
TAU is the organization of Tova M. who published4 papers in 2 conferences in 2006 - 2014 and Slava N.who published ’OASSIS...’ in SIGMOD in 2014.
Figure 10:
Summarized Sentences its entire sub-circuit (e.g. if we summarize for Tova. M. at the level of conferences, we cannot specify thepapers titles and publication years).Figure 9 presents the summarizations of sub-trees for the “TAU” answer, where “size” is a summarizationoperator that counts the number of distinct values and “range” is an operator over numeric values, summa-rizing them as their range. The summarized factorizations are further converted to NL sentences which areshown in Figure 10. Summarizing at a higher level results in a shorter but less detailed summarization.
So far our solution has been limited to Conjunctive Queries, and we next extend it to account for Unionsthereof (UCQs). We next describe the necessary augmentations of the algorithms, illustrating them throughexamples. Recall that in the first step, the system takes a natural language query and translates it to adependency tree, while maintaining the dependency-to-query-mapping mapping. The difference here is thata tree node can now be mapped to several variables. This implies a generalization of Definition 2.5:
Definition 6.1
Given a dependency tree T = ( V, E, L ) and a UCQ Q , . . . , Q m , a dependency-to-UCQ-mapping is a set of dependency-to-query-mapping { τ , . . . , τ m } , where τ i : T → Q i . Example 6.2
Consider the NL query “return the organization of authors who published papers in databaseconferences before 2005 or after 2015”, whose dependency tree is depicted in Figure 11. The “or” heredefines two different CQs (depicted in Figure 12). Since the two numerical values cannot form a conjunctivecondition and thus cannot be compiled into a single boolean condition,
NaLIR translates this NL query intotwo CQs. The two CQs define two different dependency-to-query-mapping that map nodes from the singledependency tree to two different sets of variables. Consider an organization (e.g., TAU) which appears asan answer. It is mapped both to oname and to oname . Thus, we would like to present the assignmentsto both queries as explanations. Here, the dependency-to-UCQ-mapping is { τ , τ } where τ maps the nodesfrom the dependency tree to the variables of the first query in Figure 12 and τ maps the nodes from thedependency tree to the variables of the second query. Thus, the dependency-to-UCQ-mapping captures theassignments from both queries. Note that τ differs from τ for some words. Specifically, τ maps the nodes“before” and “2005” to pyear < and τ maps the nodes “after” and “2015” to pyear > . We further give an unique integer identifier to each mapped word in the dependency tree as exemplifiedby the superscript in Figure 11 for reasons we explain in the sequel.After determining the dependency-to-UCQ-mapping, we rely on an augmentation of Definition 2.8. Thefollowing definition essentially generalizes the definition for CQs by also summing the pairs of (word identifier,value) from all the CQs participating in the union.
Definition 6.3
Let A ( Q, D ) be the set of assignments for a UCQ Q = { Q , . . . , Q m } and a database instance D , and let { τ , . . . , τ m } be the dependency-to-UCQ-mapping of Q . We define the NL value-level provenance20 eturnorganization ofauthors published inconferences databaseafteror2015 before2005 papers who the Figure 11:
Dependency Tree With “Or” Condition query(oname1) :- org(oid1, oname1), author(aid1, aname1,oid1), pub(wid1, cid1, ptitle1, pyear1), conf(cid1,cname1), domainConf(cid1, did1), domain(did1, dname1),writes(aid1, wid1), dname1 = ’Databases’, pyear1 < 2005query(oname2) :- org(oid2, oname2), author(aid2, aname2,oid2), pub(wid2, cid2, ptitle2, pyear2), conf(cid2,cname2), domainConf(cid2, did2), domain(did2, dname2),writes(aid2, wid2), dname2 = ’Databases’, pyear2 > 2015
Figure 12:
Two CQs from the Same NL Query of Q w.r.t. D as (cid:88) Q i ∈Q (cid:88) α ∈ A ( Q i ,D ) Π { x i ,a i | α ( x i )= a i } ( τ − i ( x i ) , a i ) . Rel. org oid oname2 TAU
Rel. author aid aname oid4 Tova M. 2
Rel. pub wid cid ptitle pyear6 10 “Positive Active XML” 20047 11 “Rudolf...” 2016
Rel. writes aid wid4 64 7
Rel. conf cid cname10 PODS11 VLDB
Rel. domainConf cid did10 1811 18
Rel. domain did dname18 Databases
Figure 13:
DB Instance for Example 6.4
Example 6.4
Reconsider the UCQ defined by the union of the two CQs depicted in Figure 12 and thedatabase in Figure 13 with tuples standing for two more publications by the author Tova Milo: “PositiveActive XML” published in PODS in 2004 and “Rudolf: Interactive Rule Refinement System for Fraud De-tection” published in VLDB in 2016. The first of the two summands in Figure 14 (in the “A” brackets)stands for an assignment to the top query in Figure 12, while the second summand (in the “B” brackets)stands for an assignment for the bottom query. Assignments are represented as multiplication of pairs of theform ( id, val ) so that id is the unique identifier of a word in the NL query mapped to the variable var in aspecific query Q i that is assigned val in the particular assignment. (cid:40) (1,TAU) · (2,Tova M.) · (3,Positive Active XML) · (4,PODS) · (5,04’) + (cid:41) AB (cid:40) (1,TAU) · (2,Tova M.) · (3,Rudolf...) · (4,VLDB) · (6,16’) + ... (cid:41) B Figure 14:
Value-level Provenance for Example 6.4We now have a polynomial containing sets of pairs where the first element is the unique word in the NLquery and the second is the value from the database mapped to it. This allows us to consider explanationsfor the same answer regardless of the query from which they originated .By replacing the variable name in each pair with the unique word identifier from the NL query, we areable to treat the assignment of different variable names as relating to the same word or phrase in the NLquery. This allows us to factorize the provenance of the different queries in the union in the context of asingle NL query to which we will build a single NL answer. Now, we can use the procedure described inSection 4 to produce a T -compatible factorization and summarization of the provenance. The only changeneeded in Algorithm 3 is to replace all nodes that form the logical “or” condition with the words mappedto them. In our example, replacing the subtrees rooted at “before” and “after” with the year from theprovenance assignments. NLProv is implemented in JAVA 8, extending
NaLIR . Its web UI is built using HTML, CSS and JavaScript. Itruns on Windows 8 and uses MySQL server as its underlying database management system (the source codeis available in [34]). Figure 15a depicts the system architecture. First, the user enters a query in NaturalLanguage. This NL sentence is fed to the augmented
NaLIR system which interprets it and generates a formalquery. This includes the following steps: a parser [58] generates the dependency tree for the NL query. Then,the nodes of the tree are mapped to attributes in the tables of the database and to functions, to form aformal query. In fact,
NaLIR may generate several candidate queries, from which it will choose the one thatis ranked highest according to an internal ranking function. We use the highest ranked as the chosen query.As explained, to be able to translate the results and provenance to NL,
NLProv stores the mapping fromthe nodes of the dependency tree to the query variables. Once a query has been produced,
NLProv uses the
SelP system [26] to evaluate it while storing the provenance, keeping track of the mapping of dependencytree nodes to parts of the provenance. The provenance information is then factorized (see Algorithm 2) andthe factorization is compiled to an NL answer (Algorithm 3) containing explanations. Finally, the factorizedanswer is shown to the user. If the answer contains excessive details and is too difficult to understand, theuser may choose to view summarizations.
User Interface
We now discuss the user interface
NLProv . First the user writes a natural language questionin the web interface. The question is inputted to the augmented
NaLIR box, converted to an SQL query whilestoring the mapping from words to variables and evaluated over the database, where the query results . Allresults are then shown to the user, where each result can be further explored by viewing its natural languageprovenance, in each of the three forms described earlier: an explanation formed by a single assignment,22n explanation which encapsulates all assignments as a factorized representation of the provenance, and asummarized explanation based on the factorization.
Our solution “marries”, for the first time to our knowledge, two fields: (1) Natural Language Interfaces toDatabases, and (2) Data Provenance. For each of these two, we have made choices in our implementation:
NaLIR for the NLIDB, as well as a semiring-like value-level provenance model. We next revisit these choicesand discuss alternatives in detail.
As mentioned above,
NaLIR is a prominent interface for querying relational databases in Natural Language.Yet the problem of transforming an NL query into a formal query has been researched extensively, by boththe database and NLP communities, and it includes a variety of different approaches for the solution. Asthe field keeps progressing, the question rises: how flexible is our approach of Natural Language Provenance,with respect to NLIDB development? Namely, if an improved NLIDB is developed, can it be incorporatedin our framework?To address this question, we will analyze our requirements and briefly discuss state-of-the-art algorithmsfor the problem of translating text to SQL or similar formal languages, reviewing their compatibility to theserequirements and consequently the possibility of their binding to
NLProv . Further in-depth discussion of theworks themselves appears in our review of related work in Section 10.The DB community has been studying
Natural Language Interfaces to Databases for several decades.Many solutions focus on matching the query parts to the DB schema, and infer the SQL based on thismappings, Obtaining a matching from natural language query to the DB schema in various ways, suchas pattern matching, grammar matching, or intermediate representations language (see Section 9 for moredetails).The NLP community has also extensively studied the translation of natural language question to logicalrepresentations that query a knowledge base [79, 56, 11, 9]. One of the earliest statistical models for mappingtext to SQL was the
PRECISE system [65, 64]; it was able to achieve high precision on specific class of queriesthat were able to be linked tokens and database values, attributes, and relations. However,
PRECISE did notattempt to generate SQL for questions outside this class. Later work considered generating queries basedon relations extracted by a syntactic parser [33] and applying techniques from logical parsing research [63].Recently there is a flourish of work on generating SQL [78, 45, 80], typically applying Machine Learningmethods such as seq2seq networks and reinforcement learning.
NLProv architecture, as depicted in Figure 15a and explained previously, is coupled with an augmentedversion of
NaLIR in the following sense: we get from
NaLIR both its translation to a formal query, alongwith the dependency-to-query-mapping τ . These are the two essential factors for the operation of NLProv .That is,
NLProv can use any existing system that transform natural language question to formal query andalso return partial mapping from the dependency tree nodes to the query parts. Indeed, many of the othertechniques for Natural Language Interface design mentioned above, could be adapted to return τ and supportour requirements. For example PRECISE has a component called “matcher”, that generates mapping fromtokens and database values, attributes, and relations. However, not all of the methods described above aredesigned in a way that allows generation of this mapping. E.g., a semantic parser that relies on seq2seq DeepNeural Network may be unable to return this mapping. The DNN would be trained on a large corpus of23 act. +SentenceFact. +SentenceParser(Augmented) NaLIR(Augmented) NaLIRBuilderQuery BuilderNL QueryNL Query DBDB SelPFactorization GenerationSentence GenerationFact. +MappingFact. +MappingResults + Provenance + MappingResults + Provenance + MappingQuery + MappingQuery + MappingDep. TreeDep. Tree SummarizationSentenceSentenceSentenceSentence Summarized SentenceSummarized Sentence (a)
System Architecture
Fact. +Sentence
Parser
Query Builder
NL Query DB SelP
Factorization
Sentence Generation
Fact. +Mapping
Results + Provenance + Mapping
Dep.Tree
Summarization
Sentence
Summarized Sentence
Mapper
Query
Query + Mapping (b)
Extended System Architecture
Figure 15:
NLProv and General Architecturesnatural language questions along with their relevant SQL queries, and its objective would be to generalize tonew questions. Due to the network complex representation it may be hard to extract the desired mapping.To this end, we propose an alternative architecture, depicted in Figure 15b. This architecture does notrely on the query builder to also generate the partial mapping τ from the dependency tree nodes to thequery parts. Instead, we have added an additional block, Mapper , that receives as input the dependency treealong with the generated query and outputs the mapping τ . Note that generating the dependency tree maybe done using existing tools such as the Stanford Parser, independently of whether the NLIDB generates it(as NaLIR does) or not (as is the case with semantic parsers).
Algorithm 4:
Mapper input :
Dependency tree nodes V ,Conjunctive Query Q ,Similarity Threshold β output: Partial Mapping τ G vertices .. = V ∪ V AR ( Q ); G edges .. = ∅ ; foreach v ∈ V do foreach q ∈ V AR ( Q ) do if Sim ( v, q ) ≥ β then e .. = ( v, q ); e weight .. = Sim ( v, q ); G edges .. = G edges ∪ { e } ; return M aximalM atching ( G );We then present Algorithm 4 responsible for the mapping generation. The algorithm is similar in spiritto the default mapping algorithms of NaLIR and
PRECISE , but could be used as a stand-alone componentwithout these systems. It generates a bipartite graph, with the dependency tree nodes at one side, and thequery parts in the second side. For each pair the algorithm calculates a similarity between the two, and incase they are similar enough (similarity is higher than the input constant β ) an edge will be generated withthe corresponding weight. Eventually, the algorithm will perform maximal matching, and will return themapping τ with the highest match score.Note that the similarity threshold β balance between the mapping precision and recall. Low β values will24nable more edges in the bipartite graph, which results in higher recall. However, more edges may introducenoise, which in turn will be harmful to the precision. For our use case it is crucial to have a mapping withhigh precision, hence high β values will be used. Example 7.1
Recall our running example, and consider the two mapping functions presented in Figure 16. τ depicted in the orange nodes has high recall, as all of the relevant tree nodes mapped to query parts,however it does not have perfect precision as organization node is incorrectly mapped to aname and authors node is mapped to oname . As a result our answer will be: Tova M. is the organization of TAU who published ’OASSIS...’ in SIGMOD in 2014
This answer makes no sense, and will cause the user to mistrusts the answer and the system. On the otherhand τ , depicted in the green nodes, has perfect precision but low recall as papers and conferences nodesare not mapped to any variable; this will result in: TAU is the organization of Tova M. who published papers in database conferences in 2014
Even though the answer does not supply all relevant information, it is a coherent sentence, and clearly abetter answer than the previous one.
Since the dependency tree can be artificially made by our system from the NL query, the only recom-mended component of this NLIDB is a mapping from words of the NL query to the parts of the formal query.Therefore, any NLIDB with such a component could work well with our system (e.g. PRECISE [65] andATHENA [68]). But even this component can be replaced by Algorithm 4 which artificially generates sucha mapping. If we do use this algorithm, any NLIDB can be fitted to the system. (aname, Tova M.)(oname, TAU)(ptitle, OASSIS...) (cname, SIGMOD)(pyear, 2014) (oname, TAU)(aname, Tova M.)(pyear, 2014) returnorganizationofauthorspublished inconferencesdatabaseafter2005paperswho the
Figure 16:
Dependency-To-Query Partial Mappings
We have used a detailed value-level provenance model for UCQs, which we have leveraged to connect differentpieces of the provenance with different parts of the NL question, eventually resulting in a detailed answer.We next briefly discuss alternatives and extensions.
Tuple-level provenance
Using value-level provenance has been essential in our construction of the NLrepresentation of provenance, and consequently in the generation of answers with NL explanations.If the system is connected to a system that only allows coarser-grained provenance, then the mappingbetween words and values needs to be otherwise constructed. Namely, one could use provenance that is onlyat a tuple-level, as is typically done in provenance models (including the standard semiring model [38], why-provenance [14], lineage [66], etc.). Then, considering all values in the tuple participating in the provenance,25e can reconstruct the mapping to NL query words in alternative means, such as word embeddings andsemantic similarity.
Operator-level provenance
An even coarser grain view of the provenance is at the operator level . Forinstance, in the context of relational queries one may consider tracking tuples that are the input and outputof each operator in the query plan, while not necessarily keeping track of which input contributed to whichoutput. Can such form of provenance be useful in our setting?One use-case of marrying operator-level provenance with NL queries is in the context of provenance fornon-answers. Examining the set of input and output tuples of each operator in the query plan, the workof [15] defines the notion of a “picky” operator with respect to a tuple, as one that is responsible for itsomission from the output. This opens up possibilities for explaining non-answers. In a recent preliminarywork [25] we have combined
NaLIR ’s mapping of words to query operators, with the work of [15] to identifythe words that map to picky operators. Then, for each requested non-answer, we can highlight this word as“responsible”.
Example 7.2
Reconsider our running example, but this time assume it is executed on a smaller dirty DBas depicted in Figure 17, where the papers “OASSIS: . . . ” and “A sample . . . ” are erroneously associatedwith the publication year 2004 instead of 2014. Due to the errors in the database “TAU” will not returnas an answer to the query, and a user who expects to see “TAU” in the results screen will be interested tounderstand the reason for its absence (and fix the database accordingly). Consider the query evaluation planin Figure 18, for the query in Figure 1a. The frontier picky operator for “TAU” is σ pyear > , thus thesystem depicted in [25] will highlight the relevant part in the NL query, and return return the organization of authors who published papers in database conferences after 2005. Indicating “TAU” is a non-answer because the authors associated with it did not published papers after 2005.
Similarly to the approach discussed here, the system utilizes the mappings from words to operators,constructed by the NLIDB, and highlights the relevant term which filtered the queried tuple. A challengearises when the filtering operator has no direct word or phrase in the NL query mapped to it.
Example 7.3
Continuing Example 7.2, if the query was about an organization whose authors did not publishin any database conference after 2005, the filtering operator would have been the join between the author and writes tables. Since there is no direct mapping between a word in the NL query and the join operator, it isunclear which word/phrase to highlight.
Provenance Beyond UCQs
A limitation of our work is that it is limited to the SPJU fragment of SQL(UCQs), while NLIDBs have considerable success in handling questions that compile to far more expressiveformalisms.
NaLIR in particular also supports nesting and aggregation, both lacking support in our solution.Provenance models for such formalisms do exist, from [6] and [62] for aggregate queries, [53] for nestedqueries, [54] for queries with negation to [35] for full SQL, just as few examples. By and large, these solutionsare intended as internal representations. Presenting them as explanations is an important task that lacksatisfactory solutions. For instance, the model of [6] for aggregate results includes in a sense a record of alltuples participating in the aggregate computation, which may be far too many to show to the user. The workof [62] discusses a factorized circuit-based form, yet it is also too complicated to allow its presentation to auser. We view devising an effective way of showing such provenance instances to users – e.g. summarizingthe contribution of individual tuples in aggregate queries – as an important goal for future work.26el. org oid oname1 UPENN2 TAU
Rel. author aid aname oid3 Susan D. 14 Tova M. 25 Slava N. 2
Rel. pub wid cid ptitle pyear6 10 “OASSIS...” 20047 10 “A sample...” 2004
Rel. writes aid wid4 63 65 64 7
Rel. conf cid cname10 SIGMOD
Rel. domainConf cid did10 18
Rel. domain did dname18 Databases
Figure 17:
Faulty DB Instance Π oname (cid:46)(cid:47) cid σ pyear> (cid:46)(cid:47) wid (cid:46)(cid:47) aid writes (cid:46)(cid:47) oid orgauthorpub σ dname = databases (cid:46)(cid:47) cid (cid:46)(cid:47) did domainConfdomainconf Figure 18:
Query Plan with Frontier Picky
We have performed an experimental study to assess
NLProv through two prisms: (1) the quality of answersproduced by the system, and (2) the efficiency of the algorithms in terms of execution time.
We have examined the usefulness of the system through a user study, involving 22 non-expert users. Theuser study was conducted in two phases, first we asked 15 users to evaluate the solution for SPJ queries,where in the second phase 7 different users were requested to evaluate the solution for union queries. Forthe SPJ evaluation we presented to each user 6 NL queries, namely No. 1–4, 6, and 7 from Table 1 (chosenas a representative sample), where for the union evaluation users were presented with queries 13–15. Wehave also allowed each user to freely formulate an NL query of her choice, related to the MAS database [1].2 users have not provided a query at all, and for 5 users the query either did not parse well or involvedaggregation (which is not supported), leading to a total of 119 successfully performed tasks. For each ofthe NL queries, users were shown the NL provenance computed by
NLProv for cases of single derivations,factorized and summarized answers for multiple derivations (where applicable). Multiple derivations wererelevant in 71 of the 119 cases. Examples of the results are shown in Table 2.We have asked users three questions about each case, asking them to rank the results on a 1–5 scalewhere 1 is the lowest score: (1) is the answer relevant to the NL query? (2) is the answer understandable?and (3) is the answer detailed enough, i .e. supply all relevant information? (asked only for answers includingmultiple assignments). 27 able 1: NL queries
Num. Queries Return the homepage of SIGMOD Return the papers whose title contains ’OASSIS’ Return the papers which were published inconferences in database area Return the authors who published papers inSIGMOD after 2005 Return the authors who published papers inSIGMOD before 2015 and after 2005 Return the authors who published papers indatabase conferences Return the organization of authors who publishedpapers in database conferences after 2005 Return the authors from TAU who publishedpapers in VLDB Return the area of conferences Return the authors who published papers indatabase conferences after 2005 Return the conferences that presented paperspublished in 2005 by authors from organization Return the years of paper publishedby authors from IBM Return the authors who published papers inVLDB or SIGMOD after 2005 Return the authors from TAU or HUJI who publishedpapers in VLDB or SIGMOD Return the papers published by authors fromTAU or HUJI
Table 2:
Sample use-cases and results
Query Single Assignment Multiple Assignments - Summarized
The results of our user study are summarized in Figure 19. In all cases, the user scores were in the range3–5, with the summarized explanation receiving the highest scores on all accounts. Note in particular thedifference in understandability score, where summarized sentences ranked as significantly more understand-able than their factorized counterparts. Somewhat surprisingly, summarized sentences were even deemed28
PJ Queries Union Queries
Category 3 4 5 Avg. 3 4 5 Avg.Single
Relevant 4 10 84
Understandable 7 25 66
Relevant 0 7 43
Understandable 4 13 33
Detailed 3 7 40
Relevant 2 2 46
Understandable 3 3 44
Detailed 2 5 43
Figure 19:
Users rankingby users as being more detailed than factorized ones (although technically they are of course less detailed),which may be explained by their better clarity (users who ranked a result lower on understandability havealso tended to ranked it low w.r.t. level of detail).
Another facet of our experimental study includes runtime experiments to examine the scalability of ouralgorithms. Here again we have used the MAS database whose total size is 4.7 GB, and queries No. 1–15from Table 1, running the algorithm to generate NL provenance for each individual answer. The experimentswere performed on a i7 processor and 32GB RAM with Windows 8. As expected, when the provenanceincludes a single assignment per answer, the runtime is negligible (this is the case for queries No. 1–3). Wethus show the results only for queries No. 4–15.
Table 3:
Computation time (sec.), for the MAS database
Query Query Eval.Time Fact.Time SentenceGen. Time
NLProv
Time
Table 3 includes, for each query, the runtime required by our algorithms to transform provenance toNL in factorized or summarized form, for all query results (as explained in Section 4, we can compute thefactorizations independently for each query result). We show a breakdown of the execution time of oursolution: factorization time, sentence generation time, and total time incurred by
NLProv (we note that thetime to compute summarizations given a factorization was negligible). For indication on the complexity levelof the queries, we also report the time incurred by standard (provenance-oblivious) query evaluation, usingthe mySQL engine. We note that our algorithms perform quite well for all queries (overall
NLProv executionhas 15% overhead), even for fairly complex ones such as queries 7, 11, and 12.29igure 20a (see next page) presents the execution time of NL provenance computation for an increasingnumber of assignments per answer (up to 5000, note that the maximal number in the real data experimentswas 4208). The provenance used for this set of experiments was such that the only shared value in allassignments was the result value, so the factorization phase is negligible in terms of execution time, takingonly about one tenth of the total runtime in the multiple assignments case. Most computation time here isincurred by the answer tree structuring. We observe that the computation time increased moderately as afunction of the number of assignments (and is negligible for the case of a single assignment). The executiontime for 5K assignments with unique values was 1.5, 2, 1.9, 4.9, 0.006, 0.003, 2.6, 5.3, 3.7, 3.5, 5.7, and 3.7seconds resp. for queries 4–15. Summarization time was negligible, less than 0.1 seconds in all cases. (a)
Computation time as a functionof the number of assignments (b)
Computation time as a functionof the number of unique values (c)
Factorization size as a function ofthe number of unique values
Figure 20:
Results for synthetic data (a)
Factorization time (b)
Sentence gen. time
Figure 21:
Breakdown for synthetic experimentsFor the second set of experiments, we have fixed the number of assignments per answer at the maximum5K and changed only the domain of unique values from which provenance expressions were generated.The domain size per answer, per query variable varies from 0 to 5000 (this cannot exceed the number ofassignments). Note that the running time increases as a function of the number of unique values: when thereare more unique values, there are more candidates for factorization (so the number of steps of the factorizationalgorithm increases), each factorization step is in general less effective (as there are more unique values fora fixed size of provenance, i .e. the degree of value sharing across assignments decreases), and consequentlythe resulting factorized expression is larger, leading to a larger overhead for sentence generation. Indeed, asour breakdown analysis (Figure 21) shows, the increase in running time occurs both in the factorization andin the sentence generation time. Finally, Figure 20c shows the expected increase in the factorized expressionsize w.r.t the number of unique values. 30 a) (b) (c)(d) (e) Figure 22:
Computation time as function of (a)
NL query length (b) depth of the NL query dependencytree (c) number of query selection operations (d) number of query join operations (e) number of provenanceattributesFor the third scalability experiment we evaluated the computation time for different classes of queries.In this experiment we have used a larger set of queries, consisting of 45 different NL queries (available in[34]) which vary in their size, structure, and complexity. For each query we have fixed both the number ofassignments per answer and the domain of unique values at 5K.
NLProv computation time varied between0 .
005 to 4 .
69 seconds, where the mean and median computation times were 1 .
73 and 1 . \ tree depth, with Pearson correlation of 0.74 and 0.78 respectively.The impact of the formal query complexity on NLProv running time is presented in Figures 22c and 22d,notice that the query complexity influence the query evaluation time, but does not have direct impact on theexplanation generation, hence a lower correlation was measured (0.55 and 0.64 for number of selection andjoin operations respectively). Finally, Figure 22e depict the impact of provenance size on the computationtime, the number of provenance attributes is crucial for the factorization step, hence it is influential onAlgorithm 3 running time as exhibited by the 0.71 correlation.
In this section we review and compare our work to existing approaches in the context of database theoryand database interfaces. 31 rovenance
The tracking, storage and presentation of provenance have been the subject of extensiveresearch in the context of database queries, scientific workflows, and others (see e .g. [14, 41, 38, 17, 39,22, 21, 37, 48]) while the field of provenance applications has also been broadly studied ( e .g. [26, 59,67]). A longstanding challenge in this context is the complexity of provenance expressions, leading todifficulties in presenting them in a user-comprehensible manner. Approaches in this respect include showingthe provenance in a graph form [74, 60, 43, 31, 22, 20, 4], allowing user control over the level of granularity(“zooming” in and out [19]), or otherwise presenting different ways of provenance visualization [41]. Otherworks have studied allowing users to query the provenance ( e .g. [47, 44]) or to a-priori request that onlyparts of the provenance are tracked (see for example [26, 35, 36]). Importantly provenance factorizationand summarization have been studied ( e .g., [16, 8, 62, 66]) as means for compact representation of theprovenance. Usually, the solutions proposed in these works aim at reducing the size of the provenancebut naturally do not account for its presentation in NL; we have highlighted the different considerations incontext of factorization/summarization in our setting. We note that value-level provenance was studied in[61, 18] to achieve a fine-grained understanding of the data lineage, but again do not translate the provenanceto NL. Detailed Answers to Keyword Queries
There is an extensive line of work on answering keywordqueries which focuses on providing not just the query answer (tuples that contain the queried value), butalso comprehensive details about it. Works such as [3, 12, 42] focus on answering keyword queries overa relational database, by outputting tuples that are related to one or more of the queried keywords. Inparticular, [50, 73] studies the subject of pre´cis queries over relational database. These queries are logicalcombinations of keywords. The query along with constraints on the schema is inputted to the system, andthe answer returned should include the most relevant tuples to the keyword(s), according to the constraints,as relations that form a logical subset of the original database (i.e., contain not only items directly relatedto the given query terms but also items implicitly related to them). Still in the field of answering keywordqueries, [28, 30, 29] deal with snippets of a database object, which is an entity that has its identity in theresult tuple. In this scenario, the system provides a snippet which is a summary of the relevant informationrelated to these objects. This information is taken from tuples that relate to the queried object’s tuple andis prioritized in different manners (e.g., diversity and proportionality). All of these works, similarly to ours,provide the query answer along with further details about it, these details stem from tuples that relate insome defined manner to the answer. While there is a commonality between these works and ours, our worksupports complex CQs formulated in NL as opposed to keyword queries. Answers to keyword queries arenot always explicitly specified in the query, e.g., we can ask about and author and get the name of herorganization, or get tuples that contain this author from different tables. For CQs, users explicitly specifythe form of answer they would like, from which relation they would like it, and what conditions it has tosatisfy. Additionally, our work defines the related tuples by their membership in the provenance of theresult, i.e, the query structure (formulated by the user) is a major factor in determining which tuples will beincluded in the explanation. Furthermore, our system generates the NL explanation based on the NL querygiven by the user, and not a textual template.
Summarization of Database Content
There have been previous works that proposed a summarizedpresentation of the query results. In the context of keyword queries, the approach of [30] gives short sum-maries on information regarding data object by limiting the number of related tuples (according to theschema), showing only the highest ranked. The summary is represented as a tree where the root is a tuplecontaining the keywords and the neighboring tuples are the related ones. In the context of top aggregate32ueries, [77] presents an approach that summarizes the results using clustering of the top ranked results, byformulating the clustering problem as an optimization problem. The framework of this work is interactiveand allows users to choose the number of clusters and other parameters. After observing the results, userscan update the parameters and get different results. The summarization aims to serve as an overview of allquery results through a summarized relation. Another related approach is [46] which devised the smart drill-down operator. the operator allows users to obtain interesting summarizations of the tuples in the relation.The summary is essentially the top-k clusters, according to a goal function, of tuples with don’t-care values.Similarly to these approaches, we focus on tuples with the same values or similar values and The SaintEtiQsystem [69] summarizes entire database relations using background knowledge with a vocabulary to translateraw tuple values. Summarization is done using a clustering approach. Like our system, there is a notionof getting a more detailed and precise summary containing more information, and a less detailed one whichis more compact. Provenance summarization has also been proposed by [5], yet it offers an approximatedsummary of the provenance based on distance, semantic constraints and size, with a possible loss of informa-tion. Our summarization technique compacts all the tuples in the provenance, through functions like SUMand RANGE, as opposed to showing/summarizing a few representative tuples. We base this summarizationon the factorization of the provenance done as an initial step. Furthermore, we focus here on UCQs anddo not cover aggregate queries. Finally, we do not rely on background knowledge of a vocabulary, or otherexternal constraints to summarize, but rather use the provenance factorization. Additionally, we translatethe summarization into NL which is geared towards non-expert users.
NL Interfaces
Multiple lines of work ( e .g. [52, 7, 55, 76, 75, 3, 65]) have proposed NL interfaces forthe formulation of database queries, and additional works [32] have focused on presenting the answers inNL, typically basing their translation on the schema of the output relation. Among these, works such as[7, 55] also harness the dependency tree in order to make the translation form NL to SQL by employing mappings from the NL query to formal terms. The work of [51] has focused on the complementary problemof translating SQL queries (rather than their answers or provenance) to NL. Another work has devised aninteractive chatbot interface to drill down and zoom-in on a specific part of the database which the user isinterested in [70]. This work helps guide the user with NL but does not show the query answers and theirexplanations in natural language. In the context of answering keyword queries, [71] shows an approach thatpresents the results of pre´cis queries as a narrative text so that the output is more user friendly. To do so,there is a need for predefined textual templates to embed the relevant tuples in. The templates are predefinedby a designer or the administrator of the database. Synthesizing text directly from databases has also beenexplored in [72] which extended [71]. This work revolves around the generation of textual representation fordatabase subsets. The text is generated based on templates rather than on user formulated queries in NL.Moreover, the explanation is composed of tuples from related database tables, as opposed to tuples from theprovenance which provide a targeted explanation tailored to the specific details provided by a user in an NLquery. To our knowledge, no previous work has focused on formulating the provenance of output tuples inNL. This requires fundamentally different techniques ( e .g. that of factorization and summarization, buildingthe sentence based on the input question structure, etc.) and leads to answers of much greater detail.
10 Conclusion
We have studied in this paper, for the first time to our knowledge, provenance for NL queries. We havedevised a novel model of “word-to-provenance” mapping, thereby leveraging the structure of the original NL33uestion for the generation of a new NL sentence that captures the answers along with their provenance-based explanations. Since there may be many explanations, even for a single answer, we have developedfactorization and summarization techniques that are geared towards sentence generation, showing that theyresult in new criteria for preferring one factorized/summarized form over another. We have implemented theapproach and demonstrated its effectiveness through use cases and experiments.Our work presented a simple yet effective approach of generating NL explanations based on the user NLquery. We have demonstrated that by applying basic transformations on the original question we are ableto get understandable and relevant NL explanations. Usage of more advanced
Natural Language Generation techniques can farther improve the explanations quality; this is an interesting direction for future work.Our implementation is based on a particular NL interface to Databases and on a particular provenancemodel for UCQs, but we have also discussed at some depth the extension of our solution beyond thesesettings. This discussion provides indication of the generic nature of the approach, but further research isrequired to fully realize its potential in these other settings. In particular, we believe that the need to handlemore complex queries with nesting, aggregation etc. may lead to new and exciting research avenues.
Acknowledgements
This research was partially supported by the Israeli Science Foundation (ISF) andby the European Research Council (ERC) under the European Unions Horizon 2020 research and innovationprogramme (Grant agreement No. 804302).
References [1] Mas. http://academic.research.microsoft.com .[2] S. Abiteboul, R. Hull, and V. Vianu.
Foundations of databases: the logical level . Addison-WesleyLongman Publishing Co., Inc., 1995.[3] S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relationaldatabases. In
ICDE , pages 5–16, 2002.[4] A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management.In
SSDBM , pages 190–199, 1998.[5] E. Ainy, P. Bourhis, S. B. Davidson, D. Deutch, and T. Milo. Approximated summarization of dataprovenance. In
CIKM , pages 483–492, 2015.[6] Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstickon pig: Enabling database-style workflow provenance.
Proc. VLDB Endow. , 2011.[7] Y. Amsterdamer, A. Kukliansky, and T. Milo. A natural language interface for querying general andindividual knowledge.
VLDB , pages 1430–1441, 2015.[8] N. Bakibayev, D. Olteanu, and J. Zavodny. FDB: A query engine for factorised relational databases.
PVLDB , pages 1232–1243, 2012.[9] I. Beltagy, K. Erk, and R. Mooney. Semantic parsing using distributional semantics and probabilisticlogic. In
Proceedings of the ACL 2014 Workshop on Semantic Parsing , pages 7–11, 2014.[10] O. Benjelloun, A. Sarma, A. Halevy, M. Theobald, and J. Widom. Databases with uncertainty andlineage.
VLDB J. , 2008. 3411] J. Berant and P. Liang. Semantic parsing via paraphrasing. In
Proceedings of the 52nd Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers) , volume 1, pages 1415–1425,2014.[12] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsingin databases using BANKS. In
ICDE , pages 431–440, 2002.[13] P. Brgisser, M. Clausen, and M. A. Shokrollahi.
Algebraic Complexity Theory . Springer PublishingCompany, Incorporated, 2010.[14] P. Buneman, S. Khanna, and W. chiew Tan. Why and where: A characterization of data provenance.In
ICDT , pages 316–330, 2001.[15] A. Chapman and H. V. Jagadish. Why not? In
SIGMOD , pages 523–534, 2009.[16] A. P. Chapman, H. V. Jagadish, and P. Ramanan. Efficient provenance storage. In
SIGMOD , pages993–1006, 2008.[17] J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where.
Foundationsand Trends in Databases , pages 379–474, 2009.[18] L. Chiticariu, W. C. Tan, and G. Vijayvargiya. Dbnotes: a post-it system for relational databases basedon provenance. In
SIGMOD , pages 942–944, 2005.[19] S. Cohen-Boulakia, O. Biton, S. Cohen, and S. Davidson. Addressing the provenance challenge usingzoom.
Concurr. Comput. : Pract. Exper. , pages 497–506, 2008.[20] D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations andprocesses.
IEEE Data Eng. Bull. , pages 3–9, 2009.[21] S. B. Davidson, S. C. Boulakia, A. Eyal, B. Lud¨ascher, T. M. McPhillips, S. Bowers, M. K. Anand, andJ. Freire. Provenance in scientific workflow systems.
IEEE Data Eng. Bull. , pages 44–50, 2007.[22] S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In
SIGMOD , pages 1345–1350, 2008.[23] D. Deutch, N. Frost, and A. Gilad. Nlprov: Natural language provenance.
Proc. VLDB Endow. , pages1900–1903, 2016.[24] D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries.
PVLDB , 10(5):577–588,2017.[25] D. Deutch, N. Frost, A. Gilad, and T. Haimovich. Nlprovenans: natural language provenance fornon-answers.
Proceedings of the VLDB Endowment , 11(12):1986–1989, 2018.[26] D. Deutch, A. Gilad, and Y. Moskovitch. Selective provenance for datalog programs using top-k queries.
PVLDB , pages 1394–1405, 2015.[27] K. Elbassioni, K. Makino, and I. Rauf. On the readability of monotone boolean formulae.
JoCO , pages293–304, 2011.[28] G. J. Fakas. Automated generation of object summaries from relational databases: A novel keywordsearching paradigm. In
ICDE , pages 564–567, 2008.3529] G. J. Fakas, Z. Cai, and N. Mamoulis. Versatile size-$l$ object summariesfor relational keyword search.
IEEE Trans. on Knowl. and Data Eng. , 26(4):1026–1038, 2014.[30] G. J. Fakas, Z. Cai, and N. Mamoulis. Diverse and proportional size-l object summaries using pairwiserelevance.
VLDB J. , 25(6):791–816, 2016.[31] I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying,and automating data derivation.
SSDBM , pages 37–46, 2002.[32] E. Franconi, C. Gardent, X. I. Juarez-Castro, and L. Perez-Beltrachini. Quelo Natural Language In-terface: Generating queries and answer descriptions. In
Natural Language Interfaces for Web of Data ,2014.[33] A. Giordani and A. Moschitti. Translating questions to sql queries with generative parsers discrimina-tively reranked.
Proceedings of COLING 2012: Posters , pages 401–410, 2012.[34] https://github.com/navefr/NL_Provenance/ .[35] B. Glavic. Big data provenance: Challenges and implications for benchmarking. In
Specifying Big DataBenchmarks - First Workshop, WBDB , pages 72–80, 2012.[36] B. Glavic and G. Alonso. Perm: Processing provenance and data on the same data model throughquery rewriting. In
ICDE , pages 174–185, 2009.[37] B. Glavic, R. J. Miller, and G. Alonso. Using sql for efficient generation and querying of provenanceinformation. In
In Search of Elegance in the Theory and Practice of Computation , pages 291–320.Springer, 2013.[38] T. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In
PODS , pages 31–40, 2007.[39] T. J. Green. Containment of conjunctive queries on annotated relations. In
ICDT , pages 296–309, 2009.[40] E. Hemaspaandra and H. Schnoor. Minimization for generalized boolean formulas. In
IJCAI , pages566–571, 2011.[41] M. Herschel and M. Hlawatsch. Provenance: On and behind the screens. In
SIGMOD , pages 2213–2217,2016.[42] V. Hristidis and Y. Papakonstantinou. DISCOVER: keyword search in relational databases. In
VLDB ,pages 670–681, 2002.[43] D. Hull et al. Taverna: a tool for building and running workflows of services.
Nucleic Acids Res. , pages729–732, 2006.[44] Z. G. Ives, A. Haeberlen, T. Feng, and W. Gatterbauer. Querying provenance for ranking and recom-mending. In
TaPP , pages 9–9, 2012.[45] S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer. Learning a neural semanticparser from user feedback. arXiv preprint arXiv:1704.08760 , 2017.[46] M. Joglekar, H. Garcia-Molina, and A. G. Parameswaran. Smart drill-down: A new data explorationoperator.
PVLDB , 8(12):1928–1931, 2015. 3647] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In
SIGMOD , pages 951–962,2010.[48] B. Kenig, A. Gal, and O. Strichman. A new class of lineage expressions over probabilistic databasescomputable in p-time. In
SUM , pages 219–232, 2013.[49] D. Klein and C. D. Manning. Accurate unlexicalized parsing. In
Annual Meeting on Association forComputational Linguistics , pages 423–430, 2003.[50] G. Koutrika, A. Simitsis, and Y. E. Ioannidis. Pr´ecis: The essence of a query answer. In
ICDE , pages69–78, 2006.[51] G. Koutrika, A. Simitsis, and Y. E. Ioannidis. Explaining structured queries in natural language. In
ICDE , pages 333–344, 2010.[52] D. K¨upper, M. Storbel, and D. R¨osner. Nauda: A cooperative natural language interface to relationaldatabases. SIGMOD, pages 529–533, 1993.[53] N. Kwasnikowska and J. V. den Bussche. Mapping the NRC dataflow model to the open provenancemodel. In
IPAW , pages 3–16, 2008.[54] S. Lee, S. K¨ohler, B. Lud¨ascher, and B. Glavic. A sql-middleware unifying why and why-not provenancefor first-order queries. In , pages 485–496, 2017.[55] F. Li and H. V. Jagadish. Constructing an interactive natural language interface for relational databases.
Proc. VLDB Endow. , pages 73–84, 2014.[56] P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics.
Computa-tional Linguistics , 39(2):389–446, 2013.[57] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english:The penn treebank.
Comput. Linguist. , pages 313–330, 1993.[58] M. Marneffe, B. Maccartney, and C. Manning. Generating typed dependency parses from phrase struc-ture parses. In
LREC , pages 449–454, 2006.[59] A. Meliou, Y. Song, and D. Suciu. Tiresias: a demonstration of how-to queries. In
SIGMOD , pages709–712, 2012.[60] P. Missier, N. W. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In
EDBT , pages 299–310, 2010.[61] T. M¨uller and T. Grust. Provenance for SQL through abstract interpretation: Value-less, but worth-while.
PVLDB , pages 1872–1875, 2015.[62] D. Olteanu and J. Z´avodn´y. Factorised representations of query results: Size bounds and readability.In
ICDT , pages 285–298, 2012.[63] H. Poon. Grounded unsupervised semantic parsing. In
Proceedings of the 51st Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) , volume 1, pages 933–943, 2013.3764] A.-M. Popescu, A. Armanasu, O. Etzioni, D. Ko, and A. Yates. Modern natural language interfacesto databases: Composing statistical parsing with semantic tractability. In
Proceedings of the 20thinternational conference on Computational Linguistics , page 141, 2004.[65] A.-M. Popescu, O. Etzioni, and H. Kautz. Towards a theory of natural language interfaces to databases.In
IUI , pages 149–157, 2003.[66] C. R´e and D. Suciu. Approximate lineage for probabilistic databases.
Proc. VLDB Endow. , pages797–808, 2008.[67] S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In
SIGMOD ,pages 1579–1590, 2014.[68] D. Saha, A. Floratou, K. Sankaranarayanan, U. F. Minhas, A. R. Mittal, and F. ¨Ozcan. ATHENA: anontology-driven system for natural language querying over relational data stores.
PVLDB , 9(12):1209–1220, 2016.[69] R. Saint-Paul, G. Raschia, and N. Mouaddib. Database summarization: The saintetiq system. In
ICDE ,pages 1475–1476, 2007.[70] T. Sellam and M. L. Kersten. Have a chat with clustine, conversational engine to query large tables. In
HILDA , page 2, 2016.[71] A. Simitsis and G. Koutrika. Comprehensible answers to pr´ecis queries. In
CAiSE , pages 142–156, 2006.[72] A. Simitsis, G. Koutrika, Y. Alexandrakis, and Y. E. Ioannidis. Synthesizing structured text from logicaldatabase subsets. In
EDBT , pages 428–439, 2008.[73] A. Simitsis, G. Koutrika, and Y. E. Ioannidis. Pr´ecis: from unstructured keywords as queries tostructured databases as answers.
VLDB J. , 17(1):117–149, 2008.[74] Y. L. Simmhan, B. Plale, and D. Gannon. Karma2: Provenance management for data-driven workflows.
Int. J. Web Service Res. , pages 1–22, 2008.[75] D. Song, F. Schilder, and C. Smiley. Natural language question answering and analytics for diverse andinterlinked datasets. In
NAACL , pages 101–105, 2015.[76] D. Song, F. Schilder, C. Smiley, C. Brew, T. Zielund, H. Bretz, R. Martin, C. Dale, J. Duprey, T. Miller,and J. Harrison. TR discover: A natural language interface for querying and analyzing interlinkeddatasets. In
ISWC , pages 21–37, 2015.[77] Y. Wen, X. Zhu, S. Roy, and J. Yang. Interactive summarization and exploration of top aggregate queryanswers.
PVLDB , 11(13):2196–2208, 2018.[78] S. W.-t. Yih, M.-W. Chang, X. He, and J. Gao. Semantic parsing via staged query graph generation:Question answering with knowledge base. 2015.[79] L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classificationwith probabilistic categorial grammars. arXiv preprint arXiv:1207.1420 , 2012.[80] V. Zhong, C. Xiong, and R. Socher. Seq2sql: Generating structured queries from natural language usingreinforcement learning. arXiv preprint arXiv:1709.00103arXiv preprint arXiv:1709.00103