[PDF] Approximate Summaries for Why and Why-not Provenance (Extended Version)

Abstract

Why and why-not provenance have been studied extensively in recent years. However, why-not provenance, and to a lesser degree why provenance, can be very large resulting in severe scalability and usability challenges. In this paper, we introduce a novel approximate summarization technique for provenance which overcomes these challenges. Our approach uses patterns to encode (why-not) provenance concisely. We develop techniques for efficiently computing provenance summaries balancing informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to scale to large datasets and to generate comprehensive and meaningful summaries.

Full PDF

AApproximate Summaries for Why and Why-not Provenance(Extended Version)

Seokki Lee

Illinois Institute of Technology [email protected] Bertram Lud ¨ascher

University of Illinois, Urbana-Champaign [email protected] Boris Glavic

Illinois Institute of Technology [email protected]

ABSTRACT

Why and why-not provenance have been studied extensivelyin recent years. However, why-not provenance and — to alesser degree — why provenance can be very large, resultingin severe scalability and usability challenges. We introduce anovel approximate summarization technique for provenanceto address these challenges. Our approach uses patternsto encode why and why-not provenance concisely. We de-velop techniques for eﬃciently computing provenance sum-maries that balance informativeness, conciseness, and com-pleteness. To achieve scalability, we integrate sampling tech-niques into provenance capture and summarization. Ourapproach is the ﬁrst to both scale to large datasets and gen-erate comprehensive and meaningful summaries.

PVLDB Reference Format:

Seokki Lee, Bertram Lud¨ascher, Boris Glavic. Approximate Sum-maries for Why and Why-not Provenance.

PVLDB , 13(6): xxxx-yyyy, 2020.DOI: https://doi.org/10.14778/3380750.3380760

1. INTRODUCTION

Provenance for relational queries [10] explains how resultsof a query are derived from the query’s inputs. In contrast,why-not provenance explains why a query result is missing.Speciﬁcally, instance-based [11] why-not provenance tech-niques determine which existing and missing data from aquery’s input is responsible for the failure to derive a miss-ing answer of interest. In prior work, we have shown howwhy and why-not provenance can be treated uniformly forﬁrst-order queries using non-recursive Datalog with nega-tion [16] and have implemented this idea in the

PUG system[18, 19]. Instance-based why-not provenance techniques ei-ther (i) enumerate all potential ways of deriving a result( all-derivations approach) or (ii) return only one possible,but failed, derivation or parts thereof ( single-derivation ap-proach). For instance, Artemis [12], Huang et al. [13], andPUG [20, 18, 19] are all-derivations approaches while the Y!system [32, 31] is a single-derivation approach.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. 13, No. 6ISSN 2150-8097.DOI: https://doi.org/10.14778/3380750.3380760 r : AL ( N, R ) : − L ( I, N, T, R, queen anne , E ) , A ( I, , P ) L isting (input) Id Name Ptype Rtype NGroup Neighbor A vailability (input) Id Date Price A vailable L istings (output) Name Rtype cozy homebase privatemodern view entire

Attribute

Id Name Ptype Rtype NGroup Neighbor Date Price

Figure 1: Example Airbnb database and query

Example Fig. 1 shows a sample of a real-world data-set recording Airbnb (bed and breakfast) listings and theiravailability. Each

Listing has an id , name , property type( Ptype ) , room type ( Rtype ) , neighborhood ( Neighbor ) , and neighborhood group ( NGroup ) . The neighborhood groups arelarger areas including multiple neighborhoods. Availability stores ids of listings with available dates and a price foreach date. We refer to this sample dataset as

S-Airbnb andthe full dataset as

F-Airbnb ( ). Bob, an analyst at Airbnb, investigatesa customer complaint about the lack of availability of sharedrooms on in Queen Anne ( NGroup = queen anne ).He ﬁrst uses Datalog rule r from Fig. 1 to return all list-ings (names and room types) available on that date in QueenAnne. The query result conﬁrms the customer’s complaint,since none of the available listings are shared rooms. Bobnow needs to investigate what led to this missing result. We refer to such questions as provenance questions . A pro-venance question is a tuple with constants and placeholders(upper-case letters) which specify a set of (missing) answersthe user is interested in (all answers that agree with the pro-venance question on constant values). For example, Bob’squestion can be written as AL ( N, shared ). All-derivations ap-proaches like PUG explain the absence of shared rooms byenumerating all derivations of missing answers that match Bob’s question. That is, all possible bindings of the vari-ables of the rule r to values from the active domain (thevalues that exist in the database) such that R is bound to shared and the tuple produced by the grounded rule is miss-ing. While this explains why shared rooms are unavailable1 a r X i v : . [ c s . D B ] A p r any tuple with R = shared ), the number of possible bind-ings can be prohibitively large. Consider our toy example S-Airbnb dataset. Let us assume that only values from theactive domain of each attribute are considered for a variablebound to this attribute to avoid nonsensical derivations, e.g.,binding prices to names. The number of distinct values perattribute are shown on the bottom of Fig. 1. Under thisassumption, there are 6 · · · · AL ( N, shared ). For the fulldataset F-Airbnb , there are ∼ · possible derivations. Example Continuing with Ex. 1, assume that Bob usesPUG [18] to compute an explanation for the missing re-sult AL ( N, shared ) . A provenance graph fragment is shownin Fig. 2a. This type of provenance graph connects rulederivations (box nodes) with the tuples (ovals) they are de-riving, rule derivations to the goals in their body (roundedboxes), and goals to the tuples that justify their success orfailure. Nodes are colored red/green to indicate failure/-success (goal and rule nodes) or absence/existence (tuplenodes). For S-Airbnb, the graph produced by PUG consistsof all failed derivations of missing answers that match AL ( N, shared ) . The fragment shown in Fig. 2a encodes oneof these derivations: The shared room of the existing list-ing Central Place ( Id ) is not available on at a price of $130 , explaining that this derivation fails be-cause the tuple ( , , ) does not exist in therelation Availability (the second goal failed).

Single-derivation approaches address the scalability issueof why-not provenance by only returning a single derivation(or parts thereof). However, this comes at the cost of incom-pleteness. For instance, a single-derivation approach mayreturn the derivation shown in Fig. 2a. However, such anexplanation is not suﬃcient for Bob’s investigation. Whatabout other prices for the same listing? Do other listingsfrom this area have shared rooms that are not available forthis date or do they simply not have shared rooms? A sin-gle derivation approach cannot answer such questions sinceit only provides one out of a vast number of failed deriva-tions (or even only a suﬃcient reason for a derivation tofail as in [32, 31]). For S-Airbnb, no shared rooms areavailable in Queen Anne on Nov 9th, 2016 because: (i) allthe existing shared rooms of apartments (listings and ) in Queen Anne are not available on the requested dateand (ii) no listings in the West Queen Anne neighborhood(listings and ) have shared rooms.

Thus, return-ing only one derivation is insuﬃcient for justifyingthe missing answer as only the collective failure ofall possible derivations explains the missing answer.

Suppose that Bob has to explain the result of his investi-gation to his manager and is expected to propose possibleways of how to improve the availability of rooms in QueenAnne. His manager is unlikely to accept an explanation ofthe form “There are no shared rooms available on this date,because listing 8403 is not available for $

130 on this day.”

Summarizing Provenance.

In this paper, we present anovel approach that overcomes the drawbacks of both ap-proaches. Speciﬁcally, we eﬃciently create summaries thatcompactly represent large amounts of provenance informa-tion. We focus on the algorithmic and formal foundationof this method as well as its experimental evaluation (wedemonstrated a GUI frontend in [19] and our vision in [21]). AL ( central place , shared ) r ( central place , shared , , apt , east , ) g ( , , ) A ( , , ) (a) Partial provenancegraph AL ( N, shared ) r ( N, shared ,I, apt ,E,P ) (0 . g ( I, ,P ) A ( I, ,P ) (b) Provenance sum-mary

Figure 2: Explanations for the missing results AL ( N, shared ) Example Our summarization approach encodes setsof nodes from a provenance graph using “pattern nodes” ,i.e., nodes with placeholders. A possible summary for AL ( N, shared ) is shown in Fig. 2b. The graph contains a rule pat-tern node r ( N, shared , I, apt , E, P ) . N , I , E , and P areplaceholders. For each such node, our approach reports theamount of provenance covered by the pattern (shown to theleft of nodes). This summary provides useful informationto Bob: all shared rooms of apartments in Queen Anne arenot available at any price on Nov 9th, 2016 (their ids arenot in relation Availability ). Over F-Airbnb, ∼ . ofderivations for AL ( N, shared ) match this pattern. The type of patterns we are using here can also be mod-eled as selection queries and has been used to summarizeprovenance [30, 24] and for explanations in general [8, 9].

Selecting Meaningful Summaries.

The provenance ofa (missing) answer can be summarized in many possibleways. Ideally, we want provenance summaries to be con-cise (small provenance graphs), complete ( covering all pro-venance), and informative ( providing new insights). Wedeﬁne informativeness as the number of constants in a pat-tern that are not enforced by the user’s question. The intu-ition behind this deﬁnition is that patterns with more con-stants provide more detailed information about the data ac-cessed by derivations. Finding a solution that optimizes allthree metrics is typically not possible. Consider two extremecases: (i) any provenance graph is a provenance summary(one without placeholders). Provenance graphs are completeand informative, but not concise; (ii) at the other end of thespectrum, an arbitrary number of derivations of a rule r canbe represented as a single pattern with only placeholdersresulting in a maximally concise summary. However, sucha summary is not informative since it only contains place-holders. We design a summarization algorithm that returnsa set of up to k patterns (guaranteeing conciseness) optimiz-ing for a combination of completeness and informativeness.The rationale behind this approach is to ensure that sum-maries are covering a suﬃcient fraction of the provenanceand at the same time provide suﬃciently detailed informa-tion. Most of our results, however, are independent of thechoice of ranking metric. Eﬃcient Summarization.

While summarization of pro-venance has been studied in previous work, e.g., [2, 33], forwhy-not provenance we face the challenge that it is infeasi-ble to generate full provenance as input for summarization.For instance, there are ∼ · derivations of missinganswers matching Bob’s question if we use the F-Airbnbdataset. We overcome this problem by (i) integrating sum-marization with provenance capture and (ii) developing a We deliberately use the term placeholder and not variableto avoid confusion with the variables of a rule.2ethod for sampling rule derivations from the why-not pro-venance without materializing it ﬁrst. Our sampling tech-nique is based on the observation that the number of missinganswers is typically signiﬁcantly larger than the number ofexisting answers. Thus, to create a sample of the why-notprovenance of missing answers matching a provenance ques-tion, we can randomly generate derivations that match theprovenance question. We, then, ﬁlter out derivations forexisting answers. This approach is eﬀective, because a ran-domly generated derivation is much more likely to derive amissing than an existing answer. While sampling is neces-sary for performance, it is not suﬃcient. Even for relativelysmall sample sizes, enumerating all possible sets of candidatepatterns and evaluating their scores to ﬁnd the set of size upto k with the highest score is not feasible. We introduce sev-eral heuristics and optimizations that together enable us toachieve good performance. Speciﬁcally, we limit the numberof candidate patterns, approximate the completeness of setsof patterns over our sample, and exploit provable upper andlower bounds for the score of candidate pattern sets whenranking such sets. Contributions.

To the best of our knowledge, we are theﬁrst to address both the usability and scalability (compu-tational) challenges of why-not provenance through sum-maries. Speciﬁcally, we make the following contributions: • Using patterns, we generate meaningful summaries forthe why and why-not provenance of unions of conjunc-tive queries with negation and inequalities (

UCQ ¬ < ). • We develop a summarization algorithm that appliessampling during provenance capture and avoids enu-merating full why-not provenance. Our approach out-sources most computation to a database system. • We experimentally compare our approach with a single-derivation approach and Artemis [12] and demonstratethat it eﬃciently produces high-quality summaries.The remainder of this paper is organized as follows. Wecover preliminaries in Sec. 2 and deﬁne the provenance sum-marization problem in Sec. 3. We present an overview of ourapproach in Sec. 4 and, then, discuss sampling, pattern can-didate generation, and top- k summary construction (Sec. 5to 8). We present experiments in Sec. 9, discuss related workin Sec. 10, and conclude in Sec. 11.

2. BACKGROUND2.1 Datalog

A Datalog program Q consists of a ﬁnite set of rules r : R ( X ) : − g ( X ) , . . . , g m ( X n ) where X j denotes a tupleof variables and/or constants. R ( X ) is the head of the rule,denoted as head ( r ), and g ( X ) , . . . , g m ( X n ) is the body (each g j ( X j ) is a goal ). We use vars ( r ) to denote the set of vari-ables in r . The set of relations in the schema over which Q is deﬁned is referred to as the extensional database (EDB),while relations deﬁned through rules in Q form the inten-sional database (IDB), i.e., the heads of rules. All rules r of Q have to be safe , i.e., every variable in r must occurin a positive literal in r ’s body. Here, we consider unionof conjunctive queries with negation and comparison pred-icates ( UCQ ¬ < ). Thus, all rules of a query have the samehead predicate and goals in the body are either literals , i.e.,atoms L ( X j ) or their negation ¬ L ( X j ), or comparisons ofthe form a (cid:5) b where a and b are either constants or variables and (cid:5) ∈ { <, ≤ , (cid:54) = , ≥ , > } . For example, considering the Dat-alog rule r from Fig. 1, head ( r ) is AL ( N, R ) and vars ( r ) is { I, N, T, R, E, P } . The rule is safe since all head variablesoccur in the body and all goals are positive.The active domain adom ( D ) of a database D (an instanceof EDB relations) is the set of all constants that appear in D . We assume the existence of a universal domain of val-ues D which is a superset of the active domain of everydatabase. The result of evaluating Q over D , denoted as Q ( D ), contains all IDB tuples Q ( t ) for which there exists asuccessful rule derivation with head Q ( t ). A derivation of r is the result of applying a valuation ν : vars ( r ) → D whichmaps the variables of r to constants such that all compar-isons of the rule hold, i.e., for each comparison ψ ( Y ) theexpression ψ ( ν ( Y )) evaluates to true. Note that the set ofall derivations of r is independent of D since the constantsof a derivation are from D . Let c be a list of constants from D , one for each variable of r . We use r ( c ) to denote therule derivation that assigns constant c i to variable X i in r . Note that variables are ordered by the position of theirﬁrst occurrence in r , e.g., the variable order for r (Fig. 1)is ( N, R, I, T, E, P ). A rule derivation is successful (failed)if all (at least one of) the goals in its body are successful(failed). A positive/negative literal goal is successful if thecorresponding tuple exists/does not exist. A missing answerfor Q and D is an IDB tuple Q ( t ) for which all derivationsfailed. For a given D and r , we use D | = r ( c ) to denote that r ( c ) is successful over D . Typically, as mentioned in Sec. 1,not all failed derivations constructed in this way are sensi-ble, e.g., a derivation may assign an integer to an attributeof type string. We allow users to control which values toconsider for which attribute (see [18, 20]). For simplicity,however, we often assume a single universal domain D . We now explain the provenance model introduced in Ex. 2.As demonstrated in [20], this provenance model is equivalentto the provenance semiring model for positive queries [10]and to its extension for ﬁrst-order (FO) formula [25]. Inour model, existing IDB tuples are connected to the suc-cessful rule derivations that derive them while missing tu-ples are connected to all failed derivations that could havederived them. Successful derivations are connected to suc-cessful goals. Failed derivations are only connected to failedgoals (which justify the failure). Nodes in provenance graphscarry two types of labels: (i) a label that determines thenode type (tuple, rule, or goal) and additional information,e.g., the arguments and rule identiﬁer of a derivation, and(ii) a label indicating success/failure. We encode (ii) as col-ors in visualizations of such graphs. As shown in [18], pro-venance in this model can equivalently be represented assets of successful and failed rule derivations as long as thesuccess/failure state of goals are known.

Definition Let r be a Datalog rule Q ( X ) : − R ( X ) , . . ., R l ( X l ) , ¬ R l + ( X l +1 ) , . . . , ¬ R m ( X n ) , ψ ( Y ) , . . . , ψ ( Y k ) whereeach ψ i is a comparison. Let D be a database. An annotatedderivation d = r ( c ) − ( g ) of r consists of a list of constants c and a list of goal annotations g = ( g , . . . , g m ) such that(i) r ( c ) is a rule derivation, and (ii) g i = T if i ≤ l ∧ D | = R i ( c i ) or i > l ∧ D (cid:54)| = R i ( c i ) and g i = F otherwise. An example failed annotated derivation of rule r (Fig. 1)is d = r ( central place , shared , , apt , east , ) − ( T , F )3rom Fig. 2a. That is, while A ( , , ) failed, L ( , central place , apt , shared , queen anne , east ) is success-ful. Using annotated derivations, we can explain the exis-tence or absence of a (set of) query result tuple(s). We use A ( Q, D, r ) to denote all annotated derivations of rule r from Q according to D , A ( Q, D ) to denote (cid:83) r ∈ Q A ( Q, D, r ), and A ( Q, D, t ) to denote the subset of A ( Q, D ) with head Q ( t ).Note that by deﬁnition, valuations that violate any compar-ison of a rule are not considered to be rule derivations.We now deﬁne provenance questions (PQ) . Through thetype of a PQ ( Why or Whynot ), the user speciﬁes whethershe is interested in missing or existing results. In addition,the user provides a tuple t of constants (from D ) and place-holders to indicate what tuples she is interested in. Werefer to such tuples as pattern tuples ( p-tuples for short)and use bold font to distinguish them from tuples with con-stants only. We use capital letters to denote placeholdersand variables, and lowercase to denote constants. We saya tuple t matches a p-tuple t , written as t (cid:50) t , if we canunify t with t by applying a valuation ν that substitutesplaceholders in t with constants from D such that ν ( t ) = t ,e.g., AL ( plum , shared ) (cid:50) AL ( N, shared ) using ν := N → plum .The provenance of all existing (missing) tuples matching t constitutes the answer of a Why ( Whynot ) PQ.

Definition 2 (Provenance Question).

Let Q be aquery. A provenance question Φ over Q is a pair ( t , type ) where t is a p-tuple and type ∈ { Why , Whynot } . Bob’s question from Ex. 1 can be written as Φ bob = ( t bob , Whynot ) where t bob = AL ( N, shared ), i.e., Bob wants anexplanation for all missing answers where R = shared . Thegraph shown in Fig. 2a is part of the provenance for Φ bob . Definition 3 (Provenance).

Let D be a database, Q an n -nary UCQ ¬ < query, and t an n -nary p-tuple. We de-ﬁne the why and why-not provenance of t over Q and D as: Why ( Q, D, t ) = (cid:91) t (cid:50) t ∧ t ∈ Q ( D ) Why ( Q, D, t ) Why ( Q, D, t ) = { d | d ∈ A ( Q, D, t ) ∧ D | = d } Whynot ( Q, D, t ) = (cid:91) t (cid:50) t ∧ t (cid:54)∈ Q ( D ) Whynot ( Q, D, t ) Whynot ( Q, D, t ) = { d | d ∈ A ( Q, D, t ) ∧ D (cid:54)| = d } The provenance

Prov (Φ) of a provenance question Φ is: Prov (Φ) = (cid:40)

Why ( Q, D, t ) if Φ = ( t , Why ) Whynot ( Q, D, t ) if Φ = ( t , Whynot )

3. PROBLEM DEFINITION

We now formally deﬁne the problem addressed in thiswork: how to summarize the provenance

Prov (Φ) of a pro-venance question Φ. For that, we introduce derivation pat-terns that concisely describe provenance and, then, deﬁneprovenance summaries as sets of such patterns. We also de-velop quality metrics for such summaries that model com-pleteness and informativeness as introduced in Sec. 1. A derivation pattern is an annotated rule derivation whosearguments can be both constants and placeholders. Definition 4 (Derivation Pattern).

Let r be a rulewith n variables and m goals and P an inﬁnite set of place-holders. A derivation pattern p = r (¯ e ) − (¯ g ) consists of a list ¯ e of length n where e i ∈ D ∪ P and ¯ g , a list of m booleans. Consider pattern p = r ( N, shared , I, apt , E, P ) − ( T , F )for rule r (Fig. 1) shown in Fig. 2b. Pattern p representsthe set of failed derivations matching AL ( N, shared ) wherethe listing is an apartment ( apt ) and for which the 1 st goalsucceeded (the listing exists in Queen Anne) and the 2 nd goalfailed (the listing is not available on Nov 9th, 2016). We use p [ i ] to denote the i th argument of pattern p and omit goalannotations if they are irrelevant to the discussion. A derivation pattern p represents the set of derivationsthat “match” the pattern. We deﬁne pattern matches asvaluations that replace the placeholders in a pattern withconstants from D . In the following, we use placeh ( p ) to de-note the set of placeholders of a pattern p . Definition 5 (Pattern Matches).

A derivation pat-tern p = r (¯ e ) − ( ¯ g ) matches an annotated rule derivation d = r (¯ c ) − ( ¯ g ) , written as p (cid:50) d , if there exists a valuation ν : placeh ( p ) → D such that ν ( p ) = d and ¯ g = ¯ g . Consider p = r ( N, shared , I, apt , E, P ) − ( T , F ) and d = r ( central place , shared , , apt , east , ) − ( T , F ) (from Fig.2). We have p (cid:50) d since the valuation N → central place , I → , E → east , and P → maps p to d and thegoal indicators ( T , F ) are same for p and d . We call p a pattern for a p-tuple t if p and t agree onconstants, e.g., p is a pattern for t bob = AL ( N, shared ) since p [2] = t bob [2] = shared . We use Pat ( Q, t ) to denote the setof all patterns for t and Q . Definition 6 (Provenance Summary).

Let Q be a UCQ ¬ < query and Φ = ( t , type ) a provenance question. A provenance summary S for Φ is a subset of Pat ( Q, t ) . Based on the Def. 6, any subset of

Pat ( Q, t ) is a sum-mary. However, summaries do diﬀer in conciseness, infor-mativeness, and completeness. Consider a summary forΦ bob consisting of p = r ( N, shared , I, T, E, P ) − ( T , F ) and p (cid:48) = ( N, shared , I, T, E, P ) − ( F , F ). This summary covers

Prov (Φ bob ). However, the pattern only consists of place-holders and constants from Φ bob — no new information isconveyed. Pattern p = r ( plum , shared , , apt , east , ) − ( T , F ) consists only of constants. It provides detailed in-formation but covers only one derivation.

We now introduce a quality metric that combines com-pleteness and informativeness . We deﬁne completeness asthe fraction of

Prov (Φ) matched by a pattern. For a ques-tion Φ, query Q , and database D , we use M ( Q, D, p,

Φ) todenote all derivations in

Prov (Φ) that match a pattern p : M ( Q, D, p,

Φ) := { d | d ∈ Prov (Φ) ∧ d (cid:50) p } Considering the pattern p from Fig. 2b and the derivation d from Fig. 2a, we have d ∈ M ( r , D, p , Φ bob ). Pattern p (cid:48)(cid:48) = r ( N, shared , I, T, E, P ) − ( F , T ) has nomatches, because non-existing listings cannot be available.4

A B

A B d d d d Attribute Domains Sample De ‐ rivations Why-not Derivations774 bab A B d d d

774 bab p p p b p p p A Bphase 1

Sampling Why-not Provenance(Sec 5) phase 2

Generating Patterns(Sec 6) phase 3

Measuring Quality (Sec 7) phase 4

Top-k(Sec 8)Top-2Summary774 bab p p p b p p p A B C I

C = completnessI = Informativeness

74 b p p A B C I ﬁlter derivations of existing answers

Patterns Patterns+ Quality Metricse

Figure 3: Overview of why-not provenance summarization

Definition 7 (completeness).

Let Q be a query, D a database, p a pattern, and Φ a provenance question. The completeness of p is deﬁned as cp ( p ) = |M ( Q,D,p, Φ) || Prov (Φ) | . We also deﬁne informativeness which measures how muchnew information is conveyed by a pattern.

Definition 8 (Informativeness).

For a pattern p andquestion Φ with p-tuple t , let C ( p ) and C ( t ) denote the num-ber of constants in p and t , respectively. The informative-ness info ( p ) of p is deﬁned as info ( p ) = C ( p ) − C ( t ) arity ( p ) − C ( t ) . For Bob’s question Φ bob and pattern p = r ( N, shared , I, apt , E, P ) − ( T , F ), we have info ( p ) = 0 . C ( p ) is2 ( shared and apt ), C ( t bob ) is 1 ( shared ), and arity ( p ) is 6(all placeholders and constants). We generalize complete-ness and informativeness to sets of patterns (summaries) asfollows. The completeness of a summary S is the fractionof Prov (Φ) covered by at least one pattern from S . Forpatterns p and p (cid:48) from Sec. 3.3, we have cp ( { p , p (cid:48) } ) = cp ( p ) + cp ( p (cid:48) ) = 1. Note that cp ( S ) may not be equal tothe sum of cp ( p ) for p ∈ S since the set of matches for twopatterns may overlap. We will revisit overlap in Sec. 8. Wedeﬁne informativeness as the average informativeness of thepatterns in S . cp ( S ) = | (cid:83) p ∈S M ( Q, D, p, Φ) || Prov (Φ) | info ( S ) = (cid:80) p ∈S info ( p ) |S| The score of a summary S is then deﬁned as the harmonicmean of completeness and informativeness, i.e., sc ( S ) =2 · cp ( S ) · info ( S ) cp ( S )+ info ( S ) . We are now ready to deﬁne the top-k pro-venance summarization problem which, given a provenancequestion Φ, returns the top- k patterns for Φ wrt. sc ( S ). • Input:

A query Q , database D , provenance questionΦ = ( t , type ), k ∈ N ≥ • Output: S ( Q, D, Φ , k ) = argmax S⊂ Pat ( Q, t ) ∧|S|≤ k sc ( S )

4. OVERVIEW

Before describing our approach in detail in the followingsections, we ﬁrst give a brief overview. To compute a top- k provenance summary S ( Q, D, Φ , k ) for a provenance ques-tion Φ, we have to (i) compute Prov (Φ), (ii) enumerateall patterns that could be used in summaries, (iii) calculatematches between derivations and the patterns to calculatethe completeness of sets of patterns, and (iv) ﬁnd a set ofup to k patterns for Φ that has the highest score amongall such sets of patterns. To compute the exact solution tothis problem, we would need to enumerate all derivationsfrom Prov (Φ). However, this is not feasible for why-not

Query: r ex : Q ex ( X, Y ) : − R ( X, Z ) , R ( Z, Y ) , X < Y PQ: Φ ex = ( t ex , Whynot ) where t ex = Q ex ( X, ) Query Uniﬁed With P-Tuple t ex : r t ex : Q t ex ( X,

4) : − R ( X, Z ) , R ( Z, , X < RA B Q ex A B

Answers matching t ex A B

Figure 4: Running example for summarizationprovenance questions since, as we will discuss in the follow-ing, the size of why-not provenance

Whynot ( Q, D, t ) is in O ( | D | n ), i.e., linear in the size of the data domain D , butexponential in n , the maximal number of variables of a rulefrom query Q that is not bound to constants by t ∈ Φ. In-stead, we present a heuristic approach that uses samplingand outsources most of the computation to a database forscalability. Fig. 3 shows an overview of this approach.

Sampling Provenance (Phase 1, Sec. 5).

As shown inFig. 3 (phase 1), we develop a technique to compute a sam-ple S of n S derivations from Whynot ( Q, D, t ) that is un-biased with high probability. We create S by (i) randomlysampling a number of n OS > n S values from the domain ofeach attribute (e.g., A and B in Fig. 3) individually, (ii) zipthese samples to create derivations, and (iii) remove deriva-tions for existing results (e.g., the derivation d highlightedin red) to compute a sample of Whynot ( Q, D, t ) that withhigh probability is at least of size n S (in Fig. 3 we assumed n S = 3). For why-provenance, we sample directly from thefull provenance for Φ computed using our query instrumen-tation technique from [20, 18]. Enumerating Pattern Candidates (Phase 2, Sec. 6).

The number of patterns for a rule with m goals and n vari-ables is in O ( | D + n | n · m ). Even if we only consider pat-terns that match at least one derivation from S , the numberof patterns may still be a factor of 2 n larger than S . Weadopt a heuristic from [8] that, in the worst case, generatesquadratically many patterns (in the size of S ). As shown inFig. 3, we generate a pattern p for each pair of derivations d and d (cid:48) from S . If d [ i ] = d (cid:48) [ i ] then p [ i ] = d [ i ]. Otherwise p [ i ]is a fresh placeholder (shown as an empty box in Fig. 3). Estimating Pattern Coverage (Phase 3, Sec. 7).

Tobe able to compute the completeness metric of a pattern setwhich is required for scoring pattern sets in the last step,we need to determine what derivations are covered by whichpattern and which of these belong to

Whynot ( Q, D, t ). Weestimate completeness based on S . The informativeness ofa pattern can be directly computed from the pattern. Computing the Top- k Summary (Phase 4, Sec. 8).

In the last step (phase 4 in Fig. 3), we generate sets of up to k patterns from the set of patterns produced in the previousstep, rank them based on their scores, and return the setwith the highest score as the top- k summary. We applya heuristic best-ﬁrst search method that utilizes eﬃcientlycomputable bounds for the completeness of sets of patternsto prune the search space.

5. SAMPLING WHY-NOT PROVENANCE

5n this section, we ﬁrst discuss how to eﬃciently gener-ate a sample S of annotated derivations of a given size n S from the why-not provenance Whynot ( Q, D, t ) for a prove-nance question (PQ) Φ (phase 1 in Fig. 3). This sample willthen be used in the following phases of our summarizationalgorithm. We introduce a running example in Fig. 4 anduse it through-out Sec. 5 to Sec. 8. Consider the examplequery r ex shown on the top of Fig. 4 which returns start-and end-points of paths of length 2 in a graph with integernode labels such that the end-point is labeled with a laregernumber than the start-point. Evaluating r ex over the ex-ample instance R from the same ﬁgure yields three results: Q ex (1 , Q ex (1 , Q ex (5 , Q ex ( X, ex from Fig. 4. Recall that, Whynot ( Q, D, t ) forp-tuple t consists of all derivations of tuples t (cid:54)∈ Q ( D ) where t (cid:50) t . Assuming D = { , , , , , } , on the bottom rightof Fig. 4 we show all missing and existing answers matching t ex (missing answers are shown with red background). To generate all derivations for missing answers, we canbind the variables of each rule r of a query Q to the constantsfrom t to ensure that only derivations of results which matchthe PQ’s p-tuple t are generated. We refer to this process asunifying Q with t . For our running example, this yields therule r t ex shown in Fig. 4. The naive way to create a sam-ple of derivations from Whynot ( Q, D, t ) using this rule isto repeatably sample a value from D for each variable, thencheck whether (i) the predicates of the rule are fulﬁlled and(ii) the resulting rule derivation computes a missing answer.For example, for r t ex , we may choose X = 2 and Z = 2 andget a derivation d = r t ex (2 , d fulﬁlls thepredicate X < Q ex (2 ,

4) is a missing answer.Thus, d belongs to the why-not provenance of t ex . Then,to get an annotated rule derivation, we determine its goalannotations by checking whether the tuples correspondingto the grounded goals of the rule exists in the database in-stance. For this example, d = r t ex (2 , − ( F , T ) since theﬁrst goal R (2 ,

2) fails, but the second goal R (2 ,

4) succeeds.There are two ways of how this process can fail to producea derivation of

Whynot ( Q, D, t ): (i) a predicate of the rulemay be violated by the bindings generated in this way (e.g.,if we would have chosen X = 5, then X < X = 1 and Z = 3, we get the failed derivation r t ex (1 ,

3) of the existing answer Q ex (1 , Analysis of Naive Sampling.

If we repeat the processdescribed above until it has returned n S failed derivations,then this produces an unbiased sample of Whynot ( Q, D, t ).Note that, technically, there is no guarantee that the processwill ever terminate since it may repeatedly produce deriva-tions that do not fulﬁll a predicate or derive existing an-swers. Observe that, typically the amount of missing an-swers is signiﬁcantly larger than the number of answers, i.e., | Whynot ( Q, D, t ) | (cid:29) |A ( Q, D, t ) − Whynot ( Q, D, t ) | . Asa result, any randomly generated derivation is with highprobability in Whynot ( Q, D, t ). We will explain how todeal with derivations that fail to fulﬁll predicates in Sec. 5.2. Batch Sampling.

A major shortcoming of the naive sam-pling approach is that it requires us to evaluate queries totest for every produced derivation d whether it derives amissing answer ( head ( d ) (cid:54)∈ Q ( D )) and to determine its goal annotations by checking for each grounded goal R ( (cid:126)c )or ¬ R ( (cid:126)c ) whether R ( (cid:126)c ) ∈ D . It would be more eﬃcient tomodel sampling as a single batch computation that we canoutsource to a database system and that can be fused intoa single query with the other phases of the summarizationprocess to avoid unnecessary round-trips between our sys-tem and the database. However, for batch sampling, we haveto choose upfront how many samples to create, but not allsuch samples will end up being why-not provenance or fulﬁllthe rule’s predicates. To ensure with high probability thatthe batch computation returns at least n S derivations from Whynot ( Q, D, t ), we use a larger sample size n OS ≥ n S such that the probability that the resulting sample containsat least n S derivations from Whynot ( Q, D, t ) is higher thana conﬁgurable threshold P success (e.g., 99.9%). We refer tothis part of the process as over-sampling . We discuss howto generate a query that computes a sample of size n OS inSec. 5.2 and, then, discuss how to determine n OS in Sec. 5.3. For simplicity, we limit the discussion to queries with asingle rule, e.g., the query r ex from Fig. 4. We discussqueries with multiple rules at the end of this section. Thequery we generate to produce a sample of size n OS consistsof three steps: generating derivations, ﬁltering derivationsof existing answers, determining goal annotations.

1. Generating Derivations.

We ﬁrst generate a querythat creates a random sample OS of n OS derivations (notannotated) for which there exists an annotated version in A ( Q, D, t ) (all annotated derivations that match head Q ( t )).Consider a single rule r with m literal goals, n variables, and h head variables: r : Q ( X ) : − g ( X ) , . . . , g m ( X m ) , ψ ( Y ) ,. . . , ψ k ( Y l ). Let R i be the relation accessed by goal g i , i.e., g i ( X i ) is either R i ( X i ) or ¬ R i ( X i ). Let k be the numberof head variables bound by the p-tuple t from the questionΦ for which we are sampling. We use Z = Z , . . . , Z u todenote the u = n − k variables of r that are not bound by t . Recall that, to only consider derivations matching t , weunify the rule with t by binding variables in the rule to thecorresponding constants from t . We use r t to denote theresulting uniﬁed rule. Note that we will describe our sum-marization techniques using derivations and patterns for r t .Patterns for r can be trivially reconstructed from the re-sults of summarization by plugging in constants from t . Togenerate derivations for such a query, we sample n OS valuesfor each unbound variable independently with replacement,and, then, combine the resulting samples into a sample of A ( Q, D, t ) modulo goal annotations. Predicates comparingconstants with variables, e.g., X < r t ex , are applied be-fore sampling to remove values from the domain of a variablethat cannot result in derivations fulﬁlling the predicates.Similar to [20, 18], we assume that the user speciﬁes the do-main D A for each attribute A as a unary query that returns D A (we provide reasonable defaults to avoid overloading theuser). We extend the relational algebra with two operatorsto be able to express sampling. Operator Sample n returns n samples which are chosen uniformly random with replace-ment from the input of the operator. We use A to de-note an operator that creates an integer identiﬁer for eachinput row that is stored in a column A appended to theschema of the operator’s input. For each variable X ∈ Z with attrs ( X ) = { A , . . . , A j } ( attrs ( X ) denotes the set ofattributes that variable X is bound to by the rule contain-6ng X ), we create a query Q X that unions the domains ofthese attributes, then applies predicates that compare X with constants, and then samples n OS values. Q X := id ( Sample n OS ( σ θ X ( ρ X (( D A ∪ . . . ∪ D A j ))))Here, θ X is a conjunction of all the predicates from r t thatcompare X with a constant. The purpose of r t : Q bind := σ θ join ( Q Z (cid:46)(cid:47) . . . (cid:46)(cid:47) Q Z u )Here, θ join is a conjunction of all predicates from r t thatcompare two variables. Note that the selectivity of θ join hasto be taken into account when computing n OS (discussed inSec. 5.3). Each tuple in the result of Q bind encodes thebindings for one derivation d of a tuple t (cid:50) t . Example Consider uniﬁed rule r t ex from Fig. 4. As-sume that D A = D B = Π A ( R ) ∪ Π B ( R ) and n OS = 3 . Vari-able X is bound to attribute A and Z is bound to both A and B . Thus, we generate the following queries: Q X := id ( Sample n OS ( σ X< ( ρ X ( D A )))) Q Z := id ( Sample n OS ( ρ Z ( D A ∪ D B ))) Q bind := σ true ( Q X (cid:46)(cid:47) Q Z ) Evaluated over the example instance, this query may return: Q X id X Q Z id Z Q bind id X Z

2. Filtering Derivations of Existing Answers.

Wenow construct a query Q der , which checks for each deriva-tion d ∈ OS for a tuple t (cid:50) t whether t (cid:54)∈ Q ( D ) and onlyretain derivations passing this check. This is achieved byanti-joining ( (cid:66) ) Q bind with Q which we restricted to tuplesmatching t since only such tuples can be derived by Q bind . Q der := Q bind (cid:66) θ der σ θ t ( Q )The query Q der uses condition θ der which equates at-tributes from Q bind that correspond to head variables of r t with the corresponding attribute from Q and condition θ t that ﬁlters out derivations not matching t by equatingattributes with constants from t . Example From Q ex in Fig. 4, we remove answerswhere Y (cid:54) = 4 (since t ex binds Y = 4 ) and anti-join on X ,the only head variable of r t ex , with Q bind from Ex. 4. Theresulting query and its result are shown below. Note thattuple (1 , , was removed since it corresponds to a (failed)derivation of the existing answer Q ex (1 , . Q der := Q bind (cid:66) X = X σ Y =4 ( Q ex ) id X Z

3. Computing Goal Annotations.

Next, we determinegoal annotations for each derivation to create a set of an-notated derivations from

Whynot ( Q, D, t ). Recall that apositive (negative) grounded goal is successful if the corre-sponding tuple exists (is missing). We can check this byouter-joining the derivations with the relations from therule’s body. Based on the existence of a join partner, wecreate boolean attributes storing g i for 1 ≤ i ≤ | body ( r ) | ( F is encoded as false). For a negated goal, we negate the result of the conditional expression such that F is used if a joinpartner exists. We construct query Q sample shown below toassociates derivations in Q der with the goal annotations ¯ g . Q sample := δ (Π Z , ··· ,Z u ,e → g , ··· ,e n → g m ( Q goals )) Q goals := Q der (cid:46)(cid:47) θ Π R , → h ( R ) . . . (cid:46)(cid:47) θ n Π R , → h m ( R m )Note that we use duplicate elimination to preserve set se-mantics. In projection expressions, we use e → a to denoteprojection on a scalar expression e whose result is storedin attribute a . Here, the join condition θ i equates the at-tributes storing the values from X i in r t with the corre-sponding attributes from R i . Attributes at positions thatare bound to constants in r t are equated with the constant.The net eﬀect is that a tuple from Q der corresponding to arule derivation d has a join partner in R i iﬀ the tuple corre-sponding to the i th goal of d exists in D . The expression e i used in the projection of Q sample then computes the booleanindicator for goal g i as follows: e i := (cid:40) if ( isnull ( h i )) then F else T if g i is positive if ( isnull ( h i )) then T else F otherwise Example For our running example, we generate: Q sample := δ (Π X,Z, if ( isnull ( h )) then F else T → g , if ( isnull ( h )) then F else T → g ( Q goals )) Q goals := Q der (cid:46)(cid:47) X = A ∧ Z = B Π A,B, → h ( R ) (cid:46)(cid:47) Z = A ∧ B =4 Π A,B, → h ( R ) Evaluating this query, we get the result shown below. Q sample id X Z h h r t ex (2 , − ( F , T ) for which the ﬁrst goal fails since R (2 , does not ex-ist in R while the second goal succeeds because R (2 , exists.Similarly, the second tuple corresponds to r t ex (2 , − ( T , F ) for which the ﬁrst goal succeeds since R (2 , exists while thesecond goal fails because R (4 , does not exist. Queries With Multiple Rules.

For queries with multiplerules, we determine n OS separately for each rule (recall thatwe consider UCQ ¬ < queries where every rule has the samehead predicate) and create a separate query for each ruleas described above. Patterns are also generated separatelyfor each rule. In the ﬁnal step, we then select the top- k summary from the union of the sets of patterns. Complexity.

The runtime of our algorithm is linear in n OS and | D | which signiﬁcantly improves over the naive algo-rithm which is in O ( | D | n ). Implementation.

Some DBMS such as Oracle and Post-gres support a sample operator out of the box which we canuse to implement the

Sample operator introduced above.However, these implementations of a sampling operator donot support sampling with replacement out of the box. Wecan achieve decent performance for sampling with replace-ment using a set-returning function that takes as input theresult of applying the built-in sampling operator to generatea sample of size n OS , caches this sample, and then samples n OS times from the cached sample with replacement. The A operator can be implemented in SQL using ROW_NUMBER() .The expressions if ( θ ) then e else e and isnull () can beexpressed in SQL using CASE WHEN and

IS NULL , respectively.7 .3 Determining Over-sampling Size

We now discuss how to choose n OS , the size of the sample OS produced by query Q bind , such that the probability that OS contains at least n S derivations from Whynot ( Q, D, t )is higher than a threshold P success under the assumptionthat the sampling method we introduced above samples uni-formly random from A ( Q, D, t ). We, then, prove that oursampling method returns a uniform random sample. First,consider the probability p prov that a derivation chosen uni-formly random from A ( Q, D, t ) is in Whynot ( Q, D, t ) whichis equal to the faction of derivations from A ( Q, D, t ) that arein Whynot ( Q, D, t ): p prov = | Whynot ( Q, D, t ) ||A ( Q, D, t ) ||A ( Q, D, t ) | can be computed from Q , t , and the attributedomains as explained in Sec. 2.2. For instance, consider D = { , , , , , } as the universal domain and rule r t ex fromour running example, but without the conditional predicate.Then, there are | D | n = 6 possible derivations of this rulebecause neither variable X nor Z are not bound by t ex and D has 6 values. To determine | Whynot ( Q, D, t ) | , we needto know how many derivations in A ( Q, D, t ) correspond tomissing tuples matching t . Since in most cases the numberof missing answers is signiﬁcantly larger than the number ofexisting tuples, it is more eﬀective to compute the number of(successful and/or failed) derivations of t ∈ Q ( D ) with t (cid:50) t ,i.e., |{ t | t ∈ Q ( D ) ∧ t (cid:50) t }| . This gives us the probability p notProv that a derivation is not in Whynot ( Q, D, t ) andwe get: p prov = 1 − p notProv .Next, consider a random variable X that is the numberof derivations from Whynot ( Q, D, t ) in OS . We want tocompute the probability p ( X ≥ n S ). For that, considerﬁrst p ( X = i ), the probability that the sample OS we pro-duce contains exactly i derivations from Whynot ( Q, D, t ).We can apply standard results from statistics for computing p ( X = i ), i.e., out of a sequence of n OS picks with probabil-ity p prov we get i successes. The probability to get exactly i successes out of n picks is (cid:0) ni (cid:1) · p provi · (1 − p prov ) n − i basedon the Binomial Distribution . For i (cid:54) = j , the events X = i and X = j are disjoint (it is impossible to have both exactly i and j derivations from Whynot ( Q, D, t ) in OS ). Thus, p ( X ≥ n S ) is (cid:80) p ( X = i ) for i ∈ { n S , . . . , n OS } : p ( X ≥ n S ) = n OS (cid:88) i = n S (cid:32) n OS i (cid:33) · p provi · (1 − p prov ) n OS − i Given p prov , n S , and P success , we can compute the samplesize n OS such that p ( X ≥ n S ) is larger than P success ([1, 28]presents an algorithm for ﬁnding the minimum such n OS ). Handling Predicates.

Recall that we apply predicatesthat compare a variable with a constant before creating asample for this variable. Thus, we do not need to considerthese predicates when determining n OS . Predicates com-paring variables with variables are applied after the stepof creating derivations. We estimate the selectivity of suchpredicates using standard techniques [15] to estimate howmany derivations will be ﬁltered out and, then, increase n OS to compensate for this, e.g., for a predicate with 0 . n OS . We now formally analyze if our approach creates a uni-form sample of

Whynot ( Q, D, t ). We demonstrate this by analyzing the probability p ( d ∈ S ) for an arbitrary deriva-tion d ∈ Whynot ( Q, D, t ) to be in the sample S . If ourapproach is unbiased, then this probability should be inde-pendent of which d is chosen and for d (cid:48) (cid:54) = d the events d ∈ S and d (cid:48) ∈ S should be independent. Theorem Given derivations d, d (cid:48) ∈ Whynot ( Q, D, t ) and sample sizes n S and n OS , p ( d ∈ S ) = c where c is aconstant that is independent of the choice of d . Furthermore,the events d ∈ S and d (cid:48) ∈ S are independent of each other. Proof.

To prove the theorem, we have to demonstratethat none of the phases of sampling introduces bias. In thefollowing, let A := A ( Q, D, t ), n A := |A| , and n P := |P| where P := Whynot ( Q, D, t ). Recall that the ﬁrst phaseof sampling generates a sample OS of size n OS by indepen-dently creating samples for each unbound variable which arecombined into a sample of A . Consider ﬁrst the case where n OS = 1, i.e., we pick a single value from each domain. Let D i denote the domain for unbound variable Z i in the sin-gle rule r of query Q . Since we sample uniform from D i ,each of the | D i | values has a probability of | D i | to be cho-sen. Since the sample for Z i is chosen independently from Z j for i (cid:54) = j , any particular derivation d ∈ A to be in OS is p ( d ∈ OS ) = | D |× ... ×| D u | = |A| . For n OS >

1, observethat each value in the sample of D i is chosen independently.Thus, p ( d ∈ OS ) = 1 − p ( d (cid:54)∈ OS ) = 1 − (1 − n A − n A ) n OS (thelast equivalence is based on p ( A ∩ B ) = p ( A ) · p ( B ) when A and B are independent). Furthermore, this implies that d ∈ OS is independent of d (cid:48) ∈ OS for d (cid:54) = d (cid:48) . So far, wehave established that p ( d ∈ OS ) is constant and the eventsfor picking particular derivations are mutually independent.It remains to be shown that the same holds for a derivation d ∈ P and the sample S we derive from OS . Since OS issampled from A , it may contain derivations d (cid:48) (cid:54)∈ P . Oursampling algorithms ﬁlters such derivations. Let S np denotethe set of all such derivations from OS and n np = | S np | .Observe that for i (cid:54) = j the events n np = i and n np = j are obviously disjoint since OS contains a ﬁxed number ofderivations not in P . Furthermore, (cid:80) n OS i =0 p ( n np = i ) = 1since OS has to contain anywhere from zero to n OS suchderivations. Thus, we can compute the probability p ( d ∈ S )as the sum over i = { , . . . , n OS } of the probability that d is selected to be in S conditioned on the probability that d ∈ OS (otherwise d cannot be in S ) and that n np = i . p ( d ∈ S ) = n OS (cid:88) i =0 p ( d ∈ S | d ∈ OS ∩ n np = i ) · p ( d ∈ OS ∩ n np = i )Now, consider the individual probabilities in this formula.Let p = p ( d ∈ S | d ∈ OS ∩ n np = i ). If d is in OS and n np = i , then there are n OS − i − P in OS .Our sampling algorithm selects uniformly n S derivation in P from OS − S np if n OS − i > n S . Thus, the probability forour particular derivation d to be in the ﬁnal result is: p = min( n S , n OS − i ) n OS − i Next, consider p = p ( d ∈ OS ∩ n np = i ). Based on ourobservation above, any subset of derivations of A has thesame probability to be returned as OS . Thus, p can becomputed as the fraction of subsets of A of size n OS thatcontain i successful derivation(s), and d and n OS − i − P . Putting diﬀerently, how many ofthe n An OS possible samples that could be produced by ouralgorithm contain d and exactly i derivations from S np : p = n npi · n P n OS − i − n An OS Observe, that the formulas for p and p only refer to con-stants that are independent of the choice of d . Thus, p ( d ∈ S ) is independent of the choice of d .

6. GENERATING PATTERN CANDIDATES

We now explain the candidate generation step of our sum-marization approach (phase 2 in Fig. 3). Consider a prove-nance question (PQ) Φ = ( t , Whynot ) for a query Q . Forany rule r of Q , let n be the number of unbound variables,i.e., | vars ( r t ) | where r t is the uniﬁed rule for r and t , and m be the number of goals in r . The number of possible patternsfor r t is in O (( | D | + n ) n · m ), because for each variable of r t we can choose either a placeholder or a value from D andfor each goal we have to pick one of two possible annotations( F or T ). Note that the names of placeholders are irrelevantto the semantics of a pattern, e.g., patterns p = ( A,

3) and p (cid:48) = ( B,

3) are equivalent (matching the same derivations).That is, we only have to decide which arguments of a pat-tern are placeholders and which arguments share the sameplaceholder. Thus, it is suﬃcient to only consider n distinctplaceholders P i (where 1 ≤ i ≤ n ) when creating patternsfor a uniﬁed rule r t with n variables. Example Consider rule r t ex from Fig. 4. Let D = { , , , , , } and P = { P , P } . Let us for now ignore goalannotations. Note that, taking the predicate X < into ac-count, any pattern where X ≥ cannot possibly match anyderivations for this rules and, thus, we only have to con-sider patterns where X is bound to a constant less than 4 ora placeholder. The set of viable patterns is: r t ex ( P , P ) , r t ex ( P , , . . . , r t ex ( P , , r t ex (1 , P ) , . . . , r t ex (6 , P ) ,r t ex (1 , , . . . , r t ex (1 , , . . . r t ex (3 , , . . . , r t ex (3 , This set contains elements. Considering goal annotations ( F , F ) , ( F , T ) , and ( T , F ) , we get · patterns. Given the O (( | D | + n ) n · m ) complexity, it is not feasibleto enumerate all possible patterns. Instead, we adapt the Lowest Common Ancestor (LCA) method [8, 9] for our pur-pose which generates a number of pattern candidates fromthe derivations in a sample S that is at most quadratic in n S .Thus, this approach sacriﬁces completeness to achieve betterperformance. Given a set of derivations (tuples in the workfrom [8, 9]), the LCA method computes the cross-productof this set with itself and generates candidate explanationsby generalizing each such pair. The rationale is that eachpattern generated in this fashion will at least match twoderivations (or one derivation for the special case where aderivation is paired with itself). In our adaptation, we matchderivations on the goal annotations such that only deriva-tions with the same success/failure status of goals are paired.For each pair of derivations d = ( a , . . . , a n ) − (¯ g ) and d =( b , . . . , b n ) − (¯ g ), we generate a pattern p = ( c , . . . , c n ) − (¯ g ).We determine each element c i in p as follows. If a i = b i then c i = a i . That is, constants on which d and d agree areretained. Otherwise, c i is a fresh placeholder. Example Reconsider the uniﬁed rule r t ex and instance R from Fig. 4. Two example annotated rule derivations are d = r t ex (2 , − ( F , F ) and d = r t ex (2 , − ( F , F ) . LCAgenerate a pattern p = r t ex (2 , Z ) − ( F , F ) to generalize d and d because d [1] = d [1] = 2 (and, thus, this constant isretained) and p [2] = Z since d [2] = 1 (cid:54) = 5 = d [2] . We apply

LCA to the sample S created using Q sample from Sec. 5.2. Using LCA , we avoid generating exponen-tially many patterns and, thus, improve the runtime of pat-tern generation from O ( | D | n ) to O ( n S ) where typically n S (cid:28)| D | . Furthermore, this optimization reduces the input sizefor the ﬁnal stages of the summarization process leading toadditional performance improvements. Even though LCAis only a heuristic, we demonstrate experimentally in Sec. 9that it performs well in practice. Implementation.

We implement the

LCA method as aquery Q lca joining the query Q sample (the query producing S ) with itself on a condition θ lca := (cid:86) mi =0 g i = g i where m is the number of goals of the rule r of Q (recall that wecreate patterns for each rule of Q independently and mergein the ﬁnal step). Patterns are generated using a projectionon an list of expressions A lca , where the i th argument of apattern is determined as if ( X i = X i ) then X i else NULL .Note that the LCA method never generates patterns wherethe same placeholder appears more than once. Thus, it issuﬃcient to encode placeholders as

NULL values. Q lca := δ (Π A lca ( Q sample (cid:46)(cid:47) θ lca Q sample ))The query generated for our running example is: Q lca := δ (Π e X → X,e Z → Z ( Q sample (cid:46)(cid:47) ( g = g ) ∧ ( g = g ) Q sample ) e X := if ( X = X ) then X else NULLe Z := if ( Z = Z ) then Z else NULL

7. ESTIMATING COMPLETENESS

To generate a top- k summary in the next step, we needto calculate the informativeness (Def. 8) and completeness(Def. 7) quality metrics for sets of patterns. Informativenesscan be computed from patterns without accessing the data.Recall that completeness is computed as the fraction of pro-venance matched by a pattern: cp ( p ) = |M ( Q,D,p, Φ) || Prov (Φ) | . Sincewe can materialize neither |M ( Q, D, p, Φ) | nor | Prov (Φ) | forthe why-not provenance, we have to estimate their sizes. Inthis section, we focus on how to estimate the completenessof individual patterns. How to compute the completenessmetric for sets of patterns will be discussed in Sec. 8.To determine whether a derivation d ∈ Prov (Φ) with goalannotations ¯ g matches a pattern p with goal annotations¯ g that is in M ( Q, D, p,

Φ), we have to check that ¯ g = ¯ g and a valuation exists that maps p to d . Then, we count thenumber of such derivations to compute |M ( Q, D, p, Φ) | . Theexistence of a valuation can be checked in linear time in thenumber of arguments of p by ﬁxing a placeholder order and,then, assigning to each placeholder in p the correspondingconstant in d if a unique such constant exists. The valuationfails if p and d end up having two diﬀerent constants at thesame position. Example Continuing with Ex. 8, we compute com-pleteness of the pattern p = r t ex (2 , Z ) − ( F , F ) . For sake ofthe example, assume that Prov (Φ ex ) = : = r t ex (2 , − ( F , F ) d = r t ex (2 , − ( F , T ) d = r t ex (2 , − ( T , F ) d = r t ex (2 , − ( T , F ) d = r t ex (2 , − ( F , F ) d = r t ex (2 , − ( F , F ) The completeness of p is cp ( p ) = because p matches all 3derivations ( d , d , and d ) for which both goals have failedby assigning Z to 1, 5, and 6. To estimate the completeness of a pattern p , we computethe number of matches of p with derivations from the sam-ple S produced by Q sample as discussed in Sec. 5. As longas S is an unbiased sample of Prov (Φ), then the fraction ofderivations from S matching the pattern is an unbiased es-timate of the completeness of the pattern. Continuing withEx. 9, assume that we created a sample S = { d , d , d , d } .Estimating the completeness of pattern p based on S , we get cp ( p ) (cid:39) . Implementation.

We generate a query Q match which joinsthe query Q lca generating pattern candidates with Q sample ,the query generating the sample derivations. Let r t be therule for which we are generating patterns and A be the resultattributes of Q lca . We count the number of matches perpattern by grouping on A : Q match := γ A,count ( ∗ ) ( Q lca (cid:46)(cid:47) θ match Q sample )Recall that we encode placeholders as NULL values. Con-dition θ match is a conjunction of conditions, one for eachargument X of the pattern/derivation: X = X ∨ isnull ( X ).Since the number of candidates produced by LCA is at most n S , matching is in O ( n S · n S ) = O ( n S ). For our runningexample, we would create the following query: Q match := γ X,Z,g ,g ,count ( ∗ ) ( Q lca (cid:46)(cid:47) θ match Q sample ) θ match := ( X = X ∨ isnull ( X )) ∧ ( Z = Z ∨ isnull ( Z ))

8. COMPUTING TOP-K SUMMARIES

We now explain how to compute a top- k provenance sum-mary for a provenance question Φ (phase 4 in Fig. 3). Thisis the only step that is evaluated on the client-side. Its inputis the set of patterns (denoted as Pat lca ) with completenessestimates returned by evaluating query Q match (Sec. 7). Wehave to ﬁnd the set S ⊆

Pat lca of size k that maximizes sc ( S ). A brute force solution would enumerate all suchsubsets, compute their scores (which requires us to com-pute the union of the matches for each pattern in the set tocompute completeness), and return the one with the high-est score. However, the number of candidates is (cid:0) | Pat lca | k (cid:1) and this would require us to evaluate a query to computematches for each candidate. Our solution uses lower and up-per bounds on the completeness of patterns that can be com-puted based on the patterns and their completeness aloneto avoid running additional queries. Furthermore, we usea heuristic best-ﬁrst search method to incrementally buildcandidate sets guiding the search using these bounds. In general, the exact completeness of a set of patterns can-not be directly computed based on the completeness of thepatterns of the set, because the sets of derivations match-ing two patterns may overlap. We present two conditionsthat allow us to determine in some cases whether the matchsets of two patterns are disjoint or one is contained in the other. We say a pattern p generalizes a pattern p writ-ten as p (cid:22) p p if ∀ i : p [ i ] = p [ i ] ∨ p [ i ] ∈ P (inﬁniteset of placeholders), and they have the same goal annota-tions. For instance, ( X, Y, a ) − ( F , F ) generalizes ( X, b , a ) − ( F , F ). From Def. 5, it immediately follows that if p (cid:22) p p then M ( Q, D, p , Φ) ⊆ M ( Q, D, p , Φ) since any deriva-tion matching p also matches p and, thus, cp ( { p , p } ) = cp ( p ). We say pattern p and p are disjoint written as p ⊥ p p if (i) they are from diﬀerent rules, (ii) they do notshare the same goal annotations, or (iii) there exists an i such that p [ i ] = c (cid:54) = c = p [ i ], i.e., the patterns have adiﬀerent constant at the same position i . If p ⊥ p p , then M ( Q, D, p , Φ) ∩ M ( Q, D, p , Φ) = ∅ and, thus, we have cp ( { p , p } ) = cp ( p ) + cp ( p ). Note that for any S , cp ( S )is trivially bound from below by max p ∈S cp ( p ) (making theworst-case assumption that all patterns fully overlap) and bymin(1 , (cid:80) p ∈S cp ( p )) from above (completeness is maximizedif there is no overlap). Using generalization and disjoint-ness, we can reﬁne these bounds. Note that generalizationis transitive. To use generalization to ﬁnd tighter upperbounds on completeness for a pattern set S , we computethe set S ub = { p | p ∈ S ∧ ¬∃ p (cid:48) ∈ S : p (cid:22) p p (cid:48) } . Any patternnot in S ub is generalized by at least one pattern from S ub .For disjointness, if we have a set of patterns S for whichpatterns are pairwise disjoint, then cp ( S ) = (cid:80) p ∈S cp ( p ).Based on this observation, we ﬁnd the subset S lb of pairwisedisjoint patterns from S that maximizes completeness, i.e., S lb = argmax S (cid:48) ⊆S∧∀ p (cid:54) = p (cid:48) ∈S (cid:48) : p ⊥ p p (cid:48) (cid:80) p ∈S (cid:48) cp ( p ). We use S lb and S ub to deﬁne an lower bound cp ( S ) and upper-bound cp ( S ) on the completeness of a pattern set S : cp ( S ) := (cid:88) p ∈S lb cp ( p ) cp ( S ) := (cid:88) p ∈S ub cp ( p ) Example

Consider the following patterns for r t ex : p = (2 , Z ) − ( F , F ) , p (cid:48) = (3 , Z ) − ( F , F ) , p (cid:48)(cid:48) = (2 , − ( F , F ) .Assume that cp ( p ) = 0 . , cp ( p (cid:48) ) = 0 . , and cp ( p (cid:48)(cid:48) ) = 0 . .Consider S = { p, p (cid:48) , p (cid:48)(cid:48) } and observe that p ⊥ p p (cid:48) , p (cid:48) ⊥ p p (cid:48)(cid:48) ,and p (cid:48)(cid:48) (cid:22) p p . Thus, S ub = { p, p (cid:48) } (the pattern p (cid:48)(cid:48) is gener-alized by p ) and S lb = { p, p (cid:48) } (while also p (cid:48) ⊥ p p (cid:48)(cid:48) holds, wehave cp ( p ) + cp ( p (cid:48) ) > cp ( p (cid:48) ) + cp ( p (cid:48)(cid:48) ) ). We get: cp ( S ) = cp ( p ) + cp ( p (cid:48) ) = 0 . and cp ( S ) = cp ( p ) + cp ( p (cid:48) ) = 0 . fromwhich follows that cp ( S ) = 0 . . Note that, without usinggeneralization and disjointness, we would have to settle fora lower bound of max p ∈S cp ( p ) = 0 . and upper bound of min(1 , (cid:80) p ∈S cp ( p )) = 1 . We apply a best-ﬁrst search approach to compute a ap-proximate top- k summary given a set of patterns Pat lca .Our approach maintains a priority queue of candidate setssorted on a lower bound sc for the score of candidate setsthat we compute based on the completeness bound cp intro-duced above. We also maintain an upper bound sc . For a set S of size k , we can compute info ( S ) exactly. For incompletecandidates (size less than k ), we bound the informativenessand completeness of any extension of the candidate into aset of size k using worst-case/best-case assumptions. Forexample, to bound completeness for an incomplete candi-date S from above, we assume that the remaining patterns Note that this is the intractable weighted maximal cliqueproblem. For reasonably small k , we can solve the problemexactly and otherwise apply a greedy heuristic.10 : InvalidD ( C ) : − LICENSE ( I, B, G, C, T, d ) , ¬ VALID ( I ) r : Fsenior ( C ) : − LICENSE ( I, B, f , C, T, L ) , VALID ( I ) , B < r : CasualWatch ( T, E, N ) : − MOVIES ( I, T, Y, R, P, B, V ) , GENRES ( I, E ) , PRODCOMPANY ( I, C ) , COMPANY ( C, N ) , RATINGS ( U, I, G, S ) , ¬ GENRES ( I, thriller ) , R < , G > = 4 r : Players ( A ) : − MOVIES ( I, T, Y, R, P, B, V ) , CASTS ( I, C, H, A, G ) , GENRES ( I, romance ) , RATINGS ( U, I, N, S ) , Y > , N > = 4 r (cid:48) : Players ( A ) : − MOVIES ( I, T, Y, R, P, B, V ) , CASTS ( I, C, H, A, G ) , GENRES ( I, comedy ) , KEYWORDS ( I, love ) , RATINGS ( U, I, N, S ) , Y > , N > = 4 r (cid:48)(cid:48) : Players ( A ) : − MOVIES ( I, T, Y, R, P, B, V ) , CASTS ( I, C, H, A, G ) , GENRES ( I, drama ) , KEYWORDS ( I, relationship ) , RATINGS ( U, I, N, S ) , Y > , N > = 4 r : DirGen ( N ) : − MOVIES ( I, T, Y, R, P, B, V ) , CREWS ( I, W, N, director , M ) , GENRES ( I, E ) , B > r : TomKey ( T, K, E ) : − MOVIES ( I, T, Y, R, P, B, V ) , CASTS ( I, C, H, tom cruise , G ) , KEYWORDS ( I, K ) , GENRES ( I, E ) , RATINGS ( U, I, A, S ) , A ≥ r : FavCom ( T ) : − MOVIES ( I, T, Y ) , GENRES ( I, comedy ) , RATES ( U, I, R, M, A ) , R ≥ r : ActMov ( T ) : − MOVIES ( I, T, Y ) , GENRES ( I, action ) , RATES ( U, I, , M, A ) r : CommCrime ( T ) : − CRIMES ( I, Y, T, L, austin ) , ¬ ARREST ( I ) r : CrimeSince ( T ) : − CRIMES ( I, Y, T, L, C ) , ¬ ARREST ( I ) , Y > r : Hops ( L ) : − DBLP ( L, R ) , DBLP ( R, R , DBLP ( R , R , DBLP ( R , R , DBLP ( R , R , DBLP ( R , R r : Custs ( CN, NK ) : − CUSTOMER ( CK, CN, C , NK, C , C , C , C ) , ORDERS ( OK, CK, O , O , O , O , O , O , O ) , LINEITEM ( OK, L , L , L , · · · , L , L , L ) Figure 5: Queries used in the experimentswill not overlap with any pattern from S and have maximalcompleteness (max p ∈ Pat lca cp ( p )). We initialize the priorityqueue with all singleton subset of Pat lca and, then, repeat-ably take the incomplete candidate set with the highest sc and extend it by one pattern from Pat lca in all possibleways and insert these new candidates into the queue. Thealgorithm terminates when a complete candidate S best isproduced for which sc is higher than the highest sc valueof all candidates we have produced so far (eﬃciently main-tained using a max-heap sorted on sc ). In this case, we re-turn S best since it is guaranteed to have the highest score (ofcourse completeness is only an estimation) even though wedo not know the exact value. The algorithm also terminateswhen all candidates have been produced, but no S best hasbeen found. In this case, we apply the following heuristic:we return the set with the highest average (( sc + sc ) /

9. EXPERIMENTS

We evaluate (i) the performance of computing summariesand (ii) the quality of summaries produced by our technique.

Experimental Setup.

All experiments were executed ona machine with 2 x 3.3Ghz AMD Opteron CPUs (12 cores)and 128GB RAM running Oracle Linux 6.4. We use a com-mercial DBMS (name omitted due to licensing restrictions).

Q Why Why-not r new york swanton r brooklyn delaware r drama ( E ) family ( E ) r jack black tom ford r steven robertspielberg altman r mission ( K ) spying ( K ) Q Why Why-not r forrest gump babysitting r ﬁght club avalanche r battery domesticviolence r theft ritualism r - xueni pan r - various Figure 6: Bindings for why and why-not provenance ques-tions used in the experiments.

Dataset

LICENSE

Attribute

I B G T

16M 118 2 64

Dataset

CRIME Attribute

I Y L C

6M 19 181 105

Dataset

MOVIES

Attribute

I T Y R P B V E

45K 42K 135 350 44K 1K 7K 20

Attribute

C C

CASTS

H G

CASTS

U G S

61K 24K 170K 3 270K 10 21M

Figure 7: Number of distinct values from the largest datasetsfor attributes of the experimental queries

Datasets.

We use TPC-H and several real-world datasets:(i) the New York State (NYS) license dataset ( ∼ ( ∼

26M tuples), (iii) aChicago crime dataset ( ∼

6M tuples), and (iv) a co-authorgraph relation extracted from DBLP . For each dataset, wecreated several subsets; R x denotes a subset of R with x rows. Queries.

Fig. 5 shows the queries used in the experiments.In Fig. 7, we provide the number of distinct values fromthe largest datasets of (i), (ii), and (iii) for the attributesof these queries. For the license dataset, we use

InvalidD ( r ) which returns cities with invalid driver’s licenses and Fsenior ( r ) which returns cities with valid licenses held byfemale seniors. For the movie dataset, CasualWatch ( r ) re-turns movies with their genres and production companies iftheir runtime is less than 100 minutes and they have receivedhigh ratings ( G ≥ Players ( r ) computes actresses/ac-tors who have been successful (rating higher than 4) in aromantic comedy after 1999. In addition, DirGen ( r ) com-putes name of person who has directed a movie with over2M dollars budget, and TomKey ( r ) returns movie title withits keyword and genre that Tom Cruise has played. FavCom ( r ) computes popular movies ( if it has received ratings over3) in comedy genre and ActMov ( r ) returns titles of actionmovies that have been rated with the highest score. For thecrime dataset, CommCrime ( r ) and CrimeSince ( r ) returntypes of unarrested crimes in the community Austin andanywhere since 2012, respectively. For the DBLP dataset, Hops ( r ) returns authors that are connected to each otherby a path of length 6 in the co-author graph. Over TPC-H, Custs ( r ) returns ids and the nations of customers whohave at least one order. We consider samples of varying size ( Sx denotes a samplewith x rows). Furthermore, Full denotes using the full pro-venance as input for summarization. Missing bars indicatetimed-out experiments (30 minute default timeout). https://data.ny.gov/Transportation/Driver-License-Permit-and-Non-Driver-Identificatio/a4s2-d9tt https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 Derivations in Prov R un t i m e ( s e c ) x o v + x o v + x o v + x o v + x o v +

15 95 924 9.5K 33K

SampPatGQualTopk x: S100o: S1K v: S10K+: FULLx: S100o: S1K v: S10K+: FULL

1K 10K 100K 1M 16MDataset size (a)

InvalidD -Why R un t i m e ( s e c ) x o v + x o v + x o v + x o v + x o v + SampPatGQualTopk x: S100o: S1K v: S10K+: FULLx: S100o: S1K v: S10K+: FULL

1K 10K 100K 1M 16MDataset size (b)

InvalidD -Whynot

38 78 4.6K 1.8M R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (c)

CasualWatch -Why R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (d)

CasualWatch -Whynot

78 78 78 239 R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (e)

Players -Why R un t i m e ( s e c ) S100 S1K S10K FULL

29K 260K 2.6M 26MDataset size (f )

Players -Whynot R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (g)

DirGen -Why R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (h)

DirGen -Whynot R un t i m e ( s e c ) S100S1K S10KFULL

29K 260K 2.6M 26MDataset size (i)

TomKey -Why R un t i m e ( s e c ) S100 S1K S10K FULL

29K 260K 2.6M 26MDataset size (j)

TomKey -Whynot R un t i m e ( s e c ) S100S1K S10KFULL

1K 10K 100K 1M 16MDataset size (k)

FavCom -Why R un t i m e ( s e c ) S100S1K S10KFULL

1K 10K 100K 1M 16MDataset size (l)

FavCom -Whynot

Figure 8: Measuring performance of generating summaries for why and why-not provenance for queries

Dataset Size.

We measure the runtime of our approachfor computing top-3 summaries varying dataset and samplesize over the queries from Fig. 5. On the x-axis of plots, weshow both the provenance size ( r when using a provenancequestion that binds C to new york (why) and swanton (why-not), respectively (the bindings for provenance questions areshown in Fig. 6). Observe that, even for the largest dataset,we are able to generate summaries within reasonable time ifusing sampling. Overall, pattern generation dominates theruntime for why provenance. For why-not provenance,sampling dominates the runtime for smaller sample sizeswhile pattern generation is dominant for S10K . FULL doesnot ﬁnish even for the smallest license dataset (1K). Forqueries r (many joins with a negation) and r (union of r (cid:48) and r (cid:48)(cid:48) ), Fig. 8c and 8e show the runtimes for why andFig. 8d and 8f show the runtimes for why-not. We observethe same trend as for r even though why-not provenanceis signiﬁcantly larger (up to 10 derivations). This trendis preserve over the runtimes of other queries r , r , and r (Fig. 8g, 8i and 8k for why and Fig. 8h, 8j and 8l forwhy-not, respectively). Performance Comparison with Naive Approach.

Wealso compare the performance of our sample-based sum-

106 290 578 1340 3540 7467 11996 R un t i m e ( s e c ) top1top3top5top10 Figure 9: Runtime for computing top- k summaries whenpatterns are provided as input.maries to the summary over full provenance ( FULL ). The

FULL is shown as ‘+’ or red bars in Fig. 8. The results of

FULL for why provenance are quadratic increase over the sizeof successful derivations while summaries result in almostlinear increase. Computing summaries over

FULL why-notprovenance are not feasible within the allocated time slotfor any size ( bars are omitted in Fig. 8).

Generating Top- k summaries. Fig. 9 shows the runtimeof computing the top- k summary over the patterns producedby the ﬁrst three steps (Sec. 5 through Sec. 7). We selectedsets of patterns from diﬀerent queries and sample sizes toget a roughly exponential progression of the number of pat-terns. We vary k from 1 to 10. The runtime is roughlylinear in k and in the number of patterns. Note that thisis signiﬁcantly better than the theoretical worst-case of ouralgorithm ( O ( n k ) where n is the number of patterns). The12 R un t i m e ( s e c ) S100S1K S10K4e17 8e28 1e40 3e51 6e62 (a)

Chain query R un t i m e ( s e c ) S100S1K S10K1e56 2e62 4e68 6e74 1e81 (b)

Star query R un t i m e ( s e c ) S100S1K S10K5e38 5e52 5e66 5e80 5e89 (c)

Chain query

10 14 18 22 24 R un t i m e ( s e c ) S100 S1K S10K3e34 3e49 3e63 3e77 3e84 (d)

Star query R un t i m e ( s e c ) S100S1KS10K (e) r over DBLP K R un t i m e ( s e c ) S100S1KS10K (f ) r over TPC-H K Figure 10: Varying

Query Complexity and Structure.

In this experiment,we vary the query complexity in terms of number of joins andvariables. We randomly generated synthetic queries whosejoin graph is either a star or a chain. We compute the top-3 patterns for why-not provenance. In Fig. 10a and 10bwe vary the number of joins. The results conﬁrm that ourapproach scales to very large provenance sizes (more than10 derivations) regardless of join types. To evaluate theimpact of the number of variables on performance, we usechain queries with 8-way joins and star queries with 5-wayjoins. We vary the number of variables bound to constantsby the query from 1 to 16 (out of 25 variables). The headand join variables are never bound. The results shown inFig. 10c and 10d conﬁrm that our approach works well, evenfor queries with up to 24 unbound variables (provenancesizes of up to ∼ ).We now extend the evaluation with other queries anddatasets. For the extension of variable impacts, we usequery r (the join size is ﬁxed to 3) over TPC-H andcompute summaries of why-not provenance. By bindingan increasing number of variables from r to constants,we generate 6 rules that contain between 5 and 29 exis-tential variables. The result shown in Fig. 10f conﬁrmsthat our approach scales well over TPC-H dataset. Usingthe DBLP dataset, we vary the number of joins (pathlength) of query r . For example, ( L ) : − DBLP ( L, R ) , DBLP ( R, R

1) is the query we use for a 2-way join. We use ap-tuple that binds L = xueni pan (Fig. 6). Fig. 10e shows R e l a t i v e E rr o r ( % ) S100S1KS10K (a)

Why (

InvalidD ) R e l a t i v e E rr o r ( % ) S100S1KS10K (b)

Why-not(

InvalidD ) R e l a t i v e E rr o r ( % ) S100S1KS10K (c)

Why (

CrimeSince ) R e l a t i v e E rr o r ( % ) S100S1KS10K (d)

Why-not (

CrimeSince ) Figure 11: Quality metric error caused by sampling. r1 r2 r3 r4Query0.20.40.60.81.0 C o m p l e t e n e ss t1 t3 t5 t10 (a) Why r1 r2 r3 r4Query0.20.40.60.81.0 C o m p l e t e n e ss t1 t3 t5 t10 (b) Why-not

Figure 12: Completeness - varying k .that even for real-world dataset (with a 6-way join wherethe provenance contains 3 · derivations), we produce asummary for sample sizes S100 and

S1K . We now measure the diﬀerence between the quality met-rics approximated using sampling and the exact values whenusing full provenance. For why-not provenance where it isnot feasible to compute full provenance, we compare againstthe largest sample size (

S10k ) instead.

Quality Metric Error.

Fig. 11a and 11b show the relativequality metric error for query r over InvalidD varyingsample size and k . The error is at most ∼

2% and typicallydecreases in k . For query r over Crimes , Fig. 11cand 11d show the results for why and why-not, respectively.Similarly, the overall relative error caused by sampling isquite low (below 1%) and descreases in k and sample size. Summary Completeness.

Fig. 12 shows the completenessscores of summaries returned by our approach for queriesfrom Fig. 5. We measure this by calculating the upper boundof completeness of the set of top- k patterns for each query asdescribed in Sec. 8. For k = 10, we achieve 100% complete-ness for why provenance and ∼

75% completeness for why-not except for r (Fig. 12b) for which the relatively largenumber of distinct values for the domains of unbound vari-ables prevents us from achieving better completeness scores. We now compare our system against [12] (all-derivations)and a single-derivation approach implemented in our system.13 .2K 2.4K 4.8K 9.6K 20K R un t i m e ( s e c ) Artemis PUG-Summ

Dataset size (a)

Artemis R un t i m e ( s e c ) SingleDer PUG-Summ

1K 10K 100K 1M 16M

Dataset size (b)

Single-derivation

Figure 13: Performance comparisons for why-not

Artemis.

The authors of [12] made their system

Artemis available as a virtual machine (VM). We ran both systemsin this VM (4GB memory) and used Postgres as a backendsince it is supported by both systems. We used a query fromthe VM installation that computes the names of witnesses( N ) who saw a person with a particular cloth and hair color( C and H ) perpetrating a crime of a particular type T . CrimeDesc ( T, N, C, H ) : − CRIME ( T, S ) , WITNESS ( N, S ) , SAWPERSON ( N, H, C ) , PERSON ( M, H, C ) , S > T = ‘trespassing’, N = ‘aaron-golden’, C = ‘midnightblue’, and H = ‘lavender’. The origi-nal dataset is CRIME which we scaled up to

CRIME . Weuse ∼

10% as the sample size (e.g.,

S2K for

CRIME ) andcompute top-5 summaries. The result of this comparisonis shown in Fig. 13a. Our system (

PUG-Summ ) outperformsArtemis by ∼ p = ( tresp. , aarongolden , midnightblue , lavender , S, M ) , S > ∼

50% of the provenance: p (cid:48) = ( tresp. , aarongolden , midnightblue , lavender , , M ) Single Derivation Approach.

We implemented a simplesingle-derivation approach (

SingleDer ) in our system by ap-plying n S = 1. That is, the explanation is computed basedon only one value from D for each unbound variable. We usequery r from Fig. 5, sample size S1K , and compute a top-3 summary. As shown in Fig. 13b,

SingleDer outperforms

PUG-Summ about an order of magnitude for small datasets.The gap between the two approaches is less signiﬁcant forlarger datasets (more than 1M tuples).

10. RELATED WORK

Compact Representation of Provenance.

The needfor compressing provenance to reduce its size has been rec-ognized early-on, e.g., [3, 7, 22]. However, the compressedrepresentations produced by these approaches are often notsemantically meaningful to users. More closely related toour work are techniques for generating higher-level explana-tions for binary outcomes [8, 29], missing answers [26], orquery results [24, 30, 2] as well as methods for summarizingdata or general annotations which may or may not encodeprovenance information [33]. Speciﬁcally, like [8, 29, 24,30] we use patterns with placeholders. Some approaches use ontologies [26, 29] or logical constraints [24, 8, 30] to de-rive semantically meaningful and compact representationsof a set of tuples. The use of constraints to compactly rep-resent large or even inﬁnite database instances has a longtradition [14, 17] and these techniques have been adoptedto compactly explain missing answers [12, 23]. However,the compactness of these representations comes at the costof computational intractability.

Missing Answers.

The missing answer problem was ﬁrststated for query-based explanations (which parts of the queryare responsible for the failure to derive the missing answer)in the seminal paper by Chapman et al. [6]. Most follow-up work [4, 5, 6, 27] is based on this notion. Huang etal. [13] ﬁrst introduced an instance-based approach, i.e.,which existing and missing input tuples caused the miss-ing answer [12, 13, 18, 20]). Since then, several techniqueshave been developed to exclude spurious explanations andto support larger classes of queries [12]. As mentioned be-fore, approaches for instance-based explanations use eitherthe all-derivations (giving up performance) or the single-derivation approach (giving up completeness). In contrast,using summarizes we guarantee performance by compactlyrepresenting large amounts of provenance without forsakingcompleteness. Artemis [12] uses c-tables to compactly rep-resent sets of missing answers. However, this comes at thecost of additional computational complexity.

11. CONCLUSIONS

We have presented an approach for eﬃciently computingsummaries of why and why-not provenance. Our approachuses sampling to generate summaries that are guaranteedto be concise while balancing completeness (the fraction ofprovenance covered) and informativeness (new informationprovided by the summary). Thus, we overcome a severelimitation of prior work which sacriﬁces either completenessor performance. We demonstrate experimentally that ourapproach eﬃciently produces meaningful summaries of pro-venance graphs with up to 10 derivations. In future work,we plan to investigate how to utilize additional information,e.g., integrity constraints, in the summarization process.

12. REFERENCES [1] M. Abramowitz and I. A. Stegun.

Handbook ofmathematical functions: with formulas, graphs, andmathematical tables , volume 55. Courier Corporation,1965.[2] E. Ainy, P. Bourhis, S. Davidson, D. Deutch, andT. Milo. Approximated summarization of dataprovenance. In

CIKM , pages 483–492, 2015.[3] M. K. Anand, S. Bowers, T. McPhillips, andB. Lud¨ascher. Eﬃcient Provenance Storage overNested Data Collections. In

EDBT , pages 958–969,2009.[4] N. Bidoit, M. Herschel, and K. Tzompanaki.Immutably answering why-not questions for equivalentconjunctive queries. In

TaPP , 2014.[5] N. Bidoit, M. Herschel, K. Tzompanaki, et al.Query-Based Why-Not Provenance with NedExplain.In

EDBT , pages 145–156, 2014.[6] A. Chapman and H. V. Jagadish. Why Not? In

SIGMOD , pages 523–534, 2009.147] A. Chapman, H. V. Jagadish, and P. Ramanan.Eﬃcient Provenance Storage. In

SIGMOD , pages993–1006, 2008.[8] K. El Gebaly, P. Agrawal, L. Golab, F. Korn, andD. Srivastava. Interpretable and informativeexplanations of outcomes.

PVLDB , 8(1):61–72, 2014.[9] K. E. Gebaly, G. Feng, L. Golab, F. Korn, andD. Srivastava. Explanation tables.

IEEE Data Eng.Bull. , 41(3):43–51, 2018.[10] T. Green, G. Karvounarakis, and V. Tannen.Provenance semirings. In

PODS , pages 31–40, 2007.[11] M. Herschel, R. Diestelk¨amper, and H. Ben Lahmar.A survey on provenance: What for? what form? whatfrom?

VLDB J. , 26(6):881–906, 2017.[12] M. Herschel and M. Hernandez. Explaining MissingAnswers to SPJUA Queries.

PVLDB , 3(1):185–196,2010.[13] J. Huang, T. Chen, A. Doan, and J. F. Naughton. Onthe provenance of non-answers to queries overextracted data.

PVLDB , 1(1):736–747, 2008.[14] T. Imieli´nski and W. Lipski Jr. IncompleteInformation in Relational Databases.

JACM ,31(4):761–791, 1984.[15] Y. Ioannidis. The history of histograms (abridged). In

VLDB , pages 19–30, 2003.[16] S. K¨ohler, B. Lud¨ascher, and D. Zinn. First-orderprovenance games. In

In Search of Elegance in theTheory and Practice of Computation , pages 382–399.Springer, 2013.[17] G. M. Kuper, L. Libkin, and J. Paredaens, editors.

Constraint Databases . Springer, 2000.[18] S. Lee, S. K¨ohler, B. Lud¨ascher, and B. Glavic. ASQL-middleware unifying why and why-notprovenance for ﬁrst-order queries. In

ICDE , pages485–496, 2017.[19] S. Lee, B. Lud¨ascher, and B. Glavic. Provenancesummaries for answers and non-answers.

PVLDB ,11(12):1954–1957, 2018.[20] S. Lee, B. Lud¨ascher, and B. Glavic. Pug: aframework and practical implementation for why andwhy-not provenance.

VLDB J. , pages 1–25, 2018.[21] S. Lee, X. Niu, B. Lud¨ascher, and B. Glavic.Integrating approximate summarization withprovenance capture. In

TaPP , pages 2–2, 2017.[22] D. Olteanu and J. Z´avodn´y. On factorisation ofprovenance polynomials. In

TaPP , 2011.[23] S. Riddle, S. K¨ohler, and B. Lud¨ascher. Towardsconstraint provenance games. In

TaPP , 2014.[24] S. Roy and D. Suciu. A formal approach to ﬁndingexplanations for database queries. In

SIGMOD , pages1579–1590, 2014.[25] V. Tannen. Provenance analysis for FOL modelchecking.

ACM SIGLOG News , 4(1):24–36, 2017.[26] B. ten Cate, C. Civili, E. Sherkhonov, and W.-C. Tan.High-level why-not explanations using ontologies. In

PODS , pages 31–43, 2015.[27] Q. T. Tran and C.-Y. Chan. How to conquer why-notquestions. In

SIGMOD , pages 15–26, 2010.[28] E. Von Collani and K. Dr¨ager.

Binomial distributionhandbook for scientists and engineers . Springer Science& Business Media, 2001. [29] X. Wang, X. L. Dong, and A. Meliou. Data x-ray: Adiagnostic tool for data errors. In

SIGMOD , pages1231–1245, 2015.[30] E. Wu and S. Madden. Scorpion: Explaining awayoutliers in aggregate queries.

PVLDB , 6(8):553–564,2013.[31] Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo.Answering why-not queries in software-deﬁnednetworks with negative provenance. In

Proceedings ofthe Twelfth ACM Workshop on Hot Topics inNetworks , page 3. ACM, 2013.[32] Y. Wu, M. Zhao, A. Haeberlen, W. Zhou, and B. T.Loo. Diagnosing missing events in distributed systemswith negative provenance. In

SIGCOMM , pages383–394, 2014.[33] D. Xiao and M. Y. Eltabakh. Insightnotes:Summary-based annotation management in relationaldatabases. In