COBRA: Compression via Abstraction of Provenance for Hypothetical Reasoning
CCOBRA : Compression via Abstractionof Provenance for Hypothetical Reasoning
Daniel Deutch
Tel Aviv University
Yuval Moskovitch
Tel Aviv University
Noam Rinetzky
Tel Aviv University
Abstract
Data analytics often involves hypothetical reasoning: repeatedly modifying thedata and observing the induced effect on the computation result of a data-centricapplication. Recent work has proposed to leverage ideas from data provenancetracking towards supporting efficient hypothetical reasoning: instead of a costlyre-execution of the underlying application, one may assign values to a pre-computed provenance expression . A prime challenge in leveraging this approachfor large-scale data and complex applications lies in the size of the provenance.To this end, we present a framework that allows to reduce provenance size. Ourapproach is based on reducing the provenance granularity using abstraction.We propose a demonstration of
COBRA , a system that allows examine the effectof the provenance compression on the anticipated analysis results. We willdemonstrate the usefulness of
COBRA in the context of business data analysis.
Hypothetical reasoning involves examining the effect on a query/applicationresult of modifying its input. It is a particular type of data analytics that is ofgreat importance to analysts aiming at achieving a better understanding of thedata and applications in hand, thereby optimizing either or both.Recent work [3] has proposed to leverage ideas from data provenance track-ing towards supporting efficient hypothetical reasoning. The high level idea isto instrument the data with symbolic variables, either at the cell or tuple level.Then, existing provenance models such as [5] or [2] define how these variablespropagate through query evaluation, to form provenance polynomials , whichmay be regarded as a symbolic representation of the query result. The polyno-mial construction has the property that it commutes with variable valuations [2],i.e., that the result of applying valuations directly to the computed polynomialsis guaranteed to yield the same result as that of replacing the variables with thecorresponding values in the input and then re-executing the query. Importantly,the former (applying valuations to the polynomial) is typically much faster than1 a r X i v : . [ c s . D B ] J u l he latter (re-running the query), and so the commutativity property serves asa correctness guarantee for the reasoning process.A prime challenge in leveraging this approach for large-scale data and com-plex applications lies in the provenance size . The instrumentation process de-scribed above often results in very large provenance polynomials. While forthe purpose of generating the provenance, it is reasonable to expect a ratherpowerful hardware, e.g., a cluster or a cloud; this assumption cannot be madefor the actual interaction with the provenance expressions as applying valuationmay be performed by multiple analysts, possibly using weaker hardware. Thus,requiring each such analyst to store and manipulate large polynomials may beinfeasible.In a paper that is to appear in SIGMOD ’19 [4], we presented a frameworkfor the reduction of provenance size for hypothetical reasoning. The frameworkis based on the notion of abstraction ; the main idea is that instead of assigninga distinct variable per cell/tuple, we can often group variables together, formingan abstract “meta-variable”. By doing so, we decrease the degree of freedom forhypotheticals (because now we are forced to assign a single value to all groupedvariables), but we also gain in provenance size: distinct monomials may becomeidentical, in which case they are compactly represented by a single monomial(by summing their coefficients). Whether or not it makes sense to group vari-ables together depends on their semantics; to enable meaningful abstraction weintroduced in [4] the notion of abstraction trees , which resemble ontologies overthe provenance variables to guide and restrict the allowed groupings.We propose to demonstrate our solution, which we implemented in a sys-tem called COBRA (for “COmpression using aBstRAction trees”). The systemallows examining the effect of the provenance compression on the anticipatedanalysis results. The framework is based on the algorithm presented in [4],and designed to assist the meta-analysis determine the desired bound over thecompressed provenance size and the construction of the abstraction tree. Thisis done by presenting the changes in the analysis query results using valuationof the compressed provenance with respect to valuation of the full provenance.In more detail,
COBRA gets as input provenance polynomials, generated by anyprovenance engine. The meta-analyst provides to the system a valuation forthe provenance variables, an abstraction tree and a bound over the compressedprovenance size. Once the abstraction tree and bound are set,
COBRA computesan abstraction over the variables. The abstraction meta-variables are then pre-sented to the user, and she may assign values, or use default values (average ofthe original values) set by the system. Finally, the system illustrates the effectof the compression on the analysis results by presenting the user with the queryresult using the full provenance compared with the result using the compressedprovenance, the resulting provenance size, and the speedup in the assignmenttime.We will demonstrate
COBRA in the context of business data analysis, usingthe synthetic telephony company database, described below, as well as datagenerated by the TPC Benchmark H. We will walk the audience through theprocess of building the abstraction trees, and let the them interactively examine2 ustID Plan Zip . . . . . . . . .
CallsCID Mo Dur . . . . . . . . .
CID Mo Dur . . . . . . . . .
PlansPlan Mo Price
Plan A 1 0.4Family1 (F1) 1 0.35Youth1 (Y1) 1 0.3Veterans (V) 1 0.25Small Business1 (SB1) 1 0.1Small Business2 (SB2) 1 0.1Enterprise (E) 1 0.05 . . . . . . . . .
Plan Mo Price
A 3 0.5F1 3 0.35Y1 3 0.25V 3 0.2SB1 3 0.1SB2 3 0.15E 3 0.05 . . . . . . . . .
Figure 1: Example databasethe effect of the bound on the query results, provenance size and assignmenttime.
Related Work
Provenance summarization was studied in multiple contexts,e.g., for probability computation [7] or explanations [6]. The main novel as-pects of the present work are: (i) the problem setting which includes the use ofabstraction trees that both restrict and guide the summarization, and (ii) ournovel compression algorithms and analysis that leverage the presence of suchtrees. Indeed, the way that we use these trees to define our optimization prob-lem is geared towards hypothetical reasoning, where one wishes to optimize theremaining degrees of freedom for hypotheticals, and is aware of the scenariosintended to be examined.
We (informally) introduce the model underlying
COBRA , through a running ex-ample. The model and the example, as well as most of the text in this sectionare taken verbatim or in a shortened form from [4]; they appear here for com-pleteness.
Example 1 (Running example)
Our running example concerns a telephonycompany, whose database is illustrated in Figure 1. It includes a
Cust tablewith information about the customers (ID, calling plan and zip code); a
Calls table including the duration in minutes, totaled by month for each customer;and the
Plans table including the price per minute ( ppm ) of every plan, where he ppm may vary from month to month. The company offers several callingplans: Small business plans ( SB , SB ), enterprises plan ( E ), plans for youth( Y , Y , Y ) for families ( F , F ) and for veterans ( V ), as well as standardplans ( A , B ). Each customer is subscribed to one calling plan.Our example query computes the revenues of the company by summing theper-customer-revenue, computed by multiplying the duration of calls by the ppm of the customer’s plan, and aggregating the result per zip code: SELECT Zip , SUM ( Calls . Dur * Plans . Price )FROM Calls , Cust , PlansWHERE Cust . Plan = Plans . PlanAND Cust . ID = Calls . CIDAND Calls . Mo = Plans . MoGROUP BY Cust . Zip
An analyst working for the company may be interested in the effect of possiblechanges to the call prices on the company revenues. For example, what if the price per minute ( ppm ) of all plans are decreased by 20% on March? Or whatif the ppm in the business calling plans are increased by ? Provenance Polynomials
COBRA gets as input provenance polynomials. Givena set of indeterminates X we use the standard notion of a polynomial over X asa sum of monomials, where each of which is a product of indeterminates and/orrational numbers referred to as coefficients. An indeterminate may appear morethan once in a monomial, in which case this number of occurrences is called itsexponent. We assume that we are given a multiset of such polynomials , intu-itively including all polynomials that appear in the provenance-aware result ofquery evaluation. Example 2
To support the hypothetical scenarios given in Example 1, we can parameterize the (multiplicative) change in price, assigning, e.g., a distinct pa-rameter m i to capture the change in month i . Similarly, the variables p , f , y , v , b , b and e are used to parameterize the plans prices based on the plan’stype: p is used to control the changes in the price of plan A , f for plan F , y for Y , and v for the veterans plan. In this example we would then get as an-swer to the above query, instead of a single aggregate value, symbolic provenanceexpressions of the form P = 208 . · p · m + 240 · p · m + 127 . · f · m +114 . · f · m + 75 . · y · m + 72 . · y · m +42 · v · m + 24 . · v · m P = 77 . · b · m + 80 . · b · m + 52 . · e · m +56 . · e · m + 69 . · b · m + 100 . · b · m Abstraction Trees
COBRA reduces the provenance polynomial size so thatits number of monomials is below a given threshold, while supporting maximal4lans Standard p p Special v Y y y y F f f Business e SB b b Figure 2: An abstraction tree of the plans variablesgranularity for hypothetical reasoning. To this end, we allow the user to define abstraction trees over the variables, intuitively defining groups of variables whichwill be assigned the same values. The notion of abstraction trees is criticalbecause determining which grouping “makes sense” is based on their semantics.The abstraction trees may be obtained by leveraging existing ontologies on theannotated data, in turn capturing the semantics of variables. The user may alsomanually construct/augment the trees based on the expected use of provenance,namely, form the trees so that variables that, based on the user experience, areexpected to be assigned the same value will be located in proximity to eachother in the tree.An abstraction is then represented by a cut in the tree separating the rootfrom all leaves. The idea is that for every node in the chosen cut, all of itsdescendant leaves are replaced by a single metavariable. Intuitively, such choicemeans that for the subsequent hypothetical reasoning scenarios, all variablesbelow each chosen node must be assigned the same value.
Example 3
In Example 1, the plans variables may be abstracted based on theirtype, e.g., plans for small businesses SB and SB , or further abstracting allBusiness plans, small businesses and enterprises. Abstracting the family plansusing a single variable F , and youth plans using the variable Y . We may alsoconsider using a coarse abstraction that combines all special plans (families,youth and veterans) into a single variable. Figure 2 depicts the resulting ab-straction tree. Optimization Problem
The problem we have studied in [4] is as follows:Given a provenance polynomial and abstraction tree over (subsets of) its vari-ables, find a choice of abstraction that reduces the provenance size, while max-imizing the expressiveness of the abstraction; we next explain both measures.First, the provenance size is measured by the number of monomials in the result-ing provenance polynomial. The number of monomials is indeed the dominantfactor in the provenance size since each monomial is bounded by a typically smallconstant, independent from the database size (it may depend on the query or5he number of hypothetical scenarios). As for the expressiveness of the abstrac-tion, we aim at maximizing the degrees of freedom left for hypothetical analysis;naturally, every grouping limits the possible scenarios in the sense that it forcesmultiple variables to be assigned the same value. Consequently, we measure theexpressiveness of the abstraction by the number of distinct variable names it de-fines. Our goal is to reduce the number of distinct monomials in the provenance,while maximizing the number of distinct variables . Example 4
Consider the abstraction tree presented in Figure 2. The followingcuts are possible abstractions: S = { Business, Special, Standard } S = { SB, e, f , f , Y, v, Standard } S = { b , b , e, Special, Standard } S = { SB, e, F, Y, v, p , p } S = { P lans } Each choice of abstraction may entail a “loss” in terms of the granularity ofhypothetical reasoning, in exchange for a reduction in the size of the polynomial:consider the polynomial P for the revenues shown in Example 2, using theabstraction S we obtain the polynomial (we use St and Sp as shorthand forStandard and Special respectively) 208 . · St · m + 240 · St · m + 245 . · Sp · m +211 . · Sp · m , with four different variables and four monomials, whereas usingthe abstraction S the obtained polynomial 466 . · P lans · m +451 . · P lans · m ,consist of two monomials and three variables.In this demonstration, we consider the case of a single abstraction tree (evenin this case, a monomial may still consist of multiple variables, but the ab-straction may apply to at most one of them); note that there may still beexponentially many cuts in the tree. In this case the optimization problem issolvable in polynomial time complexity. In a nutshell, the algorithm traversesthe abstraction tree in a bottom-up fashion, and using dynamic programming,computes an abstraction for the sub-tree rooted by each one of the inner nodes(see [4] for full details). COBRA ’s back-end side is implemented in Python 3. Its front-end is written inAngular JS framework using Bootstrap toolkit. It runs on Windows 10. Thesystem architecture is depicted in Figure 4, and the user interface is shown inFigure 3. We next briefly explain the components of the system.
Back-end
As mentioned earlier, the input to
COBRA is set of provenance poly-nomials (generated by any provenance engine), default assignment to the prove-nance variables, a bound over the provenance size and abstraction tree (given bythe user). The system then computes an optimal abstraction over the polyno-mials, namely, an abstraction that reduces the provenance size below the given6igure 3: User Interfacebound while maximizing the number of variables. This is done using the algo-rithm presented in [4]. Once the abstraction is generated, the user may inputvaluation to the compressed polynomials’ variables, and the system generatesthe query results under the scenario given by the assignment, and presents theresults to the user.
Front-end
The interaction with
COBRA is done via a dedicated interface shownin Figure 3. The user is presented with the query result under a default assign-ment to the input provenance variables. She can then construct the abstractiontree, and set the bound over the provenance size. Once the abstracted poly-nomials are generated, the system presents the user the abstraction variablesas shown in Figure 5. Each meta-variable in the abstraction is presented withthe list of abstracted variables, each with its value in the original assignment,and a default value (average over the abstracted variables’ values). The usercan then modify the assigned values of the meta-variables, and
COBRA presentsthe the query result under the given assignment, showing the changes from theinitial result. In addition, the system provides the user with information aboutthe resulting provenance size and the assignment speedup using the compressedpolynomials.
We will demonstrate the usefulness of
COBRA using both synthetic and realdatasets. In the first phase, we will discuss the dataset. We will use the prove-nance generated for the query from our running example, where the plans price7
OBRA
Provenance PolynomialsBound, Abstraction Trees Provenance Compression AbstractedPolynomialsAbstracted Variables Assignment ResultsAssignmentProvenance Engine
Figure 4: System ArchitectureFigure 5: Meta-variables Assignment Screenwas parametrized by month and plan. In addition, we will demonstrate
COBRA in the context of TPC Benchmark H (TPC-H) [1], which consists of a suite ofbusiness oriented queries. To this end, we will use the data generated by thebenchmark and present a subset of its queries.We will walk the audience through the process of building the abstractiontrees, by presenting the underlying database and the query used to generateprovenance. There are multiple reasonable abstractions for each query. Forinstance, in our running example, if the analyst knows that the prices are usuallychanged uniformly during each quarter, a natural abstraction tree would consistof quarter meta-variables q . . . , q , that can be used to group the monthlyvariables, i.e., the variables m , . . . , m are the children of q , m , . . . , m of q etc. The abstraction tree given in Figure 2 is another plausible example. Wewill use predefined trees for each one of the datasets.In the second phase, we will let the audience interactively examine the effectof the bound on the query results, provenance size and assignment time. Asexplained in Section 2, given the provenance polynomials, abstraction tree andbound, the system computes an abstraction. Once the abstraction is computed, COBRA presents the user the abstraction variables with default assignment asshown in Figure 5. We will let the user select valuations to the abstractionvariables and observe the results: the changes in the analysis query resultsusing the compressed provenance.Moreover, the system provides the user information about the resultingprovenance size and the assignment speedup. For example, the provenance sizeof the polynomials generated by our running example using a database of one8illion customers parameterized using month variables and the leaves of the ab-straction trees in Figures 2 is 139 , ,
600 the compressed provenance expression obtained is of size 88 , ,
600 resultsin provenance polynomials of size 37 ,
980 and assignment speedup of 79%.Finally, we will allow the audience to look “under the hood”. In particular,we will show the audience the part of the provenance polynomials, intermedi-ate results of the algorithm and the computational sequence that lead to theresulting abstraction.
Acknowledgements
This research has been funded by the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation programme (grantagreement No. 804302), the Israeli Ministry of Science, Technology and Space,Len Blavatnik and the Blavatnik Family foundation, Blavatnik InterdisciplinaryCyber Research Center at Tel Aviv University, and the Pazy Foundation. Thecontribution of Yuval Moskovitch is part of Ph.D. thesis research conducted atTel Aviv University.
References [1] Tpc benchmark. .[2] Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregatequeries. In
PODS , 2011.[3] D. Deutch, Z. G. Ives, T. Milo, and V. Tannen. Caravan: Provisioning forwhat-if analysis. In
CIDR , 2013.[4] D. Deutch, Y. Moskovitch, and N. Rinetzky. Hypothetical reasoning viaprovenance abstraction. to appear in SIGMOD 2019.[5] T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In
PODS , 2007.[6] S. Lee, X. Niu, B. Lud¨ascher, and B. Glavic. Integrating approximate sum-marization with provenance capture. In
TaPP , 2017.[7] C. R´e and D. Suciu. Approximate lineage for probabilistic databases.