Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views
Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska
SStale View Cleaning: Getting Fresh Answers from StaleMaterialized Views
Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska † UC Berkeley, † Brown University { sanjaykrishnan, jnwang, franklin, goldberg } @berkeley.edutim [email protected] ABSTRACT
Materialized views (MVs), stored pre-computed results, are widelyused to facilitate fast queries on large datasets. When new recordsarrive at a high rate, it is infeasible to continuously update (main-tain) MVs and a common solution is to defer maintenance by batch-ing updates together. Between batches the MVs become increas-ingly stale with incorrect, missing, and superfluous rows leadingto increasingly inaccurate query results. We propose Stale ViewCleaning (SVC) which addresses this problem from a data clean-ing perspective. In SVC, we efficiently clean a sample of rows froma stale MV, and use the clean sample to estimate aggregate queryresults. While approximate, the estimated query results reflect themost recent data. As sampling can be sensitive to long-tailed dis-tributions, we further explore an outlier indexing technique to giveincreased accuracy when the data distributions are skewed. SVCcomplements existing deferred maintenance approaches by givingaccurate and bounded query answers between maintenance. Weevaluate our method on a generated dataset from the TPC-D bench-mark and a real video distribution application. Experiments con-firm our theoretical results: (1) cleaning an MV sample is moreefficient than full view maintenance, (2) the estimated results aremore accurate than using the stale MV, and (3) SVC is applicablefor a wide variety of MVs.
1. INTRODUCTION
Storing pre-computed query results, also known as materializa-tion, is an extensively studied approach to reduce query latencyon large data [9,22,31]. Materialized Views (MVs) are now sup-ported by all major commercial vendors. However, as with any pre-computation or caching, the key challenge in using MVs is main-taining their freshness as base data changes. While there has beensubstantial work in incremental maintenance of MVs [9,27], ea-ger maintenance (i.e., immediately applying updates) is not alwaysfeasible.In applications such as monitoring or visualization [35,49], ana-lysts may create many MVs by slicing or aggregating over differentdimensions. Eager maintenance requires updating all affected MVsfor every incoming transaction, and thus, each additional MV re-
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li-cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per-mission prior to any use beyond those covered by the license. Contactcopyright holder by emailing [email protected]. Articles from this volumewere invited to present their results at the 41st International Conference onVery Large Data Bases, August 31st - September 4th 2015, Kohala Coast,Hawaii.
Proceedings of the VLDB Endowment,
Vol. 8, No. 12Copyright 2015 VLDB Endowment 2150-8097/15/08.
Stale MV
Stale
Sample MV
Outlier Index
Up-to-date Sample MV
Query
Inaccurate
Result
Query Result
Estimate
Efficient
Data Cleaning
Figure 1: In SVC, we pose view maintenance as a sample-and-clean problem and show that we can use a sample of clean (up-to-date) rows from an MV to correct inaccurate query resultson stale views. duces the available transaction throughput. This problem becomessignificantly harder when the views are distributed and computa-tional resources are contended by other tasks. As a result, in pro-duction environments, it is common to batch updates together toamortize overheads [9]. Batch sizes are set according to systemconstraints, and can vary from a few seconds to even nightly.While increasing the batching period gives the user more flexi-bility to schedule around system constraints, a disadvantage is thatMVs are stale between maintenance periods. Other than an edu-cated guess based on past data, the user has no way of knowinghow incorrect their query results are. Some types of views andquery workloads can be sensitive to even a small number of basedata updates, for example, if updates disproportionately affect asubset of frequently queried rows. Thus, any amount of stalenessis potentially dangerous, and this presents us a dichotomy betweenfacing the cost of eager maintenance or coping with consequencesof unknown inaccuracy. In this paper, we explore an intriguingmiddle ground, namely, we can derive a bounded approximation ofthe correct answer for a fraction of the cost. With a small amountof up-to-date data, we can compensate for the error in aggregatequery results induced by staleness.Our method relies on modeling query answering on stale MVsas a data cleaning problem. A stale MV has incorrect, missing, orsuperfluous rows, which are problems that have been studied in thedata cleaning literature (e.g., see Rahm and Do for a survey [45]).Increasing data volumes have led to development of new, efficientsampling-based approaches for coping with dirty data. In our priorwork, we developed the SampleClean framework for scalable ag-gregate query processing on dirty data [48]. Since data cleaningis often expensive, we proposed cleaning a sample of data and us-ing this sample to improve the results of aggregate queries on thefull dataset. Since stale MVs are dirty data, an approach similar toSampleClean raises a new possibility of using a sample of “clean”rows in the MVs to return more accurate query results. a r X i v : . [ c s . D B ] S e p tale View Cleaning (SVC illustrated in Figure 1) approximatesaggregate query results from a stale MV and a small sample of up-to-date data. We calculate a relational expression that materializesa uniform sample of up-to-date rows. This expression can be in-terpreted as “cleaning” a stale sample of rows. We use the cleansample of rows to estimate a result for an aggregate query on theview. The estimates from this procedure, while approximate, re-flect the most recent data. Approximation error is more manageablethan staleness because: (1) the uniformity of sampling allows us toapply theory from statistics such as the Central Limit Theorem togive tight bounds on approximate results, and (2) the approximateerror is parametrized by the sample size which the user can controltrading off accuracy for computation. However, the MV settingpresents new challenges that we did not consider in prior work. Tosummarize our contributions:(1) a hashing-based technique that ef-ficiently materializes an up-to-date sample view, (2) algorithms forprocessing general aggregate queries on a sample view and bound-ing results in confidence intervals, (3) an outlier indexing techniqueto reduce sensitivity to skewed datasets that can push the index upto derived relations, and (4) an evaluation of this technique on realand synthetic datasets to show that SVC gives highly accurate re-sults for a relatively small maintenance cost.The paper is organized as follows: In Section 2, we give thenecessary background for our work. Next, in Section 3, we for-malize the problem. In Sections 4 and 5, we describe the samplingand query processing of our technique. In Section 6, we describethe outlier indexing framework. Then, in Section 7, we evaluateour approach. We discuss Related Work in Section 8. Finally, wepresent our Conclusions in Section 10.
2. BACKGROUND2.1 Motivation and Example
Materialized view maintenance can be very expensive resultingin staleness. Many important use-cases require creating a largenumber of views including: visualization, personalization, privacy,and real-time monitoring. The problem with eager maintenance isthat every view created by an analyst places a bottleneck on in-coming transactions. There has been significant research on fastMV maintenance algorithms, most recently DBToaster [27] whichuses SQL query compilation and higher-order maintenance. How-ever, even with these optimizations, some materialized views arecomputationally difficult to incrementally maintain. For example,incremental maintenance of views with correlated subqueries cangrow with the size of the data. It is also common to use the sameinfrastructure to maintain multiple MVs (along with other analyt-ics tasks) adding further contention to computational resources andreducing overall available throughput. When faced with such chal-lenges, one solution is to batch updates and amortize maintenanceoverheads.
Log Analysis Example:
Suppose we are a video streaming com-pany analyzing user engagement. Our database consists of two ta-bles
Log and
Video , with the following schema:
Log ( s e s s i o n I d , vide oId )Video ( v ideoId, ownerId, d u r a t i o n )
The
Log table stores each visit to a specific video with primary key( sessionId ) and a foreign-key to the
Video table ( videoId ).For our analysis, we are interested in finding aggregate statisticson visits, such as the average visits per video and the total num-ber of visits predicated on different subsets of owners. We coulddefine the following MV that counts the visits for each videoId associated with owners and the duration.
CREATE VIEW v i s i t V i e w
AS SELECT videoId , ownerId , d u r a t i o n , count ( 1 ) as v i s i t C o u n t FROM
Log , Video
WHERE
Log . video Id = Video . vide oId
GROUP BY vide oId As Log table grows, this MV becomes stale, and we denote theinsertions to the table as:
LogIns ( s e s s i o n I d , videoI d )
Staleness does not affect every query uniformly. Even when thenumber of new entries in
LogIns is small relative to
Log , somequeries might be very inaccurate. For example, views to newlyadded videos may account for most of
LogIns , so queries thatcount visits to the most recent videos will be more inaccurate. Theamount of inaccuracy is unknown to the user, who can only esti-mate an expected error based on prior experience. This assumptionmay not hold in rapidly evolving data. We see an opportunity forapproximation through sampling which can give bounded queryresults for a reduced maintenance cost. In other words, a smallamount of up-to-date data allows the user to estimate the magni-tude of query result error due to staleness.
SampleClean is a framework for scalable aggregate query pro-cessing on dirty data. Traditionally, data cleaning has explored ex-pensive, up-front cleaning of entire datasets for increased query ac-curacy. Those who were unwilling to pay the full cleaning costavoided data cleaning altogether. We proposed SampleClean toadd an additional trade-off to this design space by using sampling,i.e., bounded results for aggregate queries when only a sample ofdata is cleaned. The problem of high computational costs for ac-curate results mirrors the challenge faced in the MV setting withthe tradeoff between immediate maintenance (expensive and up-to-date) and deferred maintenance (inexpensive and stale). Thus,we explore how samples of “clean” (up-to-date) data can be usedfor improved query processing on MVs without incurring the fullcost of maintenance.However, the metaphor of stale MVs as a Sample-and-Cleanproblem only goes so far and there are significant new challengesthat we address in this paper. In prior work, we modeled data clean-ing as a row-by-row black-box transformation. This model does notwork for missing and superfluous rows in stale MVs. In particular,our sampling method has to account for this issue and we propose ahashing based technique to efficiently materialize a uniform sampleeven in the presence of missing/superfluous rows. Next, we greatlyexpand the query processing scope of SampleClean beyond sum , count , and avg queries. Bounding estimates that are not sum , count , and avg queries, is significantly more complicated. Thisrequires new analytic tools such as a statistical bootstrap estima-tion to calculate confidence intervals. Finally, we add an outlierindexing technique to improve estimates on skewed data.
3. FRAMEWORK OVERVIEW3.1 Notation and Definitions
SVC returns a bounded approximation for aggregate queries onstale MVs for a flexible additional maintenance cost.
Materialized View:
Let D be a database which is a collection ofrelations { R i } . A materialized view S is the result of applying a view definition to D . View definitions are composed of standardrelational algebra expressions: Select ( σ φ ), Project ( Π ), Join ( (cid:46)(cid:47) ),Aggregation ( γ ), Union ( ∪ ), Intersection ( ∩ ) and Difference ( − ).We use the following parametrized notation for joins, aggregationsand generalized projections: Π a ,a ,...,a k ( R ) : Generalized projection selects attributes { a , a , ..., a k } from R , allowing for adding new attributesthat are arithmetic transformations of old ones (e.g., a + a ). • (cid:46)(cid:47) φ ( r ,r ( R , R ) : Join selects all tuples in R × R thatsatisfy φ ( r , r ) . We use (cid:46)(cid:47) to denote all types of joins evenextended outer joins such as (cid:46)(cid:47) , (cid:46)(cid:47), (cid:46)(cid:47) . • γ f,A ( R ) : Apply the aggregate function f to the relation Rgrouped by the distinct values of A , where A is a subset ofthe attributes. The DISTINCT operation can be consideredas a special case of the Aggregation operation.The composition of the unary and binary relational expressions canbe represented as a tree, which is called the expression tree . Theleaves of the tree are the base relations for the view. Each non-leave node is the result of applying one of the above relational ex-pressions to a relation. To avoid ambiguity, we refer to tuples of thebase relations as records and tuples of derived relations as rows . Primary Key:
We assume that each of the base relations has a pri-mary key . If this is not the case, we can always add an extra columnthat assigns an increasing sequence of integers to each record. Forthe defined relational expressions, every row in a materialized viewcan also be given a primary key [14,52], which we will describein Section 4. This primary key is formally a subset of attributes u ⊆ { a , a , ..., a k } such that all s ∈ S ( u ) are unique. Staleness:
For each relation R i there is a set of insertions ∆ R i (modeled as a relation) and a set of deletions ∇ R i . An “update”to R i can be modeled as a deletion and then an insertion. We referto the set of insertion and deletion relations as “delta relations”,denoted by ∂ D : ∂ D = { ∆ R , ..., ∆ R k } ∪ {∇ R , ..., ∇ R k } A view S is considered stale when there exist insertions or dele-tions to any of its base relations. This means that at least one of thedelta relations in ∂ D is non-empty. Maintenance:
There may be multiple ways (e.g., incremental main-tenance or recomputation) to maintain a view S , and we denote theup-to-date view as S (cid:48) . We formalize the procedure to maintain theview as a maintenance strategy M . A maintenance strategy is arelational expression the execution of which will return S (cid:48) . It is afunction of the database D , the stale view S , and all the insertionand deletion relations ∂ D . In this work, we consider maintenancestrategies composed of the same relational expressions as material-ized views described above. S (cid:48) = M ( S, D , ∂D ) Staleness as Data Error:
The consequences of staleness are in-correct, missing, and superfluous rows. Formally, for a stale view S with primary key u and an up-to-date view S (cid:48) : • Incorrect:
Incorrect rows are the set of rows (identified bythe primary key) that are updated in S (cid:48) . For s ∈ S , let s ( u ) be the value of the primary key. An incorrect row is one suchthat there exists a s (cid:48) ∈ S (cid:48) with s (cid:48) ( u ) = s ( u ) and s (cid:54) = s (cid:48) . • Missing:
Missing rows are the set of rows (identified by theprimary key) that exist in the up-to-date view but not in thestale view. For s (cid:48) ∈ S (cid:48) , let s (cid:48) ( u ) be the value of the primarykey. A missing row is one such that there does not exist a s ∈ S with s ( u ) = s (cid:48) ( u ) . • Superfluous:
Superfluous rows are the set of rows (identi-fied by the primary key) that exist in the stale view but not inthe up-to-date view. For s ∈ S , let s ( u ) be the value of theprimary key. A superfluous row is one such that there doesnot exist a s (cid:48) ∈ S (cid:48) with s ( u ) = s (cid:48) ( u ) . Uniform Random Sampling:
We define a sampling ratio m ∈ [0 , and for each row in a view S , we include it into a samplewith probability m . We use the “hat” notation (e.g., (cid:98) S ) to denotesampled relations. The relation (cid:98) S is a uniform sample of S if(1) ∀ s ∈ (cid:98) S : s ∈ S ; (2) P r ( s ∈ (cid:98) S ) = P r ( s ∈ (cid:98) S ) = m. We say a sample is clean if and only if it is a uniform randomsample of the up-to-date view S (cid:48) .E XAMPLE In this example, we summarize all of the key con-cepts and terminology pertaining to materialized views, stale dataerror, and maintenance strategies. Our example view, visitView,joins the Log table with the Video table and counts the visits foreach video grouped by videoId. Since there is a foreign key rela-tionship between the relations, this is just a visit count for eachunique video with additional attributes. The primary keys of thebase relations are: sessionId for Log and videoId for Video.If new records have been added to the Log table, the visitView isconsidered stale. Incorrect rows in the view are videos for whichthe visitCount is incorrect and missing rows are videos that hadnot yet been viewed once at the time of materialization. While notpossible in our running example, superfluous rows would be videoswhose Log records have all been deleted. Formally, in this exampleour database is D = { V ideo, Log } , and the delta relations are ∂ D = { LogIns } .Suppose, we apply the change-table IVM algorithm proposedin [22]:1. Create a “delta view” by applying the view definition to Lo-gIns. That is, calculate the visit count per video on the newlogs: γ ( V ideo (cid:46)(cid:47) LogIns )
2. Take the full outer join of the “delta view” with the stale viewvisitView (equality on videoId).
V isitV iew (cid:46)(cid:47) γ ( V ideo (cid:46)(cid:47) LogIns )
3. Apply the generalized projection operator to add the visit-Count in the delta view to each of the rows in visitView wherewe treat a NULL value as 0: Π( V isitV iew (cid:46)(cid:47) γ ( V ideo (cid:46)(cid:47) LogIns )) Therefore, the maintenance strategy is: M ( { V isitV iew } , { V ideo, Log } , { LogIns } )= Π( V isitV iew (cid:46)(cid:47) γ ( V ideo (cid:46)(cid:47) LogIns )) Formally, the workflow of SVC is:1. We are given a view S .2. M defines the maintenance strategy that updates S at eachmaintenance period.3. The view S is stale between periodic maintenance, and theup-to-date view should be S (cid:48) .4. (Problem 1. Stale Sample View Cleaning) We find an expres-sion C derived from M that cleans a uniform random sampleof the stale view (cid:98) S to produce a “clean” sample of the up-to-date view (cid:98) S (cid:48) .5. (Problem 2. Query Result Estimation) Given an aggregatequery q and the state query result q ( S ) , we use (cid:98) S (cid:48) and (cid:98) S toestimate the up-to-date result.6. We optionally maintain an index of outliers o for improvedestimation in skewed data. tale Sample View Cleaning: The first problem addressed in thispaper is how to clean a sample of the stale materialized view.P
ROBLEM
TALE S AMPLE V IEW C LEANING ). We aregiven a stale view S , a sample of this stale view (cid:98) S with ratio m , themaintenance strategy M , the base relations D , and the insertionand deletion relations ∂ D . We want to find a relational expression C such that: (cid:98) S (cid:48) = C ( (cid:98) S, D , ∂ D ) , where (cid:98) S (cid:48) is a sample of the up-to-date view with ratio m . Query Result Estimation:
The second problem addressed in thispaper is query result estimation.P
ROBLEM
UERY R ESULT E STIMATION ). Let q be an ag-gregate query of the following form : SELECT agg ( a ) FROM View WHERE
Condition (A) ;
If the view S is stale, then the result will be incorrect by somevalue c : q ( S (cid:48) ) = q ( S ) + c Our objective is to find an estimator f such that: q ( S (cid:48) ) ≈ f ( q ( S ) , (cid:98) S, (cid:98) S (cid:48) ) E XAMPLE Suppose a user wants to know how many videoshave received more than 100 views.
SELECT COUNT ( 1 )
FROM v i s i t V i e w
WHERE v i s i t C o u n t > Let us suppose the user runs the query and the result is . How-ever, there have now been new records inserted into the Log ta-ble making this result stale. First, we take a sample of visitView and suppose this sample is a 5% sample. In Stale Sample ViewCleaning (Problem 1), we apply updates, insertions, and deletionsto the sample to efficiently materialize a 5% sample of the up-to-date view. In Query Result Estimation (Problem 2), we estimateaggregate query results based on the stale sample and the up-to-date sample.
4. EFFICIENTLY CLEANING A SAMPLE
In this section, we describe how to find a relational expression C derived from the maintenance strategy M that efficiently cleans asample of a stale view (cid:98) S to produce (cid:98) S (cid:48) . To setup the problem, we first consider two naive solutions tothis problem that will not work. We could trivially apply M to theentire stale view S and update it to S (cid:48) , and then sample. Whilethe result is correct according to our problem formulation, it doesnot save us on any computation for maintenance. We want to avoidmaterialization of up-to-date rows outside of the sample. However,the naive alternative solution is also flawed. For example, we couldjust apply M to the stale sample (cid:98) S and a sample of the delta rela-tions (cid:99) ∂ D . The challenge is that M does not always commute withsampling. To understand the commutativity problem, consider maintaininga group by aggregate view: For simplicity, we exclude the group by clause for all queries in the paper, as it canbe modeled as part of the
Condition . Figure 2: Applying the rules described in Definition 2, we illus-trate how to assign a primary key to a view.
SELECT videoId , count ( 1 )
FROM
Log
GROUP BY vide oId
The resulting view has one row for every distinct videoId . Wewant to materialize a sample of S (cid:48) , that is a sample of distinct videoId . If we sample the base relation Log first, we do notget a sample of the view. Instead, we get a view where every countis partial.To achieve a sample of S (cid:48) , we need to ensure that for each s ∈ S (cid:48) all contributing rows in subexpressions to s are also sampled.This is a problem of row provenance [14]. Provenance, also termedlineage, has been an important tool in the analysis of materializedviews [14] and in approximate query processing [52].D EFINITION
ROVENANCE ). Let r be a row in relation R ,let R be derived from some other relation R = exp ( U ) where exp ( · ) be a relational expression composed of the expressions de-fined in Section 3.1. The provenance of row r with respect to U is p U ( r ) . This is defined as the set of rows in U such that for anupdate to any row u (cid:54)∈ p U ( r ) , it guarantees that r is unchanged. For the relational expressions defined in the previous sections,this provenance is well defined and can be tracked using primarykey rules that are enforced on each subexpression [14]. We recur-sively define a set of primary keys for all relations in the expressiontree:D
EFINITION
RIMARY K EY G ENERATION ). For every re-lational expression R , we define the primary key attribute(s) of ev-ery expression to be: • Base Case: All relations (leaves) must have an attribute p which is designated as a primary key. • σ φ ( R ) : Primary key of the result is the primary key of R • Π ( a ,...,a k ) ( R ) : Primary key of the result is the primary keyof R. The primary key must always be included in the projec-tion. • (cid:46)(cid:47) φ ( r ,r ( R , R ) : Primary key of the result is the tuple ofthe primary keys of R and R . • γ f,A ( R ) : The primary key of the result is the group by key A (which may be a set of attributes). • R ∪ R : Primary key of the result is the union of the primarykeys of R and R • R ∩ R : Primary key of the result is the intersection of theprimary keys of R and R • R − R : Primary key of the result is the primary key of R For every node at the expression tree, these keys are guaranteed touniquely identify a row.
These rules define a constructive definition that can always be ap-plied for our defined relational expressions.E
XAMPLE A variant of our running example view that doesnot have a primary key is:
REATE VIEW v i s i t V i e w
AS SELECT count ( 1 ) as v i s i t C o u n t FROM
Log , Video
WHERE
Log . v i d e o I d = Video . v i d e o I d
GROUP BY v i d e o I d
We illustrate the key generation process in Figure 2. Suppose thereis a base relation, such as
Log , that is missing a primary key (ses-sionId) . We can add this attribute by generating an increasingsequence of integers for each record in Log . Since both base tables
Video and
Log have primary keys videoId and sessionId respec-tively, the result of the join will have a primary key (videoId, ses-sionId). Since the group by attribute is videoId, that becomes theprimary key of the view.
The primary keys allow us to determine the set of rows that con-tribute to a row r in a derived relation. If we have a deterministicway of mapping a primary key to a Boolean, we can ensure that allcontributing rows are also sampled. To achieve this we use a hash-ing procedure. Let us denote the hashing operator η a,m ( R ) . Forall tuples in R, this operator applies a hash function whose rangeis [0 , to primary key a (which may be a set) and selects thoserecords with hash less than or equal to m .In this work, we study uniform hashing where the condition h ( a ) ≤ m implies that a fraction of approximately m of the rowsare sampled. Such hash functions are utilized in other aspects ofdatabase research and practice (e.g. hash partitioning, hash joins,and hash tables). Hash functions in these applications are designedto be as uniform as possible to avoid collisions. Numerous em-pirical studies establish that many commonly applied hash func-tions (e.g., Linear, SDBM, MD5, SHA) have negligible differenceswith a true uniform random variable [25,32]. Cryptographic hasheswork particularly well and are supported by most commercial andopen source systems, for example MySQL provides MD5 and SHA1.To avoid materializing extra rows, we push down the hashingoperator through the expression tree. The further that we can push η down, the more operators (i.e., above the sampling) can benefit.This push-down is analogous to predicate push-down operationsused in query optimizers. In particular, we are interested in find-ing an optimized relational expression that materializes an identi-cal sample before and after the push-down. We formalize the push-down rules below:D EFINITION
ASH PUSH - DOWN ). For a derived relation R , the following rules can be applied to push η a,m ( R ) down theexpression tree. • σ φ ( R ) : Push η through the expression. • Π ( a ,...,a k ) ( R ) : Push η through if a is in the projection. • (cid:46)(cid:47) φ ( r ,r ( R , R ) : No push down in general. There arespecial cases below where push down is possible. • γ f,A ( R ) : Push η through if a is in the group by clause A . • R ∪ R : Push η through to both R and R • R ∩ R : Push η through to both R and R • R − R : Push η through to both R and R Special Case of Joins:
In general, a join
R (cid:46)(cid:47) S blocks the push-down of the hash operator η a,m ( R ) since a possibly consists ofattributes in both R and S . However, when there is a constraintthat enforces these attributes are equal then push-down is possible. Foreign Key Join.
If we have a join with two foreign-key rela-tions R (fact table with foreign key a ) and R (dimension table It does not make sense for Video to be missing a primary key in our running exampledue to the foreign key relationship For example, if hash function is a 32-bit unsigned integer which we can normalizeby
MAXINT to be in [0 , . with primary key b ⊆ a ) and we are sampling the key a , then wecan push the sampling down to R . This is because we are guaran-teed that for every r ∈ R there is only one r ∈ R . Equality Join.
If the join is an equality join and a is one of theattributes in the equality join condition R .a = R .b , then η can bepushed down to both R and R . On R the pushed down operatoris η a,m ( R ) and on R the operator is η b,m ( R ) .E XAMPLE We illustrate our hashing procedure in terms ofSQL expressions on our running example. We can push down thehash function for the following expressions:
SELECT ∗ FROM
Video
WHERE
Condition ( · ) SELECT ∗ FROM
Video , Log
WHERE
Video . v i d e o I d = Log . v i d e o I d
SELECT videoId , count ( 1 )
FROM
Log
GROUP BY v i d e o I d
The following expressions are examples where we cannot push-down the hash function:
SELECT ∗ FROM
Video , Log
SELECT c , count ( 1 )
FROM ( SELECT videoId , count ( 1 ) as c FROM
Log
GROUP BY v i d e o I d)
GROUP BY c In Theorem 1, we prove the correctness of our push-down rules.T
HEOREM Given a derived relation R , primary key a , andthe sample η a,m ( R ) . Let S be the sample created by applying η a,m without push-down and S (cid:48) be the sample created by applying thepush-down rules to η a,m ( R ) . S and S (cid:48) are identical samples withsampling ratio m . P ROOF S KETCH . We can prove this by induction. The basecase is where the expression tree is only one node, trivially makingthis true. Then, we can induct considering one level of operators inthe tree. σ, ∪ , ∩ , − clearly commutes with hashing a . Π commutesonly if a is in the projection. For (cid:46)(cid:47) , a sampling operator on Q canbe pushed down if a is in either k r or k s , or if there is a constraintthat links k r to k s . For group by aggregates, if a is in the groupclause (i.e., it is in the aggregate), then hashing the operand filtersall rows that have a which is sufficient to materialize the derivedrow. If we apply the hashing operator to M , we can get an optimizedcleaning expression C that avoids materializing unnecessary rows.When applied to a stale sample of a view (cid:98) S , the database D , and thedelta relations ∂ D , it produces an up-to-date sample with samplingratio m : (cid:98) S (cid:48) = C ( (cid:98) S, D , ∂ D ) Thus, it addresses Problem 1 from the previous section.E
XAMPLE We illustrate our proposed approach on our ex-ample view visitView with the expression tree listed in Figure3. We start by applying the hashing operator to the primary key( videoId ). The next operator we see in the expression tree is aprojection that increments the visitCount in the view, and thisallows for push-down since primary key is in the projection. Thesecond expression is a hash of the equality join key which mergesthe aggregate from the “delta view” to the old view allowing usto push down on both branches of the tree using our special casefor equality joins. On the left side, we reach the stale view so we isitView
LogIns Video Π γ “Delta View” ! η (primaryKey,5%) visitView LogIns Video Π γ “Delta View” ! Optimized Un-Optimized η (primaryKey,5%) η (primaryKey,5%) η (primaryKey,5%) Figure 3: Applying the rules described in Section 4.4, we illus-trate how to optimize the sampling of our example maintenancestrategy. stop. On the right side, we reach the aggregate query (count) andsince the primary key is in group by clause, we can push down thesampling. Then, we reach another point where we hash the equal-ity join key allowing us to push down the sampling to the relations
LogIns and
Video . We started with a uniform random sample (cid:98) S of the stale view S . The hash push down allows us to efficiently materialize thesample (cid:98) S (cid:48) . (cid:98) S (cid:48) is a uniform random sample of the up-to-date viewS. While both of these samples are uniform random samples oftheir respective relations, the two samples are correlated since (cid:98) S (cid:48) is generated by cleaning (cid:98) S . In particular, our hashing techniqueensures that the primary keys in (cid:98) S (cid:48) depend on the primary keys in (cid:98) S . Statistically, this positively correlates the query result q ( (cid:98) S (cid:48) ) and q ( (cid:98) S ) . We will see how this property can be leveraged to improvequery estimation accuracy (Section 5.1).P ROPERTY
ORRESPONDENCE ). Suppose (cid:98) S (cid:48) and (cid:98) S areuniform samples of S (cid:48) and S , respectively. Let u denote the pri-mary key. We say (cid:98) S (cid:48) and (cid:98) S correspond if and only if: • Uniformity: (cid:98) S (cid:48) and (cid:98) S are uniform random samples of S (cid:48) and S respectively with a sampling ratio of m • Removal of Superfluous Rows: D = {∀ s ∈ (cid:98) S (cid:64) s (cid:48) ∈ S (cid:48) : s ( u ) = s (cid:48) ( u ) } , D ∩ (cid:98) S (cid:48) = ∅• Sampling of Missing Rows: I = {∀ s (cid:48) ∈ (cid:98) S (cid:48) (cid:64) s ∈ S : s ( u ) = s (cid:48) ( u ) } , E ( | I ∩ (cid:98) S (cid:48) | ) = m | I |• Key Preservation for Updated Rows: For all s ∈ (cid:98) S and notin D or I , s (cid:48) ∈ (cid:98) S (cid:48) : s (cid:48) ( u ) = s ( u ) .
5. QUERY RESULT ESTIMATION
SVC returns two corresponding samples, (cid:98) S and (cid:98) S (cid:48) . (cid:98) S is a “dirty”sample (sample of the stale view) and (cid:98) S (cid:48) is a “clean” sample (sam-ple of the up-to-date view). In this section, we first discuss how toestimate query results using the two corresponding samples. Then,we discuss the bounds and guarantees on different classes of aggre-gate queries. Suppose, we have an aggregate query q of the following form: q ( View ) :=
SELECT f ( a t t r )
FROM View WHERE cond ( ∗ ) We quantify the staleness c of the aggregate query result as thedifference between the query applied to the stale view S comparedto the up-to-date view S (cid:48) : q ( S (cid:48) ) = q ( S ) + c The objective of this work is to estimate q ( S (cid:48) ) . In the Approx-imate Query Processing (AQP) literature, sample-based estimateshave been well studied [4,41]. This inspires our first estimation al-gorithm, SVC+AQP, which uses SVC to materialize a sample viewand an AQP-style result estimation technique. SVC+AQP:
Given a clean sample view (cid:98) S (cid:48) , the query q , and ascaling factor s , we apply the query to the sample and scale it by s : q ( S (cid:48) ) ≈ s · q ( (cid:98) S (cid:48) ) For example, for the sum and count the scaling factor is m . Forthe avg the scaling factor is 1. Refer to [4,41] for a detailed dis-cussion on the scaling factors.SVC+AQP returns what we call a direct estimate of q ( S (cid:48) ) . Wecould, however, try to estimate c instead. Since we have the staleview S , we could run the query q on the full stale view and es-timate the difference c using the samples (cid:98) S and (cid:98) S (cid:48) . We call thisapproach SVC+CORR, which represents calculating a correctionto q ( S ) instead of a direct estimate. SVC+CORR:
Given a clean sample (cid:98) S (cid:48) , its corresponding dirtysample (cid:98) S , a query q, and a scaling factor s :1. Apply SVC+AQP to (cid:98) S (cid:48) : r est fresh = s · q ( (cid:98) S (cid:48) )
2. Apply SVC+AQP to (cid:98) S : r est stale = s · q ( (cid:98) S )
3. Apply q to the full stale view: r stale = q ( S )
4. Take the difference between (1) and (2) and add it to (3): q ( S (cid:48) ) ≈ r stale + ( r est fresh − r est stale ) A commonly studied property in the AQP literature is unbiased-ness. An unbiased result estimate means that the expected valueof the estimate over all samples of the same size is q ( S (cid:48) ) . Wecan prove that if SVC+AQP is unbiased (there is an AQP methodthat gives an unbiased result) then SVC+CORR also gives unbiasedresults.L EMMA If there exists an unbiased sample estimator for q(S’)then there exists an unbiased sample estimator for c. P ROOF S KETCH . Suppose, we have an unbiased sample esti-mator e q of q . Then, it follows that E (cid:2) e q ( (cid:98) S (cid:48) ) (cid:3) = q ( S (cid:48) ) If wesubstitute in this expression: c = E (cid:2) e q ( (cid:98) S (cid:48) ) (cid:3) − q ( S ) . Applying thelinearity of expectation: c = E (cid:2) e q ( (cid:98) S (cid:48) ) − q ( S ) (cid:3) Some queries do not have unbiased sample estimators, but the biasof their sample estimators can be bounded. Example queries in-clude: median , percentile . A corollary to the previous lemma,is that if we can bound the bias for our estimator then we canachieve a bounded bias for c as well.E XAMPLE We can formalize our earlier example query inSection 2 in terms of SVC+CORR and SVC+AQP. Let us supposethe initial query result is . There now have been new log recordsinserted into the Log table making the old result stale, and supposewe are working with a sampling ratio of 5%. For SVC+AQP, wecount the number of videos in the clean sample that currently havecounts greater than 100 and scale that result by = 20 . If thecount from the clean sample is , then the estimate for SVC+AQP is The avg query is considered conditionally unbiased in some works. . For SVC+CORR, we also run SVC+AQP on the dirty sample.Suppose that there are only two videos in the dirty sample withcounts above 100, then the result of running SVC+AQP on the dirtysample is · . We take the difference of the two values −
40 = 40 . This means that we should correct the old result by resulting in the estimate of
45 + 40 = 85 . To bound our estimates in confidence intervals we explore threecases: (1) aggregates that can be written as sample means, (2) ag-gregates that can be bounded empirically with a statistical boot-strap, and (3) min and max . For (1), sum , count , and avg canall be written as sample means. sum is the sample mean scaled bythe relation size and count is the mean of the indicator functionscaled by the relation size. In this case, we can get analytic confi-dence intervals which allows us to analyze the efficiency tradeoffs.In case (2), for example median , we lose this property and haveto use an empirical technique to bound the results. Queries suchas min and max fall into their own category as they cannot eas-ily be bounded empirically [3], and we discuss these queries in ourTechnical Report [29]. The first case is aggregates that can be expressed as a samplemean ( sum , count , and avg ) Sample means for uniform randomsamples (also called sampling without replacement) converge to thepopulation mean by the Central Limit Theorem (CLT). Let ¯ µ be asample mean calculated from k samples, σ be the variance of thesample, and µ be the population mean. Then, the error ( µ − ¯ µ ) isnormally distributed: N (0 , σ k ) . Therefore, the confidence intervalis given by: ¯ µ ± γ (cid:114) σ k where γ is the Gaussian tail probability value (e.g., 1.96 for 95%,2.57 for 99%).We discuss how to calculate this confidence interval in SQL forSVC+AQP. The first step is a query rewriting step where we movethe predicate cond(*) into the SELECT clause (1 if true, 0 if false).Let attr be the aggregate attribute and m be the sampling ratio. Wedefine an intermediate result trans which is a table of transformedrows with the first column the primary key and the second columndefined in terms of cond(*) statement and scaling. For sum : t r a n s = SELECT pk , 1 . 0 /m · a t t r · cond ( ∗ ) as t r a n s a t t r FROM s For count : t r a n s = SELECT pk , 1 . 0 /m · cond ( ∗ ) as t r a n s a t t r FROM s For avg since there is no scaling we do not need to re-write thequery: t r a n s =
SELECT pk , a t t r as t r a n s a t t r FROM s WHERE cond ( ∗ ) SVC+AQP:
The confidence interval on this result is defined as:
SELECT γ · s t d e v ( t r a n s a t t r ) / s q r t ( count ( 1 ) ) FROM t r a n s
To calculate the confidence intervals for SVC+CORR we have tolook at the statistics of the difference, i.e., c = q ( S ) − q ( S (cid:48) ) , froma sample. If all rows in (cid:98) S exist in (cid:98) S (cid:48) , we could use the associativityof addition and subtraction to rewrite this as: c = q ( S − S (cid:48) ) , where − is the row-by-row difference between S and S (cid:48) . The challengeis that the missing rows on either side make this ill-defined. Thus,we defined the following null-handling with a subtraction operatorwe call ˙ − . D EFINITION
ORRESPONDENCE S UBTRACT ). Given an ag-gregate query, and two corresponding relations R and R withthe schema ( a , a , ... ) where a is the primary key for R and R ,and a is the aggregation attribute for the query. ˙ − is defined as aprojection of the full outer join on equality of R .a = R .a : Π R .a − R .a ( R (cid:46)(cid:47) R ) Null values ∅ are represented as zero. Using this operator, we can define a new intermediate result diff : diff := trans ( (cid:98) S (cid:48) ) ˙ − trans ( (cid:98) S ) SVC+CORR:
Then, as in SVC+AQP, we bound the result usingthe CLT:
SELECT γ · s t d e v ( t r a n s a t t r ) / s q r t ( count ( 1 ) ) FROM d i f f
In terms of these bounds, we can analyze how SVC+AQP com-pares to SVC+CORR for a fixed sample size k . SVC+AQP givesan estimate that is proportional to the variance of the clean sampleview: σ S (cid:48) k . SVC+CORR to the variance of the differences : σ c k .Since the change is the difference between the stale and up-to-dateview, this can be rewritten as σ S + σ S (cid:48) − cov ( S, S (cid:48) ) k Therefore, a correction will have less variance when: σ S ≤ cov ( S, S (cid:48) ) As we saw in the previous section, correspondence correlates thesamples. If the difference is small, i.e., S is nearly identical to S (cid:48) ,then cov ( S, S (cid:48) ) ≈ σ S . This result also shows that there is a pointwhen updates to the stale MV are significant enough that directestimates are more accurate. When we cross the break-even pointwe can switch from using SVC+CORR to SVC+AQP. SVC+AQPdoes not depend on cov ( S, S (cid:48) ) which is a measure of how muchthe data has changed. Thus, we guarantee an approximation errorof at most σ S (cid:48) k . In our experiments (Figure 6(b)), we evaluate thisbreak even point empirically. Let p be the selectivity of the query and k be the sample size; thatis, a fraction p records from the relation satisfy the predicate. Forthese queries, we can model selectivity as a reduction of effectivesample size k · p making the estimate variance: O ( k ∗ p ) . Thus,the confidence interval’s size is scaled up by √ p . Just like thereis a tradeoff between accuracy and maintenance cost, for a fixedaccuracy, there is also a tradeoff between answering more selectivequeries and maintenance cost. Optimality in unbiased estimation theory is defined in terms ofthe variance of the estimate [13].P
ROPOSITION An estimator is called a minimum varianceunbiased estimator (MVUE) if it is unbiased and the variance ofthe estimate is less than or equal to that of any other unbiasedestimate.
A sampled relation R defines a discrete distribution. It is impor-tant to note that this distribution is different from the data generat-ing distribution, since even if R has continuous valued attributes R till defines a discrete distribution. Our population is finite and wetake a finite sample thus every sample takes on only a discrete setof values. In the general case, this distribution is only described bythe set of all of its values (i.e., no smaller parametrized representa-tion). In this setting, the sample mean is an MVUE. In other words,if we make no assumptions about the underlying distribution ofvalues in R , SVC+AQP and SVC+CORR are optimal for their re-spective estimates ( q ( S (cid:48) ) and c ). Since they estimate different vari-ables, even with optimality SVC+CORR might be more accuratethan SVC+AQP and vice versa. There are, however, some caseswhen the assumptions, namely zero-knowledge, of this optimalitycondition do not hold. As a simple counter example, if we knewour data were exactly on a line, a sample size of two is sufficientto answer any aggregate query. However, even for many paramet-ric distributions, the sample mean estimators are still MVUEs, e.g.,poisson, bernouilli, binomial, normal, and exponential. It is oftendifficult and unknown in many cases to derive an MVUE other thana sample mean. Our approach is valid for any choice of estimatorif one exists, even though we do the analysis for sample mean esti-mators and this is the setting in which that estimator is optimal. In the second case, we explore bounding queries that cannot beexpressed as sample means. We do not get analytic confidence in-tervals on our results, nor is it guaranteed that our estimates areoptimal. In AQP, the commonly used technique is called a sta-tistical bootstrap [4] to empirically bound the results. In this ap-proach, we repeatedly subsample with replacement from our sam-ple and apply the query to the sample. This gives us a technique tobound SVC+AQP the details of which can be found in [3,4,52]. ForSVC+CORR, we have to propose a variant of bootstrap to boundthe estimate of c . In this variant, repeatedly estimate c from sub-samples and build an empirical distribution for c . SVC+CORR:
To use bootstrap to find a 95% confidence interval:1. Subsample (cid:98) S (cid:48) sub and (cid:98) S sub with replacement from (cid:98) S (cid:48) and (cid:98) S respectively2. Apply SVC+AQP to (cid:98) S (cid:48) sub and (cid:98) S sub
3. Record the difference · ( aqp ( (cid:98) S (cid:48) sub ) − aqp ( (cid:98) S sub ))
4. Return to 1, for k iterations.5. Return the . % and the . % percentile of the distributionof results.
6. OUTLIER INDEXING
Sampling is known to be sensitive to outliers [7,10]. Power-lawsand other long-tailed distributions are common in practice [10].The basic idea is that we create an index of outlier records (recordswhose attributes deviate from the mean value greatly) and ensurethat these records are included in the sample, since these recordsgreatly increase the variance of the data.
The first step is that the user selects an attribute of any base re-lation to index and specifies a threshold t and a size limit k . In asingle pass of updates (without maintaining the view), the index isbuilt storing references to the records with attributes greater than t . If the size limit is reached, the incoming record is compared tothe smallest indexed record and if it is greater then we evict thesmallest record. The same approach can be extended to attributesthat have tails in both directions by making the threshold t a range,which takes the highest and the lowest values. However, in thissection, we present the technique as a threshold for clarity. There are many approaches to select a threshold. We can useprior information from the base table, a calculation which can bedone in the background during the periodic maintenance cycles. Ifour size limit is k , for a given attribute we can select the the top-krecords with that attributes. Then, we can use that top-k list to seta threshold for our index. Then, the attribute value of the lowestrecord becomes the threshold t . Alternatively, we can calculate thevariance of the attribute and set the threshold to represent c standarddeviations above the mean. This threshold can be adaptively set ateach maintenance period. Given this index, the next question is how we can use this infor-mation in our materialized views. We need to propagate the indicesupwards through the expression tree. We add the condition thatthe only eligible indices are ones on base relations that are beingsampled (i.e., we can push the hash operator down to that relation).Therefore, in the same iteration as sampling, we can also test theindex threshold and add records to the outlier index. We formalizethe propagation property recursively. Every relation can have anoutlier index which is a set of attributes and a set of records thatexceed the threshold value on those attributes. The main idea isto treat the indexed records as a sub-relation that gets propagatedupwards with the maintenance strategy.D
EFINITION
OUTLIER INDEX PUSHUP ). Define an out-lier index to be a tuple of a set of indexed attributes, and a setof records ( I, O ) . The outlier index propagates upwards with thefollowing rules: • Base Relations: Outlier indices on base relations are pushedup only if that relation is being sampled, i.e., if the samplingoperator can be pushed down to that relation. • σ φ ( R ) : Push up with a new outlier index and apply the se-lection to the outliers ( I, σ φ ( O )) • Π ( a ,...,a k ) ( R ) : Push upwards ( I ∩ ( a , ..., a k ) , O ) . • (cid:46)(cid:47) φ ( r ,r ( R , R ) : Push upwards ( I ∪ I , O (cid:46)(cid:47) O ) . • γ f,A ( R ) : For group-by aggregates, we set I to be the aggre-gation attribute. For the outlier index, we do the followingsteps. (1) Apply the aggregation to the outlier index γ f,A ( O ) ,(2) for all distinct A in O select the row in γ f,A ( R ) with thesame A , and (3) this selection is the new set of outliers O . • R ∪ R : Push up with a new outlier index ( I ∩ I , O ∪ O ) .The set of index attributes is combined with an intersectionto avoid missed outliers. • R ∩ R : Push up with a new outlier index ( I ∩ I , O ∩ O ) . • R − R : Push up with a new outlier index ( I ∪ I , O − O ) . For all outlier indices that can propagate to the view (i.e., the topof the tree), we get a final set O of records. Given these rules, O is, in fact, a subset of our materialized view S (cid:48) . Thus, our queryprocessing can take advantage of the theory described in the previ-ous section to incorporate the set O into our results. We implementthe outlier index as an additional attribute on our sample with aboolean flag true or false if it is an outlier indexed record. If a rowis contained both in the sample and the outlier index, the outlierindex takes precedence. This ensures that we do not double countthe outliers. For result estimation, we can think of our sample ˆ S (cid:48) and ouroutlier index O as two distinct parts. Since O ⊂ S (cid:48) , and we givemembership in our outlier index precedence, our sample is actu-ally a sample restricted to the set (cid:92) ( S (cid:48) − O ) . For a given query, let reg be the correction calculated on (cid:92) ( S (cid:48) − O ) using the techniqueproposed in the previous section and adjusting the sampling ratio m to account for outliers removed from the sample. We can alsoapply the technique to the outlier set O since this set is determinis-tic the sampling ratio for this set is m = 1 , and we call this result c out . Let N be the count of records that satisfy the query’s con-dition and l be the number of outliers that satisfy the condition.Then, we can merge these two corrections in the following way: v = N − lN c reg + lN c out . For the queries in the previous section thatare unbiased, this approach preserves unbiasedness. Since we areaveraging two unbiased estimates c reg and c out , the linearity of theexpectation operator preserves this property. Furthermore, since c out is deterministic (and in fact its bias/variance is 0), c reg and c out are uncorrelated making the bounds described in the previoussection applicable as well.E XAMPLE We chose an attribute in the base data to index,for example duration , and an example threshold of 1.5 hours.We apply the rules to push the index up, and this materializes theentire set of rows whose duration is longer than 1.5 hours. ForSVC+AQP, we run the query on the set of clean rows with durationslonger than 1.5 hours. Then, we use the update rule in Section 6.3to update the result based on the number of records in the indexand the total size of the view. For SVC+CORR, we additionally runthe query on the set of dirty rows with durations longer than 1.5hours and take the difference between SVC+AQP. As in SVC+AQP,we use the update rule in Section 6.3 to update the result based onthe number of records in the index and the total size of the view.
7. RESULTS
We evaluate SVC first on a single node MySQL database to eval-uate its accuracy, performance, and efficiency in a variety of mate-rialized view scenarios. Then, we evaluate the outlier indexing ap-proach in terms of improved query accuracy and also evaluate theoverhead associated with using the index. After evaluation on thebenchmark, we present an application of server log analysis with adataset from a video streaming company, Conviva.
Single-node Experimental Setup:
Our single node experimentsare run on a r3.large Amazon EC2 node (2x Intel Xeon E5-2670,15.25 GB Memory, and 32GB SSD Disk) with a MySQL ver-sion 5.6.15 database. These experiments evaluate views from a10GB TPCD-Skew dataset. TPCD-Skew dataset [8] is based onthe Transaction Processing Council’s benchmark schema (TPCD)but is modified so that it generates a dataset with values drawnfrom a Zipfian distribution instead of uniformly. The Zipfian dis-tribution [37] is a long-tailed distribution with a single parameter z = { , , , } where a larger value means a more extreme tailand z = 1 corresponds to the basic TPCD benchmark. In ourexperiments, we use use z = 2 unless otherwise noted. The in-cremental maintenance algorithm used in our experiments is the“change-table” or “delta-table” method used in numerous works inincremental maintenance [22,23,27]. In all of the applications, theupdates are kept in memory in a temporary table, and we discountthis loading time from our experiments. We build an index on theprimary keys of the view, but not on the updates. Below we de-scribe the view definitions and the queries on the views : Join View:
In the TPCD specification, two tables receive inser-tions and updates: lineitem and orders . Out of 22 parametrizedqueries in the specification, 12 are group-by aggregates of the join Refer to our extended paper on more details about the experimental setup [29]. of lineitem and orders (Q3, Q4, Q5, Q7, Q8, Q9, Q10, Q12, Q14,Q18, Q19, Q21). Therefore, we define a materialized view of theforeign-key join of lineitem and orders , and compare incrementalview maintenance and SVC. We treat the 12 group-by aggregatesas queries on the view. Complex Views:
Our goal is to demonstrate the applicabilityof SVC outside of simple materialized views that include nestedqueries and other more complex relational algebra. We take theTPCD schema and denormalize the database, and treat each of the22 TPCD queries as views on this denormalized schema. The 22TPCD queries are actually parametrized queries where parameters,such as the selectivity of the predicate, are randomly set by theTPCD qgen program. Therefore, we use the program to generate10 random instances of each query and use each random instanceas a materialized view. 10 out of the 22 sets of views can benefitfrom SVC. For the 12 excluded views, 3 were static (i.e, there areno updates to the view based on the TPCD workload), and the re-maining 9 views have a small cardinality not making them suitablefor sampling.For each of the views, we generated queries on the views . Sincethe outer queries of our views were group by aggregates, we pickeda random attribute a from the group by clause and a random at-tribute b from aggregation. We use a to generate a predicate.For each attribute a , the domain is specified in the TPCD stan-dard. We select a random subset of this domain, e.g., if the at-tribute is country then the predicate can be countryCode > and countryCode < . We generated 100 random sum , avg , and count queries for each view. Distributed Experimental Setup:
We evaluate SVC on ApacheSpark 1.1.0 with 1TB of logs from a video streaming company,Conviva [1]. This is a denormalized user activity log correspondingto video views and various metrics such as data transfer rates, andlatencies. Accompanying this data is a four month trace of queriesin SQL. We identified 8 common summary statistics-type queriesthat calculated engagement and error-diagnosis metrics. These 8queries defined the views in our experiments. We populated theseview definitions using the first 800GB of user activity log records.We then applied the remaining 200GB of user activity log recordsas the updates (i.e., in the order they arrived) in our experiments.We generated aggregate random queries over this view by takingeither random time ranges or random subsets of customers.
No maintenance (Stale):
The baseline for evaluation is not ap-plying any maintenance to the materialized view.
Incremental View Maintenance (IVM):
We apply incrementalview maintenance (change-table based maintenance [22,23,27]) tothe full view.
SVC+AQP:
We maintain a sample of the materialized view usingSVC and estimate the result with AQP-style estimation technique.
SVC+CORR:
We maintain a sample of the materialized viewusing SVC and process queries on the view using the correctionwhich applies the AQP to both the clean and dirty samples, anduses both estimates to correct a stale query result.Since SVC has a sampling parameter, we denote a sample sizeof x % as SVC+CORR-x or SVC+AQP-x, respectively. To evaluateaccuracy and performance, we define the following metrics: Relative Error:
For a query result r and an incorrect result r (cid:48) , therelative error is | r − r (cid:48) | r . When a query has multiple results (a group-by query), then, unless otherwise noted, relative error is defined asthe median over all the errors.
Maintenance Time:
We define the maintenance time as the timeeeded to produce the up-to-date view for incremental view main-tenance, and the time needed to produce the up-to-date sample inSVC.
In our first experiment, we evaluate how SVC performs on a ma-terialized view of the join of lineitem and orders . We generate a10GB base TPCD-Skew dataset with skew z = 2 , and derive theview from this dataset. We first generate 1GB (10% of the basedata) of updates (insertions and updates to existing records), andvary the sample size. Performance:
Figure 4(a) shows the maintenance time of SVCas a function of sample size. With the bolded dashed line, we notethe time for full IVM. For this materialized view, sampling allowsfor significant savings in maintenance time; albeit for approximateanswers. While full incremental maintenance takes 56 seconds,SVC with a 10% sample can complete in 7.5 seconds.The speedup for SVC-10In the next figure, Figure 4(b), we eval-uate this speedup. We fix the sample size to 10% and plot thespeedup of SVC compared to IVM while varying the size of theupdates. On the x-axis is the update size as a percentage of thebase data. For small update sizes, the speedup is smaller, 6.5x fora 2.5% (250MB) update size. As the update size gets larger, SVCbecomes more efficient, since for a 20% update size (2GB), thespeedup is 10.1x. The super-linearity is because this view is a joinof lineitem and orders and we assume that there is not a join indexon the updates. Since both tables are growing sampling reducescomputation super-linearly.
Accuracy:
At the same design point with a 10% sample, weevaluate the accuracy of SVC. In Figure 5, we answer TPCDqueries with this view. The TPCD queries are group-by aggregatesand we plot the median relative error for SVC+CORR, No Main-tenance, and SVC+AQP. On average over all the queries, we foundthat SVC+CORR was 11.7x more accurate than the stale baseline,and 3.1x more accurate than applying SVC+AQP to the sample.
SVC+CORR vs. SVC+AQP:
While more accurate, it is truethat SVC+CORR moves some of the computation from mainte-nance to query execution. SVC+CORR calculates a correction toa query on the full materialized view. On top of the query time onthe full view (as in IVM) there is additional time to calculate a cor-rection from a sample. On the other hand SVC+AQP runs a queryonly on the sample of the view. We evaluate this overhead in Fig-ure 6(a), where we compare the total maintenance time and queryexecution time. For a 10% sample SVC+CORR required 2.69 secsto execute a sum over the whole view, IVM required 2.45 secs, andSVC+AQP required 0.25 secs. However, when we compare thisoverhead to the savings in maintenance time it is small.SVC+CORR is most accurate when the materialized view is lessstale as predicted by our analysis in Section 5.2.2. On the otherhand SVC+AQP is more robust to the staleness and gives a consis-tent relative error. The error for SVC+CORR grows proportionalto the staleness. In Figure 6(b), we explore which query processingtechnique, SVC+CORR or SVC+AQP, should be used. For a 10%sample, we find that SVC+CORR is more accurate until the updatesize is 32.5% of the base data.
In this experiment, we demonstrate the breadth of views sup-ported by SVC by using the TPCD queries as materialized views.We generate a 10GB base TPCD-Skew dataset with skew z = 2 ,and derive the views from this dataset. We first generate 1GB (10%of the base data) of updates (insertions and updates to existingrecords), and vary the sample size. Figure 7 shows the maintenancetime for a 10% sample compared to the full view. This experiment ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! M a i n t enan c e T i m e ( s ) ! (a) Sampling Ratio ! Join View: SVC Update Size 10% ! SVC ! IVM ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! S peed U p ! ! (b) Updates (% of Base Data) ! Join View: SVC 10% Speedup ! Figure 4: (a) On a 10GB view with 1GB of insertions and up-dates, we vary the sampling ratio and measure the maintenancetime of SVC. (b) For a fixed sampling ratio of 10%, we vary theupdate size and plot the speedup compared to full incrementalmaintenance. ! ! ! ! ! Q3 ! Q4 ! Q5 ! Q7 ! Q8 ! Q9 ! Q10 ! Q12 ! Q14 ! Q18 ! Q19 ! Q21 ! R e l a t i v e E rr o r % ! Join View: Queries Accuracy ! Stale ! SVC+AQP-10% ! SVC+CORR-10% ! Figure 5: For a fixed sampling ratio of 10% and update sizeof 10% (1GB), we generate 100 of each TPCD parameterizedqueries and answer the queries using the stale materializedview, SVC+CORR, and SVC+AQP. We plot the median rela-tive error for each query. illustrates how the view definitions plays a role in the efficiencyof our approach. For the last two views, V21 and V22, we seethat sampling does not lead to as large of speedup indicated in ourprevious experiments. This is because both of those views containnested structures which block the pushdown of hashing. V21 con-tains a subquery in its predicate that does not involve the primarykey, but still requires a scan of the base relation to evaluate. V22contains a string transformation of a key blocking the push down.These results are consistent with our previous experiments showingthat SVC is faster than IVM and more accurate than SVC+AQP andno maintenance.
In our next experiment, we evaluate our outlier indexing withthe top-k strategy described in Section 6. In this setting, outlierindexing significantly helps for both SVC+AQP and SVC+CORR.We index the l extendedprice attribute in the lineitem table. Weevaluate the outlier index on the complex TPCD views. We find thatfour views: V3, V5, V10, V15, can benefit from this index with ourpush-up rules. These are four views dependent on l extendedprice that were also in the set of “Complex” views chosen before.In our first outlier indexing experiment (Figure 8(a)), we analyzeV3. We set an index of 100 records, and applied SVC+CORR andSVC+AQP to views derived from a dataset with a skew parameter z = { , , , } . We run the same queries as before, but this timewe measure the error at the 75% quartile. We find in the mostskewed data SVC with outlier indexing reduces query error by afactor of 2. Next, in Figure 8 (b), we plot the overhead for outlierindexing for V3 with an index size of 0, 10, 100, and 1000. Whilethere is an overhead, it is still small compared to the gains madeby sampling the maintenance strategy. We note that none of theprior experiments used an outlier index. The caveat is that theseexperiments were done with moderately skewed data with Zipfianparameter = 2, if this parameter is set to 4 then the 75% quartilequery estimation error is nearly 20% (Figure 8a). Outlier indexingalways improves query results as we are reducing the variance of IVM
SVC+CORR-10%
SVC+AQP-10% To t a l T i m e ( s ) (a) Join View: Total Time
Query (1GB Updates)
Maintenance (1GB Updates) % % % % % % % % % % % % % % % % % R e l a t i v e E rr o r % (b) Updates (% Base Data ) Join View: SVC+CORR vs. SVC+AQP
SVC+CORR+10%
SVC+AQP+10%
Figure 6: (a) For a fixed sampling ratio of 10% and update sizeof 10% (1GB), we measure the total time incremental mainte-nance + query time. (b) SVC+CORR is more accurate thanSVC+AQP until a break even point. ! ! ! ! ! V3 ! V4 ! V5 ! V9 ! V10 ! V13 ! V15i ! V18 ! V21 ! V22 ! M a i n t e n a n ce T i m e ( s ) ! (a) Complex Views: Maintenance ! IVM ! SVC 10% ! ! ! ! ! ! ! ! V3 ! V4 ! V5 ! V9 ! V10 ! V13 ! V15i ! V18 ! V21 ! V22 ! R e l a t i ve E rr o r % ! (b) Complex Views: Generated Query Accuracy ! Stale ! SVC+AQP-10% ! SVC+CORR-10% ! Figure 7: (a) For 1GB update size, we compare maintenancetime and accuracy of SVC with a 10% sample on differentviews. V21 and V22 do not benefit as much from SVC due tonested query structures. (b) For a 10% sample size and 10%update size, SVC+CORR is more accurate than SVC+AQP andNo Maintenance. the estimation set, however, this reduction in variance is largestwhen there is a longer tail.
We derive the views from 800GB of base data and add 80GBof updates. These views are stored and maintained using ApacheSpark in a distributed environment. The goal of this experimentis to evaluate how SVC performs in a real world scenario with areal dataset and a distributed architecture. In Figure 9(a), we showthat on average over all the views, SVC-10% gives a 7.5x speedup.For one of the views full incremental maintenance takes nearly 800seconds, even on a 10-node cluster, which is a very significant cost.In Figure 9(b), we show that SVC also gives highly accurate re-sults with an average error of 0.98%. These results show consis-tency with our results on the synthetic datasets. This experimenthighlights a few salient benefits of SVC: (1) sampling is a rela-tively cheap operation and the relative speedups in a single nodeand distributed environment are similar, (2) for analytic workloadslike Conviva (i.e., user engagement analysis) a 10% sample givesresults with 99% accuracy, and (3) savings are still significant insystems like Spark that do not support selective updates.
In our next experiment, we evaluate an aggregate view use casesimilar to a data cube. We generate a 10GB base TPCD dataset with ! ! ! ! ! ! ! ! ! ! ! ! - Q u a r t il e E rr o r % ! (a) Zipfian Parameter !! V3 Accuracy with K=100 ! SVC+AQP ! SVC+AQP+Out ! SVC+CORR ! SVC+CORR+Out ! Stale ! ! ! ! ! ! ! ! ! ! V3 ! V5 ! V10 ! V15 ! M a i n t e n a n ce T i m e ! (b) ! Overhead of Outlier Indexing ! ! ! ! ! IVM ! Figure 8: (a) For one view V3 and 1GB of updates, we plotthe 75% quartile error with different techniques as we varythe skewness of the data. (b) While the outlier index adds anoverhead this is small relative to the total maintenance time. ! ! ! ! ! ! V1 ! V2 ! V3 ! V4 ! V5 ! V6 ! V7 ! V8 ! M a i n t e n a n ce T i m e ( s ) ! (a) Conviva: Maintenance Time For 80GB Added ! IVM ! SVC-10% ! ! ! ! ! ! V1 ! V2 ! V3 ! V4 ! V5 ! V6 ! V7 ! V8 ! R e l a t i ve E rr o r % ! (b) Conviva: Query Accuracy For 80GB Added ! Stale ! SVC+AQP-10% ! SVC+CORR-10% ! Figure 9: (a) We compare the maintenance time of SVC witha 10% sample and full incremental maintenance, and find thatas with TPCD SVC saves significant maintenance time. (b) Wealso evaluate the accuracy of the estimation techniques. skew z = 1 , and derive the base cube as a materialized view fromthis dataset. We add 1GB of updates and apply SVC to estimate theresults of all of the “roll-up” dimensions. Performance:
We observed the same trade-off as the previousexperiment where sampling significantly reduces the maintenancetime (Figure 10(a)). It takes 186 seconds to maintain the entireview, but a 10% sample can be maintained in 26 seconds. As be-fore, we fix the sample size at 10% and vary the update size. Wesimilarly observe that SVC becomes more efficient as the updatesize grows (Figure 10(b)), and at an update size of 20% the speedupis 8.7x.
Accuracy:
In Figure 11, we measure the accuracy of each ofthe “roll-up” aggregate queries on this view. That is, we take eachdimension and aggregate over the dimension. We fix the samplesize at 10% and the update size at 10%. On average SVC+Corr is12.9x more accurate than the stale baseline and 3.6x more accuratethan SVC+AQP (Figure 10(c)).Since the data cubing operation is primarily constructed bygroup-by aggregates, we can also measure the max error for eachof the aggregates. We see that while the median staleness is closeto 10%, for some queries some of the group aggregates have nearly80% error (Figure 12). SVC greatly mitigates this error to less than12% for all queries.
Other Queries:
Finally, we also use the data cube to illus-trate how SVC can support a broader range of queries outside of sum , count , and avg . We change all of the roll-up queries touse the median function (Figure 13). First, both SVC+Corr andSVC+AQP are more accurate as estimating the median than theywere for estimating sums. This is because the median is less sensi-tive to variance in the data. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! M a i n t enan c e T i m e ( s ) ! (a) Sampling Ratio ! Agg View: SVC (Update Size 10%) ! SVC ! IVM ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! S peedup ! (b) Updates (% Base Data) ! AggView: SVC 10% Speedup ! Figure 10: (a) In the aggregate view case, sampling can savesignificant maintenance time. (b) As the update size grows SVCtends towards an ideal speedup of 10x. ! ! ! ! ! ! ! Q1 ! Q2 ! Q3 ! Q4 ! Q5 ! Q6 ! Q7 ! Q8 ! Q9 ! Q10 ! Q11 ! Q12 ! Q13 ! R e l a t i v e E rr o r % ! Agg View: Roll-up Query Accuracy ! Stale ! SVC+AQP-10 ! SVC+Corr-10 ! Figure 11: We measure the accuracy of each of the roll-upaggregate queries on this view. For a 10% sample size and10% update size, we find that SVC+Corr is more accurate thanSVC+AQP and No Maintenance.
We devised an end-to-end experiment simulating a real integra-tion with periodic maintenance. However, unlike the MySQL case,Apache Spark does not support selective updates and insertions asthe “views” are immutable. A further point is that the immutabilityof these views and Spark’s fault-tolerance requires that the “views”are maintained synchronously. Thus, to avoid these significantoverheads, we have to update these views in batches. Spark doeshave a streaming variant [51], however, this does not support thecomplex SQL derived materialized views used in this paper, andstill relies on mini-batch updates.SVC and IVM will run in separate threads each with their ownRDD materialized view. In this application, both SVC and IVMmaintain respective their RDDs with batch updates. In this model,there are a lot of different parameters: batch size for periodic main-tenance, batch size for SVC, sampling ratio for SVC, and the factthat concurrent threads may reduce overall throughput. Our goalis to fix the throughput of the cluster, and then measure whetherSVC+IVM or IVM alone leads to more accurate query answers.
Batch sizes:
In Spark, larger batch sizes amortize overheads bet-ter. In Figure 14(a), we show a trade-off between batch size andthroughput of Spark for V2 and V5. Throughputs for small batchesare nearly 10x smaller than the throughputs for the larger batches.
Concurrent SVC and IVM:
Next, we measure the reductionin throughput when running multiple threads. We run SVC-10 inloop in one thread and IVM in another. We measure the reductionin throughput for the cluster from the previous batch size experi-ment. In Figure 14(b), we plot the throughput against batch sizewhen two maintenance threads are running. While for small batchsizes the throughput of the cluster is reduced by nearly a factor of2, for larger sizes the reduction is smaller. As we found in laterexperiments (Figure 16), larger batch sizes are more amenable toparallel computation since there was more idle CPU time.
Choosing a Batch Size:
The results in Figure 14(a) and Fig-ure 14(b) show that larger batch sizes are more efficient, however, ! ! ! ! ! ! Q1 ! Q2 ! Q3 ! Q4 ! Q5 ! Q6 ! Q7 ! Q8 ! Q9 ! Q10 ! Q11 ! Q12 ! Q13 ! M a x R e l a t i v e E rr o r % ! Agg View: Roll-up Query Max Group Error ! Stale (Max) ! SVC+AQP (Max) ! SVC+Corr (Max) ! Figure 12: For 1GB of updates, we plot the max error as op-posed to the median error in the previous experiments. Eventhough updates are 10% of the dataset size, some queries arenearly 80% incorrect. SVC helps significantly mitigate this er-ror. ! ! ! ! ! ! ! Q1 ! Q2 ! Q3 ! Q4 ! Q5 ! Q6 ! Q7 ! Q8 ! Q9 ! Q10 ! Q11 ! Q12 ! Q13 ! R e l a t i v e E rr o r % ! Agg View: "
Median " Query ! Stale ! SVC+AQP-10 ! SVC+Corr-10 ! Figure 13: We run the same experiment but replace the sum query with a median query. We find that similarly SVC is moreaccurate. larger batch sizes also lead to more staleness. Combining the re-sults in Figure 14(a) and Figure 14(b), for both SVC+IVM andIVM, we get cluster throughput as a function of batch size. Fora fixed throughput, we want to find the smallest batch size thatachieves that throughput for both. For V2, we fixed this at 700,000records/sec and for V5 this was 500,000 records/sec. For IVMalone the smallest batch size that met this throughput demand was40GB for both V2 and V5. And for SVC+IVM, the smallest batchsize was 80GB for V2 and 100GB for V5. When running periodicmaintenance alone view updates can be more frequent, and whenrun in conjunction with SVC it is less frequent.We run both of these approaches in a continuous loop,SVC+IVM and IVM, and measure their maximal error during amaintenance period. There is further a trade-off with the sam-pling ratio, larger samples give more accurate estimates howeverbetween SVC batches they go stale. We quantify the error in theseapproaches with the max error; that is the maximum error in amaintenance period (Figure 15). These competing objective leadto an optimal sampling ratio of 3% for V2 and 6% for V5. At thissampling point, we find that applying SVC gives results 2.8x moreaccurate for V2 and 2x more accurate for V5.To give some intuition on why SVC gives more accurate results,in Figure 16, we plot the average CPU utilization of the clusterfor both periodic IVM and SVC+periodic IVM. We find that SVCtakes advantage of the idle times in the system; which are commonduring shuffle operations in a synchronous parallelism model.In a way, these experiments present a worst-case application forSVC, yet it still gives improvements in terms of query accuracy. Inmany typical deployments throughput demands are variable forcingmaintenance periods to be longer, e.g., nightly. The same way thatSVC takes advantage of micro idle times during communicationsteps, it can provide large gains during controlled idle times whenno maintenance is going on concurrently.
8. RELATED WORK ! ! ! ! ! ! ! ! ! ! ! ! M ax r a t e ( r ec o r d / s ) ! (b) Batch Size (GB) ! Throughput vs. Batch Size ! (2 threads) ! V2 ! V5 ! ! ! ! ! ! ! ! ! ! ! ! ! M ax r a t e ( r ec o r d / s ) ! (a) Batch Size (GB) ! Throughput vs. Batch Size ! V2 ! V5 ! Figure 14: (a) Spark RDDs are most efficient when updatedin batches. As batch sizes increase the system throughput in-creases. (b) When running multiple threads, the throughputreduces. However, larger batches are less affected by this re-duction. ! ! ! ! ! ! ! ! ! ! ! ! M ax E rr o r ! Sampling Ratio ! Max Error V2 ! IVM+SVC ! IVM ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! M ax E rr o r ! Sampling Ratio ! Max Error V5 ! IVM+SVC ! IVM ! Figure 15: For a fixed throughput, SVC+Periodic Maintenancegives more accurate results for V2 and V5.
Addressing the cost of materialized view maintenance is the sub-ject of many recent papers, which focus on various perspectivesincluding complex analytical queries [38], transactions [5], real-time analytics [34], and physical design [33]. The increased re-search focus parallels a major concern in industrial systems forincrementally updating pre-computed results and indices such asGoogle Percolator [43] and Twitter’s Rainbird [49]. The stream-ing community has also studied the view maintenance problem[2,18,20,24,28]. In Spark Streaming, Zaharia et al. studied howthey could exploit in-memory materialization [51], and in Mon-etDB, Liarou et al. studied how ideas from columnar storage canbe applied to enable real-time analytics [34]. These works focus oncorrectness, consistency, and fault tolerance of materialized viewmaintenance. SVC proposes an alternative model for view main-tenance where we allow approximation error (with guarantees) forqueries on materialized views for vastly reduced maintenance time.In many decision problems, exact results are not needed as long asthe probability of error is boundable. Sampling has been well stud-ied in the context of query processing [4,17,40]. Both the prob-lems of efficiently sampling relations [40] and processing complexqueries [3], have been well studied. In SVC, we look at a newproblem, where we efficiently sample from a maintenance strategy,a relational expression that updates a materialized view. We gen-eralize uniform sampling procedures to work in this new contextusing lineage [14] and hashing. We look the problem of approxi-mate query processing [3,4] from a different perspective by estimat-ing a “correction” rather than estimating query results. Srinivasanand Carey studied a problem related to query correction which theycalled compensation-based query processing [46] for concurrencycontrol but did not study this for sampled estimates. This work wasapplied in the context of concurrency control. However, this workdid not consider applications when the correction was applied to asample as in SVC. The sampling in SVC introduces new challengessuch as sensitivity to outliers, questions of bias, and estimate opti-mality. ! ! ! ! ! ! C P U U t ili z a t i on % ! Measurements every 1s ! IVM ! IVM+SVC ! Figure 16: SVC better utilizes idle times in the cluster by main-taining the sample.
Sampling has also been studied from the perspective of main-taining samples [42]. In [26], Joshi and Jermaine studied indexedmaterialized views that are amenable to random sampling. Whilesimilar in spirit (queries on the view are approximate), the goalof this work was to optimize query processing and not to addressthe cost of incremental maintenance. There has been work usingsampled views in a limited context of cardinality estimation [30],which is the special case of our framework, namely, the count query. Nirkhiwale et al. [39], studied an algebra for estimating con-fidence intervals in aggregate queries. The objective of this workis not sampling efficiency, as in SVC, but estimation. As a specialcase, where we consider only views constructed from select andproject operators, SVC’s hash pushdown will yield the same resultsas their model. There has been theoretical work on the maintenanceof approximate histograms, synopses, and sketches [12,19], whichclosely resemble aggregate materialized views. The objectives ofthis work (including techniques such as sketching and approximatecounting) have been to reduce the required storage, not to reducethe required update time.Meliou et al. [36] proposed a technique to trace errors in an MVto base data and find responsible erroneous tuples. They do not,however, propose a technique to correct the errors as in SVC. Cor-recting general errors as in Meliou et al. is a hard constraint sat-isfaction problem. However, in SVC, through our formalization ofstaleness, we have a model of how updates to the base data (mod-eled as errors) affect MVs, which allows us to both trace errors andclean them. Wu and Madden [50] did propose a model to correct“outliers” in an MV through deletion of records in the base data.This is a more restricted model of data cleaning than SVC, wherethe authors only consider changes to existing rows in an MV (noinsertion or deletion) and do not handle the same generality of re-lational expressions (e.g., nested aggregates). Challamalla et al. [6]proposed an approximate technique for specifying errors as con-straints on a materialized view and proposing changes to the basedata such that these constraints can be satisfied. While comple-mentary, one major difference between the three works [6,36,50]and SVC is that they require an explicit specification of erroneousrows in a materialized view. Identifying whether a row is erroneousrequires materialization and thus specifying the errors is equivalentto full incremental maintenance. We use the formalism of a “main-tenance strategy”, the relational expression that updates the view,to allow us to sample rows that are not yet materialized. However,while not directly applicable for staleness, we see SVC as com-plementary to these works in the dirty data setting. The samplingtechnique proposed in Section 4 of our paper could be used to ap-proximate the data cleaning techniques in [6,36,50] and this is anexciting avenue of future work.Sampling has been explored in the streaming community, and asimilar idea of sampling from incoming updates has also been ap-plied in stream processing [16,44,47]. While some of these worksstudied problems similar to materialization, for example, the Jet-tream project (Rabkin et al.) looks at how sampling can helpwith real-time analysis of aggregates. None of these works for-mally studied the class views that can benefit from sampling orformalized queries on these views. However, there are ideas fromRabkin et al. that could be applied in SVC in future work, for ex-ample, their description of coarsening operations in aggregates isvery similar to our experiments with the “roll-up” queries in aggre-gate views. There are a variety of other efforts proposing storageefficient processing of aggregate queries on streams [15,21] whichis similar to our problem setting and motivation.
9. LIMITATIONS AND OPPORTUNITIES
While our experiments show that SVC works for a variety of ap-plications, there are a few limitations which we summarize in thissection. There are two primary limitations for SVC: class of queriesand types of materialized views. In this work, we primarily fo-cused on aggregate queries and showed that accuracy decreases asthe selectivity of the query increases. Sampled-based methods arefundamentally limited in the way they can support “point lookup”queries that select a single row. This is predicted by our theoreti-cal result that accuracy decreases with p where p is the fraction ofrows that satisfy the predicate. In terms of more view definitions,SVC does not support views with ordering or “top-k” clauses, asour sampling assumes no ordering on the rows of the MV and itis not clear how sampling commutes with general ordering oper-ations. In the future, we will explore maintenance optimizationsproposed in recent work. For example, DBToaster has two maincomponents, higher-order delta processing and a SQL query com-piler, both of which are complementary to SVC. SVC proposes anew approach for accurate query processing with MVs. Our resultsare promising and suggest many avenues for future work. In par-ticular, we are interested in deeper exploration of the multiple MVsetting. There are many interesting design problems such as givenstorage constraints and throughput demands, optimize sampling ra-tios over all views. Furthermore, there is an interesting challengeabout queries that join mutliple sample MVs managed by SVC.We are also interested in the possibility of sharing computation be-tween MVs and maintenance on views derived from other views.Finally, our results suggest relatively a straight forward implemen-tation of adaptive selection of the parameters in SVC such as theview sampling ratio and the outlier index threshold.
10. CONCLUSION
Materialized view maintenance is often expensive, and in prac-tice, eager view maintenance is often avoided due to its costs. Thisleads to stale materialized views which have incorrect, missing, andsuperfluous rows. In this work, we formalize the problem of stale-ness and view maintenance as a data cleaning problem. SVC uses asample-based data cleaning approach to get accurate query resultsthat reflect the most recent data for a greatly reduced computationalcost. To achieve this, we significantly extended our prior workin data cleaning, SampleClean [48], for efficient cleaning of staleMVs. This included processing a wider set of aggregate queries,handling missing data errors, and proving for which queries opti-mality of the estimates hold. We presented both empirical and the-oretical results showing that our sample data cleaning approach issignificantly less expensive than full view maintenance for a largeclass of materialized views, while still providing accurate aggregatequery answers that reflect the most recent data.Our results are promising and suggest many avenues for futurework. In this work, we focused on aggregate queries and showedthat accuracy decreases as the selectivity of the query increases.Sampled-based methods are fundamentally limited in the way theycan support “point lookup” queries that select a single row, and we believe we can address this problem with new results in non-parametric machine learning instead of using single-parameter esti-mators. In particular, we are interested in deeper exploration of themultiple MV setting. There are also many interesting design prob-lems such as given storage constraints and throughput demands,optimize sampling ratios over all views.
We thank Kai Zeng for his advice and feedback on this paper. Thisresearch is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, TheThomas and Stacey Siebel Foundation, Adatao, Adobe, Apple, Inc.,Blue Goji, Bosch, C3Energy, Cisco, Cray, Cloudera, EMC2, Ericsson,Facebook, Guavus, HP, Huawei, Informatica, Intel, Microsoft, NetApp,Pivotal, Samsung, Schlumberger, Splunk, Virdata and VMware.
11. REFERENCES [1] Conviva. .[2] D. J. Abadi, D. Carney, U. C¸ etintemel, M. Cherniack, C. Convey,S. Lee, M. Stonebraker, N. Tatbul, and S. B. Zdonik. Aurora: anew model and architecture for data stream management.
VLDBJ. , 12(2):120–139, 2003.[3] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. I. Jordan,S. Madden, B. Mozafari, and I. Stoica. Knowing when you’rewrong: building fast and reliable approximate query processingsystems. In
SIGMOD Conference , pages 481–492, 2014.[4] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, andI. Stoica. Blinkdb: queries with bounded errors and boundedresponse times on very large data. In
EuroSys , pages 29–42, 2013.[5] P. Bailis, A. Fekete, J. M. Hellerstein, A. Ghodsi, and I. Stoica.Scalable atomic visibility with ramp transactions. In
SIGMODConference , pages 27–38, 2014.[6] A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptiveand prescriptive data cleaning. In
SIGMOD Conference , pages445–456, 2014.[7] S. Chaudhuri, G. Das, M. Datar, R. Motwani, and V. R.Narasayya. Overcoming limitations of sampling for aggregationqueries. In
ICDE , pages 534–542, 2001.[8] S. Chaudhuri and V. Narasayya. TPC-D data generation with skew. ftp.research.microsoft.com/users/viveknar/tpcdskew .[9] R. Chirkova and J. Yang. Materialized views.
Foundations andTrends in Databases , 4(4):295–405, 2012.[10] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power-lawdistributions in empirical data.
SIAM Review , 51(4):661–703,2009.[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.Introduction to algorithms.[12] G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine.Synopses for massive data: Samples, histograms, wavelets,sketches.
Foundations and Trends in Databases , 4(1-3):1–294,2012.[13] D. R. Cox and D. V. Hinkley.
Theoretical statistics . CRC Press,1979.[14] Y. Cui and J. Widom. Lineage tracing for general data warehousetransformations.
VLDB J. , 12(1):41–58, 2003.[15] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi.Processing complex aggregate queries over data streams. In
SIGMOD Conference , pages 61–72, 2002.[16] M. Garofalakis, J. Gehrke, and R. Rastogi.
Data StreamManagement: Processing High-Speed Data Streams . Springer,2011.[17] M. N. Garofalakis and P. B. Gibbons. Approximate queryprocessing: Taming the terabytes. In
VLDB , 2001.[18] T. M. Ghanem, A. K. Elmagarmid, P.- ˚A. Larson, and W. G. Aref.Supporting views in data stream management systems.
ACMTransactions on Database Systems (TODS) , 35(1):1, 2010.[19] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incrementalmaintenance of approximate histograms.
ACM Trans. DatabaseSyst. , 27(3):261–298, 2002.[20] L. Golab, T. Johnson, and V. Shkapenyuk. Scalable scheduling ofupdates in streaming data warehouses.
IEEE Trans. Knowl. DataEng. , 24(6):1092–1105, 2012.21] M. Greenwald and S. Khanna. Space-efficient online computationof quantile summaries. In
SIGMOD Conference , pages 58–66,2001.[22] A. Gupta and I. S. Mumick. Maintenance of materialized views:Problems, techniques, and applications.
IEEE Data Eng. Bull. ,18(2):3–18, 1995.[23] H. Gupta and I. S. Mumick. Incremental maintenance of aggregateand outerjoin expressions.
Information Systems , 31(6):435–464,2006.[24] B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou.Comet: batched stream processing for data intensive distributedcomputing. In
SoCC , pages 63–74, 2010.[25] C. Henke, C. Schmoll, and T. Zseby. Empirical evaluation of hashfunctions for packetid generation in sampled multipointmeasurements. In
Passive and Active Network Measurement ,pages 197–206. Springer, 2009.[26] S. Joshi and C. M. Jermaine. Materialized sample views fordatabase approximation.
IEEE Trans. Knowl. Data Eng. ,20(3):337–351, 2008.[27] C. Koch, Y. Ahmad, O. Kennedy, M. Nikolic, A. N¨otzli, D. Lupei,and A. Shaikhha. Dbtoaster: higher-order delta processing fordynamic, frequently fresh views.
VLDB J. , 23(2):253–278, 2014.[28] S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko,A. Li, and N. Thombre. Continuous analytics over discontinuousstreams. In
SIGMOD Conference , pages 1081–1092, 2010.[29] S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska.Stale view cleaning: Getting fresh answers from stale materializedviews. , 2014.[30] P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinalityestimation using sample views with quality assurance. In
SIGMOD , pages 175–186, 2007.[31] P.- ˚A. Larson and H. Z. Yang. Computing queries from derivedrelations. In
VLDB , pages 259–269, 1985.[32] P. L’Ecuyer and R. Simard. Testu01: Ac library for empiricaltesting of random number generators.
ACM Transactions onMathematical Software (TOMS) , 33(4):22, 2007.[33] J. LeFevre, J. Sankaranarayanan, H. Hacig¨um¨us, J. Tatemura,N. Polyzotis, and M. J. Carey. Opportunistic physical design forbig data analytics. In
SIGMOD Conference , pages 851–862, 2014.[34] E. Liarou, S. Idreos, S. Manegold, and M. L. Kersten.MonetDB/DataCell: Online analytics in a streaming column-store.
PVLDB , 5(12):1910–1913, 2012.[35] Z. Liu, B. Jiang, and J. Heer. imMens : Real-time visual queryingof big data.
Comput. Graph. Forum , 32(3):421–430, 2013.[36] A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing dataerrors with view-conditioned causality. In
SIGMOD Conference ,pages 505–516, 2011.[37] M. Mitzenmacher. A brief history of generative models for powerlaw and lognormal distributions.
Internet Mathematics ,1(2):226–251, 2003.[38] M. Nikolic, M. Elseidy, and C. Koch. Linview: incremental viewmaintenance for complex analytical queries. In
SIGMODConference , pages 253–264, 2014.[39] S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebrafor aggregate estimation.
PVLDB , 6(14):1798–1809, 2013.[40] F. Olken.
Random sampling from databases . PhD thesis,University of California, 1993.[41] F. Olken and D. Rotem. Simple random sampling from relationaldatabases. In
VLDB , pages 160–169, 1986.[42] F. Olken and D. Rotem. Maintenance of materialized views ofsampling queries. In
ICDE , pages 632–641, 1992.[43] D. Peng and F. Dabek. Large-scale incremental processing usingdistributed transactions and notifications. In
OSDI , pages251–264, 2010.[44] A. Rabkin, M. Arye, S. Sen, V. S. Pai, and M. J. Freedman.Aggregation and degradation in jetstream: Streaming analytics inthe wide area. In
NSDI , 2014.[45] E. Rahm and H. H. Do. Data cleaning: Problems and currentapproaches.
IEEE Data Eng. Bull. , 23(4):3–13, 2000.[46] V. Srinivasan and M. J. Carey. Compensation-based on-line queryprocessing. In
SIGMOD Conference , pages 331–340, 1992.[47] N. Tatbul, U. C¸ etintemel, S. B. Zdonik, M. Cherniack, andM. Stonebraker. Load shedding in a data stream manager. In
VLDB , pages 309–320, 2003.[48] J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, andT. Milo. A sample-and-clean framework for fast and accuratequery processing on dirty data. In
SIGMOD Conference , pages469–480, 2014.[49] K. Weil. Rainbird: Real-time analytics at twitter. In
Strata , 2011.[50] E. Wu and S. Madden. Scorpion: Explaining away outliers inaggregate queries.
PVLDB , 6(8):553–564, 2013.[51] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica.Discretized streams: fault-tolerant streaming computation at scale.In
SOSP , pages 423–438, 2013.[52] K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analyticalbootstrap: a new method for fast error estimation in approximatequery processing. In
SIGMOD , pages 277–288, 2014.
12. APPENDIX12.1 Extensions min and max fall into their own category since this is a canoni-cal case where bootstrap fails. We devise an estimation procedurethat corrects these queries. However, we can only achieve boundthat has a slightly different interpretation than the confidence inter-vals seen before. We can calculate the probability that a larger (orsmaller) element exists in the unsampled view.We devise the following correction estimate for max : (1) For allrows in both S and S (cid:48) , calculate the row-by-row difference, (2) let c be the max difference, and (3) add c to the max of the stale view.We can give weak bounds on the results using Cantelli’s Inequal-ity. If X is a random variable with mean µ x and variance var ( X ) ,then the probability that X is larger than a constant (cid:15) P ( X ≥ (cid:15) + µ x ) ≤ var ( X ) var ( X ) + (cid:15) Therefore, if we set (cid:15) to be the difference between max value es-timate and the average value, we can calculate the probability thatwe will see a higher value.The same estimator can be modified for min , with a correspond-ing bound: P ( X ≤ µ x − a )) ≤ var ( x ) var ( x ) + a This bound has a slightly different interpretation than the confi-dence intervals seen before. This gives the probability that a larger(or smaller) element exists in the unsampled view.
In SVC, we also explore how to extend this correction procedureto Select queries. Suppose, we have a Select query with a predicate:
SELECT ∗ FROM View WHERE
Condition (A) ;
We first run the Select query on the stale view, and this returns aset of rows. This result has three types of data error: rows that aremissing, rows that are falsely included, and rows whose values areincorrect.As in the sum , count , and avg query case, we can apply thequery to the sample of the up-to-date view. From this sample, us-ing our lineage defined earlier, we can quickly identify which rowswere added, updated, and deleted. For the updated rows in the sam-ple, we overwrite the out-of-date rows in the stale query result. Forthe new rows, we take a union of the sampled selection and the up-dated stale selection. For the missing rows, we remove them fromthe stale selection. To quantify the approximation error, we canrewrite the Select query as count to get an estimate of number ofows that were updated, added, or deleted (thus three “confidence”intervals). In this work, we argue that hashing can be used for “sampling”a relational expression. However, from a complexity theory per-spective, hashing is not equivalent to random number generation(RNG). The existence of true one-way hash functions is a conjec-ture that would imply P (cid:54) = NP . This conjecture is often taken asan assumption in Cryptography. Of course, the ideal one-way hashfunctions required by the theory do not exist in practice. How-ever, we find that existing hashes (e.g., linear hashes and SHA1)are sufficiently close to ideal that they can still take advantage ofthis theory. On the other hand, a SHA1 hash is nearly an order ofmagnitude slower but is much more uniform. This assumption iscalled the Simple Uniform Hashing Assumption (SUHA) [11], andis widely used to analyze the performance of hash tables and hashpartitioning. There is an interesting tradeoff between the latency incomputing a hash compared to its uniformity. For example, a lin-ear hash stored procedure in MySQL is nearly as fast pseudoran-dom number generation that would be used in a TABLESAMPLEoperator, however this hash exhibits some non-uniformity. A benefit of deterministic hashing is that when applied in con-junction to the primary keys of a view, we get the CorrespondenceProperty (Definition 1) for free.P
ROPOSITION
ASHING C ORRESPONDENCE ). Supposewe have S which is the stale view and S (cid:48) which is the up-to-dateview. Both these views have the same schema and a primary key a . Let η a,m be our hash function that applies the hashing to theprimary key a . ˆ S = η a,m ( S )ˆ S (cid:48) = η a,m ( S (cid:48) ) Then, two samples ˆ S (cid:48) and ˆ S correspond. P ROOF . There are four conditions for correspondence: • (1) Uniformity: (cid:98) S (cid:48) and (cid:98) S are uniform random samples of S (cid:48) and S respectively with a sampling ratio of m • (2) Removal of Superfluous Rows: D = {∀ s ∈ (cid:98) S (cid:64) s (cid:48) ∈ S (cid:48) : s ( u ) = s (cid:48) ( u ) } , D ∩ (cid:98) S (cid:48) = ∅• (3) Sampling of Missing Rows: I = {∀ s (cid:48) ∈ (cid:98) S (cid:48) (cid:64) s ∈ S : s ( u ) = s (cid:48) ( u ) } , E ( | I ∩ (cid:98) S (cid:48) | ) = m | I |• (4) Key Preservation for Updated Rows: For all s ∈ (cid:98) S andnot in D or I , s (cid:48) ∈ (cid:98) S (cid:48) : s (cid:48) ( u ) = s ( u ) .Uniformity is satisfied under by definition under SUHA (SimpleUniform Hashing Assumption). Condition 2 is satisfied since if r is deleted, then r (cid:54)∈ S (cid:48) which implies that r (cid:54)∈ ˆ S (cid:48) . Condition 3 isjust the converse of 2 so it is satisfied. Condition 4 is satisfied sinceif r is in ˆ S then it was sampled, and then since the primary key isconsistent between S and S (cid:48) it will also be sampled in ˆ S (cid:48) . T HEOREM Given a derived relation R , primary key a , andthe sample η a,m ( R ) . Let S be the sample created by applying η a,m without push down and S (cid:48) be the sample created by applying thepush down rules to η a,m ( R ) . S and S (cid:48) are identical samples withsampling ratio m . P ROOF . We can prove this by induction. The base case is wherethe expression tree is only one node, trivially making this true.Then, we can induct considering one level of operators in the tree. σ, ∪ , ∩ , − clearly commute with hashing the key a allowing forpush down. Π commutes only if a is in the projection. For (cid:46)(cid:47) , asampling operator on Q can be pushed down if a is in either k r or k s , or if there is a constraint that links k r to k s . There are twocases in which this happens a foreign-key relationship or an equal-ity join on the same key. For group by aggregates, if a is in thegroup clause (i.e., it is in the aggregate) then a hash of the operandfilters all rows that have a which is sufficient to materialize the de-rived row. It is provably NP-Hard to pushdown through a nestedgroup by aggregate such as: SELECT c , count ( 1 )
FROM ( SELECT videoId , sum ( 1 ) as c FROM
Log
GROUP BY vide oId)
GROUP BY c by reduction to a SUBSET-SUM problem. We defined a concept of tuple-lineage with primary keys. How-ever, a curious property of the deterministic hashing technique isthat we can actually hash any attribute while retain the importantstatistical properties. This is because a uniformly random sampleof any attribute (possibly not unique) still includes every individ-ual row with the same probability. A consequence of this is thatwe can push down the hashing operator through arbitrary equalityjoins (not just many-to-one) by hashing the join key.We defer further exploration of this property to future work as itintroduces new tradeoffs. For example, sampling on a non-uniquekey, while unbiased in expectation, has higher variance in the sizeof the sample. Happening to hash a large group may lead to de-creased performance.Suppose our keys are duplicated µ k times on average with vari-ance σ k , then the variance of the sample size is for sampling frac-tion m : m (1 − m ) µ k + (1 − m ) σ k This equation is derived from the formula for the variance of a mix-ture distribution. In this setting, our sampling would have to con-sider this variance against the benefits of pushing the hash operatorfurther down the query tree.
In our first experiment, we materialize the join of lineitem andorders. We treat the TPCD queries as queries on the view, and weselected 12 out of the 22 to include in our experiments. The other10 queries did not make use of the join.
In this workload, there were annotated summary statisticsqueries, and we filtered for the most common types. While, wecannot give the details of the queries, we can present some of thehigh-level characteristics of 8 summary-statistics type views. • V1.
Counts of various error types grouped by resources,users, date • V2.
Sum of bytes transferred grouped by resource, users, date • V3.
Counts of visits grouped by an expression of resourcetags, users, date.
V4.
Nested query that groups users from similar regions/ser-vice providers together then aggregates statistics • V5.
Nested query that groups users from similar regions/ser-vice providers together then aggregates error types • V6.
Union query that is filtered on a subset of resources andaggregates visits and bytes transferred • V7.
Aggregate network statistics group by resources, users,date with many aggregates. • V8.
Aggregate visit statistics group by resources, users, datewith many aggregates.
We defined the base cube as a materialized view: s e l e c tsum ( l e x t e n d e d p r i c e ∗ (1 − l d i s c o u n t ) ) as revenue ,c custkey , n nationkey ,r r e g i o n k e y , L PARTKEY from l i n e i t e m , orders ,customer , nation ,r e g i o n where l o r d e r k e y = o o r d e r k e y and O CUSTKEY = c c u s t k e y and c n a t i o n k e y = n n a t i o n k e y and
N REGIONKEY = r r e g i o n k e y group by c custkey , n nationkey ,r r e g i o n k e y , L PARTKEY
Each of queries was an aggregate over subsets of the dimensionsof the cube, with a sum over the revenue column. • Q1. all • Q2. c custkey • Q3. n nationkey • Q4. r regionkey • Q5. l partkey • Q6. c custkey,n nationkey • Q7. c custkey,r regionkey • Q8. c custkey,l partkey • Q9. n nationkey, r regionkey • Q10. n nationkey, l partkey • Q11. c custkey,n nationkey, r regionkey • Q12. c custkey,n nationkey,l partkey • Q13. n nationkey,r regionkey,l partkeyWhen we experimented with the median query, we changed the sum to a median of the revenues.
We denormalize the TPCD schema and treat each of the 22queries as views on the denormalized schema. In our experiments,we evaluate 10 of these with SVC. Here, we provide a table of thequeries and reasons why a query was not suitable for our exper-iments. The main reason a query was not used was because thecardinality of the result was small. Since we sample from the view,if the result was small eg. ¡ 10, it would not make sense to applySVC. Furthermore, in the TPCD specification the only tables thatare affected by updates are lineitem and orders; and queries that donot depend on these tables do not change; thus there is no need formaintenance.Listed below are excluded queries and reasons for their exclu-sion. • Query 1. Result cardinality too small • Query 2. The query was static • Query 6. Result cardinality too small • Query 7. Result cardinality too small • Query 8. Result cardinality too small • Query 11. The query was static • Query 12. Result cardinality too small • Query 14. Result cardinality too small • Query 15. The query contains an inner query, which we treatas a view. • Query 16. The query was static • Query 17. Result cardinality too small • Query 19. Result cardinality too small ••