[PDF] Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Abstract

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed summary statistics, which inform the choice of partitions and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7 × to 70× reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB.

Full PDF

AApproximate Partition Selection for Big-Data Workloadsusing Summary Statistics

Kexin Rong †∗ , Yao Lu † , Peter Bailis ∗ , Srikanth Kandula † , Philip Levis ∗ Microsoft † , Stanford ∗ ABSTRACT

Many big-data clusters store data in large partitions thatsupport access at a coarse, partition-level granularity. Asa result, approximate query processing via row-level sam-pling is ineﬃcient, often requiring reads of many partitions.In this work, we seek to answer queries quickly and ap-proximately by reading a subset of the data partitions andcombining partial answers in a weighted manner withoutmodifying the data layout. We illustrate how to eﬃcientlyperform this query processing using a set of pre-computedsummary statistics, which inform the choice of partitionsand weights. We develop novel means of using the statisticsto assess the similarity and importance of partitions. Ourexperiments on several datasets and data layouts demon-strate that to achieve the same relative error compared touniform partition sampling, our techniques oﬀer from 2.7 × to 70 × reduction in the number of partitions read, and thestatistics stored per partition require fewer than 100KB.

1. INTRODUCTION

Approximate Query Processing (AQP) systems allow usersto trade oﬀ between accuracy and query execution speed. Inapplications such as data exploration and visualization, thistrade-oﬀ is not only acceptable but often desirable. Sam-pling is a common approximation technique, wherein thequery is evaluated on a subset of the data, and much of theliterature focuses on row-level samples [13, 16, 23].When data is stored in media that does not supportrandom access (e.g., ﬂat ﬁles in data lakes and columnarstores [1, 53]), constructing a row-level sample can be as ex-pensive as scanning the entire dataset. For example, if datais split into partitions with 100 rows, a 1% uniform row sam-ple would in expectation require fetching 64% (1 − . )of the partitions; a 10% uniform row sample would touch al-most all partitions. As a result, recent work from a produc-tion AQP system shows that row-level sampling only oﬀerssigniﬁcant speedups for complex queries where substantialquery processing remains after the sampling [42].In contrast to row-level sampling, the I/O cost of con-structing a partition-level sample is proportional to the sam- Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee.

Stats

Partition Picker

SELECT X, SUM(Y)

Stats Builder … … X Yant 1… … X Ybee 10… …

Weighted selection - partition 1, weight=10- partition 3, weight =2- … partition 1

X Ycat 100… … partition 2 partition N … DataQueryBudget: 10 partitions

OfﬂineOnline

Stats … PrecomputedQuery-speciﬁc

Figure 1: Our system PS makes novel use ofsummary statistics to perform importance andsimilarity-aware sampling of partitions. pling fraction . In our example above, a 1% partition-levelsample would only read 1% of the data. We are especiallyinterested in big data clusters, where data is stored in chunksof tens to hundreds of megabytes, instead of disk blocks orpages which are typically a few kilobytes [34, 53]. Partition-level sampling is already used in production due to its ap-pealing performance: commercial databases create statisticsusing partition samples [5, 30] and several Big Data storesallow sampling partitions of tables [2, 6, 8].However, a key challenge remains in how to constructpartition-level samples that can answer a given query ac-curately. Since all or none of the rows in a partition areincluded in the sample, the correlation between rows (e.g.,due to layout) can lead to inaccurate answers. A uniformlyrandom partition-level sample does not make a represen-tative sample of the dataset unless the rows are randomlydistributed among partitions [24], which happens rarely inpractice [20]. In addition, even a uniform random sampleof rows can miss rare groups in the answer or miss therows that contribute substantially to SUM -like aggregates.It is not known how to compute stratiﬁed [13] or measure-biased [32] samples over partitions, which helps with querieswith group-by’s and complex aggregates.In this work, we introduce PS (Partition Selection withSummary Statistics), a system that supports AQP viaweighted partition selection (Figure 1). Our primary usecase is in large-scale production query processing systems In the paper, we use “partition” to refer to the ﬁnest gran-ularity at which the storage layer maintains statistics.1 a r X i v : . [ c s . D B ] A ug uch as Spark [15], F1 [52], SCOPE [21] where queries onlyread and datasets are bulk appended . Our goal is to mini-mize the approximation error given a sampling budget, orfraction of data that can be read. Motivated by observationsfrom production clusters at Microsoft and in the literaturethat many datasets remain in the order that they were in-gested [43], PS does not require any speciﬁc layout or re-partitioning of data. Instead of storing precomputed sam-ples [14, 16, 22], which requires signiﬁcant storage budgetsto oﬀer good approximations for a wide range of queries [25,44], PS performs sampling exclusively during query opti-mization. Finally, similar to the query scope studied in priorwork [14, 48, 54], PS supports single-table queries with SUM, COUNT(*), AVG aggregates,

GROUP BY on columnsetswith moderate distinctiveness, predicates that are conjunc-tions, disjunctions or negations over single-column clauses.To select partitions that are most relevant to a query, PS leverages the insight that partition-level summary statisticsare relatively inexpensive to compute and store. The keyquestion is which statistics to use. Systems such as SparkSQL and ZoneMaps already maintain statistics such as max-imum and minimum values of a column to assist in query op-timization [43]. Following similar design considerations, welook for statistics with small space requirements that can becomputed for each partition in one pass at ingest time. Forfunctionality, we look for statistics that are discriminativeenough to support decisions such as whether the partitioncontributes disproportionally large values of the aggregates.We propose such a set of statistics for partition sampling– measures, heavy hitters, distinct values, and histograms– which include but expand on conventional catalog-levelstatistics. The total storage overhead scales with the num-ber of partitions instead of with the dataset size. We onlymaintain single-column statistics to keep the overhead low.The resulting storage overhead can be orders of magnitudessmaller than approaches using auxiliary indices to reducethe cost of random access [32]. While the set of statistics isby no means complete, we show that each type of statisticscontributes to the sampling performance, and, in aggregate,delivers eﬀective AQP results.We illustrate three ways of using summary statistics tohelp with partition selection. First, if we knew which parti-tions contribute more to the query answer, we could samplethese partitions more frequently. While it is challenging tomanually design rules that relate summary statistics to par-tition contribution, a model can learn how much summarystatistics matter through examples. Inspired by prior workwhich uses learning techniques to improve the sampling ef-ﬁciency for counting queries on row-level samples [56], wepropose a learned importance-style sampling algorithm thatworks on aggregate queries with GROUP BY clauses and onpartitions. The summary statistics serve as a natural featurerepresentation for partitions, from which we can train mod-els oﬄine to learn a mapping from summary statistics to therelative importance of a partition. During query optimiza-tion, we use the trained models to classify partitions intoseveral importance groups, and allocate the sampling bud-get across groups such that the more important groups geta greater proportion of the budget. The training overheadis a one time cost for each dataset and workload, and forhigh-value datasets in clusters that are frequently queried,this overhead is amortized over time.In addition, we leverage the redundancy and skewness of the partitions for further optimization. For two partitionsthat output similar answers to an input query, it suﬃces toonly include one of them in the sample. While directly com-paring the contents of the two partitions is expensive, we canuse the query-speciﬁc summary statistics as a proxy for thesimilarity between partitions. We also observe that datasetscommonly exhibit signiﬁcant skew in practice. For example,in a prototypical production service request log dataset atMicrosoft, the most popular application version out of the167 distinct versions accounts for almost half of the dataset.Inspired by prior works in AQP that recognize the impor-tance of outliers [16, 23], we use summary statistics (e.g.,the occurrences of heavy hitters in a partition) to identifya small number of partitions that are likely to contain raregroups, and dedicate a portion of the sampling budget toevaluate these partitions exactly.In summary, this paper makes the following contributions:1. We introduce PS , a system that makes novel usesof summary statistics to perform weighted partitionselection for many popular queries. Given the querysemantics, summary statistics and a sampling budget,the system intelligently combines a few sampling tech-niques to produce a set of partitions to sample and theweight of each partition.2. We propose a set of lightweight sketches for data parti-tions that are not only practical to implement, but canalso produce rich partition summary statistics. Whilethe sketches are well known, this is the ﬁrst time thestatistics are used for weighted partition selection.3. We evaluate on a number of real-world datasets withreal and synthetic workloads. Our evaluation showsthat each component of PS contributes meaningfullyto the ﬁnal accuracy and together, the system outper-forms alternatives across datasets and layouts, deliver-ing from 2.7 × to 70 × reduction in data read given thesame error compared to uniform partition sampling.

2. SYSTEM OVERVIEW

In this section, we give an overview of PS , including itsmajor design considerations, supported queries, inputs, andoutputs, and the problem statement. We highlight a few design considerations in the system.

Layout Agnostic.

A random data layout would make thepartition selection problem trivial, but maintaining a ran-dom layout requires additional eﬀorts and rarely happensin practice [20]. In read-only or append-only data stores,it is also expensive to modify the data layout. As a result,we observe that in practice, many datasets simply remain inthe order that they were ingested in the cluster. In addition,prior work [54] has shown that it is challenging and some-times impossible to ﬁnd a partitioning scheme that enablesgood data skipping for arbitrary input queries. Therefore,instead of requiring re-partitioning or random layout, PS explicitly chooses to keep data in situ and tries to make thebest out of the given data layout. We show that PS canwork across diﬀerent data layouts in the evaluation ( § Sampling on a single table.

To perform joins eﬀec-tively, prior work [44] has shown that sampling each input2elation independently is not enough and that the joint dis-tribution must be taken into account. Handling the corre-lations between join tables at the partition level is anotherresearch problem on its own, and is outside the scope of thispaper. However, sampling on a single table can still oﬀernon-trivial performance improvements even for queries thatinvolve joining multiple tables. For example, in key–foreignkey joins, fact tables are often much larger compared to di-mension tables. Sampling the fact table, therefore, alreadygets us most of the gains.

Generalization.

Prior works make various trade-oﬀs be-tween the eﬃciency and the generality of the queries thatthey support, ranging from having access to the entire work-load [54] to being workload agnostic [38]. Our system fallsin the middle of the spectrum, where we make assumptionsabout the structure and distribution of the query workload.Speciﬁcally, we assume that the set of columns used in

GROUPBY s and the aggregate functions are known apriori, with thescope deﬁned in § PS is trained per data layout and workload,and generalizes to unseen queries sampled from the samedistribution as the training workload. Overall, our systemis best suited for commonly occurring queries and should beretrained in case of major changes in query workloads suchas the introduction of unseen group by columns.We do not consider generalization to unseen data layoutsor datasets and we view broader generalization as an excit-ing area for future work ( § X can distinguish parti-tions in a layout where the dataset is sorted by X , but mayprovide no information in a random layout. In this section, we deﬁne the scope of queries that PS supports. We support queries with an arbitrary combinationof aggregates, predicates and group bys. Although we donot directly support nested queries, many queries can beﬂattened using intermediate views [36]. Our techniques canalso be used directly on the inner queries. Overall, our queryscope covers 11 out of 22 queries in the TPC-H workload(Appendix A.1, extended report [ ? ]). • Aggregates . We support

SUM and

COUNT(*) (hence

AVG ) aggregates on columns as well as simple linearprojections of columns in the select clause. The pro-jections include simple arithmetic operations ( +, - ) onone or more columns in the table . We also support asubset of aggregates with CASE conditions that can berewritten as an aggregate over a predicate. • Predicates . Predicates include conjunctions, disjunc-tions and negations over the clauses of the form c op v , where c denotes a column, op an operation and v a value. We support equality and inequality compar-isons on numerical and date columns, equality check We also support the multiply and divide operations in somecases using statistics computed over the logs of the columns. with a value as well as the IN operator for string andcategorical columns as clauses. • Groups . We support

GROUP BY clauses on one or morestored attributes . We do not support GROUP BY oncolumns with large cardinality since there is little gainfrom answering highly distinct queries over samples;one could either hardly perform any sampling withoutmissing groups, or would only care about a limitednumber of groups with large aggregate values (e.g.,

TOP queries), which is out of the scope of this paper. • Joins . Queries containing key–foreign joins can besupported as queries over the corresponding denormal-ized table. For simplicity, our discussion in this paperis based on a denormalized table.In the TPC-H workload, 16 out of the 22 queries can berewritten on a denormalized table and 11 out of the 16 aresupported by our query scope. For the 5 that are not sup-ported, 4 involve group bys on high cardinality columns and1 involves the

MAX aggregate. A number of prior work havealso studied similar query scopes [14, 48, 54]. PS consists of two main components: the statisticsbuilder and the partition picker (Figure 1). In this sec-tion, we give an overview of the inputs and outputs of eachcomponent during preprocessing and query time. Preparation.

The statistics builder takes a data partitionas input and outputs a number of lightweight sketches foreach partition. The sketches are stored separately from thepartitions. We describe the sketches used in detail, includingthe time and space complexity for constructing and storingthe sketches in § Query Time.

During query optimization, one can accessthe sketches without touching the raw data. Given an inputquery, the statistics builder combines pre-computed columnstatistics with query-speciﬁc statistics computed using thestored sketches and produces a set of summary statistics foreach partition and for each column used in the query.

Preparation.

In the preparation phase, the picker takesa speciﬁcation of workload in the form of a list of aggre-gate functions and columnsets that are used in the

GROUPBY . We can sample a query from the workload by combin-ing randomly generated predicates and randomly selectedaggregate functions and group by columnsets (0 or 1) fromthe speciﬁcation. For each sampled query, we compute thesummary statistics as well as the answer to the query oneach partition as the training data, which the picker usesto learn the relevance of diﬀerent summary statistics. Thetraining is a one time cost and we train one model for eachworkload to be used for all test queries. We elaborate onthe design of the picker in § Query Time.

The picker takes an input query, summarystatistics and a sampling budget as inputs, and outputs alist of partitions to sample, as well as the weight of eachpartition in the sample. This has a net eﬀect of replacing To support derived attributes, we make a new column fromthe derived attribute and store its summary statistics3 able 1: Per partition, the time and space overheadsto construct and store sketches for partitions with R b rows. Small logarithmic factors are ignored. Sketch Construction Storage

Histograms O ( R b log R b ) O ( buckets )Measures O ( R b ) O (1)AKMV O ( R b ) O ( k )Heavy Hitter O ( R b ) O ( support ) a table in the query execution plan with a set of weightedpartition choices with a small overhead (Table 5). Query ex-ecution should also be augmented to handle weights, similarto modiﬁcations suggested in prior work [44]. Let N be the total number of partitions and M be thedimension of the summary statistics. For an aggregationquery Q , let G be the set of groups in the answer to Q .For each group g ∈ G , denote the aggregate values for thegroup as A g ∈ R d , where d is the number of the aggregates.Denote the aggregates for group g on partition i as A g , i .Given the input query Q , the summary statistics F ∈ R N × M as well as sampling budget n in the form of numberof partitions to read, our system returns a set of weightedpartition choices S = { ( p , w ) , ( p , w ) , ..., ( p n , w n ) } . Theapproximate answer ˜A g of group g for Q is computed by ˜A g = (cid:80) nj =1 w j A g , p j , ∀ g ∈ G .Our goal is to produce the set of weighted partition choice S such that ˜A g is a good approximation of the true answer A g for all groups g ∈ G . To assess the approximation qualityacross groups and aggregates that are of diﬀerent sizes andmagnitudes, we measure absolute and relative error, as wellas the percentage of groups that are missed in the estimate.

3. PARTITION SUMMARY STATISTICS

The high-level insight of our approach is that we wantto diﬀerentiate partitions based on their contribution to thequery answer, and that the contribution can be estimatedusing a rich set of partition-level summary statistics. As asimple example, for

SUM -type aggregates, partitions with ahigher average value of the aggregate should be preferred,all else being equal. We are unaware of prior work thatuses partition-level summary statistics for performing non-uniform partition selection. In this section, we describe thedesign and implementation of the summary statistics.

Our primary use case, similar to columnar databases, isread-only or append-only stores. Summary statistics areconstructed for each new data partition when the partition issealed. The necessary data statistics should be simple, smallin size and can be computed incrementally in one pass overdata. The necessary statistics should also be discriminativeenough to set partitions apart and rich enough to supportsampling decisions such as estimating the number of rowsthat pass the predicate in a partition. We opt to use onlysingle-column statistics to keep the memory overhead light,although more expensive statistics such as multi-column his-tograms can help estimate selectivity more accurately. Thedesign considerations lead us to the following sketches:

Table 2: Summary statistics and the sketches usedto compute them. Selectivity is computed per queryand all other statistics is computed per column.

Summary Statistics Sketch x , min ( x ), max ( x ), x , std ( x ) Measureslog( x ), log( x ) , min (log( x )), max (log( x )) Measuresnumber of distinct values AKMVavg/max/min/sum freq. of distinct values AKMV • Measures:

Minimum, maximum, as well as ﬁrst andsecond moments are stored for each numeric column.For columns whose value is always positive, we alsostore measures on the log transformed column. • Histogram:

We construct equal-depth histograms foreach column. For string columns, the histogram isbuilt over hashes of the strings. By default, each his-togram has 10 buckets. • AKMV:

We use an AKMV (K-Minimum Values)sketch to estimate the number of distinct values [19].The sketch keeps track of the k minimum hashed val-ues of a column and the number of times these valuesappeared in the partition. We use k = 128 by default. • Heavy Hitter:

We maintain a dictionary of heavyhitters and their frequencies for each column in thepartition using lossy counting [46]. By default, we onlytrack heavy hitters that appear in at least 1% of therows, so the dictionary has at most 100 items.Table 1 summarizes the time complexity to construct thesketches and the space overhead to store them, ignoringsmall logarithmic factors. The sketches can be constructedin parallel for each partition. We do not claim that theabove choices make a complete set of sketches that shouldbe used for the purpose of partition selection. Our pointis that these are a set of inexpensive sketches that can beeasily deployed or might have already been maintained inbig-data systems [43], and that they can be used in newways to improve partition sampling.

Given the set of sketches, we compute summary statis-tics for each partition, which can be used as feature vec-tors to discriminate partitions based on their contributionto the answer of a given query. The features consist of twoparts: pre-computed per column features and query-speciﬁcselectivity estimates (Table 2). We apply a query-dependentmask on the pre-computed column features: features asso-ciated with columns that are unused in the query are set tozero. In addition, for categorical columns where the mea-sure based sketches do not apply, we set the correspondingfeatures to zero. The schema of the feature vector is deter-mined entirely by the schema of the table, so queries on thesame dataset share the same feature vector schema.Overall, there are four types of features based on the un-derlying sketches that generate them: measures, heavy hit-ters, distinct values and selectivity. Each type of featurecaptures diﬀerent information about the partitions and thequeries. Measures help identify partitions with dispropor-tionally large values of the aggregates; heavy hitters and4istinct values help discriminate partitions from each otherand selectivity helps assess the impact of the predicates. Wefound that all types of features are useful in PS but the rel-ative importance of each varies across datasets ( § Occurrence Bitmap.

We found that it is not only helpfulto know the number of heavy hitters, but also which heavyhitters are present in the partition. To do so, we collect aset of k global heavy hitters for a column by combining theheavy hitters from each partition. For each partition, wecompute a bitmap of size k , each bit representing whetherthe corresponding global heavy hitter is also a heavy hitterin the current partition. The feature is only computed forgrouping columns and we cap k at 25 for each column. Selectivity Estimates.

The selectivity estimate is areal number between 0 and 1, designed to reﬂect the frac-tion of rows in the partition that satisﬁes the query predi-cate. The estimate supports predicates deﬁned in our queryscope ( § X <

X >

10) are evaluated jointly. As a spe-cial case, if a string column has a small number of distinctvalues, all distinct values and their frequencies are storedexactly; this can support regex-style textual ﬁlters on thestring column (e.g., ’%promo%’ ). We use the following fourfeatures to represent the selectivity of predicates which canbe a conjunction or disjunction of individual clauses:1. selectivity upper : For

AND s, the selectivity is atmost the min of the selectivity of individual clauses;for OR s, the selectivity is at most 1 and at most thesum of the selectivity of individual clauses.2. selectivity indep : This feature computes the se-lectivity assuming independence between predicateclauses. For AND s, the feature is the product of theselectivity for each individual clause; for OR s, the fea-ture is the min of the selectivity of individual clauses.3. selectivity min , selectivity max : We store the minand max of the selectivity of individual clauses.If the upper bound of the selectivity is zero, the parti-tion contains no rows that pass the predicate; if the upperbound is nonzero however, the partition can have zero ormore rows that pass the predicate. In other words, as aclassiﬁer for identifying partitions that satisfy the predicate, selectivity upper > X >

1, the precisionis 100%; for complicated predicates involving conjunctionsand disjunctions over many clauses and columns (e.g., TPC-H Q19), the precision can be as low as 10%.

4. PARTITION PICKING

In this section, we describe PS ’s partition picker compo-nent and how it makes novel use of the summary statisticsdiscussed above to realize weighted partition selection. To start, we give an overview of how our partition pickerworks. Recall that the picker takes a query, the summarystatistics and a sampling budget as inputs, and outputs alist of partitions to evaluate the query on and the weight of

Algorithm 1

Partition Picker

Input: partition features F , sampling budget n , group-bycolumns gb col , models regrs , decay rate α Output: selection : [( p , w ) , ( p , w ) , ..., ( p n , w n )]1: outliers, inliers ← Outlier ( F , gb col )2: n o ← outliers .size()3: selection .add( outliers , [1] ∗ n o )4: groups ← ImportanceGroup ( F , inliers, regrs )5: n c ← AllocateSamples ( groups , n − n o , α )6: for i ← , ..., groups .size() do selection .add( Clustering ( F [ groups[i] ], n c [ i ]))8: end for each partition. Partial answers from the selected partitionsare combined in a weighted manner, as described in § § § α from the i th important to the( i + 1) th important group. Finally, given a sample size anda set of partitions in each importance group, we select sam-ples via clustering using the procedure described in § We start by describing the sample selection procedure(line 7 in Algorithm 1), designed to leverage the redundancybetween partitions. We use feature vectors to compute asimilarity score between partitions, which consequently en-ables us to choose dissimilar partitions as representatives ofthe dataset. In fact, identical partitions will have identicalsummary statistics, but the converse does not hold; havingsummary statistics on multiple columns as well as multiplestatistics for each column makes it less likely that dissimilarpartitions have identical summary statistics.We propose to use clustering as a sampling strategy: givena sampling budget of n partitions, we perform clustering us-ing feature vectors with a target number of n clusters; anexemplar partition is chosen per cluster, with an assignedweight equals the number of partitions in the cluster. De-note the answer to the query on cluster i ’s exemplar parti-tion as A i and the size of cluster i as s i . The estimate ofthe query answer is given by ˜ A = (cid:80) ni =1 s i A i .Concretely, we measure partition similarity using Eu-clidean distances of the feature vectors. We zero out featuresfor unused columns in the query so they have no impacton the result; we also perform normalization such that thedistance is not dominated by any single feature (AppendixB). Regarding the choice of the clustering algorithm, weexperimented with KMeans and Agglomerative Clusteringand found that they perform similarly. Finally, the clusterexemplar is selected by picking the partition whose featurevector has the smallest distance to the median feature vectorof partitions in the clusters.Our proposed scheme leads to a biased estimator that canbe challenging to analyze. Speciﬁcally, given the median5eature vector of a cluster, our estimator deterministicallypicks the partition that is closest to the median vector asthe cluster exemplar. However, one could make a simplemodiﬁcation to unbias the estimator by selecting a randompartition in the cluster as the exemplar instead. We haveincluded an empirical comparison of the accuracy of the twoestimators as well as a variance analysis for the unbiasedestimator in Appendix D. We have empirically found thatthe proposed scheme outperforms its unbiased counterpartwhen the sampling budget is limited.Clustering eﬀectively leverages the redundancy betweenpartitions, especially in cases when partitions have near iden-tical features. Although there is no guard against an adver-sary, in practice, having a large and diverse set of summarystatistics makes it naturally diﬃcult for dissimilar partitionsto be in the same cluster. Clusters play a similar role asstrata in stratiﬁed sampling. The goal of clustering is tomake partitions in the same stratum homogeneous such thatthe overall sampling variance is reduced. Finally, clusteringresults vary from query to query: the same partition can bein diﬀerent clusters for diﬀerent queries due to the changes inselectivity features and the query-dependent column masks. Feature Selection.

Clustering assumes that all featuresare equally relevant to partition similarity. To further im-prove the clustering performance, we perform feature se-lection via a “leave-one-out” style test. For example, con-sider a table with columns

X, Y and features min, max .We compare the clustering performance on the training setusing { min ( X ) , max ( X ) , min ( Y ) , max ( Y ) } as features tothat from using only { max ( X ) , max ( Y ) } as features. If thelatter gives a smaller error, we subsequently exclude the min feature for all columns from clustering. We greedily removefeatures until converging to a local optimal, at which pointexcluding any remaining features would hurt clustering per-formance. In an outer loop, we repeat the above greedyprocedure multiple times, each time starting with a randomordering of the features. Our experiments show that fea-ture selection consistently improves clustering performanceacross datasets. We provide the pseudo code of the proce-dure in Appendix B.1. Limitations.

We brieﬂy discuss two failure cases for clus-tering in which PS can fall back to random sampling (de-tails in Appendix B.1). First, clustering takes advantage ofthe redundancy among partitions. In the extreme case whenthe query groups by the primary key, no two partitions con-tribute similarly to the query and any downsampling wouldresult in missed groups. As discussed in § While clustering helps select partitions that are dissimilar,it makes no distinction between partitions that contributemore to the query and partitions that contribute less. Ide-ally, we would want to sample the more important partitionsmore frequently to reduce the variance of the estimate [37].The feature vectors can help assess partition contribution.Consider the query:

SELECT SUM(X), Y FROM table WHERE ...

Sampling Rate 𝑟𝛼 !" … Importance Groups regr_1 ≤ & regr_2 ≤ & regr_2 > 0 & … & regr_k > 0 selectivity > 0 Partitions … Figure 2: The trained regressors are used to classifyinput partitions into groups of diﬀerent importance.The sampling rate decreases by a factor of α > fromthe i th important to the ( i + 1) th important group. Z > 1 GROUP BY Y . The subset of partitions that answer thisquery well should contain large values of X , many rows thatsatisfy the predicate and many distinct values of Y . Featurevectors are correlated with these desired properties: measurestatistics (e.g. max, std) can help reveal large values of X , selectivity measures the fraction of the partition that isrelevant to the query, and heavy hitter and distinct valuestatistics summarize the distribution of groups. However, itis challenging to manually quantify how much each featurematters for each query. In our example, it is unclear whethera partition with a high variance of X but few rows that matchthe predicate should be prioritized over a partition with lowvariance and many rows that match the predicate.While it is not obvious how to manually design rules thatrelate feature vectors to partition contribution, a model may learn to do so from examples. An intuitive design is to usepartition features as inputs and predict partition weights asoutputs, which turns out to be a non-traditional regressionproblem. The goal of the regressor is to assign a weight vec-tor to N partitions such that the weighted partition choiceproduces a small approximation error. Given a samplingbudget of n partitions, there are exponentially many choicesof subsets of partitions of size n and the optimal choice is dis-continuous on n . In addition, the decision depends jointlyon the set of partitions chosen; the weight assigned to onepartition, for example, may depend on how many other par-titions with nearly identical content are picked in the sam-ple. Therefore, a simple, per partition regressor is unableto capture the combinatorial nature of the decision space.Existing solutions [18, 45] would require signiﬁcantly moreresources and we pursue a lightweight alternative instead.Given the challenges to directly use learned models to pre-dict sampling probabilities, we propose a design that utilizesthe models indirectly for sample size allocation; similar ob-servations were made for using learned models to improverow-level sampling designs for count queries [56]. We con-sider classifying partitions based on their relative importance to the query answer into a few importance groups, and ap-ply multiplicatively increasing sampling probability to themore important groups. We detail each of these steps next. Partition Contribution.

We consider the “contribution”of a partition to the answer of a query as its largest relativecontribution to any group and any aggregate in the answer.Recall that we denote the aggregates for group g ∈ G as A g ∈ R d , where d is the number of the aggregate func- A small change in n can completely change the set of par-titions to pick and the weights to assign to them.6 lgorithm 2

Group partitions by importance.

Input: partition features F function ImportanceGroup ( F , parts, regressors )2: groups .add( FilterByPredicate ( F , parts ))3: for regr ∈ regressors do to examine ← groups[-1] to pick ← p ∈ to examine s.t. regr( F [ p ]) > groups[-1] ← to examine .diﬀerence( to pick )7: groups .add( to pick )8: end for return groups end function tions, and the aggregates for group g on partition i are de-noted as A g , i ∈ R d . Partition i ’s contribution is deﬁnedas: max g ∈ G max dj =1 ( A g , i [ j ] A g [ j ] ). There are several alternativedeﬁnitions of contribution, such as using the average insteadof the max of the ratios, or using absolute values instead ofthe relatives. Among all variants, the max of the relativesis perhaps the most generous: it recognizes a partition’s im-portance if it helps with any aggregates in any groups, andis not biased towards large groups or aggregates with largeabsolute values. We ﬁnd that our simple deﬁnition abovealready leads to good empirically results. Training.

Given the partition contributions for all queriesin the training data, we train a set of k models to distinguishthe relative importance of partitions. When k is large, train-ing the set of models is equivalent to solving the regressionproblem in which we are directly predicting partition contri-bution from the feature vector; when k is small, the trainingreduces to a simpler multiclass classiﬁcation problem. The k models discretize partition contribution into k + 1 bins, andwe choose exponentially spaced bin boundaries: the num-ber of partitions that satisfy the i th model increase exponen-tially from the number of partitions that satisfy the ( i +1) th model. In particular, the ﬁrst model identiﬁes all partitionsthat have non zero contribution to the query and the lastmodel identiﬁes partitions whose contribution is ranked inthe top 1% of all partitions . We use the XGBoost regres-sor as our base model, and provide additional details of thetraining in Appendix B.2.

Testing.

During test time, we run partitions through afunnel that utilizes the set of trained models as ﬁlters andsort partitions into diﬀerent importance groups (Figure 2).The advantage of building a funnel is that it requires parti-tions to pass more ﬁlters as they advance to the more im-portant groups, which help limit the impact of inaccuratemodels. We list the procedure in Algorithm 2. We startfrom all partitions with non zero selectivity upper fea-ture; as discussed in § The small number of positive examples make it challengingto train an accurate model beyond 1%. We then split the sampling budget such that more im-portant groups get a greater proportion of the budget. Weimplement a sampling rate that decays by a factor of α > i th important to the ( i + 1) th important group. Weinvestigate the impact of the decay rate α in the sensitivityanalysis (Appendix C.2). In general, increasing α improvesthe overall performance especially when the trained modelsare accurate, but the marginal beneﬁt decreases as α be-comes larger. If the trained models are completely randomhowever, a larger α would increase the variance of the esti-mate. We have found that a decay rate of α = 2 with k = 4models works well across a range of datasets and layoutsempirically. However, it is possible to ﬁne-tune α for eachdataset to further improve the performance and we leave theﬁne-tuning to future work. Finally, we observe that datasets often exhibit signiﬁcantskew in practice (example in § GROUP BY queries. These par-titions are not representative of other partitions and shouldbe excluded from clustering. To identify such partitions, wetake advantage of the occurrence bitmap feature that trackswhich heavy hitters are present in a partition. We put parti-tions with identical bitmap features for columns in the

GROUPBY clause in the same group and consider a bitmap featuregroup outlying if its size is small both in absolute ( < <

10% the size of the largestgroup). For example, if there are 100 such bitmap featuregroups and 10 partitions per group, we do not consider anygroup as outlying although the absolute size of each groupis small. We allocate up to 10% of the sampling budget toevaluate outliers. We have empirically found that increasingthe outlier budget further does not signiﬁcantly improve theperformance using only the outliers we consider. Exploringalternative ways to identify outliers could be an interestingarea for improvement for future works.

5. EVALUATION

In this section, we evaluate the empirical performance of PS . Experiments show that:1. PS consistently outperforms alternatives on a varietyof real-world datasets, delivering 2 . − × reductionof data read to achieve the same average relative errorcompared to uniform partition sampling, with storageoverhead ranging from 12KB to 103KB per partition.2. Every component of PS and every type of featurescontribute meaningfully to the ﬁnal performance.3. PS works across datasets, partitioning schemes, par-tition counts and generalizes to unseen queries. In this subsection, we describe the experimental method-ology, which includes the datasets, query generation, meth-ods of comparison and error metrics.7 .1.1 Datasets

We evaluate on four real-world datasets that are summa-rized below. We include a speciﬁcation of the table schemain Appendix A.

TPC-H*.

Data is generated from a Zipﬁan distributionwith skewness of 1 and a scale factor of 1000 [7]. We denor-malize all tables against the lineitem table. The resultingtable has 6B rows, with 14 numeric columns and 31 cate-gorical columns. Data is sorted by column

L SHIPDATE . TPC-DS*. catalog sales table with a scale factor of 1from TPC-DS, joined with dimensions tables item , date dim , promotion and customer demographics , with 4.3M rows,21 numeric columns and 20 categorical columns. Data issorted by columns year , month and day . Aria.

Production service request log at Microsoft with10M rows, 7 numeric columns and 4 categorical columns [11,33]. Data is sorted by categorical column

TenantId . KDD.

KDD Cup’99 dataset on network intrusion detec-tion with 4.8M rows, 27 numeric columns and 14 categoricalcolumns [17]. Data is sorted by numeric column count .By default we use a partition count of 1000, the smallestsize from which partition elimination becomes interesting.The

TPC-H* dataset (sf=1000) has 2844 partitions, with apartition size of about 2.5GB, consistent with the scale of thebig-data workloads seen in practice. In the sensitivity anal-ysis, we further investigate the eﬀect of the partition count( § § To train PS , we construct a training set of 400 queries foreach dataset by sampling at random the following aspects: • between 0 and 8 columns as the group-by columns • between 0 and 5 predicate clauses; each of which picksa column, an operator and a constant at random • between 1 and 3 aggregates over one or more columnsWe generate a held-out set of 100 test queries in a simi-lar way. For TPC-H* , we include an additional test set of 10TPC-H queries ( § All methods except for simple random sampling have ac-cess to feature vectors, and use the selectivity upper fea-ture to ﬁlter out partitions that do not satisfy the predicatebefore sampling. Recall that this ﬁlter has false positivesbut no false negatives. All methods have access to the sameset of features. We report the average of 10 runs for methodsthat use random sampling.

Random Sampling.

Partitions are sampled uniformlyat random. Aggregates in the answer are scaled up by thesampling rate.

Random+Filter.

Same as random sampling except thatonly partitions that pass the selectivity ﬁlter are sampled.This is only achievable with the use of summary statistics.

Learned Stratiﬁed Sampling (LSS).

A baseline in-spired by prior work on learned row-level stratiﬁed sam-pling [56]. We rank partitions by the model’s prediction and perform stratiﬁcation such that each strata covers par-titions whose predictions fall into a consecutive range. Wemade three modiﬁcations to LSS to enable partition-levelsampling: moving training from online to oﬄine for I/O sav-ings, changing inputs and outputs to operate on partitionsinstead of rows, and adopting a diﬀerent stratiﬁcation strat-egy. We include a detailed description of the modiﬁcationsin Appendix C.1. PS . A prototype that matches the description given sofar. Unless otherwise speciﬁed, default parameter values for PS in all experiments are k = 4 , α = 2 and up to a 10%sampling budget dedicated to outliers. Similar to prior work [13, 16, 42], we report multiple ac-curacy metrics. It is possible, for example, for a method tohave a small absolute error but miss all small groups andsmall aggregate values. We therefore consider all three met-rics below for a complete picture.

Missed Groups.

Percentage of groups in the true answerthat are missed by the estimate.

Average Relative Error.

The average of the relativeerror for each aggregate in each group. For missed groups,the relative error is counted as 1.

Absolute Error over True.

The average absolute er-ror value of an aggregate across groups divide by the aver-age true value of the aggregate across groups, averaged overmultiple aggregates.

We compare the performance of methods of interest undervarying sampling budgets on four datasets (Figure 3). Thecloser the curve is to the bottom left, the better the results.While the scale of the three error metrics is diﬀerent, theordering of the methods is relatively stable. Using the se-lectivity feature to ﬁlter out partitions that do not satisfythe predicate strictly improves the performance for all meth-ods, except on datasets like

TPC-DS* where most partitionspass the predicate. The modiﬁed LSS (green) clearly im-proves upon random sampling by leveraging the correlationbetween feature vectors and partition contribution, consis-tent with ﬁndings of prior work.Overall, PS consistently outperforms alternatives acrossdatasets and error metrics. On our large scale experimentwith the TPC-H* data, PS achieves an average relative errorof 1.5% with a 1% sampling rate. With a 1% same samplingrate, PS improves the error achieved by 17 . × comparedto random sampling, 10 . × compared to random samplingwith ﬁlter and 3 . × compared to LSS (read from intersec-tions between the baseline curves and a vertical line at 1%sampling rate). To achieve an average relative error of 1.5%, PS reduces the fraction of data read by over 70 × comparedto random sampling, over 40 × compared to random sam-pling with ﬁlter and 5 × compared to LSS (read from theintersections between baseline curves and a horizontal lineat 1.5% error rate). We observe similar trends on the threesmaller datasets but the performance gap is smaller: PS reduces the data read by 2 . × to 8 . × compared to simplerandom sampling to achieve ≤

10% average relative error.We additionally show that the fraction of data read is areliable proxy for reductions in resources used, measured bytotal compute time. We evaluate example queries on the8 igure 3: Comparison of error under varying sampling budget on four datasets, lower is better. PS (red)consistently outperforms others across datasets and diﬀerent error metrics.Table 3: Average speedups for query latency andtotal compute time under diﬀerence sampling rateson the TPC-H* dataset.

1% 5% 10% 100%Query Latency 4.7 × × × -Total Compute Time 105.3 × × × - Table 4: Per partition storage overhead of the sum-mary statistics (in KB) for each dataset.

Dataset Total Histogram HH AKMV Measure

TPC-H*

TPC-DS*

Aria

KDD

TPC-H* dataset using SCOPE clusters [21, 59], Microsoft’smain batch analytics platform, which consist of tens of thou-sands of nodes. Table 3 shows that reading 1%, 5% and 10%of the partitions results in a near linear speedup of 105.3 × ,19.6 × , 11.4 × in the total compute time. Improvement ofquery latency however, is less than linear and depends onstragglers and other concurrent jobs on the cluster. We report the space overhead of storing summary statis-tics in Table 4. The statistics are computed for each columnand therefore require a constant storage overhead per par-tition. The overheads range from 12KB to 103KB acrossthe four datasets. The larger the partition size, the lowerthe relative storage overhead of the statistics. For example,with a partition size of 2.5GB, the storage overhead is below0.003% for the

TPC-H* dataset.The AKMV sketch for estimating distinct values takesthe most space compared to other sketches. If the num-ber of distinct values in a column is larger than k (we use k = 128), the sketch has a ﬁxed size; otherwise the sketchsize is proportional to the number of distinct values. The Table 5: Range of the average picker overhead acrosssampling budgets for each dataset (in milliseconds).

Aria KDD TPC-DS* TPC-H*

Total 89.9 ± ± ± ± ± ± ± ± KDD dataset, for example, has more columns but a smallerAKMV sketch size compared to the

Aria dataset since anumber of its columns are binary.We also report the single-thread latency of the partitionpicker (Algorithm 1) in Table 5, measured on an Intel XeonE5-2690 v4 CPU. Our prototype picker is implemented inPython using the

XGBoost and

Sklearn libraries. Overall,the overhead is a small fraction of the query time, rangingfrom 86.5ms to around 1s across datasets. In comparison,the average query takes tens of total computation hours onthe

TPC-H* dataset. As the number of partitions and the di-mension of the feature vectors increase, the total overheadincreases and the clustering component takes up an increas-ing proportion of the overhead. The overhead can be fur-ther reduced via optimization such as performing clusteringin parallel across diﬀerent importance groups.

In this section, we take a closer look at individual com-ponents of the picker and their impact on the ﬁnal perfor-mance, as well as the importance of partition features.

We inspect how the three components of the partitionpicker introduced in § § §

20 40 60 80 100 data read (%) A v g R e l E rr Aria (Lesion) PS w/o cluster w/o outlier w/o regressor data read (%) A v g R e l E rr Aria (Factor) random +filter +outlier +regressor +cluster

Figure 4: Lesion study and factor analysis on theAria dataset. Each component of our system con-tributes meaningfully to the ﬁnal accuracy. Resultsare similar on other datasets.

TPCH TPCDS Aria KDD F e a t . I m p o r t a n c e ( % ) selectivity hh dv measure Figure 5: Feature importance for the regressors.The higher the percentage, the more important thestatistics are to the regressor’s accuracy. pling budget dedicated to outliers. To disable the regressor( § selectivity upper ≥ We divide partition features into four categories based onthe sketches used to generate them: selectivity, heavy hit-ter, distinct value and measures. We investigate the contri-bution of features in each component of PS . The outliercomponent depends exclusively on the heavy hitter features.The clustering component uses all four feature types and wereport the list of features selected for each dataset in Ap-pendix B.1. For the learned component, we measure theregressors’ feature importance via the “gain” metric, whichreports the improvement in accuracy brought by a featureto the branches it is on [9]. For each dataset, we report the A v g R e l E rr TPCDS: p_promo_sk

TPCDS: cs_net_profitrandom random+filter LSS PS A v g R e l E rr Aria: AppInfo_Version

Aria: IngestionTime data read (%) A v g R e l E rr KDD: service, flag data read (%)

KDD: src_bytes, dst_bytes

Figure 6: Our method consistently outperforms al-ternatives across datasets and data layouts. gain for features in each category as a percentage of the to-tal gain aggregated over all learned models. The larger thepercentage, the more important the feature is to the ﬁnalaccuracy. We report the result in Figure 5.Overall, all four types of features contribute to the re-gressor accuracy, but the relative importance varies acrossthe datasets. Selectivity estimates, despite being less usefulfor regressors, are useful to ﬁlter out partitions that do notcontain any rows satisfying the predicate.

In this section, we evaluate the sensitivity of the system’sperformance to changes in setups and parameters.

One of our design constraints is to be able to work withdata in situ. To assess how PS performs on diﬀerent datalayouts, we evaluate on two additional layouts for each datasetusing the same training and testing query sets from exper-iments in § PS consistently outperforms alternatives across the board,but the sizes of the improvements vary across datasets andlayouts. Overall, the more random/uniform the data lay-out is, the less room for improvement for importance-stylesampling. For example in the TPC-DS* dataset, the layoutsorted by column cs net profit is more uniform than thelayout sorted by column p promo sk , since random samplingachieves a much smaller error under the same sampling bud-get in the former layout. LSS is only marginally better thanrandom in the former layout, indicating a weak correlationbetween features and partition importance.As a special case, we explicitly evaluate PS on a randomlayout for the TPC-H* dataset with a scale factor of 1 (Fig-ure 8, left). As expected, sampling partitions uniformly atrandom performs well on the random layout. PS underper-10 igure 7: Performance breakdown by query selec-tivity on the TPC-H* dataset (sf=1000). data read (%) A v g R e l E rr Layout: Random1k parts data read (%)

Layout:L_SHIPDATE1k parts data read (%)

Layout: L_SHIPDATE10k partsrandom+filter PS Figure 8: Comparison of

TPC-H* (sf=1) results ondiﬀerent data layouts and total number of partitions. forms random sampling in this setting, but the performancediﬀerence is small. Realistically, we do not expect PS tobe used for random data layouts; users would have chosenrandom sampling were they paying the cost to maintain arandom data layout [20]. We investigate how queries with diﬀerent sensitivities ben-eﬁt from PS . Figure 7 reports the error breakdown byquery selectivity for random partition-level sampling and PS on the TPC-H* dataset; other datasets show similartrends. Compared to naive random partition level sam-pling (blue), PS oﬀers more improvements for more selec-tive queries (selectivity < . PS oﬀers moreimprovements for non-selective queries (selectivity > . In this subsection, we investigate the impact of partitioncount on the ﬁnal performance. We report results on the

TPC-H* dataset (sf=1) with 1000 and 10,000 partitions inthe middle and right plot of Figure 8. Compared to resultson the same dataset with fewer partitions, the percentage ofpartitions that can be skipped increases with the increase ofthe number of partitions. In addition, as the partition countincreases, the error achieved under the same sampling frac-tion becomes smaller. However, the overheads of PS alsoincrease with the number of partitions. Speciﬁcally, the stor-age overhead for per-partition statistics increases linearlywith the number of partitions. The latency of the partitionpicker also increases with the partition count. Perhaps moreconcerning is the increase in I/O costs. The larger the par-tition count, the smaller the size of each partition. In thelimit when each partition only contains one row, partition-level sampling is equivalent to row-level sampling, which isexpensive to construct as discussed earlier ( § data read (%) A v g R e l E rr Average data read (%)

Worst: Q8 data read (%)

Best: Q1random+filter PS Figure 9: The average, worst and best results from ageneralization test on unseen TPC-H queries (sf=1).Table 6: AUC for diﬀerent clustering algorithms;smaller is better.

HAC(single) HAC(ward) KMeans

TPCDS

Aria

KDD .71 .58 .55

To further assess the ability of the trained models to gen-eralize to unseen queries, we test PS trained on the ran-domly generated training queries with TPC-H schema (de-scribed in § ; the set of aggregate functions and groupby columnsets are shared between the train and test set. Wegenerate 20 random test queries for each TPC-H query tem-plate. We report the average, worst and best performancesacross the test queries on the TPC-H* dataset (sf=1000) inFigure 9. On average, PS is still able to outperform uniformpartition sampling, despite the larger domain gap betweentraining and test set compared to experiments conducted in § We evaluate the eﬀect of clustering algorithm choice onthe clustering performance.We compare a bottom-up clustering algorithm (Hierarchi-cal Agglomerative Clustering, or HAC) to a top-down algo-rithm (KMeans). For HAC, we evaluate two linkage metrics:the “single” linkage minimizes the minimum distances be-tween all points of the two merged clusters, while the “ward”linkage minimizes the variances of two merged clusters. Foreach dataset, we evaluate the average relative error for es-timating the query answer, and report the area under theerror curve under diﬀerent sampling budgets (Table 6). Thesmaller the area, the better the clustering performance. Wealso include results from a similar evaluation on the impactof the feature selection procedure in Appendix B.1.HAC using the “ward” linkage metric and K-Means con-sistently produce similar results, suggesting that the clus-tering performance is not dependent on the choice of theclustering algorithm. The single linkage metric, however,produces worse results especially on the

TPCDS dataset. Q4 is excluded since it operates on the orders table.11 . RELATED WORK

In this section, we discuss related work in sampling-basedAQP, data skipping, and partition-level sampling.

Sampling-based AQP.

Sampling-based approximatequery processing has been extensively studied, where queryresults are estimated from a carefully chosen subset of thedata [25]. A number of techniques have been proposed toimprove upon simple row-level random sampling, for exam-ple, by using auxiliary data structures such as an outlier in-dex [23] or by biasing the samples towards small groups [12,13]. Prior work has shown that, despite the improvements insampling techniques, it is often diﬃcult to construct a sam-ple pool that oﬀers good results for arbitrary queries givenfeasible storage budgets [44]. Instead of computing and stor-ing samples apriori [14, 16, 22], our work makes samplingdecisions exclusively during query optimization.Prior works have used learning to improve the samplingeﬃciency for AQP. One line of work uses learning to modelthe dataset and reduce the number of samples required toanswer queries [31]. Similarly, prior work tries to learnthe underlying distribution that generates the dataset fromqueries, and relies on the learned model to reduce sam-ple size over time [48]. Our work is closer to works thatuse learned models to improve the design of the samplingscheme. Recent work proposes a learned stratiﬁed samplingscheme wherein the model predictions are used as strati-ﬁcation criteria [56]. However, the work focuses on row-level samples and on count queries; we support a broaderscope of queries with aggregates and group bys and workwith partition-level samples. In the evaluation, we compareagainst a scheme inspired by learned stratiﬁed sampling.

Data Skipping.

Our work is also closely related to priorworks on data skipping which studied the problems of opti-mizing data layouts [54, 55, 58, 57] and indexing [40, 41, 49],improving data skipping given a query workload. Buildingon the observation that it is often diﬃcult, if not impossi-ble, to ﬁnd a data layout that oﬀers optimal data skippingfor all queries, we instead choose to work with data in situ.Researchers and practitioners have also looked at ways touse statistics and metadata to prune partitions that are ir-relevant to the query. The proposed approaches range fromusing simple statistics such as min and max to check forpredicate ranges [3, 4], to deriving complex pruning rulesfor queries with joins [43]. Our work is directly inspiredby this line of work and extends deterministic pruning toprobabilistic partition selection.

Partition-level sampling.

Researchers have long recog-nized the I/O beneﬁts of partition-level sampling over row-level sampling [39, 50]. Partition-level samples have beenused to build statistics such as histograms and distinct valueestimates for query optimizers [24, 26]. Prior work has stud-ied combining row-level and partition-level Bernoulli stylesampling for

SUM, COUNT , and

AVG queries, in which one canadjust the overall sampling rate but each sample is treatedequally [35]. Our work more closely resembles importancesampling where we sample more important partitions withhigher probability.Partition level sampling is also studied in the context ofonline aggregation (OLA) where query estimates can be pro-gressively reﬁned as more data gets processed, and users canstop the processing when the answer reaches target accu-racy [28, 47, 51]. Classic work in OLA assume that tuples are processed in a random order, which often require randomshuﬄing as an expensive processing step [20]. Our approachdoes not require random layout, and in fact, should not beused if the data layout is random. Prior work has also stud-ied OLA over raw data, which requires an expensive tupleextraction step to process raw ﬁles [27]. PS can work withdata stored in any format as long as per-partition statisticsare available and focuses on selecting fewer partitions in-stead of stopping processing early within a partition, sincethe most expensive operation for our setup is the I/O costof reading the partition.

7. DISCUSSION AND FUTURE WORK

Our work shows promise as a ﬁrst step towards usinglearning techniques to improve upon uniform partition-levelsampling. We highlight a few important areas for futurework below.First, our system is designed mainly for read-only andappend-only data stores, so the proposed set of sketchesshould be reconsidered if deletions and edits to data must besupported. Furthermore, the partition picker logic must beretrained when the summary statistics of partitions changein a substantial way.Second, our work only considers generalization to unseenqueries in the same workload on the same dataset and datalayout. Although retraining can help generalize to unseencolumns in the same dataset and layout, supporting broaderforms of generalization such as to diﬀerent data layouts isnon-trivial and requires further attention.Third, our work demonstrates empirical advantages touniform partition-level sampling on several real-world datasetsbut provides no apriori error guarantees. Developing errorguarantees and diagnostic procedures for failure cases willbe of immediate value to practitioners.

8. CONCLUSION

We introduce PS , a system that leverages lightweightsummary statistics to perform weighted partition selectionin big-data clusters. We propose a set of sketches – mea-sures, heavy hitters, distinct values, and histograms – togenerate partition-level summary statistics that help assesspartition similarity and importance. We show that our pro-totype PS provides sizable speed ups compared to randompartition selection with a small storage overhead.

9. ACKNOWLEDGEMENT

We thank Laurel Orr for her contributions to an early ver-sion of this project. We thank Surajit Chaudhuri and manymembers of the Stanford InfoLab for their valuable feedback.Kexin Rong, Peter Bailis and Philip Levis were supportedin part by aﬃliate members and other supporters of theStanford DAWN project—Ant Financial, Facebook, Google,Infosys, NEC, and VMware—as well as Toyota Research In-stitute, Northrop Grumman, Amazon Web Services, Cisco,the NSF under CAREER grant CNS-1651570 and the NSFunder Grant No. CPS-1931750. Any opinions, ﬁndings, andconclusions or recommendations expressed in this materialare those of the authors and do not necessarily reﬂect theviews of the NSF. Toyota Research Institute (”TRI”) pro-vided funds to assist the authors with their research butthis article solely reﬂects the opinions and conclusions of itsauthors and not TRI or any other Toyota entity.12

0. REFERENCES [1] https://bit.ly/2T8MsFj .[2] Block Sampling in Hive. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling . Accessed: 2020-2-12.[3] Impala Partition Pruning. https://docs.cloudera.com/runtime/7.0.3/impala-reference/topics/impala-partition-pruning.html . Accessed:2020-2-12.[4] MySQL Partition Pruning. https://dev.mysql.com/doc/mysql-partitioning-excerpt/8.0/en/partitioning-pruning.html . Accessed: 2020-2-12.[5] Oracle Database optimizer statistics. https://docs.oracle.com/en/database/oracle/oracle-database/18/tgsql/optimizer-statistics-concepts.htm .Accessed: 2020-2-12.[6] PostgreSQL 9.5.21 TABLESAMPLE. .Accessed: 2020-2-12.[7] Program for TPC-H Data Generation with Skew. . Accessed: 2020-2-12.[8] Snowﬂake SAMPLE / TABLESAMPLE. https://docs.snowflake.net/manuals/sql-reference/constructs/sample.html . Accessed:2020-2-12.[9] XGBoost Feature Importance. https://xgboost.readthedocs.io/en/latest/python/python_api.html . Accessed: 2020-2-12.[10]

Probability Sampling from a Finite Universe ,chapter 1, pages 1–93. John Wiley & Sons, Ltd, 2009.[11] F. Abuzaid, P. Kraft, S. Suri, E. Gan, E. Xu,A. Shenoy, A. Ananthanarayan, J. Sheu, E. Meijer,X. Wu, et al. Diﬀ: a relational interface for large-scaledata explanation.

PVLDB , 12(4):419–432, 2018.[12] S. Acharya, P. B. Gibbons, and V. Poosala. Aqua: Afast decision support system using approximate queryanswers.

PVLDB , 1999.[13] S. Acharya, P. B. Gibbons, and V. Poosala.Congressional samples for approximate answering ofgroup-by queries. In

Acm Sigmod Record , volume 29,pages 487–498. ACM, 2000.[14] S. Agarwal, B. Mozafari, A. Panda, H. Milner,S. Madden, and I. Stoica. Blinkdb: queries withbounded errors and bounded response times on verylarge data. In

Proceedings of the 8th ACM EuropeanConference on Computer Systems , pages 29–42. ACM,2013.[15] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu,J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin,A. Ghodsi, et al. Spark sql: Relational data processingin spark. In

SIGMOD , pages 1383–1394, 2015.[16] B. Babcock, S. Chaudhuri, and G. Das. Dynamicsample selection for approximate query processing. In

SIGMOD , pages 539–550, 2003.[17] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth.The UCI KDD archive of large data sets for datamining research and experimentation.

ACM SIGKDDexplorations newsletter , 2(2):81–85, 2000.[18] I. Bello, H. Pham, Q. V. Le, M. Norouzi, andS. Bengio. Neural combinatorial optimization with reinforcement learning. arXiv preprintarXiv:1611.09940 , 2016.[19] K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, andR. Gemulla. On synopses for distinct-value estimationunder multiset operations. In

SIGMOD , pages199–210, 2007.[20] P. G. Brown and P. J. Haas. Techniques forwarehousing of sample data. In

ICDE , pages 6–6,2006.[21] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and ecient parallel processing of massive datasets.

PVLDB ,1(2):1265–1276, 2008.[22] S. Chaudhuri, G. Das, G. Das, and V. Narasayya. Arobust, optimization-based approach for approximateanswering of aggregate queries. In

ACM SIGMODRecord , volume 30, pages 295–306. ACM, 2001.[23] S. Chaudhuri, G. Das, M. Datar, R. Motwani, andV. Narasayya. Overcoming limitations of sampling foraggregation queries. In

Proceedings 17th InternationalConference on Data Engineering , pages 534–542.IEEE, 2001.[24] S. Chaudhuri, G. Das, and U. Srivastava. Eﬀective useof block-level sampling in statistics estimation. In

SIGMOD , pages 287–298, 2004.[25] S. Chaudhuri, B. Ding, and S. Kandula. Approximatequery processing: No silver bullet. In

SIGMOD , pages511–519. ACM, 2017.[26] S. Chaudhuri, R. Motwani, and V. Narasayya.Random sampling for histogram construction: Howmuch is enough?

ACM SIGMOD Record ,27(2):436–447, 1998.[27] Y. Cheng, W. Zhao, and F. Rusu. Bi-level onlineaggregation on raw data. In

Proceedings of the 29thInternational Conference on Scientiﬁc and StatisticalDatabase Management , pages 1–12, 2017.[28] X. Ci and X. Meng. An eﬃcient block samplingstrategy for online aggregation in the cloud. In

International Conference on Web-Age InformationManagement , pages 362–373. Springer, 2015.[29] G. Cormode, M. Garofalakis, P. J. Haas, C. Jermaine,et al. Synopses for massive data: Samples, histograms,wavelets, sketches.

Foundations and Trends R (cid:13) inDatabases , 4(1–3):1–294, 2011.[30] G. Das, S. Chaudhuri, and U. Srivastava. Block-levelsampling in statistics estimation, Oct. 6 2005. USPatent App. 10/814,382.[31] A. Deshpande, C. Guestrin, S. R. Madden, J. M.Hellerstein, and W. Hong. Model-driven dataacquisition in sensor networks. PVLDB , 30:588–599,2004.[32] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, andC. Wang. Sample+ seek: Approximating aggregateswith distribution precision guarantee. In

SIGMOD ,pages 679–694. ACM, 2016.[33] E. Gan, P. Bailis, and M. Charikar. Coopstore:Optimizing precomputed summaries for aggregation.

PVLDB , 13(11):2174–2187, 2020.[34] S. Ghemawat, H. Gobioﬀ, and S.-T. Leung. Thegoogle ﬁle system. In

Proceedings of the nineteenthACM symposium on Operating systems principles ,pages 29–43, 2003.1335] P. J. Haas and C. K¨onig. A bi-level bernoulli schemefor database sampling. In

SIGMOD , pages 275–286.[36] A. Y. Halevy. Answering queries using views: Asurvey.

The VLDB Journal , 10(4):270–294, 2001.[37] J. M. Hammersley and D. Handscomb. Percolationprocesses. In

Monte Carlo Methods , pages 134–141.Springer, 1964.[38] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Onlineaggregation. In

Proceedings of the 1997 ACMSIGMOD international conference on Management ofdata , pages 171–182, 1997.[39] W.-C. Hou and G. Ozsoyoglu. Statistical estimatorsfor aggregate relational algebra queries.

ACMTransactions on Database Systems (TODS) ,16(4):600–654, 1991.[40] S. Idreos, M. L. Kersten, S. Manegold, et al. Databasecracking. In

CIDR , volume 7, pages 68–78, 2007.[41] S. Idreos, S. Manegold, H. Kuno, and G. Graefe.Merging what’s cracked, cracking what’s merged:adaptive indexing in main-memory column-stores.

PVLDB , 4(9):586–597, 2011.[42] S. Kandula, K. Lee, S. Chaudhuri, and M. Friedman.Experiences with approximating queries in microsoft’sproduction big-data clusters.

PVLDB ,12(12):2131–2142, 2019.[43] S. Kandula, L. Orr, and S. Chaudhuri. Pushingdata-induced predicates through joins in big-dataclusters.

PVLDB , 13(3):252–265, 2019.[44] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma,R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazilyapproximating complex adhoc queries in bigdataclusters. In

SIGMOD , pages 631–646, 2016.[45] E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song.Learning combinatorial optimization algorithms overgraphs. In

Advances in Neural Information ProcessingSystems , pages 6348–6358, 2017.[46] G. S. Manku and R. Motwani. Approximate frequencycounts over data streams. pages 346–357, 2002.[47] N. Pansare, V. R. Borkar, C. Jermaine, andT. Condie. Online aggregation for large mapreducejobs.

PVLDB , 4(11):1135–1145, 2011.[48] Y. Park, A. S. Tajik, M. Cafarella, and B. Mozafari.Database learning: Toward a database that becomessmarter every time. In

SIGMOD , pages 587–602, 2017.[49] F. M. Schuhknecht, A. Jindal, and J. Dittrich. Theuncracked pieces in database cracking.

PVLDB ,7(2):97–108, 2013.[50] S. Seshadri and J. F. Naughton. Sampling issues inparallel database systems. In

International Conferenceon Extending Database Technology , pages 328–343.Springer, 1992.[51] Y. Shi, X. Meng, F. Wang, and Y. Gan. You can stopearly with cola: online processing of aggregate queriesin the cloud. In

Proceedings of the 21st ACMinternational conference on Information andknowledge management , pages 1223–1232. ACM, 2012.[52] J. Shute, R. Vingralek, B. Samwel, B. Handy,C. Whipkey, E. Rollins, M. Oancea, K. Littleﬁeld,D. Menestrina, S. Ellner, et al. F1: A distributed sqldatabase that scales. 2013.[53] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.The hadoop distributed ﬁle system. In . Ieee, 2010.[54] L. Sun, M. J. Franklin, S. Krishnan, and R. S. Xin.Fine-grained partitioning for aggressive data skipping.In

SIGMOD , pages 1115–1126, 2014.[55] L. Sun, M. J. Franklin, J. Wang, and E. Wu.Skipping-oriented partitioning for columnar layouts.

PVLDB , 10(4):421–432, 2016.[56] B. Walenz, S. Sintos, S. Roy, and J. Yang. Learning tosample: Counting with complex queries.

PVLDB ,13(3):390–402, 2019.[57] Z. Yang, B. Chandramouli, C. Wang, J. Gehrke, Y. Li,U. F. Minhas, P.-r. Larson, D. Kossmann, andR. Acharya. Qd-tree: Learning data layouts for bigdata analytics. In

SIGMOD , page 193208, 2020.[58] E. Zamanian, C. Binnig, and A. Salama.Locality-aware partitioning in parallel databasesystems. In

SIGMOD , pages 17–30, 2015.[59] J. Zhou, N. Bruno, M.-C. Wu, P.-A. Larson,R. Chaiken, and D. Shakib. Scope: parallel databasesmeet mapreduce.

PVLDB , 21(5):611–636, 2012.

APPENDIXA. DATA SCHEMAA.1 TPC-H*

We provide the query used to denormalize the lineitem table in the

TPC-H dataset below. This denormalized ta-ble can support 16 out of 22 queries in the TPC-H bench-mark (Q1,3,4,5,6,7,8,9,10,12,14,15,17,18,19,21). We addi-tionally include two derived columns

L YEAR and

O YEAR in the view in order to support group by clauses on thesecolumns (Q7,8,9). Our generalization test ( § CREATE TABLE denorm

ASSELECT lineitem.*, customer.*, orders.*, part.*,partsupp.*, supplier.*, n1.*, n2.*, r1.*, r2.*, datepart (yy, o orderdate) AS o year, datepart (yy, l shipdate) AS l year FROM lineitem

JOIN partsupp ON ps partkey = l partkey AND ps suppkey = l suppkey

JOIN orders ON o orderkey = l orderkey JOIN part ON p partkey = ps partkey JOIN supplier ON s suppkey = ps suppkey JOIN customer ON c custkey = o custkey JOIN nation AS n1 ON n1.n nationkey = c nationkey JOIN nation AS n2 ON n2.n nationkey = s nationkey JOIN region AS r1 ON r1.r regionkey = n1.n regionkey JOIN region AS r2 ON r2.r regionkey = n2.n regionkey A.2 TPC-DS*

We provide the query used to denormalize the catalog sales table below. The joined dataset contains 4.3M rows, 21 nu-meric columns and 20 categorical columns.

CREATE TABLE denorm cs

ASSELECT catalog sales.*, cd.*, item.*, promo.*, date.*

FROM catalog sales

JOIN item ON cs item sk = i item sk JOIN promo ON cs promo sk = p promo sk14 able 7: Area under the curve for the average rel-ative error of clustering under diﬀerent samplingbudgets for Hierarchical Agglomerative Clustering(HAC) and KMeans clustering; smaller is better. HAC (ward) +feat sel KMeans +feat sel

TPCDS

Aria

KDD .58 .55 (-5%) .55 .54 (- .5%)

Algorithm 3

Feature Selection for Clustering1: feats ← (selectivity, occurrence bitmap,log( x ), log ( x ), min(log( x )), max(log( x )), x, x , std, min( x ), max( x ), best ← [] (cid:46) Features excluded from clustering3: for i ← → do feats .shuﬄe() (cid:46) Explore features in random order5: to exclude ← []6: for f ∈ feats do new ← [ to exclude ]+[f]8: if ImproveCluster ( to exclude, new ) then to exclude ← new end if end for if ImproveCluster ( best, to exclude ) then best ← to exclude end if end for return best JOIN date ON cs sold date sk = d date sk JOIN cd ON cs ship cdemo sk = cd demo sk A.3 Aria

Aria is a production service request log dataset at Mi-crosoft that was also used in prior work [11, 33]. The dataset contains the following columns: records received count, records tried to send count , records sent count, olsize, ol w, infl, TenantId , AppInfo Version, UserInfo TimeZone , DeviceInfo NetworkType, PipelineInfo IngestionTime . B. IMPLEMENTATION DETAILS

In this section, we provide additional implementation de-tails for the partition picker.

B.1 Clustering

Normalization.

Prior to clustering, we normalize thesummary statistics to make sure that the euclidean distanceis not dominated by any single statistic. We ﬁrst apply alog transformation to reduce the overall skewness to all sum-mary statistics except for selectivity estimates; for the selec-tivity estimates which are between 0 and 1, we use the cuberoot transformation instead. We then normalize each sum-mary statistics by its average value in the training dataset.We choose the average instead of the max as the normaliza-tion factor since it is more robust to outliers. During testtime, the statistics are normalized by their correspondingaverage values in the training dataset.

Failure Cases.

As discussed in Section 4.2, clusteringdoes not perform well when the predicate is highly selec-tive. Although we can use the selectivity upper featureas an upper bound for the true selectivity, in practice, wehave seen that this upper bound could overestimated thetrue selectivity by over 10 × for complex predicates (see Sec-tion 3.2). Therefore, we simply rely on the query semanticsto estimate the complexity of the predicates. Speciﬁcally, ifthe predicate contains more than 10 clauses, we use randomsampling instead of clustering to select sample partitions. Feature Selection.

We provide pseudo code for the fea-ture selection procedure in Algorithm 3.We report the features selected by the procedure on thefour real-world datasets for experiments reported in § • TPC-H* : selectivity upper, selectivity lower, min( x ), maxhh, max dv, hh bitmap • TPC-DS* : log ( x ), x , sum dv, hh bitmap • Aria : selectivity indep, selectivity max, min(log( x )), x , max( x ), avg hh, • KDD : selectivity indep, x , max dvOnly a small number of features are used in each dataset,but across datasets, all four types of features are represented.This again illustrates the need for all four sketches.Finally, we measure the quantitative impact of the featureselection procedure on clustering performance in Table 7.Similar to the experiment in § averagerelative error for estimating the query answer using diﬀerentclustering procedures, and compare the total area under theerror curve for diﬀerent sampling budgets. Overall, featureselection consistently improves clustering performance forboth clustering methods, reducing the area from 0.5% to15% across datasets. B.2 Training

We use the

XGBoost regressor as our base model and usethe squared error as the loss function. Although our mod-els are only used for binary classiﬁcation, we train themas regressors instead of classiﬁers. This is to address theproblem that the ratio of positive to negative examples arediﬀerent for diﬀerent queries. Consider a query which hasone partition with rows that satisfy the predicate versus aquery with 100 such partitions. Missing one positive exam-ple would have a much larger impact on the ﬁnal accuracyfor the ﬁrst query compared to the second. While a classiﬁercan only handle class imbalance globally, with a regressor,we can scale labels diﬀerently such that the positive exam-ples weigh more in the ﬁrst query. We provide pseudo codefor the training set up in Algorithm 4.

C. ADDITIONAL RESULTSC.1 Modiﬁed Learned Stratiﬁed Sampling

In this section, we present the three necessary modiﬁca-tions made to Learned Stratiﬁed Sampling [56] in detail: • We move the training from online to oﬄine, and useone trained model per dataset and layout instead of per query . LSS performs training inline for each query,using a ﬁxed portion of the sampling budget as thetraining data. Training on random row-level samplesmay invalidate I/O gains and already require a full scanover data ( §

10 20 30 40 50 data read (%) A v g R e l E rr learned data read (%) oracle=1.0 =2.0 =3.0 =4.0 =5.0 Figure 10: Impact of the sampling decay rate α onthe KDD dataset. Larger α improves performance,but the marginal beneﬁts decreases.Algorithm 4 Training Label Generation

Input: threshold t ∈ [0 , n , feature di-mension m , query answer dimension d ; for each inputquery i , partition features F i ∈ R n × m and normalizedquery answers on each partition A i ∈ [0 , n × d Output:

X, Y1: X ← [], Y ← []2: for each ( F i , A i ) ∈ training do (cid:46) For each query3: ans ← (cid:80) ( A i ) (cid:46) Ground truth query answer4: for j ← → n do y [ j ] ← max( A i [ j ]) > t (cid:46) Partition contribution6: end for positive ← (cid:80) y for j ← → n do if y [ j ] == 1 then y [ j ] ← (cid:113) cpositive else y [ j ] ← − (cid:113) cn − positive end if end for X.append ( F i )16: Y.append ( y )17: end forTable 8: Strata sizes for the modiﬁed LSS algorithmselected via exhaustive search. Sampling Budget (% data read)10 20 30 40 50 60 70 80 90

TPC-H*

15 50 100 250 260 580 430 50 730

TPC-DS*

55 120 85 130 160 250 395 170 10

Aria

75 80 55 150 260 70 80 130 190

KDD

90 160 295 230 360 430 220 410 820 training queries sampled from the workload and use thesame trained model for all test queries. • We change the model’s inputs and labels. LSS operateson rows, while we use partition features as inputs. LSSonly considers count queries, so the label is either 0 or1. To support aggregates and group bys, we use thepartition contribution deﬁned in § • We use diﬀerent stratiﬁcation strategies. Prior workanalyzes optimal choices of strata boundaries for pro-portional allocation of samples, in which the samplesize allocated to each stratum is proportional to its size.The analysis does not extend to our setup, so we useequi-width strata instead. To set the number of strata, we exhaustively sweep the strata sizes and select onethat minimizes average relative error on the trainingset. We report the selected strata sizes in Table 8.

C.2 Effect of Sampling Rate

We investigate the extent to which applying diﬀerent sam-pling rates aﬀects the performance of learned importancestyle sampling. Recall that we tune the sampling rate viaparameter α , which is the ratio of sampling rates betweenthe i th important and the ( i + 1) th important group. Thelarger α is, the more samples we allocate to the importantgroups. We report the results achieved under diﬀerent α s forthe KDD dataset (Figure 10, left). Overall the performanceimproves with the increase of α , but the marginal beneﬁtdecreases.We repeat the experiment and replace the trained regres-sors with an oracle that has perfect precision and recall (Fig-ure 10, right). This gives an upper bound of the improve-ments enabled by important-styled sampling. Compared tousing learned models, the overall error decreases with the or-acle, as expected. The performance gap between the learnedand the oracle regressor increases with the increase of α . Thecomparison shows that the more accurate the regressor, themore beneﬁts we get from using higher sampling rates forimportant groups. While we used a default value of α = 2across the experiments, it is possible to further ﬁne-tune α for each dataset to improve the performance. C.3 TPC-H Results

In this subsection, we report a detailed breakdown of theperformances of PS and random partition-level samplingon the TPC-H queries from the generalization test ( § SUM aggregate with a

CASE condition as an aggregate over the predicate. In addi-tion, PS explicitly chooses to use random sampling insteadof clustering to select samples for Q19, which has complexpredicates consisting of 21 clauses ( § B.1). Our trainingqueries are sampled randomly according to procedure de-scribed in § TPC-H* schema below:

SELECT

N1 NAME,

SUM (L EXTENDEDPRICE * L TAX)

FROM denorm

WHERE

P SIZE ≥ AND

L COMMITDATE ≥ “1997-09-29” GROUPBY

N1 NAME;

Overall, PS signiﬁcantly outperforms random partitionselection on Q1, Q6 and Q7, and performs similarly to ran-dom partition selection on other queries. In particular, Q1,Q6 and Q7 all have a small number of partitions with eitherrare groups or outlying aggregate values. While PS canidentify such partitions via clustering and outlier detection,random partition selection can easily miss these importantpartitions especially when the sampling budget is limited. D. VARIANCE ANALYSISD.1 Unbiased picker

Unbiased picker.

We introduce an unbiased version ofour proposed estimator that lends well to analysis. As de-scribed in §

20 40 60 80 100 data read (%) A v g R e l E rr Q1 data read (%) Q5 data read (%) Q7 data read (%) Q8 data read (%) Q12random+filter PS data read (%) A v g R e l E rr Q6 data read (%) Q9 data read (%) Q14 data read (%)

Q17 data read (%)

Q19random+filter PS Figure 11: Detailed breakdown of results on TPC-H queries used in the generalization test ( § signiﬁcantly outperforms random partition selection on Q1, Q6, Q7 and performs similarly to randompartition selection on other queries.Figure 12: Empirical comparison of the bias andunbiased version of the estimator. The biased esti-mator tends to outperform the unbiased when thesampling fraction is small. picks a cluster exemplar partition at random. We empir-ically compare the performances of the two estimators onfour real-world datasets in Figure 12. For each test query,we run the unbiased estimator 10 times and compute the av-erage error achieved to compare against the error achievedby the biased estimator.Overall, Figure 12 shows that the biased estimator achievessmaller error compared to the unbiased version when thesampling fraction is small, and that there are no signiﬁcantdiﬀerences in accuracy between the two estimators other-wise. In addition, for a given query, the biased version ofthe estimator has no variance . Therefore, in use cases whenthe sampling budget is limited or when users prefer getting a deterministic answer for a given query, the biased versionof the estimator might be preferred. Analysis.

Next, we analyze the unbiased version of theestimator using the framework of stratiﬁed sampling. Com-pared to a simple random sample of the same size, stratiﬁedsampling can produce an estimator with smaller varianceif the elements within strata are homogeneous. In our case,each cluster is essentially a stratum; if clustering is eﬀective,the partitions in a cluster are similar to each other, leadingto a variance reduction.Within each stratum, we perform simple random samplingwithout replacement (SRSWoR) to draw a sample of size 1;the variance formula for SRSWoR can be found in Chapter2.5.2 of [29]. Note that since we only draw one sample fromeach cluster/stratum, in order to estimate variance of thestratum, we would need to evaluate additional partitionsper stratum. Finally, the total variance of the unbiased es-timator is the sum of the variances from each stratum.When central limit theorem holds, the 95% conﬁdenceinterval of an estimator Y is given by ± . (cid:112) σ ( Y ) [29],where σ ( Y ) is the variance of the estimator described above. D.2 Partition-level v.s. row-level sampling

In this subsection, we compare random partition levelsampling to random row level sampling. We show that underthe same sampling fraction, random partition level samplinghas much larger variance than random row level sampling.

Set up.

We start with a description of the setup. For agroup G in the query, let y i be the value of the aggregatefunction on partition i . Let π i be the probability that parti-tion i is included in the sample, π ij be the probability thatboth partition i and j are in the sample, N be the totalnumber of partitions and S be the set of sampled partitions.We wish to estimate the total value of the aggregate func-tion for group G on all partitions. For SUM and

COUNT queries,17he total value is Y = (cid:80) Ni =1 y i . If all partitions have posi-tive sampling probability ( π i > , ∀ i ), an unbiased Horvitz-Thompson estimator for Y under Poisson sampling is:ˆ Y = (cid:88) i ∈ S y i π i The true variance of the estimator ˆ Y is: σ ( ˆ Y ) = N (cid:88) i,j =1 ( π ij π i π j − y i y j (1)However, since y i is only available for partitions that areincluded in the sample, we can not evaluate the true varianceusing Eq 1 directly. Instead, we estimate the true varianceusing the sampled set of partitions S [29]:ˆ σ ( ˆ Y ) = N (cid:88) i,j =1 ( 1 π i π j − π ij ) y i y j (2)If the second-order inclusion probability π ij > i, j , Eq 2 is an unbiased estimator forEq 1, the true variance of ˆ Y [10]. Analysis.

For random partition level sampling, assumethat each partition is selected in the sample with probabil-ity p . The expected size of S is Np . Since the partitionsare sampled independently, π ij = π i π j . Plug the inclusionprobabilities in Eq 2, the estimator of the true variance is:ˆ σ ( Y blk ) = (cid:88) i ∈ S ( 1 p − p ) y i (3) Similarly, assume that each tuple is sampled with proba-bility p . Let t x be the total value that a tuple x contributestowards the aggregate for group G , and S t be the set ofsampled tuples. Following similar derivation as Eq 3, theestimator of the variance for random row level sampling isˆ σ ( T row ) = (cid:88) x ∈ S t ( 1 p − p ) t x (4)Note that y i in Eq 3 is simply the sum of tuples in par-tition i . Let b x be the partition that contains tuple x , then y i = (cid:80) b x = i t x . Therefore, y i = (cid:88) b x = i t x + 2 (cid:88) x