Approximate Query Processing for Group-By Queries based on Conditional Generative Models
aa r X i v : . [ c s . D B ] J a n Approximate Query Processing for Group-By Queriesbased on Conditional Generative Models
Meifan Zhang a , Hongzhi Wang a,b a Department of Computer Science and Technology, Harbin Institute of Technology b Peng Cheng Laboratory, Shenzhen, China
Abstract
The Group-By query is an important kind of query, which is common and widelyused in data warehouses, data analytics, and data visualization. Approximatequery processing is an effective way to increase the querying efficiency on bigdata. The answer to a group-by query involves multiple values, which makes itdifficult to provide sufficiently accurate estimations for all the groups. Stratifiedsampling improves the accuracy compared with the uniform sampling, but thesamples chosen for some special queries cannot work for other queries. Onlinesampling chooses samples for the given query at query time, but it requires a longlatency. Thus, it is a challenge to achieve both accuracy and efficiency at thesame time. Facing such challenge, in this work, we propose a sample generationframework based on a conditional generative model. The sample generationframework can generate any number of samples for the given query withoutaccessing the data. The proposed framework based on the lightweight modelcan be combined with stratified sampling and online aggregation to improve theestimation accuracy for group-by queries. The experimental results show thatour proposed methods are both efficient and accurate.
Keywords:
Approximate query processing, Group-By Queries, Sampling,Conditional Generative Model ∗ Corresponding author
URL: [email protected] (Hongzhi Wang)
Preprint submitted to Journal of L A TEX Templates January 11, 2021 . Introduction
Querying on big data is costly and suffers from long latency. Approximatequery processing (AQP) is proposed to increase the efficiency of querying on bigdata. Aggregation queries can be estimated by the AQP methods. The group-byaggregation queries plays an important role in interactive data analytics [1, 2].A Group-By query is in the following form:
SELECT A , AGGR ( B ) FROM D GROUP BY A ; In this template, A is the Group-By attribute, B is the aggregation attribute,and AGGR is an aggregation function such as the COUNT, SUM, AVG, VAR,and STD.A group-by query groups the data according to some columns and computesthe aggregation of each group. The result of a group-by query includes multiplevalues. The answer of a group-by aggregation query can be easily transformedinto a visualization, such as a bar-chart, a pie-chart, and a histogram, which isuseful in data analytics for decision-making [3, 1].Sampling-based AQP [1, 4, 5, 6, 7, 3, 8] is the most widely used AQP method,but sampling-based AQP for group-by queries is a challenging task. The answerto a group-by query includes multiple values, and it is difficult to get sufficientlyaccurate estimations for all the groups at the same time [1, 8].The sampling-based AQP for group-by queries has problems of either inaccurate, inflexible,or inefficient. The sampling methods are mainly divided into two categories:off-line sampling and on-line sampling.The off-line sampling means creating samples before queries come. Uniformsampling choose each data as a sample with equal probability. It does notwork well for group-by queries, since some small groups cannot get sufficientsamples from the uniform sampling [8, 9]. Thus, the estimation accuracy ofeach group varies greatly. Some extremely rare groups will even be missingin the uniform samples. Stratified sampling [10, 11, 12, 13, 14, 8] is a way toimprove the accuracy. It samples from each group with different probabilities.The stratified sampling usually relies on the workload or some prior knowledge2f the distribution. Some methods set the sampling probabilities with the goalof improving the estimation accuracy of the queries in the workload, but thestratified samples chosen for some special queries does not work well for otherqueries. For example, if the aggregated attribute of a given query is not involvedin the workload, the off-line stratified samples chosen based on the workload donot benefit the given query. Preparing samples for all the columns require ahuge storage.The online sampling [6, 3] means creating samples for the given query afterthe query comes. For the queries unrelated to the workload or the ad-hocqueries, the off-line samples has little help. They need online samples chosen forthe given queries to improve the accuracy. For low-selectivity queries, choosingenough off-line samples is expensive, since only a few tuples satisfy the predicateof a low-selectivity query. Online-sampling can choose enough samples for thegiven query predicate to improve the accuracy without maintaining a large off-line sample. Even though online sampling can improve the accuracy for thegiven query, its drawback is also significant. Sampling from big data at querytime means frequently accessing the raw data and a huge I/O cost [1] . Thatis, online-sampling for the given query results in high latency in the query time,which is unacceptable in interactive analysis.We summarize the pros and cons of these methods. Uniform sampling issimple, but it cannot provide sufficiently accurate estimation for the rare groupsdue to the uneven distribution of different groups. Stratified sampling is moreaccurate, but it cannot flexibly respond to the variation of the query for thereason that the stratified samples are chosen to optimize the accuracy of somegiven aggregations or the queries in the workload. Online sampling can flexiblyget samples for the given query, but it suffers from long latency caused bysampling from big data at query time. Thus, these methods cannot achieve theaccuracy, flexibility and efficiency at the same time.Thus, we attempt to get online stratified samples to improve the estimationaccuracy of a group-by query, and avoid the cost of sampling from data atquery time. As we mentioned before, the stratified sampling benefits the group-3y queries, but the off-line samples cannot cope well with changes in aggregationattributes. Online sampling draws samples for the given query at the cost ofthe high latency. If we can get stratified samples for the given query withoutaccessing the data, both efficiency and accuracy can be achieved.However, getting stratified samples without accessing the data is still a chal-lenging task. The generative model such as the GAN can learn the data distri-bution and generate samples for the entire dataset. However, it cannot gener-ate targeted stratified samples to improve the estimation accuracy of group-byqueries. Thus, it still suffers from the sampling error like the uniform sam-pling. We thought of using the conditional generative models instead of the rawdata to generate stratified samples. A conditional generative model, such as theconditional generative adversarial network (CGAN) [15, 16, 17], can generateimages of a given type. We want to enable the conditional generative modelto generate samples for each given group. Thus, the rare groups can easily getenough samples to improve the accuracy. We convert the group-by attributevalues to be the labels, and adopt the conditional generative model to learn theconditional data distribution. After the model is well tuned, it allows targetedsample generation for a given group. Thus, we can directly get the stratifiedsamples from the model without accessing the raw data.The proposed method can achieve both high efficiency and accuracy. Onthe one hand, the samples can be obtained without scanning the data, whichgreatly reduces the response latency. On the other hand, the model can be usedto generate any number of samples for the given group, thus, small groups caneasily get enough samples to improve the accuracy.The proposed method can be easily combined with many kinds of stratifiedsampling methods to allocate the sample size for each group. The allocationcan be computed according to some statistics, such as the frequency, mean, andthe variation of each group. These statistics can be computed in a single passthrough the data in advance.Based on the proposed sample generation method, it is easy to generateenough samples for each group. However, for the group-by query with a predi-4ate, the generated samples still need to be filtered by the predicate. In order toobtain enough samples that satisfy the predicate, a large number of useless sam-ples need to be generated, which will cause a waste of time. Thus, approximatequery processing for a group-by query with a predicate is also a challengingtask.The group-by query with a predicate is in the following form:
SELECT A , AGGR ( B ) FROM D WHERE Condition GROUP BY A ; The predicate is a filtering condition, such as “ l ≤ C ≤ r ”, and the answerto this kind of query is the aggregation of each group after filtering through thecondition. If the tuples in the data rarely satisfy the predicate, in other words,it is a low-selectivity query, then there are also very few tuples that satisfy thepredicate in the generated samples. We can simply generate more samples toget enough ones satisfying the predicate, but it is not an ideal method.In order to reduce the waste caused by generating useless samples, we at-tempt to enable the generative model to generate targeted samples for the givenpredicate. However, it is not trivial to label the data with the predicate, sincea data belongs to all the ranges that cover it. In order to label the tuples withrange information, we partition the data with a histogram. Thus, each tuplebelongs to one bucket. We train the conditional generative model with boththe bucket labels and group labels. Once the model is well-tuned, it can gener-ate samples for the given bucket and group. This method improves the samplequality for the low-selectivity queries.We make the following contributions in this paper.(1) We proposed a sample generation framework for group-by queries basedon the conditional generative model. The proposed framework can generateany number of samples for the given group. It can be combined with stratifiedsampling and online aggregation to improve the estimation accuracy.(2) We proposed a sample generation framework for the group-by querieswith predicates. The proposed framework can generate any number of samplesfor a sub-range of the data, which benefits the estimations of the low-selectivityqueries. 53) The experimental results show that our method can efficiently generatestratified samples to improve the accuracy, iteratively narrow down the esti-mation error like the online aggregation, and work well for the low-selectivityqueries. We also compare the proposed method with the state-of-the-art machine-learning-based method DBEst. The experimental results show that our methodis more accurate while occupying less space.The remaining parts of this paper are organized as follows. In Section 2, wesurvey the related works for this paper. In Section 3, we introduce the samplegeneration framework for the group-by queries and the methods of combining astratified sampling and online aggregation with the proposed method to improvethe accuracy. In Section 4, we introduce the method to generate stratifiedsamples for the group-by queries with predicates. In Section 5, we show theexperimental results. Finally, we conclude this study in Section 6.
2. Related Work
Sampling-based approximate query processing method [1, 4, 5, 6, 7] is themost widely used AQP approach, since it is simple and can be adaptive to mostgeneral queries. Many of works have been proposed in the past. The uniformsampling does not work well for the high-skew data. The answers of group-byqueries includes the aggregation in each group. The distribution of the groupsis not always uniform. The uniform sampling cannot collect enough samplesfor the rare groups. It demands a big sample to get a satisfactory estimationfor a low-selectivity query or the high-skew data. Stratified sampling is de-signed to improve the accuracy, but it requires some foreknowledge such as theexpected query workload and the distribution of the aggregated attribute. Inaddition, the stratified sample only benefits the given specific query [5]. Estimat-ing queries over rare sub-populations based on the SAQP is still challenging [4].Sample+Seek [1] proposed a new metric called ‘distribution precision’ to mea-sure the accuracy of the visualization. It developed a measure-biased samplingmethod, which picks a row with probability proportional to its value on the6easure attribute. CVOPT [8] is a recent stratified sampling method whichallocate the sample size for each group according to the coefficients of variation,and the experimental results show that it achieves lower maximum relative errorcompared with uniform sampling, congressional sampling [10] and the stratifiedsampling method proposed in reference [2]. Online aggregation [6, 7] is anotherway for AQP. It iteratively accesses more samples to improve the estimationaccuracy. The online aggregation draws samples at query time, thus, it suffersfrom the high latency caused by sampling from big data.Progressive methods provide a way to iteratively improve the quality of esti-mations. IFOCUS [3] and INCVISAGE [18] incrementally improve the approx-imate estimations. IFOCUS [3] aims at guaranteeing the order of estimations ofdifferent groups. It keeps increasing the sample size until the confidence intervalof each group does not overlapping with others. INCVISAGE [18] is designedto avoid the incorrect intermediate visualizations generated by the incrementalsampling-based method such as the SampleAction [19]. The INCVISAGE isconservative at the beginning, and it keeps splitting one segment into two inorder to refine the visualization. Different from the SampleAction [19], it finallyreveals the features of the eventual visualization.Synopses such as histograms, sketches, wavelets are also used for AQP inthe past, but most of them only store some summarizations such as count,density, average, for the sake of lightweight. The synopses-based AQP methodshave been summarized in the surveys [20, 9]. A group-by query has a multi-dimensional result rather a simple value. Therefore, adopting the synopses toestimate the group-by queries will cost a huge storage.Making use of previous answers to improve the estimations of the new queriesis another typical idea. An AQP formulation in reference [21] answers newqueries using the old results according to the probability theory. It also buildsthe tail index for the rare subpopulation.Machine learning methods have been adopted in query processing in recentyears [22, 23, 24, 25].
Database Learning [22] learns from past answers to im-prove the new query answers. DeepDB [25] adopted the sum-product network7o model the data probability distribution. EntropyDB [5] uses the principleof maximum entropy to generate a probabilistic representation of the data thatcan be used to give approximate query answers. Reference [24] proposed aAQP method using deep generative models to learn data distribution and ap-proximately answer queries by generating samples from the learned models. Itreduces the latency, but it still suffers from the sampling error. DBEst [23] isa machine-learning-based AQP method, which builds the probability densitymodel and regression model according to samples from data. It uses the modelto answer the queries directly. For the group-by queries, it builds a model foreach group, which will increase the cost of storage.
3. AQP for Group-By Queries based on Conditional Generative Model
In this section, we introduce the framework of sample generation for group-by queries based on a conditional generative model. The sample generationframework is introduced in section 3.1. We label each tuple with its group-by attribute value, and feed the data with labels to the conditional generativemodel. After training, the generator of the model can generate samples con-ditional on the given group that are similar to real samples of that group. Insection 3.2, we introduce methods to generate uniform samples and stratifiedsamples based on the sample generation framework. We also provide methodsfor online aggregation based on the proposed sample generation framework.
In this part, we introduce the conditional sample generation framework forgroup-by queries. We attempt to generate the stratified samples to improve theaccuracy. There are many stratified sampling methods with different optimiza-tion goals, and they have different allocations for the sample size of each group.We do not want to limit the sample generation framework to one kind of strat-ified sampling methods, but to flexibly generate stratified samples according todifferent accuracy optimization goals, which can increase the applicability of8he sample generation framework. According to our requirements, we hope thatthe sample generation model can generate any number of samples for any givengroup.Fig 1 introduces the framework of generating samples for group-by queriesbased on a conditional generative model CWGAN. CWGAN is the combinationof the Conditional Generative Adversarial Nets [15] (CGAN) and WassersteinGenerative Adversarial Nets [26] (WGAN). CGAN is an extension of the Gen-erative Adversarial Nets (GAN), which adds conditional settings to both thegenerator and the discriminator of the GAN, and thus can generate data for thegiven condition. WGAN is a variation of GAN, which can improve the stabilityof learning, and get rid of problems of GAN like mode collapse.We adopt the CWGAN to generate samples for group-by queries in thisframework.The first step of this framework is to label the data. Since we want enablethe model to generate samples for the given group, we label the data with thegroup attribute values. We index the values and encode their indices by theone-hot encoding, which is a type of vector representation widely used in manybranches of computer science, especially machine learning and digital circuitdesign. The encoded group-by attribute values will be regarded as the labelsfor the training data. We will an example to illustrate the encoding later.The next step of this framework is to train the CWGAN with the labeleddata. It feeds each data x and its corresponding label y to the model, and trainsboth the generator G and the discriminator D with the labels. After the modelis well-tuned, the generator can generate a sample x ′ for a given label similarto the real sample x . That is, the model can generate any number of samplesfor each targeted group.We use the following example to show that how to encode the group-byattribute values. Example 1.
Table 1 shows the way to encode the values of the group-by attribute‘gender’. The attribute has two distinct values, ‘male’ and ‘female’. We index igure 1: The framework of sample generation for group-by queries based on CWGAN Data
Gzxy x' D
Conditional-WGAN
S1 for G1
GroupLabel
Group G1G2Gn S2 for G2Sn for GnSample ... ... the values with ‘0’ and ‘1’, respectively. Since there are only two distinct values,the one-hot encoding adopts a two-digit register. The ‘male’ and ‘female’ arefinally encoded as ‘1 0’ and ‘0 1’.
Table 1: Example of encoding the conditional label gender index One-hot encodingmale 0 1 0female 1 0 1After training the conditional generative model, we can use it to generateany number of samples for each group without accessing the data, suggestingthat sampling can be conducted independent of data. We can flexibly allocatethe sample size of each group to improve the estimation accuracy.We use the following example to show that how to use the model to generatesamples for the given group.
Example 2.
Suppose we require five samples for the group “male”. The valuesof attribute “gender” are encoded as shown in Table 1, and “male” is encoded as‘1 0’. Suppose the latent layer has two dimensions, and the noise z is generatedfrom the normal distribution z ∼ N (0 , . We need to feed five two-dimensionalnoises and the group label shown in Table 2 to the generator, and the output of he generator is the five samples for group “male”. Table 2: Input of Generator noise z group label0.0223 -0.7296 1 01.3867 0.8051 1 0-0.7237 -0.6542 1 00.7410 0.5340 1 01.2362 0.9117 1 0 Since the CWGAN-based sampling can generate samples for each groupwithout accessing the data, it can be easily combined with various samplingmethods. In this section, we introduce how to combine CWGAN with theuniform sampling method, a stratified sampling method CVOPT, and the onlineaggregation. Generating uniform samples can restore the data distribution, andgenerating stratified samples can improve the estimation accuracy for group-by queries. Generating samples for online aggregation can iteratively increasethe accuracy and allow the user to stop the processing when the estimation isaccurate enough.
The proposed sample generation approach based on CWGAN can generateuniform samples by allocating the sample size of each group according to thefrequency of each group.The sampling probability of each data is the same in the uniform sampling.It means that the number of samples for each group respects to the frequency ofthis group in the whole dataset. The algorithm 1 introduces how to generate uni-form samples for the data. The frequency of each group [
COU N T , COU N T ,..., COU N T g ] is known in advance, and the algorithm allocates the SampleSize according to these frequencies. The
Random ( z, SampleSize ) means to get11 ampleSize random noises z (Line 3), and the Repeat (One-hot(i) , SampleSize )means to repeat the label of the i th group SampleSize times (Line 4). It thenfeeds both the noises and the labels to the generator, and get the samples S i for the i th group (Line 5). Algorithm 1
CWGAN-based uniform sampling
Input: Generator G , TotalSampleSize m , Number of Groups g , Fre-quencies of each group [ COU N T , COU N T ,..., COU N T g ]Output: Samples for each group S , S ,..., S g for i in 1 to g do SampleSize = m · COUNT i P gj =1 COUNT j noise ← Random ( z, SampleSize ) label ← Repeat (One-hot(i) , SampleSize ) S i ← G ( noise, label ) end for Return: S , S ,..., S g Under the premise that the frequency of each group is known in advance,the generative model can generate uniform samples to restore the distributionof the data.
The proposed sample generation framework based on CWGAN can generatestratified samples by allocating the sample size of each group according to thestandard deviation and mean of each group.In this part, we combine the CWGAN with a stratified sampling method,CVOPT, which allocates the samples according to the coefficient of variation ofeach group. The coefficient of variation (CV) is defined as CV [ X ] = standard deviation [ X ] mean [ X ] .The CVOPT is proved to construct a stratified sample optimizing the l normof the CVs of different answers. Algorithm 2 introduces the way of combiningCWGAN with CVOPT to generate stratified samples. The difference of thisalgorithm from Algorithm 1 is the SampleSize allocation for each group (Line12).
Algorithm 2
CWGAN-based stratified sampling
Input: Generator G , TotalSampleSize m , Number of Groups g , Meansof each group [ µ , µ ,..., µ g ], Standard deviation of each group[ σ , σ ,..., σ g ]Output: Samples for each group for i in 1 to g do SampleSize = m · σ i /µ i P gj =1 σ j /µ j noise ← Random ( z, SampleSize ) label ← Repeat (One-hot(i) , SampleSize ) S i ← G ( noise, label ) end for Return: S , S ,..., S g Since the CVOPT is reported to reduce the maximum relative error com-pared with the uniform sampling. The samples generated according to theallocation of CVOPT will also increase the accuracy, on the assumption thatthe generative model is reliable. This algorithm requires the statistics includingthe standard deviation and the mean of each group, and these statistics can beeasily computed with one-pass of data in advance.
Since the generative model can generate any number of samples for a givengroup, we can apply the generated samples for online-aggregation to iterativelyimprove the accuracy in an interactive way. Online-aggregation keeps improvingthe accuracy and computing the running confidence intervals by increasing thesample size. It allows the user to stop the processing of each group when theestimation is accurate enough.Algorithm 3 introduces the way of combining CWGAN with the online-aggregation. It adopts the width of confidence interval as the termination ofthe algorithm. Once the confidence interval of a group is narrower than the13bject width, the algorithm will stop generating samples for that group. In thisalgorithm, the
ActiveGroup is used to store the indices of the groups requiringmore samples.This algorithm adopts the Round-Robin to iteratively generate one samplefor each group, and it is convenient to observe the accuracy of each group aftergenerating the same number of samples. Since the model is easy to generatesamples for the given group, we can use it to generate samples for one groupuntil its termination is reached.
Algorithm 3
CWGAN-based Online-Aggregation
Input: Generator G , Number of Groups g , Object ConfidenceInterval[ I , I ,..., I g ]Output: Samples for each group S , S ,..., S g ActiveGroup ← [1 , , ..., g ] Initialize the S , S ,..., S n to be empty sets. while ActiveGroup = ∅ do for i in ActiveGroup do noise ← Random ( z, label ← Repeat (One-hot(i) , s ← G ( noise, label ) S i = S i ∪ s CurrentI i ← Confidence Interval of S i if CurrentI i < I i then ActiveGroup ← ActiveGroup \ i end if end for end while Return: S , S ,..., S g Our conditional sample generation framework can iteratively improve theaccuracy by generating more samples like the online aggregation. But the sam-ple generation does not require random I/O accesses, and thus largely reduces14he latency at query time.The proposed framework can also be combined with the other online sam-pling method such as the IFOCUS, which keeps increasing the sample size ofthe groups whose confidence interval overlap with other groups, until the con-fidence interval of each group does not overlap with others. The algorithm ofCWGAN-based IFOCUS can be obtained by modifying the termination of eachgroup in Algorithm 3 to the non-overlap of the confidence intervals.
4. CWGAN-based Sampling for Group-By Queries with predicates
In this section, we propose method to generating the samples for the group-by queries with predicates.The group-by query with a predicate is in the following form:
SELECT A , AGGR ( B ) FROM D WHERE Condition GROUP BY A ; The “Condition” in the predicates is in the form like “ l ≤ C ≤ r ”. Thiskind of queries can be estimated in two ways. The first is to estimate a querybased on the samples chosen from the pre-computed samples according to thepredicate. But the estimations based on the chosen samples are not satisfactoryfor the low-selectivity queries, since only a small part of the samples satisfyingthe predicate contribute to the estimations. Another way is to separately choosesamples for each query. The samples can be chosen from the data filtered bythe query predicate. This method will undoubtedly improve accuracy, but itrequires sampling from the entire dataset several time. Query time samplingwill cause unacceptable long latency.Even though the conditional sample generation framework can generate sam-ple without accessing the data, we do not want to waste time of generatingsample without contribution to the estimation. We attempt to enable the con-ditional generative model to generate samples for the given query predicate.However, the conditional generative model cannot be simply adopted to solvethis problem. The reason is that, we cannot directly set a range label for eachtuple like setting the group labels, since each tuple belongs to all the ranges15overing it.In order to give each data a clear range label, we consider bucketing thedata range through a histogram. A histogram can divide the data into severaldisjoint buckets. Each bucket contains a sub-range of data, and each tuple onlycorresponds to one bucket. We add a bucket label for each tuple accordingto the index of its located bucket. Thus, we can train the model by feedingthe data with both group labels and bucket labels to the conditional generativemodel. Finally, the generator can generate samples for a given bucket and agiven group.The framework of sample generation for the group-by queries with predicatesis shown in Fig 2. It differs from the framework proposed in the previous sectionin that it uses two kinds of labels to train the conditional generative model. Wealso adopt the CWGAN as the conditional generative model. The input of thegenerator includes the random noises, bucket labels, and group labels.In this framework, we partition the data with a histogram. We do notlimit which kind of histogram to use. Different histogram schemes, such as theequi-width histogram, equi-depth histogram, and V-optimal histogram, can beapplied to this framework. We prefer the equi-depth histogram, which has thesame number of tuples in each bucket. It benefits for both the model trainingand the sample quality. On the one-hand, it balances the quantity of trainingdata corresponding to each bucket. On the other hand, it limits the number ofthe generated samples that do not satisfy the query predicate.Algorithm 4 introduces how to generate samples for a given query. In thisalgorithm, the model generates an equal number of samples for each bucketthat intersects the query predicate. It adopts the CVOPT sampling method toallocate the sample size for each group. The mean and standard deviation ofthe aggregated attribute in each group should be aware in advance.The algorithm first finds the buckets intersected with the query predicate(Line 1-6). It allocates an equal number of samples for each bucket (Line 8). The SampleSize for each group is allocated according to CVOPT (Line 10). Thealgorithm then obtains the random noise (Line 11), the
BucketLabel (Line 12),16 ata
Gzxy x' D
Conditional-WGAN
Query Bucket Bi sample for Bi & Gj
GroupLabelBucketLabel H i s t og r a m Group Gj
Figure 2: The framework of sample generation for group-by queries with predicates and the
GroupLabel (Line 13). The generator G finally generates the samples S i,j for the i th bucket and j th group.Algorithm 4 can generate targeted samples for the buckets intersecting thequery predicate. Though the generated samples still require to be filtered bythe predicate, this method greatly reduces the proportion of samples that donot contribute to the estimation accuracy.
5. Experimental Results
All the experiments were conducted on a laptop with an Intel Core i5 CPUwith 2.60GHz clock frequency and 8GB of RAM. We use Keras, a python deeplearning library running on top of TensorFlow, to build the conditional genera-tive models.
Real dataset: ROAD dataset was constructed by adding elevation informa-tion to a 2D road network in North Jutland, Denmark. The dataset contains http://archive.ics.uci.edu/ml/datasets/3D+Road+Network+(North+Jutland,+Denmark) lgorithm 4 Generate samples for group-by queries with predicates
Input: Generator G , Groups g , total sample size m , Query Q , Buckets B , B , ..., B n , µ i,j and σ i,j for each bucket B i and group j Output: Samples S i,j RB ← ∅ for i in 1 to n do if B i intersect with Q then RB ← ∪ i end if end for for i in RB do BSampleSize ← m/ | RB | for j in 1 to g do SampleSize = BSampleSize · µ i,j /σ i,j P g − k =0 µ i,k /σ i,k noise ← Random ( z, SampleSize ) BucketLabel ← Repeat (One-hot(i) , SampleSize ) GroupLabel ← Repeat (One-hot(j) , SampleSize ) S i,j ← G ( noise, BucketLabel, GroupLabel ) end for end for Return:
Samples S i,j , i ∈ RB , j = 1 , , ..., g {
1, 2, 3, 4 } . The attribute ‘z’ will be adopted as anaggregation attribute in the experiments. The values of ‘z’ of the tuples in thefour distributions are chosen uniformly from the ranges [0, 10000], [0, 1000], [0,100], [0, 10], respectively. The number of tuples in these four distributions are10M, 20M, 30M, and 40M. That is, the variance of the smaller group is higher. We adopt two metrics to measure the accuracy of the query estimation. Inthe following equations, the g , est i , and true i mean the number of groups, theestimated aggregation of the i th group, and the true aggregation of the i thgroup, respectively.Average Relative Error : ARE = 1 |{ true i | true i > , i ∈ [0 , g ] }| X i ∈ [0 ,g ] ,true i > | est i − true i | true i (1)Mean Squared Error : M SE = 1 g X i ∈ [0 ,g ] ( est i − true i ) (2)19 .1.4. Implementations We adopt the CWGAN as the conditional generative model for our experi-ments. CWGAN is the combination of the WGAN and CGAN.The generator of CWGAN has three fully connected hidden-layers, and thenumber of neurons in each layer is 128, 64, 32. The discriminator of CWGANhas three fully connected hidden-layers, and the number of neurons in each layeris 128, 64, 32.For the group-by queries without predicates, the inputs of generator are therandom noises and the group labels. For the group-by queries with predicates,the inputs of the generator include the random noises, the group labels, and thebucket labels.We set the activation function as RELU [27], which is a widely used activa-tion function and yields better results compared to Sigmoid and Tanh. We setthe optimizer as RMSprop [28] according to the recommendation of WGAN. Weset the parameters of RMSprop as ( lr =0.0001, rho =0.95). The clip for WGANis set as [-0.1, 0.1]. The settings of other parameters follow the recommendationin WGAN [26].We just adopt a simple model with a few layers, and the settings mostlyfollow the WGAN. More layers and more careful tuning may result in a bettermodel. But it will also increase the difficulty of training and the size of themodel. Actually, the structure and settings of the model should be modifiedaccording to the dataset. In this experiment, we are not committed to find theoptimal model, but to test the idea of adopting conditional generative model toapproximate group-by queries. The following experimental results verified thatthe simple model is able to generate good enough samples. In this experiment, we visualize the distribution of the random samples andgenerated samples. This experiment is conducted on the ROAD and GAUSSdatasets. We choose 1,000 random samples from each dataset, and generate1,000 samples based on the generative model for each dataset. In order to20estore the original data distribution, the number of samples generated in eachgroup is allocated according to the number of tuples in that group of the wholedataset. That is, we adopt the Algorithm 1 to generate uniform samples foreach dataset.Fig 3 and Fig 4 show the distribution of the random samples and generatedsamples in the first two attributes of the ROAD and GAUSS datasets. Thedistribution of generated samples and real samples are visually similar in thefirst two attributes as shown in these figures. (a) Random Sample (b) Generated SampleFigure 3: Random Sample &.Generated Sample (ROAD dataset)
In this experiment, we compare the accuracies of approximate query resultsbased on the random samples and generated samples. This experiment is con-duced on the ROAD dataset. We test the accuracy of the estimations based on1,000 random samples and 1,000 generated samples. (a) Random Sample (b) Generated SampleFigure 4: Random Sample &.Generated Sample (GAUSS dataset)
21e partition the dataset into 10 equi-depth buckets according to the firstattribute, suggesting that, the number of tuples in each bucket is the same. Wegenerate 100 samples for each bucket. Thus, we have 1,000 samples in totalfor the 10 buckets. The number of samples for each group inside a bucket isallocated according to the CVOPT sampling. That is, we adopt Algorithm 2 togenerate stratified samples for each bucket.The query for the i th bucket is as follows: SELECT AVG(Elevation), GroupID FROM
Bucket i GROUP BY GroupID;
Fig 5 shows the estimations and 95% confidence intervals based on the ran-dom samples and the generated samples. The red line shows the true aggregationresults for the 10 groups numbered by 0,1,...,9, and the estimations are shownas the blue dotted lines with error bars. We can learn from the figure thatthe confidence intervals of the estimations based on the random samples varygreatly for different groups due to the skew of the data. However, the accura-cies of the estimations based on the generated samples seem to be more stableamong different groups. The reason is that we combine the generative modelwith the CVOPT sampling to generate stratified samples for each group. Theexperimental results show that, the conditional generative model can be eas-ily combined with the stratified sampling method to increase the accuracy. Inaddition, the generating samples based on the model have the advantage overthe stratified sampling, since stratified sampling requires accessing the entiredataset, while the generated sample does not.
In this experiment, we show that the generative model can be applied to theonline-aggregation, which continuously narrows down the confidence intervaland increases the accuracy by accessing more samples.This experiments are conducted on the GAUSS dataset. For each group, weset the sample sizes from 1,000 to 10,000 to test the accuracy of the samples indifferent sizes. Fig 6 shows the confidence intervals of the estimations based onthe samples in different sizes. It is clear that the confidence interval narrows22 a) Random Sample(b) Generated SampleFigure 5: Estimations of group-by queries based on different samples Group1 ■ Group2 ◆ Group3 ▲ Group4 ● ● ● ● ● ● ● ● ● ●■ ■ ■ ■ ■ ■ ■ ■ ■ ■◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ C I W i d t h (a) Impact of sample size on confidenceInterval ● Group1 ■ Group2 ◆ Group3 ▲ Group4 ● ● ● ● ● ● ● ● ● ●■ ■ ■ ■ ■ ■ ■ ■ ■ ■◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ C I W i d t h / M ean (b) Impact of sample size on CIMean
Figure 6: Online Aggregation based on Generative model down as the sample size increases. Fig 6(b) measures the accuracy accordingto
ConfidenceIntervalMean , which limits the relative error of the estimations. Theaccuracy also increases with the sample size. The results indicate that thegenerated samples can also be adopted to online-aggregation and it can narrowdown the confidence interval and increase the accuracy by generating moresamples.
In this experiment we compare the accuracy of different methods and testthe impact of selectivity on the accuracy.We compare the accuracy of the estimations based on the generated samples,the random samples, and the DBEst. DBEst is the state-of-the-art ML-model-based AQP method, which builds the probability density models according tothe samples from data.This experiment is conducted on the ROAD dataset. We test the accuraciesof 100 queries whose selectivities are below 0.05. The queries are in the followingform:
SELECT AVG(Elevation), GroupID FROM ROAD,WHERE a ≤ Longitude ≤ b GROUP BY GroupID; We choose 1,000 random samples from the dataset. We adopt Algorithm 4 togenerate 1,000 samples for each query according to its predicate. We randomly24
R(cid:0)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10) ■ GenSample ◆ D(cid:11)(cid:12)(cid:13)(cid:14) ( ) ▲ (cid:16)(cid:17)(cid:18)(cid:19)(cid:20) ( (cid:21)(cid:22)(cid:23) ) ▼ (cid:24)(cid:25)(cid:26)(cid:27)(cid:28) ( (cid:29)(cid:30)(cid:31) ) ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ●●● ●●●●●●● ● ●● ● ●●●●■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■■■■■■■ ■■■ ■■■■ ■■■ ■ ■■ ■ ■■■◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆ ◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆ ◆◆◆ ◆◆◆◆◆◆◆ ◆ ◆◆ ◆ ◆◆◆▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ ▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ ▲▲▲ ▲▲▲▲ ▲▲▲ ▲ ▲▲ ▲ ▲▲▲▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼ ▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼ ▼▼▼ ▼▼▼▼ ▼▼▼ ▼ ▼▼ ▼ ▼▼▼ S&’()*+,-./M23 (a) MSE ● ■ GenSample ◆ ABCEF ( GH ) ▲ IJKLN ( OPQ ) ▼ TUVWX ( YZ[\ ) ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●● ●●●● ● ●●●■■■■■■■■■■■■■■■■■■■ ■■■■■■■■■■■■■ ■■■■■■■■ ■■■■ ■■■■■■ ■■■■ ■ ■■■◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆ ◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆◆ ◆◆◆◆ ◆◆◆◆◆◆ ◆◆◆◆ ◆ ◆◆◆▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲▲ ▲▲▲▲▲▲▲▲▲▲▲▲▲ ▲▲▲▲▲▲▲▲ ▲▲▲▲ ▲▲▲▲▲▲ ▲▲▲▲ ▲ ▲▲▲▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼ ▼▼▼▼▼▼▼▼▼▼▼▼▼ ▼▼▼▼▼▼▼▼ ▼▼▼▼ ▼▼▼▼▼▼ ▼▼▼▼ ▼ ▼▼▼ ]^_‘ abcdefghijklmn (b) AREFigure 7: The impact of selectivity on the accuracy choose 1k, 10k, 100k samples to build DBEst models, respectively. The resultis shown in Fig 7. For both the random samples and generated samples, theestimation error decreases as the query selectivity increases, and the estimationsbased on the generated samples are more accurate. The reason is that thegenerated samples contains a higher proportion of the samples satisfying thequery predicates, compared with the random samples chosen from the entiredataset. The queries in this experiment are low-selectivity queries, meaningthat, most of the samples chosen from the entire dataset do not satisfy thequery predicates. Only a small part of the samples contribute to the estimation,which leads to low accuracy. Since the generative model can flexibly generatesamples for some sub-ranges, it can obtain more samples that contribute tothe estimation. Since more samples leads to more accurate estimations, thegenerated samples can bring higher accuracy for a low-selectivity query. Theaccuracy of DBEST has not flattened out with the increased selectivity, whichmay be caused by the inaccurate probability density model built according tothe samples. In this algorithm, we compare the efficiency of generating samples and choos-ing random samples from the dataset. This experiment is conducted on theGAUSS dataset. We compare the sampling time of 1,000 samples from 1M,25 opqrstuvwxyz ■ GenSample ● ● ●■ ■ ■ {|}~ ( (cid:127) ) Figure 8: The impact of data size on the sampling time (cid:128)(cid:129)(cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137) ( (cid:138)(cid:139) ) (cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149) ( (cid:150)(cid:151)(cid:152) ) (cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡ ( ¢£⁄¥ ) ƒ§ Figure 9: The Space Cost of the models in GenSample and DBEst
In this experiment, we compare the space cost of the generative model withthat of the DBEst. We test the space cost of the models in DBEst built from 1k,10k, and 100k samples. The experimental result is shown in Fig 9. The spacecost of the generative model built according to the implementation introducedin Section 5.1.4 is 77KB. The figure shows that the space cost of the generativemodel is less than the DBEst. The reason is that, DBEst separately buildsa model for each group, while different groups share the same model in ourmethod. Thus, the conditional generative model in our framework is morecompact than the DBEst. 26 . Conclusions
In this paper, we proposed a generative-model-based approximate query pro-cessing framework for group-by queries. Samples in each group can be generatedfrom the conditional generative model without accessing the raw data, whichavoids the latency caused by sampling from the large amount of data. The pro-posed framework can collaborate with the uniform sampling to restore the datadistribution, and collaborate with the stratified sampling and online aggrega-tion to improve the accuracy. In this work, we separate the data into bucketsto build the conditional generative model which can generate targeted samplesfor each small bucket. We will try to make the conditional generative modeldirectly generate samples for a given range in our future study.
References [1] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, C. Wang,Sample + seek: Approximating aggregates with distribution precision guarantee,in: F. ¨Ozcan, G. Koutrika, S. Madden (Eds.), Proceedings of the 2016International Conference on Management of Data, SIGMOD Conference2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp.679–694. doi:10.1145/2882903.2915249 .URL https://doi.org/10.1145/2882903.2915249 [2] P. R¨osch, W. Lehner, Sample synopses for approximate answering of group-by queries,in: M. L. Kersten, B. Novikov, J. Teubner, V. Polutin, S. Manegold (Eds.),EDBT 2009, 12th International Conference on Extending DatabaseTechnology, Saint Petersburg, Russia, March 24-26, 2009, Proceedings,Vol. 360 of ACM International Conference Proceeding Series, ACM, 2009,pp. 403–414. doi:10.1145/1516360.1516408 .URL https://doi.org/10.1145/1516360.1516408 [3] A. Kim, E. Blais, A. G. Parameswaran, P. Indyk, S. Madden, R. Rubinfeld,Rapid sampling for visualizations with ordering guarantees, PVLDB 8 (5)272015) 521–532. doi:10.14778/2735479.2735485 .URL [4] S. Chaudhuri, B. Ding, S. Kandula,Approximate query processing: No silver bullet, in: Proceedings ofthe 2017 ACM International Conference on Management of Data, SIG-MOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, 2017, pp.511–519. doi:10.1145/3035918.3056097 .URL https://doi.org/10.1145/3035918.3056097 [5] L. J. Orr, M. Balazinska, D. Suciu,Entropydb: a probabilistic approach to approximate query processing,VLDB J. 29 (1) (2020) 539–567. doi:10.1007/s00778-019-00582-9 .URL https://doi.org/10.1007/s00778-019-00582-9 [6] J. M. Hellerstein, P. J. Haas, H. J. Wang, Online aggregation, in: J. Peck-ham (Ed.), SIGMOD 1997, Proceedings ACM SIGMOD International Con-ference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA,ACM Press, 1997, pp. 171–182. doi:10.1145/253260.253291 .URL https://doi.org/10.1145/253260.253291 [7] F. Li, B. Wu, K. Yi, Z. Zhao, Wander join: Online aggregation via random walks,in: F. ¨Ozcan, G. Koutrika, S. Madden (Eds.), Proceedings of the 2016International Conference on Management of Data, SIGMOD Conference2016, San Francisco, CA, USA, June 26 - July 01, 2016, ACM, 2016, pp.615–629. doi:10.1145/2882903.2915235 .URL https://doi.org/10.1145/2882903.2915235 [8] T. D. Nguyen, M. Shih, S. S. Parvathaneni, B. Xu, D. Srivastava, S. Tirtha-pura, Random sampling for group-by queries, in: 36th IEEE InternationalConference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, IEEE, 2020, pp. 541–552. doi:10.1109/ICDE48307.2020.00053 .URL https://doi.org/10.1109/ICDE48307.2020.00053 doi:10.1007/s41019-018-0074-4 .URL https://doi.org/10.1007/s41019-018-0074-4 [10] S. Acharya, P. B. Gibbons, V. Poosala,Congressional samples for approximate answering of group-by queries,in: Proceedings of the 2000 ACM SIGMOD International Conference onManagement of Data, May 16-18, 2000, Dallas, Texas, USA., 2000, pp.487–498. doi:10.1145/342009.335450 .URL http://doi.acm.org/10.1145/342009.335450 [11] V. Ganti, M. Lee, R. Ramakrishnan,ICICLES: self-tuning samples for approximate query answering, in:VLDB 2000, Proceedings of 26th International Conference on Very LargeData Bases, September 10-14, 2000, Cairo, Egypt, 2000, pp. 176–187.URL [12] S. Chaudhuri, G. Das, V. R. Narasayya,Optimized stratified sampling for approximate query processing, ACMTrans. Database Syst. 32 (2) (2007) 9. doi:10.1145/1242524.1242526 .URL https://doi.org/10.1145/1242524.1242526 [13] L. Sidirourgos, M. L. Kersten, P. A. Boncz,Sciborq: Scientific data management with bounds on runtime and quality,in: CIDR 2011, Fifth Biennial Conference on Innovative Data SystemsResearch, Asilomar, CA, USA, January 9-12, 2011, Online Proceedings,2011, pp. 296–301.URL http://cidrdb.org/cidr2011/Papers/CIDR11_Paper39.pdf [14] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, I. Stoica,Blinkdb: queries with bounded errors and bounded response times on very large data,in: Eighth Eurosys Conference 2013, EuroSys ’13, Prague, Czech Republic,29pril 14-17, 2013, 2013, pp. 29–42. doi:10.1145/2465351.2465355 .URL https://doi.org/10.1145/2465351.2465355 [15] M. Mirza, S. Osindero, Conditional generative adversarial nets, CoRRabs/1411.1784. arXiv:1411.1784 .URL http://arxiv.org/abs/1411.1784 [16] S. Qin, T. Jiang, Improved wasserstein conditional generative adversarial network speech enhancement,EURASIP J. Wirel. Commun. Netw. 2018 (2018) 181. doi:10.1186/s13638-018-1196-0 .URL https://doi.org/10.1186/s13638-018-1196-0 [17] Y. Luo, EEG data augmentation for emotion recognition using a conditional wasserstein GAN,in: 40th Annual International Conference of the IEEE Engineering inMedicine and Biology Society, EMBC 2018, Honolulu, HI, USA, July 18-21, 2018, IEEE, 2018, pp. 2535–2538. doi:10.1109/EMBC.2018.8512865 .URL https://doi.org/10.1109/EMBC.2018.8512865 [18] S. Rahman, M. Aliakbarpour, H. Kong, E. Blais,K. Karahalios, A. G. Parameswaran, R. Rubinfeld,I’ve seen ”enough”: Incrementally improving visualizations to support rapid decision making,PVLDB 10 (11) (2017) 1262–1273. doi:10.14778/3137628.3137637 .URL [19] D. Fisher, I. O. Popov, S. M. Drucker, m. c. schraefel,Trust me, i’m partially right: incremental visualization lets analysts explore large datasets faster,in: J. A. Konstan, E. H. Chi, K. H¨o¨ok (Eds.), CHI Conference on HumanFactors in Computing Systems, CHI ’12, Austin, TX, USA - May 05 - 10,2012, ACM, 2012, pp. 1673–1682. doi:10.1145/2207676.2208294 .URL https://doi.org/10.1145/2207676.2208294 [20] G. Cormode, M. N. Garofalakis, P. J. Haas, C. Jermaine,Synopses for massive data: Samples, histograms, wavelets, sketches,Foundations and Trends in Databases 4 (1-3) (2012) 1–294.30 oi:10.1561/1900000004 .URL https://doi.org/10.1561/1900000004 [21] A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, T. Kraska,Revisiting reuse for approximate query processing, PVLDB 10 (10) (2017)1142–1153. doi:10.14778/3115404.3115418 .URL [22] Y. Park, A. S. Tajik, M. J. Cafarella, B. Mozafari,Database learning: Toward a database that becomes smarter every time,in: S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, D. Suciu (Eds.),Proceedings of the 2017 ACM International Conference on Managementof Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017,ACM, 2017, pp. 587–602. doi:10.1145/3035918.3064013 .URL https://doi.org/10.1145/3035918.3064013 [23] Q. Ma, P. Triantafillou, Dbest: Revisiting approximate query processing engines with machine learning models,in: P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, T. Kraska(Eds.), Proceedings of the 2019 International Conference on Management ofData, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 -July 5, 2019, ACM, 2019, pp. 1553–1570. doi:10.1145/3299869.3324958 .URL https://doi.org/10.1145/3299869.3324958 [24] S. Thirumuruganathan, S. Hasan, N. Koudas, G. Das,Approximate query processing for data exploration using deep generative models,in: 36th IEEE International Conference on Data Engineering, ICDE 2020,Dallas, TX, USA, April 20-24, 2020, IEEE, 2020, pp. 1309–1320. doi:10.1109/ICDE48307.2020.00117 .URL https://doi.org/10.1109/ICDE48307.2020.00117 [25] B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, C. Binnig,Deepdb: Learn from data, not from queries!, Proc. VLDB Endow. 13 (7)(2020) 992–1005.URL arXiv:1701.07875 .URL http://arxiv.org/abs/1701.07875 [27] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines,in: Proceedings of the 27th International Conference on Machine Learning(ICML-10), June 21-24, 2010, Haifa, Israel, 2010, pp. 807–814.URL