BayesCard: Revitilizing Bayesian Frameworks for Cardinality Estimation
Ziniu Wu, Amir Shaikhha, Rong Zhu, Kai Zeng, Yuxing Han, Jingren Zhou
BBayesCard: Revitalizing Bayesian Networks for CardinalityEstimation
Ziniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou Alibaba Group, University of Edinburgh { ziniu.wzn, red.zr, zengkai.zk, yuxing.hyx, jingren.zhou } @alibaba-inc.com, { amir.shaikhha } @ed.ac.uk ABSTRACT
Cardinality estimation (
CardEst ) is an essential component in queryoptimizers and a fundamental problem in DBMS. A desired
CardEst method should attain good algorithm performance, be stable tovaried data settings, and be friendly to system deployment. However,no existing
CardEst method can fulfill the three criteria at thesame time. Traditional methods often have significant algorithmdrawbacks such as large estimation errors. Recently proposed deeplearning based methods largely improve the estimation accuracybut their performance can be greatly affected by data and oftendifficult for system deployment.In this paper, we revitalize the Bayesian networks (BN) for
CardEst by incorporating the techniques of probabilistic program-ming languages. We present
BayesCard , the first framework thatinherits the advantages of BNs, i.e., high estimation accuracy andinterpretability, while overcomes their drawbacks, i.e. low structurelearning and inference efficiency. This makes
BayesCard a perfectcandidate for commercial DBMS deployment. Our experimentalresults on several single-table and multi-table benchmarks indicate
BayesCard ’s superiority over existing state-of-the-art
CardEst meth-ods:
BayesCard achieves comparable or better accuracy, 1–2 ordersof magnitude faster inference time, 1–3 orders faster training time,1–3 orders smaller model size, and 1–2 orders faster updates. Mean-while,
BayesCard keeps stable performance when varying data withdifferent settings. We also deploy
BayesCard into PostgreSQL. Onthe IMDB benchmark workload, it improves the end-to-end querytime by 13 . . Cardinality estimation (
CardEst ), which aims at predicting the re-sult size of a SQL query without its actual execution, is a longstand-ing and fundamental problem in DBMS. It is the core component ofquery optimizers [26, 28, 42] to produce high-quality query plans.Although a variety of
CardEst methods have been proposed inthe last several decades, it remains to be a notoriously challengingproblem in the DB community.
Status and challenges of
CardEst . Given a table 𝑇 on attributes { 𝑇 , . . . ,𝑇 𝑛 } and a query 𝑄 , CardEst is equivalent to estimating theprobability of tuples in 𝑇 satisfying 𝑄 . Therefore, the core problemof CardEst is how to model the distribution of 𝑇 to estimate theprobability of 𝑄 . Based on existing work [49], we believe that an ap-plicable CardEst method should satisfy criteria from three aspects,namely
A(Algorithm) , D(Data) and
S(System) . ( A ): the CardEst al-gorithm itself should have high estimation accuracy, fast inference(and training) time, lightweight storage cost, and efficient updatingprocess, in order to generate high-quality query plans [36, 58]. ( D ): Table 1: Status of
CardEst methods according to
ADS criteria.
CardEst
Methods Algorithm Data System A cc u r a c y L a t e n c y T r a i n i n g M o d e l S i z e U p d a t i n g D i s t r i b u t i o n C o rr e l a t i o n D o m a i n S c a l e D e b u g I n t e r p r e t P r e d i c t R e p r o d u c e Histogram − ✓ ✓ ✓ ✓ ✓ − ✓ − ✓ ✓ ✓ ✓ Sampling − − ✓ ✓ ✓ − ✓ − − ✓ − ✓ − Naru ✓ − ✓ ✓ − − ✓ − − − − − − DeepDB ✓ ✓ ✓ ✓ − ✓ − ✓ − − − ✓ ✓ FLAT ✓ ✓ ✓ ✓ − ✓ ✓ ✓ − − − ✓ ✓ MSCN − ✓ − ✓ − ✓ ✓ ✓ − − − − ✓ BN ✓ − − ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ BayesCard ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ the
CardEst method should maintain stable performance for differ-ent data with varied distribution, attribute correlation, domain size,and number of attributes. ( S ): the CardEst method should be friendlyfor system deployment with interpretable model, predictable be-haviors, reproducible results, and easy for debugging [49].The simplest
CardEst method assumes that all attributes are mu-tually independent and builds a histogram on each 𝑇 𝑖 . Its estimationlatency is low but the error is high since correlations between at-tributes are ignored. Another class of methods samples tuples from 𝑇 for CardEst . They can be inaccurate on high-dimensional data orqueries with small cardinality. These traditional
CardEst methodshave significant algorithm drawbacks and unstable performancew.r.t. varied data but friendly for system deployment.Recently, numerous works attempt to utilize machine learning(ML), especially deep learning (DL) techniques for
CardEst . Theyeither build supervised models mapping featurized query 𝑄 to itscardinality [1, 21] or learn unsupervised models of 𝑃 𝑇 , the jointprobability distribution of table 𝑇 , to support computing the proba-bility of any query 𝑄 on 𝑇 [20, 53, 58]. DL-based CardEst methodsgreatly improve the estimation accuracy but often sacrifice otheralgorithm aspects. More importantly, their performance can begreatly affected by data and often difficult for system deployment,such as the hyper-parameter tuning and the “black-box” property.Table 1 summarizes the status of existing
CardEst methods ac-cording to the
ADS criteria. We can clearly see that no existingsolution satisfactorily addresses this problem.
Our motivation.
Recently, a classical method Bayesian networks(BNs) have re-attracted numerous attentions in the ML communityto overcome the drawbacks of deep models [25, 54, 57], and theyare naturally suitable for
CardEst [16, 18, 48]. In comparison withother methods, BNs have significant advantages in terms of the
ADS criteria. First, from the algorithm perspective, BNs are very compactand easy to update. Second, BNs reflect the intrinsic causal relationsbetween attributes, which are robust to the data changes. Thus, a r X i v : . [ c s . D B ] F e b iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou they tend to maintain stable performance as the data varies in termsof correlation, distribution, and etc. Third, BNs are interpretable ,easy to predict, maintain, validate and improve with expert knowl-edge, thus friendly for system deployment. These attractive modelshave been proposed decades ago [16, 48], but the BNs’ NP-hardmodel construction process and intractable probability inferencealgorithm make them impractical for DBMS. In summary, as long as we can overcome the inefficiency of modelconstruction and probability inference of BNs, we can obtain a desir-able method for
CardEst satisfying the ADS criteria simultaneously.
Our contributions.
In this paper, we try to resolve the
CardEst challenges by revitalizing BNs with new equipments. We propose
BayesCard , a unified Bayesian framework for
CardEst . The keyidea of
BayesCard is to build an ensemble of BNs to model thedistributions of tables in a database, and use the constructed BNsto estimate the cardinality of any query.
BayesCard incorporatesthe recent advances in probabilistic programming languages (PPLs)for building BNs [2, 3, 33, 38, 40, 46]. PPLs allow for a declarativespecification of probabilistic models, within which each variableis defined as a probability distribution influenced by others. Basedon PPLs, we can easily define BNs to support various structurelearning, parameter learning, and inference algorithms. Therefore,
BayesCard provides a user-friendly framework of building differentBNs suitable for various data and system settings.The key techniques of
BayesCard overcome the deficiency of ex-isting BNs. First, based on PPLs,
BayesCard designs the progressivesampling and compiled variable elimination probability inferencealgorithms, which significantly accelerate the traditional BN’s infer-ence process. Moreover,
BayesCard adapts its inference algorithmsto efficiently handle multi-table join queries. Second,
BayesCard designs an efficient model construction algorithm for building anensemble of BNs. Furthermore, using PPLs,
BayesCard can pre-specify constraints on the learned BN structure with prior knowl-edge to speed up the structure learning process. An accurate andlightweight BN structure could be obtained efficiently.By our benchmark evaluation results, in comparison with DL-based
CardEst methods,
BayesCard achieves comparable or betteraccuracy, 1–2 orders of magnitude lower inference latency (nearhistogram) and update time, and 1–3 orders faster training timeand smaller model size. Meanwhile,
BayesCard keeps stable perfor-mance when varying data with different settings. We also integrate
BayesCard into PostgreSQL. On the benchmark workload, it im-proves the end-to-end query time by 13 . .
2% using the true cardinality.In summary, the main contributions of this paper are as follows: • We analyze the existing
CardEst methods in terms of the
ADS criteria to evaluate a good and practical
CardEst method. (Section 2) • We propose
BayesCard , a general framework that unifies theefforts behind PPLs for constructing BNs for
CardEst . (Section 3) • We develop algorithms and techniques in
BayesCard using PPLsto improve inference latency and reduce the model constructioncost, which help
BayesCard attain the desired properties of
CardEst methods. (Section 4 and 5) • We conduct extensive experiments on benchmarks and inte-grate
BayesCard into real-world system to demonstrate its superi-ority from
ADS criteria. (Section 7)
In this section, we first formally define the
CardEst problem fromboth database and statistical perspectives and then exhaustivelyexamine the existing traditional methods and state-of-the-art DL-based methods for
CardEst from the
ADS criteria.
CardEst problem.
Let 𝑇 be a table with 𝑛 attributes 𝑇 , · · · ,𝑇 𝑛 .For each 1 ≤ 𝑖 ≤ 𝑛 , let 𝐷 𝑖 denote the domain (all unique values) ofattribute 𝑇 𝑖 . Any selection query 𝑄 on 𝑇 can be represented in acanonical form as 𝑄 = { 𝑇 ∈ 𝑅 𝑄 ( 𝑇 ) ∧ 𝑇 ∈ 𝑅 𝑄 ( 𝑇 ) ∧ · · · ∧ 𝑇 𝑛 ∈ 𝑅 𝑄 ( 𝑇 𝑛 )} , where 𝑅 𝑄 ( 𝑇 𝑖 ) ⊆ 𝐷 ( 𝑇 𝑖 ) is the region specified by 𝑄 overattribute 𝑇 𝑖 . Without loss of generality, we have 𝑅 𝑄 ( 𝑇 𝑖 ) = 𝐷 ( 𝑇 𝑖 ) if 𝑄 has no constraint on attribute 𝑇 𝑖 .Let 𝐶 𝑄 denote the cardinality, i.e., the number of tuples in 𝑇 satisfying query 𝑄 . From a statistical perspective, we can also regardall tuples in 𝑇 as points sampled according to the joint distribution 𝑃 𝑇 = 𝑃 𝑇 ( 𝑇 ,𝑇 , . . . ,𝑇 𝑛 ) of all attributes. Let 𝑃 𝑇 ( 𝑄 ) = 𝑃 𝑇 ( 𝑇 ∈ 𝑅 𝑄 ( 𝑇 ) ,𝑇 ∈ 𝑅 𝑄 ( 𝑇 ) , · · · ,𝑇 𝑛 ∈ 𝑅 𝑄 ( 𝑇 𝑛 ) be the probability specifiedby the region of 𝑄 . Then, we have 𝐶 𝑄 = 𝑃 𝑇 ( 𝑄 ) · | 𝑇 | . Thus, the CardEst problem can essentially be reduced to model the probabilitydensity function (PDF) 𝑃 𝑇 of table 𝑇 . In this paper, we focus on data-driven CardEst methods, which try to model 𝑃 𝑇 directly. For query-driven CardEst methods, they implicitly model 𝑃 𝑇 by buildingfunctions mapping 𝑄 to 𝑃 𝑇 ( 𝑄 ) . Existing
CardEst
Methods.
We review the two traditional meth-ods widely used by commercial DBMS and four state-of-the-art(SOTA) DL-based methods. [32] method assumes all attributes in 𝑇 are inde-pendent, and thus 𝑃 𝑇 can be estimated as the (cid:206) 𝑛𝑖 = 𝑃 𝑇 ( 𝑇 𝑖 ) . is a model-free method, which fetches tuples from 𝑇 on-the-fly to estimate the probability of 𝑄 on the samples. [53], based on deep auto-regression models (DAR) [15],factorizes 𝑃 𝑇 as 𝑃 𝑇 ( 𝑇 ) ∗ (cid:206) 𝑛𝑖 = 𝑃 𝑇 ( 𝑇 𝑖 | 𝑇 , . . . ,𝑇 𝑛 − ) and approximateeach conditional PDF by a deep neural network (DNN). [20], based on sum-product networks (SPN) [37],approximates 𝑃 𝑇 by recursively decomposing it into local and sim-pler PDFs. Specifically, the tree-structured SPN contains sum nodeto split 𝑃 𝑇 to multiple 𝑃 𝑇 ′ on tuple subset 𝑇 ′ ⊆ 𝑇 , product nodeto decompose 𝑃 𝑇 ′ to 𝑃 𝑇 ′ ( 𝑇 𝑖 ) · 𝑃 𝑇 ′ ( 𝑇 𝑗 ) if attributes 𝑇 𝑖 and 𝑇 𝑗 areindependent and leaf node if 𝑃 𝑇 is a univariate PDF. [58], based on factorized-split-sum-product networks(FSPN) [51], improves over SPN by adaptively decomposing 𝑃 𝑇 according to the attribute dependence level. It adds the factorizenode to split 𝑃 𝑇 as 𝑃 𝑇 ( 𝑊 ) · 𝑃 𝑇 ( 𝐻 | 𝑊 ) where 𝐻 and 𝑊 are highlyand weakly correlated attributes in 𝑇 . 𝑃 𝑇 ( 𝑊 ) is modeled in thesame way as SPN. 𝑃 𝑇 ( 𝐻 | 𝑊 ) is decomposed into small PDFs by thesplit nodes until 𝐻 is locally independent of 𝑊 . Then, the multi-leafnode is used to model the multivariate PDF 𝑃 𝑇 ( 𝐻 ) directly. [21], is a query-driven method, which uses the set-convolutional DNN to learn the mapping functions between theinput query 𝑄 and its probability 𝑃 𝑇 ( 𝑄 ) . Analysis Results.
We elaborate the
ADS criteria for
CardEst prob-lem and analyze the aforementioned methods in details. The resultsare summarized in Table 1. Handling pattern matching queries or string predicates (e.g., “LIKE” queries) requireextensions (such as q-grams [6]), which we do not consider in this paper. ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation • Algorithm.
From the algorithm’s perspective, we consider fiveimportant metrics that are widely used in existing work [20, 58] toevaluate the performance of
CardEst methods. • Estimation accuracy is one of the priorities for
CardEst sinceinaccurate estimation typically leads to sub-optimal and slow queryplan [26]. Unfortunately, the traditional methods frequently incurpoor estimations:
Histogram can cause large estimation error inpresence of attributes correlations and
Sampling may be inaccurateon high-dimensional data with limited sampling size. Query-drivenmethods, such as
MSCN , also have poor accuracy if the target querydoes not follow the same distribution of the query workload that themodel is trained on. By existing evaluations [20, 53, 58], DL-based
CardEst methods can produce accurate results. • Inference latency is crucial since
CardEst method needs to beexecuted numerous times in query optimization [27, 42]. As a re-sult, slow latency may degrade the end-to-end query time on plangeneration and execution. The inference latency of
Naru is high be-cause of its large underlying DNN models and repetitive samplingprocess.
Sampling is also not efficient when the sample size is large. • Training cost refers to
CardEst model construction time for agiven database. Query-driven based methods, such as
MSCN , arein general slow for training, since an enormous amount of queriesneed to be executed to learn the models. • Model size is related to the storage cost of models. In nowadaysDBMS, the space costs of all these
CardEst methods are affordable. • Update time is also important since table data frequently changes.Traditional methods are easy to update while no existing DL-basedmethod can keep up with the fast data updates [49]. • Data.
Generally, a DBMS will process various data with differentsettings. Therefore, we analyze whether the
CardEst methods havea stable performance on four typical variations of data settings,namely data distribution , attribute correlation , attribute domain size,and the number of attributes ( scale ).For traditional methods,
Histogram ’s estimation error growsexponentially when data are highly correlated.
Sampling ’s accuracydegrades on high-dimensional data with larger domain size andmore attributes. In addition, for highly skewed data, the fetchedsamples tend to miss the query ranges with small probability, whichalso degrades its accuracy.For DL-based methods, the poor performance stability of
Naru , DeepDB and
MSCN is demonstrated in a recent benchmark study [49].In a nutshell, their accuracy decreases while inference and train-ing cost increases with more attributes.
Naru is also sensitive todata distribution and domain size since skewed or large PDF ismore difficult to model.
DeepDB has the intrinsic drawback thattends to generate large and inaccurate SPNs on highly correlatedattributes [31].
FLAT overcomes the drawback of
DeepDB but itsperformance also degrades severely with more attributes. • System.
The
CardEst method should satisfy the following prop-erties for friendly system deployment [49]. • Debuggability and easy to tune are crucial to the DB experts. TheDL-based methods with “black-box” components may fail silentlyand contain high risks of missing a bug [49]. • Interpretability is necessary when system developers wouldlike to explain and validate the learned component, which is notsatisfied by the DL-based methods [5]. • Predictability is important since the system developers wouldlike to predict the performance before actual deployment. As
Naru and
MSCN contain DNNs with illogical behaviors [49], their per-formance is hard to predict. • Reproducibility is necessary to locate system issues. As
Sam-pling and
Naru involve stochastic processes, their results cannot bereproduced by estimating the same query one more time.
Summary.
From Table 1, we observe that no existing CardEst method is satisfactory in all criteria. Our detailed experimentalevaluation in Section 7 also verifies this observation. Therefore,we design a new
CardEst framework
BayesCard that successfullysatisfies all criteria for the first time.
In this section, we briefly review the background knowledge onBN and PPL in Section 3.1, which are the foundations of
BayesCard .Then we overview our new framework
BayesCard for
CardEst inSection 3.2.
Bayesian networks specifies a probability distribution 𝑃 𝑇 of table 𝑇 , whose attributes form a directed acyclic graph (DAG), such asImage (2.ii) in Figure 1. Each node of the DAG corresponds to anattribute and each edge defines the causal dependency between twonodes. An attribute is dependent on its parents (the source nodeswith edges directing to this attribute) and conditionally independentof all other attributes given its parents [22]. Thus, the 𝑃 𝑇 can becompactly represented as 𝑃 𝑇 ( 𝑇 , · · · ,𝑇 𝑛 ) = (cid:206) 𝑛𝑖 = 𝑃 𝑇 ( 𝑇 𝑖 | 𝑃𝑎𝑟 ( 𝑇 𝑖 )) ,where 𝑃𝑎𝑟 ( 𝑇 𝑖 ) denotes the set of parents of 𝑇 𝑖 in the defined DAG. Probabilistic programming languages are general-purpose pro-gramming paradigm to specify probabilistic models and performinference on the models automatically. Unlike in traditional pro-gramming languages (TPLs), each variable in PPLs is defined asa probability distribution, whose value can condition on a set ofother variables. The compilers of PPLs are optimized to efficientlylearn parameters of variable distribution and sample from thesedistributions. PPLs have been applied to various ML domains, suchas computer vision [23], with remarkable performance.To define a BN, for each attribute 𝑇 𝑖 , the PPLs can define a vari-able whose distribution is conditioned on variables in 𝑃𝑎𝑟 ( 𝑇 𝑖 ) . Forexample, the first seven lines in the PPL program on the right sideof Image (2.ii) in Figure 1 sufficiently defines the BN on the left asseven variables. PPLs have the following properties. First, PPLs candefine variables of any general distribution, including tabular andcontinuous distributions, which helps to build BNs with continuousattributes. Whereas, existing BNs for CardEst problems [16, 18, 48]only support discrete variables. Second, PPLs can efficiently learnthe parameters using maximum likelihood estimation (MLE) [33];e.g. the parameters of the example BN in Image (2.ii) can be derivedby simply executing the last two lines of code. Third, PPLs [41] alsoincorporates several main-stream algorithms for learning the BNs’structure, which captures the causal pattern of attributes in the data.The structure learning procedure of PPLs supports pre-specifyingsub-structures. Forth, PPLs can efficiently generate samples fromthe distribution of each variable. iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou (1) Join Tree of Tables in DB (2.ii) Defining a single BN with PPL program(2.i) Adding fanout attributes H.Key H H F H→ Ω K.Key K K F K→ Ω H.Key H H K.Key K K F H→ Ω F K→ Ω F Ω →{A, D} ∅ ∅ ∅ AB DCF GEI J HK vv BN B N BN BN BN H K F H→ Ω F K→ Ω K H F Ω →{A,D} Code Listing H1 = dist.categorical (); K1 = dist.categorical (); K2 = dist.gaussian().condition_on(H1, K1); F_HO = dist.categorical ().condition_on(H1); H2 = dist.gaussian().condition_on(K2); F_KO = dist.categorical ().condition_on(K1); F_OAD = dist.categorical ().condition_on(F_KO , H2, K1); parameters = MLE(H1, K1, K2, F_HO , H2, F_KO , F_OAD); parameters.estimate_from_data(T); Listing 1: Python example1 (3) Probability inference
Ensembleof BNsQuery Q Probability Inference Cardinality of Q
H.Key H H F H→ Ω K.Key K K F K→ Ω H.Key H H K.Key K K F H→ Ω F K→ Ω F Ω →{A, D} ∅ ∅ ∅ Outer join table Ω (2.iii) Table H and K Figure 1: An example workflow of
BayesCard . In this paper, we propose
BayesCard , a framework for
CardEst . Thekey idea of
BayesCard is to build an ensemble of BNs to modelthe distributions of tables in a database and use the constructedBNs to estimate the cardinality of any query. This framework, in-cluding model construction and probability inference of BNs, isimplemented using PPLs in order to leverage its compiler and exe-cution advantages of presenting probability distribution.Specifically, the inputs of
BayesCard are a DB D containing 𝑛 tables and its join schema J . Following prior work’s assump-tion [52, 56, 58], BayesCard only considers the join schema to be atree, i.e. without self joins or cyclic joins. In the join tree J , eachnode represents a table and each edge represents a join relationbetween two tables. For example, Figure 1-(1) illustrates a DB with11 tables and the join tree schema on the tables.Given D and J , BayesCard constructs an ensemble of 𝑚 BNs.Each BN models the joint distribution of a subset of connectedtables in J . For example in Figure 1-(1), BayesCard builds 5 BNs( 𝐵𝑁 , . . . , 𝐵𝑁 in the red circles) to characterize the distributionsof tables in the DB, where 𝐵𝑁 is built to represent the joint distri-bution of tables 𝐻 and 𝐾 .To accurately model the joint distribution of multiple tables T , BayesCard uses the fanout method as in prior works [20, 52, 58],by creating a BN on the full outer join results of T , along withadditional fanout attributes. For example, as shown in Figure 1-(2.i), 𝐵𝑁 models Ω , the full outer join of 𝐻 and 𝐾 (shown in Figure 1-(2.iii)), along with the added fanout attributes: 𝐹 𝐻 −→ Ω , indicatinghow many tuples in Ω does a particular tuple in 𝐻 fanouts to; 𝐹 𝐾 −→ Ω , indicating how many tuples in Ω does a particular tuple in 𝐾 fanouts to; 𝐹 Ω −→ { 𝐴,𝐷 } , indicating how many tuples in the outerjoin table Ω ⊲⊳ 𝐴 ⊲⊳ 𝐷 does a particular tuple in Ω fanouts to.Each BN can be represented as a PPL program, such as 𝐵𝑁 inFigure 1-(2.ii). The probability 𝑃 T ( 𝑄 ) of any query 𝑄 on a subsetof tables T can be estimated based on the combination of multipleBNs containing tables covered in T . The process of estimating theprobability of a given query 𝑃 T ( 𝑄 ) is called probability inference. Challenges.
Existing PPLs are not optimized for
CardEst tasks interms of probability inference and model construction, which areall addressed and optimized in
BayesCard . Probability inference.
After the PPL program is successfully de-clared to represent a BN, existing PPLs do not support using thisprogram for efficient probability inference, which is the key to
CardEst problem. Therefore,
BayesCard tailors existing PPLs anddesigns two efficient inference algorithms. Using PPLs’ extremelyefficient sampling process,
BayesCard proposes the progressive sam-pling algorithm, which guarantees to run in linear time complex-ity for estimating any query (Section 4.1). In addition,
BayesCard invents compiled variable elimination to further accelerate the in-ference algorithm (Section 4.2). Furthermore,
BayesCard adapts itsinference algorithms for the fanout method to efficiently combineresults from multiple BNs to estimate the probability of join queries(Section 4.3).
Model construction.
A database generally contains multiple ta-bles and deciding which ensemble of BNs corresponding to thepartition of tables to learn significantly affects the
CardEst accu-racy and efficiency. Therefore,
BayesCard designs the ensembleconstruction algorithm to explore the optimal partition of all tablesin the DB and optimizes the
CardEst quality (Section 5.1). Further-more, Existing PPLs do not explore how to accelerates the structurelearning algorithms in DB scenarios.
BayesCard tailors and speedsup these algorithms by exploring and exploiting functional depen-dencies and other user-defined expert knowledge (Section 5.2).
In this section, we address the probability inference in BayesCard .Specifically, we first propose two novel inference algorithms basedon PPLs for a single BN model, namely progressive sampling (Sec-tion 4.1), which guarantees to return an approximate probabilityestimation in linear time, and complied variable elimination (Sec-tion 4.2), which returns the exact probability with two orders ofmagnitude acceleration. Next, we present how to extend these twoalgorithms on multiple BNs to support join queries (Section 4.3). ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation
Algorithm 1
Progressive Sampling Inference Algorithm
Input : a table 𝑇 with 𝑛 attributes, a query 𝑄 with region 𝑅 𝑄 and a PPLprogram defining the BN on 𝑃 𝑇 Align the attributes in topological order 𝑇 , . . . ,𝑇 𝑛 𝑝 ← 𝑆 ← [ ] 𝑘 × 𝑛 , an 𝑘 × 𝑛 dimension matrix of samples for 𝑖 ∈ { , . . . , 𝑛 } do Take 𝑆 [ 𝑃𝑎𝑟 ( 𝑇 𝑖 ) ] , the columns in 𝑆 corresponding to attributes in 𝑃𝑎𝑟 ( 𝑇 𝑖 ) ˆ 𝑃 𝑖 ( 𝑇 𝑖 ) ← 𝑘 (cid:205) 𝑑 ∈ 𝑆 [ 𝑃𝑎𝑟 ( 𝑇 𝑖 )] 𝑃 𝑇 ( 𝑇 𝑖 | 𝑑 ) 𝑝 ← 𝑝 ∗ ˆ 𝑃 𝑖 ( 𝑇 𝑖 ∈ 𝑅 𝑄 ( 𝑇 𝑖 )) Define a PPL variable 𝑃 ′ 𝑖 by normalizing ˆ 𝑃 𝑖 ( 𝑡 𝑖 | 𝑡 𝑖 ∈ 𝑅 𝑄 ( 𝑇 𝑖 )) 𝑆 [ 𝑖 ] ← 𝑘 points sampled from 𝑃 ′ 𝑖 end for return 𝑝 We define the inference procedure of a simple case, where wehave a query 𝑄 on tables 𝑇 in a DB and a single BN that exactlymodels 𝑃 𝑇 on the full outer join of tables 𝑇 . In this case, estimatingthe cardinality of 𝑄 , 𝑃 𝑇 ( 𝑄 ) can be derived directly on this BN. Asdefined in Section 2, a query 𝑄 takes the form of { 𝑇 ∈ 𝑅 𝑄 ( 𝑇 )∧ 𝑇 ∈ 𝑅 𝑄 ( 𝑇 ) ∧ · · · ∧ 𝑇 𝑛 ∈ 𝑅 𝑄 ( 𝑇 𝑛 )} , where 𝑅 𝑄 is the region defined by 𝑄 over attributes in 𝑇 .Thus, we can represent the probability of 𝑄 as: 𝑃 𝑇 ( 𝑄 ) = (cid:206) 𝑛𝑖 = 𝑃 𝑇 ( 𝑇 𝑖 ∈ 𝑅 𝑄 ( 𝑇 𝑖 )| 𝑃𝑎𝑟 ( 𝑇 𝑖 ) ∈ 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 ))) = (cid:206) 𝑛𝑖 = 𝑃 𝑖 , where 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) denotes the query region over the set of parent attributes 𝑃𝑎𝑟 ( 𝑇 𝑖 ) and we can denote each term as 𝑃 𝑖 , for simplicity. Therefore, tocompute 𝑃 𝑇 ( 𝑄 ) , we only need to compute or estimate each 𝑃 𝑖 .In PPLs, accessing the probability 𝑃 𝑇 ( 𝑇 𝑖 | 𝑠 ) for each fixed valueassignment 𝑠 ∈ 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) takes constant time complexity. How-ever, computing 𝑃 𝑖 is generally intractable, as there can be exponen-tial or infinite number of unique values in 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) . Specifically,for large BNs with complex structures, the PPLs’ existing inferencealgorithms can not have an efficiency guarantee, which is requiredfor CardEst in practical DBMS. Therefore,
BayesCard designs the progressive sampling inference algorithm, which uses the MonteCarlo approximation of 𝑃 𝑖 based on a sample 𝑆 of 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) toensure the computation efficiency, i.e., 𝑃 𝑖 ≈ | 𝑆 | (cid:205) 𝑠 ∈ 𝑆 𝑃 𝑇 ( 𝑅 𝑄 ( 𝑇 𝑖 )| 𝑠 ) .The default sampling procedure in PPLs only supports samplingvalues from a variable’s domain, which are not like to fail in thequery range 𝑅 𝑄 . Naively using this sampling algorithm will re-sult in enormous ineffective points. Therefore, we can leveragethe learned model, create variables to materialize the distribu-tion 𝑃 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )| 𝑃𝑎𝑟 ( 𝑇 𝑖 ) ∈ 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 ))) , and progressively samplepoints from 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) accordingly, which greatly improves thesample effectiveness. Algorithm description.
We present the details in Algorithm 1.Specifically, we first align the attributes from 𝑇 in topological orderas 𝑇 , . . . ,𝑇 𝑛 , where 𝑇 is the root of the BN’s DAG structure (line 1).We can directly obtain from the PPL 𝑃 𝑇 ( 𝑇 ) as it does not dependon any other attribute, and compute 𝑃 = 𝑃 𝑇 ( 𝑅 𝑄 ( 𝑇 )) . Then, wecan define a new variable in PPLs to represent the distribution 𝑃 𝑇 ( 𝑡 | 𝑡 ∈ 𝑅 𝑄 ( 𝑇 )) and generate sample 𝑆 of 𝑅 𝑄 ( 𝑇 ) from thisvariable. Next, for each of the rest attributes 𝑇 𝑖 , the samples ofits parents 𝑃𝑎𝑟 ( 𝑇 𝑖 ) must have already been generated because the attributes are aligned in topological order (line 5). We can derive anew distribution ˆ 𝑃 𝑖 approximating 𝑃 𝑇 ( 𝑇 𝑖 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 ))) using thesesamples (line 6). This distribution ˆ 𝑃 𝑖 will be used to estimate 𝑃 𝑖 (line 7) and generate samples from 𝑅 𝑄 ( 𝑇 𝑖 ) (line 8). At last, after weachieve the estimated value for each 𝑃 𝑖 , 𝑃 𝑇 ( 𝑄 ) can be computed astheir product (line 10). Analysis.
Sampling | 𝑆 | points and evaluating the probability witheach fixed point takes 𝑂 (| 𝑆 |) time complexity to approximate each 𝑃 𝑖 . Thereafter, time complexity of progressive sampling on BN withany structure is guaranteed to be 𝑂 (| 𝑆 | ∗ 𝑛 ) . This inference algo-rithm is very efficient because generally, a small sample 𝑆 wouldsuffice to make a very accurate estimation and the sampling processis extremely efficient in PPL. The progressive sampling algorithmin PPL resembles the one in the DAR model, proposed by Naru [53].Our method is different from theirs in the following aspects: 1)Efficient sampling is naturally supported in PPL for various con-tinuous distributions, whereas the sampling procedure in DAR ispost-equipped for categorical distributions only. 2) The progressivesampling in BayesCard estimates each 𝑃 𝑖 using sample 𝑆 duringthe sampling process, whereas in DAR, the samples 𝑆 are used todirectly compute the 𝑃 𝑇 , which is less effective. Graph reduction optimization.
To further accelerate the pro-gressive sampling algorithm,
BayesCard proposes the graph reduc-tion optimization, which significantly speeds up the inference la-tency for datasets with large amount of attributes.
Main idea.
In fact, the progressive sampling algorithm involves alarge amount of redundant computation. For example, for an at-tribute 𝑇 𝑖 , which is not constrained by predicates in 𝑄 , i.e. 𝑅 𝑄 ( 𝑇 𝑖 ) = 𝐷 ( 𝑇 𝑖 ) , the estimation of 𝑃 𝑖 should equal to 1. If all the decedents 𝑇 𝑗 of 𝑇 𝑖 are not constrained in 𝑄 , there is no need to sample 𝑇 𝑖 sinceeach 𝑃 𝑗 should equal to 1 regardless of the samples. Therefore, wecan reduce the larger BN model to a much smaller one by removingthese redundant attributes, and perform probability inference on itwithout affecting the estimation accuracy. Formulation.
First, we make the following rigorous definitionof reduced graph 𝐺 ′ . Intuitively, 𝐺 ′ only contains all constrainedattributes in the query and other necessary attributes to connectthem to form a minimal BN. An example of a reduced graph can befound in Figure 2.Definition 1. Given a BN representing a table 𝑇 with attributes 𝑉 = { 𝑇 , · · · 𝑇 𝑛 }, its defined DAG 𝐺 = ( 𝑉 , 𝐸 ) , and a query 𝑄 = ( 𝑇 ′ = 𝑡 ′ ∧ · · · ∧ 𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 ) where 𝑇 ′ 𝑖 ∈ 𝑉 . We define the reduced graph 𝐺 ′ = ( 𝑉 ′ , 𝐸 ′ ) to be a sub-graph of 𝐺 where 𝑉 ′ equals (cid:208) ≤ 𝑖 ≤ 𝑘 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) ,and 𝐸 ′ equals all edges in 𝐸 with both endpoints in 𝑉 ′ . 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) includes all parent nodes of 𝑇 ′ 𝑖 and their parent nodes recursively. Based on this definition, we can reduce the original BN model(i.e. PPL program with variables 𝑉 ) into a much smaller one (i.e.PPL program with variable 𝑉 ′ ), and perform inference on it. Thecorrectness of the graph reduction optimization is stated in Theo-rem 1. Due to space limits, we put the proof of all theorems in theAppendix A of the accompanied technical report [50].Theorem 1. Given a BN 𝐵 defining 𝐺 , a query 𝑄 and the reducedBN 𝐵 ′ defining 𝐺 ′ on 𝑄 , computing 𝑃 𝑇 ( 𝑄 ) on 𝐵 ′ is equivalent tocomputing 𝑃 𝑇 ( 𝑄 ) on 𝐵 . iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou H K F H→ Ω F K→ Ω K H F Ω →{A,D} H K F H→ Ω K H GraphReduction JIT
Code Listing H1 = dist.categorical(); K1 = dist.categorical(); K2 = dist.gaussian().condition_on(H1, K1); F_HO = dist.categorical().condition_on(H1); H2 = dist.gaussian().condition_on(K2); F_KO = dist.categorical().condition_on(K1); F_OAD = dist.categorical().condition_on(F_KO , H2, K1); parameters = MLE(H1, K1, K2, F_HO , H2, F_KO , F_OAD); parameters.estimate_from_data(T); Listing 1: Python example def JIT_exact_for_BN_4(R_Q): M_H1 = H1.params.as_mat()[R_Q(H1)]; M_K1 = K1.params.as_mat()[R_Q(K1)]; M_K2 = K2.params.as_mat()[R_Q(K2), R_Q(H1), R_Q(K1)]; M_F_HO = F_HO.params.as_mat()[R_Q(F_HO), R_Q(H1)]; M_H2 = H2.params.as_mat()[R_Q(H2), R_Q(K2), R_Q(F_HO)]; exec( matmal( colsum(matmal(colsum(M_H2), M_F_HO) * matmal(M_K2 , M_K1)), M_H1) ) Listing 2: Python example1
Figure 2: Graph reduction and the compiled program withJIT. The left image shows the graph reduction for query 𝐾 ∈{ , } , 𝐻 ∈ {− , } . The red nodes refer to the attributes inthe query. All red, green nodes and the red edges form thereduced graph 𝐺 ′ . Progressive sampling works for general PPL programs with anydistribution type. For programs restricted to categorical distribu-tions, we can further accelerate the inference algorithm using analternative approach: compiled variable elimination. Inspired by theimpressive results of compilation for query processing [34, 43, 44],we investigate the usage of just-in-time compilation (JIT) and com-piler optimizations to improve inference latency.
Observation.
Let us revisit the example 𝐵𝑁 built on tables 𝐻 and 𝐾 , in the left image of Figure 2. Consider a query 𝑄 = ( 𝐾 ∈{ , } ∧ 𝐻 ∈ {− , } ), where we remove all “black” attributesby the graph reduction technique, based on Theorem 1. For the“green” attributes, we have 𝑅 𝑄 ( 𝐻 ) = 𝐷 ( 𝐻 ) , 𝑅 𝑄 ( 𝐾 ) = 𝐷 ( 𝐾 ) , and 𝑅 𝑄 ( 𝐹 𝐻 −→ Ω ) = 𝐷 ( 𝐹 𝐻 −→ Ω ) . The variable elimination algorithm (VE)compute the probability 𝑃 𝑇 ( 𝑄 ) based on the following equation. 𝑃 𝑇 ( 𝑄 ) = ∑︁ ℎ ∈ 𝑅 𝑄 ( 𝐻 ) · · · ∑︁ ℎ ∈ 𝑅 𝑄 ( 𝐻 ) 𝑃 𝑇 ( ℎ ) ∗ 𝑃 𝑇 ( 𝑘 ) ∗ · · · ∗ 𝑃 𝑇 ( ℎ | , 𝑓 𝐻 −→ Ω , 𝑘 ) This computation can be very inefficient in PPLs and repeatedfor estimating multiple queries. However, we observe that the VEalgorithm only involves sum and product over attributes. If eachvariable in PPL (attribute in BN) is defined as categorical conditionaldistribution, they can be materialized as vectors or matrices. Thus,the VE algorithm essentially defines a program of linear algebraoperations, whose execution time can be significantly enhanced bynowadays computing resource. Furthermore, we observe that thelinear algebra program computing VE is fixed for a target query aslong as the elimination order is fixed.
JIT of VE.
For any query, the
BayesCard can first decide an optimalvariable elimination order and then compile the learned BN from thePPL program into a static program containing only matrix or tensoroperations to maximize the execution efficiency. Furthermore, thisprogram can be re-used to infer other queries with the same reducedgraph by only changing the input query regions 𝑅 𝑄 (as shown inFigure 2). Therefore, JIT can remember the execution pattern forthis query and will re-use this pattern to infer the probability offuture queries for further speed-up. An example program showing the JIT compilation of VE on thesame query 𝑄 is shown in Figure 2. Specifically, for each variable 𝑇 𝑖 of PPLs in the reduced graph 𝐺 ′ , the JIT program first extractthe parameters of its distribution 𝑃 𝑇 ( 𝑇 𝑖 | 𝑃𝑎𝑟 ( 𝑇 𝑖 )) . Since VE onlysupports categorical distributions, the extracted parameters of 𝑇 𝑖 forms a matrix 𝑀 𝑇 𝑖 . Next, based on the query region 𝑅 𝑄 , the JITprogram can further reduce 𝑀 𝑇 𝑖 by keeping only useful information,i.e. slicing its rows with 𝑅 𝑄 ( 𝑇 𝑖 ) and its columns with 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 )) (lines 2-6 of the code in Figure 2). This reduction not only eliminatesthe redundant computation but also enables a close-form linearalgebra equation.Then, BayesCard can determine an elimination order for thesevariables using the reversed topological order or standard proce-dure [11]. A fixed program containing only linear algebra opera-tions can be derived, like the one in line 8, where “ matmal ” refersto matrix multiplication, “ colsum ” refers to column sum, and “ .T ”refers to the transpose. At last, this generated static program canexecute efficiently, thanks to the batch processing of the tensoroperations with various performance tuning techniques (e.g., looptiling, parallelization, and vectorization). By our evaluation, suchprogram can achieve up to two orders of magnitude speed-ups overthe original VE algorithm. Previous sections discuss the process of inferring the probability 𝑃 𝑇 ( 𝑄 ) of a query 𝑄 on the table(s) 𝑇 , represented by exactly a singleBN. For a database with multiple tables, this process needs to bemodified for the following two types of queries: (1) a query 𝑄 ontables, that cover many BNs (i.e. 𝑄 on 𝑇 = { 𝐴, 𝐷, 𝐻, 𝐾 } in Figure 1)-(1); (2) a query on tables, that only cover a subset of a single BN(i.e. 𝑄 on 𝑇 = { 𝐻 } ). In these cases, the BayesCard does not containan exact BN representing 𝑃 𝑇 to estimate this query 𝑄 . Fortunately,based on the fanout method explained earlier in Section 3.2, we canuse the following theorem to calculate 𝑃 𝑇 ( 𝑄 ) , which is proposedand proved in [58].Theorem 2. Given a query 𝑄 , let 𝑉 = { 𝑉 , 𝑉 , . . . ,𝑉 𝑑 } denoteall vertices (nodes) in the join tree touched by 𝑄 and let V de-notes the full outer join of all tables in 𝑉 . On each node 𝑉 𝑖 , let 𝐹 = { 𝐹 𝐴 ,𝐵 , 𝐹 𝐴 ,𝐵 , . . . , 𝐹 𝐴 𝑛 ,𝐵 𝑛 } , where each ( 𝐴 𝑗 , 𝐵 𝑗 ) is a distinct joinwhere 𝐵 𝑗 is not in 𝑄 . Let 𝑓 = ( 𝑓 , 𝑓 , . . . , 𝑓 𝑛 ) where 𝐹 𝐴 𝑗 ,𝐵 𝑗 = 𝑓 𝑗 for all ≤ 𝑖 ≤ 𝑛 , denote an assignment to 𝐹 and dlm ( 𝑓 ) = (cid:206) 𝑛𝑗 = max { 𝑓 𝑗 , } .Let 𝑝 𝑖 = |V 𝑖 ||V | · ∑︁ 𝑓 ,𝑣 (cid:18) 𝑃 V 𝑖 ( 𝑄 𝑖 ∧ 𝐹 = 𝑓 ∧ 𝐹 𝑉 𝑖 ,𝑉 = 𝑣 ) · max { 𝑣, } dlm ( 𝑓 ) (cid:19) . (1) Then, the cardinality of 𝑄 is |V| · (cid:206) 𝑑𝑖 = 𝑝 𝑖 . In short, since all the fanout attributes involved in this compu-tation are pre-stored in the table 𝑉 𝑖 and there exists a BN for 𝑃 V 𝑖 , BayesCard can directly use this theorem for probability inferenceof multi-table join queries.
Efficient summation computation in
BayesCard . We can com-pute the summation (cid:205) 𝑓 ,𝑣 ( 𝑃 V 𝑖 ( 𝑄 𝑖 ∧ 𝐹 = 𝑓 ∧ 𝐹 𝑉 𝑖 ,𝑉 = 𝑣 ) · max { 𝑣, } dlm ( 𝑓 ) ) over all assignments of 𝑓 and 𝑣 as efficiently as computing theprobability 𝑃 V 𝑖 ( 𝑄 𝑖 ) for any query. We will explain the detailed pro-cedure for calculating (cid:205) 𝑓 ∈ 𝐷 ( 𝐹 ) 𝑃 𝑇 ( 𝑄, 𝐹 = 𝑓 ) ∗ 𝑓 using progressive ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation AB C DE F G HI J K . . . . . . (a). k’ = 2 A,DB,E C,F H,KGI J . . . . (b). k’ = 3 A,DB,E,I C,F,G H,KJ (c). Result
Figure 3:
BayesCard ensemble learning algorithm demo. sample and complied variable elimination, where 𝐷 ( 𝐹 ) denotes thedomain of unique values in 𝐹 . Then, this procedure can naturallygeneralize to more complex cases.Our calculation procedure is motivated by the Bayesian rule,that 𝑃 𝑇 ( 𝑄, 𝐹 = 𝑓 ) = 𝑃 𝑇 ( 𝐹 = 𝑓 | 𝑄 ) ∗ 𝑃 𝑇 ( 𝑄 ) . We observe that 𝑃 𝑇 ( 𝑄 ) is a fixed value independent of 𝐹 because the fanout attributes areartificial attributes that will not be involved in 𝑄 . Furthermore, byproperty of BN, we know that 𝑃 𝑇 ( 𝐹 | 𝑄 ) = 𝑃 𝑇 ( 𝑓 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝐹 ))) , socan derive the following equation. It spots a common term 𝑃 𝑇 ( 𝑄 ) so the calculation can avoid repeatedly computing 𝑃 𝑇 ( 𝑄 ) . ∑︁ 𝑓 ∈ 𝐷 ( 𝐹 ) 𝑃 𝑇 ( 𝑄, 𝐹 = 𝑓 )∗ 𝑓 = 𝑃 𝑇 ( 𝑄 )∗ (cid:169)(cid:173)(cid:171) ∑︁ 𝑓 ∈ 𝐷 ( 𝐹 ) 𝑃 𝑇 ( 𝑓 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝐹 ))) ∗ 𝑓 (cid:170)(cid:174)(cid:172) Progressive sampling.
Recall in Section 4.2,
BayesCard estimates 𝑃 𝑖 = 𝑃 𝑇 ( 𝑇 𝑖 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝑇 𝑖 ))) by making progressive samples of 𝑅 𝑄 and approximate the 𝑃 𝑇 ( 𝑄 ) as (cid:206) 𝑃 𝑖 . After finishing estimating 𝑃 𝑇 ( 𝑄 ) with sample 𝑆 , BayesCard can directly estimate (cid:205) 𝑓 ∈ 𝐷 ( 𝐹 ) 𝑃 𝑇 ( 𝑓 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝐹 ))) ∗ 𝑓 using the same sample 𝑆 , i.e. as (cid:205) 𝑓 ∈ 𝑆 [ 𝐹 ] ˆ 𝑃 𝑇 ( 𝑓 | 𝑆 [ 𝑃𝑎𝑟 ( 𝐹 )])∗ 𝑓 . The final result can be achieved by multiplyingthese two terms together. Compiled variable elimination.
Recall in Section 4.2,
BayesCard can specify a particular elimination order by choosing the fanoutvariable 𝐹 as the last variable to eliminate. Using PPL, the intermedi-ate result after each elimination step is materialized as a distribution.Therefore, before the last elimination step of VE algorithm for com-puting 𝑃 𝑇 ( 𝑄 ) , BayesCard can store the intermediate result, whichrepresents the conditional distribution 𝑃 𝑇 ( 𝐹 | 𝑄 ) . Then, the sum-mation (cid:205) 𝑓 ∈ 𝐷 ( 𝐹 ) 𝑃 𝑇 ( 𝑓 | 𝑅 𝑄 ( 𝑃𝑎𝑟 ( 𝐹 ))) ∗ 𝑓 equals to 𝑃 𝑇 ( 𝐹 | 𝑄 ) · 𝐷 ( 𝐹 ) ,where · denotes the vector dot product. Therefore, similar to com-puting 𝑃 𝑇 ( 𝑄 ) , this process only involves linear algebra operations,which can be compiled and efficiently calculated using JIT. BAYESCARD
In this section, we explain how
BayesCard constructs an ensembleof BNs for a multi-table database. Specifically, Section 5.1 first intro-duces the BN ensemble construction method with budget, whichclusters all tables in the database into several groups and buildsa single BN on each group of tables. Then, Section 5.2 introducessome optimizations for building a single BN using PPLs. Finally,Section 5.3 shows how to incrementally update the BN model.
Main idea.
Consider the example database in Figure 3 with 11tables
𝐴, 𝐵, . . . , 𝐾 forming a join tree, where each node represents atable and each edge represents a possible join between two tables. Aprevious approach [20] suggests to create every possible two-tablejoin results, examine the level of dependence between attributesacross the two, and determine whether to create one large model ontheir full outer join table or two separate models. Since generatingthe full outer join of multiple tables could require exponential
Algorithm 2
BN Ensemble Construction Algorithm
Input : a DB schema with n tables 𝑇 , · · · ,𝑇 𝑛 and a budget 𝑘 Create the join tree T = ( 𝑉 , 𝐸 ) for the schema Generate unbiased samples 𝑆 for full outer join of the entire schema Initialize a dependence matrix 𝑀 ∈ R 𝑛 × 𝑛 for Each pair of tables 𝑒 = ( 𝑇 𝑖 ,𝑇 𝑗 ) do Calculate the RDC dependence level scores between all attributesin 𝑇 𝑖 and attributes in 𝑇 𝑗 𝑤 𝑒 ← average RDC scores end for if 𝑘 = then return T and learn a single PRM for each table end if for 𝑘 ′ ← , · · · , 𝑘 do Sort 𝐸 in decreasing order based on 𝑤 𝑒 . for 𝑒 = ( 𝑢, 𝑣 ) ∈ 𝐸 do if 𝑢 and 𝑣 contain exactly 𝑘 ′ tables in total then Update 𝑇 by contracting nodes 𝑢, 𝑣 to a single node { 𝑢, 𝑣 } end if end for end for return T and learn a single PRM for each node in T memory, this approach normally can not explore the possibility ofcreating a model on the join of more than three tables.Another approach [52] generates an unbiased sample 𝑆 on thefull outer join of all tables in the schema and builds a single largemodel on 𝑆 directly. As the resulting model is built on all attributesin the database, the model construction and the probability infer-ence can be very inefficient. Moreover, the size of 𝑆 is relativelysmall with respect to the full outer join size, suggesting a largeamount of information loss, so the learned model on 𝑆 might notaccurately represent the actual data distribution.In order to balance the estimation accuracy and inference effi-ciency, we want to explore the full possibility of learning differentBN ensembles such that the number of joined tables in each BNis no more than a threshold. Therefore, the resulting ensembleshould capture as much dependence between tables as possible andsimultaneously keep each BN in this ensemble as small as possible. Algorithm description.
The details of the ensemble constructionalgorithm is given in Algorithm 2. First, we define the budget 𝑘 such that a single BN model can only be constructed on (a sampleof) the full outer join of no more than 𝑘 tables. The budget 𝑘 is ahyper-parameter decided by the dataset, system, and computingresource. The algorithm generally works as follows:1) Computing dependency between tables (lines 1-7).
Given a tree-structured join schema T , we first generate the unbiased sample 𝑆 of the full outer join of all tables according to [56]. Specifically, thejoin tree is regarded as a rooted tree and samples 𝑆 are obtained byscanning all tables in T in a bottom-up manner. Then, we calcu-late the randomized dependence coefficient, i.e., RDC value [29],between each pair of join tables using 𝑆 . The detailed computationmethod is given in Appendix B of our technical report [50]. InFigure 3, the RDC value is shown as red numbers on each edge.2) Contracting nodes (lines 8-18).
Intuitively, we would like tobuild a model on the full outer join of tables with high dependency.We can iteratively contract the nodes (tables) with high RDC value iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou in T in a greedy manner. Let 𝑘 ′ = 𝑘 ′ ≤ 𝑘 , we first sort all edges 𝑒 = ( 𝑢, 𝑣 ) (joins) in adescending order based on their RDC values. According to this edgeorder, we aggregate 𝑢, 𝑣 , i.e. two endpoints of edge 𝑒 , into a singlenode if they contain exactly 𝑘 ′ tables in total and update the RDCvalues of 𝑒 accordingly, whose details is given in Appendix B ofour technical report [50]. We iterate this process until 𝑘 ′ = 𝑘 , andin the end, we obtain a tree where each node contains at most 𝑘 tables. For example, in Figure 3, let the budget 𝑘 =
3. In the firstiteration where 𝑘 ′ =
2, the algorithm considers joining two tablestogether. The edge ( 𝐵, 𝐸 ) has the highest RDC value, so 𝐵 and 𝐸 are aggregated in the first step ( 1 ○ in Figure 3(a)). After the firstiteration, the join schema T has been transformed into a new treein Figure 3(b). Similarly, in the second iteration where 𝑘 ′ =
3, thenode { 𝐵, 𝐸 } is first merged with the node 𝐼 . Finally, the join tree istransformed to a tree in Figure 3(c).3) Building BNs (line 19).
In the end,
BayesCard will construct asingle BN model on (a sample of) the full outer join of tables withineach node and fanout attributes will be added accordingly.
Time Complexity analysis.
As shown in [56], creating the sam-ples 𝑆 on the full outer join of tables 𝑇 , · · · ,𝑇 𝑛 takes 𝑂 ( (cid:205) 𝑛𝑖 = | 𝑇 𝑖 |) time. Let 𝑚 be the attribute number in the full outer join of thetables. Calculating the pairwise RDC values takes 𝑂 ( 𝑚 | 𝑆 | log | 𝑆 |) .The rest of Algorithm 2 takes 𝑂 ( 𝑘𝑛 ) time since the algorithm termi-nates in 𝑘 iterations and in each iteration we only need to check thetables defined by two endpoints of each edge, which is at most 𝑛 .Thus, the whole time complexity is 𝑂 ( (cid:205) 𝑛𝑖 = | 𝑇 𝑖 |+ 𝑚 | 𝑆 | log | 𝑆 |+ 𝑘𝑛 ) . The structure learning process, i.e., learning the causal structurefrom data of a single BN, is an NP-hard combinatorial optimizationproblem [7]. Current structure learning algorithms supported byPPLs either produce a general DAG structure or a simplified treestructure. We show optimization techniques for them as follows:
Optimization for DAG structure learning algorithms.
The ex-act DAG structure learning algorithms explore the super-exponentialsearching space of all possible DAGs and select the best candi-date [4, 9, 19, 55]). The learned structure is accurate but ineffi-cient, which only scales to tens of attributes. Approximate meth-ods limit the searching space with local heuristic (i.e. greedy algo-rithms [8, 14, 39]), but they may produce inaccurate results. Basedon PPLs,
BayesCard supports pre-specifying sub-structures beforerunning the exact and greedy structure learning algorithms, whichlimits the DAG searching space and makes the structure learn-ing much more efficient. Specifically, practical databases gener-ally exist attributes with functional dependencies [13] or obviouscausal relations between attributes, such as one’s “age” determiningone’s “school level”. First, users of
BayesCard can use their “expertknowledge” to pre-specify certain causal structures for subsets ofattributes. Then, the PPLs within
BayesCard can define the vari-ables corresponding to these attributes, and condition the variableswith each other according to the pre-specified structure. At last,
BayesCard can rely on the existing algorithms to construct the re-maining causal structure on these variables. Since the algorithmsare forced to maintain these sub-structures, the number of qualified DAG candidates is significantly curtailed, making the structurelearning process more efficient without loss in accuracy.
Optimization for tree structure learning algorithms.
The treestructure learning algorithm learns a tree structure such as
Chow-Liu tree [10], which sacrifices accuracy for efficiency.
BayesCard canalso improve the accuracy of a learned structure using the afore-mentioned “expert knowledge” after running the
Chow-Liu tree algorithm. This efficient algorithm forces the learned BN structureto be a tree, which could contain “false” causality or miss importantattribute dependence. For example, intuitively we know that thenumber of “children” raised by someone is largely dependent onone’s “income” and one’s “marital status”, which can not be cap-tured simultaneously by the tree BN, since one node is only allowedto have one parent. Thus, after the structure is learned,
BayesCard can add the edge from “Income” to “Children” to improve its accu-racy. With PPLs, only the parameters of the affected sub-structure(the “Children” variable in this example) need to be updated.
Most of the practical databases update their data frequently, requir-ing the cardinality estimators to adjust their underlying modelsdynamically [49]. When the data distribution changes,
BayesCard can update its underlying BNs very efficiently. Specifically, thelearned structure of BN captures the intrinsic causal pattern of theattributes, which is not likely to change even in the case of massivedata updates. Therefore, in most cases,
BayesCard can preserve theoriginal BN structure and only incrementally update its distributionparameters. Such parameter updates are extremely efficient usingMLE in PPLs. By our testing, it generally takes less than one secondfor an insertion or deletion of a thousand tuples. In some rare casesinvolving the insertion or deletion of attributes, a new BN struc-ture should be constructed. Even in this case, the causal pattern ofthe original attributes is largely preserved. Therefore,
BayesCard can pre-specify some sub-structures and learn the new structureefficiently using the methods stated in the previous section.
In this section, we analyze and demonstrate that
BayesCard satisfiesthe ADS criteria from all aspects, as shown in Table 1.
Algorithm.
A BN with exact learned structure can losslessly cap-ture the data distribution, a.k.a. near-perfect estimation accuracy forall queries. We show empirically that even with an approximate treestructure,
BayesCard can achieve comparable or better accuracythan the current SOTA methods. The inference latency of BayesCard is roughly 1ms per query (close to Histogram method), thanks toour novel inference algorithms. Furthermore, as explained in Sec-tion 4,
BayesCard can learn a compact structure of small model size with fast training and update time . Data.
Every dataset contains an inherent causal pattern, whichcan be discovered by
BayesCard . Building upon this structure,
BayesCard can represent its PDF accurately and efficiently. Specifi-cally, the variables in PPL can characterize most data distribution types with varied domain size . Attribute correlation is merely a man-ifestation of the underlying causal pattern, which can be accuratelyrepresented. Moreover, for data with more attributes (larger scale ),the proposed graph reduction inference technique can reduce a ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation larger graph into a much smaller one for efficient inference. There-fore, the inference latency is also stable for various data settings.
System.
Both the structure and the distribution parameters of
BayesCard model are interpretable and debuggable . Specifically, aDB expert can verify a learned structure based on his prior knowl-edge of data causality (functional dependency in DBs), and validatethe learned parameter using basic probability rules (non-negativeand sum to one). Since the probability inference of
BayesCard fol-lows the Bayesian rule, its performance is logical and predictable .Furthermore, the compiled VE does not contain any stochasticity,so the users’ error is reproducible . In this section, we empirically demonstrate the superiority of our
BayesCard over other
CardEst methods. In the following, Section 7.1first introduces the experimental setups. Next, Section 7.2 thor-oughly compares different
CardEst methods in terms of the
ADS criteria on single table datasets. Then, Section 7.3 evaluates theperformance and end-to-end query plan execution time on multi-table datasets. At last, Section 7.4 performs ablation studies on ourproposed algorithms and optimizations in
BayesCard method.
CardEst methods to compare with.
We compare our
BayesCard framework with the following
CardEst methods, including bothtraditional methods widely used in DBMS and four existing SOTADL-based methods. For each ML-based
CardEst method, we adoptthe authors’ source code and apply the same hyper-parameters asused in the original paper. is the simplest
CardEst method widely used inDBMS such as Postgres [17]. has been used in DBMS such as MySQL [35]. In ourtesting, we randomly sample 1% of all tuples for
CardEst . [52, 53] are DAR -based
CardEst methods forsingle table and multi-table join queries, respectively. [20] is a SPN-based
CardEst method. [58] is an FSPN-based
CardEst method. [21] is the SOTA query-driven
CardEst method. Foreach dataset, we train it with 10 queries generated in the sameway as the workload.Our BayesCard framework subsumes BNs with various combi-nation of structure learning and inference algorithms as describedin previous sections. In Section 7.2 and 7.3, we use an exemplaryBN with
Chow-Liu tree structure learning algorithm and compiledvariable elimination inference algorithm with graph reduction opti-mizations. The comparison of different BNs realizable in
BayesCard and controlled ablation studies are deferred to Section 7.4.
Datasets and query workloads.
Our single table experiments areperformed on three datasets:1).
DMV dataset is a real-world dataset consisting of 11,575,483tuples of vehicle registration information in New York. We use thesame attributes as in [49, 53].2).
CENSUS dataset contains population survey by U.S. CensusBureau conducted in 1990. This dataset has 2,458,285 tuples and68 attributes, containing highly correlated attributes. Based onRDC test [29], we find that more half of the attributes are highly correlated with at least one other attribute. This dataset is verylarge in scale and has very complicated distribution.3)
SYNTHETIC datasets are a collection of human-generated datasetswith varied data distribution skewness, attributes correlation, do-main size and number of attributes. We generated these datasetsusing the similar approach as a recent benchmark study [49]. Theyare used to evaluate models’ stability w.r.t. changes in data.For each dataset, we generate 1 ,
500 selection queries as work-load. For each query 𝑄 , first we select a subset of attributes as filterattributes of 𝑄 . For each selected attribute 𝑐 , if it represents a con-tinuous variable, we uniformly generate two values ( 𝑣 , 𝑣 ) fromits value domain and then add the filter predicate “ 𝑣 ≤ 𝑐 ≤ 𝑣 ” to 𝑄 . Otherwise, if 𝑐 is a categorical variable, we uniformly sample 𝑘 unique values{ 𝑣 , 𝑣 , · · · , 𝑣 𝑘 } from its domain and place a predicate“ 𝑐 IN { 𝑣 , · · · , 𝑣 𝑘 }” in 𝑄 .4). Multi-table IMDB:
We conduct the multi-table experiment oninternational movie database (IMDB) benchmark. Prior work [26]claims that this DB contains complicated data structure and estab-lishes it to be a good test benchmark for cardinality estimators. Weuse
JOB-light benchmark query workload with 70 queries proposedin the original paper [26] and create another workload of 1500
JOB-comp with more comprehensive and complicated queries.
JOB-light ’s IMDB schema contains six tables ( title , cast_info , movie_info , movie_companies , movie_keyword , movie_info_idx ) andfive join operations in total where every other tables can only joinwith the primary table “title”. Each JOB-light query involves 3-6tables with 1-4 filter predicates. The filter variety is not very diversewith equality filters on all attributes but the “title.production_year”attribute only. In addition,
JOB-light ’s workload only contains 70queries, which is not enough to account for the variance in modelprediction. Thus, we synthesize 1,500
JOB-comp queries based onthe schema of
JOB-light with more number of filter predicatesper query. Each
JOB-comp query involves 4-6 tables with 2-7 filterpredicates. The queries are uniformly distributed to each join of4-6 tables. After determining the join graph, the filter predicatesselection process is similar as in single table cases.
Evaluation metric:
We use the Q-error as our evaluation metrics,which is define as follow:
Q-error = 𝑚𝑎𝑥 ( Estimated CardinalityTrue Cardinality , True CardinalityEstimated Cardinality ) This evaluation metric is well recognized in DBMS communityand widely used in recent papers on cardinality estimation [16, 20,47, 52, 53]. We report the (median), , and (worst)Q-error quantiles as evaluation of estimation accuracy. Experimental environment:
All models are evaluated on Intel(R)Xeon(R) Platinum 8163 CPU with 64 cores, 128GB DDR4 mainmemory, and 1TB SSD. For a fair comparison, we compare themodel inference latency on CPU only since apart from the DARmodel (
Naru and
NeuroCard ) and
MSCN , the rest methods’ inferencealgorithms do not support GPU.
In this section, we compare the performance of
CardEst methodsin terms of
Algorithm and
Data criteria. iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou Table 2: Performance of
CardEst algorithms on single tables.
Dataset Method 50% 90% 95% 100% Latency (ms)DMV
BayesCard · Sampling 1.004 1.052 1.140 143.0 79Naru 1.003
MSCN 1.210 2.263 4.507 151.8 3.4CENSUS
BayesCard
Histogram 5.561 259.8 5 · · Sampling · · · · Algorithm criteria.
We evaluate the
CardEst methods from fouraspects: estimation accuracy, inference latency, model size andtraining time, and updating effects.
Estimation accuracy:
The estimation accuracy on two real-worldsingle table datasets is reported in Table 2, where the color shadein each cell corresponds to the rank among different
CardEst meth-ods. When compared with traditional models (
Histogram and
Sam-pling ), BayesCard achieves 1–3 order of magnitude higher accuracythan both models. When compared with DL-based methods (
Naru , DeepDB and
FLAT ), BayesCard has comparable or better estimateaccuracy on DMV dataset, but significantly more accurate on CEN-SUS dataset. This is because these DL models can accurately repre-sent the data distribution of DMV, which contains relatively lessattribute correlation and fewer number of attributes. CENSUS, how-ever, contains seven times larger number of attributes with morecomplex attribute correlations. As the learning space grows expo-nentially with the number of attributes,
Naru ’s accuracy droppedsignificantly. For
DeepDB and
FLAT , their SPN or FSPN structurecan not well capture the data distribution in presence of a largenumber of highly correlated attributes, so their performance alsoheavily degrades.
Inference latency:
As shown in Table 2, apart from
Histogram ,which leverages the attribute independence assumption for fastinference,
BayesCard generally attains the best (1–2 orders of mag-nitude) inference latency among the result methods. Worth noticingthat we observe significant increase in latency from DMV to CEN-SUS datasets for all methods except for
BayesCard . BayesCard ’sinference time appears to be insensitive to the number of attributes,mainly because the novel graph reduction technique can reduce alarge CENSUS attribute graph to a much smaller one for inference.
Model size and training time:
As shown in Figure 4, apart fromthe traditional methods,
BayesCard achieves the smallest model sizewith the fastest training time because the causal pattern of datasetsenables a compact representation of data distribution. Worth notic-ing that Sampling is a model-free method that does not have modelsize or training time, so we do not include it in the figure.
Updating time:
We evaluate each method’s updating effects byfollowing a similar experimental setup of prior work [49]. Specif-ically, we create a copy of the original DMV dataset and sort thetuples based on the value of each column in an ascending order.Then, we take the first 20% of the data to train a stale model and use
IMDBCENSUS M o d e l t r a i n i n g t i m e ( S e c ) Model size (KB)DMV
Figure 4: Model storage and training time.Table 3: Performance of model updates of different
CardEst methods on DMV. The baseline q-error is the 95% q-errorquoted from Table 2 for comparison.
Method
BayesCard
Histogram Naru DeepDB FLATbaseline q-error 1.049 143.6 1.035 1.193 1.06695% q-error the rest 80% as data insertion updates. This procedure will makesure that the training dataset has different data distribution thanthe testing dataset; otherwise, the stale model would perform wellwithout model updates. Then, after the model finishes the updatingprocess, we test the model using the original query workload sameas in Table 2 and report their 95% q-errors and total update time inFigure 3. Here, we refrain from comparing with the query-drivenmethod MSCN because it requires a new query workload to updateits model, which is unavailable in our experimental settings.
BayesCard , Histogram , and
DeepDB all preserve the original struc-ture and only incrementally update the parameters, so in general,they have the fastest update time. Among them,
Histogram has theleast amount of parameters to update, so it has the best update time.We use the method described in the original paper [58] to update
FLAT , which generates new sub-structures to fit the inserted datadistribution, so it is slightly slower than the previous three.
Naru uses the incoming data to fine-tune its pre-trained DNNs for threeepochs, which is significantly slower than others.After the model updates, we observe that
BayesCard has no dropin estimation accuracy, whereas the deep probabilistic models havedegraded performance. The reasons can be summarized as follow:(1)
BayesCard ’s structure captures the data causal pattern, whichoften does not change after update; (2)
DeepDB ’s preserved structureis not robust against data distribution changes; (3) fine-tuning the
Naru ’s underlying DAR model overfits the information from the20% previously trained data, leading to degraded performance.
Summary:
BayesCard attains comparable or better estimation ac-curacy, lower inference latency, smaller model size, less training andupdate time than DL-based models. In addition, BayesCard is 1-3orders of magnitude more accurate than traditional methods.
Data criteria.
We evaluate the stability of
CardEst methods interms of
Data criteria from four aspects: data distribution, attributecorrelation, domain size, and number of attributes.SYNTHETIC datasets are generated using the similar approachin a recent benchmark study [49]. Specifically, suppose we wouldlike to generate a table 𝑇 with attributes { 𝑇 , . . . ,𝑇 𝑛 } and 10 tu-ples, where is the 𝑛 denotes the number of attributes ( scale ). Wegenerate the first column for 𝑇 using a Pareto distribution (us-ing scipy.stats.pareto function), with a controlled skewness 𝑠 anddomain size 𝑑 . For each of the rest attribute 𝑇 𝑖 , we generate a col-umn based on a previous attribute 𝑇 𝑗 with 𝑗 < 𝑖 , to control thecorrelation 𝑐 . For each tuple ( 𝑡 , . . . , 𝑡 𝑛 ) in 𝑇 , we set 𝑡 𝑖 to 𝑡 𝑗 with a ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation T o p % e s t i m a t i o n q - e rr o r d i s t r i bu t i o n Distribution Skewness s (c = 0.4, d=100, n=10) Attribute Correlation c (s = 1.0, d=100, n=10)Domain Size d (s = 1.0, c = 0.4, n=10) Number of attributes n (s = 1.0, c = 0.4, d=100)
Figure 5: Comparing
BayesCard and DeepDB’s stability. probability of 𝑐 , and set 𝑡 𝑖 to a random value drawn from the Paretodistribution with the probability of 1 − 𝑐 .The experimental results on SYNTHETIC are shown in Figure 5.Due to space limit, we only plot the comparison results between BayesCard and
DeepDB on the estimation accuracy metric. Theadditional experimental results are reported in the appendix of thetechnical report [50]. We summarize our observations as follows.
Distribution (s):
Similar to the previous study [49], we find thatincreasing the Pareto distribution skewness severely degrades theperformance of
Naru and
Sampling methods, but has only mildeffect on
BayesCard and other methods. This is because
BayesCard , Histogram , FLAT , and
DeepDB all use (multi-)histograms to repre-sent distributions, which are robust against distribution changes.
Correlation (c):
The increase in 𝑐 has no impact on BayesCard ,mild impact on
Sampling , Naru , FLAT and
MSCN , and severe im-pact on
Histogram and
DeepDB , which make local or global attributeindependence assumptions.
BayesCard is able to capture the causalpattern of the datasets, and thus can represent any attribute corre-lation accurately.
Domain (d):
The increase in the domain size degrades the estima-tion accuracy for all methods, because increasing 𝑑 may increasethe data complexity exponentially as there are 𝑑 𝑛 possible valuesthat a tuple can take. Fortunately, except for Naru , the degrades inaccuracy are within a reasonable range for all other methods.
Scale (n):
Similar to domain size, increasing the number of at-tributes also increases the data complexity exponentially, and thuswe expect to see a decrease in accuracy for all methods. Surpris-ingly, the performance of
BayesCard was not affected by 𝑛 at all.This is owing to the graph reduction technique, which significantlyreduces the number of attributes involved during inference. Thistechnique not only improves the inference latency but also increasesthe estimation accuracy as potential modeling errors on the reducedattributes are also eliminated.Apart from estimation accuracy, BayesCard also maintains verystable and robust performance in terms of inference latency, modelsize, and training time, which is analyzed in Appendix C [50].
Summary:
BayesCard is much more stable and robust than other
CardEst methods for datasets with various settings of data.
As reported in Table 4 and Figure 4,
BayesCard achieves comparableperformance with the current SOTAs on the two query workloadsof the IMDB dataset and preserves its superior inference latency,lightweight model storage, and fast training. Specifically, the esti-mation accuracy of
BayesCard is comparable to
NeuroCard , slightly
Table 4: Performance of cardinality estimation algorithmson IMDB datasets with two query workloads.
Workload Method 50% 90% 95% 100% Latency (ms)JOB-light
BayesCard
Histogram 7.318 1006 5295 1 · Sampling 2.464 55.29 276.1 4 · BayesCard · Histogram 15.78 7480 4 · · Sampling 3.631 102.7 1374 8 · · · · · Number of tables in query0100200300400 A v e r a g e e x e c u t i o n t i m e ( s ) + . % - . % - . % - . % - . % + . % - . % - . % - . % - . % - . % - . % - . % - . % - . % PostgresFLATBayesCardTrue_card
Figure 6: End-to-End evaluation of
BayesCard . better than DeepDB , and slightly worse than
FLAT , but with up to60 × smaller model size, and 10 × faster training and inference. End-to-End evaluation on PostgreSQL.
Furthermore, we use theIMDB dataset to demonstrate
BayesCard ’s behavior in terms of
System criteria. The four aspects of
System criteria are rather con-ceptual and hard to compare quantitatively in experiment, so weincorporate
BayesCard into a commercial DBMS,
Postgres 9.6.6 , toshow that it can improve the query optimization process of a realsystem. Specifically, we evaluate the end-to-end query process-ing time for JOB-light queries as shown in Figure 6, and compare
BayesCard with the Postgres baseline,
FLAT , and optimal resultderived by inserting the true cardinality during query optimization.We do not compare with other methods since
FLAT has establishedits SOTA performance in the same experiment, as reported in theoriginal paper [58]. We observe that:1)
BayesCard improves the Postgres baseline by 13 . CardEst results, the query optimizer cangenerate better query plans with lower execution cost.2) The improvement of
BayesCard is very close to the methodusing true cardinality in query compiling (14.2%). This verifies thatthe accuracy of
BayesCard is sufficient to generate high-qualityquery plans. Besides, even though
BayesCard has a slightly worseestimation accuracy, it still marginally outperforms
FLAT . Bothmethods produce similar execution plans and the marginal gain of
BayesCard over
FLAT mainly credits to its faster inference latency.3) The improvement of
BayesCard and
FLAT becomes more sig-nificant on queries joining more tables because the execution planfor a query joining 2 or 3 is almost fixed. Whereas, for queriesjoining more tables, the inaccurate Postgres baseline results may iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou Table 5: Comparing different structure learning algorithmsof
BayesCard on CENSUS.
Algorithms 95% Infer. Model Train Updateq-error Time (s) Size (mb) Time (min) Time (s)Exact lead to a sub-optimal query plan, while
BayesCard and
FLAT pro-viding more accurate
CardEst results can find a better plan. Thisphenomenon has also observed and explained in [36, 58].
Summary:
The integration of
BayesCard into Postgres validates itas a practical counterpart of the
CardEst component in Postgres andalso verifies that
BayesCard is a system-friendly
CardEst method.
BayesCard
In this section, we compare different
BayesCard ’s structure learningalgorithms, perform ablation studies on the inference algorithmsand summarize the take-home messages for using
BayesCard . Comparing structure learning algorithms.
We report the esti-mation accuracy, inference latency without any proposed tech-niques, training time, model size, and update time on CENSUSdataset for various structure learning algorithms in Table 5. For ex-act and greedy algorithms, we incorporated the “expert knowledge”as described in Section 4.3; otherwise, these algorithms becomeintractable and can not generate the BN’s structure. We observethat with a more accurate structure learning algorithm (exact), theestimate accuracy has a significant improvement, but it sacrificesthe other four dimensions to a great extent. We did not report theresult for DMV and IMDB datasets with a much fewer number ofattributes because their data causal patterns are much simpler anddifferent structure learning algorithms have similar performance.
Ablation study of inference algorithms.
We compare the novelinference optimizations of
BayesCard with the original variableelimination (VE) and belief propagation (BP) algorithms on a modellearned with Chow-Liu tree algorithm on the CENSUS dataset,shown in Table 6. We have the following observations: (1) thelatency of original algorithms, VE and BP is unaffordable (780 ms perquery) for practical systems; (2) the graph reduction (GR) and just-in-time compilation (JIT) optimization do not affect the estimationaccuracy; (3) the GR and JIT alone improve the inference latencyby 5 and 30 times respectively, and 325 times when combined forVE; (4) the progressive sampling algorithm (PS) produces 4 timeslarger estimation error but with significant improvement in latency.Worth noticing that the inference latency of PS and PS+GR canbe much faster than VE+GR+JIT for
BayesCard with a complexstructure (e.g. learned by exact structure learning algorithm).
Take-home messages for
BayesCard users. (1) The Chow-Liutree structure learning algorithm can efficiently generate a compactmodel, which has improved inference latency and stable perfor-mance over other structure learning algorithms. The degrades inaccuracy can be compensated using “expert knowledge” describedin Section 4.3. (2) The
VE+GR+JIT inference algorithm efficientlyproduces exact estimation for BNs with discrete attributes, which
Table 6: Ablation study of different inference algorithms of
BayesCard on CENSUS.
Algorithms VE BP VE+GR VE+JIT VE+GR+JIT PS PS+GR95% q-error is debuggable, predictable, reproducible, and very friendly for sys-tem development. However,
PS+GR is a general approach that hasguaranteed efficiency for any complex DAG-structured BN, andsupport continuous attributes with any distribution. (3)
BayesCard provides a general
CardEst framework for users to explore differenttrade-offs to suit their data and system settings.
We will briefly revisit the existing
CardEst methods based on BNand the supervised
CardEst methods.
BN-based methods have been explored decades ago for
CardEst .Getoor et al. [16] used a greedy algorithm for BN structure learning,the variable elimination for probability inference, and referential in-tegrity assumption for join estimation. Tzoumas et al. [47] learnedan exact-structured BN and used belief propagation for inference.Halford et al. [18] adopted the Chow-Liu tree structure learningalgorithm, the VE inference algorithm, and the uniformity assump-tion for join estimation. However, none of the practical DBMSesincorporates these methods due to their impractical structure learn-ing process, intractable inference latency, or inaccurate estimationfor join queries due to over-simplified assumptions.
Supervised
CardEst methods use the feedback of past queries totrain ML models, which maps the featurized query 𝑄 to its actualcardinality. The first approach using neural networks on cardinal-ity estimation was published for UDF predicates [24]. Later on, aregression-based model [1] and a semi-automatic alternative [30]were presented. Recently, supervised DL-based approaches, usedmulti-set convolutional network ( MSCN ) [21], tree-LSTM [45], andlightweight XG-boost model [12] for
CardEst . However, the super-vised learning approaches have two major drawbacks as mentionedin [20]: (1) Their models neglect the data itself and are not robustto changes in query workload. (2) Collecting the training data canbe very expensive and training data has to be recollected whenthe workload changes. Therefore, in general, query-driven super-vised ML methods on cardinality estimation are not as flexible andaccurate as data-driven unsupervised ML methods.
This paper proposes
BayesCard , the first framework that unifies theexisting efforts on PPLs and BNs and optimizes them for
CardEst indifferent data and system settings.
BayesCard revitalizes BNs withnew equipments in model construction and probability inference,which make it a desirable
CardEst method satisfying the algorithm , data and system criteria at the same time. Extensive experimentalstudies and end-to-end system deployment establish BayesCard ’ssuperiority over existing
CardEst methods.Furthermore,
BayesCard captures the underlying data causality,which benefits other data-related tasks. In future work, we plan toexplore the possibility of using
BayesCard for other tasks, such asdata cleaning, entity matching, and approximate query processing. ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation
REFERENCES [1] Mert Akdere and Ugur Cetintemel. 2012. Learning-based query performancemodeling and prediction.
ICDE (2012).[2] Ankur Ankan and Abinash Panda. 2015. pgmpy: Probabilistic graphical modelsusing python. In
Proceedings of the 14th Python in Science Conference (SCIPY 2015) .Citeseer.[3] Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, NeerajPradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, andNoah D. Goodman. 2019. Pyro: deep universal probabilistic programming.
Journalor Machine learning research (2019).[4] Remco R Bouckaert. 1993. Probabilistic network construction using the minimumdescription length principle.
ECSQURU (1993).[5] Supriyo Chakraborty, Richard Tomsett, Ramya Raghavendra, Daniel Harborne,Moustafa Alzantot, Federico Cerutti, Srivastavam Mani, Alun Preece, Simon Julier,Raghuveer M. Rao, Troy D. Kelley, Dave Braines, Murat Sensoyk, Christopher J.Willis, and Prudhvi Gurram. 2017. Interpretability of deep learning models: Asurvey of results. In .1–6.[6] Surajit Chaudhuri, Venkatesh Ganti, and Luis Gravano. 2004. Selectivity es-timation for string predicates: Overcoming the underestimation problem. In
Proceedings. 20th International Conference on Data Engineering . IEEE, 227–238.[7] David Maxwell Chickering. 1996. Learning Bayesian networks is NP-complete.
In Learning from Data: Artificial Intelligence and Statistics (1996).[8] David Maxwell Chickering. 2002. Optimal structure identification with greedysearch.
JMLR (2002).[9] David Maxwell Chickering and David Heckerman. 1997. Efficient approximationsfor the marginal likelihood of Bayesian networks with hidden variables.
ML 29 (1997).[10] C. K. Chow and C. N. Liu. 1968. Approximating discrete probability distributionswith dependence trees.
IEEE transactions on Information Theory (1968).[11] Adnan Darwiche. 2009.
Modeling and reasoning with Bayesian networks . Cam-bridge university press.[12] Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya,and Surajit Chaudhuri. 2019. Selectivity estimation for range predicates usinglightweight models.
Proceedings of the VLDB Endowment
12, 9 (2019), 1044–1057.[13] Wenfei Fan, Floris Geerts, Jianzhong Li, and Ming Xiong. 2010. Discoveringconditional functional dependencies.
IEEE Transactions on Knowledge and DataEngineering
23, 5 (2010), 683–698.[14] Fei Fu and Qing Zhou. 2013. Learning sparse causal Gaussian networks withexperimental intervention: regularization and coordinate descent.
JASA (2013).[15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015. MADE:Masked autoencoder for distribution estimation.
International Conference onMachine Learning (2015), 881–889.[16] Lise Getoor and Daphne Taskar, Ben andKoller. 2001. Selectivity estimation usingprobabilistic models.
SIGMOD (2001).[17] The PostgreSQL Global Development Group. 2018.
Documentation PostgreSQL10.3 .[18] Max Halford, Philippe Saint-Pierre, and Franck Morvan. 2019. An approach basedon bayesian networks for query selectivity estimation.
DASFAA
ML 20 (1995).[20] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Ker-sting Kristian, and Carsten Binnig. 2020. DeepDB: Learn from Data, not fromQueries!
PVLDB (2020).[21] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, andAlfons Kemper. 2019. Learned Cardinalities: Estimating correlated joins withdeep learning.
CIDR (2019).[22] Dophan Koller and Nir Friedman. 2009.
Probabilistic Graphical Models Principlesand Techniques . MIT Press.[23] Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka.2015. Picture: A probabilistic programming language for scene perception. In
Proceedings of the ieee conference on computer vision and pattern recognition .4390–4399.[24] Seetha Lakshmi and Shaoyu Zhou. 1998. Selectivity estimation in extensibledatabases – A neural network.
VLDB (1998).[25] Hao-Chih Lee, Matteo Danieletto, Riccardo Miotto, Sarah T Cherng, and Joel TDudley. 2019. Scaling structural learning with NO-BEARS to infer causal tran-scriptome networks. arXiv e-prints (2019), 1911.00081.[26] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, andThomas Neumann. 2015. How Good Are Query Optimizers, Really?
Proc. VLDBEndow.
The VLDBJournal (2018).[28] Guy Lohman. 2014. Is query optimization a solved problem?
SIGMOD (2014). [29] David Lopez-Paz, Philipp Hennig, and Bernhard Schölkopf. 2013. The randomizeddependence coefficient. In
NIPS . 1–9.[30] Tanu Malik, Randal Burns, and Nitesh Chawla. 2007. A black-box approach toquery cardinality estimation.
CIDR (2007).[31] James Martens and Venkatesh Medabalimi. 2014. On the expressive efficiency ofsum product networks. arXiv preprint arXiv:1411.7717 (2014).[32] Yoss Matias, Jeffery Scott Vitter, and Min Wang. 1998. Wavelet-based histogramsfor selectivity estimation.
SIGMOD (1998).[33] T. Minka, J.M. Winn, J.P. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. 2018./Infer.NET 0.3. Microsoft Research Cambridge. http://dotnet.github.io/infer.[34] Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for ModernHardware.
PVLDB
4, 9 (2011), 539–550.[35] Corporation Oracle. 2020.
MySQL 8.0 Reference Manual .[36] Matthew Perron, Zeyuan Shang, Tim Kraska, and Michael Stonebraker. 2019.How I learned to stop worrying and love re-optimization. In
ICDE . 1758–1761.[37] Hoifung Poon and Pedro Domingos. 2011. Sum-product networks: A New DeepArchitecture.
IEEE International Conference on Computer Vision Workshops (2011).[38] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. Proba-bilistic programming in Python using PyMC3.
PeerJ Computer Science (2016).https://doi.org/10.7717/peerj-cs.55[39] Mauro Scanagatta, Giorgio Corani, Cassio P De Campos, and Marco Zaffalon.2016. Optimal structure identification with greedy search.
NeurIPS (2016).[40] Jacob Schreiber. 2018. Pomegranate: fast and flexible probabilistic modeling inpython.
Journal of Machine Learning Research
18, 164 (2018), 1–6.[41] Jacob Schreiber. 2018. pomegranate: Fast and Flexible Probabilistic Modeling inPython. jmlr (2018).[42] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price.1979. Access path selection in a relational database management system.
ACMSIGMOD international conference on Management of data (1979).[43] Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018. Building EfficientQuery Engines in a High-Level Language.
ACM Transactions on Database Systems
43, 1, Article 4 (April 2018), 45 pages.[44] Amir Shaikhha, Yannis Klonatos, Lionel Parreaux, Lewis Brown, MohammadDashti, and Christoph Koch. 2016. How to Architect a Query Compiler. In
Proceedings of the 2016 International Conference on Management of Data (SanFrancisco, California, USA) (SIGMOD’16) . ACM, New York, NY, USA, 1907–1922.[45] Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. arXivpreprint arXiv:1906.02560 (2019).[46] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, andDavid M. Blei. 2016. Edward: A library for probabilistic modeling, inference, andcriticism. arXiv preprint arXiv:1610.09787 (2016).[47] Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2011. Lightweightgraphical models for selectivity estimation without independence assumptions.
Proceedings of the VLDB Endowment
4, 11 (2011), 852–863.[48] Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. 2013. Efficientlyadapting graphical models for selectivity estimation.
Proceedings of the VLDBEndowment
1, 22 (2013).[49] Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and QingqingZhou. 2020. Are We Ready For Learned Cardinality Estimation?arXiv:2012.06743 [cs.DB][50] Ziniu Wu and Amir Shaikhha. 2020. BayesCard: Revitalizing Bayesian Networksfor Cardinality Estimation. arXiv:2012.14743 [cs.DB][51] Ziniu Wu, Rong Zhu, Andreas Pfadler, Yuxing Han, Jiangneng Li, ZhengpingQian, Kai Zeng, and Jingren Zhou. 2020. FSPN: A New Class of ProbabilisticGraphical Model. arXiv:2011.09020 [cs.AI][52] Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, andIon Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables. arxiv (2020).[53] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen,Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deepunsupervised cardinality estimation.
Proceedings of the VLDB Endowment (2019).[54] Qiaoling Ye, Arash Amini, and Qing Zhou. 2020. Optimizing regularized choleskyscore for order-based learning of bayesian networks.
IEEE transactions on patternanalysis and machine intelligence (2020).[55] Changhe Yuan, Brandon Malone, and Xiaojian Wu. 1984. Learning OptimalBayesian Networks Using A* Search.
Wadsworth and Brooks, Monterey, CA (1984).[56] Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Randomsampling over joins revisited.
In Proceedings of the International Conference onManagement of Data (2018).[57] Rong Zhu, Andreas Pfadler, Ziniu Wu, Yuxing Han, Xiaoke Yang, Feng Ye, Zhen-ping Qian, Jingren Zhou, and Bin Cui. 2021. Efficient and Scalable StructureLearning for Bayesian Networks: Algorithms and Applications.
ICDE (2021).[58] Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian,Jingren Zhou, and Bin Cui. 2020. FLAT: Fast, Lightweight and Accurate Methodfor Cardinality Estimation. arXiv preprint arXiv:2011.09022 (2020). iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou A PROOF OF THEOREM 1:
Theorem 1 . Given a BN and its defined DAG 𝐺 = ( 𝑉 , 𝐸 ) , representinga table 𝑇 with attributes 𝑉 = { 𝑇 , · · · 𝑇 𝑛 }, and a query 𝑄 = ( 𝑇 ′ = 𝑡 ′ ∧ · · · ∧ 𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 ) where 𝑇 ′ 𝑖 ∈ 𝑉 . Let 𝐺 ′ = ( 𝑉 ′ , 𝐸 ′ ) be a sub-graph of 𝐺 where 𝑉 ′ equals (cid:208) ≤ 𝑖 ≤ 𝑘 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) , and 𝐸 ′ equals all edges in 𝐸 with both endpoints in 𝑉 ′ . 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) includes all parent nodesof 𝑇 ′ 𝑖 and all parents of parent node recursively. Then, performing VE of BN on full graph G is equivalent to running VE on reduced graphG’. Proof of Theorem 1 : Given the probability query 𝑄 on originalgraph 𝐺 and the reduced graph 𝐺 ′ defined above, we define 𝑄 𝑉 = { 𝑇 ′ , · · · ,𝑇 ′ 𝑘 } and 𝑉 / 𝑄 𝑉 = 𝑇 ′′ , · · · 𝑇 ′′ 𝑛 − 𝑘 . In this proof, we will onlyshow that running VE on 𝐺 is equivalent to running VE on 𝐺 ′ . Thenthe proof for progressive sampling naturally follows as it is directlyapproximating the computation of VE .First, recall that by law of total probability, we have the followingEquation 2. 𝑃 𝑇 ( 𝑇 ′ = 𝑡 ′ , · · · ,𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 ) = ∑︁ 𝑡 ′′ ∈ 𝐷 ( 𝑇 ′′ ) · · · ∑︁ 𝑡 ′′ 𝑛 − 𝑘 ∈ 𝐷 ( 𝑇 ′′ 𝑛 − 𝑘 ) (cid:34) (cid:214) 𝑇 ′ 𝑖 ∈ 𝑄 𝑉 𝑃 𝑇 ( 𝑇 ′ 𝑖 = 𝑡 ′ 𝑖 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑖 ))∗ (cid:214) 𝑇 ′′ 𝑖 ∈ 𝑉 / 𝑄 𝑉 𝑃 𝑇 ( 𝑇 ′′ 𝑖 = 𝑡 ′′ 𝑖 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′′ 𝑖 )) (cid:35) (2)where 𝐷 ( 𝑇 ′′ 𝑖 ) denotes the domain of attribute 𝑇 ′′ 𝑖 and 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′′ 𝑖 ) denotes the parents of node 𝑇 ′′ 𝑖 in graph 𝐺 . For simplicity, herewe refer to 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′′ 𝑖 ) as ( 𝑇 ′′ 𝑗 = 𝑡 ′′ 𝑗 , ∀ 𝑇 ′′ 𝑗 ∈ 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑖 ) ). The VE algorithm are essentially computing Equation 2 by summingout one attribute from 𝑉 / 𝑄 𝑉 at a time until all 𝑇 ′′ 𝑖 ∈ 𝑉 / 𝑄 𝑉 areeliminated [22].Alternatively, we can derive the following Equation 3 by law oftotal probability and conditional independence assumption. 𝑃 𝑇 ( 𝑇 ′ = 𝑡 ′ , · · · ,𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 ) = ∑︁ 𝑇 ′′ 𝑖 ∈ (cid:208) 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 ∑︁ 𝑡 ′′ 𝑖 ∈ 𝐷 ( 𝑇 ′′ 𝑖 ) (cid:34) 𝑃 𝑇 (cid:16) 𝑇 ′ = 𝑡 ′ , · · · ,𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 | (cid:216) ( 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 ) (cid:17) ∗ 𝑃 𝑇 (cid:16) (cid:216) 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 (cid:17)(cid:35) = ∑︁ 𝑇 ′′ 𝑖 ∈ (cid:208) 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 ∑︁ 𝑡 ′′ 𝑖 ∈ 𝐷 ( 𝑇 ′′ 𝑖 ) (cid:34) (cid:214) 𝑇 ′ 𝑗 ∈ 𝑄 𝑉 𝑃 𝑇 ( 𝑇 ′ 𝑗 = 𝑡 ′ 𝑗 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 )) ∗ 𝑃 𝑇 (cid:16) (cid:216) 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 (cid:17)(cid:35) (3)where 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′′ 𝑖 ) denotes the parents of node 𝑇 ′′ 𝑖 in graph 𝐺 ,which is the same as parents of node 𝑇 ′′ 𝑖 in graph 𝐺 ′ . By definition ofreduced graph 𝐺 ′ where 𝑉 ′ = (cid:208) ≤ 𝑖 ≤ 𝑘 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) . 𝐴𝑛𝑐𝑒𝑠𝑡𝑜𝑟 ( 𝑇 ′ 𝑖 ) includes all parent nodes of 𝑇 ′ 𝑖 and all parents of parent node re-cursively. Let | 𝑉 ′ | = 𝑛 ′ and 𝑉 ′ / 𝑄 𝑉 = 𝑇 ′′′ , · · · ,𝑇 ′′′ 𝑛 ′ − 𝑘 . We can recur-sively write out 𝑃 𝑇 (cid:16) (cid:208) 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑗 ) ≤ 𝑗 ≤ 𝑘 (cid:17) using Equation 3 andresult in Equation 4. 𝑃 𝑇 ( 𝑇 ′ = 𝑡 ′ , · · · ,𝑇 ′ 𝑘 = 𝑡 ′ 𝑘 ) = ∑︁ 𝑡 ′′′ ∈ 𝐷 ( 𝑇 ′′′ ) · · · ∑︁ 𝑡 ′′′ 𝑛 ′− 𝑘 ∈ 𝐷 ( 𝑇 ′′′ 𝑛 − 𝑘 ) (cid:34) (cid:214) 𝑇 ′ 𝑖 ∈ 𝑄 𝑉 𝑃 𝑇 ( 𝑇 ′ 𝑖 = 𝑡 ′ 𝑖 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′ 𝑖 ))∗ (cid:214) 𝑇 ′′′ 𝑖 ∈ 𝑉 / 𝑄 𝑉 𝑃 𝑇 ( 𝑇 ′′′ 𝑖 = 𝑡 ′′′ 𝑖 | 𝑃𝑎𝑟𝑒𝑛𝑡𝑠 ( 𝑇 ′′′ 𝑖 )) (cid:35) (4)Equation 4 has the same form as Equation 2 with less attributes inthe summation. Thus the VE algorithm [22] can compute Equation 4by eliminating one attribute from 𝑉 ′ / 𝑄 𝑉 at a time. Thus running VE on 𝐺 is equivalent to running VE on 𝐺 ′ . B COMPUTING THE DEPENDENCE LEVELBETWEEN TABLES
We use the randomized dependence coefficient (RDC) [29] as ameasure of dependence level between two attributes. RDC is invari-ant with respect to marginal distribution transformations and haslow computational cost and it is widely used in many statisticalmethods [20, 51]. The complexity of RDC is roughly 𝑂 ( 𝑛 ∗ 𝑙𝑜𝑔 ( 𝑛 )) where n is the sample size for the two attributes. AB C DE F G HI J K . . . . . . (a). k’ = 2 A,DB,E C,F H,KGI J . . . . (b). k’ = 3 A,DB,E,I C,F,G H,KJ (c). Result
Figure 7: PRM Ensemble learning algorithm demonstration
B.1 Calculating the pairwise RDC scorebetween two tables
Recall Figure 7, we have a DB schema with 11 tables 𝐴, · · · , 𝐾 and their join tables are defined as a tree T on the left image. Inaddition, we have unbiased samples S of the full outer join of alltables in T using the previously mentioned approach [56]. Nowconsider 𝑇, 𝑅 ∈ 𝐴, · · · , 𝐾 as two random tables in this schema withattributes 𝑇 , · · · ,𝑇 𝑛 and 𝑅 , · · · , 𝑅 𝑚 respective. We can computethe pairwise RDC score between attributes 𝑇 𝑖 and 𝑅 𝑗 , 𝑅𝐷𝐶 𝑖 𝑗 basedon S , as described in [29]. Then we take the average as the level ofdependence between 𝑇 and 𝑅 in the following Equation 5. ∑︁ ≤ 𝑖 ≤ 𝑛 ∑︁ ≤ 𝑗 ≤ 𝑚 𝑅𝐷𝐶 𝑖,𝑗 /( 𝑛 ∗ 𝑚 ) (5)Thus, we can compute the dependence level matrix 𝑀 of size11 ×
11 with each entry specifying the dependence level betweentwo tables in the schema. Then the edge weights of original T onthe left image can be directly taken from 𝑀 . The complexity ofcalculating 𝑀 is thus 𝑂 ( 𝑚 ∗ | S | ∗ 𝑙𝑜𝑔 (| S |)) where m is the totalnumber attributes in all tables. ayesCard: Revitalizing Bayesian Networks for Cardinality Estimation B.2 Calculating the pairwise RDC scorebetween two set of tables
During the PRM ensemble construction procedure, we sometimesneed to calculate the dependence level between two sets of tables,such as the dependence level of
𝐴, 𝐷 and
𝐻, 𝐾 as in the right imageof Figure 7. Similarly to the previous cases in Section B.1, this valuecan be directly computed from 𝑀 .Take 𝐴𝑡𝑡 ( 𝑇 ) denotes the set of attributes in table T. Same as Equa-tion 5, the level of dependence between 𝐴, 𝐷 and
𝐻, 𝐾 is defined asEquation 7. ∑︁ 𝑎𝑑 ∈ 𝐴𝑡𝑡𝑟 ({ 𝐴,𝐷 }) ∑︁ ℎ𝑘 ∈ 𝐴𝑡𝑡𝑟 ({ 𝐻,𝐾 }) 𝑅𝐷𝐶 𝑎𝑑,ℎ𝑘 / (cid:16) | 𝐴𝑡𝑡𝑟 ( 𝐴 ) + 𝐴𝑡𝑡𝑟 ( 𝐷 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐻 ) + 𝐴𝑡𝑡𝑟 ( 𝐾 )| (cid:17) = (cid:32) ∑︁ 𝑎 ∈ 𝐴𝑡𝑡𝑟 ( 𝐴 ) ∑︁ ℎ ∈ 𝐴𝑡𝑡𝑟 ( 𝐻 ) 𝑅𝐷𝐶 𝑎,ℎ + ∑︁ 𝑎 ∈ 𝐴𝑡𝑡𝑟 ( 𝐴 ) ∑︁ 𝑘 ∈ 𝐴𝑡𝑡𝑟 ( 𝐾 ) 𝑅𝐷𝐶 𝑎,𝑘 + ∑︁ 𝑑 ∈ 𝐴𝑡𝑡𝑟 ( 𝐷 ) ∑︁ ℎ ∈ 𝐴𝑡𝑡𝑟 ( 𝐻 ) 𝑅𝐷𝐶 𝑑,ℎ + ∑︁ 𝑑 ∈ 𝐴𝑡𝑡𝑟 ( 𝐷 ) ∑︁ 𝑘 ∈ 𝐴𝑡𝑡𝑟 ( 𝐾 ) 𝑅𝐷𝐶 𝑑,𝑘 (cid:33) / (cid:16) | 𝐴𝑡𝑡𝑟 ( 𝐴 ) + 𝐴𝑡𝑡𝑟 ( 𝐷 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐻 ) + 𝐴𝑡𝑡𝑟 ( 𝐾 )| (cid:17) = (cid:16) 𝑀 [ 𝐴, 𝐻 ] ∗ |
𝐴𝑡𝑡𝑟 ( 𝐴 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐻 )| (6) + 𝑀 [ 𝐴, 𝐾 ] ∗ |
𝐴𝑡𝑡𝑟 ( 𝐴 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐾 )|+ 𝑀 [ 𝐷, 𝐻 ] ∗ |
𝐴𝑡𝑡𝑟 ( 𝐷 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐻 )|+ 𝑀 [ 𝐷, 𝐾 ] ∗ |
𝐴𝑡𝑡𝑟 ( 𝐷 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐾 )| (cid:17) / (cid:16) | 𝐴𝑡𝑡𝑟 ( 𝐴 ) + 𝐴𝑡𝑡𝑟 ( 𝐷 )| ∗ | 𝐴𝑡𝑡𝑟 ( 𝐻 ) + 𝐴𝑡𝑡𝑟 ( 𝐾 )| (cid:17) (7)Thus the weight of the edge can be updated quickly knowingthe pre-computed 𝑀 and the number of attributes in each table. C ADDITIONAL EXPERIMENTAL RESULTS
The addition experiments on SYNTHETIC comparing
BayesCard with other methods are reported in Table 7 and Table 8. Pleasenote that we do not fine-tune the hyper-parameters of the DL-based methods since the training time on all the datasets take solong that we can not afford to explore different hyper-parameters.But we believe that the experimental results are enough to showthe insights. On the other hand,
BayesCard do not have hyper-parameters to fine-tune, which is another advantage of our methods.We summarize our observations as follows.
Distribution (s):
For the inference latency and model size, we findthat increasing the Pareto distribution skewness would degrade theperformance of
DeepDB and
FLAT , but has little affect on all othermethods. This is because
DeepDB and
FLAT tend to generate largermodels on more complex data. As a result, their training time alsoimproves w.r.t. the skewness level. The training time of
Naru and
MSCN also grows w.r.t. skewness level, as their underlying DNNsneed more time to model the complex distributions.
Correlation (c):
For the inference latency and model size, the in-crease of 𝑐 has negative impact on DeepDB and
FLAT , as their modelsbecome larger on more correlated data. However,
FLAT behaveswell on very highly correlated data since it can split them with otherattributes to reduce the model size. The impact of 𝑐 on BayesCard ismild, whose model size is still affordable. For the training time, theincrease of 𝑐 has impact on algorithms except Histogram and
Sam-pling . This is also reasonable as they need more time to model thecomplex distribution. Note that, for each setting of 𝑐 , the trainingtime of our BayesCard is still much less than them.
Domain (d):
For the inference latency, model size and training time,the increase of 𝑑 has significant impact on Naru , DeepDB and
FLAT .This is increasing the number of attributes would increases the datacomplexity exponentially, so they need more neurons or nodes tomodel the distribution. Whereas, the impact on our
BayesCard ismuch mild.
Scale (n):
Similar to domain size, increasing the number of at-tributes also increases the data complexity exponentially, and thuswe expect to see an increase in latency, model size and traininbgtime for almost all methods. However, in comparison with
Naru , DeepDB , FLAT and
MSCN , the impact on
BayesCard is not verysignificant.
Summary:
In comparison with DL-based
CardEst methods, ourBayesCard attains very stable and robust performance in terms ofinference latency, model size and training time. iniu Wu , Amir Shaikhha , Rong Zhu , Kai Zeng , Yuxing Han , Jingren Zhou Table 7: Stability performance of different
CardEst method w.r.t. changes in data distribution skewness and correlation.
CardEst
Algorithm Distribution Skewness (s) Attribute Correlation (c)Methods Criteria c=0.4, d=100, n=10 s=1.0, d=100, n=10s=0 s=0.3 s=0.6 s=1.0 s=1.5 s=2.0 c=0 c=0.2 c=0.4 c=0.6 c=0.8 c=1.0
BayesCard
Accuracy (95% q-error) 1.06
Latency (ms) 3.5 2.8 3.7 2.4 2.2 3.0 · · Latency (ms)
Model size (kb)
Sampling Accuracy (95% q-error)
Latency (ms) 0.6 0.9 0.6 0.5 1.5 1.7
Training time (s) 91 93 127 142 240 253 5.5 133 244.2 629 1370 17.0MSCN Accuracy (95% q-error) 55.1 160 105 94 129 340 51.3 54.0 94 145 544 620Latency (ms) 1.1 1.1 1.0 1.1 1.3 1.0 1.0 1.4 1.2 1.0 1.3 1.1Model size (kb) 3200 3200 3200 3200 3200 3200 3200 3200 3200 3200 3200 3200Training time (s) 1203 1208 959 1430 871 1770 922 935 1432 955 1831 880
Table 8: Scalability performance of different
CardEst method w.r.t. changes in data domain size and number of attributes.
CardEst
Algorithm Domain Size (d) Number of Attributes (n)Methods Criteria s=1.0, c=0.4, n=10 s=1.0, c=0.4, d=100d=10 d=100 d=500 d=1000 d=5000 d=10000 n=2 n=5 n=10 n=50 n=100 n=200
BayesCard
Accuracy (95% q-error) 1.29
Latency (ms) 1.0 2.4 2.8 11.5 2.2 3.6 0.4 1.5 2.4 4.7 11.3 15.1Model size (kb) 19.3 542 982 1280 168 417 64.2 236 542 2820 5400 1 · Training time (s) 11.1 14.9 33.4 48.7 · · · · · Latency (ms)
Model size (kb)
Training time (s)
Sampling Accuracy (95% q-error) 1.95 2.21 2.09 4.93 · · Latency (ms) 59 55 60 63 57 59 45 60 77 109 123 135Model size (kb) - - - - - - - - - - - -Training time (s) - - - - - - - - - - - -Naru Accuracy (95% q-error) · · · Training time (s) 1032 5503 5607 6489 5980 1 · · · DeepDB Accuracy (95% q-error) 1.08 10.0 15.1 107 61 213 1.04 1.17 15.1 257 1490 1 · Latency (ms) 1.3 7.5 8.8 11.9 10.7 19.0 0.6 1.4 7.5 25.4 67.1 109Model size (kb) 29.5 834 1234 1910 1781 3974 35.0 129 834 6710 3 · · Training time (s) 25 197 769 2310 4155 1 ·
21 65 197 3698 8930 2 · FLAT Accuracy (95% q-error) 1.08 1.76 1.35 1.17 27.6 · Latency (ms) 0.5 0.6 1.5 18.0 15.9 49.7 0.4 0.5 0.5 25.9 66.0 110Model size (kb) 16.1 75.3 310 2701 1980 5732 15.0 49.9 75.3 6908 3 · · Training time (s) 15.5 142 198 2670 1535 9721 9.7 48.6 142 4017 1 · · MSCN Accuracy (95% q-error) 53.0 94.4 106 188 173 290 18.8 40.5 94.4 1783 8084 3 ·4