A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration
AA Survey on Advancing the DBMS Query Optimizer: CardinalityEstimation, Cost Model, and Plan Enumeration
Hai Lan
RMIT UniversityMelbourne, [email protected]
Zhifeng Bao
RMIT UniversityMelbourne, [email protected]
Yuwei Peng*
Wuhan UniversityWuhan, [email protected]
ABSTRACT
Query optimizer is at the heart of the database systems. Cost-basedoptimizer studied in this paper is adopted in almost all currentdatabase systems. A cost-based optimizer introduces a plan enu-meration algorithm to find a (sub)plan, and then uses a cost modelto obtain the cost of that plan, and selects the plan with the lowestcost. In the cost model, cardinality, the number of tuples throughan operator, plays a crucial role. Due to the inaccuracy in cardi-nality estimation, errors in cost model, and the huge plan space,the optimizer cannot find the optimal execution plan for a complexquery in a reasonable time. In this paper, we first deeply study thecauses behind the limitations above. Next, we review the techniquesused to improve the quality of the three key components in thecost-based optimizer, cardinality estimation, cost model, and planenumeration. We also provide our insights on the future directionsfor each of the above aspects.
KEYWORDS
Query Optimizer; Cardinality Estimation; Cost Model; Plan Enu-meration
Query optimizer is at the heart of relational database manage-ment systems (RDBMSes) and some big data process engines, e.g.SCOPE [7]. Given a query written in a declarative language (e.g.,SQL), the optimizer finds the most efficient execution plan (alsocalled physical plan) and feeds it to the executor. Thus, most of thetime, the users only think over how to transform their requirementsto a valid query without the need to analyze how to run the queryefficiently. Almost all systems adopt a cost-based optimizer basedon the architecture of System R [79] or Volcano/Cascades [27, 28].Figure 1 illustrates the three most important components in acost-based optimizer: cardinality estimation (CE) , cost model(CM) , and plan enumeration (PE) . CE uses statistics of data andsome assumptions about data distribution, column correlation, andjoin relationship to get the number of tuples generated by an inter-mediate operator , which is also crucial for other search problems,e.g., [101, 102]. CM can be regarded as a complex function thatmaps the current state of database and estimated cardinalities tothe cost of executing a (sub)plan. PE is an algorithm to explore thespace of semantically equivalent join orders and find the optimal or-ders with minimal cost. There are two principal approaches to findan optimal join order: bottom-up join enumeration via dynamic pro-gramming and top-down join enumeration through memorization. In some context, the cardinality in the database area refers to distinct count [35].
Figure 1: Query optimizer architecture. IS, HJ, NL, and TSrefer to index scan, hash join, nestloop join, and table scan.
Theoretically, provided that the estimated cardinality and costare accurate, and plan enumeration component can efficiently walkthrough the huge search space, this architecture can obtain theoptimal execution plan in a reasonable time. However, it fails inreality. Despite decades of work, cost-based query optimizers stillmake mistakes on “difficult” queries due to the error in CE , thedifficulty in building an accurate CM , and the pain in finding theoptimal join orders ( PE ) for complex queries. The details are pre-sented in Section 2, i.e., why the existing optimizer is still far fromsatisfaction.There are lots of research studies proposed to improve the capa-bility of the optimizer. In this paper, we present a survey on them.Specifically, we review the publications which are proposed to im-prove the capabilities of the three key components in the optimizer,i.e., CE , CM , PE .This paper makes the following contributions:(1) We summarize the reasons why the CE , CM , and PE do notperform well (Section 2).(2) We review the studies proposed to estimate cardinality moreaccurately. According to the techniques used, we categorizethem into synopsis-based methods, sampling-based methods,and learning-based methods (Section 3).(3) We review the work on improving the cost model. We classifythem into three groups: improvement of the existing costmodel, cost model alternatives, and performance predictionfor a single query (Section 4).(4) We review the techniques used in plan enumeration andstudy the non-learning methods used to handle large queries.Besides, we review recent proposed methods, which adoptreinforcement learning to select the join order (Section 5).(5) In Sections 3-5, we present our insights on the future direc-tions respectively.There are two related surveys. In 1998, Chaudhuri [8] reviewsthe work with non-learning methods on query optimizer. In the lasttwo decades, many methods are proposed to improve the capabilityof the optimizer. It is necessary to review the new work. Recently, a r X i v : . [ c s . D B ] J a n ai Lan, Zhifeng Bao, and Yuwei Peng* Zhou et al. [107] investigate how AI is introduced in the differentparts of DBMS, such as monitoring, tuning, optimizer. In this paper,we focus on the query optimizer and give a comprehensive surveyon the three key components of the optimizer. We summarize thelearning-based and non-learning methods at the same time, reviewthese work in details, and present possible future directions foreach of them.
In this section, we summarize the reasons why the cardinalityestimation, cost model, and plan enumeration do not perform wellrespectively. The studies reviewed in this paper try to improve thequality of the optimizer by handling these shortages.
Cardinality estimation is the ability to estimate the tuples generatedby an operator and is used in the cost model to calculate the costof that operator. Lohman [62] points out that the cost model canintroduce errors of at most 30%, while the cardinality estimationcan easily introduce errors of many orders of magnitude. Leis et al.[56] experimentally revisit the components, CE , CM , and PE inthe classical optimizers with complex workloads. They focus onthe quality of the physical plan on multi-join queries and get thesame conclusion with Lohman.The errors in cardinality estimation are mainly introduced inthree cases:(1) Error in single table with predications.
Database sys-tems usually take histograms as the approximate distributionof data. Histograms are smaller than the original data. Thus,it cannot represent the true distribution entirely and someassumptions (e.g. uniformity on a single attribute, indepen-dence assumption among different attributes) are proposed.When those assumptions are not hold, estimation errors oc-cur, leading to sub-optimal plans. The correlation amongattributes in a table is not unusual. Multi-histograms havebeen proposed. However, it suffers from a large storage size.(2)
Error in multi-join queries.
Correlations possibly exist incolumns from different tables. However, there is no efficientway to get synopses between two or more tables. Inclusionprinciple has been introduced for this case. The cardinalityof a join operator is calculated using the inclusion principlewith cardinalities of its children. It has large errors when theassumption is not held. Besides, for a complex query withmultiple tables, the estimation errors can propagate andamplify from the leaves to root of the plan. The optimizers ofcommercial and open-source database systems still strugglein cardinality estimation for mult-join queries [56].(3)
Error in user defined function.
Most of database systemssupport the user-defined function (UDF). When a UDF existsin the condition, there is no general method to estimate howmany tuples satisfying it [8].
Cost-based optimizers use a cost model to generate the estimate ofcost for a (sub)query. The cost of (sub)plan is the sum of costs of alloperators in it.The cost of an operator depends on the hardware where thedatabase is deployed, the operator’s implementation, the number oftuples processed by the operator, and the current database state (e.g.,data in the buffer, concurrent queries) [65]. Thus, a large number ofmagic numbers should be determined when combining all factors,and errors in cardinality estimation also affect the quality of the costmodel. Furthermore, when the cost-based optimizer is deployedin a distributed or parallel database, the cloud environment, orthe cross-platform query engines, the complexity of cost model isincreasing dramatically. Moreover, even with the true cardinality,the cost estimation of a query is not linear to the running time,which may lead to a suboptimal execution plan [46, 81].
Plan enumeration algorithm is used to find the optimal join orderfrom the space of semantically equivalent join orders such that thequery cost is minimized. It has been proven to be an NP-hard prob-lem [42]. Exhaustive query plan enumeration is a prohibitive taskfor large databases with multi-join queries. Thus, it is crucial to ex-plore the right search space which should consist of the optimal joinorders or approximately optimal join orders and design an efficientenumeration algorithm. The join trees in the search space couldbe zig-zag trees, left-deep trees, right-deep trees, and bushy treesor the subset of them. Different systems consider different formsof join tree. There are three enumeration algorithms in traditionaldatabase systems: (1) bottom up join enumeration via dynamicprogramming (DP) (e.g., System R [79]), (2) top-down join enumera-tion through memorization (e.g., Volcano/Cascades [27, 28]), and (3)randomized algorithms (e.g., genetic algorithm in PostgreSQL [12]with numerous tables joining).Plan enumeration suffers from three limitations: (1) the errors incardinality estimation and cost model, (2) the rules used to prunethe search space, and (3) dealing with the queries with large numberof tables. When a query touches a large number of tables, optimizershave to sacrifice optimality and employ heuristics to keep optimiza-tion time reasonable, like genetic algorithm in PostgreSQL, greedymethod in DB2, which usually generates poor plans.We should notice the errors in cardinality will propagate tothe cost model and lead to suboptimal join order. Eliminating orreducing the errors in cardinality is the first step to build a capableoptimizer as Lohman [62] says “The root of all evil, the Achilles Heelof query optimization, is the estimation of the size of intermediateresults, known as cardinalities” .In the following three sections, we summarize the research ef-forts made to handle limitations in CE, CM, and PE, i.e, how tomake the query optimizer good.
At present, there are three major strategies for cardinality estima-tion as shown in Figure 2. We only list some representative work foreach category. Every method tries to approximate the distribution
Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration
Figure 2: A classification of cardinality estimation methods of data well with less storage. Some proposed methods combinedifferent techniques, e.g., [91, 92].
Synopsis-based methods introduce new data structures to record thestatistics information. Histogram and sketch are the widely adoptedforms. A survey on synopses has been proposed in 2012 [10], whichfocuses on distinguishing aspects of synopses that are pertinent toApproximate Query Processing (AQP).
There are two histogram types: -dimensionaland d -dimensional histograms, where 𝑑 ≥ d -dimensional his-tograms can capture the correlation between different attributes.A on attribute a is constructed by par-titioning the sorted tuples into 𝐵 ( ≥
1) mutually disjoint subsets,called buckets and approximates the frequencies and values in eachbucket in some common fashion, e.g., uniform distribution and con-tinuous values. A d-dimensional histogram on an attribute group A is constructed by partitioning the joint data distribution of A .Because there is no order between different attributes, the partitionrule needs to be more intricate. In 2003, Ioannidis [44] present acomprehensive survey on histograms following the classificationmethod in [77]. Gunopulos et al. [30] also propose a survey in2003, which focuses on the work used to estimate the selectivityover multiple attributes. They summarize the multi-dimensionalhistograms and kernel density estimators. After 2003, the work inhistograms can be divided into three categories: (1) fast algorithmfor histogram construction [1, 29, 33, 34, 43]; (2) new partitionmethods to divide the data into different buckets to achieve betteraccuracy [15, 59, 88]; (3) histogram construction based on queryfeedback [48, 58, 83]. Query feedback methods are also summarizedin [10] (Section 3.5.1.2) and readers can refer to it for details.Guha et al. [29] analyze the previous algorithm, VODP [45] and,find some calculations on the minimal sum-of-squared-errors (SSE)can be reduced. They design an efficient algorithm AHistL- Δ with time complexity 𝑂 ( 𝑛 + 𝐵 ( lg 𝑛 + 𝜖 − )) while VODP takes 𝑂 ( 𝑛 𝐵 ) ,where 𝑛 is the domain size, 𝐵 is the number of buckets, and 𝜖 is aprecision parameter. Halim et al. [33, 34] propose GDY, a fast his-togram construction algorithm based on greedy local search. GDYgenerates good sample boundaries, which then are used to con-struct 𝐵 final partitions optimally using VODP. This study comparesGDY variants with AHistL- Δ [29] in minimizing the total errors ofall the buckets and shows its superiority in resolving the efficiency-quality trade-off. Instead of scanning the whole dataset [29], In-dyk et al. [43] design a greedy algorithm to construct the his-togram on the random samples from dataset with time complexity 𝑂 (( 𝐵 / 𝜖 ) log 𝑛 ) and sample complexity 𝑂 (( 𝐵 / 𝜖 ) log 𝑛 ) . Acharyaet al. [1] study the same problem with [43] and propose a mergingalgorithm with time complexity 𝑂 ( / 𝜖 ) . Methods in [1, 29, 43] canbe extended to approximate distributions by piecewise polynomials.Considering the tree-based indexes divide the data into differ-ent segments (nodes), which is quite similar with buckets in thehistogram, Eavis and Lopez [15] build the multi-dimensional his-togram based on R-tree. They first build a native R-tree histogramon the Hilbert sort of data, and then, propose a sliding window algo-rithm to enhance the naive histogram under a new proposed metric,which seeks to minimize the dead space between bucket points. Linet al. [59] design a two-level histogram for one attribute, which isquite similar to the idea of the B-tree index. The first level is usedto locate which leaf histograms to be used, and the leaf histogramsstore the statistics information. To et al. [88] construct a histogrambased on the principle of minimizing the entropy reduction of thehistogram. They design two different histograms for the equalityqueries and an incremental algorithm to construct the histogram.However, it only considers the one-dimensional histogram and doesnot handle range queries well. Sketch models a column as a vector or matrix tocalculate the distinct count (e.g., HyperLogLog [22]) or frequencyof tuples (e.g., Count Min [11]) on a value. Rusu and Dobra [78]summarize how to use different sketches to estimate the join size.This work considers the case of two tables (or data streams) withoutfilters. The basic idea of them is: (1) building the sketch (a vector ormatrix) on the join attribute, while ignoring all the other attributes,(2) estimating the join size based on the multiplication of the vectorsor matrices. These methods only support the equi-join and join onsingle column. As shown in [94], a possible method introducing onefilter in sketch is to build an imaginary table which only consistsof the join value of tuples which satisfy the filter. However, thismakes the estimation drastically worse. Skimmed sketch [24] isbased on the idea of bifocal sampling [25] to estimate the join size.However, it requires knowing frequencies of the most frequent joinattribute values. Recent work [6] on join size estimation introducesthe sketch to record the degree of a value.
TuG [82] is a graph-based synopsis. Thenode of TuG represents a set of tuples from the same table or aset of values for the same attribute. The edge represents the joinrelationship between different tables or between attributes andvalues. The authors adopt a three-step algorithm to construct TuGand introduce the histogram to summarize the value distributionin a node. When a new query comes, the selectivity is estimated bytraversing TuG. The construction process is quite time-consuming ai Lan, Zhifeng Bao, and Yuwei Peng* and cannot be used in a large dataset. Without the relationshipbetween different tables, TuG cannot be built.
Synopsis-based methods are quite difficult to capture the correla-tion between different tables. Some researchers try to use a specificsampling strategy to collect a set of samples (tuples) from tables,and then run the (sub)query over samples to estimate the cardinality.As long as the distribution of the obtained samples is close to theoriginal data, the cardinality estimation is believable. Thus, lots ofwork have been proposed to design a good sampling approach, fromindependent sampling to correlated sampling technique. Sampling-based methods also are summarized in [10]. After 2011, there arenumerous studies that utilize the sampling techniques. Differentwith [10], we mainly summarize the new work. Moreover, we re-view the work according to their publishing time and present therelationship between them, i.e., which shortages of the previouswork the later work tries to overcome.In 1993, Haas et al. [31] analyze the six different fixed-step (apre-defined sample size) sampling methods for the equi-join queries.They conclude that if there are some indexes built on join keys,page-level sampling combining the index is the best way. Otherwise,the page-level cross-product sampling is the most efficient way.Then, the authors extend the fixed-step methods to fixed-precisionprocedures.Ganguly et al. [25] introduce bifocal sampling to estimate thesize of an equi-join. They classify values of the join attribute ineach relation into two groups, sparse ( s ) and dense ( d ) based ontheir frequencies. Thus, the join type between tuples can be s-s , s-d , d-s , and d-d . The authors first adopt t_cross sampling [31] toestimate the join size of d-d , then adopt t_index to estimate the joinsize of the remaining cases, and finally add all the estimation as thejoin size estimation. However, it needs an extra pass to determinethe frequencies of different values and needs indexes to estimatethe join size for s-s , s-d , and d-s . Without indexes, the process istime-consuming.End-biased sampling [16] stores the ( 𝑣, 𝑓 𝑣 ) if 𝑓 𝑣 ≥ 𝑇 , where 𝑣 is a value in the join attribute domain, 𝑓 𝑣 is the number of tupleswith value 𝑣 , and 𝑇 is a defined threshold. It applies a hash function ℎ ( 𝑣 ) : 𝑣 ↦→ [ , ] . If ℎ ( 𝑣 ) ≤ 𝑓 𝑣 𝑇 , it stores ( 𝑣, 𝑓 𝑣 ) or not. Differenttables adopt the same hash function to correlate their samplingdecisions for tuples with low frequencies. Then the join size canbe estimated using stored ( 𝑣, 𝑓 𝑣 ) pairs. However, it only supportsequi-join on two tables and cannot handle other filter conditions.Notice, end-bias sampling is quite similar to bifocal sampling. Thedifference is: the former uses a hash function to sample correlatedtuples and the latter uses the indexes. Both of them require anextra pass through the data to compute the frequencies of the joinattribute values.Yu et al. [105] introduce correlated sampling as a part of CS2algorithm. They (1) choose one of the tables in a join graph asthe source table 𝑅 , (2) use a random sampling method to obtainsample set 𝑆 for 𝑅 (mark 𝑅 as visited), (3) follow an unvisitededge < 𝑅 𝑖 , 𝑅 𝑗 > ( 𝑅 𝑖 is visited) in the join graph and collect the tuplesfrom 𝑅 𝑗 which are joinable with tuples in 𝑆 𝑖 as 𝑆 𝑗 , and (4) estimatethe join size over the samples. To support the query without source tables, they propose a reverse estimator, which tracks to the sourcetables to estimate the join size. However, due to the walking throughthe join graph many times, it is time-consuming without indexes.Furthermore, it requires an unpredictable large space to store thesamples.Vengerov et al. [94] propose a correlated sampling method with-out the prior knowledge of frequencies of join attributes, likein [16, 25]. A tuple with join value 𝑣 is included in the sampleset if ℎ ( 𝑣 ) < 𝑝 , where 𝑝 = 𝑛𝑇 , ℎ ( 𝑣 ) is a hash function similar in [16], 𝑛 is the sample size, and 𝑇 is the table size. Then, we can use ob-tained samples to estimate the join size and handle specified filterconditions. Furthermore, the authors extend the method into moretables join and complex join conditions. In most cases, the cor-related sampling has lower variance than independent Bernoulisampling ( t_cross ), but when the values of many join attributesoccur with large frequencies, the Bernouli sampling is better. Onepossible solution the authors propose is to adopt a one-pass algo-rithm to detect the values with high frequencies,which is back tothe method in [16].Through experiments, Chen and Yi [9] conclude that there doesnot exist one sampling method suitable for all cases. They proposea two-level sampling method, which is based on the independentBernouli sampling, end-bias sampling [16], and correlated sam-pling [94]. Level-one sampling samples a value 𝑣 from join attributedomain into value set ( 𝑉 ), if ℎ ( 𝑣 ) < 𝑝 𝑣 . ℎ is a hash function similarto [16], 𝑝 𝑣 is a defined probability for value 𝑣 . Before level-twosampling, they sample a random tuple, called the sentry, for every 𝑣 in 𝑉 into tuple set. Level-two sampling samples tuples with value 𝑣 ( 𝑣 ∈ 𝑉 ) with probability 𝑞 . Then, we can estimate the join sizeby using the tuple samples. Obviously, the first level is a correlatedsampling and the second level is independent Bernouli sampling.The authors analyze how to set the 𝑝 𝑣 and 𝑞 according to differentjoin types and the frequencies of values in join attributes.Wang and Chan [95] extend [9] to a more general frameworkin terms of five parameters. Based on the new framework, theypropose a new class of correlated sampling methods, called CSDL,which is based on the discrete learning algorithm. A variant ofCSDL, CSDL-Opt has outperformed [9] when the samples are smallor join value density is small.Wu et al. [100] adopt the online sampling to correct the possibleerrors in the plan generated by the optimizers. Due to the capability of the learning-based methods, many re-searchers have introduced a learning-based model to capture thedistribution and correlations of data. We classify them into: (1)supervised methods, (2) unsupervised methods.
Malik et al. [64] group queries intotemplates, and adopt machine learning techniques (e.g., linear re-gression model, tree models) to learn the distribution of query resultsizes for each family. The features used in it include query attributes,constants, operators, aggregates, and arguments to UDFs.Park et al. [76] propose a model, QuickSel, in query-driven para-digm, which is similar to [48, 58, 83], to estimate the selectivity ofone query. Instead of adopting the histograms, QuickSel introducesthe uniform mixture models to represent the distribution of the
Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration
Table 1: Learning-based methods for cardinality estimation.
Refs Model Model Count Encoding Multi-columns Multi-tables UDF Workload Shift[64] LR 1 Model/1 Template predicates, arguments ✓ ✓ ✓ × [76] MixModel 1 Model predicates ✓ × × ✓ [91, 92] BN 1 Model predicates ✓ ✓ × × [32] BN 1 Model/1 Table predicates ✓ × × × [54] NN 1 Model/1 UDF arguments × × ✓ × [61] NN 1 Model predicates ✓ × × × [98] NN/PR/MLR 1 Model/1 Subquery predicates, input cardinalities ✓ ✓ ✓ × [50] MSCN 1 Model predicates, tables, joins ✓ ✓ × × [14] Tree-Ensemble/NN 1 Model predicates ✓ × × × [96, 97] NN 1 Model/1 Template predicates ✓ ✓ × × [75] DNN/RNN/Tree 1 Model predicates, tables, joins ✓ ✓ × × [85] tree-LSTM 1 Model predicate, operator, metadata ✓ ✓ × × [39] KDE 1 Model samples ✓ × × ✓ [49] KDE 1 Model1 Model / 1 Table samples ✓ ✓ × ✓ [41] SPN 1 Model tuples; predicates ✓ ✓ × ✓ [36, 104] Autoregression 1 Model tuples; predicates ✓ × × ✓ [103] Autoregression 1 Model tuples; predicates ✓ ✓ × ✓ Table 2: A preliminary comparison in different methods for cardinality estimation.
Methods Workload Shift Data Change Update Time Storage Usage Multi-Columns Multi-Tables -histogram[44] × × Short Small × × d -histogram [30] × × Short Large ✓ × Sampling [9, 95, 100] ✓ × Medium Large ✓ ✓
Supervised Learning [50, 85] × ×
Long Small ✓ ✓
Unsupervised Learning [41, 103] ✓ × Long Small ✓ ✓ data. They train the model by minimizing the mean squared errorbetween the mixture model and a uniform distribution.Tzoumas et al. [91, 92] build a Bayesian network and decomposethe complex statistics over multiple attributes into small one-/two-dimensional statistics, which means the model captures dependen-cies between two relations at most. They build the histograms forthese small dimensional statistics and adopt a dynamic program-ming to calculate the selectivity for the new queries. Different withprevious method [26], it can handle more general joins and has amore efficient construction algorithm because of capturing smallerdependencies. However, the authors do not verify their methodwith multiple tables join and in large dataset. Moreover, construct-ing the two-dimensional statistics with attributes from differenttables needs the join operation. Halford et al. [32] also introduce amethod based on Bayesian network. To construct the model quickly,they only factorize the distribution of attributes inside each relationand use the previous assumptions for joins. However, they do notpresent how well their method compared with [91, 92].In 1998, Lakshmi and Zhou [54] first introduce the NN into thecardinality estimation of user defined function (UDF), which thehistograms and other statistics cannot support. They design a two-layer neural network (NN) and employ the back propagation toupdate the model.Liu et al. [61] formalize a selectivity function,
𝑆𝑒𝑙 : 𝑅 𝑁 ↦→ 𝑅, ( 𝑙 , 𝑢 , ..., 𝑙 𝑛 , 𝑢 𝑛 ) ↦→ 𝑐 , where 𝑁 is the number of attributes, 𝑙 𝑖 and 𝑢 𝑖 is the lower and upper bound on 𝑖 𝑡ℎ attribute for a query. They employ a 3-layer NN to learn the selectivity function. To support > and < , they add 2 𝑁 small NNs to produce 𝑙 𝑖 and 𝑢 𝑖 .Wu et al. [98] use a learning-based method for workload inshared clouds, where the queries are often recurring and overlap-ping in nature. They first extract overlapping sub-graph templatesin multiple query graphs. Then, they learn the cardinality modelsfor those sub-graph templates.Kipf et al. [50] introduce the multi-set convolutional network(MSCN) model to estimate the cardinality of correlated joins. Theyrepresent a query as a collection of a set of tables 𝑇 , joins 𝐽 , andpredicts 𝑃 and build the separate 2-layer NN for each of them. Then,the outputs of three NNs are concatenated after the averagingoperation and fed into the final output network. Deep sketch [51]is built on [50] and is a wrapper of it.Dutt et al. [14] formalize the estimation as a function similarto [61], and they consider it as a regression problem. They adopt twodifferent approaches for the regression problem, NN-based methodsand tree-based ensembles. Different with [61], the authors also usehistograms and domain knowledge (e.g., AVI, EBO, and MinSel)as the extra features in the models, which improve the estimationaccuracy. Due to the domain knowledge quickly updated when thedata distribution changes, the model is robust to the updates on thedatasets.Woltmann et al. [97] think building a single NN, called globalmodel, over the entire database schema has the sparse encodingand needs numerous samples to train the model. Thus, they builddifferent models, called local models, for different query templates. ai Lan, Zhifeng Bao, and Yuwei Peng* Every local model adopts multi-layer perceptrons (MLP) to producethe cardinality estimation. To collect the true cardinality, manysample queries are issued during the training process, which istime-consuming. Furthermore, Woltmann et al. [96] introduce themethod of pre-aggregating the base data using the data cube conceptand execute the example queries over this pre-aggregated data.Ortiz et al. [75] empirically analyze various of deep learningapproaches used in cardinality estimation, including deep neuralnetwork (DNN) and recurrent neural network (RNN). The DNNmodel is similar with [97]. To adopt RNN model, the authors focuson left-deep plans and model a query as a series of actions. Everyaction represents an operation (i.e., selection or join). In each times-tamp 𝑡 , the model receives two inputs: 𝑥 𝑡 , the encoding of 𝑡 𝑡ℎ ofoperation, and ℎ 𝑡 − , the generated hidden state from timestamp 𝑡 −
1, which can be regarded as the encoding of a subquery andcaptures the important details about the intermediate results.Sun and Li [85] introduce a tree-LSTM model to learn a rep-resentation of an operator and add an estimation layer upon thetree-LSTM model to estimate the cardinality and cost simultane-ously.
Heimel et al. [39] introduce the Ker-nel Density Estimator (KDE) into estimating the selectivity onsingle table with multiple predicates. They first adopt the GaussianKernel and the bandwidth obtained by a certain rule to constructthe initial KDE, and then they use the history queries to choosethe optimal bandwidth by minimize the estimation error usinginitial KDE. To support the shifts in workload and dataset, theyupdate the bandwidth after each incoming query and design thenew sample maintenance method for insert-only workload andupdates/deletions workload. Furthermore, in 2017, Kiefer et al. [49]extend the method into estimating the selectivity of join. Theydesign two different models: single model over the join samplesand the models over the base tables, which does not need the joinoperation and estimates the selectivity of join with the independentassumption.Yang et al. [104] propose a model called Naru, which adoptsthe deep autoregressive model to produce 𝑛 conditional densitiesˆ 𝑃 ( 𝑥 𝑖 | 𝑥 < 𝑖 ) on a set of 𝑛 -dimensional tuples. Then they estimate theselectivity using the product rule:ˆ 𝑃 ( x ) = ˆ 𝑃 ( 𝑥 , 𝑥 , ..., 𝑥 𝑛 ) = ˆ 𝑃 ( 𝑥 𝑛 | 𝑥 , ..., 𝑥 𝑛 − ) ˆ 𝑃 ( 𝑥 𝑛 − | 𝑥 , ..., 𝑥 𝑛 − ) ... ˆ 𝑃 ( 𝑥 | 𝑥 ) ˆ 𝑃 ( 𝑥 ) (1)To support range conditions, they introduce a progressive samplingmethod by sampling points from more meaningful region accordingto the trained autoregressive model, which is robust to the skeweddata. Furthermore, they adopt the wildcard-skipping to handlewildcard condition.Hasan et al. [36] also adopt the deep autoregressive models andintroduce an adaptive sampling method to support range queries.Compared with the Naru, the authors adopt the binary encodingmethod and the sampling process runs parallelly, which leads themodel is smaller than Naru and makes the inference faster. Besides,it can incorporate with the workload by assigning the tuples withweights according to the workload when defining the cross-entropyloss function. Hilprecht et al. [41] introduce the Relational Sum Product Net-work (RSPN) to capture the distribution of single attributes and thejoint probability distribution. They focus on Tree-SPNs, where oneleaf is the approximation of a single attribute, the internal node isSum node (splitting the rows into clusters) or Product node (split-ting the columns of one cluster). To support cardinality estimationof join, they build the RSPN over the join results.Yang et al. [103] extend their previous work, Naru, to supportjoins. They build an autoregressive model over the full outer join ofall tables. They introduce the lossless column factorization for large-cardinality columns and employ the join count table to support anyqueries on the subset of tables. The basic histogram types (e.g., equi-width, equi-depth, d -dimensional) have been introduced before 2000. Recentstudies mainly focus on how to quickly construct the histogramsand to improve the accuracy of them. Updating the histogramsby query feedback is a good approach to improve the quality ofhistograms. However, there are still two limitations in the his-tograms: (1) the storage size increases dramatically when build-ing a d -dimensional histograms; (2) histograms cannot capture thecorrelation between attributes from different tables. If building ahistogram for the attributes from different tables, the join operationis required, like in [91, 92] and it is difficult to update this histogram.Sketch can be used to estimate the distinct count of an attributeor the cardinality of equi-join results. However, it cannot supportmore general cases well, e.g., join with filters. Synopsis-based meth-ods cannot estimate the size of final or intermediate relations whenone or both of the child relations is an intermediate relation.Sampling is a good approach to capture the correlations betweendifferent tables. However, when the tuples in tables have beenupdated, the samples may become out-of-date. Sampling-basedmethods also suffer from the storage used to store the samples andthe time used to retrieve the samples, especially when the originaldata is numerous. Furthermore, current sampling methods onlysupport the equi-join.The supervised learning methods are mostly query-driven, whichmeans the model is trained for a specific workload. If the workloadshifts, the model needs to be retrained. Thus, the data-driven (un-supervised learning) approaches come out, which still can estimatethe cardinality even if the workload shifts. As shown in [104] (Sec-tion 6.3), Naru is robust to workload shift while MSCN and KDE aresensitive to the training queries. Moreover, both of the supervisedand unsupervised learning methods suffer from the data change. Aspresented in [103] (Section 7.6) and [85] (Section 7.5), both of themare sensitive to data change and the models will be updated in anincremental mode or retrained from scratch. However, they onlyconsider that new tuples are appended into one table and theredoes not exist delete or update operation.Due to the difference in the experiment settings, we only presenta preliminary comparison between the methods for cardinalityestimation as shown in Table 2. Sometimes, the size of learning-based methods is still not small as presented in [103]. The state-of-the-art method is [103], which shows superiority compared withother methods in their experiments. Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration
There are several possible direc-tions as follows:(1)
Learning-based methods.
Many studies on cardinality es-timation are learning-based methods in last two years. Thelearning-based models currently integrated a real system arethe light model or one model for one (sub-)query graph [14,98], which can be trained and updated quickly. However, theaccuracy and generality of these models are limited. Morecomplex models (achieve a better accuracy) still suffer fromthe long training time and update time. Training time is in-fluenced by the hardware. For example, it only takes severalminutes in [103], while it is 13 hours in [85] using a GPUwith relatively poor performance. A database instance, espe-cially in the cloud environment, is in a resource-constrainedenvironment. How to train the model efficiently should beconsidered. The interaction between the models and the opti-mizer also needs to be considered, which should not be withtoo much overhead on the database systems [14]. As pre-sented above, current proposed methods for data change can-not handle delete or update operation. A possible method isto adopt the idea of active learning to update the model [63].(2)
Hybrid methods.
Query-driven methods are sensitive toworkload. Although data-driven methods can support morequeries, it may not achieve the best accuracy for all queries.How to combine two methods in these two different catalogsis a possible direction. Actually, the previous query-feedbackhistograms is an instance of this case. Another interestingthing is that utilizing the query feedback information willhelp the model be aware of the data change.(3)
Experimental Study.
Although many methods have beenproposed, it lacks of experimental studies to verify thesemethods. Different methods have different characteristicsas shown in Table 2. It is crucial to conduct a comprehen-sive experimental study for proposed methods. We thinkthe following aspects should be included: (1) is it easy tointegrate the method into a real database system; (2) whatis the performance of the method under different workloadpatterns (e.g., static or dynamic workload, OLTP or OLAP)and different data scales and distributions; (3) researchersshould pay attention to the trade-off between storage us-age and accuracy of candidate estimation and the trade-offbetween efficiency of model update and the accuracy.
In this section, we present the researches proposed to solve thelimitations in the cost model. We classify the methods into threegroups: (1) improving the capability of the existed cost model, (2)building a new cost model, and (3) predicting the query performance.We include the work on the single query performance prediction.Because the cost used in the optimizer is the metric for performance.These methods are possibly integrated into the cost-based optimizerand replace the cost model to estimation the cost of a (sub)plan,like in [67]. However, we do not consider the query performanceprediction under concurrent context (e.g., [108]). On the one hand,the concurrent queries existing during the optimization may bequite different with queries during the execution process. On the
Figure 3: A classification of cost estimation methods other hand, it also needs to collect more information than themodel predicting the performance of a single query. We list therepresentative work of in the cost model in Figure 3.
Several studies try to estimate the cost of UDF [3, 37, 38]. Boulos andOno [3] execute the UDF several times with different input values,collect the different costs, and then use these costs to build a multi-dimensional histogram. The histogram is stored in a tree structure.When estimating the UDF with specific parameters, traverse thetree top-down to get the estimated cost to locate the leaf withsimilar parameters with inputs. However, this method needs toknow the upper and lower bounds of every parameter and it cannotsolve the complex relation between input parameters and the costs.Unlike the static histogram used in [3], He et al. [37] introducea dynamic quadtree-based approach to store the UDF executioninformation. When a query is executed, the actual cost of executingthe UDF is used to update the cost model. He et al. [38] introducea memory-limited K-nearest neighbors (MLKNN) method. Theydesign a data structure, called PData, to store the execution costand a multidimensional index used for fast retrieval k nearest PDatafor a given query point (parameter in UDF) and fast insertion ofnew PData.Liu and Blanas [60] introduce a cost model for hash-based joinfor main-memory database. They model the response time of aquery as being proportional to the number of operations weightedby the costs of four basic access patterns. They first adopt themicrobenchmarks to get the cost of each access pattern and thenmodel the cost of sequential scan, hash join, hash join with differentorders by the basic access patterns.Most of the previous cost models only consider the executioncost, which may be not reasonable in the cloud environments. Theusers of the cloud database systems care about the economic cost.Karampaglis et al. [47] first propose a bi-objective query cost model,which is used to derive running time and monetary cost togetherin the multi-cloud environment. They model the execution timebased on the method in [99]. For economic cost estimation, theyfirst model the charging policies and estimate the monetary costby combining the policy and time estimation. ai Lan, Zhifeng Bao, and Yuwei Peng*
The cost model is a function mapping the (sub)plan with annotatedinformation to a scalar (cost). Because a neural network on dataprimarily approximates the unknown underlying mapping functionfrom inputs to outputs, most of the methods used to replace theorigin cost model are learning-based, especially NN-based.Boulos et al. [4] firstly introduce the neural network for costevaluation. They design two different models: a single large neuralnetwork for every query type and a single small neural networkfor every operator. In the first model, they also train another modelto classify a query in a certain type. The output of the first modelis the cost of a (sub)plan, while the second model needs to add upthe outputs from small models to get the cost.Sun and Li [85] adopt a tree-LSTM model to learn the presenta-tion of an operator and add an estimation layer upon the tree-LSTMmodel to estimate the cost of the query plan.Due to the difficulty in collecting statistics and the needs ofpicking the resources in big data systems, particularly in moderncloud data services, Siddiqui et al. [81] propose a learning-basedcost model and integrate it into the optimizer of SCOPE [7]. Theybuild large number of small models to predict the costs of common(sub)queries, which are extracted from the workload history. Thefeatures encoded into the models are quite similar with [98]. More-over, to support resource-aware query planning, they add numberof partitions allocated to the operator into the features. In orderto improve the coverage of the models, they introduce operator-input models and operator-subgraphApprox models and employa meta-ensemble model to combine the models above as the finalmodel.
The performance of the one query mainly refers to the latency. Wuet al. [99] adopt an offline profiling to calibrate the coefficients in thecost model under a specific hardware and software conditions. Then,they adopt the sampling method to obtain the true cardinalities ofthe physical operators to predict the execution times.Ganapathi et al. [23] adopt the Kernel Canonical CorrelationAnalysis (KCCA) into the resource estimation, e.g. CPU time. Theyonly model the plan level information, e.g. the number of eachphysical operator type and their cardinality, which is too vulnerable.To estimate the resources (CPU time and logical I/O times), Liet al. [57] train a boosted regression tree for every operator in thedatabase and the consumption of the plan is the sum of the opera-tors’. To make the model more robust, they train a separate scalingfunction for every operator and combine scaling functions with theregression models to handle the cases when the data distribution,size or queries’ parameters are quite different with the trainingdata. Different with [23], this is an operator-level model.Akdere et al. [2] propose the learning-based models to predictthe query performance. They first design a plan-level model if theworkload is known in advance and an operator-level model. Consid-ering the plan-level model makes highly accurate prediction and theoperator-level generalizes well, for queries with low operator-levelprediction accuracy, they train models for specific query subplansusing plan-level modeling and compose both types of models to predict the performance of the entire plan. However, the modelsadopted are linear.Marcus and Papaemmanouil [68] introduce a plan-structure neu-ral network to predict the query’s latency. They design a smallneural network, called neural unit, for every logic operator and anyinstance of the same logic operator shares the same network. Then,these neural units are combined into a tree shape according to theplan tree. The output of one neural unit consists of two parts, thelatency of current operator and the information sent to its parentnode. The latency of the root neural unit of a plan is the plan’slatency.Neo [67] is a learning-based query optimizer, which introducesa neural network, called value network, to estimate the latencyof (sub)plan. The query-level encoding (join graph and columnswith predicts) is fed through several full-connected layers and thenconcatenated with the plan level encoding, which is a tree vectorto represent the physical plan. Next, the concatenated vector is fedinto a Tree Convolution and another several full-connected layersto predict the latency of the input physical plan.
The methods trying to improve the existing costmodel focus on different aspects, e.g., UDFs, hash join in main mem-ory. These studies leave us an important lesson: when introducinga new logical or physical operator, or re-implementing the existingphysical operators, we should consider how to add them into theoptimization process and design the corresponding cost estimationformulas for them (e.g., [55, 71]).Leaning-based methods adopt the model to capture the complexrelationship between cost and the factors while the traditional costmodel is defined as a certain formula by the database experts. TheNN-based methods used to predict the performance, estimate cost,and estimate cardinality in Section 3.3.1 are quite similar in thefeatures and models selection. For example, Sun and Li [85] use thesame model to estimate the cost and cardinality and Neo [67] usesthe latency (performance) of (sub)plan as the cost. A model, whichis able to capture the data itself, operator level information, andsubplan information, can predict the cost accurately. For example,the work [85], one of the state-of-the-art methods, adopts the tree-LSTM model to capture the information mentioned above. However,all of them are supervised methods. If the workload shifts or thedata is updated the models need to be retrained from the scratch.
There are two possible directionsas follows:(1)
Cloud database systems.
The users of the cloud databasesystems need to meet their latency or throughput at the low-est price. Integrating the economic cost of running queriesinto the cost model is a possible direction. It is interestingto consider these related information into the cost model.For example, Siddiqui et al. [81] consider the number ofcontainer into their cost model.(2)
Learning-based methods.
Learning-based methods to es-timate the cost also suffer from the same problems withmethods in cardinality estimation (Section 3.4.2). The modelthat has been adopted in a real system is a light model [81].The trade-off between accuracy and training time is still a
Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration problem. The possible solutions adopted in cardinality esti-mation also can be used in the cost model.
In this section, we present the researches published to handle theproblems in plan enumeration. We classify the work on plan enu-meration into two groups, non-learning methods and learning-based methods.
In 1997, Steinbrunn et al. [84] proposed a representative surveyfor selecting an optimal join orders. Thus, we mainly focus on theresearches after 1997.
Selinger et al. [79] propose a dy-namic programming algorithm to select the optimal join order fora given conjunctive query. They generate the plan in the orderof increasing size and restrict the search space to left-deep trees,which significantly speeds up the optimization. Vance and Maier[93] propose a dynamic programming algorithm to find the opti-mal join order by considering different partial table sets. They useit to generate the optimal bushy tree join trees containing crossproducts. [79, 93] are generate-and-test paradigm and most of theoperations are used to check whether the subgraphs are connectedand two subgraphs are combinative. Thus, none of them meetthe lower bound in [74]. Moerkotte and Neumann [69] propose agraph-based Dynamic programming algorithm. They first intro-duce a graph-based method to generate the connected subgraph.Thus, it does not need to check out the connection and combina-tions and perform more efficiently. Then, they adopt DP over themfor the generation of optimal bushy tree without cross products.Moerkotte and Neumann [70] extend the method in [69] to dealwith non-inner joins and a more generalized graph, hyper graph,where join predicates can involve more than two relations.
TDMinCutLazy is the first efficienttop-down join enumeration algorithm proposed by DeHaan andTompa [13]. They utilize the idea of minimal cuts to partition a joingraph and introduce two different pruning strategies, predicted costbounding and accumulated cost bounding into top-down partition-ing search, which can avoid exhaustive enumeration. Top-downmethod is almost as efficient as dynamic programming and hasother tempting properties, e.g., pruning, interesting order. Fenderand Moerkotte [18] propose an alternative top-down join enumera-tion strategy (TDMinCutBranch). TDMinCutBranch introduces agraph-based enumeration strategy which only generates the validjoin, i.e. cross-product free partitions, unlike TDMinCutLazy whichadopts a generate-and-test approach. In the following year, Fenderet al. [21] propose another top-down enumeration strategy TD-MinCutConservative which is easier to implement and gives betterruntime performance in comparison to TDMinCutBranch. Further-more, Fender and Moerkotte [19, 20] present a general frameworkto handle non-inner joins and a more generalized graph, hypergraph for top-down join enumeration. RS-Graph, a new join trans-formation rules based on top-down join enumeration, is presentedin [80] to efficiently generate the space of cross-product free jointrees.
Figure 4: One possible join order episode
For large queries, Greedy Operator Order-ing [17] builds the bushy join trees bottom-up by adding the mostprofitable (with the smallest intermediate join size) joins first. Tosupport large queries, Kossmann and Stocker [52] propose twodifferent incremental dynamic programming methods, IDP-1 andIDP-2. With a given size k, IDP-1 runs the algorithm in [79] toconstruct the cheapest plan with that size and then regards it asa base relation, and repeats the process. With a given size k, inevery iteration, IDP-2 first performs a greedy algorithm to con-struct the join tree with k tables and then runs DP on the generatedjoin tree to produce the optimal plan, regards the plan as a baserelation, and then repeats the process. Neumann [72] proposes atwo-stage algorithm: First, it performs the query simplification torestrict the query graph by the greedy heuristic until the graphbecomes tractable for DP, and then it runs a DP algorithm to findthe optimal join orders for the simplified the join graph. Bruno et al.[5] introduce enumerate-rank-merge framework which generalizesand extends the previous heuristics [17, 86]. The enumeration stepconsiders the bushy trees. The ranking step is used to evaluatethe best join pair each step, which adopts the min-size metric. Themerging step constructs the selected join pair. Neumann and Radke[73] divide the queries into three types: small queries, medium, andlarge queries according to their query graph type and the numberof tables. Then, they adopt DP to solve small queries, restrict theDP by linearizing the search space for medium queries, and use theidea in [52] for large queries.
Trummer and Koch [89] transform the join orderingproblem into a mixed integer linear program to minimize the costof the plan and adopt the existing the MILP solvers to obtain alinear join tree (left-deep tree). To satisfy the linear properties inMILP, they approximate the cost of scan and join operations vialinear functions.Most of existed OLAP systems mainly focus on start/snowfakejoin queries (PK-FK join relation) and generate the left-deep binaryjoin tree. When handling FK-FK join, like in TPC-DS (snowstormschema), they inccur a large number of intermediate results. Namet al. [71] introduce a new n-ary join operator, which extends theplan space. They define the core graph to represent the FK-FK joinsin the join graph and adopt the n-ary join operator to process it.They design a new cost model for this operator and integrate it intoan existed OLAP systems.
All learning-based methods adopt the reinforcement learning (DL).In RL, an agent interacts with environment by actions and rewards. ai Lan, Zhifeng Bao, and Yuwei Peng*
At each step 𝑡 , the agent uses a policy 𝜋 to choose an action 𝑎 𝑡 according to the current state 𝑠 𝑡 and transitions to a new state 𝑠 𝑡 + .Then the environment applies the action 𝑎 𝑡 and returns a reward 𝑟 𝑡 to the agent. The goal of RL is to learn a policy 𝜋 , a function thatautomatically takes an action based on the current state, with themaximum long-term reward. In join order selection, state is thecurrent sub-trees, and action is to combine two sub-trees, like inFigure 4. The reward of intermediate action is 0 and the reward ofthe last action is the cost or latency of the query.ReJoin [66] adopts the deep reinforcement learning (DRL), whichhas widely been adopted in other areas, e.g., influence maximiza-tion [87], to identify the optimal join orders. State in DRL representthe current subtrees. Each action will combine two subtrees to-gether into a single tree. It uses cost obtained from the cost modelin optimizer as the reward. ReJoin encodes the tree structure of(sub)plan, join predicates, and selection predicates in state. Differentwith [66], Heitz and Stockinger [40] create a matrix to represent atable or a subquery in each row and adopt the cost model in [56] toquickly obtain the cost of one query. DQ [53] is also a DRL-basedmethod. It uses one-hot vectors to encode the visible attributes inthe (sub)query. DQ also encodes the choice of physical operator byadding another one-hot vector. When training the model, DQ firstuses the cost observed from the cost model of the optimizer andthen fine-tunes the model with true running time. Yu et al. [106]adopt DRL and tree-LSTM together for join order selection. Differ-ent with the previous methods [40, 53, 66], tree-LSTM can capturemore the structure information of the query tree. Similar with [53],they also use cost to train the model and then switch to runningtime as feedback for fine-tuning. Notice, they also discuss how tohandle the changes in the database schema, e.g. adding/deletingthe tables/columns. SkinnerDB [90] adopts the UCT, a reinforce-ment learning algorithm, and learns from the current query, whilethe previous learning-based join order methods are learning fromprevious queries. It divides the execution of one query into manysmall slices where different small slices may choose the differentjoin order and learn from the previous execution slices. The non-learning based studies focus on im-proving the efficiency and the ability (to handle the more generaljoin cases) of the existing approaches. Compared with dynamicprogramming approach, the top-down strategy is tempting dueto the better extensibility, e.g., adding new transformation rules,branch-and-bound pruning. Both of them have been implementedin many database systems.Compared with the non-leaning methods, learning-based ap-proaches have a fast planning time. All learning-based methodsemploy reinforcement learning. The main differences between themare:(1) choosing which information as the state and how to encodethem, (2) adopting which models. A more complicated model withmore related information can achieve better performance. The state-of-the-art method [106] adopts a tree-LSTM model similar with [85]to generate the representation of a subplan. Due to the inaccuracyin the cost model, it can improve the quality of model by usingthe latency to fine-tune the model. Although current state-of-the-art method [106] outperforms the non-learning based methods as shown in their experiments, how to integrate the learning-basedmethod into the real system must be solved.
There are two possible directionsas follows:(1)
Handle large queries.
All methods proposed to handlelarge queries are DP-based methods in the bottom-up man-ner. A question is remaining: how to make the top-downsearch strategy support the large queries. Besides, the state-of-art method for large queries [73] cannot support the gen-eral join cases.(2)
Learning-based methods.
Current leaned methods onlyfocus on the PK-FK join and the join type is inner join. Howto handle the other join cases is a possible direction. None ofthe proposed methods have discussed how to integrate theminto a real system. In their experiments, they implementthe method as a separate component to get the right joinorder, and then send to the database. The database still needto optimize it to get the final physical plan. If a query hassubquery, they may interact multiple times. The reinforce-ment learning methods are trained in a certain environment,which refers to a certain database in the join order selectionproblem. How to handle the changes in the table schemasand data is also a possible direction.
Cardinality estimation, cost model, and plan enumeration playcritical roles to generate an optimal execution plan in a cost-basedoptimizer. In this paper, we review the work proposed to improvetheir qualities, including the traditional and learning-based methods.Besides, we provide possible future directions respectively.We observe that more and more learning-based methods areintroduced and outperform traditional methods. However, theysuffer from long training and updating time. How to make themodels robust to workload shifts and data changes or to updatemodels quickly is still an open question. Traditional methods withtheoretical guarantees are widely adopted in real systems. Thereis a great possibility of improving traditional methods with newalgorithms and data structures. Moreover, We believe the ideasbehind the traditional methods can be used to enhance the learning-based methods.
ACKNOWLEDGMENTS
Zhifeng Bao is supported in part by ARC DP200102611, DP180102050,and a Google Faculty Award.
REFERENCES [1] Jayadev Acharya, Ilias Diakonikolas, Chinmay Hegde, Jerry Zheng Li, andLudwig Schmidt. 2015. Fast and Near-Optimal Algorithms for ApproximatingDistributions by Histograms. In
PODS . 249–263.[2] Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley B.Zdonik. 2012. Learning-based Query Performance Modeling and Prediction. In
ICDE . 390–401.[3] Jihad Boulos and Kinji Ono. 1999. Cost Estimation of User-Defined Methods inObject-Relational Database Systems.
SIGMOD Rec.
28, 3 (1999), 22–28.[4] Jihad Boulos, Yann Viemont, and Kinji Ono. 1997. A neural networks approachfor query cost evaluation.
Transaction of Information Processing Society of Japan
38, 12 (1997), 2566–2575.[5] Nicolas Bruno, César A. Galindo-Legaria, and Milind Joshi. 2010. Polynomialheuristics for query optimization. In
ICDE . 589–600.
Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration [6] Walter Cai, Magdalena Balazinska, and Dan Suciu. 2019. Pessimistic Cardinal-ity Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities. In
SIGMOD . 18–35.[7] Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib,Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efficient parallelprocessing of massive data sets.
VLDB
1, 2 (2008), 1265–1276.[8] Surajit Chaudhuri. 1998. An Overview of Query Optimization in RelationalSystems. In
SIGMOD . ACM Press, 34–43.[9] Yu Chen and Ke Yi. 2017. Two-Level Sampling for Join Size Estimation. In
SIGMOD . 759–774.[10] Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012.Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches.
Found.Trends Databases
4, 1-3 (2012), 1–294.[11] Graham Cormode and S. Muthukrishnan. 2004. An Improved Data Stream Sum-mary: The Count-Min Sketch and Its Applications. In
LATIN 2004: TheoreticalInformatics, 6th Latin American Symposium, Buenos Aires, Argentina, April 5-8,2004, Proceedings
SIGMOD . 785–796.[14] Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek R. Narasayya,and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates usingLightweight Models.
VLDB
12, 9 (2019), 1044–1057.[15] Todd Eavis and Alex Lopez. 2007. Rk-hist: an r-tree based histogram for multi-dimensional selectivity estimation. In
CIKM . 475–484.[16] Cristian Estan and Jeffrey F. Naughton. 2006. End-biased Samples for JoinCardinality Estimation. In
ICDE . 20.[17] Leonidas Fegaras. 1998. A New Heuristic for Optimizing Large Queries. In
DEXA . 726–735.[18] Pit Fender and Guido Moerkotte. 2011. A new, highly efficient, and easy toimplement top-down join enumeration algorithm. In
ICDE . 864–875.[19] Pit Fender and Guido Moerkotte. 2013. Counter Strike: Generic Top-Down JoinEnumeration for Hypergraphs.
VLDB
6, 14 (2013), 1822–1833.[20] Pit Fender and Guido Moerkotte. 2013. Top down plan generation: From theoryto practice. In
ICDE . 1105–1116.[21] Pit Fender, Guido Moerkotte, Thomas Neumann, and Viktor Leis. 2012. Effectiveand Robust Pruning for Top-Down Join Enumeration Algorithms. In
ICDE .414–425.[22] Philippe Flajolet, Eric Fusy, Olivier Gandouet, and Frederic Meunier. 2007. Hy-perLogLog: the analysis of a near-optimal cardinality estimation algorithm.
Discrete Mathematics and Theoretical Computer Science (2007), 137–156.[23] Archana Ganapathi, Harumi A. Kuno, Umeshwar Dayal, Janet L. Wiener, Ar-mando Fox, Michael I. Jordan, and David A. Patterson. 2009. Predicting MultipleMetrics for Queries: Better Decisions Enabled by Machine Learning. In
ICDE .592–603.[24] Sumit Ganguly, Minos N. Garofalakis, and Rajeev Rastogi. 2004. ProcessingData-Stream Join Aggregates Using Skimmed Sketches. In
EDBT . 569–586.[25] Sumit Ganguly, Phillip B. Gibbons, Yossi Matias, and Abraham Silberschatz.1996. Bifocal Sampling for Skew-Resistant Join Size Estimation. In
SIGMOD .271–281.[26] Lise Getoor, Benjamin Taskar, and Daphne Koller. 2001. Selectivity Estimationusing Probabilistic Models. In
SIGMOD . 461–472.[27] Goetz Graefe. 1995. The Cascades Framework for Query Optimization.
IEEEData Eng. Bull.
18, 3 (1995), 19–29.[28] Goetz Graefe and William J. McKenna. 1993. The Volcano Optimizer Generator:Extensibility and Efficient Search. In
Proceedings of the Ninth International Con-ference on Data Engineering, April 19-23, 1993, Vienna, Austria . IEEE ComputerSociety, 209–218.[29] Sudipto Guha, Nick Koudas, and Kyuseok Shim. 2006. Approximation andstreaming algorithms for histogram construction problems.
ACM Trans. Data-base Syst.
31, 1 (2006), 396–438.[30] Dimitrios Gunopulos, George Kollios, Vassilis J. Tsotras, and Carlotta Domeni-coni. 2005. Selectivity estimators for multidimensional range queries over realattributes.
VLDB J.
14, 2 (2005), 137–154.[31] Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Arun N. Swami. 1993. Fixed-Precision Estimation of Join Selectivity. In
PODS . 190–201.[32] Max Halford, Philippe Saint-Pierre, and Franck Morvan. 2019. An ApproachBased on Bayesian Networks for Query Selectivity Estimation. In
DASFAA .3–19.[33] Felix Halim, Panagiotis Karras, and Roland H. C. Yap. 2009. Fast and effectivehistogram construction. In
CIKM . 1167–1176.[34] Felix Halim, Panagiotis Karras, and Roland H. C. Yap. 2010. Local Search inHistogram Construction. In
AAAI .[35] Hazar Harmouch and Felix Naumann. 2017. Cardinality Estimation: An Experi-mental Survey.
Proc. VLDB Endow.
11, 4 (2017), 499–512.[36] Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas,and Gautam Das. 2020. Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries. In
SIGMOD . 1035–1050.[37] Zhen He, Byung Suk Lee, and Robert R. Snapp. 2004. Self-tuning UDF CostModeling Using the Memory-Limited Quadtree. In
EDBT , Vol. 2992. 513–531.[38] Zhen He, Byung Suk Lee, and Robert R. Snapp. 2005. Self-tuning cost modelingof user-defined functions in an object-relational DBMS.
TODS
30, 3 (2005),812–853.[39] Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation.In
SIGMOD . 1477–1492.[40] Jonas Heitz and Kurt Stockinger. 2019. Join Query Optimization with DeepReinforcement Learning Algorithms.
CoRR abs/1911.11689 (2019).[41] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kris-tian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not fromQueries!
VLDB
13, 7 (2020), 992–1005.[42] Toshihide Ibaraki and Tiko Kameda. 1984. On the Optimal Nesting Order forComputing N-Relational Joins.
ACM Trans. Database Syst.
9, 3 (1984), 482–502.[43] Piotr Indyk, Reut Levi, and Ronitt Rubinfeld. 2012. Approximating and testingk-histogram distributions in sub-linear time. In
PODS . 15–22.[44] Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In
VLDB .Morgan Kaufmann, 19–30.[45] H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C.Sevcik, and Torsten Suel. 1998. Optimal Histograms with Quality Guarantees.In
VLDB . 275–286.[46] Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, and Sanjay Chawla. 2020. ML-based Cross-Platform QueryOptimization. In
ICDE . 1489–1500.[47] Zisis Karampaglis, Anastasios Gounaris, and Yannis Manolopoulos. 2014. ABi-objective Cost Model for Database Queries in a Multi-cloud Environment. In
MEDES . 109–116.[48] Raghav Kaushik and Dan Suciu. 2009. Consistent Histograms In The Presenceof Distinct Value Counts.
VLDB
2, 1 (2009), 850–861.[49] Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. EstimatingJoin Selectivities using Bandwidth-Optimized Kernel Density Models.
VLDB
CIDR .[51] Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Vik-tor Leis, Peter A. Boncz, Thomas Neumann, and Alfons Kemper. 2019. EstimatingCardinalities with Deep Sketches.
CoRR abs/1904.08223 (2019).[52] Donald Kossmann and Konrad Stocker. 2000. Iterative dynamic programming:a new class of query optimization algorithms.
TODS
25, 1 (2000), 43–82.[53] Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph M. Hellerstein, andIon Stoica. 2018. Learning to Optimize Join Queries With Deep ReinforcementLearning.
CoRR abs/1808.03196 (2018).[54] M. Seetha Lakshmi and Shaoyu Zhou. 1998. Selectivity Estimation in ExtensibleDatabases - A Neural Network Approach. In
VLDB . 623–627.[55] Jyoti Leeka and Kaushik Rajan. 2019. Incorporating Super-Operators in Big-DataQuery Optimizers.
VLDB
13, 3 (2019), 348–361.[56] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kemper,and Thomas Neumann. 2015. How Good Are Query Optimizers, Really?
VLDB
9, 3 (2015), 204–215.[57] Jiexing Li, Arnd Christian König, Vivek R. Narasayya, and Surajit Chaudhuri.2012. Robust Estimation of Resource Consumption for SQL Queries usingStatistical Techniques.
VLDB
5, 11 (2012), 1555–1566.[58] Lipyeow Lim, Min Wang, and Jeffrey Scott Vitter. 2003. SASH: A Self-AdaptiveHistogram Set for Dynamically Changing Workloads. In
VLDB . 369–380.[59] Xudong Lin, Xiaoning Zeng, Xiaowei Pu, and Yanyan Sun. 2015. A CardinalityEstimation Approach Based on Two Level Histograms.
J. Inf. Sci. Eng.
31, 5(2015), 1733–1756.[60] Feilong Liu and Spyros Blanas. 2015. Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In
Socc . 153–166.[61] Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. 2015.Cardinality estimation using neural networks. In
CASCON . 53–59.[62] Guy Lohman. 2014. IS QUERY OPTIMIZATION A “SOLVED” PROBLEM?http://wp.sigmod.org/?p=1075. Accessed June 10, 2020.[63] Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learningfor ML Enhanced Database Systems. In
SIGMOD . 175–191.[64] Tanu Malik, Randal C. Burns, and Nitesh V. Chawla. 2007. A Black-Box Approachto Query Cardinality Estimation. In
CIDR . 56–67.[65] Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2002. Generic DatabaseCost Models for Hierarchical Memory Systems. In
VLDB . 191–202.[66] Ryan Marcus and Olga Papaemmanouil. 2018. Deep Reinforcement Learningfor Join Order Enumeration. In aiDM@SIGMOD . 3:1–3:4.[67] Ryan C. Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Al-izadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: ALearned Query Optimizer.
VLDB
12, 11 (2019), 1705–1718. ai Lan, Zhifeng Bao, and Yuwei Peng* [68] Ryan C. Marcus and Olga Papaemmanouil. 2019. Plan-Structured Deep NeuralNetwork Models for Query Performance Prediction.
VLDB
12, 11 (2019), 1733–1746.[69] Guido Moerkotte and Thomas Neumann. 2006. Analysis of Two Existing andOne New Dynamic Programming Algorithm for the Generation of OptimalBushy Join Trees without Cross Products. In
VLDB . 930–941.[70] Guido Moerkotte and Thomas Neumann. 2008. Dynamic programming strikesback. In
SIGMOD . 539–552.[71] Yoon-Min Nam, Donghyoung Han, and Min-Soo Kim. 2020. SPRINTER: A Fastn-ary Join Query Processing Method for Complex OLAP Queries. In
SIGMOD .2055–2070.[72] Thomas Neumann. 2009. Query simplification: graceful degradation for join-order optimization. In
SIGMOD . 403–414.[73] Thomas Neumann and Bernhard Radke. 2018. Adaptive Optimization of VeryLarge Join Queries. In
SIGMOD . 677–692.[74] Kiyoshi Ono and Guy M. Lohman. 1990. Measuring the Complexity of JoinEnumeration in Query Optimization. In
VLDB . 314–325.[75] Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S. Sathiya Keerthi.2019. An Empirical Analysis of Deep Learning for Cardinality Estimation.
CoRR abs/1905.06425 (2019).[76] Yongjoo Park, Shucheng Zhong, and Barzan Mozafari. 2020. QuickSel: QuickSelectivity Learning with Mixture Models. In
SIGMOD . 1017–1033.[77] Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, and Eugene J. Shekita.1996. Improved Histograms for Selectivity Estimation of Range Predicates. In
SIGMOD . 294–305.[78] Florin Rusu and Alin Dobra. 2008. Sketches for size of join estimation.
ACMTrans. Database Syst.
33, 3 (2008), 15:1–15:46.[79] Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A.Lorie, and Thomas G. Price. 1979. Access Path Selection in a Relational DatabaseManagement System. In
SIGMOD . 23–34.[80] Anil Shanbhag and S. Sudarshan. 2014. Optimizing Join Enumeration inTransformation-based Query Optimizers.
VLDB
7, 12 (2014), 1243–1254.[81] Tarique Siddiqui, Alekh Jindal, Shi Qiao, Hiren Patel, and Wangchao Le. 2020.Cost Models for Big Data Query Processing: Learning, Retrofitting, and OurFindings. In
SIGMOD . 99–113.[82] Joshua Spiegel and Neoklis Polyzotis. 2006. Graph-based synopses for relationalselectivity estimation. In
SIGMOD . 205–216.[83] Utkarsh Srivastava, Peter J. Haas, Volker Markl, Marcel Kutsch, and Tam MinhTran. 2006. ISOMER: Consistent Histogram Construction Using Query Feedback.In
ICDE . 39.[84] Michael Steinbrunn, Guido Moerkotte, and Alfons Kemper. 1997. Heuristic andRandomized Optimization for the Join Ordering Problem.
VLDB J.
6, 3 (1997),191–208.[85] Ji Sun and Guoliang Li. 2020. An End-to-End Learning-based Cost Estimator.
VLDB
13, 3 (2020), 307–319.[86] Arun N. Swami. 1989. Optimization of Large Join Queries: Combining Heuristicand Combinatorial Techniques. In
SIGMOD . 367–376.[87] Shan Tian, Songsong Mo, Liwei Wang, and Zhiyong Peng. 2020. Deep Reinforce-ment Learning-Based Approach to Tackle Topic-Aware Influence Maximization.
Data Sci. Eng.
5, 1 (2020), 1–11.[88] Hien To, Kuorong Chiang, and Cyrus Shahabi. 2013. Entropy-based histogramsfor selectivity estimation. In
CIKM . 1939–1948.[89] Immanuel Trummer and Christoph Koch. 2017. Solving the Join OrderingProblem via Mixed Integer Linear Programming. In
SIGMOD . 1025–1040.[90] Immanuel Trummer, Junxiong Wang, Deepak Maram, Samuel Moseley, Sae-han Jo, and Joseph Antonakakis. 2019. SkinnerDB: Regret-Bounded QueryEvaluation via Reinforcement Learning. In
SIGMOD . 1153–1170.[91] Kostas Tzoumas, Amol Deshpande, and Christian S. Jensen. 2011. LightweightGraphical Models for Selectivity Estimation Without Independence Assump-tions.
VLDB
4, 11 (2011), 852–863.[92] Kostas Tzoumas, Amol Deshpande, and Christian S. Jensen. 2013. Efficientlyadapting graphical models for selectivity estimation.
VLDB J.
22, 1 (2013), 3–27.[93] Bennet Vance and David Maier. 1996. Rapid Bushy Join-order Optimizationwith Cartesian Products. In
SIGMOD . 35–46.[94] David Vengerov, Andre Cavalheiro Menck, Mohamed Zaït, and Sunil Chakkap-pen. 2015. Join Size Estimation Subject to Filter Conditions.
VLDB
8, 12 (2015),1530–1541.[95] TaiNing Wang and Chee-Yong Chan. 2020. Improved Correlated Sampling forJoin Size Estimation. In
ICDE . 325–336.[96] Lucas Woltmann, Claudio Hartmann, Dirk Habich, and Wolfgang Lehner. 2020.Machine Learning-based Cardinality Estimation in DBMS on Pre-AggregatedData.
CoRR abs/2005.09367 (2020).[97] Lucas Woltmann, Claudio Hartmann, Maik Thiele, Dirk Habich, and Wolf-gang Lehner. 2019. Cardinality estimation with local deep learning models. In aiDM@SIGMOD . 5:1–5:8.[98] Chenggang Wu, Alekh Jindal, Saeed Amizadeh, Hiren Patel, Wangchao Le, ShiQiao, and Sriram Rao. 2018. Towards a Learning Optimizer for Shared Clouds.
VLDB
12, 3 (2018), 210–222. [99] Wentao Wu, Yun Chi, Shenghuo Zhu, Jun’ichi Tatemura, Hakan Hacigümüs,and Jeffrey F. Naughton. 2013. Predicting query execution time: Are optimizercost models really unusable?. In
ICDE . 1081–1092.[100] Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-BasedQuery Re-Optimization. In
SGMOD . 1721–1736.[101] Yang Yang, Wenjie Zhang, Ying Zhang, Xuemin Lin, and Liping Wang. 2019.Selectivity Estimation on Set Containment Search.
Data Sci. Eng.
4, 3 (2019),254–268.[102] Yang Yang, Wenjie Zhang, Ying Zhang, Xuemin Lin, and Liping Wang. 2019.Selectivity Estimation on Set Containment Search. In
DASFAA, Part I . 330–349.[103] Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, andIon Stoica. 2020. NeuroCard: One Cardinality Estimator for All Tables.
CoRR abs/2006.08109 (2020). arXiv:2006.08109[104] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, PeterChen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica.2019. Deep Unsupervised Cardinality Estimation.
VLDB
13, 3 (2019), 279–292.[105] Feng Yu, Wen-Chi Hou, Cheng Luo, Dunren Che, and Mengxia Zhu. 2013. CS2:a new database synopsis for query estimation. In
SIGMOD . 469–480.[106] Xiang Yu, Guoliang Li, Chengliang Chai, and Nan Tang. 2020. ReinforcementLearning with Tree-LSTM for Join Order Selection. In
ICDE . 1297–1308.[107] X. Zhou, C. Chai, G. Li, and J. SUN. 2020. Database Meets Artificial Intelligence:A Survey.
TKDE (2020), 1–1.[108] Xuanhe Zhou, Ji Sun, Guoliang Li, and Jianhua Feng. 2020. Query PerformancePrediction for Concurrent Queries using Graph Embedding.