[PDF] Flow-Loss: Learning Cardinality Estimates That Matter

Abstract

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.

Full PDF

FFlow-Loss: Learning Cardinality Estimates That Matter

Parimarjan Negi , Ryan Marcus , Andreas Kipf , Hongzi Mao ,Nesime Tatbul , Tim Kraska , Mohammad Alizadeh MIT CSAIL Intel Labs { pnegi, rcmarcus, kipf, hongzi, tatbul, kraska, alizadeh } @mit.edu ABSTRACT

Previous approaches to learned cardinality estimation havefocused on improving average estimation error, but not allestimates matter equally. Since learned models inevitablymake mistakes, the goal should be to improve the estimatesthat make the biggest diﬀerence to an optimizer. We in-troduce a new loss function, Flow-Loss, that explicitly op-timizes for better query plans by approximating the opti-mizer’s cost model and dynamic programming search algo-rithm with analytical functions. At the heart of Flow-Lossis a reduction of query optimization to a ﬂow routing prob-lem on a certain plan graph in which paths correspond todiﬀerent query plans. To evaluate our approach, we intro-duce the Cardinality Estimation Benchmark (CEB) whichcontains the ground truth cardinalities for sub-plans of over16 K queries from 21 templates with up to 15 joins. We showthat across diﬀerent architectures and databases, a modeltrained with Flow-Loss improves the cost of plans (using thePostgreSQL cost model) and query runtimes despite havingworse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the trainingqueries, both models improve performance signiﬁcantly overPostgreSQL and are close to the optimal performance (us-ing true cardinalities). However, the Q-Error trained modeldegrades signiﬁcantly when evaluated on queries that areslightly diﬀerent (e.g., similar but not identical query tem-plates), while the Flow-Loss trained model generalizes bet-ter to such situations. For example, the Flow-Loss modelachieves up to 1 . × better runtimes on unseen templatescompared to the Q-Error model, despite leveraging the samemodel architecture and training data.

1. INTRODUCTION

Cardinality estimation is a core task in query optimiza-tion for predicting the sizes of sub-plans , which are inter-mediate operator trees needed during query optimization.Query optimizers use these estimates to compare alternativequery plans according to a cost model and ﬁnd the cheapestplan. Recently, machine learning approaches to cardinal-ity estimation have been successful in improving estimationaccuracy [18, 48, 10, 14, 51], but they largely neglect theimpact of improved estimates on the generated query plans.This is the ﬁrst work (known to us) that learns cardinalityestimates by directly optimizing for the cost of query plansgenerated by an optimizer.All learned models will have non-trivial estimation errorsdue to limitations in model capacity, featurization, training

Figure 1:

For this example, we use the sum of the cardi-nalities as the cost of a plan. With true cardinality values,Plan1 is cheaper than Plan2. This is also the case withEstimator1. Interestingly, however, although Estimator2’scardinality values have smaller error than those of Estima-tor1, they will mislead the optimizer to choose Plan2.data, and diﬀerences between training and testing condi-tions (e.g., due to changing workloads). Therefore it is cru-cial to understand which errors are more acceptable for theoptimizer. Unsupervised models learn from the data — butthey will use model capacity for sub-plans that never occursince they treat every potential query as equally likely. Su-pervised models require representative workloads, but learnmore eﬃciently by focusing model capacity on likely sub-plans. However, all estimates are not equally important.While an optimizer’s decisions may be very sensitive to es-timates for some sub-plans (e.g. join of two large tables),other estimates may have no impact on its decisions.As a drop-in replacement for the well known Q-Error [31]loss function used to train supervised cardinality estimationmodels, we propose

Flow-Loss , a loss function that explic-itly emphasizes estimates that matter to query performancefor a given workload. Flow-Loss takes the idea of focusingmodel capacity to its logical extreme — encouraging betterestimates only if they improve resulting query plans. Forinstance, consider Figure 1: Estimator2 corrects Estima-tor1’s estimate of A (cid:49) C , but it actually leads to a worseplan (Plan 2), because the relative cardinalities ( A (cid:49) B vs. A (cid:49) C ) are incorrect. A loss function using Flow-Loss willshow no error for Estimator1, while nudging Estimator2 tocorrect the relative cardinalities of these two joins.At its core, Flow-Loss computes the gradient of the costof a query plan w.r.t. the cardinality estimates used to gen-erate the plan. To do this, we assume a simpliﬁed cost1 a r X i v : . [ c s . D B ] J a n odel and recast the dynamic programming (DP) algorithmfor query optimization as a shortest path problem, whichwe approximate with a smooth and diﬀerentiable analyticalobjective function. This lets us use gradient descent basedtechniques to improve the estimates that are most relevantto improving the query plans. We show that improving car-dinalities w.r.t. this objective also improves the quality ofplans of the complex PostgreSQL cost model and optimizer.There are two main beneﬁts of training models to mini-mize Flow-Loss. First, similar to how attention-based mod-els in natural language processing [45] treat certain parts ofthe input as more important than others, Flow-Loss high-lights which sub-plans are most relevant to the query opti-mizer. This helps a model focus its limited capacity on ro-bustly estimating the sizes of such sub-plans. Across variousscenarios, we show that Flow-Loss trained model have worseaverage estimation accuracy than Q-Error trained models,but improve the cost of generated plans. For instance, weshow in an ablation study that models trained with Flow-Loss can adapt to removing various components of the fea-turization scheme, and still do equally well. Meanwhile, ab-lations cause the Q-Error models to get up to 2 × worse w.r.t.PostgreSQL costs. Second, by having a larger tolerance forerrors on less critical sub-plans, training with Flow-Loss canavoid overﬁtting the model to cardinalities for which preciseestimates are not needed, thereby leading to simpler modelswithout sacriﬁcing query performance. Such simpler mod-els typically generalize better. We show that models trainedusing Q-Error can be brittle, and can lead to signiﬁcant re-gressions when the query workload diverges slightly from thetraining queries; for example, achieving up to 1.5 × sloweraverage runtimes than the default estimates in PostgreSQL,despite better overall estimation accuracy. Meanwhile, mod-els trained using Flow-Loss do not see such drops in perfor-mance, and typically improve over PostgreSQL estimates,even across diﬀerent workloads than those used in training.Our key contributions are: • Postgres Plan Cost (PPC).

Based on Moerkotte etal.’s [31] cost model-based proxy for the runtime of aquery plan, we introduce PPC as a metric to evaluate thegoodness of cardinality estimates in terms of their impacton query optimization. We show that it correspondsclosely to runtimes, and provide an implementation toeasily evaluate the performance of cardinality estimationmodels on PPC using PostgreSQL. • Flow-Loss.

We introduce Flow-Loss, a smooth and dif-ferentiable approximation of Postgres Plan Cost whichcan be optimized by any supervised learning model withgradient descent. • Cardinality Estimation Benchmark (CEB).

We cre-ate a new tool to generate challenging queries based ontemplates in a semi-automated way. We use this to cre-ate the Cardinality Estimation Benchmark, which is over100 × larger than the Join Order Benchmark (JOB) [22],and has more complex queries. • Training with AQP estimates.

A challenge for us-ing supervised cardinality estimation models in practiceis that collecting ground truth data is expensive. How-ever, precise estimates are not needed for near-optimalplans. We show that almost equally good query planscan be generated using models trained with Flow-Losson data collected using approximate query processing (AQP), which is 10 − × faster than computing truevalues. Since Q-Error tends to overﬁt, it is less robustto noisy training data generated via AQP. In fact, whenusing AQP training data, at the 99 th percentile, modelstrained with Q-Error have over 10 × worse Q-Error andPostgreSQL costs than models trained with Flow-Loss.

2. RELATED WORK

For cardinality estimation, traditional approaches haveused histograms [6], sampling [23], kernel density estima-tion [16], wavelets [30], or singular value decomposition [38].Recently, machine learning approaches have shown high es-timation accuracy. Many works focus on single-table se-lectivity estimates [36, 51, 13, 10], but while this is use-ful in other contexts, such as approximate query process-ing, it is non-trivial to extend such models to joins usingjoin sampling [54]. Learned cardinality estimation for joinscan be categorized into unsupervised (data-driven, indepen-dent of query workload) and supervised (query-driven) ap-proaches. Unsupervised approaches for cardinality estima-tion include Probabilistic Graphical Models [12, 44], Sum-Product Networks [14], or deep autoregressive models [50].NeuroCard [50] is the most advanced of these approaches,but it still does not support the complex analytical work-loads studied in this work (e.g., queries with self joins). Thatbeing said, any unsupervised model can be integrated intoour approach by providing their estimates as features.Supervised approaches use queries with their true cardi-nalities as training data to build a regression model. Ourwork builds on the approach pioneered by Kipf et al. [18],which uses a single deep learning model for the whole work-load. While several supervised learning-based works reportimproved estimation accuracy [34, 18, 48, 49, 10, 9], only afew actually demonstrate improved query performance [15,35, 33]. Our approach seeks to learn the cardinalities used bya traditional DBMS optimizer, while using the optimizer’ssearch and cost algorithms for query optimization. Recently,there have been several other learning approaches to improvequery performance which are complementary to our meth-ods: learning the complete optimizer [28, 29, 20], learning touse the optimizer’s hints [27], learning the cost model [41],re-optimization [37, 43], and bounding worst case cardinal-ities to avoid bad plans [7].

3. OVERVIEW

In this section, we will provide the high-level intuitionbehind our approach, which will be formalized in the nextsections. We target supervised learning methods that use aparametric model, such as a neural network, to estimate car-dinalities for sub-plans required to optimize a given query.Today, such models are trained using loss functions thatcompare true and estimated cardinalities for a given sub-plan, such as Q-Error.

Deﬁnition 3.1.

Q-Error.Q-Error( y true , y est ) = max( y true y est , y est y true ) . (1)Such a loss function treats every estimate as equally im-portant. Instead, we want a loss function that will focusmodel capacity on improving accuracy of estimates thatmatter most to the quality of the plans produced by theoptimizer, while tolerating larger errors for other estimates.2 ⋈ SELECT * FROM A , B , C WHERE A .b1 = B .b1 AND A .c1 = C .c1 Query |A| 4|B| 2|C| 2|A B| 10|A C| 8

Cardinalities

Optimizer’s Cost Model Left-Deep Plan Search B ⋈ A C ⋈ Optimal Plan ≈ ≡ C Simple Cost Model Shortest Path Soft Shortest Path ≈ A B

SDB A C

A C ⋈⋈ SDB A C

A B A C ⋈⋈ Figure 2:

The query optimization process has two non-diﬀerentiable components: the cost model and the plansearch algorithm. We develop diﬀerentiable approximationsfor these so we can understand how sensitive query plansare to changes in cardinality estimates.This loss function will need to be diﬀerentiable so we canoptimize it using standard gradient descent methods.To understand how cardinality estimates impact the re-sulting query plan, let us consider the basic structure of aquery optimizer. There are two independent components,as highlighted in Figure 2: (i) a cost model , which outputs acost for every join given the cardinality estimates for all sub-plans. (ii) a DP search algorithm , which ﬁnds the cheapestquery plan. Our goal is to approximate both components us-ing analytical functions that can be combined into a single,diﬀerentiable loss function: Y est C ( · ) −−−→ Join-Cost S ( · ) −−→ Plan . (2)Here C ( · ) maps the cardinality estimates, ˆ Y , to the cost ofeach join, and S ( · ) maps the join costs to the optimal plan.Approximating the cost model as an analytical function isstraightforward since it is already represented using analyt-ical expressions. In principle, we can make this function asprecise as we want, but we found that a very simple approx-imation with terms to cost joins with or without indexesworks well in our workloads (Deﬁnition 4.5).However, the DP search algorithm is non-trivial to modelanalytically. Our key contribution is in developing a dif-ferentiable analytical function to approximate left-deep plansearch. Left-deep plans join a single table to a sub-plan ateach step. Our construction exploits a connection betweenleft-deep plan search and the shortest path problem on acertain “plan graph”. While we focus on left-deep searchfor tractability, the resulting loss function improves the per-formance for all query plans, as the sub-plans required forcosting left-deep plans are the same as required for all plans.Figure 2 shows the plan graph corresponding to a simplequery that joins three tables A , B , and C . Every edge in theplan graph represents a join and has a cost, and every pathbetween two special nodes, S and D , represents a left-deepplan. The DP search algorithm outputs the cheapest plan,i.e. the shortest path. When cardinality estimates change,they change the cost of the edges in the plan graph, possi-bly changing the shortest path. Therefore, to capture theinﬂuence of cardinality estimates on the plan analytically,we need an expression to relate edge costs to the shortestpath in the plan graph. Figure 3:

Join graph and optimal plan for sample query Q on the IMDb database.But this alone is not enough. The shortest path is insen-sitive to small changes to most edge costs (and hence, smallchanges to most cardinality estimates). For instance, con-sider any edge not on the shortest path; slightly increasingor decreasing the cost of that edge would not change theshortest path. Therefore an analytical function based onthe shortest path would not have a gradient with respect tothe cost of such edges. This would make it impossible forgradient-descent-based learning approaches to improve.We tackle these challenges by using a soft approximationto the shortest path problem. In this formulation, the plangraph is viewed as an electrical circuit, with each edge hav-ing a resistance equal to its cost. One unit of current is sentfrom S to D , split across paths in a way that minimizesthe total energy consumed. This formulation has two ad-vantages over shortest path. First, it provides an explicit,closed-form expression relating the edge resistances (costs)to the amount of current on every path. Second, it doesnot suﬀer from the non-existent gradient problem describedabove. In an electrical circuit, the current is not exclusivelysent on the path with the least resistance (i.e., the path cor-responding to the cheapest plan). Instead, all low-resistancepaths carry a non-negligible amount of current. Therefore,changing the resistance (cost) of an edge on any of thesepaths will aﬀect the distribution of current across the entirecircuit. The implication in our context is that all joins in-volved in low-cost query plans matter (even if they do notappear in the cheapest plan). This aligns with the intu-ition that the optimizer is sensitive to precisely these joins:changing their cost could easily change the plan it picks.

4. DEFINITIONS

This section formally deﬁnes the plan graph and the con-cepts we use to develop our new loss function, Flow-Loss. Asa running example, we will consider the query Q (Figure 3)on the Internet Movie Database (IMDb). Throughout this Electrical ﬂows have been used to construct the most ef-ﬁcient polynomial time algorithms for approximating theclosely related maximum ﬂow problem in graphs [8, 21, 26].3 cikt n rtci tkt t ci n ci rtci kt t ci n t ci rt t ci n rtci kt n t ci kt rt t ci n rt tD = ci kt n rt t S cici n ci n rtci n rt tD = ci kt n rt t S Cost

Figure 4:

Plan graph (Deﬁnition 4.2) for query Q . Thecheapest path, P-Opt( Y true ), is highlighted. The edges arecolored according to C ( e, Y true ). The relative thickness ofthe edges represent the ﬂows computed by Equation 6, F-Opt( Y true ).work, joins refer to inner joins, and we ignore cross-joins.For simplicity, we assume all joined columns have an index. Deﬁnition 4.1.

Sub-plan.

Given query Q , a sub-plan is asubset of tables in Q that can be joined using inner joins.In query Q (cf. Figure 3), kt (cid:49) t is a sub-plan but kt (cid:49) ci is not. Deﬁnition 4.2.

Plan graph.

Given query Q , the plan graphis a directed acyclic graph (V,E) where V is the set of allsub-plans, and there is an edge corresponding to every joinin Q between a sub-plan and a base table, i.e. ( u, v ) ∈ E ifand only if v = u (cid:49) b for a base table b . For convenience,we add a node S for the empty set, which has an edge to allnodes containing exactly one table. We use D to denote thenode consisting of all tables. Figure 4 shows the plan graphfor query Q . Deﬁnition 4.3.

Path / Plan , P . A path (sequence of edges)from S to D in the plan graph. Any left-deep plan cor-responds to a path from S to D . For instance, the plan((( t (cid:49) kt ) (cid:49) ci ) (cid:49) n ) (cid:49) rt for query Q corresponds to: S → t → t (cid:49) kt → t (cid:49) kt (cid:49) ci → t (cid:49) kt (cid:49) ci (cid:49) n → D inFigure 4. Deﬁnition 4.4.

Cardinality vector Y . The cardinalities foreach node (sub-plan) in the plan graph. We use Y true and Y est to refer to true and estimated cardinalities. Deﬁnition 4.5. C ( e, Y ) . A cost model which takes as inputan edge (join) e in the plan graph and assigns it a cost giventhe cardinality vector Y . In this paper we use the followingsimple cost model: C (( u, v ) , Y ) = min( | u | + λ | b | , | u | · | b | ) (3)where b is a base table s.t. u (cid:49) b = v and | u | , | b | are car-dinalities of u and b given by Y . The term | u | · | b | modelsnested loop joins without an index, and λ = 0 .

001 is used tomodel an index on b . Figure 4 shows the cost of each edgein query Q . Flow-Loss can use a more precise cost model(e.g., with terms for other join operators such as hash join),but we found this simple model is eﬀective in our workloads. § Deﬁnition 4.6.

P-Opt(Y). The cheapest path (plan) in theplan graph with edge costs given by C ( e, Y ):P-Opt( Y ) = arg min P (cid:88) e ∈ P C ( e, Y ) . (4)For example, given Y true , the cheapest path P-Opt( Y true )is highlighted in Figure 4. We will use the terms “cheapest”and “shortest” path interchangeably. Deﬁnition 4.7.

P-Cost( Y est , Y true ) . The true cost of theoptimal path (plan) chosen based on cardinality vector Y est :P-Cost( Y est , Y true ) = (cid:88) e ∈ P-Opt( Y est ) C ( e, Y true ) . (5)P-Cost can be viewed as an alternative to loss functions likeQ-Error to compare estimated and true cardinalities Y est and Y true . It ﬁnds the cheapest path using Y est , i.e. P-Opt( Y est ), and then sums the true costs of the edges in thispath using Y true . Note that for a ﬁxed Y true , P-Cost takesits lowest value when Y est = Y true . Remark.

As deﬁned, P-Cost is not a distance metric [1](e.g., it does not satisfy the symmetry property). However,this does not aﬀect its use in our loss function. In an onlineappendix [3], we use P-Cost to construct a pseudometric [2]that computes a distance between two cardinality vectors.

5. FLOW-LOSS

While P-Cost captures the impact of cardinalities on queryplans, it has an important drawback as a loss function: Itcannot be minimized using gradient-based methods. In fact,the gradient of P-Cost with respect to Y est is zero at al-most all values of Y est . To see why, notice that a smallperturbation to Y est does unlikely change the path chosenby P-Opt( Y est ); the path would only change if there weremultiple cheapest paths. Therefore P-Cost will also not beaﬀected by a small perturbation to Y est . In this sectionwe deﬁne an alternative to P-Opt that has a gradient w.r.t.any cardinality in the plan graph, and use it to constructour loss function, Flow-Loss. The problem with P-Opt is that it strictly selects theshortest (cheapest) path in the plan graph. Consider, in-stead, the following alternative that can be thought of as a“soft” variant of shortest path. Assume the plan graph isan electrical circuit, with edge e containing a resistor withresistance C ( e, Y ). Now suppose we send one unit of cur-rent from S to D . How will the current be split between thediﬀerent paths from S and D ?In an electric circuit, paths with lower resistance (shorterpaths) carry more current, but the current does not ﬂow ex-clusively on the path with least resistance. Assuming allpaths have a non-zero resistance, they will all carry somecurrent. Importantly, every edge’s resistance aﬀects how For the purpose of this discussion, we view the resistanceof a path as the sum of the resistances along its edges,which corresponds to the path’s length when the resistance isviewed as a distance, or the path’s cost when the resistanceis viewed as the cost of an edge.4urrent is split across paths. The precise way in which cur-rent ﬂows in the circuit can be obtained by solving the fol-lowing energy minimization problem:F-Opt( Y ) = arg min F (cid:88) e ∈ E C ( e, Y ) · F e (6)s.t (cid:88) e ∈ Out ( S ) F e = (cid:88) e ∈ In ( D ) F e = 1 (7) (cid:88) e ∈ Out ( V ) F e = (cid:88) e ∈ In ( V ) F e (8)Here the optimization variable F assigns a ﬂow of current toeach edge. Equation (7) enforces that one unit of ﬂow is sentfrom S to D . Equation (8) is the conservation constraint forall nodes except S and D — it enforces that the amount ofﬂow going in and out of a node should be the same. Thethickness of edges in Figure 4 show the ﬂows assigned toeach edge by F-Opt( Y true ).Computing F-Opt is a classical problem in circuit de-sign [4, 8], and it has a simple closed form expression asa function of the resistances C ( e, Y ). We provide the exactexpression and its derivation online [3]. The solution hasthe following form:F-Opt( Y ) = A ( Y ) B ( Y ) − i, (9)where i ∈ R N is a constant vector, and A ∈ R M,N , B ∈ R N,N are matrices with entries constructed as simple func-tions of the edge costs (resistances), C ( e, Y ), for a plangraph with M edges and N + 1 nodes. F-Opt just multi-plies two matrices, thus is clearly diﬀerentiable. We providean explicit closed form expression for the gradient of F-Optonline [3]. The closed form expression allows us to reuseintermediate results (e.g., B ( Y ) − ) to reduce the overheadof computing the gradient signiﬁcantly.We are now ready to deﬁne our ﬁnal loss function. Deﬁnition 5.1. Flow-Loss.

Flow-Loss( Y est , Y true ) = (cid:88) e ∈ E C ( e, Y true ) · F-Opt( e, Y est ) (10)Notice the similarity to P-Cost (Equation 5). P-Cost com-puted the sum of the true edge costs of the path chosen byP-Opt( Y est ), whereas Flow-Loss is a weighted sum of thetrue edge costs, where the weight of an edge is the squareof F-Opt( Y est , e ). An alternative, intuitive interpretationof Flow-Loss is the true “energy dissipation” of the ﬂowsF-Opt( Y est ). This is minimized when the ﬂows are chosenaccording to the true cardinalities, i.e. when Y est = Y true .Since F-Opt( · ) and C ( · ) (Deﬁnition 4.5) are both diﬀeren-tiable, so is Flow-Loss, and we can use the chain rule to getthe gradients of Flow-Loss w.r.t Y est . Beyond left-deep plans.

P-Cost, and by extension, Flow-Loss are deﬁned over left-deep plans. Extending Flow-Lossto bushy plans is more challenging: we will need to deﬁnea graph similar to the plan graph, where every valid bushyplan is a path, but this will lead to an exponential increase inthe number of paths. Fortunately, it does not seem required Recall that the energy dissipated when current I ﬂowsthrough a resistor with resistance R is RI [4]. to consider bushy plans explicitly when optimizing for cardi-nality estimates. First, the best left-deep plan often has rea-sonable performance compared to the best overall plan [22].Second, the same set of sub-plans required for optimizinga left-deep plan are also needed for bushy plans (excludingcross-joins). In particular, when indices are used, left-deepsub-plans are a prominent part of bushy plans. Hence, esti-mates that are important for choosing good left-deep plansare also important for bushy plans. Anchoring.

An unusual property of Flow-Loss comparedto loss functions such as Q-Error is that it is not very sensi-tive to the absolute value of the cardinality estimates. Likean optimizer, Flow-Loss is aﬀected more by the relative valueof estimates for competing sub-plans. In particular, multi-plying the cardinality estimates of all sub-plans of a queryby a constant will often not change the cheapest path inthe plan graph, because the costs computed using C (Def-inition 4.5) are linear in the cardinality estimates for mostedges (speciﬁcally, the edges corresponding to joins that arecheaper with an index). The implication is that training acardinality estimation model using Flow-Loss does not “an-chor” the learned model’s outputs to the true values (e.g., itmay learn to estimate cardinalities that are all roughly 5 × larger than the true values). It is possible to add explicitterms to the loss function that penalize large deviations fromtrue values, or use a more precise cost model that is sensi-tive to absolute cardinalities. Flow-Loss will optimize forwhichever cost model we use. However, in our workloadswe found that even without explicit anchoring, Flow-Losslearns cardinalities that perform well with PostgreSQL.

6. FLOW-LOSS ANALYSIS

The goal of Flow-Loss is to learn cardinality estimationmodels that improve query performance of a DBMS. As aconcrete example, we focus speciﬁcally on improving Post-greSQL. In this section, we analyze the behavior of Flow-Loss using examples on PostgreSQL to understand how itimproves on traditional loss functions like Q-Error.

P-Cost and Flow-Loss were deﬁned using the simple costmodel C (Deﬁnition 4.5). However, our ultimate goal is toimprove query performance of PostgreSQL, which we quan-tify using the actual PostgreSQL cost model. Deﬁnition 6.1. Postgres Plan Cost (PPC).

PPC is thesame as P-Cost (Deﬁnition 4.7), but uses the default Post-greSQL cost model and exhaustive search over all plans —not only left-deep plans. To compute PPC, we inject Y est into the PostgreSQL optimizer to get the cheapest plan (joinorder and physical operators) for Y est . Then we cost thisplan using Y true . We implement it using a modiﬁed versionof the plugin pg hint plan [42]. We disable materializationand parallelism in PostgreSQL as they add complexity whichmakes it harder to analyze.Flow-Loss is an approximation to P-Cost, which in turnis an approximation to PPC. For Flow-Loss to be useful,its cost model C must broadly reﬂect the behavior of thePostgreSQL cost model. Figure 5 shows a scatter plot of For example, a cost model that accounts for spilling. https://github.com/parimarjan/pg hint plan5 P-Cost10 P o s t g r e s P l a n C o s t IMDb Workload 10 P-Cost10 StackExchange Workload

Figure 5:

P-Cost versus PPC given true cardinalities forthe two workloads we used. Q - E rr o r ci t ci n rt t kt t F l o w - L o ss Figure 6:

Comparing Q-Error (left) or Flow-Loss (right)as we vary the cardinality estimates of diﬀerent sub-plans.For each data point we multiply or divide the true value(center) by 2.P-Cost versus PPC given true cardinalities for two work-loads described in Section 8. The PostgreSQL cost modelincludes many terms that we do not model, thus we wouldnot expect the scale of P-Cost and PPC to match precisely.Nonetheless, we observe that PPC and P-Cost mostly followthe same trends. It matters less that P-Cost is not very pre-cise, since we are merely using it as a signal to improve thecardinality estimates that lead to high costs. To optimizequeries, these cardinality estimates will be provided to thePostgreSQL optimizer with its full cost model.

Next, we will compare the behavior of Q-Error (Deﬁnition3.1), PPC (Deﬁnition 6.1), and Flow-Loss using our runningexample, query Q (Figure 3). Recall that Figure 4 showsthe true cost of each edge, C ( e, Y true ). As we change thecardinality of one node (sub-plan), u , the estimated costs ofoutgoing edges from u will change, aﬀecting the overall costof any path (plan) that passes through u . Flow-Loss is sensitive to under-estimates of nodeson bad paths, and over-estimates of nodes on goodpaths.

Figure 6 shows three representative examples of howQ-Error and Flow-Loss change as we multiply or divide thecardinality of one node by increasing amounts while keep-ing the others ﬁxed at their true values. Q-Error changesidentically for all nodes (the lines overlap), but the behaviorof Flow-Loss diﬀers depending on the node. Node ci (cid:49) t has multiple expensive paths that go through it (note thered edges in Figure 4). As we under-estimate its cardinal-ity, Flow-Loss shoots up (blue line). This aligns with theintuition that under-estimating this node makes bad pathsappear cheaper, which may cause the optimizer to chooseone of them instead of the actual cheapest path. Over-estimating its cardinality, on the other hand, make bad ci n rt | 1/200x 1x 200x| ci rt |Q-Error Postgres Plan Cost Flow-Loss1/200x 1x 200x| ci kt rt t | Figure 7:

Comparing the shapes of Q-Error, PPC, andFlow-Loss as we vary estimate of one sub-plan, while keepingothers ﬁxed at their true values. Each loss curve is plottedwith its own scale (not shown). For each data point wemultiply or divide the true value (center) by 2. kt t |200x1x1/200x | c i k tt | Q-Error 1/200x 1x 200x| kt t |Postgres Plan Cost 1/200x 1x 200x| kt t |Flow-Loss5101520 123456

Figure 8:

Comparing Q-Error, PPC, and Flow-Loss whenwe vary estimates of two sub-plans at the same time. Thecolors go from dark (low errors) to light (high errors).paths appear even more expensive, which is good as we wantthe optimizer to avoid these paths. Thus, it is sensible thatFlow-Loss stays near its minimum in this case. The node ci (cid:49) n (cid:49) rt (cid:49) t is on the cheapest path, while the node kt (cid:49) t has two relatively good paths passing through it (c.f.Figure 4). For these nodes, Flow-Loss remains at its mini-mum for under-estimates (since it makes good paths appearcheaper), and shoots up for over-estimates (since it makesgood paths appear more expensive). Recall that Flow-Lossuses all relatively good paths, not just the cheapest, andtherefore, it is impacted by both nodes. Flow-Loss roughly tracks PPC decision boundaries.

Figure 7 compares the shapes of Q-Error, PPC, and Flow-Loss as we vary the cardinality of a single node. Each curveis plotted on its own scale as we are only interested in com-paring their behavioral trends. Node ci (cid:49) n (cid:49) rt is alreadyon the cheapest path (cf. Figure 4), so Flow-Loss is onlysensitive to over-estimating its cardinality, like PPC. Node ci (cid:49) rt is not on the cheapest path, and like PPC, Flow-Loss is a lot more sensitive to under-estimates as it causesﬂow to be diverted to the paths containing this node frompotentially cheaper paths. Node ci (cid:49) kt (cid:49) rt (cid:49) t is an exam-ple of a case where Flow-Loss leads to a diﬀerent behaviorfrom PPC. For overestimates, PPC is ﬂat at its minimumwhile Flow-Loss blows up. ci (cid:49) kt (cid:49) rt (cid:49) t is not on thecheapest path, but there are multiple nearly optimal pathsusing this node (cf. Figure 4). Since Flow-Loss routes anon-trivial amount of ﬂow on such paths, it is sensitive tomaking them more expensive, even though the optimizerdoes not switch from the cheapest path (thus, PPC remainsﬂat). This is a desirable property from the standpoint of ro-bustness . It reﬂects the fact that any of the nearly optimalpaths could become the cheapest path and get chosen by theoptimizer if the cardinalities change slightly. For instance,6lthough node ci (cid:49) kt (cid:49) rt (cid:49) t is not on the cheapest pathwhen all edges are cost using true cardinalities, it wouldbe on the cheapest path if we underestimate the cost of the ci (cid:49) kt (cid:49) t → ci (cid:49) kt (cid:49) rt (cid:49) t edge (or overestimate the costof the actual cheapest path). In that case, PPC would havebeen sensitive to increasing the cardinality of this node. Byconsidering all good paths simultaneously, Flow-Loss morerobustly captures the behavior of the optimizer in responseto such variations in cardinalities. As a further example,in Figure 8, we vary cardinalities of two sub-plans simul-taneously. Once again we observe that Flow-Loss roughlyreﬂects the behavior of PPC — it is highest when cardi-nalities for both the nodes are under-estimated (lower leftquadrant in the ﬁgures). In practice, cardinality estimation models face several chal-lenges: limited model capacity (making it impossible tolearn all the intricacies of the data distribution), limitedtraining data (since collecting ground truth data is expen-sive), insuﬃcient features (e.g., it may be hard to representpredicates on columns with a large number of categoricalvalues), noisy training data (cf. § § Model capacity.

Lower capacity models, or less expres-sive features, make it harder for learned models to achievehigh accuracy. Flow-Loss helps utilizing the limited modelcapacity in a way that maximizes the model’s impact onquery performance (see ablation study in § Domain-speciﬁc regularization.

A model trying to min-imize Q-Error treats each estimate as equally important,which makes it easy to overﬁt to the training data. Regular-ization is a general approach to mitigate overﬁtting and im-prove generalization, but generic regularization techniquessuch as weight decay [5] simply bias towards learning sim-pler models (e.g., smoother functions) without taking ad-vantage of the problem structure. Flow-Loss provides astronger, guided regularization eﬀect by utilizing domain-speciﬁc knowledge about query optimization. The key in-formation is to know which details of the training data canbe ignored without impacting query performance. If estima-tion errors on a subset of sub-plans do not typically causeworse plans, then there is no need to learn a more complexmodel to correct them. This is precisely what Flow-Lossdoes by allowing a high tolerance to cardinality estimationerrors for many sub-plans (see results in § Tolerance to noisy training data.

As a direct conse-quence of the previous point, by ignoring accuracy on lessimportant subsets of the data, Flow-Loss can better han-dle noisy, or missing training data, which can let us avoidthe expensive process of executing all sub-plans to generate There are similar examples in other ML applications, e.g.,Li et al. show domain-speciﬁc loss functions for physics ap-plications lead to improved generalization via implicit regu-larization [25]. the true cardinalities. Instead, we can train well-performingmodels using approximate cardinalities obtained via fastsampling techniques [24] (c.f. §

7. FEATURIZATION AND MODELS

This section introduces the model architectures that weuse to evaluate Flow-Loss. The implementations of thesemodels are available online . Featurization.

As described by Kipf et al. [18], a sub-plan q is mapped to three sets of input vectors: T q , J q , and P q for the tables, joins, and predicates in the sub-plan. Weaugment these with a vector G q that captures the proper-ties of the sub-plan in the context of the plan graph. Aone-hot vector encodes each table in the sub-plan ( T q ), anda second one-hot vector encodes each join ( J q ). For range predicates, we use min-max normalization [18, 35]. For in predicates we use feature hashing [39]. This is a standardtechnique in ML applications where categorical features withlarge alphabet sizes are hashed to N bins. Even if N is muchsmaller than the alphabet size, it still provides a signal forthe learned models. For like predicates we use feature hash-ing with character n-grams [47]. This is useful to distinguishbetween extremely common and uncommon characters. For like , we also include the number of characters and the pres-ence of a digit as additional features. We ﬁnd that N = 10bins each for every column-operator pair works well on ourworkloads. As proposed by Dutt et al. [10], we add thecardinality estimate for each table (after applying its predi-cates) from PostgreSQL to that table’s vector in T q , whichwe found to be suﬃcient for our workload. For a strongerruntime signal, we could add sample bitmaps [18, 19] (i.e.,bitmaps indicating qualifying sample tuples), however, asthis would signiﬁcantly increase the model’s parameters andhence increase memory requirements, we omit this optimiza-tion in this work. Similarly, we do not explicitly encode group by columns like earlier work does [17] and rely onPostgreSQL’s estimates instead. G q is a vector for the plan graph-based properties of asub-plan. This includes information about the immediatechildren of the sub-plan node in the plan graph (i.e., thenodes obtained by joining the sub-plan with a base table).Speciﬁcally: the number of children, the PostgreSQL cost ofthe join producing that child, and the relative cardinality ofthat child compared to the sub-plan. Intuitively, such infor-mation about neighboring plan graph nodes could be usefulto generalize to new queries. We also add the PostgreSQLcardinality and cost estimate for the sub-plan to G q . For allcardinalities, we apply log transformation [10]. Models.

To compare Q-Error and Flow-Loss, we train tworepresentative neural network architectures with both lossfunctions. Fully-Connected Neural Network (FCNN) wasused by Ortiz et al. [35] and Dutt et al. [10]. It takes as in-put a 1-D feature vector that concatenates the vectors in T q , J q , P q , and G q . Multi-Set Convolutional Network (MSCN)was proposed by Kipf et al. [18] based on the DeepSets ar-chitecture [52], and we extend it to include the G q featuresas well. These are very diﬀerent architectures, and representimportant trade-oﬀs — FCNN is a lightweight model thattrains eﬃciently, but does not scale to increasing database github.com/parimarjan/learned-cardinalities7izes (number of neural network weights grow with the num-ber of columns), while MSCN uses a set-based formulationthat is scalable but is less eﬃcient to train.

8. Cardinality Estimation Benchmark (CEB)8.1 Dataset

Table 1:

Comparing CEB with JOB.

Dataset JOB(IMDb) CEB(IMDb) CEB(SE)

Figure 9:

TOML conﬁguration ﬁle for generating queriesbased on a predeﬁned template and rules.

Benchmark.

We create a tool to generate a large numberof challenging queries based on predeﬁned templates andrules. Using this tool, we generate the Cardinality Estima-tion Benchmark (CEB) [32], a workload on two diﬀerentdatabases (IMDb [22] and StackExchange (SE) [40]) con-taining over 16 K unique queries and true cardinalities forover 4 M sub-plans including count and group by aggre-gates, and range , in , and like predicates. Table 1 summa-rizes the key properties of CEB, and contrasts them withJoin Order Benchmark (JOB) [22]. Notice that for the 13 K IMDb queries in CEB, there are over 2 K unique plans gen-erated by PostgreSQL with true cardinalities — showingthat diﬀerent predicates lead to a diverse collection of opti-mal query plans. CEB addresses the two major limitationsof queries used in previous works [18, 35, 9]: First, pastwork on supervised cardinality estimation [18, 9, 35] evalu-ate on workloads with only up to six joins per query. CEBhas much more complex queries ranging from ﬁve to sixteenjoins. Second, while JOB [22] contains challenging querieswith up to 16 joins, they only have two to ﬁve queries pertemplate. This is insuﬃcient training data for supervisedlearning methods. CEB contains hundreds of queries perhand-crafted template with real-world interpretations. Query generator.

Generating predicate values for querytemplates is challenging because predicates interact in com-plex ways, and sampling them independently would often lead to queries with zero or very few results. Our key insightis to generate interesting predicate values for one, or mul-tiple columns together, using predeﬁned SQL queries thattake into account correlations and other user speciﬁed condi-tions. Figure 9 shows a complete template which generatesqueries with the same structure as our running example, Q . We will walk through the process of generating a sam-ple query following the rules speciﬁed in this template. [basesql] is the SQL query to be generated, with a few unspeciﬁedpredicates to be ﬁlled in. [predicates] are rules to choose thepredicates for groups of one or more columns. The predicate year is of type less than or equal to, and we choose a valueuniformly from the given list. We sample ﬁlter values for theremaining three in predicates together because kind , role ,and gender are highly correlated columns. For these, wealso add year as a dependency — as the year chosen wouldinﬂuence predicate selectivities for all these columns. Wegenerate a list of candidate triples using a group by query.From this list, we sample 2 to 7 values for each in predicate. Timeouts.

Some sub-plans in the StackExchange queriestime out when collecting the true values. This is due to un-usual join graphs which make certain sub-plans behave likecross-joins (see online appendix [3]). In such cases, we use alarge constant value in place of the true cardinalities as thelabel for the timed out sub-plans in the training data. Weveriﬁed that the plans generated by injecting all known truecardinalities and this constant value into PostgreSQL leadsto almost 10 × faster runtimes than using the default Post-greSQL estimates. Thus, despite the timeouts, our labels forStackExchange are a good target for training a cardinalityestimation model. Intuitively, we may not need precise cardinality estimatesto get the best plans — thus, approximate query processingtechniques, such as wander join [24] or IBJS [23], shouldprovide suﬃcient accuracy. However, we cannot use thesetechniques for query optimization because they are too slowto provide estimates for all sub-plans at runtime. But thesetechniques are much faster than generating the ground truthcardinality estimates for all sub-plans, which is by far themost expensive step in building a cardinality estimationmodel. We hence propose a new version of the wander joinalgorithm that is suited for the task of eﬃciently generatinglabeled training data for a large number of similar queries.Wander join estimates the cardinality of a given query bydoing random walks on the join graph: you start with arandom tuple in the ﬁrst table, and look up a tuple with amatching join key in the second table using an index, andso on. The walks may fail either due to predicates, or nomatching join keys. Our key modiﬁcation is to ﬁrst applythe predicates on the base tables, and then do the randomwalks. This does not change the theory around wander join,and makes the walks signiﬁcantly faster. But, the cost ofapplying the predicates ﬁrst is too expensive for AQP ap-plications — to eﬃciently do the random walks, one wouldalso need to create an index on the tables after applying thepredicate. But, for generating the training data, it is a one-time cost — and then, we can reuse the ﬁltered tables for allsub-plans in the query. Moreover, because CEB consists ofpredicates that are often re-used in diﬀerent combinationsacross queries, we are able to re-use the ﬁltered tables formany queries with the same template.8e use this only as a proof of concept; our implementationis not optimized, and uses a mix of Python and SQL calls todo the random walks. Despite this, we generate the wanderjoin estimates with speedups over generating ground truthdata that range from 10 × to 100 × for diﬀerent templates.For instance, for the largest template with around 3 K sub-plans, generating all the ground truth data on a single coretakes about 5 hours , while wander join estimates take lessthan 5 minutes on average, and give almost equally goodplans. In Section 9.1, we explore if the wander join estimatesare as good as true cardinalities to train learned models.

9. EXPERIMENTS

Setup.

We use PostgreSQL 12 with parallelism and mate-rialization disabled for all our experiments. For the runtimeexperiments, we use an Amazon EC2 instance with a NVMeSSD hard drive, and 8GB RAM (m5ad.large for IMDb, andm5ad.xlarge for StackExchange).

Loss functions.

Our main focus is to compare the Q-Errorand Flow-Loss loss functions to train the neural networkmodels. We also compare with Prioritized Q-Error [33],which was our exploratory earlier work in tweaking the Q-Error loss function to focus on queries which have highPPC. We use the true cardinalities and estimates from Post-greSQL as baselines to compare the learned models.

Training and test sets.

We consider two scenarios:1.

Testing on seen templates.

The model is evaluatedon new queries from the same templates that it wastrained on. We put 20% of the queries of each templateinto the validation set, and 40% each into the trainingand test sets. The hyperparameters are tuned on thevalidation set, and we report results from the test set.2.

Testing on unseen templates.

The model is evalu-ated on diﬀerent templates than the ones it was trainedon. We split the templates equally into training andtest templates. Since the number of templates is muchsmaller than the number of queries, we use ten-fold cross-validation for these experiments: the training / test setsplits are done randomly using ten diﬀerent seeds (seeds= 1 − sub-plans . This tests the robustness of these mod-els to slight shifts in the workload. Key results.

Figure 10a shows the results of all approachesw.r.t. PPC on IMDb. All models outperform PostgreSQL’sestimator signiﬁcantly on seen templates. However, onlythe Flow-Loss trained models do so consistently on unseentemplates as well. For seen templates, the models trainedusing Flow-Loss do better than the models trained usingQ-Error on PPC. All models get worse when evaluated onunseen templates - but the Flow-Loss models degrade moregracefully. When the queries are from seen templates, thediﬀerence in PPC does not translate into runtime improve-ments (cf. Figure 10b). However, on unseen templates, wesee clear improvements in runtime as well. This learning problem is similar to how obfuscation byadding random noise to training data is used to learn MLmodels while preserving privacy [53]. Table 2 summarizes the results for the IMDb workloadwhen trained and tested on all IMDb templates. Each ex-periment is repeated three times, and we show ± stddev foreach statistic. All learned models improve signiﬁcantly overPostgreSQL on all metrics, and do about equally well. Worse Q-Error, better PPC, similar runtimes.

OnQ-Error, the models trained using Flow-Loss do worse thanthe models trained using Q-Error. This is to be expected —our goal was to improve cardinality estimates only when itscrucial for query planning. The Flow-Loss trained modelsdistinctly improve mean PPC over the Q-Error models, get-ting close to the PPC with true cardinalities. This suggeststhat Flow-Loss models better utilize their model capacityto focus on sub-plans that are more crucial for PPC. It alsoshows that better Q-Error estimates do not directly trans-late into improved plans. However, in terms of runtimes, allmodels do equally well, and are very close to the optimal.

PPC versus runtimes.

Figure 11 shows the trends forPPC and runtimes are roughly correlated. Notice that thecosts (x-axis) are shown on a log scale — thus, order ofmagnitude better costs translate to faster runtimes.

Training with wander join estimates.

Figure 12 showsthe Q-Error and PPC for models trained with true cardinal-ities and with wander join estimates (cf. § like predicates) fromthe CEB workload. Notice that training with wander join es-timates is almost as good as training with true cardinalities.And, Flow-Loss models are robust when trained using thenoisy wander join estimates — meanwhile, the Q-Error mod-els trained using wander join estimates have a much worsetail performance on both estimation accuracy (Q-Error) andPPC. This supports our hypothesis that the models trainedusing Flow-Loss are able to avoid overﬁtting to noisy datathat may not be as relevant for query optimization (cf. § When we split the training set and test set by templates,each partition leads to very diﬀerent information availableto the models — therefore we will analyze the partitionsindividually. We only show results for the models trainedwith Flow-Loss or Q-Error (Prioritized Q-Error does clearlyworse than Flow-Loss w.r.t. PPC, see online appendix [3]).

Flow-Loss generalizes better.

In Figure 13a, we lookat the performance of a model trained with Flow-Loss com-pared to one trained with Q-Error w.r.t. PPC and queryruntime. A single bar represents the same model architec-ture (FCNN or MSCN) trained and evaluated on one of theten partitions in the unseen templates scenario. To avoidthe runtime overhead, we omit the experiments where thePPC of the models trained with Q-Error and Flow-Loss werewithin 500 K of each other. This ﬁgure highlights the overalltrends across all unseen partition experiments: we see sig-niﬁcant improvements on some partitions, relatively smallerregressions, and similar performance on many partitions. Zooming in on partitions.

For the FCNN and MSCNmodels, we sort all the partitions by the diﬀerence in themean runtimes between the Flow-Loss and the Q-Error mod-els. We select the best, median, and worst partition forFlow-Loss and show the 50 p , 90 p , and 99 p for runtimes inFigure 14. For both architectures, the model trained withFlow-Loss signiﬁcantly improves on all percentiles for thebest partition, and has about the same performance on the9 aseline FCNN MSCN0.00.51.01.52.0 P o s t g r e s P l a n C o s t Baseline FCNN MSCN

Unseen Templates

95% CI95% CI (a)

Mean Postgres Plan Cost (PPC).

Baseline FCNN MSCN10203040 R un t i m e ( s ) Seen TemplatesPostgreSQL True Flow-Loss Q-Error

Baseline FCNN MSCN

Unseen Templates

95% CI95% CI (b)

Mean query runtimes.

Figure 10:

Comparing performance of all models on seen versus unseen templates. For unseen templates, we do tenexperiments using ten diﬀerent training/test template splits.

Table 2:

Models trained and evaluated on the same IMDb templates. We show ± stddev for Q-Error and PPC from threerepeated experiments, and execute plans from one random run. Xp refers to the Xth percentile.

Q-Error Postgres Plan Cost (Millions) Runtime

50p 90p 99p Mean 90p 99p Mean 90p 99p

Baselines

True 1 1 1

FCNN

Q-Error ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSCN

Q-Error ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Postgres Plan Cost0255075100125 R un t i m e ( s ) Flow-Loss Q-Error

Figure 11:

PPC versus runtimes for MSCN models trainedwith Q-Error or Flow-Loss and evaluated on seen templates.worst partition. Even on the median partition, the Flow-Loss model does clearly better at the tail.

Learning curves.

Figure 15 shows the MSCN model’slearning curves for Q-Error (normalized while training asdone by Dutt et al. [10]), Flow-Loss, and PPC on one parti-tion (seed = 7) trained using Q-Error or Flow-Loss. We seethat the Q-Error model has smooth training set curves forall metrics, but it behaves erratically on the test set. Thisis because it is trying to minimize estimation accuracy, butsince the test set contains queries from unseen templates, itis much more challenging. Note that the Flow-Loss curvesclosely resemble the PPC curves. This similarity is partic-

True Wander Join

Training Cardinalities Q - E rr o r

99p 90p

Q-Error Flow-Loss

True Wander Join

Training Cardinalities P o s t g r e s P l a n C o s t

99p 90p

Figure 12:

Median Q-Error and mean PPC for FCNNmodel trained with true or wander join cardinalities.ularly obvious for the Q-Error model on the test set. In § Join Order Benchmark.

JOB is not suitable for traininga supervised learning model as it has too few queries. But,10 artitions1.00.50.00.51.0 P o s t g r e s P l a n C o s t Partitions20020 R un t i m e ( s ) Flow-Loss improvement compared to Q-ErrorFlow-Loss regression compared to Q-Error (a)

IMDb workload, FCNN or MSCN model.

Partitions5.02.50.02.55.0 P o s t g r e s P l a n C o s t Partitions20020 R un t i m e ( s ) (b) StackExchange workload, MSCN model.

Figure 13:

Each bar shows the PPC or runtime improve-ment/regression of Flow-Loss over Q-Error on an unseenpartition and the same model.

Lower is better.

Best Median Worst2550100200400 R un t i m e ( s )

50p 50p 50p50p 50p 50p90p 90p 90p90p 90p 90p99p 99p 99p99p 99p 99p

FCNNQ-Error (50p)Flow-Loss (50p) Q-Error (90P)Flow-Loss (90P) Q-Error (99P)Flow-Loss (99P)Best Median Worst

50p 50p 50p50p 50p 50p90p 90p 90p90p 90p 90p99p 99p 99p99p 99p 99p

MSCN

Figure 14:

Best, median, and worst partition for Flow-Loss models by runtime diﬀerence from its correspondingQ-Error model on unseen template partitions.we can use it as an evaluation set for a model trained onall the templates from CEB. This is similar to the unseentemplates scenario: The JOB queries are less challengingin terms of PPC (for instance, PostgreSQL estimates have20 × lower mean PPC on JOB than CEB). However, they aremore diverse: JOB has 31 templates, and includes predicateson columns not seen in CEB. We train on queries from all theCEB templates and evaluate on the JOB queries. Figure 16summarizes the results of the Flow-Loss and Q-Error modelsfor both architectures over three repeated runs. Both Flow-Loss models improve slightly on PPC over PostgreSQL whileachieving similar runtimes. The FCNN model trained withQ-Error performs similarly, but the MSCN model trainedwith Q-Error shows much higher variance and does signif-icantly worse. We use this experiment as a sanity checkto show that even when the queries are very diﬀerent, ourmodels avoid disastrously bad estimates. Domain speciﬁc regularization eﬀect.

Figure 17 showsthe median, 90 p , and 99 p Q-Errors for the three scenarioswe have looked at. We show results for the FCNN architec-ture and omit MSCN here, which performs very similar. Forseen templates, the models trained with Q-Error only go up Q - E rr o r ( N o r m a li z e d ) Q-Error (Training) Q-Error Flow-Loss10 F l o w - L o ss Flow-Loss (Training) 10 P o s t g r e s P l a n C o s t Postgres Plan Cost (Training)0 2 4 6 8Epoch0.0100.0150.0200.0250.0300.035 Q - E rr o r ( N o r m a li z e d ) Q-Error (Test) 0 2 4 6 8Epoch10 F l o w - L o ss Flow-Loss (Test) 0 2 4 6 8Epoch10 P o s t g r e s P l a n C o s t Postgres Plan Cost (Test)

Figure 15:

Learning curves for one unseen templates par-tition showing mean of all metrics for MSCN models.

Baseline FCNN MSCN10 P o s t g r e s P l a n C o s t

95% CIPostgreSQL True Q-Error Flow-LossBaseline FCNN MSCN010203040 R un t i m e ( s )

95% CI95% CI

Figure 16:

Mean PPC and runtimes for all models trainedwith Q-Error or Flow-Loss on CEB, and evaluated on JOB.to 100 at the 99 p of millions of unseen sub-plans. To achievesuch low estimation errors, the model needs to get quite com-plex, and overﬁt to noisy patterns, like precise estimates for ilike predicates or for sub-plans with 10 tables that mayanyway get pruned during the dynamic programming opti-mizer search. Often, we do not need such precise estimates.Models trained with Flow-Loss achieve better PPC despitean order of magnitude higher Q-Errors at the 99 p , whichsuggests that it learns a simpler model that seems more ef-fective for the task of query optimization. More strikingly,as we consider the unseen templates scenario — the mod-els trained with Q-Error get almost 10 × worse at the 90 p and 99 p , while the models trained with Flow-Loss only getabout 1 . × worse. This pattern continues on to the JOBtemplates — where the Flow-Loss models even have betterestimation accuracy than the Q-Error models. This sup-ports our regularization hypothesis (cf. § Ablation study.

Next, we seek to understand the impactof the various components of the featurization (cf. §

7) byan ablation study in which we remove key elements of thefeaturization, and evaluate the PPC on the seen templates,unseen templates, and JOB. We again focus on FCNN andomit MSCN, which follows similar trends. Figure 18 sum-marizes the results. There are two main highlights. First,on the seen templates, Flow-Loss models can adapt to re-moving various featurization components, and do as well aswith the default features, meanwhile, the Q-Error modelssuﬀer signiﬁcantly with worse featurization. This shows thatwhen constrained with fewer resources, the Flow-Loss modelcan better use its capacity to minimize PPC. Second, Post-greSQL features are crucial for generalization. These include11 een (CEB) Unseen (CEB)Unseen (JOB)Evaluation Templates10 Q - E rr o r

99p 90p

Q-Error FLow-Loss

Figure 17:

Median, 90p, and 99p Q-Error for FCNN mod-els trained on queries from CEB and evaluated on seen tem-plates, unseen templates, and JOB. DefaultNo PostgreSQL EstimatesNo Plan Graph FeaturesNo Table FeaturesNo Join FeaturesNo Predicate FeaturesNo Table, Join, Predicate Features Seen Templates Q-Error Flow-Loss10 Postgres Plan Cost

Unseen Templates 10 Join Order Benchmark

Figure 18:

Ablation study with the FCNN model forseen templates (left), unseen templates (middle), and JOB(right) showing PPC when various components of the fea-turization scheme are removed.various cardinality and cost estimates (cf. § Training time.

Compared to Q-Error, there is a 3 − × overhead for training either architecture with Flow-Loss dueto the additional calculations needed for Flow-Loss— thebottleneck is computing B ( Y ) − in Equation 9. On theCPU, when using Q-Error, the FCNN architecture trainsfor 10 epochs on the IMDb workload in under 1000 seconds,and the MSCN model takes up to 2500 seconds.

Inference time.

As in [18, 10], the inference times for theseneural networks is in the order of a few milliseconds (afterfeaturization) and hence fast enough for query optimization.

Model sizes.

The MSCN model is 2 . . In the context of similar maximum ﬂow algorithms ongraphs, Christiano et al. [8] show ways to calculate B ( Y ) − in polynomial time, utilizing the structure of the electricﬂows formulation. We also expect it to be faster on GPUswith fast matrix inverse operations [11]. Estimator10 P o s t g r e s P l a n C o s t

95% CI

Estimator050100150200 R un t i m e ( s )

95% CI

PostgreSQL Q-Error Flow-Loss True

Figure 19:

Mean PPC (ﬁve repeated runs) and runtimes(from one random run) for baselines and MSCN models eval-uated on seen templates from the StackExchange workload.Finally, we study the performance of the MSCN modelon the StackExchange (SE) database using a workload thatconsists of fewer templates and queries than IMDb.

Diﬀerences with IMDb results.

The SE workload hasmany timed out sub-plans which we replace with a largeconstant (cf. § Similarities to IMDb results.

On seen templates, bothloss functions improve signiﬁcantly over the PostgreSQL es-timates, and perform similarly to each other, although theQ-Error models exhibit more variance on PPC over ﬁverepeated experiments (cf. Figure 19). As in the IMDbworkload, suﬃciently large improvements on PPC trans-late to clear runtime improvements. On unseen templates,the Flow-Loss models improve signiﬁcantly over the Q-Errormodels (cf. Figure 13b). The magnitude of improvementsare larger than on IMDb, with three partitions having im-provements of over 20 seconds on average. Partially, this isbecause the database size is also larger than IMDb — thusbetter plans lead to more substantial improvements.

10. CONCLUSIONS

We showed that Postgres Plan Cost (PPC) is a usefulproxy to runtimes, and is an important alternative to Q-Error when evaluating a cardinality estimator. This letsus view cardinality estimation from a new lens — and wedeveloped Flow-Loss as a smooth, diﬀerentiable approxima-tion to PPC that can be used to train models via gradientdescent based learning techniques. Using a new CardinalityEstimation Benchmark, we provide evidence that Flow-Losscan guide learned model’s to better utilize their capacity tolearn cardinalities that have the most impact on query per-formance. Even more importantly, it can help models avoidoverﬁtting to cardinality estimates that are unlikely to im-prove query performance — leading to more robust gen-eralization when evaluated on queries from templates notseen in the training data, and helping models learn more ro-bustly from training data generated using AQP techniques.Generating ground truth cardinalities in order to train amodel is expensive; moreover, updates to the data wouldquickly make such training data stale. Thus, avoiding theoverhead of generating true labeled data can signiﬁcantlyimprove adoption of learned cardinality estimation modelsin practice.12

1. REFERENCES [1] Metric (mathematics). https://en.wikipedia.org/wiki/Metric_(mathematics) .[Online;].[2] Pseudometric space. https://en.wikipedia.org/wiki/Pseudometric_space .[Online;].[3] Optimizer-cost online appendix. https://parimarjan.github.io/flow_loss_appendix , 2020.[Online;].[4] A. Agarwal and J. Lang.

Foundations of analog anddigital electronic circuits . Elsevier, 2005.[5] C. M. Bishop. Regularization and complexity controlin feed-forward networks. 1995.[6] N. Bruno, S. Chaudhuri, and L. Gravano. STHoles: Amultidimensional workload-aware histogram. In

Proceedings of the 2001 ACM SIGMOD internationalconference on Management of data, Santa Barbara,CA, USA, May 21-24, 2001 , pages 211–222, 2001.[7] W. Cai, M. Balazinska, and D. Suciu. Pessimisticcardinality estimation: Tighter upper bounds forintermediate join cardinalities. In

Proceedings of the2019 International Conference on Management ofData , pages 18–35, 2019.[8] P. Christiano, J. A. Kelner, A. Madry, D. A.Spielman, and S.-H. Teng. Electrical ﬂows, laplaciansystems, and faster approximation of maximum ﬂowin undirected graphs. In

Proceedings of the forty-thirdannual ACM symposium on Theory of computing ,pages 273–282, 2011.[9] A. Dutt, C. Wang, V. R. Narasayya, andS. Chaudhuri. Eﬃciently approximating selectivityfunctions using low overhead regression models.

Proc.VLDB Endow. , 13(11):2215–2228, 2020.[10] A. Dutt, C. Wang, A. Nazi, S. Kandula, V. R.Narasayya, and S. Chaudhuri. Selectivity estimationfor range predicates using lightweight models.

PVLDB , 12(9):1044–1057, 2019.[11] P. Ezzatti, E. S. Quintana-Ort´ı, and A. Rem´on. Usinggraphics processors to accelerate the computation ofthe matrix inverse.

J. Supercomput. , 58(3):429–437,2011.[12] L. Getoor, B. Taskar, and D. Koller. Selectivityestimation using probabilistic models. In

Proceedingsof the 2001 ACM SIGMOD international conferenceon Management of data, Santa Barbara, CA, USA,May 21-24, 2001 , pages 461–472, 2001.[13] S. Hasan, S. Thirumuruganathan, J. Augustine,N. Koudas, and G. Das. Deep learning models forselectivity estimation of multi-attribute queries. In

Proceedings of the 2020 International Conference onManagement of Data, SIGMOD Conference 2020,online conference [Portland, OR, USA], June 14-19,2020 , pages 1035–1050, 2020.[14] B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina,K. Kersting, and C. Binnig. Deepdb: Learn from data,not from queries!

Proc. VLDB Endow. ,13(7):992–1005, 2020.[15] O. Ivanov and S. Bartunov. Adaptive cardinalityestimation. arXiv preprint arXiv:1711.08330 , 2017.[16] M. Kiefer, M. Heimel, S. Breß, and V. Markl.Estimating join selectivities using bandwidth-optimized kernel density models.

Proc.VLDB Endow. , 10(13):2085–2096, 2017.[17] A. Kipf, M. Freitag, D. Vorona, P. Boncz,T. Neumann, and A. Kemper. Estimating ﬁlteredgroup-by queries is hard: Deep learning to the rescue. , 2019.[18] A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, andA. Kemper. Learned cardinalities: Estimatingcorrelated joins with deep learning. In

CIDR 2019, 9thBiennial Conference on Innovative Data SystemsResearch, Asilomar, CA, USA, January 13-16, 2019,Online Proceedings .[19] A. Kipf, D. Vorona, J. M¨uller, T. Kipf, B. Radke,V. Leis, P. A. Boncz, T. Neumann, and A. Kemper.Estimating cardinalities with deep sketches. In

Proceedings of the 2019 International Conference onManagement of Data, SIGMOD Conference 2019,Amsterdam, The Netherlands, June 30 - July 5, 2019. ,pages 1937–1940.[20] S. Krishnan, Z. Yang, K. Goldberg, J. M. Hellerstein,and I. Stoica. Learning to optimize join queries withdeep reinforcement learning.

CoRR , abs/1808.03196,2018.[21] Y. T. Lee, S. Rao, and N. Srivastava. A new approachto computing maximum ﬂows using electrical ﬂows. In

Proceedings of the forty-ﬁfth annual ACM symposiumon Theory of computing , pages 755–764, 2013.[22] V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz,A. Kemper, and T. Neumann. How good are queryoptimizers, really?

PVLDB , 9(3):204–215, 2015.[23] V. Leis, B. Radke, A. Gubichev, A. Kemper, andT. Neumann. Cardinality estimation done right:Index-based join sampling. In

CIDR 2017, 8thBiennial Conference on Innovative Data SystemsResearch, Chaminade, CA, USA, January 8-11, 2017,Online Proceedings , 2017.[24] F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join:Online aggregation via random walks. In

Proceedingsof the 2016 International Conference on Managementof Data , pages 615–629, 2016.[25] L. Li, S. Hoyer, R. Pederson, R. Sun, E. D. Cubuk,P. Riley, and K. Burke. Kohn-sham equations asregularizer: Building prior knowledge intomachine-learned physics. arXiv preprintarXiv:2009.08551 , 2020.[26] A. Madry. Computing maximum ﬂow withaugmenting electrical ﬂows. In , pages 593–602. IEEE, 2016.[27] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh,and T. Kraska. Bao: Learning to steer queryoptimizers.

CoRR , abs/2004.03814, 2020.[28] R. Marcus and O. Papaemmanouil. Towards ahands-free query optimizer through deep learning. In

CIDR 2019, 9th Biennial Conference on InnovativeData Systems Research, Asilomar, CA, USA, January13-16, 2019, Online Proceedings , 2019.[29] R. C. Marcus, P. Negi, H. Mao, C. Zhang,M. Alizadeh, T. Kraska, O. Papaemmanouil, andN. Tatbul. Neo: A learned query optimizer.

PVLDB ,12(11):1705–1718, 2019.1330] Y. Matias, J. S. Vitter, and M. Wang. Wavelet-basedhistograms for selectivity estimation. In

Proceedings ofthe 1998 ACM SIGMOD international conference onManagement of data , pages 448–459, 1998.[31] G. Moerkotte, T. Neumann, and G. Steidl. Preventingbad plans by bounding the impact of cardinalityestimation errors.

PVLDB , 2(1):982–993, 2009.[32] P. Negi, R. Marcus, A. Kipf, M. Hongzi, N. Tatbul,T. Kraska, and M. Alizadeh. Cardinality estimationdataset. https://github.com/cardinality-estimation-benchmark ,2020. [Online;].[33] P. Negi, R. Marcus, H. Mao, N. Tatbul, T. Kraska,and M. Alizadeh. Cost-guided cardinality estimation:Focus where it matters. In , pages 154–157. IEEE, 2020.[34] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi.Learning state representations for query optimizationwith deep reinforcement learning. In

Proceedings ofthe Second Workshop on Data Management forEnd-To-End Machine Learning, DEEM@SIGMOD2018, Houston, TX, USA, June 15, 2018 , pages4:1–4:4, 2018.[35] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi.An empirical analysis of deep learning for cardinalityestimation.

CoRR , abs/1905.06425, 2019.[36] Y. Park, S. Zhong, and B. Mozafari. Quicksel: Quickselectivity learning with mixture models. arXivpreprint arXiv:1812.10568 , 2018.[37] M. Perron, Z. Shang, T. Kraska, and M. Stonebraker.How I learned to stop worrying and lovere-optimization. In , pages 1758–1761, 2019.[38] V. Poosala and Y. E. Ioannidis. Selectivity estimationwithout the attribute value independence assumption.In

VLDB , volume 97, pages 486–495, 1997.[39] D. D. Sarkar. Categorical data, Mar 2019.[40] StackExchange. Stackexchange data explorer, 2020.[41] J. Sun and G. Li. An end-to-end learning-based costestimator.

PVLDB , 13(3):307–319, 2019.[42] N. TELEGRAPH and T. CORPORATION. Pg hintplan, 2013.[43] I. Trummer, J. Wang, D. Maram, S. Moseley, S. Jo,and J. Antonakakis. SkinnerDB: Regret-boundedquery evaluation via reinforcement learning. In

Proceedings of the 2019 International Conference onManagement of Data, SIGMOD Conference 2019,Amsterdam, The Netherlands, June 30 - July 5, 2019 ,pages 1153–1170, 2019.[44] K. Tzoumas, A. Deshpande, and C. S. Jensen.Eﬃciently adapting graphical models for selectivityestimation.

VLDB J. , 22(1):3–27, 2013.[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,L. Jones, A. N. Gomez, (cid:32)L. Kaiser, and I. Polosukhin.Attention is all you need. In

Advances in neuralinformation processing systems , pages 5998–6008,2017.[46] X. Wang, C. Qu, W. Wu, J. Wang, and Q. Zhou. Arewe ready for learned cardinality estimation? arXiv preprint arXiv:2012.06743 , 2020.[47] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu.Charagram: Embedding words and sentences viacharacter n-grams. arXiv preprint arXiv:1607.02789 ,2016.[48] L. Woltmann, C. Hartmann, M. Thiele, D. Habich,and W. Lehner. Cardinality estimation with local deeplearning models. In

Proceedings of the SecondInternational Workshop on Exploiting ArtiﬁcialIntelligence Techniques for Data Management,aiDM@SIGMOD 2019, Amsterdam, The Netherlands,July 5, 2019 , pages 5:1–5:8, 2019.[49] C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le,S. Qiao, and S. Rao. Towards a learning optimizer forshared clouds.

PVLDB , 12(3):210–222, 2018.[50] Z. Yang, A. Kamsetty, S. Luan, E. Liang, Y. Duan,X. Chen, and I. Stoica. Neurocard: One cardinalityestimator for all tables.

CoRR , abs/2006.08109, 2020.[51] Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan,P. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan,and I. Stoica. Deep unsupervised cardinalityestimation.

PVLDB , 13(3):279–292, 2019.[52] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos,R. R. Salakhutdinov, and A. J. Smola. Deep sets. In

Advances in neural information processing systems ,pages 3391–3401, 2017.[53] T. Zhang, Z. He, and R. B. Lee. Privacy-preservingmachine learning through data obfuscation. arXivpreprint arXiv:1807.01860 , 2018.[54] Z. Zhao, R. Christensen, F. Li, X. Hu, and K. Yi.Random sampling over joins revisited. In

Proceedingsof the 2018 International Conference on Managementof Data, SIGMOD Conference 2018, Houston, TX,USA, June 10-15, 2018 , pages 1525–1539, 2018.14

PPENDIXA. METRIC BASED ON PLAN-COST

Deﬁnition 4.7, § Y , Y ) = | P-Cost( Y , Y true ) − P-Cost( Y , Y true ) | (11)Notice, we need to pre-compute the constant vector, Y true for the given query. This is a pseudo-metric because it sat-isﬁes its three properties:1. d ( x, x ) = 0; Clearly, P-Error( Y , Y ) = 0. Also, noticethat there can be other points such that P-Error( Y , Y ) =0.2. d ( x, y ) = d ( y, x ) (Symmetry); follows due to the abso-lute value sign in the deﬁnition of P-Error.3. d ( x, z ) ≤ d ( x, y )+ d ( y, z ) (Triangle Inequality); We willpresent the proof for this below.For notational convenience, we will use P e to refer to Plan-Error and P c to refer to Plan-Cost. Y t refers to Y true . Thenthe proof for the triangle inequality follows: P e ( Y , Y ) = | P c ( Y , Y t ) − P c ( Y , Y t ) | = | P c ( Y , Y t ) − P c ( Y , Y t )+ P c ( Y , Y t ) − P c ( Y , Y t ) |≤ | P c ( Y , Y t ) − P c ( Y , Y t ) | + | P c ( Y , Y t ) − P c ( Y , Y t ) | = P e ( Y , Y

2) + P e ( Y , Y ) (12)The ﬁrst line is the deﬁnition of P-Error. In the secondline, we are adding and subtracting same value P c ( Y , Y t ).The third line follows from the deﬁnition of absolute value,which gives us exactly the statement of the triangle inequal-ity that we were trying to prove. B. FLOW-LOSS DETAILSB.1 Computing Flows

Figure 20:

Plan graph for query shown in Figure 1.We will use the example of the query shown in Figure 1 in § F , in Equation 6 in §

5, which assignsa ﬂow to every edge, e in the plan graph. Throughout thisderivation, we will assume access to the cost function, C (Deﬁnition 4.5, § C ( e, Y ), a cost to anyedge, e in the plan graph given a cardinality vector Y . Werewrite the optimization program for F below:F-Opt(Y) = arg min F (cid:88) e ∈ E C ( Y ) e F e (13)s.t (cid:88) e ∈ Out ( S ) F e = (cid:88) e ∈ In ( D ) F e = 1 (14) (cid:88) e ∈ Out ( V ) F e = (cid:88) e ∈ In ( V ) F e (15)For simplicity, we will use F to refer to the vector solutionof the optimization problem above. We can express the setof constraints speciﬁed above as a system of linear equations.Recall, that in an electric circuit, every node has an associ-ated voltage. Equation 13 describes an electric circuit; Forour example (Figure 20), we can write out the linear equa-tions that need to be satisﬁed by the constraints in terms ofthe ‘voltages’ of each node, and resistances (costs) for eachedge, and using Ohm’s law as:( v S − v A ) C (( S, A ) , Y ) + ( v S − v B ) C (( S, B ) , Y ) + ( v S − v C ) C (( S, C ) , Y ) = 1( v B − v AB ) C (( B, AB ) , Y ) + ( v B − v S ) C (( S, B ) , Y ) = 0( v C − v AC ) C (( C, AC ) , Y ) + ( v C − v S ) C (( S, C ) , Y ) = 0( v A − v AB ) C (( A, AB ) , Y ) + ( v A − v AC ) C (( A, AC ) , Y ) + ( v A − v S ) C (( S, A ) , Y ) = 0( v AB − v D ) C (( AB, D ) , Y ) + ( v AB − v A ) C (( A, AB ) , Y ) + ( v AB − v C ) C (( C, AB ) , Y ) = 0( v AC − v D ) C (( AC, D ) , Y ) + ( v AC − v A ) C (( A, AC ) , Y ) + ( v AC − v C ) C (( C, AC ) , Y ) = 0( v D − v AB ) C (( AB, D ) , Y ) + ( v AC − v A ) C (( A, AC ) , Y ) + ( v AC − v C ) C (( C, AC ) , Y ) = − v A C (( S,A ) ,Y ) use Ohm’s law to com-pute the amount of maximum incoming current to node A from S , and so on for the other terms. Notice that theﬁrst and the last equations satisfy the constraint in Equa-tion 14 — 1 unit of ﬂow is outgoing from S and 1 unit ofﬂow is incoming to D . All the intermediate equations rep-resent the conservation constraint (Equation 15) — this willbe obvious if you expand out each equations, and separatethe terms being added (incoming current) versus the termsbeing subtracted (outgoing current) in each equation.Next, we will compactly represent the above linear con-straints in terms of matrix operations. The system of linearequations above is over-determined, thus in practice, we re-move the ﬁrst equation and solve the remaining ones. Forsimplicity, we will express the matrix operations withoutremoving any constraint.Let v ∈ R N be the vector for the voltages v S ...v D for allthe nodes in the plan graph. Similarly, let i ∈ R N be the15ector of 1 , , ..., − X ∈ R n,m ,with rows indexed by vertices and columns indexed by edges: X n,e =  e ∈ Out ( n ) − e ∈ In ( n )0 otherwise. (17)Next, for simplifying the notation, we will deﬁne C ( e,Y ) = g e . And let, G ( Y ) ∈ R m,m , be a diagonal matrix for the m edges, with G e,e = g e = C ( e,Y ) : g ( Y ) =  C ( e ,Y ) ... C ( e m ,Y )  ; G ( Y ) =  g ( e )) . . . g ( e m )  (18)Thus, G is deﬁned simply using the cost of each edge inthe plan graph w.r.t. a given cardinality vector Y . Notethat in g and G we use the inverse of the costs, because inEquation 16, the cost terms are in the denominator, thusthis simpliﬁes the formulas.We deﬁne B ( Y ) ∈ R N,N as a linear function of the costsw.r.t. cardinality vector Y : B ( Y ) = X · G ( Y ) · X T (19)we can verify that each entry of B is given by the followingpiece-wise linear function: B u,w =  (cid:80) e ∈ In ( u ) ∪ Out ( u ) 1 C ( e,Y ) if u = w − C ( e,Y ) if e = ( u, w ) is an edge0 otherwise.Now, we are ready to compactly represent all the con-straints from Equation 16: B ( Y ) v = i = ⇒ v = B ( Y ) − i (20)We do not know v , but B ( Y ) is a deterministic functiongiven a cost model, C and cardinality vector Y , while i isa ﬁxed vector. Thus, using the constraints, we have founda way to compute the values for V . Note that B ( Y ) doesnot have to be invertible, since we can use pseudo-inversesas well.Recall, that the ﬂows in Equation 13 correspond to thecurrent in the context of electrical circuits. Using Ohm’slaw, and the voltages calculated above, we can calculate theoptimal ﬂow (current) of an edge as: F ( u,w ) = ( v u − v w ) · g u,w (21)where g u,w = C (( u,w ) ,Y ) is the inverse of the cost of edge( u, w ), and v u , v w are the voltages’ associated with nodes u and w . The linear equations for the ﬂow, F , on each edgecan be represented as a matrix multiplication: F ( Y ) = G ( Y ) Xv = G ( Y ) XB ( Y ) − i (22)where the second Equation uses Equation 22. In §

5, Equa-tion 9, we gave the general form of the solution for F ( Y ),which is satisﬁed by the precise deﬁnition in Equation 22. B.2 Flow-Loss

Note that we found the ﬂows, F , using the estimated car-dinalities, Y , and the costs induced by them on each of theedges. To compute the true costs, which is used to calculateFlow-Loss, we will need to use the true cardinalities, Y true .For convenience, we will deﬁne the following diagonal matrixwith the true cost of each edge: C true =  C ( e , Y true ) . . . C ( e m , Y true )  (23)We will rewrite Flow-Loss, as also shows in Equation 10in § (cid:88) e C truee,e F e (24)where we sum over all edges in the plan graph. In termsof matrix operators, we can write it as: F T C true F (25) B.3 Flow-Loss gradient

We present the dependency structure in the computationfor

F low − Loss below. Notice that the cardinalities, Y onlyimpact Flow-Loss through the costs, represented by g . Y C ( · ) −−−→ g Opt ( · ) ,C true −−−−−−−−→ Flow-Loss . For convenience, we use g to represent costs — recall, g isjust the inverse of the costs. The deﬁnitions of g and C ∗ aregiven in Equations 18, 23. For simplicity, we refer to Y est assimply Y . We will use the vector notations (cid:126)g or (cid:126)Y to refer tovectors, and when gradients are taken w.r.t. vectors or scalarquantities. Recall, gradient of an n dimensional vector w.r.t.an m dimensional vector is the nxm dimensional Jacobianmatrix; while gradient of a scalar w.r.t. an n dimensionalvector is still an n dimensional vector. For simplicity, wewill also avoid explicitly writing out the dependencies asfunction arguments like we have done so far — so instead of G ( Y ) (Equation 18), we will just write G . ∇ (cid:126)Y Flow-Loss = ( ∇ (cid:126)Y (cid:126)g ) ( ∇ (cid:126)g Flow-Loss)= ( ∇ (cid:126)Y (cid:126)g ) ( ∇ (cid:126)g (cid:126)F ) (2 C true (cid:126)F ) (26)where (cid:126)F are the ﬂows for each edge, which we can computeusing Equation 22. The ﬁrst line follows because given g (inverse estimated costs of each edge), computing the Flow-Loss does not depend on Y . Therefore, we can use the chainrule to separate it out into two independent gradients. Thesecond line follows because consider the partial derivative ofFlow-Loss (as in Equation 24) w.r.t. a single element of g : ∂ Flow-Loss ∂g j = 2 (cid:88) e C truee,e F e ∂F e ∂g j (27)Here, 2 (cid:80) e C truee,e F e corresponds to the term (2 C true F ) inEquation 26.Next, we will write out the explicit formulas for each ofthe unknown terms in the above equation. First, ∇ (cid:126)Y (cid:126)g is theJacobian matrix of the cost function, with input estimatedcardinalities, and outputs costs.16 (cid:126)Y (cid:126)g =  dg dY dg dY . . . dg m dY ... . . . dg dY n dg m dY n  (28)Notice that this is the only place where we need to take thegradient of the cost function, C (Deﬁnition 4.5, § §

4. Thus, we couldpotentially be using signiﬁcantly more complex cost modelsas long as this value can be approximated. Also, most termsin this Jacobian matrix end of trivially being zeros since thecost of a particular edge will only depend on two elements of Y . This is one of the ways we can signiﬁcantly speed up thegradient computations by explicitly coding up the formulas.Calculating ∇ (cid:126)g (cid:126)F is more involved. The solution comesout to be: ∇ (cid:126)g (cid:126)F ∈ R M,N =  i T B − T ( ∂GX∂g − GXB − ∂G∂g )... i T B − T ( ∂GX∂g − GXB − ∂G∂g m )  (29)Note, each row in the above matrix is a vector in R n . X was deﬁned in Equation 17, B was deﬁned in Equation 19,and G was deﬁned in Equation 18. As we can see there aremany results being re-used from the computations of F ; interms of implementation, this means that the forward andbackward passes of a neural network can reuse intermedi-ate results, which results in signiﬁcantly more eﬃcient codeas well. Also, once again we see that many of the partialderivatives would be 0, which don’t need to be computedwhen implementing these gradients. C. DATASETC.1 StackExchange timeouts

Figure 21:

An example of a timed out subquery on theStackExchange database.On the stackexchange datasets, three of the six templateshave a large proportion of timeouts when generating theground truth data for the sub-plans. This is due to the un-usual join graph, presented with a simpliﬁed query in Fig-ure 21. As the accompanying query makes clear, there is a relationship between the joins on user, u and answer a .But there is no relationship between question q and user u without a ; still, the sub-plan q (cid:49) s (cid:49) u is not a cross-joinsince all tables have a relationship with site s . Essentially, q (cid:49) s (cid:49) uu