[PDF] Selectivity Estimation with Attribute Value Dependencies using Linked Bayesian Networks

Abstract

Relational query optimisers rely on cost models to choose between different query execution plans. Selectivity estimates are known to be a crucial input to the cost model. In practice, standard selectivity estimation procedures are prone to large errors. This is mostly because they rely on the so-called attribute value independence and join uniformity assumptions. Therefore, multidimensional methods have been proposed to capture dependencies between two or more attributes both within and across relations. However, these methods require a large computational cost which makes them unusable in practice. We propose a method based on Bayesian networks that is able to capture cross-relation attribute value dependencies with little overhead. Our proposal is based on the assumption that dependencies between attributes are preserved when joins are involved. Furthermore, we introduce a parameter for trading between estimation accuracy and computational cost. We validate our work by comparing it with other relevant methods on a large workload derived from the JOB and TPC-DS benchmarks. Our results show that our method is an order of magnitude more efficient than existing methods, whilst maintaining a high level of accuracy.

Full PDF

SS ELECTIVITY E STIMATION WITH A TTRIBUTE V ALUE D EPENDENCIES USING L INKED B AYESIAN N ETWORKS

Max Halford

IRIT LaboratoryIMT LaboratoryUniversity of Toulouse [email protected]

Philippe Saint-Pierre

IMT LaboratoryUniversity of Toulouse [email protected]

Frank Morvan

IRIT LaboratoryUniversity of Toulouse [email protected] A BSTRACT

Relational query optimisers rely on cost models to choose between different query execution plans.Selectivity estimates are known to be a crucial input to the cost model. In practice, standard selectivityestimation procedures are prone to large errors. This is mostly because they rely on the so-calledattribute value independence and join uniformity assumptions. Therefore, multidimensional methodshave been proposed to capture dependencies between two or more attributes both within and acrossrelations. However, these methods require a large computational cost which makes them unusablein practice. We propose a method based on Bayesian networks that is able to capture cross-relationattribute value dependencies with little overhead. Our proposal is based on the assumption thatdependencies between attributes are preserved when joins are involved. Furthermore, we introduce aparameter for trading between estimation accuracy and computational cost. We validate our work bycomparing it with other relevant methods on a large workload derived from the JOB and TPC-DSbenchmarks. Our results show that our method is an order of magnitude more efﬁcient than existingmethods, whilst maintaining a high level of accuracy. K eywords Query optimisation · Cost model · Selectivity estimation · Bayesian networks

A query optimiser is responsible for providing a good query execution plan (QEP) for incoming database queries. Toachieve this, the optimiser relies on a cost model, which tells the optimiser how much a given QEP will cost. The costmodel’s estimates are in large part based on the selectivity estimates of each operator inside a QEP [1]. The issue isthat selectivity estimation is a difﬁcult task. In practice, huge mistakes are not exceptions but rather the norm [2]. Inturn, this leads the cost model to produce cost estimates that can be wrong by several orders of magnitude [3]. Theerrors made by the cost model will inevitably result in using QEPs that are far from optimal in terms of memory usageand running time. Moreover, the cost model may also be used by other systems in addition to the query optimiser. Forinstance, service-level agreement (SLA) negotiation frameworks are based on the assumption that the cost of each querycan accurately be estimated by the cost model [4]. Cost models are also used for admission control (should the querybe run or not?), query scheduling (when to run a query?), progress monitoring (how long will a query?), and systemsizing (how many resources should be allocated to run the query?) [5]. Errors made by the cost model may thus havefar reaching consequences. Such errors are for the most part due to the inaccuracy of the selectivity estimates.Selectivity estimates are usually wrong because of the many simplifying assumptions that are made by the cost model.These assumptions are known to be unveriﬁed in practice. Nonetheless, they allow the use of simple methods thathave a low computational complexity. For example, the attribute value independence (AVI) assumption, which statesthat attributes are independent with each other, is ubiquitous. This justiﬁes the widespread use of one-dimensionalhistograms for storing the distribution of attribute values. Another assumption which is omnipresent is the joinuniformity assumption , which states that attributes preserve their distribution when they are part of a join. Although thisis a major source of error, it rationalises the use of simple formulas that surmise uniformity [6]. Producing accurate a r X i v : . [ c s . D B ] S e p PREPRINT - S

EPTEMBER

22, 2020selectivity estimates whilst preserving a low computational overhead is thus still an open research problem, even thoughmany methods from various approaches have been proposed.The standard approach to selectivity estimation is to build a statistical synopsis of the database. The synopsis is built atdowntime and is used by the cost model when the query optimiser invokes it. The synopsis is composed of statisticsthat summarise each relation along with its attributes. Unidimensional constructs, e.g., histograms [7], can be used tosummarise single attributes, but cannot dependencies between attributes. Multidimensional methods, e.g., multivariatehistograms [8], can be used to summarise the distribution of two or more attributes. However, their spatial requirementgrows exponentially with the number of attributes. Moreover, they often require a non-trivial construction phase thattakes an inordinate amount of time. Another approach is to use sampling, where the idea is to run a query on a sampleof the database and extrapolate the selectivity [9]. Sampling works very well for single relations. The problem is thatsampling is difﬁcult to apply in the case of joins. This is because the join of sampled relations has a high probability ofbeing empty [10]. A different approach altogether is to acknowledge that the cost model is mostly wrong, and insteadlearn from its mistakes so as not to reproduce them. The most successful method in this approach is DB2’s so called learning optimiser (LEO) [11]. Such a memorising approach can thus be used in conjunction with any cost model.Although they are appealing, memorising approaches do not help in any matter when queries that have not been seen inthe past are issued. What’s more, they are complementary to other methods. Finally, statistical approaches based onconditional distributions seem to strike the right balance between selectivity estimation accuracy and computationalrequirements [12]. A conditional distribution is a way of modifying a distribution of values based on the knowledge ofanother value – called the conditioning value. For example, if the values of attribute B depend on those of A , then wecan write P ( A, B ) = P ( B | A ) × P ( A ) . Conditional distributions can be organised in a so-called Bayesian network[13]. Bayesian networks thus factorise a multidimensional distribution into a product of lower dimensional ones. Ifwell chosen, these factorisations can preserve most of the information whilst consuming much less space. In [12], weproposed to use Bayesian networks to capture attribute value dependencies inside each relation of a database. Theissue with Bayesian networks is their computational cost [14]. To alleviate this issue, we restricted our networks topossess tree topologies, which leads to simpler algorithms that have the beneﬁt in linear time. The downside of usingtree topologies is that our networks capture less dependencies than a general network. However, we showed in ourbenchmarks that our method was able to improve the overall selectivity estimation accuracy at a very reasonable cost.The downside of our work in [12] is that it completely ignores dependencies between attributes of different relations,which we address in the present work.Bayesian networks that capture attribute value dependencies across relations have also been proposed. [15] were theﬁrst to apply them for selectivity estimation. However, they used off-the-shelf algorithms that are standard for workingwith Bayesian networks, but which are costly and impractical in constrained environments. [16] extended the work of[15] to address the computational cost issues. Indeed, they proposed various constraints on the network structure ofeach relation’s Bayesian network that reduced the overall complexity. However, this still results in a global Bayesiannetwork with a complex structure, which requires a costly inference algorithm in order to produce selectivity estimates.Although the methods from [15] and [16] enable a competitive accuracy, they both incur a costly construction phaseand are too slow at producing selectivity estimates. In light of this, our goal is to capture attribute value dependenciesacross relations with as little an overhead as possible. With this in mind, our method thus consists in measuring thedistribution of a carefully selected set of attributes before and after a join. We do so by performing a small amount ofofﬂine joins that exploits the topology of the relations. Effectively, we make use of the fact that joins mainly occur onprimary/foreign key relationships, and thus warp the attribute values distribution in a predictable way. The contributionsof our paper are as follows: (1) we introduce a new assumption which simultaneously softens the attribute valueindependence and join uniformity assumption, (2) based on our assumption, we propose an algorithm for connectingindividual Bayesian networks together into what we call a linked Bayesian network , (3) we show how such a linkedBayesian network can be used to efﬁciently estimate query selectivities both within and between relations, and (4) weintroduce a parameter which allows us to generalise the trade-offs induced by existing methods based on Bayesiannetworks.The rest of the paper is organised as follows. Section 2 presents the related work. Section 3 introduces some notionsrelated to Bayesian networks and summarises the work we did in [12]. Section 4 introduces the methodology forcombining individual Bayesian networks using the topology of a database’s relations. In section 5, we compare ourproposal with other methods on an extensive workload derived from the JOB [2] and TPC-DS [17] benchmarks. Finally,section 6 concludes the paper and hints to potential follow-ups. Ever since the seminal work of Selinger et al. [6], query optimisation has largely relied on the use of cost models.Because the most important part of the cost model is the selectivity estimation module [18], a lot of good efforts have2

PREPRINT - S

EPTEMBER

22, 2020been made across the decades. [19] ﬁrst proposed the use of histograms to approximate the distribution P ( x ) of a singleattribute x . Since then, a lot of work has gone into developing optimal histograms [7] that have been used ubiquitouslyin cutting edge cost models. Smooth versions of histograms, e.g., kernel density estimators (KDEs) [20] and wavelets[21], have also been proposed. However, these methods are based on single attributes, and as such lose in accuracywhat they gain in computational efﬁciency. Indeed, there is no way to capture a dependency between two attributes x and y if one only has unidimensional distributions P ( x ) and P ( y ) available, regardless of their accuracy.Multidimensional distributions, i.e., P ( X , . . . , X n ) , are a way to catch dependencies between attributes. Methodsbased on such distributions are naturally more accurate because they soften the AVI assumption. However, they require alarge amount of computational resources which hinders their use in high-throughput settings. [8] ﬁrst formalised the useof equi-depth multidimensional histograms and introduced an efﬁcient construction algorithm. [22] proposed anotherconstruction algorithm based on Hilbert curves. Multidimensional KDEs have also been proposed [23], with somewhatthe same complexity guarantees. In search for efﬁciency, [24] offered a workload-aware method where the idea is toonly build histograms for attributes that are often queried together. Even though methods based on multidimensionaldistributions are costly, they are implemented in some database systems and are used when speciﬁed by a databaseadministrator. However, these methods do not help whatsoever in capturing dependencies across relations, which isprobably the biggest issue cost models have to deal with.Sampling methods have also been proposed to perform selectivity estimation. The idea is to run a query on a sample ofthe database and extrapolate the selectivity [9]. Sampling works very well for single relations and has been adoptedby some commercial database systems. However, off the shelf sampling procedures suffer from the fact that the joinof sampled relations has a high probability of being empty [10]; in other words a join has to be materialised beforesampling can be done. This issue can be alleviated by the use of correlated sampling [25], where a deterministic hashfunction is used to ensure that samples from different relations will match with each other. Another technique is touse indexes when available [26], but this is only realistic for in-memory databases. [27] also proposed heuristics formaintaining statistics of join synopses . Overall, sampling is an elegant selectivity estimation method, not least because itcan handle complex predicates which statistical summaries cannot (e.g., regex queries). However, sampling necessarilyincurs a high computational cost. Indeed even if the samples are obtained at downtime, they still have to be loaded inmemory during the query optimisation phase.Throughout the years, a lot of proposals have been made to relax the simplifying assumptions from [6]. All of theserequire compromises in terms of accuracy, speed, and memory usage. The general consensus is that each method shinesin a particular use case, and thus combining different methods might be a good approach. [28] formalised this idea byusing a maximum entropy approach. Recently, [29] proposed combining sampling and synopses. Another approachaltogether is to “give up” on the cost model and instead memorise the worst mistakes it makes so as not to reproducethem in the future [11]. There have also been proposals that make use of machine learning [30, 31, 32, 33], wherea supervised learning algorithm is taught to predict selectivities based on features derived from a given query andthe database’s metadata. Recently, deep learning methods have been proposed to extract features that don’t requirerules written by humans. One of the most prominent papers that advocates the use of deep learning for selectivityestimation can be found in [34]. They proposed a neural network architecture, which they dubbed MSCN for multi-setconvolutional network . Although approaches based on supervised machine learning have had great success in otherdomains, their performance for query selectivity estimation isn’t competitive enough, yet.Approaches that exploit attribute correlations in order to avoid storing redundant information have also been proposed.For example, [35] proposes to build a statistical interaction model that allows to determine a relevant subset ofmultidimensional histograms to build. In other words, they propose to build histograms when attributes are correlated,and to make the AVI assumption if not. Bayesian networks can be seen through the same lens of exploiting redundantinformation. Essentially, they factorise the full probability distribution into a set of conditional distributions. Aconditional distribution between two attributes implies a hierarchy whereby one of the attributes determines to someextent the other. Formally, a Bayesian network is a directed acyclic graph (DAG) where each node is an attribute andeach arrow implies a conditional dependency. They can be used to summarise a probability distribution by breaking itup into smaller pieces. In comparison with the supervised learning based methods mentioned in the previous paragraph,Bayesian networks are an unsupervised learning method. What this means is that they directly learn by looking at thedata, whereas supervised methods require a workload of queries and outputs in order to learn. In [12], we proposedto use Bayesian networks for capturing attribute value dependencies inside individual relations. [15] and [16] bothproposed methods for using Bayesian networks to capture attribute value dependencies between different relations.Although this leads to more accurate selectivity estimates, it requires much more computation time and is infeasible inpractice. This is due to the fact that they require the use of expensive belief propagation algorithms for performinginference. Meanwhile, our method is much faster because it restricts each Bayesian network to a tree topology, whichallows the use of the variable elimination algorithm. However, our method completely ignores dependencies betweenattributes of different relations. Our goal in this paper is to reconcile both approaches. Effectively, we want to keep the3 PREPRINT - S

EPTEMBER

22, 2020computational beneﬁts of building and using individual Bayesian networks, but at the same time we want our method tocapture some dependencies across relations.

In [12], we developed a methodology for constructing Bayesian networks to model the distribution of attribute valuesinside each relation of a database. A Bayesian network is a probabilistic model. As such, it is used for approximatingthe probability distribution of a dataset. The particularity of a Bayesian network is that it uses a directed acyclic graph(DAG) in order to do so. The graph contains one node per variable, whilst each directed edge represents a conditionaldependency between two variables. Therefore, the graph is a factorisation of the full joint distribution: P ( X , . . . , X n ) (cid:39) (cid:89) X i ∈X P ( X i | P arents ( X i )) (1)The joint distribution P ( X , . . . , X n ) is the probability distribution over the entire set of attributes { X , . . . , X n } .Meanwhile, P arents ( X i ) stands for the attributes that condition the value of X i . The distribution P ( X i | P arents ( X i )) is thus the conditional distribution of attribute X i ’s value. In practice, the full distribu-tion is inordinately large, and is unknown to us. However, the total of the sizes of the conditional distributions P ( X i | P arents ( X i )) is much smaller.Using standard rules of probability, such as Bayes’ rule and the law of total probability [13], we are able to derivefrom a Bayesian network any selectivity estimation problem by converting a logical query into a product of conditionalprobabilities. Note, however, that a Bayesian network is necessarily an approximation of the full probability distributionbecause it makes assumptions about the generating process of the data. Finding the right graph structure of a Bayesiannetwork is called structure learning [13].This is usually done by maximising a scoring function, which is an expensive process that scales super-exponentiallywith the number of variables [14]. Approximate search methods as well as integer programming solutions have beenproposed [36]. In our work in [12], we proposed to use the Chow-Liu algorithm [37]. This algorithm has the property ofﬁnding the best tree structure where nodes are restricted to have at most one parent. The obtained tree is the best in thesense of maximum likelihood estimation. In addition to this property, the Chow-Liu algorithm only runs in O ( p ) time,where p is the number of variables, and is simple to implement. It works by ﬁrst computing the mutual information between each pair of variables, which can be seen as the strength of the relation between two variables.The next step is to ﬁnd the maximum spanning tree (MST) using the mutual information, and thus to derive a directedgraph approximating the joint probability distribution. We propose an inference process based on variable eliminationalgorithm [38] since inference can be done in linear time for tree. Our experiments indicated that competitors approachare much slower. Note that our inference process can further be accelerated using the Steiner tree problem [39].In [12], we proposed a simple method which consists in building one Bayesian network per relation. On the one hand,this has the beneﬁt of greatly reducing the computational burden in comparison with a single large Bayesian network,as is done in [15] and [16]. On the other hand, it ignores dependencies between attributes of different relations. We willnow discuss how we can improve our work from [12] in order to capture some dependencies across relations. The task of selectivity estimation is to determine the selectivity of a query over a set of attributes X i that to a set ofrelations R j . By making the AVI assumption, this comes down to to measuring individual attribute distributions andmultiplying them together, as so: P ( X , . . . , X n ) (cid:39) (cid:89) R j (cid:18) (cid:89) X i ∈ R j P ( X i ) (cid:19) (2)The methodology from [12] models the attribute distribution value of a database by building a tree-shaped Bayesiannetwork for each relation. For efﬁciency reasons it purposefully only captures dependencies between attributes of a4 PREPRINT - S

EPTEMBER

22, 2020single relation. As such, it ignores the many dependencies that exist between attributes of different relations and thatare the bane of cost models. Essentially, this method boils down to factorising the full probability distribution as so: P ( X , . . . , X n ) (cid:39) (cid:89) R j (cid:18) (cid:89) X i ∈ R j P ( X i | P arent ( X i )) (cid:19) (3)where { X , . . . , X n } is the entire set of attributes over all relations and X i ∈ R j are the attributes that belong torelation R j . P arent ( X i ) denotes the attribute on which the distribution of X i is conditioned – because each Bayesiannetwork is tree-shaped, each attribute excluding the root has a single parent. Although the work from [12] ignoresdependencies between attributes of different relations, it is still much more relevant that the common assumption of fullattribute value independence. Our goal in this paper is to model the full probability distribution by taking into accountdependencies between attributes of different relations, which can represented as: P ( X , . . . , X n ) (cid:39) (cid:89) X i P ( X i | P arent ( X i )) (4)Note that equation 4 captures more information than equation 2. Modeling the data by taking into account conditionaldependencies thus guarantees that the resulting selectivity estimates are at least as accurate as when assuming indepen-dence between attributes. In [12], we made the assumption that attribute values of different relations are independent.Additionally, we assumed that each attribute value distribution remains the same when the relation it belongs to isjoined with another relation. This is called the join uniformity assumption and is a huge source of error. Indeed, thedistributions of an attribute’s values before and after a join are not necessarily the same. For instance, imagine ane-commerce website where registered customers are stored in a database alongside with purchases. Each customercan make zero or more purchases whilst each purchase is made by exactly one customer. Some customers might beregistered on the website but might not have made a purchase. If the customers and purchases relations are joinedtogether, then the customers who have not made any purchase will not be included in the join. Therefore, the attributesfrom the customers relation will have different value distributions when joined with the purchases relation. Notehowever that the attribute value distributions from the purchases relation will not be modiﬁed. This stems from the factthat the join between the customers and purchases relations is a one-to-many join. We will now explain how we can usethis property to capture attribute value dependencies across relations.Let us assume we have two relations R and S that share a primary/foreign key relationship. That is, S contains aforeign key which references the primary key of R . This means that each tuple from R can be joined with zero or moretuples from S . A direct consequence is that the size of the join R (cid:46)(cid:47) S is equal to | S | . The join uniformity assumptionsimplies that the probability for a tuple r from relation R to be present in R (cid:46)(cid:47) S follows a uniform distribution. Instatistical terms, that is: P ( r ∈ R (cid:46)(cid:47) S ) ∼ U ( 1 | R | ) (5)Consequently, the expected number of times each tuple from R will be part of R (cid:46)(cid:47) S is | S || R | . Let us now denote by P R ( A ) the value distribution of attribute A in relation R . We will also deﬁne P R(cid:46)(cid:47)S ( A ) as the value distribution ofattribute A in join R (cid:46)(cid:47) S . The join uniformity assumption thus implies that the distribution of A ’s values before andafter R is joined with S are equal: P R ( A ) = P R(cid:46)(cid:47)S ( A ) (6)Furthermore, assume we have found and built the following factorised distribution over attributes A , B , and C fromrelation R : P R ( A, B, C ) (cid:39) P R ( A | B ) × P R ( B | C ) × P R ( C ) (7)If we hold the join uniformity assumption to be true, then we can use the factorised distribution to estimate selectivitiesfor queries involving A , B and C when R is joined with S without any further modiﬁcation. The issue is that this isan idealised situation that has no reason to occur in practice. On the contrary, it is likely are that some tuples from R will be more or less present than others. However, we may assume that after the join the attribute value dependenciesimplied by our factorisation remain valid within each relation. We call this the attribute value dependency preservation PREPRINT - S

EPTEMBER

22, 2020assumption. The idea is that if attributes A , B and C are dependent in a certain way in relation R , then there is not muchreason to believe that these dependencies will disappear once R is joined with S . Although this may not necessarilyalways occur in practice, it is still a much softer assumption than those usually made by cost models.To illustrate, let us consider a toy database composed of the following relations: customers with attributes { nationality, hair, salary } , shops with attributes { name, city, size } , purchases with attributes { day of week } .Moreover, assume that the purchases relation has two foreign keys, one that references the primary key of customersand another which that of shops. The purchases relation can thus be seen as a fact table whilst customers and shopscan be viewed as dimension tables. represented in table. In what follows we will use the shorthand C the customersrelation, S for the shops relation, and P for the purchases relation.In the customers relation, there are Swedish customers and a lot of them have blond hair. We might capture this propertyin a Bayesian network with the conditional distribution P C ( hair | nationality ) , which indicates that hair colour isinﬂuenced by nationality. We could suppose that the fact that Swedish people have blond hair is still true once thecustomers relation is joined with the purchases relation. In other words, the hair colour shouldn’t change the rate atwhich Swedish customers make purchases. However, we may rightly assume that the number of purchases will changeaccording to the nationality of each customer. Mathematically, we are saying the following: P C(cid:46)(cid:47)P ( hair, nationality ) = P C ( hair | nationality ) × P C(cid:46)(cid:47)P ( nationality ) (8)In other words, because we assume that P C ( hair | nationality ) is equal to P C(cid:46)(cid:47)P ( hair | nationality ) , then we know P C(cid:46)(cid:47)P ( hair, nationality ) – i.e., we assume that their conditional distribution remains unchanged after the join. Animmediate consequence is that we get to know the P C(cid:46)(cid:47)P ( hair ) distribution for free. Indeed, by summing over thenationalities, we obtain: P C(cid:46)(cid:47)P ( hair ) = (cid:88) nationality P C ( hair | nationality ) × P C(cid:46)(cid:47)P ( nationality ) (9)To demonstrate why our assumption is useful for the purpose of selectivity estimation, let us use the example data intables 1 and 2. Table 1: Customers relation Customer Nationality Hair

Shop Customer

C (cid:46)(cid:47) P : P C(cid:46)(cid:47)P ( hair = Blond, nationality = Swedish ) = 57 (10)The fraction is the true amount of purchases that were made by Swedish customers with blond hair – said otherwisethis is the selectivity of the query. Obtaining it requires scanning the rows resulting from the join of C with P . Inpractice this can be very burdensome, especially when queries involve many relations. If we assume that the joinuniformity assumption holds – in other words we assume that the value distributions of nationality and hair do notchange – then we can simply reuse the Bayesian network of the customers relation: P C(cid:46)(cid:47)P ( Blond, Swedish ) (cid:39) P C ( Blond | Swedish ) × P C ( Swedish ) (cid:39) × (cid:39) (11)6 PREPRINT - S

EPTEMBER

22, 2020In this case, making the join uniformity independence assumption makes us underestimate the true selectivity by 44%( − × ). Some of this error is due to the fact that the nationality attribute values are not distributed in the same wayonce C and P are joined – indeed in this toy example Swedish customers make more purchases than American ones.However, if we know the distribution of the nationality attribute values, i.e., P C(cid:46)(cid:47)P ( nationality ) , then we can enhanceour estimate in the following manner: P C(cid:46)(cid:47)P ( Blond, Swedish ) (cid:39) P C ( Blond | Swedish ) × P C(cid:46)(cid:47)P ( Swedish ) (cid:39) × (cid:39) (12)Now our underestimate has shrunk to 20%. The only difference with the previous equation is that we have replaced P C ( Swedish ) with P C(cid:46)(cid:47)P ( Swedish ) . Note that we did not have to precompute P C(cid:46)(cid:47)P ( Blond, Swedish ) . Indeed,we assumed that the dependency between nationality and hair doesn’t change once C and P are joined, which stemsfrom our dependency preservation assumption. Note that, in our toy example, the assumption is slightly wrong becauseblond customers have a higher purchase rate than brown haired ones, regardless of the nationality. Regardless, ourassumption is still much softer than the join uniformity and attribute value independence assumptions.Our assumption is softer than the join uniformity assumption because it allows attribute value distributions to changeafter a join. Statistically speaking, instead of assuming that tuples appear in a join following a uniform distribution,we are saying that the distribution of the tuples is conditioned on a particular attribute (e.g., the nationality of thecustomers dictates the distribution of the customers in the join between shops and customers). We also assume thatattribute value dependencies with each relation are preserved through joins (e.g., hair colour is still dependent onnationality). The insight is that in a factorised distribution, the top-most attribute is part of any query. For instance, inthe distribution P ( A | B ) × P ( B | C ) × P ( C ) , every query involving any combination of A , B , and C will necessarilyinvolve P ( C ) . We will now see how our newly introduced attribute value dependency preservation assumption can beused to link Bayesian networks from different relations together, and as such relax the join uniformity and attributevalue independence assumptions at the same time. As explained in the previous subsection, if a purchases relation has a foreign key that references a primary key ofanother relation named customers, then the distribution of purchases’ attribute values will not change after joiningcustomers and purchases. However the distribution of customers’ attribute values will change if purchases’ foreign keyis skewed, which is always the case to some degree. If we use the method proposed by [12], then the Bayesian networkbuilt on customers would not be accurate when estimating selectivities for queries involving customers and purchases.This is because it would assume the distributions of the attribute values from customers are preserved after the join,which is a consequence of the join uniformity assumption. Moreover, because of the AVI assumption, we would not becapturing the existing dependencies between customers’s attributes and purchases’s attributes because their respectiveattributes are assumed to be independent with those of the opposite relation. On the other hand, if we join customersand purchases and build a Bayesian network on top of the join, then we will capture the cross-relation attribute valuedependencies, but at too high a computational cost [15, 16]. Up to now, we have only mentioned the case where thereone join occurs, but the same kind of issues occur for many-way joins – including star-joins and chain-joins.If the attribute value distributions of customers and purchases are estimated using Bayesian networks that possess a treestructure, then we only have to include the dependencies of a subset ofcustomers’s attributes with those of purchases.Speciﬁcally, we only have to include the root attribute of customers’s Bayesian network into that of of purchases.Indeed, because customers’s Bayesian network is a tree, then all of its nodes are necessarily linked to the root. If weknow the distribution of the root attribute’s values after customers is joined with purchases, then, by making the attributevalue dependency preservation assumption earlier introduced, we automatically obtain the distribution of the rest ofcustomers’s attribute. In other words, if the distribution of an attribute’s values is modiﬁed when the relation it belongsto is joined with another relation, then we assume that all the attributes that depend on it have their value distributionsmodiﬁed in the exact same manner. This is another way of saying that the conditional distributions remain the same.We will show how this works on our toy database consisting of relations customers, shops, and purchases. Followingthe methodology from [12], we would have built one Bayesian network per relation. Each Bayesian network wouldnecessarily have been a tree as a consequence of using the Chow-Liu algorithm [37]. Depending on the speciﬁcs of thedata, we might have obtained the Bayesian networks shown in ﬁgure 1.7

PREPRINT - S

EPTEMBER

22, 2020Day of weekNationalityHair Salary NameCity SizeFigure 1: Separate Bayesian networks of customers, shops, and purchasesFurthermore, let us consider the following SQL query:

SELECT *FROM customers , shops , purchasesWHERE customers . id = purchases . customer_idAND shops . id = purchases . shop_idAND customers . nationality = ' Japanese 'AND customers . hair = ' Dark 'AND shops . name = ' Izumi '

If we were to estimate the amount of tuples that satisfy the above query using the Bayesian networks from ﬁgure 1, thenwe would estimate the query selectivity in the following manner: P ( Dark, Japanese, Izumi ) = P C ( Dark | Japanese ) × P C ( Japanese ) × P S ( Izumi ) (13)On the one hand, the conditional distribution P C ( Dark | Japanese ) captures the fact that Japanese people tend to havedark hair inside the customers relation. Graphically this is represented by the arrow that points from the “Nationality”node to the “Hair” node in ﬁgure 1. On the other hand, our estimate ignores the fact that shops in Japan, including“Izumi”, are mostly frequented by Japanese people. The reason why is that we have one Bayesian network per relation,instead of a global network spanning all relations, and are thus not able to capture this dependency. Regardless ofthe missed dependency, this simple method is still more accurate than assuming total independence. Indeed the AVIassumption would neglect the dependency between hair and nationality , even though both attributes are part ofthe same relation. Meanwhile assuming relational independence is convenient because it only requires capturingdependencies within relations, but it discards the dependency between nationality and city . We propose to capturesaid dependency by adding nodes from the Bayesian networks of customers and shops to the Bayesian network ofpurchases. Speciﬁcally, for reasons that will become clear further on, we add the roots of the Bayesian networks ofcustomers and shops (i.e., nationality and name ) to the Bayesian network of purchases. This results in the linkedBayesian network shown in ﬁgure 2. NationalityDay of week NameNationalityHair Salary NameCity SizeFigure 2: Linked Bayesian network of customers, shops, and purchasesIn this new conﬁguration, we still have one Bayesian network per relation. The difference is that the Bayesian networkof purchases includes the root attributes of both customers and shops’s Bayesian networks. In other words, we havejoined the purchases relation with the customers and shops and we have then built a Bayesian network for purchasesthat now includes attributes from customers and shops. A linked Bayesian network is thus a set of separate Bayesian8 PREPRINT - S

EPTEMBER

22, 2020networks where some of the attributes are duplicated in two related networks. In practice, this means that we now knowthe distribution of the nationality and name attribute values once the relations they belong to have been joined withpurchases. Meanwhile, we also know their distributions when these relations are not joined with purchases. In otherwords, we store two distributions for each root attribute, one before the join and one afterwards. The distribution of aroot attribute in a Bayesian network is nothing more than a one-dimensional histogram. This means that storing twodistributions for each root attribute doesn’t incur any signiﬁcant memory burden.The conﬁguration shown in ﬁgure 2 has two immediate beneﬁts over the one presented in ﬁgure 1. First of all, we arenow able to determine if the percentage of Japanese in the purchases relation is different from the one in the customersrelation. Indeed, we do not have to assume the distribution remains the same after the join now that we know thedistribution of nationality ’s values when customers is joined with purchases. A key observation is that we get toknow something about the distribution of the hair attribute values when customers is joined with purchases. That is tosay, because we know how the distribution of nationality attribute values changes after the join, then we also knowsomething about the hair attribute values because both attributes are dependent within the customers relation. Thisstems from the fact that we assume that the conditional distribution P ( hair | nationality ) is preserved after the join.Mathematically this translates to: P C(cid:46)(cid:47)P ( hair, nationality ) = P C ( hair | nationality ) P C(cid:46)(cid:47)P ( nationality ) (14)Although, in practice, we expect the dependency preservation assumption to not always be veriﬁed, we argue that it is amuch weaker assumption than assuming total relational independence. The second beneﬁt is that we can now take intoaccount the fact the Japanese people typically shop in Japanese shops, even though the involved attributes belong torelations that are not directly related. This happens because the name attribute is now part of purchases’s Bayesiannetwork as well as that of shops. Formally the query selectivity can now be expressed as so: P C(cid:46)(cid:47)P (cid:46)(cid:47)S ( Dark, Japanese, Izumi ) = P C ( Dark | Japanese ) × P P (cid:46)(cid:47)S ( Izumi | Japanese ) × P C(cid:46)(cid:47)P ( Japanese ) (15)Let us now consider the following SQL query where the only difference with the previous query is that are ﬁltering by city instead of by name : SELECT *FROM customers , shops , purchasesWHERE customers . id = purchases . customer_idAND shops . id = purchases . shop_idAND customers . nationality = ' Japanese 'AND customers . hair = ' Dark 'AND shops . city = ' Osaka '

In this case, our linked Bayesian network would estimate the selectivity as so: P ( Dark, Japanese, Osaka ) = P C ( Dark | Japanese ) × (cid:88) name P P (cid:46)(cid:47)S ( Osaka | name ) P P ( name | Japanese ) × P C(cid:46)(cid:47)P ( Japanese ) (16)This is a simple application of Bayesian network arithmetic [13]. The reason why there is a sum is that we have to takeinto account all the shops that are located in Osaka because none of them in particular has been speciﬁed in the SQLquery. Note that our linked Bayesian network is still capable of estimating selectivities when only a single relation isinvolved. For example, we only need to use P P ( nationality ) when the customers relation is joined with purchasesrelation. If only the customers relation is involved in a query, then we can simply use P C ( nationality ) instead of P P ( nationality ) . We discuss these two points in further detail in subsection 3.5.Linked Bayesian networks thus combine the beneﬁts of independent Bayesian networks, while having the beneﬁt ofsoftening the join uniformity assumption as well as the attribute value independence assumption. We will now discusshow one may obtain a linked Bayesian network in an efﬁcient manner.9 PREPRINT - S

EPTEMBER

22, 2020

A linked Bayesian network is essentially a set of Bayesian networks. Indeed, our method consists in taking individualBayesian networks and linking them together in order to obtain one single Bayesian network. This linking processis detailed in the next subsection. In our case, by only including the root attribute of each relation into the Bayesiannetwork of its parent relation, we ensure that the ﬁnal network necessarily has a tree topology. Performing inference ona Bayesian network with a tree topology can be done in linear time using the sum-product algorithm [40]. Buildinga linked Bayesian network involves building the Bayesian networks of each relation in a particular order. Indeed, inour example, we ﬁrst have to build the Bayesian networks of the customers and shops relations in order to determinethe roots that are to be included in the Bayesian network of the purchases relation. To build the purchases Bayesiannetwork, we ﬁrst have to join the root attributes (i.e., nationality and name ) of the ﬁrst two Bayesian networks (i.e.,customers and shops) with the purchases relation. Naturally, performing joins incurs an added computational cost.However, we argue that joins are unavoidable if one is to capture attribute value dependencies across relations. Indeed,if joins are disallowed whatsoever, then there is basically no hope of measuring dependencies between attributes ofdifferent relations. Our methodology requires performing one left-join per primary/foreign key relationship, whilst onlyrequiring to include one attribute per join, which is as cost-effective as possible.The speciﬁcs of the procedure we used to build the linked Bayesian network are given in algorithm 1. We assumethe algorithm is given a set of relations. In addition, the algorithm is provided with the set of primary/foreign keyrelationships in the database (e.g., purchases has a foreign key that references customers’ primary key and anotherthat references shops’s primary key). This set of primary/foreign key relationships can easily be extracted from anydatabase’s metadata. The idea is to go through the set of relations and check if the Bayesian networks of the dependentrelations have been built. In this implementation a while loop is used to go through the relations in their topologicalorder, from bottom to top. The Bayesian networks are built using the

BuildBN function, which was presented in [12].The

BuildBN function works in three steps: 1. Build a fully-connected, undirected weighted graph, where each nodeis an attribute and each vertex’s weight is the mutual information between two attributes. 2. Find the maximum spanningtree (MST) of the graph. 3. Orient the MST in order to obtain a tree by choosing a root.The

BuildBN function produces a Bayesian network with a tree topology called a

Chow-Liu tree [37]. This treehas the property of being the tree which stores the maximum amount of information out of all the legal trees. In ouralgorithm, the ﬁrst pass of the while loop will build the Bayesian networks of the relations that have no dependencieswhatsoever (e.g., those who’s primary key isn’t referenced by any foreign key). The next pass will build the Bayesiannetworks of the relations that contain primary keys referenced by the foreign keys of the relations covered in the ﬁrstpass. The algorithm will necessarily terminate once each relation has an associated Bayesian network; it will take asmany steps as there are relations in the database.

Algorithm 1

Linked Bayesian networks construction function B UILD L INKED

BN( relations, relationships ) lbn ← {} built ← {} (cid:46) Records which relations have been processed while | lbn | < | relations | do queue ← relations \ built (cid:46) Relations which don’t have a BN for each relation ∈ queue do if relationships [ relation ] \ built = ∅ then for each child ∈ relationships [ relation ] do relation ← relation (cid:46)(cid:47) child.root end for end if lbn ← lbn ∪ BuildBN ( relation ) built ← built ∪ relation end for end while return lbn end function Note that we can potentially use parallelism to speed-up the execution of algorithm 1. Indeed, by using a priority queueand a worker pool, we can spawn processes in parallel to build the networks in the correct order. However, we considerthis an implementation detail and did not take the time to implement it in our benchmark. Furthermore, this would haveskewed our comparison with other methods. A linked Bayesian network doesn’t require much more additional space inwith respect to the method from [12]. Indeed, a linked Bayesian network is nothing more than a set of separate Bayesian10

PREPRINT - S

EPTEMBER

22, 2020networks where some of the attributes are duplicated in two related networks. Once a linked Bayesian network has beenbuilt, it can be used to produce selectivity estimates. That is, given a linked Bayesian network, we want to be able toestimate the selectivity of an arbitrary SQL query. An efﬁcient algorithm is required to perform so-called inferencewhen many attributes are involved, which is the topic of the following subsection. .The algorithm for producing selectivity estimates using linked Bayesian networks is based on the selectivity estimationalgorithm proposed in [12]. The key insight is that we can fuse linked Bayesian networks into a single Bayesian network.Indeed, in our building process we have to make sure to include the root attribute of each relation’s Bayesian networkinto its parent Bayesian’s network. This allows to link each pair of adjacent Bayesian networks together via their sharedattribute. In ﬁgure 2, these implicit links are represented with dotted lines. The purchases and customers relation havein common the nationality attribute, whereas the shops and purchases relations have in common the name attribute.The resulting “stiched” network is necessarily a tree because each individual Bayesian network is a tree and each sharedattribute is located at the root of each child network.

Algorithm 2

Selectivity estimation using a linked Bayesian network function I NFER S ELECTIVITY ( lbn, query ) relations ← ExtractRelations ( query ) relevant ← P runeLinkedBN ( lbn, relations ) linked ← LinkN etworks ( relevant ) selectivity ← ApplySumP roduct ( linked ) return selectivity end function The pseudocode for producing selectivity estimates is given in algorithm 2. The ﬁrst step of the selectivity estimationalgorithm is to identify which relations are involved in a given query. Indeed each SQL query will usually involve a subsetof relations, and thus we only need to use the Bayesian networks that pertain to said subset. The

P runeLinkedBN thus takes care of removing the unnecessary Bayesian networks from the entire set of Bayesian networks. Naturally, inpractice, and depending on implementation details, this may involve simply loading in memory the necessary Bayesiannetworks. In any case, the next step is to connect the networks into a single one. This necessitates looping overthe Bayesian networks in topological order – in the same exact fashion as algorithm 1 – and linking them along theway. Linking two Bayesian networks together simply involves replacing the attribute they have in common with thechild Bayesian network. For instance, in ﬁgure 2, the nationality attribute from the purchases Bayesian networkwill be replaced by the customers Bayesian network. This is because we are interested in the distribution of theattributes after the join, not before. The resulting tree thus approximates the distribution of attribute values inside the( customers (cid:46)(cid:47) purchases (cid:46)(cid:47) shops ) join instead of estimating selectivities inside each relation independently, as isdone in textbook cost models. The result of this linking process is exempliﬁed in ﬁgure 3, which shows the unrolledversion of the linked Bayesian network shown in ﬁgure 2. Finally, once the Bayesian networks have been linkedtogether, the sum-product algorithm [40] can be used to output the desired selectivity. In fact, this ﬁnal step is exactlythe same as the one described in section 3.3 of [12].Our method for estimating selectivities is very efﬁcient. The main reason is because we only to apply the sum-productalgorithm once, whereas [12] has to apply once per relation involved in the query at hand. This difference is made clearwhen comparing equations 2 and 4. Furthermore, the sum-product algorithm is much more efﬁcient in the case of treesthan the clique tree algorithm from [16]. We conﬁrm these insights in the benchmarks section.NationalityHairSalary Day of week NameCity SizeFigure 3: Unrolled version of ﬁgure 211

PREPRINT - S

EPTEMBER

22, 2020

Our model assumes that the dependencies between attribute values within a relation are preserved when a join occurs.Indeed we assume that tuples are uniformly distributed inside a join given each value in the root attribute. One maywonder why we have to stop at the root. Indeed, it turns out that we can include more attributes in addition to theroot of each child Bayesian network when building a parent Bayesian network. For example, consider the linkedBayesian network shown in ﬁgure 4. In this conﬁguration we include the salary attribute as well as the nationality attribute in the Bayesian network of the purchases relation. By doing so we obtain a new conditional distribution P ( salary | nationality ) which tells us the dependence between salary and nationality after customers has been joinedwith purchases. NationalitySalary Day of weekNationalityHair SalaryFigure 4: Linked Bayesian network of customers and purchasesThe linked Bayesian network shown in 4 is valid because we can unroll it in order to obtain a single tree, just as wedid earlier on when we only included the nationality attribute. However, the salary attribute can be included in thepurchases Bayesian network only because of the fact that the nationality attribute is included as well. Indeed, if the nationality attribute was not included, then linking customers and purchases together would have resulted in a Bayesiannetwork which would not necessarily be a tree. In this case, we would not be able to compute P ( salary | nationality ) in purchases’s Bayesian network. In other words, a node can be included in a parent Bayesian network only if all of itsconditioning attributes are included as well. Assuming a child Bayesian network has n nodes, then we can includea number k ∈ { , . . . , n } of its nodes in the parent Bayesian network. If k = 0 , then we simply keep each Bayesiannetwork separate, which brings us back to the methodology from [12]. If k = 1 , then we only include the root of eachchild Bayesian network, which is the case we have discussed up to now. If k = n , then we will include all the child’sattributes in the parent BN, which is somewhat similar to the global methods presented in [15] and [16]. On the onehand, increasing k will produce larger parent Bayesian networks that capture more attribute value dependencies butalso incur a higher computational cost. On the other hand, lower values of k will necessitate less computation but willassume more strongly that dependencies are preserved through joins. The k parameter is thus a practical parameter forcompromising between selectivity estimation accuracy and computational requirements. Notice that different values of k can be used for each pair of relations. For instance, we might want to increase k if we notice that the cost modelmakes very bad estimates for a certain relation. This can be decided upon as deemed ﬁt, be it manually or via automatedDBA [41]. The method we propose attempts to generalise existing selectivity estimation methods based on Bayesian networks.Following the methodology from [12], we build one Bayesian network per relation using Chow-Liu trees. The onlydifference is that we include a set of attributes from the child relations into the Bayesian network associated witheach parent relation. The set of included attributes depends on a chosen parameter k and the structure of each childrelation’s Bayesian network. Many distributions can be obtained for free because of the fact that each Bayesian networkis a tree in which the root attribute conditions the rest of the attributes. This requires assuming that attribute valuedependencies are preserved through joins. This assumption, although not always necessarily true, is much softer thanthe join uniformity as well as the attribute value independence assumptions. The resulting Bayesian networks arethus able to capture attribute value dependencies across relations, as well as inside individual relations. Although ourmethod requires performing joins ofﬂine, we argue that joins are unavoidable if one is to capture any cross-relationdependency whatsoever. The major beneﬁt of our method is that it only requires including a single attribute per join,and yet it brings a great deal of information for free through transitivity thanks to our newly introduced assumption.Moreover, our method can still beneﬁt from the efﬁcient selectivity estimation procedure presented in [12] because ofthe preserved tree structure. Finally, our method is able to generalise existing methods based on Bayesian networks12 PREPRINT - S

EPTEMBER

22, 2020through a single parameter which determines the amount of dependency to measure between the attributes of relationsthat share a primary/foreign key relationship.

We evaluate our proposal on an extensive workload derived from the JOB benchmark [2]. The JOB benchmark consistsof 113 SQL queries, along with an accompanying dataset extracted from the IMDb website. The dataset consistsof non-synthetic data, whereas other benchmarks such as TPC-DS [17] are based on synthetic data. The dataset ischallenging because it contains skewed distributions and exhibits many correlations between attributes, both across andinside relations. The JOB benchmark is now an established and reliable standard for evaluating and comparing costmodels. The dataset and the queries are publicly available . In addition, we have made a Docker image available foreasing future endeavours in the ﬁeld , as well as code used in our experiments .During the query optimisation phase, the cost model has to estimate the selectivity of each query execution plan (QEP)enumerated by the query optimiser. Query optimisers usually build QEPs in a bottom-up fashion [42]. Initially, thecost model will have to estimate selectivities for simple QEPs that involve a single relation. It will then be asked toestimate selectivities for larger QEPs involving multiple joins and predicates. We decided to mimic this situation byenumerating all the possible sub-queries for each of the JOB benchmark’s queries, as detailed in [43]. For example, if aquery pertains to 4 relations, we will enumerate all the possible sub-queries involving 1, 2, 3, and all 4 relations. Wealso enumerate through all the combinations of ﬁlter conditions. To do so, we represented each query as a graph witheach node being an attribute and each edge a join. We then simply had to retrieve all the so-called induced subgraphs ,which are all the subgraphs that can be made up from a given graph. Each induced subgraph was then converted back toa valid SQL statement. This procedure only takes a few minutes and yields a fairly large amount of queries; indeeda total of 5,122,790 subqueries can be generated for the JOB benchmark’s 113 queries. Tables 3 and 4 provide anoverview of the contents of our workload.Table 3: Query spread per number of join conditionsJoins Amount0 8891-5 177,3096-10 1,175,12011-15 2,060,61416-20 1,320,68121-25 388,177 Table 4: Query spread per number of ﬁlter conditionsFilters Amount1 261,4402 763,3923 1,301,8404 1,380,3295 923,4816 384,2857 94,8558 12,4969 672The general goal of our experiments is to detail the pros and cons of our method with respect to the textbook approachfrom [6] and some state-of-the-art methods that we were able to implement. Most industrial databases still resortto using textbook approaches, which are thus important to be compared with. Speciﬁcally our experiments solelyfocus on the selectivity estimation module, not on the ﬁnal query execution time. We assume that improving theselectivity estimates will necessarily have a beneﬁcial impact on the accuracy of the cost model and thus on the queryexecution time. Naturally, the estimation has to remain reasonable. This seems to be a view shared by many in thequery optimisation community [2]. Indeed, many papers that deal with selectivity estimation, both established and new,do not measure the impact on the ﬁnal query execution [44, 45, 22, 16, 25, 46].We compared our proposal with a few promising state-of-the-art methods as well as the cardinality estimation modulefrom the PostgreSQL database system. PostgreSQL’s cardinality estimation module is a fair baseline as it is a textbookimplementation of the decades old ideas from [6]. We used version 10.5 of PostgreSQL and did not tinker with thedefault settings. Additionally, we did not bother with building indexes, as these have no consequence on the selectivityestimation module. A viable selectivity estimation method should be at least as accurate as PostgreSQL, withoutintroducing too much of a computational cost increase. We implemented basic random sampling [47], which consistsin executing a given query on a sample of each relation in order to extrapolate a selectivity estimate. Basic randomsampling is simple to implement, but isn’t suited for queries that involve joins because of the empty-join problem, JOB dataset and queries: https://github.com/gregrahn/join-order-benchmark/ Docker image: https://github.com/MaxHalford/postgres-job-docker Method source code: https://github.com/MaxHalford/tldks-2020 PREPRINT - S

EPTEMBER

22, 2020as explained in section 2. However many sampling methods that take into account the empty-join problem havebeen proposed. We implemented one such method, namely correlated sampling [25]. Correlated sampling works byhashing related primary and foreign keys and discards the tuples of linked relation where the hashes disagree. We alsoimplemented MSCN, which is the deep learning method that is presented in [34]. Finally we implemented the Bayesiannetwork approach from [16]. The latter method differs from ours in that it is a global approach that builds one singleBayesian networks over the entire set of relations. Although a global approach is able to capture more correlations thanours, it require more computation. We compared our method with different values for the k parameter presented insection 3.6. Note that choosing k = 0 is equivalent to using the method from [12]. Increasing k is expected to improvethe accuracy of the selectivity estimates but deteriorates the computational performance. The k parameter can thusbe used to trade between accuracy and computational resources depending on the use case and the constraints of theenvironment. We ﬁrst measured the accuracy of the selectivity estimates for each method by comparing their estimates with the trueselectivity. The true selectivity can be obtained by executing the query and counting the number of tuples in the result.The appropriate metric for such a comparison is called the q -error [48, 18], and is deﬁned as so: q ( y, ˆ y ) = max ( y, ˆ y ) min ( y, ˆ y ) (17)where y is the true selectivity and ˆ y is the estimated selectivity. The q -error thus simply measures the multiplicativeerror between the estimate and the truth. The q -error has the property of being symmetric, and will thus be the samewhether ˆ y is an underestimation or an overestimation. Moreover the q -error is scale agnostic (e.g., = ), whichhelps in comparing errors over results with different scales. Number of queries q - e rr o r s Sorted q -errors by method PostgreSQLRandom samplingCorrelated samplingMSCNGlobal Bayesian networkIndependent Bayesian networksLinked Bayesian network with k = 1Linked Bayesian network with k = 2 Figure 5: Sorted q -errors for all queries by method on the JOB workloadFigure 5 shows the q -errors made by each method for all the queries of the workload derived from the JOB benchmark.The y axis represents the q -error associated with each query. Meanwhile the x axis denotes the amount of queries thathave less than a given q -error. For instance, PostgreSQL managed to estimate the selectivity of two million queries witha q -error of less than 10 for each query. The curves thus give us a detailed view into the distribution of the q -errors foreach method. While the curves seem to exhibit a linear trend, one must note that the scale of the y axis is logarithmic.The ﬁgure gives us a global idea of the accuracy of each method in comparison with the others. The mean, maximum,and meaningful quantiles of the q -errors are given in table 5.The overall worst method is the cost model used by PostgreSQL. This isn’t a surprise, as it assumes total independencebetween attributes, both within and between relations. It is interesting to notice that the q -errors made by PostgreSQL’scost model can be extremely high, sometimes even reaching the tens of thousands. In this case, the query optimiser isnothing short from blind because the selectivity estimates are extremely unreliable. Although this doesn’t necessarilymean that the query optimiser will not be able to ﬁnd a good query execution plan, it does imply that ﬁnding a good14 PREPRINT - S

EPTEMBER

22, 2020Table 5: q -error statistics for each method on the JOB workloadmedian 90th 95th 99th max averagePostgreSQL 7.32 77.01 185.84 707.21 10906.17 77.01Sampling 4.79 16.45 33.17 81.34 1018.43 12.71Correlated sampling 3.83 9.63 12.63 22.72 214.1 5.79MSCN 2.99 6.12 7.47 12.49 110.56 3.89Global BN 1.95 2.92 3.22 4.01 7.45 1.99Independent BN 4.0 15.36 32.9 76.91 820.46 11.82Linked BN k = 1 k = 2 k equal to 0 is equivalent to the methodologyproposed by [12]. Indeed, if no attributes are shared by the Bayesian networks of each relation, then it is as if weconsidered attribute value dependencies within each relation but not between relations. As expected, the performanceis similar to that of random sampling because both methods capture dependencies within a relation but not betweenrelations. Correlated sampling performs a bit better because it is a join-aware sampling method. However, the rest ofthe implemented methods seems to be more precise by an order of magnitude. The deep learning method, MSCN,outperforms correlated sampling, but it isn’t as performant as the Bayesian networks. However, it can probably reach abetter level of performance by tuning some of the many parameters that it exposes. Meanwhile, the method we proposedwith k = 1 means that we include the root attribute of each child relation within the Bayesian network of each parentrelation. This brings to the table the beneﬁts detailed in section 3. If k = 2 , then an additional attribute from eachchild relation is included with the Bayesian network of each parent relation. We can see on ﬁgure 6 that the globalaccuracy increases with k , which is what one would expect. The most accurate method overall is the global Bayesiannetwork presented in [16]. However, our method with k = 2 is not far off. This makes the case that our attribute valuedependency preservation assumption is a realistic one. Number of queries q - e rr o r s Sorted q -errors by method PostgreSQLRandom samplingCorrelated samplingMSCNGlobal Bayesian networkIndependent Bayesian networksLinked Bayesian network with k = 1Linked Bayesian network with k = 2 Figure 6: Sorted q -errors for all queries by method on the TPC-DS workloadWe have also benchmarked the methods on the TPC-DS benchmark. In contrast to the IMDb dataset used in the JOBbenchmark, the TPC-DS dataset is synthetic. By nature, it contains less attribute dependencies than would be expectedin a realistic use case. The TPC-DS dataset is therefore less realistic than the JOB benchmark. To produce a workloadas we did for the JOB benchmark, we have taken the 30 ﬁrst queries that are provided with the TPC-DS dataset andhave generated all possible sub-queries. This led to a total 1,414,593 queries. The amount of joins went from 2 to 15.The overall results are shown in table 6. As expected, the q -errors for the TPC-DS benchmark are better across theboard because the dataset exhibits less correlations between attributes. Nonetheless, the rankings between the methodsremains somewhat the same. Our method very slightly outperforms the global Bayesian network, but we believe thatthis is just an implementation artifact. In any case, our method is much more accurate than any method that assumesindependence between attributes of different relations. Even so, a viable selectivity estimation method also has to beable to produce estimates in a very short amount of time, which is a point we will now discuss.15 PREPRINT - S

EPTEMBER

22, 2020Table 6: q -error statistics for each method on the TPC-DS workloadmedian 90th 95th 99th max averagePostgreSQL 1.23 43.4 138.05 1025.49 82898.28 91.46Sampling 3.69 13.39 26.34 66.83 669.58 9.87Correlated sampling 2.89 8.24 10.63 19.32 170.63 4.51MSCN 1.82 4.23 5.32 9.24 78.01 2.54Global BN 1.03 1.17 1.23 1.33 1.49 1.06Independent BN 2.5 13.59 33.87 89.22 597.0 9.12Linked BN k = 1 k = 2 Naturally, we next sought to measure how fast each method was at producing selectivity estimates. In a high throughputenvironment, the query optimiser isn’t allowed to spend much time searching for an efﬁcient QEP. In addition to usingthe cost model, the query optimiser also has to enumerate potential query execution plans and pick one of them [42].Thus, only a fraction of the short amount of time allocated to the query optimiser can actually be consumed by the costmodel. This means that any viable selectivity estimation has to be extremely efﬁcient, and is probably the main reasonwhy current cost models are kept simple. We call the amount of time necessary to produce a selectivity estimate the inference time . During our experiments we recorded the inference time for each query and for each model. The resultsshown in table 7 show the average inference time for each method, aggregated by the number of joins present in eachquery.Table 7: Average inference time in milliseconds for each method with respect to the number of joins on the JOBworkload No joins 1 join 2 to 5 joins 6 joins or morePostgreSQL . ± . . ± . . ± . . ± . Sampling . ± . . ± . . ± . . ± . Correlated sampling . ± . . ± . . ± . . ± . MSCN . ± . . ± . . ± . . ± . Global BN . ± . . ± . . ± . . ± . Independent BN . ± . . ± . . ± . . ± . Linked BN k = 1 9 . ± . . ± . . ± . . ± . Linked BN k = 2 10 . ± . . ± . . ± . . ± . It is important to mention that the inference time measured for PostgreSQL is simply the time it takes the databaseto execute the

ANALYZE statement for each query. This thus includes the optimisation time, on top of the time spentat estimating selectivities. Even though they are already by far the best, the numbers displayed in our benchmarkfor PostgreSQL are pessimistic and are expected to be much lower in practice. It is also worth mentioning that weimplemented the rest of the methods in Python, which is an interpreted language and thus slower than compiledlanguages such as C, in which PostgreSQL is written. If these methods were implemented in optimised C they wouldnaturally be much faster. However, what matters here is the relative differences between each method, not the absoluteones.We can clearly see from the results in table 7 that the global Bayesian network loses in speed what it gains in accuracy.This is because it uses a complex inference method called the clique-tree algorithm , which is the standard approachfor Bayesian networks with arbitrary topologies. Although it is the most accurate method, it is much slower thanour method, regardless of the k parameter we use. What’s more, the inference time of our method doesn’t increasedramatically when the number of joins increases. This is due to the fact that we use a lightweight inference algorithmcalled variable elimination [38] also used by [12]. The inference algorithm scales well because we are able to mergethe Bayesian networks of each relation into a single tree. We can also see that correlated sampling is relatively slowmethod, although its accuracy is competitive as shown in the previous subsection. MSCN is the slowest method overallin our benchmark. This may be attributed to the fact that we implemented it from scratch because no implementationwas provided by its authors, and therefore do not have the insights that they might have. We argue that even thoughour method is not as accurate as the method proposed by [16], it is much faster and is thus more likely to be used inpractice. Naturally, we also have to take into account the amount of time it requires to build our method, as well as howmuch storage space it requires. 16 PREPRINT - S

EPTEMBER

22, 2020

The cost model uses metadata that is typically obtained when the database isn’t being used. This is done in order notto compute it in real time during the query optimisation phase. This metadata has to be refreshed every so often inorder for the cost model to use relevant ﬁgures. Typically, the metadata has to be refreshed when the underlying datadistributions change signiﬁcantly. For instance, if attributes become correlated when new data is inserted, then themetadata has to be refreshed to take this into account. Therefore, the amount of time it takes to collect the necessaryinformation is rather important, as ideally we would like to refresh the metadata as often as possible. Additionally, anyviable selectivity estimation method crucially has to make do with a little amount of storage space. Indeed, spatialcomplexity is a major reason why most methods proposed in the literature are not being used in practice. These twocomputational requirements highlight the dilemma that cost models have to face: they have to be accurate whilst runningwith a very low footprint. Most multidimensional methods that have been proposed are utterly useless when it comes totheir performance in this regard.Table 8: Computational requirements of the construction phase per method on the JOB workloadConstruction time Storage sizePostgreSQL 5 seconds 12KBSampling 7 seconds 276MBCorrelated sampling 32 seconds 293MBMSCN 15 minutes 8 seconds 37MBGlobal BN 24 minutes 45 seconds 429KBIndependent BN 55 seconds 217KBLinked BN k = 1 k = 2 k = 0 and k = 1 is due to the need to compute joinswhen k > . However, note that the jump is much smaller between k = 1 and k = 2 . The reason is that the joins don’thave to be repeated for each additional attribute included in every parent Bayesian network. During the query optimisation phase, a cost model is invoked by the query optimiser to estimate the cost of queryexecution plans. In this context, the selectivity of operators is a crucial input to the cost model [2]. Inaccurate selectivityestimates lead to bad cost estimates which in turn have a negative impact on the overall running time of a query.Moreover, errors in selectivity estimation grow exponentially throughout a query execution plan [3]. Selectivityestimation is still an open research problem, even though many proposals have been made. This is down to the fact thatthe requirements in terms of computational resources are extremely tight, and one thus has to compromise betweenaccuracy and efﬁciency.Our method is based on Bayesian networks, which are a promising way to solve the aforementioned compromise.Although the use of Bayesian networks for selectivity estimation isn’t new, previous propositions entail a prohibitivebuilding cost and inference time. In order to address these issues, we extend the work of [12] to include the measurementof dependencies between attributes of different relations. We show how we can soften the relational independenceassumption without requiring an inordinate amount of computational resources. We validate our method by comparing17

PREPRINT - S

EPTEMBER

22, 2020it with other methods on an extensive workload derived from the JOB [2] and the TPC-DS [17] benchmarks. Our resultsshow that our method is only slightly less accurate than the global Bayesian network from [16], whilst being an orderof magnitude less costly to build and execute. Additionally, our method is more accurate than join-aware sampling,whilst requiring signiﬁcantly less storage and computational requirements. In comparison with other methods whichmake more simplifying assumptions, our method is notably more accurate, whilst offering very reasonable guaranteesin terms of computational time and space. In future work, we wish to extend our method to accommodate for speciﬁcoperators such as

GROUP BY s, as well as verify the beneﬁts of our method in terms of overall query response time asperceived by a query issuer.

References [1] Yannis E Ioannidis. Query optimization.

ACM Computing Surveys (CSUR) , 28(1):121–123, 1996.[2] Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. How goodare query optimizers, really?

Proceedings of the VLDB Endowment , 9(3):204–215, 2015.[3] Yannis E Ioannidis and Stavros Christodoulakis.

On the propagation of errors in the size of join results , volume 20.ACM, 1991.[4] Shaoyi Yin, Abdelkader Hameurlain, and Franck Morvan. Sla deﬁnition for multi-tenant dbms and its impact onquery optimization.

IEEE Transactions on Knowledge and Data Engineering , 2018.[5] Wentao Wu, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacigümüs, and Jeffrey F Naughton. Predictingquery execution time: Are optimizer cost models really unusable? In , pages 1081–1092. IEEE, 2013.[6] P Grifﬁths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price.Access path selection in a relational database management system. In

Proceedings of the 1979 ACM SIGMODinternational conference on Management of data , pages 23–34. ACM, 1979.[7] Yannis Ioannidis. -the history of histograms (abridged). In

Proceedings 2003 VLDB Conference , pages 19–30.Elsevier, 2003.[8] M Muralikrishna and David J DeWitt. Equi-depth multidimensional histograms. In

ACM SIGMOD Record ,volume 17, pages 28–36. ACM, 1988.[9] Yu Chen and Ke Yi. Two-level sampling for join size estimation. In

Proceedings of the 2017 ACM InternationalConference on Management of Data , pages 759–774. ACM, 2017.[10] Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. On random sampling over joins. In

ACM SIGMODRecord , volume 28, pages 263–274. ACM, 1999.[11] Michael Stillger, Guy M Lohman, Volker Markl, and Mokhtar Kandil. Leo-db2’s learning optimizer. In

VLDB ,volume 1, pages 19–28, 2001.[12] Max Halford, Philippe Saint-Pierre, and Franck Morvan. An approach based on bayesian networks for queryselectivity estimation. In

International Conference on Database Systems for Advanced Applications , pages 3–19.Springer, 2019.[13] Finn V Jensen et al.

An introduction to Bayesian networks , volume 210. UCL press London, 1996.[14] Gregory F Cooper. The computational complexity of probabilistic inference using bayesian belief networks.

Artiﬁcial intelligence , 42(2-3):393–405, 1990.[15] Lise Getoor, Benjamin Taskar, and Daphne Koller. Selectivity estimation using probabilistic models. In

ACMSIGMOD Record , volume 30, pages 461–472. ACM, 2001.[16] Kostas Tzoumas, Amol Deshpande, and Christian S Jensen. Lightweight graphical models for selectivity estimationwithout independence assumptions.

Proceedings of the VLDB Endowment , 4(11):852–863, 2011.[17] Meikel Poess, Bryan Smith, Lubor Kollar, and Paul Larson. Tpc-ds, taking decision support benchmarking to thenext level. In

Proceedings of the 2002 ACM SIGMOD international conference on Management of data , pages582–587, 2002.[18] Viktor Leis, Bernhard Radke, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and ThomasNeumann. Query optimization through the looking glass, and what we found running the join order benchmark.

The VLDB Journal , pages 1–26, 2018.[19] Robert Philip Kooi. The optimization of queries in relational databases. 1981.18

PREPRINT - S

EPTEMBER

22, 2020[20] Björn Blohsfeld, Dieter Korus, and Bernhard Seeger. A comparison of selectivity estimators for range queries onmetric attributes. In

ACM SIGMOD Record , volume 28, pages 239–250. ACM, 1999.[21] Yossi Matias, Jeffrey Scott Vitter, and Min Wang. Wavelet-based histograms for selectivity estimation. In

ACMSIGMoD Record , volume 27, pages 448–459. ACM, 1998.[22] Viswanath Poosala and Yannis E Ioannidis. Selectivity estimation without the attribute value independenceassumption. In

VLDB , volume 97, pages 486–495, 1997.[23] Max Heimel, Martin Kiefer, and Volker Markl. Self-tuning, gpu-accelerated kernel density models for multidimen-sional selectivity estimation. In

Proceedings of the 2015 ACM SIGMOD International Conference on Managementof Data , pages 1477–1492. ACM, 2015.[24] Nicolas Bruno, Surajit Chaudhuri, and Luis Gravano. Stholes: a multidimensional workload-aware histogram. In

Acm Sigmod Record , volume 30, pages 211–222. ACM, 2001.[25] David Vengerov, Andre Cavalheiro Menck, Mohamed Zait, and Sunil P Chakkappen. Join size estimation subjectto ﬁlter conditions.

Proceedings of the VLDB Endowment , 8(12):1530–1541, 2015.[26] Viktor Leis, Bernharde Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. Cardinality estimationdone right: Index-based join sampling. In

CIDR , 2017.[27] Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approximatequery answering. In

ACM SIGMOD Record , volume 28, pages 275–286. ACM, 1999.[28] Volker Markl, Peter J Haas, Marcel Kutsch, Nimrod Megiddo, Utkarsh Srivastava, and Tam Minh Tran. Consistentselectivity estimation via maximum entropy.

The VLDB journal , 16(1):55–76, 2007.[29] Magnus Müller, Guido Moerkotte, and Oliver Kolb. Improved selectivity estimation by combining knowledgefrom sampling and synopses.

Proceedings of the VLDB Endowment , 11(9):1016–1028, 2018.[30] Mert Akdere, Ugur Çetintemel, Matteo Riondato, Eli Upfal, and Stanley B Zdonik. Learning-based queryperformance modeling and prediction. In

Data Engineering (ICDE), 2012 IEEE 28th International Conferenceon , pages 390–401. IEEE, 2012.[31] Henry Liu, Mingbin Xu, Ziting Yu, Vincent Corvinelli, and Calisto Zuzarte. Cardinality estimation using neuralnetworks. In

Proceedings of the 25th Annual International Conference on Computer Science and SoftwareEngineering , pages 53–59. IBM Corp., 2015.[32] Oleg Ivanov and Sergey Bartunov. Adaptive cardinality estimation. arXiv preprint arXiv:1711.08330 , 2017.[33] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. Learned cardinalities:Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 , 2018.[34] Andreas Kipf, Dimitri Vorona, Jonas Müller, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, ThomasNeumann, and Alfons Kemper. Estimating cardinalities with deep sketches. In

Proceedings of the 2019International Conference on Management of Data , pages 1937–1940, 2019.[35] Amol Deshpande, Minos Garofalakis, and Rajeev Rastogi. Independence is good: Dependency-based histogramsynopses for high-dimensional data.

ACM SIGMOD Record , 30(2):199–210, 2001.[36] Mark Bartlett and James Cussens. Integer linear programming for the bayesian network structure learning problem.

Artiﬁcial Intelligence , 244:258–271, 2017.[37] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.

IEEE transactionson Information Theory , 14(3):462–467, 1968.[38] Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter.

Probabilistic networks and expertsystems: Exact computational methods for Bayesian networks . Springer Science & Business Media, 2006.[39] Frank K Hwang, Dana S Richards, and Pawel Winter.

The Steiner tree problem , volume 53. Elsevier, 1992.[40] Frank R Kschischang, Brendan J Frey, Hans-Andrea Loeliger, et al. Factor graphs and the sum-product algorithm.

IEEE Transactions on information theory , 47(2):498–519, 2001.[41] Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. Automatic database management systemtuning through large-scale machine learning. In

Proceedings of the 2017 ACM International Conference onManagement of Data , pages 1009–1024. ACM, 2017.[42] Surajit Chaudhuri. An overview of query optimization in relational systems. In

Proceedings of the seventeenthACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems , pages 34–43. ACM, 1998.[43] Surajit Chaudhuri, Vivek Narasayya, and Ravi Ramamurthy. Exact cardinality query optimization for optimizertesting.

Proceedings of the VLDB Endowment , 2(1):994–1005, 2009.19

PREPRINT - S

EPTEMBER

22, 2020[44] Chungmin Melvin Chen and Nick Roussopoulos.

Adaptive selectivity estimation using query feedback , volume 23.ACM, 1994.[45] Viswanath Poosala, Peter J Haas, Yannis E Ioannidis, and Eugene J Shekita. Improved histograms for selectivityestimation of range predicates. In

ACM Sigmod Record , volume 25, pages 294–305. ACM, 1996.[46] Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek Narasayya, and Surajit Chaudhuri. Selectivityestimation for range predicates using lightweight models.

Proceedings of the VLDB Endowment , 12(9):1044–1057,2019.[47] Frank Olken and Doron Rotem. Simple random sampling from relational databases. 1986.[48] Guido Moerkotte, Thomas Neumann, and Gabriele Steidl. Preventing bad plans by bounding the impact ofcardinality estimation errors.

Proceedings of the VLDB Endowment , 2(1):982–993, 2009.[49] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem.

Proceedingsof the American Mathematical society , 7(1):48–50, 1956.20

PREPRINT - S

EPTEMBER

22, 2020

The following is an unabbreviated version of the subsection on our our preliminary subsection. It contains additionalexamples that help to get a better understanding, but that were considered too lengthy to be part of the main article.In [12], we developed a methodology for constructing Bayesian networks to model the distribution of attribute valuesinside each relation of a database. Once the Bayesian networks are constructed, we used to produce selectivity estimatesby converting a logical operator tree at hand into a probabilistic formula of sums and products. In what follows, we willgive an overview of Bayesian networks. We will also explain the compromises we made in order to produce a methodthat is both reasonably accurate as well as efﬁcient.A Bayesian network is a probabilistic model. As such, it is used for approximating the probability distribution of adataset. The particularity of a Bayesian network is that it uses a directed acyclic graph (DAG) in order to do so. Thegraph contains one node per variable, whilst each directed edge represents a conditional dependency between twovariables. For instance, if nodes A and B are connected with an edge that points from A to B , then this stands for theconditional distribution P ( B | A ) . A Bayesian network is a product of many such conditional dependencies, whichformally is: P ( X , . . . , X n ) (cid:39) (cid:89) X i ∈X P ( X i | P arents ( X i )) (18)The term, P ( X , . . . , X n ) is the probability distribution over the entire set of attributes { X , . . . , X n } . Meanwhile, P arents ( X i ) stands for the attributes that condition the value of X i . The distribution P ( X i | P arents ( X i )) is thusthe conditional distribution of attribute X i ’s value. In practice, the full distribution is inordinately large, and is unknownto us. However, the total of the sizes of the conditional distributions P ( X i | P arents ( X i )) is much smaller. Indeed, fordiscrete attributes, each conditional distribution is a ( p + 1) -way table, where p is the number of parents | P arent ( X i ) | .If an attribute hair is conditioned by a single other attribute nationality , then that conditional relationship can bestored in a two-way table, as shown in table 9.Table 9: Conditional distribution P ( hair | nationality ) Blond Brown DarkAmerican

Japanese

Swedish nationality attribute is not conditioned by any other attribute, then we can represent with aone-dimensional distribution, which is represented in table 10.Table 10: Distribution P ( nationality ) American Japanese Swedish hair equals “ Blond ” and nationality equals “ Swedish ” . This can be obtained by applying Bayes’ rule: P ( hair = Blond, nationality = Swedish ) = P ( Blond | Swedish ) × P ( Swedish )= 0 . × .

3= 0 . (19)21 PREPRINT - S

EPTEMBER

22, 2020Note that we do not have directly access to the distribution of the hair attribute. Indeed we only have the conditionaldistribution P ( hair | nationality ) . However, we can obtain P ( hair ) by marginalising over the nationality attribute.For instance, if we want to obtain the fraction of tuples where hair equals “ Blond ” : P ( hair = Blond ) = (cid:88) nationality P ( Blond, nationality )= (cid:88) nationality P ( Blond | nationality ) × P ( nationality )= 0 . × . (cid:124) (cid:123)(cid:122) (cid:125) P ( Blond,American ) + 0 . × . (cid:124) (cid:123)(cid:122) (cid:125) P ( Blond,Japanese ) + 0 . × . (cid:124) (cid:123)(cid:122) (cid:125) P ( Blond,Swedish ) = 0 . (20)With a Bayesian network, we are thus able to answer any selectivity estimation problem by converting a logical queryinto a mathematical formula following standard rules of probability [13]. Note, however, that a Bayesian network isnecessarily an approximation of the full probability distribution because it makes assumptions about the generatingprocess of the data. Finding the right graph structure of a Bayesian network is called structure learning [13]. This isusually done using a scoring function, which estimates the amount of information memorised by a network with a givenstructure. The time required to run an exhaustive search which maximises the scoring function is super-exponentialwith the number of variables [14]. Approximate search methods as well as integer programming solutions have beenproposed [36], but they still require a large amount of time to run and have brittle performance guarantees. In ourwork in [12], we proposed to use the Chow-Liu algorithm [37]. This algorithm has the property of ﬁnding the best treestructure where nodes are restricted to have at most one parent. The obtained tree is the best in the sense of maximumlikelihood estimation. In other words, it is the tree that memorises the most the given data. This is an important property,because for the purpose of selectivity estimation we are not interested in having a model that generalises well, but ratherone that is good at memorising the data that it is shown. This is explained in further details in section 4.1 of [15]. Inaddition to this property, the Chow-Liu algorithm only runs in O ( p ) time, where p is the number of variables, andis simple to implement. It works by ﬁrst computing the mutual information between each pair of variables, which isdeﬁned as so: M I ( X i , X j ) = (cid:88) x i ∈ X i (cid:88) x j ∈ X j P ( x i , x j ) × log( P ( x i , x j ) P ( x i ) P ( x j ) ) (21)The mutual information can be seen as the strength of the relation between two variables, whether it be linear or not.The distribution P ( X i , X j ) contains the occurrence counts of each pair ( x i , x j ) in a relation R . It can be obtainedwith a SELECT COUNT(*) FROM R GROUP BY X i , X j statement in SQL . The distributions P ( X i ) and P ( X j ) canbe obtained by marginalising over P ( X i , X j ) with respect to the other attribute. Once the mutual information for eachpair of attributes is computed, they are organised into a fully connected weighted graph, as shown in ﬁgure 7:nationalitycountry cityeye colour hair colour0.55 0.340.11 0.590.680.03 0.25 0.01 0.220.10Figure 7: Mutual information amounts for ﬁve attributesThe next step is to ﬁnd the maximum spanning tree (MST) of the graph, which is the spanning tree whose sum of edgeweights is maximal. A spanning tree is a subset of p − edges that forms a tree. Finding the maximum spanning tree22 PREPRINT - S

EPTEMBER