[PDF] Consistent and Flexible Selectivity Estimation for High-Dimensional Data

Abstract

Selectivity estimation aims at estimating the number of database objects that satisfy a selection criterion. Answering this problem accurately and efficiently is essential to many applications, such as density estimation, outlier detection, query optimization, and data integration. The estimation problem is especially challenging for large-scale high-dimensional data due to the curse of dimensionality, the large variance of selectivity across different queries, and the need to make the estimator consistent (i.e., the selectivity is non-decreasing in the threshold). We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator, which is flexible to fit the selectivity curve of any distance function and query object, while guaranteeing that the output is non-decreasing in the threshold. To improve the accuracy for large datasets, we propose to partition the dataset into multiple disjoint subsets and build a local model on each of them. We perform experiments on real datasets and show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way and is useful for real applications.

Full PDF

CConsistent and Flexible Selectivity Estimation forHigh-Dimensional Data

Yaoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang ,and Yoshiharu Ishikawa Shenzhen Institute of Computing Sciences, Osaka University, Nagoya University, University of New South Wales, University of Melbourne{yaoshuw,jqin,mao}@sics.ac.cn,{chuanx,onizuka}@ist.osaka-u.ac.jp,[email protected],[email protected],[email protected]

ABSTRACT

Selectivity estimation aims at estimating the number of databaseobjects that satisfy a selection criterion. Answering this problemaccurately and efficiently is essential to many applications, suchas density estimation, outlier detection, query optimization, anddata integration. The estimation problem is especially challengingfor large-scale high-dimensional data due to the curse of dimen-sionality, the large variance of selectivity across different queries,and the need to make the estimator consistent (i.e., the selectiv-ity is non-decreasing in the threshold). We propose a new deeplearning-based model that learns a query-dependent piecewise linearfunction as selectivity estimator, which is flexible to fit the selec-tivity curve of any query object and threshold, while guaranteeingthat the output is non-decreasing in the threshold. To improve theaccuracy for large datasets, we propose to partition the datasetinto multiple disjoint subsets and build a local model on each ofthem. We perform experiments on real datasets and show that theproposed model significantly outperforms state-of-the-art modelsin accuracy and is competitive in efficiency.

In this paper, we consider the following selectivity estimation prob-lem for high-dimensional data: given a query object x , a distancefunction d , and a distance threshold t , estimate the number ofobjects o s in a database that satisfy d ( x , o ) ≤ t . It is an essential pro-cedure in estimating local density [42] and outlierness [3], whichare keys to density estimation in statistics and density-based out-lier detection in data mining. It is also known as the cardinalityestimation problem in the database area. Accurate estimation helpsto find an optimal query execution plan in databases dealing withhigh-dimensional data [16]; e.g., hands-off entity matching sys-tems [8] extract paths from random forests and take each path – aconjunction of similarity predicates over multiple attributes – as ablocking rule, and efficient blocking can be achieved if we find agood query execution plan [40]. In addition, many modern recom-mender systems resort to latent representations (embeddings) ofusers and/or items [24, 47]. To make recommendation, a selectionquery is invoked to obtain a set of candidates to be further ranked bysophisticated models. Estimating the number of candidates helps tochoose a proper ranker to improve the quality of recommendation.Selectivity estimation for large-scale high-dimensional data isstill an open problem due to the following factors: (1) Large vari-ance of selectivity . The selectivity varies across queries and maydiffer by several orders of magnitude. A good estimator is supposed to predict accurately for both small and large selectivity values.(2)

Curse of dimensionality . Many methods that work well on low-dimensional data, such as histograms [18], are intractable whenwe seek an optimal solution, and they significantly lose accuracywith the increase of dimensionality. (3)

Consistency requirement .When the query object is fixed, selectivity is non-decreasing in thethreshold. Hence users may want the estimated selectivity to benon-decreasing and interpretable in applications such as densityestimation. This requirement rules out many existing methods.To address the above challenges, we propose a novel deep re-gression method that guarantees consistency. We holistically ap-proximate the selectivity curve using a query-dependent piecewiselinear function consisting of control points that are learned fromtraining data. This function family is flexible in the sense that itcan closely approximate the selectivity curve of any input queryobject; e.g., using more control points for the part of the curvewhere selectivity changes rapidly. Together with a robust loss func-tion, we are able to alleviate the impact of large variance acrossdifferent queries. To handle high dimensionality, we incorporatean autoencoder that learns the latent representation of the queryobject with respect to the data distribution. The query object andits latent representation are fed to a query-dependent control pointmodel, enhancing the fit to the selectivity curve of the query object.To ensure consistency, we achieve the monotonicity of estimatedselectivity by converting the problem to a standard neural networkprediction task, rather than imposing additional limitations such asrestricting weights to be non-negative [7] or limiting to multi-linearfunctions [9]. To improve the accuracy on large-scale datasets, wepropose a partition-based method to divide the database into multi-ple disjoint subsets and learn a local model on each of them. Sinceupdate may exists in the database, we employ incremental learningto cope with this issue instead of training from scratch.We perform experiments on three real datasets. The results showthat our method outperforms various state-of-the-art models. Com-pared to the best existing model [40], the improvement of accuracyis up to 3.7 times in mean squared error and consistent acrossdatasets, distance functions, and error metrics. The experimentsalso demonstrate that our method is competitive in estimationspeed and robust against update in the database.The rest of the paper is organized as follows. Section 2 reviewsrelated work. Section 3 introduces preliminaries. Section 4 summa-rizes our ideas. Section 5 presents our model. Section 6 discussesmodel complexity and the comparison with other models. We reportexperimental results in Section 7. Section 8 concludes the paper. a r X i v : . [ c s . D B ] J u l aoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang , and Yoshiharu Ishikawa Traditional Estimation Models.

Selectivity estimation has been ex-tensively studied in database systems, where prevalent approachesare based on histograms [18], sampling [43, 45], or sketches [6].However, few of them are applicable to high-dimensional data dueto the curse of dimensionality. Wu et al. [44] substantially improvedthe sampling complexity by using locality-sensitive hashing (LSH)as a means to perform importance sampling and attain an improvedunbiased estimation of selectivity. This approach is heavily tiedto a handful of distance or similarity functions that have knownLSH functions. Kernel density estimation (

KDE ) [16, 27] has beenproposed to handle selectivity estimation in metric space. Mattig et al. [27] alleviated the curse of dimensionality by focusing onthe distribution in metric space. However, strong assumptions areusually imposed on the kernel function (e.g., only diagonal covari-ance matrix for Gaussian kernels), and one kernel function maybe inadequate to model complex distributions in high-dimensionaldata.

Regression Models without Consistency Guarantee.

Selectivity esti-mation can be formalized as an ordinary regression problem withquery and threshold as input features, if the consistent requirementis not enforced. Non-deep learning-based regression models (e.g,support vector regression) do not perform well for our task dueto the high dimensionality. A recent trend is to use deep learning-based regression models. Vanilla deep regression [25, 36, 37] learnsgood representations of input patterns. The mixture of expert model(

MoE ) [33] has a sparsely-gated mixture-of-experts layer that as-signs data to proper experts (models) which lead to better general-ization. The recursive model index (

RMI ) [23] is a regression modelthat can be used to replace the B-tree index in relational databases.Deep regression has also been used to predict selectivities (cardinal-ities) [21, 35] in relational databases, amid a set of recent advancesin learning methods for this task [15, 29, 30, 38, 46].

Models with Consistency Guarantee.

Gradient boosting trees (e.g.,

XGBoost [5] and

LightGBM [39]) support monotonic regression.Lattice regression [9, 11, 13, 48] uses a multi-linearly interpolatedlookup table to solve low-dimensional regression problems. By en-forcing constraints on its parameter values, it can guarantee mono-tonicity. To accommodate high dimensional inputs, Fard et al. [9]proposed to build an ensemble of lattice using subsets of input fea-tures. To further increase the modelling power, deep lattice network(

DLN ) [48] was proposed to interlace non-linear calibration layersand ensemble of lattice layers. Recently, lattice regression has alsobeen used to learn a spatial index [26]. Besides,

UMNN [41] is an au-toregressive flow model which adopts Clenshaw-Curtis quadratureto achieve monotonicity. Other monotonic models include isotonicregression [14, 34] and

MinMaxNet [7].

CardNet [40] is a recentlyproposed method for monotonic selectivity estimation of similarityselection query for various data types. It maps original data to bi-nary vectors and the threshold to an integer τ , and then predictsthe selectivity for distance [ , , . . . , τ ] respectively with ( τ + ) encoder-decoder models. When applying to high-dimensional data,its has the following drawbacks: the mapping from the input thresh-old to τ is not injective, i.e., multiple thresholds may be mappedto the same τ and the same selectivity is always output for them; the overall accuracy is significantly affected if one of the ( τ + ) decoders is not accurate for some query. Problem 1 (Selectivity Estimation for High-DimensionalData).

Given a database of real-valued vectors D = { o i } ni = , adistance function d (· , ·) , a scalar threshold t , and a query object x ,estimate the selectivity in the database, i.e., |{ o | d ( x , o ) ≤ t , o ∈ D }| . While we assume d is a distance function, it is easy to extendit to consider d as a similarity function by changing ≤ to ≥ in theabove definition. In the rest of the paper, to describe our method, wefocus on the case when d is a distance function, while we evaluateEuclidean distance and cosine similarity in our experiments.We can view the selectivity (i.e., the ground truth label) y of aquery object x and a threshold t as generated by a function y = f ( x , t , D) . We call f the value function . Our goal is to estimate f ( x , t , D) using another function ˆ f ( x , t , D) .One unique requirement of our problem is that the estimatorˆ f needs to be consistent : ˆ f is consistent if and only if it is non-decreasing in the threshold t for every query object x ; i.e., ∀ x ,ˆ f ( x , t ′ , D) ≥ ˆ f ( x , t , D) iff. t ′ ≥ t . When |D| is large, it is hard to estimate f directly. One of the mainchallenges is that f may be non-smooth with respect to the inputvariables. In the worst case, we have: • For any vector ∆ x , there exists a database D of n objects and aquery ( x , t ) such that f ( x , t , D) = f ( x + ∆ x , t , D) = n . • For any ϵ >

0, there exists a database D of n objects and a query ( x , t ) such that f ( x , t , D) = f ( x , t + ϵ , D) = n .This means any model that directly approximates f is hard.Our idea to mitigate this issue is to reduce n : instead of estimatingone function f , we estimate multiple functions such that eachfunction’s output range is a small fraction of n . More specifically,we adopt the following two partitioning schemes. Threshold Partitioning.

Assume the maximum threshold we sup-port is t max . We consider dividing it with an increasing sequenceof ( L + ) values: [ τ , τ , . . . , τ L + ] such that τ i < τ j if i < j , τ = τ L + = t max + ϵ , where ϵ is a small positive quantity . Let д i ( x , t ) be an interpolant function for interval [ τ i − , τ i ) . Then wehave ˆ f ( x , t , D) = L + (cid:213) i = (cid:74) t ∈ [ τ i − , τ i ) (cid:75) · д i ( x , t ) , (1)where (cid:74)(cid:75) denotes the indicator function. Data Partitioning.

We partition the database D into K disjointparts D , . . . , D K , and let f i denote the value function defined onthe i -th part. Then we have ˆ f ( x , t , D) = (cid:205) Ki = ˆ f i ( x , t , D i ) .In the next section, we will present a model that combines bothpartitioning schemes such that the partitions are adaptive to thequery object and the database. ϵ is used to cover the corner case of t = t max in Eq. (1).2 onsistent and Flexible Selectivity Estimation for High-Dimensional Data Our idea is to approximate f using a regression model ˆ f ( x , t , D ; Θ ) .Recall the sequence [ τ , τ , . . . , τ L + ] in Section 4. We consider thefamily of continuous piecewise linear function to implement theinterpolation д i ( x , t ) , i ∈ [ , L + ] . A piecewise linear function is acontinuous function of ( L + ) pieces, each being a linear functiondefined on [ τ i − , τ i ) . The τ i values are called control points . Given aquery object x , let p i denote the estimated selectivity for a threshold τ i . For the д i function in Eq. (1), we have д i ( x , t ) = p i − + t − τ i − τ i − τ i − · ( p i − p i − ) . (2)Hence the regression model is parameterized by Θ def = { ( τ i , p i ) } L + i = .Note that τ i and p i values are dependent on x ; i.e., the piecewiselinear function is query-dependent.Using the above design for Θ has the following property to guar-antee the consistency .Lemma 1. Given a database D and a query object x , if p i ≥ p i − for ∀ i ∈ [ , L + ] , then ˆ f ( x , t , D ; Θ ) is non-decreasing in t . Another salient property of our model is that it is flexible inthe sense that it can arbitrarily well approximate the selectivitycurve. Piecewise linear functions have been well explored to fitone-dimensional curves [31]. With a sufficient number of controlpoints, one can find an optimal piecewise linear function to fitany one-dimensional curve. The idea is that a small range of inputis highly likely to be linear with the output. When x and D arefixed, the selectivity only depends on t , and thus the value functioncan be treated as a one-dimensional curve. To distinguish different x , we will design a deep learning approach to learn good controlpoints and corresponding selectivities. As such, our model not onlyinherits the good performance of piecewise linear function but alsohandles different query objects. Estimation Loss.

In the regression model, the

L τ i values andthe ( L + ) p i values are the parameters to be learned. We use theexpected loss between f and ˆ f : J est ( ˆ f ) = (cid:213) (( x , t ) , y )∈T train ℓ ( f ( x , t , D) , ˆ f ( x , t , D)) , (3)where T train denotes the set of training data, and ℓ ( y , ˆ y ) is a lossfunction between the true selectivity y and the estimated valueˆ y of a query ( x , t ) . We choose the Huber loss [17] applied to thelogarithmic values of y and ˆ y . To prevent numeric errors, we alsopad the input by a small positive quantity ϵ . Let r def = ln ( y + ϵ ) − ln ( ˆ y + ϵ ) . Then ℓ ( y , ˆ y ) = (cid:40) r , if | r | ≤ δ ; δ (| r | − δ ) , otherwise. δ is set to 1 . ℓ loss, it encourages the model to fit large selectivities well, andif we use ℓ loss, it pays more attention to small selectivities. To Proof is provided in Appendix A. achieve robust prediction, we reduce the value range by logarithmand the Huber loss.

We choose a deep neural network to learn the piecewise linearfunction. It has the following advantages: (1) Deep learning is ableto capture the complex patterns in control points and correspondingselectivities for accurate estimation of different queries. (2) Deeplearning generalizes well on queries that are not covered by trainingdata. (3) The training data for our problem can be unlimitedlyacquired by running a selection algorithm on the dataset, and thisfavors deep learning which often requires large training sets.In our model, τ i and p i values are generated separately for theinput query object. We also require non-negative increments be-tween consecutive parameters to ensure they are non-decreasing.In the following, we explain the learning of τ i s and p i s, followedby the overall neural network architecture. Control Points ( τ i s). Our idea is to learn the increments between τ i s. τ i ( x ) = i − (cid:213) j = Δ τ ( x )[ j ] , (4)where Δ τ ( x ) = Norm l ( д ( τ ) ( x )) · t max . (5)Norm l is a normalized squared function defined asNorm l ( t ) = [ t + ϵd t T t + ϵ , . . . , t d + ϵd t T t + ϵ ] , where ϵ is a small positive quantity to avoid dividing by zero, d is thedimensionality of t , and t i denotes the value of the i -th dimensionof t . The model takes x as input and outputs L distinct thresholds in ( , t max ) . д ( τ ) is implemented by a neural network. Then we have a 𝜏 vector 𝜏 = [ τ ; τ ; . . . ; τ L ; t max ] .One may consider using Softmax ( t ) , which is widely used formulti-classification and (self-)attention. We choose Norm l ( t ) ratherthan Softmax ( t ) for the following reasons: (1) Due to the exponen-tial function in Softmax ( t ) , a small change of t might lead to largevariations of the output. (2) Softmax aims to highlight the impor-tant part rather than partitioning t , while our goal is to rationallypartition the range [ , τ max ] into several intervals such that thepiecewise linear function can fit well. Selectivities at Control Points ( p i s). We learn ( L + ) p i values in asimilar fashion to control points, using another neural network toimplement д ( p ) . p i ( x ) = i (cid:213) j = Δ p ( x )[ j ] , (6)where Δ p ( x ) = ReLU ( д ( p ) ( x )) . (7)Then we have a p vector p = [ p ; p ; . . . ; p L + ] . Here, we learn ( L + ) increments ( p i − p i − ) instead of directly learning ( L + ) p i s. Thereby, we do not have to enforce a constraint p i − ≤ p i for i ∈ [ , L + ] in the learning process, and thus the learned modelcan better fit the selectivity curve. aoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang , and Yoshiharu Ishikawa Network Architecture.

We illustrate the network architecture ofour model in Figure 1. x x xz AE FFNM 𝜏 p t ∑ * R e L UN o r m l y t max S M psum S Figure 1: Network architecture.

The input x is first transformed to z , a latent representationobtained by an autoencoder (AE) on the entire dataset. The useof the AE encourages the model to exploit latent data and querydistributions in learning the piecewise linear function, and thishelps the model generalize to query objects outside the trainingdata. To learn the latent distributions of D , we pretrain the AE onall the objects of D , and then continue to train the AE with thequeries in the training data. Due to the use of AE, the final lossfunction is a linear combination of the estimation loss (Eq. (3)) andthe loss of the AE for the training data (denoted by J AE ): J ( ˆ f ) = J est ( ˆ f ) + λ · J AE . (8) x is concatenated with z , i.e., [ x ; z ] . Then [ x ; z ] is fed into twoindependent neural networks: a feed-forward network (FFN) anda model M (introduced later). Two multiplications, denoted by S operators in Figure 1, are needed to separately convert the output ofFFN and the output of model M to the 𝜏 and p vectors, respectively.They use a scalar t max and a matrix M psum which, once multipliedon the right to a vector, perform prefix sum operation on the vector. M psum =  . . .

01 1 . . . ... ... ... ... . . .  . The output of these networks, together with the threshold t ,are fed into the operator (cid:205) ∗ in Figure 1, which is implemented byEqs. (2), (5), and (7), to compute the output of the piecewise linearfunction, i.e., the estimated selectivity. Model M . To achieve better performance, we learn p using anencoder-decoder model. In the encoder, an FFN is used to generate ( L + ) embeddings: [ h ; h ; . . . ; h L + ] = FFN ([ x ; z ]) , (9)where h i s are high-dimensional representations. Here, we adopt ( L + ) embeddings, i.e., h , . . . , h L + , to represent the latent infor-mation of p . In the decoder, we adopt ( L + ) linear transformationswith the ReLU activation function: k i = ReLU ( w T i h i + b i ) . Then we have p = [ k , k + k , . . . , (cid:205) L + i = k i ] . Figure 2: Data partitioning by cover tree.

To improve the accuracy of estimation on large-scale datasets, wedivide the database into multiple disjoint subsets D , . . . , D K withapproximately the same size, and build a local model on each ofthem. Let ˆ f i denote each local model. Then the global model forselectivity estimation is ˆ f = (cid:205) i ˆ f i .We have considered several design choices and propose the fol-lowing configuration that achieves the best empirical performance:(1) Partitioning is obtained by a cover tree-based strategy. (2) Weadopt the structure in Figure 1 so that all local models share thesame input representation [ x ; z ] , but each has its own neural net-works to learn the control points. Partitioning Method.

We utilize a cover tree [19] to partition D into several parts. A partition ratio r is predefined such thatthe cover tree will not expand its nodes if the number of insideobjects is smaller than r |D| . Given a query ( x , t ) , the valid regionis the circles that intersect the circle with x as center and t asradius. For example, in Figure 2, x (the red point) and t form thered circle, and data are partitioned into 6 regions. The valid regionof ( x , t ) is the green circles that intersect the red circle. Albeitimposing constraints, cover tree might still generate too many ballregions, i.e., leaf nodes, which lead to large number of parametersof the model and the difficulty of training. Reducing the numberof ball regions is necessary. To remedy this, we adopt a mergingstrategy as follows. First, we still partition D into K ′ regions usingcover tree. Then we cluster these regions into K ( K ′ ≤ K ) clusters D , . . . , D K by the following greedy strategy: The K ′ regions aresorted in decreasing order of the number of inside objects. We beginwith K empty clusters. Then we scan each region and assign it tothe cluster with the smallest size. The regions that belong to thesame cluster are merged to one region. We consider an indicator f c : ( x , t ) → { , } K such that f c ( x , t )[ i ] = ( x , t ) intersects cluster D i , and employ it in our model:ˆ f ( x , t , D) = K (cid:213) i = f c ( x , t )[ i ] · ˆ f i ( x , t , D i ) . Since cover trees deal with metric spaces, for non-metric functions(e.g, cosine similarity), if possible, we equivalently convert it to ametric (e.g, Euclidean distance, as cos ( u , v ) = − ∥ u , v ∥ for unitvectors u and v ). Then the cover tree partitioning still works. Forthose that cannot be equivalently converted to a metric, we adoptrandom partitioning instead of using a cover tree and modify f c as f c : ( x , t ) → { } K . onsistent and Flexible Selectivity Estimation for High-Dimensional Data Training Procedure.

We have several choices on how to train themodels from multiple partitions. The default is directly training theglobal model ˆ f , with the advantage that no extra work is needed.The other choice is to train each local model independently, usingthe selectivity computed on the local partition as training label.We propose yet another choice: we pretrain the local models for T epochs, and then train them jointly. In the joint training stage, weuse the following loss function: J joint = J est ( ˆ f ) + β · (cid:213) i J est ( ˆ f i ) + λ · J AE . The indicators f c (· , ·) s of all ( x , t ) are precomputed before training. When the dataset D is updated with insertion or deletion, we firstcheck whether our model ˆ f ( x , t , D) is necessary to update. In otherwords, when minor updates occur and ˆ f ( x , t , D) is still accurateenough, we ignore them. To check the accuracy of ˆ f ( x , t , D) , weupdate the labels of all validation data, and re-test the mean abso-lute error (MAE) of ˆ f ( x , t , D) . If the difference between the originalMAE and the new one is no larger than a predefined threshold δ U ,we do not update our model. Otherwise, we adopt an incrementallearning approach as follows. First, we update the labels in thetraining and the validation data to reflect the update in the data-base. Second, we continue training our model with the updatedtraining data until the validation error (MAE) does not increase in3 consecutive epochs. Here the training does not start from scratchbut from the current model. We incrementally train our model withall the training data to prevent catastrophic forgetting. We assume an FFN has hidden layers a , . . . , a n . The complexityof an FFN with input x and output y is | FFN ( x , y )| = | x | · | a | + (cid:205) n − i = | a i | · | a i + | + | a n | · | y | .Our model contains three components: AE, FFN, and M . The com-plexity of AE is | FFN ( x , z )| . The complexity of FFN is | FFN ([ x ; z ] , t )| ,where t is the L -dimensional vector after Norm l . Component M consists of an FFN and ( L + ) linear transformations. Its complexityis | FFN ([ x ; z ] , H )| + ( L + ) · | h i | + ( L + ) , where H = [ h ; . . . ; h L + ] .Thus, the final model complexity is | FFN ( x , z )| + | FFN ([ x ; z ] , t )| + | FFN ([ x ; z ] , H )| + ( L + ) · | h i | + ( L + ) . Lattice regression models are the latest deep learning architecturesfor monotonic regression. We provide a comparison between oursand them applied to selectivity estimation. In order to performan analytical comparison, we make the following simplification:(1) We assume that x and D are fixed so the selectivity only dependson t . (2) We consider a shallow version of DLN with one layer ofcalibrator and one layer of a single lattice.With the above simplification, the

DLN can be analytically rep-resented as: ˆ f DLN ( t ) = h ( д ( t ; w ) ; θ , θ ) , where д : t ∈ [ , t max ] (cid:55)→ z ∈ [ , ] and h ( z ; θ , θ ) = ( − z ) θ + zθ . Hence it degenerates tofitting a linear interpolation in a latent space. There is little learningfor the function h , as its two parameters θ and θ are determined y Ground TruthEstimation 0.00.20.40.60.81.0 z Calibrator z (a) Simplified

DLN y Ground TruthEstimation (b) Our Model

Figure 3: Comparison of

DLN and our model (both with 8 con-trol points) to y = f ( t ) = exp ( t ) for t ∈ [ , ] . by the minimum and maximum selectivity values in the trainingdata. Thus, the workhorse of the model is to learn the non-linearmapping of д . The calibrator also uses piecewise linear functionswith L control points equivalent to our ( τ i , p i ) Li = . However, τ i sare equally spaced between 0 and t max , and only p i s are learnable.This design is not flexible for many value functions; e.g., if thefunction values change rapidly within a small interval, the calibra-tor will not adaptively allocate more control points to this area.We show this with 8 control points for both models to learn thefunction y = f ( t ) = exp ( t ) , t ∈ [ , ] . The training data are 80 ( t i , f ( t i )) pairs where t i s are uniformly sampled in [ , ] . We plotboth models’ estimation curves and their learned control points inFigure 3. The z values at the control points of DLN are shown onthe right side of Figure 3(a). We observe: (1) The calibrator virtuallydetermines the estimation as h () degenerates to a simple scaling.(2) The calibrator’s control points are evenly spaced in t , whileour model learns to place more controls points in the “interestingarea”, i.e., where y values change rapidly. (3) As a result, our modelapproximates the value function much better than DLN .Further, for

DLN , the non-linear mapping on t is independent of x (even though we do not model x here). Even in the full-fledged DLN model, the calibration is performed on each input dimensionindependently. The full-fledged

DLN model is too complex to ana-lyze, so we only study it in our empirical evaluation. Nonetheless,we believe that the above inherent limitations still remain. Our em-pirical evaluation will also show that query-dependent fitting of thevalue function is critical in our problem and very effective for ourmodel. Apart from

DLN , recent studies also employ lattice regres-sion and/or piecewise linear functions for learned index [22, 26].Like

DLN , their control points are also query independent, albeitnot equally spaced.

Clenshaw-Curtis quadrature is able to approximate the integral ∫ τ max ˆ д ( x , t , D) d t , where ˆ д = ∂ ˆ f ( x , t , D) ∂ t in our problem. UMNN [41]is a recent work that adopts the idea to solve the autoregressive flowproblem, and uses a neural network to model ˆ д . In the Clenshaw-Curtis scheme [28], the cosine transform of ˆ д ( cosθ ) is adopted andthe discrete finite cosine transform is sampled at equidistant points θ = πsN , where s = , . . . , N , and N is the number of sample points.Similar to DLN , it adopts the same integral approximation for dif-ferent queries and ignores that integral points should depend on aoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang , and Yoshiharu Ishikawa x . In contrast, our method addresses this issue by using a query-dependent model, thereby delivering more flexibility. Datasets.

We use three embedding datasets. • fasttext is a pretrained word embedding dataset consisting of1 million 300-dimensional vectors [1]. We evaluate Euclideandistance and cosine similarity and dub them fasttext-cos and fasttext- l , respectively. • face is the MS-Celeb face image dataset [12]. We randomly ex-tract 2 million images and obtain 128-dimensional embeddingvectors using the faceNet model [32]. We only evaluate cosinesimilarity and dub the setting face-cos because vectors are alreadynormalized. • YouTube is a collection of video records and consists of 0.35million vectors with 1770 dimensions [2]. We only evaluate cosinesimilarity and dub the setting

YouTube-cos because vectors arealready normalized.Given a dataset D , we randomly sampled 0.25 million vectorsfrom D as query objects. We consider two settings to generatethresholds: (1) For the default setting, like the approach in [27],we uniformly generate a sequence of 40 selectivity values in therange [ , |D|] and calculate the corresponding thresholds; e.g.,the cosine similarity for 1% |D| is around 0.5 on fasttext . (2) Forthe alternative setting, we generate cosine similarity thresholds in [ . , ] using the beta distribution α = β = .

5, in order tosimulate the case when people are more interested in thresholds inthe middle. The selectivity is up to around 10% |D| on fasttext .The resulting query workload, denoted by Q , was uniformly splitin 8:1:1 (by query objects) into training set T train , validation set T valid , and test set T test , in line with [40]. This ensures that noneof the test query objects has been seen by the model during trainingor validation. Note that labels (i.e., true selectivities) were computedon D , not Q . For each setting, we tested on 5 sampled workloadsto mitigate the effect of sampling error. Methods.

We compare the following approaches . • LSH [44] is based on importance sampling using locality-sensitivehashing. It only works for cosine similarity due to the use ofSimHash [4]. • KDE [27] is based on adaptive kernel density estimation formetric distance functions. To cope with cosine similarity, we nor-malize data to unit vectors and run

KDE for Euclidean distance. • LightGBM [39] is based on gradient boosting tree. We comparewith the standard one (

LightGBM ) and the one with monotonicityenforced (

LightGBM-m ). XGBoost [5] has the same mechanismwith

LightGBM and was shown to deliver similar accuracy to

LightGBM but in slower speed [40], so we do not compare it. • Deep regression models:

DNN , a vanilla feed-forward network;

MoE [33], a mixture of expert model with sparse activation;

RMI [23], a hierarchical mixture of expert model that achievescompetitive performance on indexing for range queries; and Please see Appendix B for model settings.

Table 1: Accuracy on fasttext-cos . Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test LSH * 70.03 71.45 12.28 13.17 1.38 1.40

KDE * 64.13 64.22 11.71 11.72 1.03 1.01

LightGBM

LightGBM-m * 157.25 160.11 9.02 9.08 0.92 0.91

DNN

MoE

RMI

CardNet * DLN * 56.22 58.57 10.66 10.54 1.07 1.06

UMNN * 23.22 24.69 6.67 6.71 1.21 1.22

SelNet * Table 2: Accuracy on fasttext- l . Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test KDE * LightGBM

LightGBM-m * 171.46 101.22 10.13 9.84 1.22 1.20

DNN

MoE

RMI

CardNet * 24.53

DLN * 76.04 77.50 11.53 11.56 1.52 1.53

UMNN * 43.11 43.04 8.90 8.89 1.36 1.35

SelNet * CardNet [40], a feature extraction model followed by a regres-sion model based on incremental prediction (we enable the ac-celerated estimation [40]). • Lattice regression models: We adopt

DLN [48] in this category. • Clenshaw-Curtis quadrature model: We adopt

UMNN [41]. • Our model is dubbed

SelNet . The default setting of L is 50 and K is 3. δ U for incremental learning is 20. We also evaluate two ablated models: (1) SelNet -ct is SelNet without the cover treepartitioning, and (2)

SelNet -ad-ct is SelNet -ct without the query-dependent feature for control points (disabled by feeding a con-stant vector into the FFN that generates the 𝜏 vector). Error Metrics.

We evaluate Mean Squared Error (

MSE ) and MeanAbsolute Percentage Error (

MAPE ), in line with [40]. We also reportMean Absolute Error (

MAE ). Environment.

Experiments were run on a server with an IntelXeon E5-2640 @2.40GHz CPU and 256GB RAM, running Ubuntu16.04.4 LTS. Models were implemented in Python and Tensorflow.

We report the accuracies of the competitors in Tables 1 – 4, wheremodels that guarantee consistency are marked with *, and best andrunner-up values are marked in boldface.Our model,

SelNet , robustly and significantly outperforms exist-ing models. It achieves substantial error reduction against the bestof state-of-the-art methods, in all the three error metrics and allthe settings. Compare to the runner-up model in each setting, the onsistent and Flexible Selectivity Estimation for High-Dimensional Data Table 3: Accuracy on face-cos . Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test LSH * 277.82 277.35 24.44 25.36 1.25 1.28

KDE * 179.76 182.34 19.65 19.76 0.99 1.01

LightGBM

LightGBM-m * 112.44 114.62 10.43 10.46 0.48 0.49

DNN

MoE

RMI

CardNet * DLN * 126.85 112.36 18.34 18.02 0.98 0.97

UMNN * 16.32 16.75 4.68 4.70 0.38 0.36

SelNet * Table 4: Accuracy on

YouTube-cos . Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test LSH * 30.25 28.54 1.85 1.83 0.78 0.76

KDE * 28.76 29.25 1.89 1.90 0.74 0.70

LightGBM

LightGBM-m * 43.83 54.26 2.12 2.19 0.64 0.65

DNN

MoE

RMI

CardNet * DLN * 29.25 29.37 1.91 1.92 0.70 0.69

UMNN * 19.04 20.57 1.64 1.69 0.50 0.49

SelNet * Table 5: Empirical monotonicity (%) on face-cos . LSH * KDE * LightGBM LightGBM-m * DNN

100 100 86.34 100 78.22

MoE RMI CardNet * DLN * UMNN * SelNet *94.82 90.48 100 100 100 100 improvement is 2.0 – 3.3 times in

MSE , 1.3 – 1.8 times in

MAE , and1.2 – 1.4 times in

MAPE on test data.We examine each category of models. We start with the sampling-based methods.

KDE works better than

LSH in most settings. Infact,

KDE ’s performance even outperforms some deep learning re-gression based methods in a few cases (e.g.,

MSE in Table 2). Fortree-based models,

LightGBM and

LightGBM-m do not performwell, and

LightGBM-m cannot beat

LightGBM in most cases. Thisindicates that the monotonic constraint, albeit better interpretabil-ity, decreases the performance of regression. Among the deep learn-ing models (except ours),

CardNet is generally the best thanks to itsincremental prediction for each threshold interval.

MoE is not goodat large selectivities (

MSE in Table 2). The performance of

DLN ismediocre. The main reason is analyzed in Section 6.2. The accuracyof

UMNN , which uses the same integral points for different queries,though better than

DLN , still trails behind ours by a large margin.

We compute the empirical monotonicity measure [7] and show theresults in Table 5. The measure is the percentage of estimated pairs

Table 6: Ablation study.

Dataset Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test fasttext-cos SelNet SelNet -ct

SelNet -ad-ct fasttext- l SelNet

SelNet -ct

SelNet -ad-ct face-cos SelNet

SelNet -ct

SelNet -ad-ct

YouTube-cos SelNet

SelNet -ct

SelNet -ad-ct S e l e c t i v i t y Ground TruthEstimation (a)

SelNet -ct of 1st query S e l e c t i v i t y Ground TruthEstimation (b)

SelNet -ad-ct of 1st query S e l e c t i v i t y Ground TruthEstimation (c)

SelNet -ct of 2nd query S e l e c t i v i t y Ground TruthEstimation (d)

SelNet -ad-ct of 2nd query

Figure 4: Control points on fasttext-cos . that violate the monotonicity, averaged over 200 queries. For eachquery, we sampled 100 thresholds, which form (cid:0) (cid:1) pairs. A lowscore indicates more inconsistent estimates. As expected, modelswithout consistency guarantee cannot produce 100% monotonicity. SelNet -ct v.s.

SelNet . Table 6 shows that the partitioning improvesthe performance, especially on fasttext- l , where 1.6 times MSE improvement is observed on test data. This is because: each modeldeals with a subset of the dataset for better fit; and the ground truthlabel values for each model are reduced, which makes it easier tofit our piecewise linear function with the same number of controlpoints, as the value function is less steep.

SelNet -ct v.s.

SelNet -ad-ct . The difference between the two modelsis whether control points are dependent on x or not. Table 6 showsthis feature has a significant impact on accuracy across all thesettings and all the error metrics. When query-dependent controlpoints are employed to fit the selectivity curve, the improvements in MSE , MAE , and

MAPE are up to 3.1, 2.0, and 3.6 times, respectively.To further illustrate the difference, we plot in Figure 4 the controlpoints learned by the two models for two randomly picked queries.We can see that

SelNet -ad-ct uses the same set of control points forthe two queries. This is obviously not ideal –

SelNet -ad-ct fails to aoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang , and Yoshiharu Ishikawa Table 7: Average estimation time (milliseconds).

Model fasttext-cos fasttext- l face-cos YouTube-cosSimSelect LSH * 1.59 - 1.08 2.35

KDE * 0.68 0.79 0.59 0.94

LightGBM

LightGBM-m * 0.28 0.28 0.19 0.50

DNN

MoE

RMI

CardNet * 0.20 0.19 0.14 0.31

DLN * 0.81 0.83 0.65 1.22

UMNN * 0.37 0.39 0.24 0.52

SelNet * 0.34 0.35 0.24 0.51

Table 8: Training time (hours).

Model fasttext-cos fasttext- l face-cos YouTube-cosKDE * LightGBM

LightGBM-m * 1.5 1.4 1.2 1.1

DNN

MoE

RMI

CardNet * 3.5 3.5 3.1 2.9

DLN * 6.7 6.6 6 5.7

UMNN * 5.3 5.2 4.9 4.4

SelNet * 5.1 5.0 4.7 4.6 fit the ground truth curve closely, especially for quickly changingselectivity values, and this results in larger errors across all thethree metrics. In contrast,

SelNet -ct devotes more points to thethreshold intervals in which the selectivity changes rapidly.

Table 7 shows the estimate time on the testing data. We also reportthe time of running a state-of-the-art selection algorithm [20] toobtain the exact cardinality (referred to as

SimSelect ). All the modelsexcept

LSH are at least on order of magnitude faster than

SimSelect .Our model is on a par with other deep learning models (except

DNN ), and faster than

LSH and

KDE . Although

DNN is very fastdue to its simple model structure, its accuracy is much worse thanours, e.g., by 22 times of

MSE on face-cos . Table 8 shows the training times of the competitors. As expected,traditional learning models are faster to train. Our models spendaround 5 hours, similar to other deep models. In Figure 5, we showthe performances, measured by

MSE , of the deep learning modelsby varying the scale of training examples from 20% to 100% of theoriginal training data. All the models perform worse with fewertraining data, but our models are more robust, showing moderateaccuracy loss.

20 40 60 80 100Training Size020406080100 M SE ( ) SelNetCardNet RMIMoE DLNUMNN (a)

MSE , fasttext-cos

20 40 60 80 100Training Size10 M SE ( ) SelNetCardNet RMIMoE DLNUMNN (b)

MSE , fasttext- l

20 40 60 80 100Training Size10 M SE ( ) SelNetCardNet RMIMoE DLNUMNN (c)

MSE , face-cos

20 40 60 80 100Training Size1020304050 M SE ( ) SelNetCardNet RMIMoE DLNUMNN (d)

MSE , YouTube-cos

Figure 5: Varying training data size.

Sequence of Operations M S E ( ) face-cos fasttext-cos (a) MSE

Sequence of Operations M A P E ( % ) face-cos fasttext-cos (b) MAPE

Figure 6: Data update.Table 9: Varying number of control points on fasttext- l . Error Metric Number of Control Points10 50 90 130

MSE on T test ( × ) 13.06 7.65 7.93 10.47 MAE on T test ( × ) 4.85 3.51 3.56 3.92 MAPE on T test We generate a stream of 100 update operations, each with an inser-tion or deletion of 5 records on face-cos and fasttext-cos , to evaluateour incremental learning technique. Figure 6 plots how

MSE and

MAPE change with the stream. The general trend is that the erroris decreasing when there are more updates, except for

MAPE on fasttext-cos , which keeps almost the same. The result indicatesthat incremental learning is able to keep up with the updated data.Besides, SelNet only spends 1.5 – 2.0 minutes for each incrementallearning, showcasing its speed to cope with updates. onsistent and Flexible Selectivity Estimation for High-Dimensional Data Table 10: Varying partition size on fasttext- l . Error Metric Partition Size1 3 6 9

MSE on T test ( × ) 13.21 7.65 6.82 6.75 MAE on T test ( × ) 4.33 3.51 3.36 3.11 MAPE on T test Table 11: Varying partitioning method on fasttext- l . Error Metric CT (3) RP (3) KM (3)

MSE on T test ( × ) 7.87 8.02 9.14 MAE on T test ( × ) 3.56 3.57 3.64 MAPE on T test Table 12: Accuracy on fasttext-cos (thresholds follow beta dis-tribution B ( , . )) . Model

MSE ( × ) MAE ( × ) MAPE T valid T test T valid T test T valid T test LSH * 74.79 69.88 14.55 14.42 1.37 1.35

KDE * 69.53 69.18 14.00 13.90 0.99 1.01

LightGBM

LightGBM-m * 68.86 70.11 13.87 13.92 0.74 0.76

DNN

MoE

RMI

CardNet * 6.75 6.21

DLN * 48.26 47.89 11.50 11.39 1.11 1.13

UMNN * SelNet * Table 9 shows the accuracy when we vary the number of controlpoints L on fasttext- l . A small value leads to underfitting towardsthe curve of thresholds, while a large value increases the learningdifficulty. L =

50 achieves the best performance.Table 10 reports the accuracy when we vary the partition size K on fasttext- l . There is no partitioning when K =

1. We observethat the partitioning is useful, but the improvement is small whenpartition size exceeds 3, and estimation time also substantially in-creases. This means a small partition size ( K =

3) suffices to achievegood performance. For partitioning strategy, we compare covertree partitioning (CT) with random partitioning (RP) and k -meanspartitioning (KM) in Table 11. CT delivers the best performance.KM is the worst because it tends to cause imbalance in the partition. We evaluate the case when thresholds are generated by the alter-native setting of beta distribution. Table 12 reports the accuracyon fasttext-cos . Compared with the default setting (Table 1), all themodels report larger

MSE and

MAE (note

MSE is measured in × and MAE in × here), but most models performs better in MAPE .This is because the selectivities in this setting (in [ , |D| ]) areusually larger than the default one (in [ , |D|] ), but the variance(when selectivities are normalized) is smaller. SelNet is still consis-tently better than the others: on test data, it improves

MSE by 3.7 times,

MAE by 2.0 times, and

MAPE by 1.1 times over the runner-upmodel in each setting. This demonstrates that our model is robustagainst distribution change in thresholds.

In this paper, we tackled the selectivity estimation problem for high-dimensional data using a deep learning architecture. Our method isbased on learning monotonic, query-dependent piece-wise linearfunction. This provides the flexibility of our model to approximatethe selectivity curve while guaranteeing the consistency of estima-tion. We proposed a partitioning technique to cope with large-scaledatasets and an incremental learning technique for updates. Ourexperiments demonstrated that the proposed model outperformsstate-of-the-art methods significantly across a range of settingsincluding datasets, distance functions, and error metrics.

REFERENCES

SIGMOD , pages 93–104, 2000.[4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In

STOC , pages 380–388, 2002.[5] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In

KDD , pages785–794, 2016.[6] G. Cormode, M. N. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massivedata:Samples,histograms,wavelets,sketches.

FoundationsandTrendsinDatabases ,4(1-3):1–294, 2012.[7] H. Daniels and M. Velikova. Monotone and partially monotone neural networks.

IEEE Transactions on Neural Networks , 21(6):906–917, 2010.[8] S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute,V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entitymatching to build cloud services. In

SIGMOD , pages 1431–1446, 2017.[9] M.M.Fard,K.Canini,A.Cotter,J.Pfeifer,andM.Gupta. Fastandflexiblemonotonicfunctions with ensembles of lattices. In

NIPS , pages 2919–2927, 2016.[10] J. Fox. Robust regression: Appendix to an r and s-plus companion to appliedregression, 2002.[11] E. Garcia and M. Gupta. Lattice regression. In

NIPS , pages 594–602, 2009.[12] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MS-Celeb-1M: A dataset and benchmarkfor large scale face recognition. In

ECCV , 2016.[13] M. Gupta, A. Cotter, J. Pfeifer, K. Voevodski, K. Canini, A. Mangylov, W. Moczyd-lowski, and A. Van Esbroeck. Monotonic calibrated interpolated look-up tables.

The Journal of Machine Learning Research , 17(1):3790–3836, 2016.[14] Q. Han, T. Wang, S. Chatterjee, and R. J. Samworth. Isotonic regression in generaldimensions. arXiv preprint arXiv:1708.09468 , 2017.[15] S. Hasan, S. Thirumuruganathan, J. Augustine, N. Koudas, and G. Das. Deeplearning models for selectivity estimation of multi-attribute queries. In

SIGMOD ,pages 1035–1050, 2020.[16] M. Heimel, M. Kiefer, and V. Markl. Self-tuning, GPU-accelerated kernel densitymodels for multidimensional selectivity estimation. In

SIGMOD , pages 1477–1492,2015.[17] P. J. Huber et al. Robust estimation of a location parameter.

The annals of mathe-matical statistics , 35(1):73–101, 1964.[18] Y. Ioannidis. The history of histograms (abridged). In

VLDB , pages 19–30, 2003.[19] M. Izbicki and C. R. Shelton. Faster cover trees. In F. R. Bach and D. M. Blei, editors,

ICML , pages 1162–1170, 2015.[20] M. Izbicki and C. R. Shelton. Faster cover trees. In

ICML , pages 1162–1170, 2015.[21] A. Kipf, T. Kipf, B. Radke, V. Leis, P. A. Boncz, and A. Kemper. Learned cardinalities:Estimating correlated joins with deep learning. In

CIDR , 2019.[22] A. Kipf, R. Marcus, A. van Renen, M. Stoian, A. Kemper, T. Kraska, and T. Neumann.Radixspline: a single-pass learned index. In aiDM@SIGMOD , pages 5:1–5:5, 2020.[23] T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned indexstructures. In

SIGMOD , pages 489–504, 2018.[24] M. Kula. Metadata embeddings for user and item cold-start recommendations.

CoRR , abs/1507.08439, 2015.[25] S. Lathuilière, P. Mesejo, X. Alameda-Pineda, and R. Horaud. A comprehensiveanalysis of deep regression. arXiv preprint arXiv:1803.08450 , 2018.[26] P. Li, H. Lu, Q. Zheng, L. Yang, and G. Pan. LISA: A learned index structure forspatial data. In

SIGMOD , pages 2119–2133, 2020.[27] M. Mattig, T. Fober, C. Beilschmidt, and B. Seeger. Kernel-based cardinality esti-mation on metric data. In

EDBT , pages 349–360, 2018.9 aoshu Wang , Chuan Xiao , , Jianbin Qin , Rui Mao , Makoto Onizuka , Wei Wang , Rui Zhang , and Yoshiharu Ishikawa [28] M. Novelinkova. Comparison of clenshaw-curtis and gauss quadrature. In WDS ,volume 11, pages 67–71, 2011.[29] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. An empirical analysis of deeplearning for cardinality estimation.

CoRR , abs/1905.06425, 2019.[30] Y. Park, S. Zhong, and B. Mozafari. Quicksel: Quick selectivity learning withmixture models. In

SIGMOD , pages 1017–1033, 2020.[31] L. Prunty. Curve fitting with smooth functions that are piecewise-linear in thelimit.

Biometrics , pages 857–866, 1983.[32] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for facerecognition and clustering. In

CVPR , pages 815–823, 2015.[33] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean.Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 , 2017.[34] J. Spouge, H. Wan, and W. Wilbur. Least squares isotonic regression in twodimensions.

Journal of Optimization Theory and Applications , 117(3):585–605,2003.[35] J. Sun and G. Li. An end-to-end learning-based cost estimator.

PVLDB , 13(3):307–319, 2019.[36] Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial pointdetection. In

CVPR , pages 3476–3483, 2013.[37] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neuralnetworks. In

CVPR , pages 1653–1660, 2014.[38] B. Walenz, S. Sintos, S. Roy, and J. Yang. Learning to sample: Counting withcomplex queries.

Proc. VLDB Endow. , 13(3):390–402, 2019.[39] D. Wang, Y. Zhang, and Y. Zhao. Lightgbm: An effective mirna classificationmethod in breast cancer patients. In

ICCBB , pages 7–11, 2017.[40] Y. Wang, C. Xiao, J. Qin, X. Cao, Y. Sun, W. Wang, and M. Onizuka. Monotonic car-dinality estimation of similarity selection: A deep learning approach. In

SIGMOD ,pages 1197–1212, 2020.[41] A. Wehenkel and G. Louppe. Unconstrained monotonic neural networks. In H. M.Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett,editors,

NeurIPS , pages 1543–1553, 2019.[42] K.-Y. Whang, S.-W. Kim, and G. Wiederhold. Dynamic maintenance of datadistribution for selectivity estimation.

The VLDB Journal , 3(1):29–51, 1994.[43] W. Wu, J. F. Naughton, and H. Singh. Sampling-based query re-optimization. In

SIGMOD , pages 1721–1736, 2016.[44] X. Wu, M. Charikar, and V. Natchu. Local density estimation in high dimensions.In

ICML , pages 5293–5301, 2018.[45] Y. Wu, D. Agrawal, and A. El Abbadi. Query estimation by adaptive sampling. In

ICDE , pages 639–648, 2002.[46] Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, P. Chen, P. Abbeel, J. M. Hellerstein,S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation.

Proc. VLDBEndow. , 13(3):279–292, 2019.[47] H. Ying, F. Zhuang, F. Zhang, Y. Liu, G. Xu, X. Xie, H. Xiong, and J. Wu. Sequentialrecommender system based on hierarchical attention networks. In

IJCAI , pages3926–3932, 2018.[48] S. You, D. Ding, K. Canini, J. Pfeifer, and M. Gupta. Deep lattice networks andpartial monotonic functions. In

NIPS , pages 2981–2989, 2017.

A PROOF

Lemma 1.

Proof. Assume t ∈ [ τ i − , τ i ) , then t + ϵ is in [ τ i − , τ i ) or [ τ i , τ i + ) .In the first case, ˆ f ( x , t + ϵ , D ; Θ ) − ˆ f ( x , t , D ; Θ ) = ϵτ i − τ i − · ( p i − p i − ) ≥

0. In the second case, ˆ f ( x , t , D ; Θ ) ≤ p i and ˆ f ( x , t + ϵ , D ; Θ ) ≥ p i . Therefore, ˆ f ( x , t , D ; Θ ) is non-decreasing in t . □ B EXPERIMENT SETUPB.1 Model Settings

Hyperparameter and training settings are given below. • DNN is a vanilla FFN with four hidden layers of sizes 512, 512,512, and 256. • MoE consists of 30 expert models, each an FFN with three hiddenlayers of sizes 512, 512, and 512. We used top-3 experts for theprediction. • RMI has three levels, with 1, 4, and 8 models, respectively. Eachmodel is an FFN with four hidden layers with sizes 512, 512, 512,and 256. • DLN is an architecture of six layers: calibrators, linear embedding,calibrators, ensemble of lattices, calibrators, and linear embed-ding. • UMNN is an FFN with four hidden layers of sizes 512, 512, 512and 256 to implement the derivative. ∂ f ( x , t , D) ∂ t . f ( x , t , D) is com-puted by Clenshaw-Curtis quadrature with learned derivatives. • SelNet : We use an FFN with two hidden layers to estimate 𝜏 , andan FFN in Equation 9 with four hidden layers to estimate p . Theencoder and decoder of AE are implemented with an FFN withthree hidden layers. For face-cos and YouTube-cos , the sizes ofthe first three (or two, if it only has two) hidden layers of thethree FFNs are 512, and the sizes of all the other hidden layers are256. For fasttext-cos and fasttext- l , the sizes of the first hiddenlayer of the these FFNs are 1024, and the others remain the sameas above. The number of control parameters L is 50. The defaultpartition size K is 3. t max is 54 for Euclidean distance. For cosinesimilarity, we equivalently convert it to Euclidean distance onunit vectors, and set t max as 1. The learning rates of face-cos , fasttext-cos , fasttext- l and YouTube-cos are 0.00003, 0.00002,0.00002 and 0.00003. | h i | (0 ≤ i ≤ L +

1) in model M is 100. Thebatch size is 512 for all datasets. We train all models in 1500epochs, and select the ones with the smallest validation errors.For training with data partitioning, we use T =

300 and β = . δ U for incremental learning is 20.For LSH and

KDE , we use 2,000 samples to keep the estimationcost reasonable. For all the other models, we train them with thesame Huber loss on the logarithm of the ground truth and predic-tion; all hyper-parameters are fine-tuned according to the validationset.

DNN , MoE and

RMI cannot directly handle the threshold t .We learn a non-linear transformation of t into an m dimensionalembedding vector, i.e., t = ReLU ( w t ) . Then we concatenate it with x as the input to these models. B.2 Evaluation Metric

We evaluate Mean Squared Error (

MSE ), Mean Absolute Error(

MAE ), and Mean Absolute Percentage Error (

MAPE ). They aredefined as:

MSE = m m (cid:213) i = ( ˆ y i − y i ) , MAE = m m (cid:213) i = | ˆ y i − y i | , MAPE = m m (cid:213) i = (cid:12)(cid:12)(cid:12)(cid:12) ˆ y i − y i y i (cid:12)(cid:12)(cid:12)(cid:12) , where y i is the ground truth value and ˆ y i is the estimated value.is the estimated value.