A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism
Zhiwei Fan, Rathijit Sen, Paraschos Koutris, Aws Albarghouthi
AA Comparative Exploration of ML Techniques for TuningQuery Degree of Parallelism
Zhiwei Fan , Rathijit Sen , Paraschos Koutris , Aws Albarghouthi , University of Wisconsin-Madison Microsoft Gray Systems Lab { zhiwei, paris, aws } @cs.wisc.edu { rathijit.sen } @microsoft.com ABSTRACT
There is a large body of recent work applying machine learn-ing (ML) techniques to query optimization and query per-formance prediction in relational database management sys-tems (RDBMSs). However, these works typically ignore theeffect of intra-parallelism – a key component used to boostthe performance of OLAP queries in practice – on queryperformance prediction. In this paper, we take a first steptowards filling this gap by studying the problem of tuning thedegree of parallelism (DOP) via ML techniques in MicrosoftSQL Server, a popular commercial RDBMS that allows anindividual query to execute using multiple cores.In our study, we cast the problem of DOP tuning as a re-gression task, and examine how several popular ML modelscan help with query performance prediction in a multi-coresetting. We explore the design space and perform an exten-sive experimental study comparing different models againsta list of performance metrics, testing how well they gen-eralize in different settings: ( i ) to queries from the sametemplate, ( ii ) to queries from a new template, ( iii ) to in-stances of different scale, and ( iv ) to different instances andqueries. Our experimental results show that a simple featur-ization of the input query plan that ignores cost model esti-mations can accurately predict query performance, capturethe speedup trend with respect to the available parallelism,as well as help with automatically choosing an optimal per-query DOP.
1. INTRODUCTION
The rise of cloud-computing platforms has offered new ca-pabilities and benefits to users, including the capability toprovision the appropriate amount of resources necessary fora given task. In order to effectively utilize this capability,Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service(PaaS) users need tools and techniques to quantify the trade-offs between resource costs and performance benefits. Suchtechniques are also invaluable to PaaS and Software-as-a-Service (SaaS) cloud providers, since they can reduce opera-tional costs while meeting Service-Level Agreements (SLAs).Identifying optimal or near-optimal resource configura-tions is a difficult task. The challenges are three-fold: un-known workload characteristics, a large state space, andcostly evaluation of what-if scenarios. Although resourceutilization may be observed for repeating jobs, one may haveto resort to intrusive profiling (not feasible in productionenvironments) or A/B testing (which requires additional re-sources) to estimate tradeoffs with different configurations. Resource provisioning for new jobs is even more challeng-ing. For these reasons, configuration tuning is traditionallyperformed by database administrators. However, this ap-proach is tedious, requires a high level of domain expertise,and does not scale to cloud platforms that need to handlemillions of databases and different service tiers.In this paper, we tackle this problem by studying how
Machine Learning (ML) techniques can be applied to esti-mate cost-performance tradeoffs for query processing in anRDBMS . To focus our study, we use Microsoft SQL Serveras the RDBMS, which is available as cloud IaaS and PaaSofferings [23] and as an on-premises solution [27]. We alsochoose to study the configuration of a particular resource,the query
Degree Of Parallelism (DOP) : this is the maxi-mum number of hardware threads (logical cores) that can beused at any time for executing the query. Estimates of cost-benefit tradeoffs for different DOPs allow the provisioningfor on-premises server configurations, IaaS compute sizes,and PaaS service tier selections—Azure SQL Database [20]offers different service tiers that allow users to choose aspecific number of (logical) cores [22, 25], or a range ofcores [21], or a pre-configured size with cores bundled withother resources [26, 24].
Example Figure 1 shows the speedup and costup of22 TPC-H1000 queries (on non-partitioned data) on SQLServer, running on our system (details in Section 4.1), fordifferent values of the total number of logical cores allowedfor the workload. Here the speedup and cost are normalizedwith respect to the performance and cost for a single core(DOP = ≤ Choosing a DOP
Query DOP is a critical knob thataffects query performance and resource utilization of multi-core servers. However, choosing the optimal or near-optimal DOP is not easy. For well-parallelizable queries, perfor-mance improves as DOP increases, but diminishing returnsare observed beyond some point depending on the databaseinstance and query characteristics. For other queries, in-creasing DOP could be detrimental to performance due to1 a r X i v : . [ c s . D B ] M a y Number of Provisioned Cores Sp ee d u p C o s t u p SpeedupCostup
Figure 1: Speedup and Costup over those with the query plan as a whole without us-ing operator-level execution time information for ML-basedDOP tuning. An interesting outcome of our work is thatapproaches that use a simple plan representation are veryeffective in many cases when they are applied to DOP tun-ing while having been dismissed in previous relevant worksuch as QPP.
Our Contribution
In this work, we study the problem ofautomating fine-grained DOP tuning using ML-based tech-niques. To the best of our knowledge, this is the first workto study this problem in the context of an RDBMS. Ourcontributions can be summarized as follows: • We emphasize the importance of DOP tuning and showthe potential benefits of fine-grained per-query DOPtuning (Section 2). • We formulate the DOP tuning problem as a regressiontask and study how to best represent each query asa feature vector (Section 3). We perform a detailedexperimental evaluation of different ML models usingwell-known decision support benchmarks (TPC-H [32]and TPC-DS [28]) consisting of queries generated fromdifferent query templates, database instances and scale factors. • In order to evaluate the performance of the ML models,we use four generalization levels based on different ap-plication scenarios: ( i ) to queries from the same tem-plate, ( ii ) to queries from a new template, ( iii ) to in-stances of different scale, and ( iv ) to different instancesand queries. We also use both task-agnostic and task-specific performance metrics to evaluate the utility ofeach model. Our experiments show that a simple whileproper featurization along with tree-ensemble modelsis effective for DOP Tuning in most cases. We also ex-perimentally analyze the causes of systematic failures,and discuss about possible improvement as our futurework (Section 4).
2. DEGREE OF PARALLELISM
In SQL Server, the query Degree Of Parallelism (DOP)is the maximum number of hardware threads or logical pro-cessors that can be used at any time to execute the query.Intra-query parallel execution is inspired by the Volcano [7]operator model. The physical operators in a query plan canbe both serial and parallel, but all parallel operators havethe same DOP, which is fixed at the start of the query execu-tion. Being able to determine the optimal DOP in one shotat (or before) query startup is important since a suboptimalDOP cannot be fixed without terminating query executionand restarting it with a different DOP.SQL Server provides mechanisms to configure the DOPboth at query optimization time and query startup time. Inthis work, we focus only on the latter option. The defaultDOP in the version of SQL Server we use, and for the ma-chines used for our experiments, is 64. Thus, we choose touse the query performance at DOP 64 as the baseline in ourexperiments. More details regarding our experimental setupare in Section 4.1.
Degree of Parallelism Sp ee d u p QaQb QcQd Qd(m)
Figure 2: Performance profiles for 4 queries on DS1000As we discuss in the introduction, each query performsdifferently as the DOP increases. Figure 2 shows the per-formance profiles for 4 different queries on TPC-DS withscale factor 1000 (see Section 4 for details about our ex-perimental setup and workloads). Queries Qb and Qc arewell-parallelizable and their performance improves as DOPincreases, but with diminishing returns beyond some point.Query Qa sees no benefit from parallelism. Query Qd losesperformance at high DOPs due to disk spills caused by in-creased memory requirement that exceeds the default limitfor a single query. Query Qd(m) shows how Qd perform dif-ferently as the DOP increases by intentionally giving enoughmemory to avoid disk spilling, indicating that memory is animportant factor which can influence the query behavior at2ifferent DOP. One can also observe that there is no singleDOP choice that is optimal across all four queries.
Degree of Parallelism Sp ee d u p DS1000 DS300 DS100 H1000 H300 H100
Figure 3: Workload speedup over workload-level DOP=64(default value) for different workload-level DOPs.Figure 3 shows workload performance (details in Section 4.1)as a function of the DOP value, with the same DOP for allparallel queries, relative to the performance with the defaultDOP of 64 for our setup. For most workloads, performanceimproves with DOP, starting from DOP=1, and attains thepeak at DOP values that are less than the default (64) andthe maximum value (80). The optimal value is workload-specific and can result in substantial performance gains overthe default value for some workloads.
DS100 DS300 DS1000 H100 H300 H10000.00.20.40.60.81.01.2 Sp ee d u p Per-Query OPT Workload OPT 40
Figure 4: Workload speedup (Y-Axis) over workload-levelDOP=64 (default value) with per-query optimal DOPs,workload-optimal DOP, and workload-level DOP=40.Further gains are possible by selecting optimal DOPs on aper-query basis. Figure 4 compares workload-level speedupspossible with per-query DOP selections, per-workload DOPselections, and a static selection of DOP=40 (equal to thenumber of physical cores in our server). All speedups valuesare with respect to the static (default) value of DOP=64.In line with observations from Figure 3, DOP=40 speeds upsome workloads (e.g., H 1000), but slows down some others(e.g., H 300), and a per-workload optimal DOP can signif-icantly improve performance (e.g., DS 300). All workloadsalso show additional speedups, ranging from 4% (H 300) to15% (DS 1000) beyond what is possible with a workload-optimal selection.Just as the per-workload optimal DOP differs across work-loads, the per-query optimal DOP values within each work-load also show substantial variation. Figure 5 shows morevariation for the DS workloads than the H workloads, whichis due to a larger variety of query templates in DS. No work-load has a single per-query optimal DOP value, indicatingthe potential for speedup with per-query DOP selections aswe observed in Figure 4. The average and median shift to-wards larger DOP values as scale factors, and consequently,size of datasets increase.
DS 100 DS 300 DS 1000 H 100 H 300 H 1000
Workload Q u e r y P e r c e n t a g e ( % ) Figure 5: Distribution of the percentage of queries with re-spect to the optimal query DOP.Based on the above, we can observe the following: • The behaviors of different queries for different DOPsare different (Figure 2). • The optimal DOP choices for different workloads (Fig-ure 3) and queries are different (Figure 5). • While fine-grained DOP tuning (on a per-query level)is more challenging compared to DOP tuning at work-load level, it can lead to greater performance improve-ment (Figure 4).
3. THE MAIN COMPONENTS
In this section, we formally describe our approach to solv-ing the DOP tuning problem. We first formally state the MLtask we are solving. Then, we describe how we featurize theinput data to the ML models and present the ML models weconsider in this work. Finally, we discuss different ways toevaluate how well our ML models generalize under differentapplication scenarios.
Regression over Classification
At first glance, it is natu-ral to think of tuning query DOP as a classification problem:given a query and a set of DOP values, classify the queryinto the DOP class that achieves optimal performance. Wechoose instead to cast the problem as a regression problemfor two reasons. First, the performance difference of a queryat neighboring DOP values can be very small (see for ex-ample Figure 2), which means that accurate classificationmay not be possible. Second, formulating DOP tuning asa regression problem allows us to use the learned model inapplications beyond DOP selection, such as resource pro-visioning. We also note that the capability to predict theestimated time of a query at different DOPs might be help-ful in order to perform DOP selection for concurrent queryexecution, which we consider as future work.
We are given a database instance I over a schema R ,and a workload W = ( P , P , . . . , P n ) over the instance I ,where each P i is a compiled query plan generated by theRDBMS. For our purposes, a query plan is a tree whereeach node is a physical operator (e.g., Index Scan, HashJoin, Sort); additionally, for each node we are given certaininformation about the operator execution (e.g., estimatedrows read, whether it runs in parallel mode, etc). Figure 6shows an example of such a query plan. In the next section,we will show how to map the query plan P i to a featurerepresentation ftr ( P i ).For each plan P i , we also have a measurement of the ex-ecution time of the plan P i over instance I with DOP d ,where d comes from a set D of DOP values of interest. We3enote this time with t ( P i ,d ) . Every such measurement cor-responds to a training point ( x, y ), where x = ( ftr ( P i ) , d ),and y = t ( P i ,d ) .We can now cast our problem as a regression task. Inparticular, the goal is to use the training points in orderto learn a function f (model instance) of a model m thatminimizes the quantity n (cid:88) i =1 (cid:88) d ∈ D L ( f ( ftr ( P i ) , d ) , t ( P i ,d ) )where L is a loss function. In this paper, we will use as a lossfunction the mean squared error (MSE) , L ( a, b ) = ( a − b ) .We will use ˆ t ( P i ,d ) = f ( ftr ( P i ) , d ) to denote the estimatedexecution time using the learned model instance.Using the learned model instance f , we can now solve theDOP tuning problem. We distinguish two subproblems. DOP Selection at Workload Level
In practice, users inmany cases configure the DOP at workload-level, in whichthe same DOP is selected for all the queries in the workload.In this case, the task is to choose a single DOP, d w workload = argmin d ∈ D n (cid:88) i =1 f ( ftr ( P i ) , d ) . DOP Selection at Per-Query Level
Further perfor-mance gains are possible if we perform DOP selection on aper-query basis. In this case, the task is to select for eachquery plan P i a DOP d i = argmin d ∈ D f ( ftr ( P i ) , d ) . In this section, we describe how we map a query plan P toa feature vector ftr ( P ) of fixed dimension in a similar mannerto what is presented in [4]. Compared to Ding et al. [4], ourfeaturization contains richer information by which the char-acteristics of a given query plan could be more accuratelycaptured. We propose different featurization alternatives fora query plan – we study the effects of different featurizationchoices in Section 4.2.We next detail our featurization process. A simplified ex-ample of the process is depicted in Figure 6. Recall that P is a tree, where each node in the tree consists of a physi-cal operator along with its processing mode and informationabout its estimated runtime performance. Each physical op-erator can execute in different modes: ( i ) parallel or serial ,and ( ii ) batch or row . Since different modes handle paral-lelism differently, it is critical that we encode them in thefeature vector. For example, it is important to distinguishbetween operators that use a parallel implementation ver-sus the ones with a non-parallel implementation. To achievethis, we construct a composite key for each operator instancein the form of( operator , batch / row , parallel / serial , (cid:104) optional (cid:105) )The vector (cid:104) optional (cid:105) adds additional attributes that areonly applicable for certain operators. For example, IsAdap-tive and
EstimatedJoinType apply only for the operator
Adap-tiveJoin . In addition, in the case where the same physicaloperator can be used to implement different logical opera-tors (e.g.,
HashMatch can be used for join, aggregation orunion), (cid:104) optional (cid:105) encodes the logical type as well.For each composite key, the feature vector allocates spaceto encode different types of execution measures – each mea-
EstimateRows
EstimateExecutionMode batch
AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)...
Query Plan P :Hash Match(join)ClusteredIndex Scan ClusteredIndex Scan ... EstimateRows
EstimateExecutionMode batch
AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)...
EstimateRows
EstimateExecutionMode batch
AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)... Feature Vector Components for
Composite Key : ("Hash Match", "batch", 1, "join" ) Feature Vector Components for
Other Composite Keys
Feature Vector Components for
Composite Key : ("Clustered Index Scan", "batch", 1 ) Feature Vector of Query Plan P
40 Degree of Parallelism(DOP)
Figure 6: Example of featurizing a query plan. The infor-mation of physical operators with different composite keysare encoded into different parts of the feature vector.sure is summed across all nodes in the plan with the samecomposite key. • Cardinality ( card ): cardinality estimates computedby the optimizer expressed in bytes (e.g., estimatedoutput size in bytes). • Cost ( cost ): cost-based estimates calculated by theoptimizer’s cost model (e.g., estimated CPU cost, es-timated I/O cost). • Count ( count ): takes value 1 if the operator has thecomposite key, otherwise 0. • Weight ( weight ): each node is assigned a weightwhich is computed recursively from the leaf nodes.The weight of a leaf node is the estimated output sizein bytes, while for a non-leaf node it is the sum ofproduct of weights and height of all its children. Theweight feature encodes structural information (see [4]). We briefly describe the ML models we use in our approach.
Linear Regression (LR)
We consider Linear Regressionas one of the baseline ML models due to its simplicity andefficiency. Specifically, we use its regularized version, elasticnet , which linearly combines L1 and L2 regularization.
Random Forest (RF)
Random Forest is an ensemblelearning method exploiting bagging as its ensemble algo-rithm. Since multiple decision trees are built/traversed in-dependently during the model construction/inference phase,this allows for efficient training/testing in a multicore set-ting. The fact that multiple trees built on different randomlysampled data also makes it more robust compared to a sin-gle decision tree, enabling it to consider complex interactionsamong different features while mitigating overfitting.
XGBoost
XGBoost [3] is an efficient implementation ofgradient boosting with a set of optimizations that exploitany available parallelism, achieve faster model convergenceby using the second-order gradients approach Newton Boost-ing and mitigate the overfitting issue of gradient boosting byimposing better regularization techniques.4 ulti-Layer Perceptron (MLP)
The multi-layer per-ceptron is a class of feedforward artificial neural networks.In this work, we use a fully connected neural network with8 hidden layers in which each layer has 512 neurons. RELU( max (0 , x )) is used as the activation function. We use theAdam optimizer for training [9]. Exploration of Other Models
Motivated by the promis-ing results in [17] with explicit exploitation of the query planstructure and operator-level query latency, we investigateda rich set of DNN-based models that are designed to capturethe spatial (e.g., nearby pixels in images) or structural (e.g.,syntactic tree structure in natural languages) information ofthe data. Specifically, we explored models that have beenapplied successfully to a set of natural language processing(NLP) related tasks including convolutional neural network(CNN) [8], long short-term memory (LSTM) based recurrentneural network (RNN) [34] and Tree-LSTM [35], regardingeach operator instance as a special word . We observed thatnone of these models outperform MLP, the simplest DNN-based model, considering both runtime efficiency and predic-tion performance. We also implemented and explored QPP-Net [17], including the operator-level elapsed time . QPPNetfailed to learn the relation between query latency and DOP(see example given in Figure 7). The failure could be at-tributed to a series of possible reasons, such as the differ-ence of the studied system environment (e.g., PostgreSQLvs SQL Server ), the assumptions QPPNet relying on as dis-cussed in the introduction, the not well-defined per-operatorlatency in our study environment, and the different sizes ofquery plans (more details in Figure 18). Our models ex-ploit the schema and database instance agnostic featuriza-tion, while QPPNet is tested on schema-dependent features.Deeper investigation of the failure of applying QPPNet inDOP tuning is future work. Due to the above observations,we choose to use MLP as the DNN-based model representa-tive and only present the comparison results of its with thatof other models.(a) Training Convergence
Degree of Parallelism Sp ee d u p ACTUAL QPPNet RF (b) Speedup Trend PredictionFigure 7:
Example Result of QPPNet in QPP : 7ashows QPPNet does converge after training for less than 25epochs on 80% of the TPC-DS1000 plan-dop pairs (Gener-alization Lvl1). But 7b suggests that QPPNet is ignorant ofdifferent DOP values - it always gives the same latency pre-diction for the same query regardless of DOP changes, whileRF is able to capture the actual speedup curve accurately.
Different application scenarios exhibit different degree ofcommonality , i.e., how much similarity is observed across aquery workload. The commonality here refers to both thequeries themselves (e.g., SQL statements and query plans), as well as the input data. Given a specific application, hav-ing a good understanding of the degree of commonality iscrucial when attempting to apply ML-based techniques forDOP tuning, since it helps practitioners to understand thecapabilities and limitations of the models that are learnedfrom the underlying data and queries. More specifically, given training data (pairs of query plans plus runtime) , weshould first try to answer the following questions:
1. How similar are future queries to queries in the train-ing data?2. How much can the input relations change (in terms ofboth schema and scale) compared to the input relationsin the training data?
The answers to the above questions are different whenconsidering different applications. For example, a supermar-ket chain might run the same set of queries daily to analyzeits sales, and the size and the distribution of the input datato the queries might be relatively stable across most of thedays, with the exception of a few promotion days such asBlack Friday and Christmas. On the other hand, analystsin different teams might run different queries on the sameinternal data for different data analysis tasks. In addition,the scale of the input data may increase after a period oftime as new data arrives daily.To capture this differentiation, we categorize the trainingdata and test data into four generalization levels , consideringthe query templates, the schema of input tables and thescale of data. Suppose that we train our model on instance I with schema R , where the workload W is drawn from adistribution of templates dist ( T ) (see Figure 8). Generalization Lvl1:same templates & same schema & same scale
The test workload comes from the same distribution of tem-plates dist ( T ) and runs on the same database instance I .As an example, the training and test data are gathered fromthe executions of two sets of TPC-DS queries generated fromthe same set of templates but with different random seeds running against the same instance . Generalization Lvl2:different templates & same schema & same scale
The test workload comes from a different distribution oftemplates dist ( T (cid:48) ), but runs on the same instance I . As anexample, the training and test data are gathered from run-ning two sets of TPC-DS queries generated from a differentsubset of templates running against the same instance . Generalization Lvl3:same template & same schema & different scale
The test workload comes from the same distribution of tem-plates dist ( T ), but runs on an instance I (cid:48) of different scaleusing the same schema R . As an example, the training andtest data are gathered from running two sets of TPC-DSqueries generated from the same set of query templates run-ning against the 300 and 1000 scale instances respectively. Generalization Lvl4:different template & different schema
The test workload comes from a different distribution oftemplates dist ( T (cid:48)(cid:48) ) and runs on a different instance I (cid:48) de-fined on a different schema R (cid:48) . As an example, the trainingand test data are gathered from running two sets of queriesgenerated from TPC-DS templates and TPC-H templatesrunning against the corresponding database instances.5 raining setup First, we train a model on a workload distribution and database instance; then, we test it on different generalization levels
Generalization level 1
Same templates, schema, scale I ∈ R W ~ dist(T) I ∈ R W ~ dist(T)
Generalization level 2
Different templates; same schema, scale I ∈ R W ~ dist(T')
Generalization level 3
Same templates, schema; different scale I’ ∈ R W ~ dist(T)
Generalization level 4
Different templates, schema, scale I’ ∈ R’ W ~ dist(T”)
Figure 8: Overview of training setup and generalization levels. We use dist ( T ) to denote a distribution of queries over a setof templates T .We assume a fixed hardware configuration and do notstudy generalization to different hardware configurations inthis work. Cloud platforms generally have a restricted setof known hardware configurations on which services are de-ployed, and separate models could be trained and deployedfor each configuration.
4. EXPERIMENTAL EVALUATION
In this section, we present and discuss the results of ourexperimental evaluation. We design our experiments withthe goal of answering the following questions: • What are the effects of different featurization alterna-tives in model performance? • How do different ML models perform for different gen-eralization levels? • What are the root causes behind wrong predictions inthe ML models in our problem formulation? • What trade-offs should be considered when choosingbetween different ML models?At a high level, our experiments show that the featuriza-tion using the count , card and weight features (exclud-ing cost ) leads to the overall best model prediction perfor-mance. RF shows the best overall performance consideringboth task-agnostic and task-specific metrics. However, whenthere is significant distribution mismatch between the testqueries and training queries, no model is able to performDOP selection that gives performance close to optimal. Oneconcrete explanation for such distribution mismatch is thedifference of memory requirements.Comparatively speaking, hyper-parameter tuning of XG-Boost and MLP is time-consuming while additional regu-larization is hardly to be imposed for better generalization(e.g., to achieve better performance on test data) when thetraining data is limited; RF is easier to be configured and itis more robust while it has relatively higher model inferenceoverhead. System Configuration
Our queries were executed on adual-socket Intel Xeon Broadwell server with a total of 40 physical cores and 512 GB main memory. We have hyper-threading enabled, resulting in a total of 80 logical coreswhich, consequently, is the maximum DOP possible on thisserver. We run SQL Server 2019 CTP 2.0 Developer edi-tion. By default, SQL Server chooses a DOP value of 64 forqueries running on this server. SQL Server sets aside part ofthe available memory for the buffer pool and shared objects;the remainder can be granted as working memory to queriesup to a certain limit (by default, 25% of the working mem-ory, corresponding to ∼ Workload and Datasets
We use both the TPC-H andTPC-DS workloads for our experiments, with scale factors100, 300, and 1000 for each. The following table gives thedetailed information for the dataset construction.Table 1: Workload and Dataset Statistics
22 1346 13460
TPC-H 300
22 1346 13460
TPC-H 1000
22 1346 13460
TPC-DS 100
103 841 8410
TPC-DS 300
106 853 8530
TPC-DS 1000
103 841 8410For each TPC-H query template, we generated queriesusing 100 different random seeds with duplicate queries be-ing removed. TPC-DS queries are generated in a similarmanner, but with 10 different random seeds used for eachquery template. We execute each query in the workloadsfor 10 different DOP values: { , , , , , , , , , } .We focus on warm-cache query execution, since caching isalmost always used in practice whenever it is feasible. Thus,to obtain each data point, we perform one cold-cache queryexecution followed by ten warm-cache runs. The averagetime of the warm-cache runs is used as the ground truth ofthe latency for the executed query. For the purpose of our6tudy, we execute one query at a time. We also only considerqueries executed using query plans with at least one physicaloperator that uses intra-parallelism. We use the clusteredcolumnstore index [19] organization for each database. Training and Testing Splits
We perform 5-fold cross-validation to evaluate the model performance for each gen-eralization level. For Generalization Lvl1, each fold uses80% of TPC-DS 1000 queries for training and the remaining20% for testing. For Generalization Lvl2, the data is splitbased on the query templates: each fold uses 80% of thequery templates from TPC-DS1000 for training, and the re-maining 20% for testing. The folds for Generalization Lvl3and Generalization Lvl4 are as follows:Table 2: TPC Queries in Training/Testing Folds for Gener-alization Lvl3 and Generalization Lvl4.
Generalization Lvl3 Generalization Lvl4Fold 1
DS1000/DS300 DS1000/H1000
Fold 2
DS1000/DS100 DS1000/H300
Fold 3
DS300/DS100 DS1000/H100
Fold 4
DS100/DS300 DS300/H1000
Fold 5
DS100/DS1000 DS300/H300
Hyper-Parameter Tuning
We tune the selected mod-els in a standard manner, similar to [4], based on the meanabsolute error in cross-validation on the training data. Fortree-ensemble models (XGBoost and RF), we limit the max-imal number of trees to be 1000. We note that spendingadditional effort tuning the parameters of different modelsmight lead to a narrower gap between the evaluation met-rics of these models. However, the main goal of this studyis to to explore the usability and trade-offs present in thecompared techniques in a standard training configuration.Hence, for each model, we stop tuning the hyper-parameters(e.g., searching larger hyper-parameter space) when reason-ably satisfying results are observed.
Comparison Metrics
To compare the performance of dif-ferent ML models, we use both task-agnostic metrics (e.g.,mean absolute error), as well as task-specific metrics that aretied to a specific application (e.g., workload throughput atoptimal DOP selection). We provide the detailed definitionsof these metrics in Table 3.
To compare the featurization alternatives, we run RF withall four features F = { card , cost , count , weight } , andalso excluding one feature at a time ( F \{ card } , F \{ cost } , F \{ count } , F \{ weight } ). Tables 7-14 show the distribu-tions of relative prediction error (RPE) and speedup predic-tion error (SPE). We observe that for Generalization Lvl1there is little difference across different featurization alter-natives (Tables 7,8). For the other generalization levels, F \{ cost } leads to the best performance: for the low rela-tive error range (e.g., RPE < F \{ cost } is larger than other featurization alterna-tives. We also observe that including all features F is rarelyoptimal, while having count and weight is beneficial mostof the time. As a result of our analysis, we present our exper-imental results using the features { count , card , weight } for the remaining experiments. In this section, we evaluate the training and inferenceoverheads for the ML models we tested. It is critical thatboth overheads are small for the model to be usable in prac-tice. We report below (Table 5) the training and inferenceoverheads for Generalization Lvl1.Table 5: Model overheads for Generalization Lvl1.
Model Training (sec) Inference (sec)LR
MLP
81 0.009772 RF
17 0.419193
XGBoost
23 0.0013The training overhead includes the time spent on prepro-cessing (e.g., featurization, training/test data splitting, etc).MLP is trained on NVIDIA GeForce GTX 1080Ti, whileother models are trained on the same machine in which thequeries are executed with all 40 physical cores given. Theinference time of all models is measured by running eachmodel on a single data point using a single CPU core. Weobserve that the inference time of all models except for RFis very small ( < ∼
400 ms) on inference due to the traversal of large num-ber of deep trees. However, our work focus on relativelylong-running queries (e.g., OLAP), for which DOP tuning ismostly beneficial to, and thus we see the inference overheadof RF as insignificant.
Mean Absolute Error
MAE represents the average ab-solute difference between the actual query latency and pre-dicted query latency of all plan-dop pairs. A nice propertyof MAE is that it has the same unit as the regression target time , which is also noted in [17].In Table 4, we observe that except for LR, all other mod-els exhibit relatively small MAEs on the training data afterlearning across all generalization levels, suggesting the ne-cessity of nonlinearity and large model capacity. Amongthese models, XGBoost and RF show the best performanceconsidering only MAEs on the training data, which are closeto zero, suggesting the number/depth of the trees in tree-ensemble models are large/deep enough.Switching to test data, the comparison between resultsof LR and other models in Generalization Lvl1 are rela-tively consistent with what is being observed in trainingdata. However, we see much larger gaps across differentgeneralization levels in all models, and this suggests the dif-ferent degree of difficulty of applying ML models for DOPtuning at different generalization levels. MAEs of differenttesting folds in each generalization level suggest that RFgeneralizes better than XGBoost and MLP. This result canbe explained by the fact that RF is less sensitive to over-fitting compared to XGBoost and MLP, as well as easierto tune. While it is possible to apply better regularizationtechniques on XGBoost and MLP to further reduce the gen-eralization error, we note that the current architecture/hy-perparameters of XGBoost, MLP, and RF (which are tunedvia cross-validation on the training data) show comparable7able 3: Comparison Metrics
Category Metric Definition
Task-Agnostic Mean Absolute Error
MAE ( W ) = | W || D | (cid:80) ni =1 (cid:80) d ∈ D | ˆ t ( P i ,d ) − t ( P i ,d ) | Relative Prediction Error
RPE ( P ) = | D | (cid:80) d ∈ D | ˆ t ( P,d ) − t ( P,d ) | t ( P,d ) Speedup Prediction Error
SPE ( P ) = | D | (cid:80) d ∈ D | ˆ t ( P,d ) ˆ t ( P, − t ( P,d ) t ( P, | Task-Specific Throughput with per-Query DOP TQ ( W ) = | W | (cid:80) ni =1 min d ∈ D ˆ t ( Pi,d ) Throughput with per-Workload DOP TW ( W ) = | W | min d ∈ D { (cid:80) ni =1 ˆ t ( Pi,d ) } Table 4: 5-Fold Mean Absolute Error on Training Data
Model Generalization Lvl1 Generalization Lvl2 Generalization Lvl3 Generalization Lvl4LR
MLP RF XGBoost
Relative Prediction Error
We show the distribution ofthe per-query
RPE of each model at different generaliza-tion levels in Figure 9. Not surprisingly, LR consistentlyexhibits high RPE ( > .
5) in most of the plan-dop pairsacross all generalization levels due to its inability of cap-turing the complex relationship between the plan-dop pairfeatures and the query latency. Looking at the relative errordistribution of other models, we observe that in most cases,RF exhibits lower RPE for a larger portion of test plan-doppairs compared to other models.
Speedup Prediction Error
While relative error might bea good metric for evaluating the accuracy of the query la-tency prediction, it does not directly infer the speedup of thequery execution at different DOP values. Intuitively, whenselecting the DOP for a single query execution, users shouldknow the pattern or trend of the performance curve of thequery at DOP values of interests. We look at a simple met-ric called speedup prediction error (Table 3) that capturesthis property. Figure 10 presents the per-query based SPEdistribution of each model considering different generaliza-tion levels. We observe that RF consistently shows the bestperformance for SPE.The relationships between relative error/speedup errorand the query latency (at DOP 40) are shown in Figure 13and 14. We observe that large errors are less likely to occurin long-running queries ( > s ) for both metrics. The obser-vation is positive, since intra-parallelism is more beneficialto long running queries. DOP Selection at Individual Query Level
We nowconsider the task defined in Section 3.1 of selecting the DOPper query Level. We use the metric TQ ( W ) as defined inTable 3. We compare TQ ( W ) given by different models andthe query throughputs given by executing queries at DOP40, 80, the actual optimal DOP (OPT) for each P i ∈ W ,the workload optimal DOP (WORKLOAD) for all P i ∈ W ,with all results being normalized over the throughput at DOP 64 (default value). The comparison results at differentgeneralization levels are shown in Figure 11. We first observethat in Generalization Lvl1, all models except LR lead toquery throughput close to OPT.At Generalization Lvl2, we observe that no model is ableto predict DOPs that lead to throughput comparable toWORKLOAD in most of the cases. Though XGBoost seemsto give performance much closer to WORKLOAD in somecases (Fold 2-5), the results are more likely to be acciden-tal. After investigation, we have found that the root causebehind this performance gap is the mismatch between thedata distribution in training and test data: non-trivial diskspilling is observed in queries from the same query templatein which Qd (as shown in Figure 2) comes from (generatedfrom different random seeds) due to the increase of memoryrequirements at high DOP values (DOP > Qd -type queries and the results are shown inFigure 12. Though the performance gain opportunity itselfbecomes significantly smaller (the performance gap betweenactual optimal DOP values and the default DOP), in mostcases, most models do not select DOPs that result in perfor-mance regression and the best-performing model RF oftenselect DOPs giving performance better than WORKLOAD.At Generalization Lvl3, performance given by models ex-cept for LR often matches up OPT or WORKLOAD (Fold1-4). For Fold 5, since models are trained on plan-dop pairsgenerated by TPC-DS 100 queries & database instance andtested on TPC-DS 1000 queries & database instance (inwhich disk-spilling is more severe than that being observedin TPC-DS 100), the distribution mismatch (similar to pre-8 R MLP RF XGBoost (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 9: Relative Prediction Error Distribution — Percentage (Y-Axis) vs. RPE (X-Axis)(a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 10: Speedup Prediction Error Distribution — Percentage (Y-Axis) vs. SPE (X-Axis)Table 6: 5-Fold Mean Absolute Error on Test Data
Model Generalization Lvl1 Generalization Lvl2 Generalization Lvl3 Generalization Lvl4LR
MLP RF XGBoost
LR MLP RF XGBoost OPT WORKLOAD 40 80
Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (a) Generalization Lvl1
Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (b) Generalization Lvl2
Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (c) Generalization Lvl3
Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (d) Generalization Lvl4Figure 11: DOP Selection at Individual Query Level — Query Throughput over DOP 64 (Y-Axis)
Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.2
Figure 12: Generalization Lvl2M vious reasoning for Generalization Lvl2) results in perfor-mance regression.At Generalization Lvl4, the opportunity for throughputimprovement over DOP 64 is small (the gap between OPTand throughput at DOP 64 is only 5% − R MLP RF XGBoost (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 13: RPE (Y-axis) vs. Query Latency at DOP 40 (X-axis)(a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 14: SPE (Y-axis) vs. Query Latency at DOP 40 (X-axis)
LR MLP RF XGBoost OPT (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 15: Predicted Performance at Per-Query Optimal-DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis) (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 16: Performance at Predicted Per-Query Optimal-DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis) (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 17: Predicted Performance at Workload-Level DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis)from which the test data is generated, scale very well with in-creasing DOP on SQL Server. And LR always chooses DOP80 due to its linearity. At first it may seem like the poor per-formance shown in other models is due to the fact that thedegree of commonality between training and testing queriesin Generalization Lvl4 is low, however, it is noticeable thatthere is no distinguishable difference between the SPE dis-tributions of Generalization Lvl3 and Generalization Lvl4(Figure 10c 10d), while RF does a fairly good job in DOP se-lection for Generalization Lvl3. After careful investigation,we find out that for most of the queries, RF and XGBoostare able to capture the speedup trend accurately, and the rel-atively poor performance seems more likely to be caused bythe specific characteristics of TPC-H queries (testing queriesin Generalization Lvl4) - they scale really well with increas- ing DOP (see Figure 16e), suggesting large performance gapbetween query executions at different DOP values and smalllatency at high DOP (e.g., nearly 30X speedup at DOP 80compared to performance at DOP 1). Keeping this fact inmind and considering the DS1000/H1000 training/testingsplit (Fold 2), RF and XGBoost failed to select the near-optimal DOP for two relatively long-running queries in thetesting split (RF selects DOP 20 and XGBoost selects DOP16 while the actual optimal DOPs are 80), resulting in theperformance gaps observed in Figure 11.
While making the optimal choice of DOP configurationat individual query is useful, selecting the optimal DOPat workload-level , although is sub-optimal, could ease DOP10onfiguration. Figure 17 visualizes the actual speedup curveand the predicted speedup curves given by different modelswith increasing degree of parallelism. The speedup curvecan be used for selecting DOP for a given set of queriesbased on a resource budget, or by simply choosing the opti-mal point without any constraints. For example, looking atFigure 17a, the speedup by increasing DOP from 32 to 40 isinsignificant but the hardware provision cost could increaseby 25% (assuming the hardware provision cost is proportionto the number of logical cores).We observe that the best-performing model RF is ableto approximately capture the trend of the actual speedupcurves in Generalization Lvl1, 2M and 3 considering bothper-query optimal-DOP (Figure 15) and workload-level DOP(Figure 17). Besides, RF is also able to select per-query levelDOPs for a given workload, leading to performance that isclose to the performance given by running each query atthe actual optimal-DOP given the specified maximum DOPpossible (Figure 16). However, there is a large gap betweenthe actual speedup curve and the curve predicted by RFin Generalization Lvl2 and Generalization Lvl4, where XG-Boost is performing better in predicting the speedup curvein Generalization Lvl4 (Figure 15e,17e )– in contrast to ourobservations for task-agnostic metrics. This gap is caused bythe heterogeneous predictions on different plan-DOP pairs inthe test data: the prediction output of a few plan-DOP pairsdominates the predicted performance, and hence the actualperformance of other plan-DOP pairs is not accurately re-flected. On the other hand, XGBoost fails to capture theperformance difference (e.g., query performance improves)for a few plan-DOP pairs that are important for DOP se-lection (Figure 16e), which though are not critical in pre-dicting the overall speedup trend (Figure 15e,17e). Basedon this observation, we argue that merely looking at thetask-agnostic metrics might not be always the best way toevaluate the performance of the model, but rather the task-specific ones should be considered together. For example,RF is doing well in terms of per-query DOP selection in Gen-eralization Lvl4 while XGBoost is preferred when predictingthe view of cost-performance trade-offs is important. Mean-while, a better learning objective, that forces the model tomore accurately learn the absolute and relative differencebetween the query latency at different DOPs is critical tomake ML-based techniques for query DOP tuning more fea-sible.
5. RELATED WORK
Query Performance Prediction (QPP) is a well-studiedproblem [1, 6, 13, 37, 17], but not in a setting with intra-query parallelism. Early work in DOP management [18]studied the exploitation of intra-operator parallelism in amulti-query environment for shared-nothing parallel databasesystems using a simplified simulation model. More recentworks have developed analytical models for query paral-lelism in BigData execution frameworks [33]. To the bestof our knowledge, this is the first paper to do a comparativestudy of ML techniques for QPP for multithreaded queryexecution and evaluate query DOP tuning for a commercial-grade RDBMS on a modern multicore server.
ML for Query Performance Prediction
Prior workhas explored the use of per-operator models [1, 13, 17] forQPP. The per-operator latency estimates can be added to- Table 7: RF Distribution of Relative Prediction Error(RPE) of different featurization (Generalization Lvl1)
RPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . Table 8: RF Distribution of Speedup Prediction Error (SPE)of different featurization (Generalization Lvl1)
SPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ .
001 17.1% 17.7% ≤ .
005 67.1% 66.8% ≤ .
010 79.3% 79.0% ≤ . ≤ .
100 97.4% 97.4% 97.3%
RPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ .
001 1.1% 0.8% ≤ .
005 11.2% 10.1% 10.1% 7.2% ≤ .
010 27.9% 21.7% ≤ .
050 76.7% 74.9% 71.4% 79.9% ≤ .
100 82.6% 83.9% 84.0%
RPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ .
005 1.4% 0.9% ≤ .
010 7.6% 7.8% ≤ .
050 56.0% 63.4% ≤ .
100 76.9%
RPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE
F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ .
005 1.4% 0.9% ≤ .
010 7.6% 7.8% ≤ .
050 56.0% 63.4% ≤ .
100 76.9%
Number of Operators P l a n P e r c e n t a g e ( % ) Plan Tree Depth P l a n P e r c e n t a g e ( % ) Figure 18: TPC-DS1000 plan size & depth distribution.from extracting and composing operator-level estimates toderive query-level estimates as Figure 7 suggests.
Automated Database Tuning
Prior work has also ex-plored techniques for automatically tuning database systemsto improve query performance [2, 5, 31, 36]. Our work onDOP tuning can benefit such approaches by providing what-if analyses for different DOP settings and thereby potentiallyeliminating the need for runtime exploration of this param-eter space.
Dynamic Parallelism
Recent work on new database sys-tems use dynamic/elastic parallelism where the the paral-lelism can be increased/decreased at runtime [12, 30]. Al-though our ML-based DOP selection approach for perfor-mance improvement focuses on a setting with static par-allelism, we believe that our approach on providing cost-benefit tradeoff estimates would be useful for resource pro-visioning in elastic systems as well.
ML for Query Optimization
Conventional query opti-mization in RDBMSs is based on manually constructed costmodels and rule-based heuristics. Such strategies usuallymake certain assumptions about the plan search space (e.g.,left-deep tree) and rely heavily on the estimation of statisticsrequired by the cost models such as cardinality estimation–which can be quite inaccurate, leading to query plans thatare far away from optimal. In addition, query optimizersbuilt this way never learn from past experience, and thesame query plan with poor performance can be repeatedlygenerated [16]. Recent research efforts attempt to enhanceand even replace the core optimizer components seen in mostof RDBMSs today using deep learning (DL) techniques. Forexample, [15, 11] exploit deep reinforcement learning (DRL)to optimize join queries, while [10, 29] propose to use DRLto improve the cardinality estimation. Neo, a query opti-mizer built based on a set of DL-based models [14], has beenshown to even outperform state-of-the-art commercial opti-mizers in some cases. Compared to this line of research, ourwork focuses on optimizing the query performance outsideof the RDBMS (an example of resource tuning) rather thantouching the optimizer internals, similar in scope to [31].
6. DISCUSSION
Our comparative exploration of ML techniques for tun-ing DOP in SQL Server indicates that performance gainsare possible by using simple tree-ensemble models alongwith query plan featurization. The performance profiling forqueries showing different behaviors at different DOP valuessuggests that DOP tuning is an important while challengingproblem. We have also identified a set of important issuesthat raise concerns when applying the ML-based techniquesfor DOP tuning in practice, along with the possible improve-ments to this work: • Memory consideration: in our present study, we runqueries in an environment in which the available mem-12ry is fixed (i.e., static). The utility of ML modelslearned without considering memory information is lim-ited, preventing their use in new environments withdifferent memory requirements (e.g., machines withdifferent memory sizes, or concurrent query executionin which each stream is under strict memory constraints).We plan to address this deficiency as part of improve-ments to this work. • Concurrent query execution: we plan to utilize thesingle-query based ML models for DOP tuning in aconcurrent environment by an analytical approach, ex-panding the applicability of our study. • Model interpretability: the results given by the black-box ML algorithm should be interpretable in a waysimilar to that of the rule-based algorithm. The DBAsshould be able to see the decision path suggested bythe algorithm in order to decide whether to accep-t/reject the model output. Meanwhile, interpretabilitycould also help with possible implementation issues in-side the targeted RDBMS (e.g., the poor parallelismcould be caused by the behavior of operators of certaintypes). • Studying the effects of hyper-threading: since enablingor disabling hyper-threading will affect query perfor-mance characteristics, it is important to study suchdifferences and maybe even encode this piece of infor-mation into ML models.
7. CONCLUSION
We studied the problem of tuning the degree of parallelismvia statistical machine learning. We focus our evaluation onMicrosoft SQL Server, a popular commercial RDBMS thatallows an individual query to execute on multiple cores. Inour study, we cast the problem of DOP tuning as a regres-sion task, and examine how several popular ML models canhelp with query performance prediction in a multi-core set-ting. We performed an extensive experimental study com-paring these models against a list of performance metrics,and tested how well they generalize in different settings: ( i )to queries from the same template, ( ii ) to queries from anew template, ( iii ) to instances of different scale, and ( iv )to different instances and queries. Our experimental resultsshow that a simple featurization of the input query planthat ignores cost model estimations can accurately predictquery performance, capture the speedup trend with respectto the available parallelism, as well as help with automati-cally choosing an optimal per-query DOP.
8. ACKNOWLEDGEMENT
We thank Alan Halverson for insightful discussions aboutthe DOP problem and about baseline experimental setupduring the initial phase of this work. Zhiwei was supportedby Microsoft’s Gray Systems Lab (GSL) through a summerinternship and Microsoft RA-ships for this research work.We thank Carlo Curino and other members of GSL, andmembers of the SQL Server team for discussions and feed-back on this work. The manuscript was improved by de-tailed and thoughtful comments from Remzi Arpaci-Dusseauand other, anonymous, reviewers.
9. REFERENCES [1] M. Akdere, U. C¸ etintemel, M. Riondato, E. Upfal, andS. B. Zdonik. Learning-based query performancemodeling and prediction. In , pages390–401. IEEE, 2012.[2] S. Chaudhuri and V. Narasayya. Self-tuning databasesystems: A decade of progress. In
Proceedings of the33rd International Conference on Very Large DataBases , VLDB ’07, pages 3–14. VLDB Endowment,2007.[3] T. Chen and C. Guestrin. Xgboost: A scalable treeboosting system. In
Proceedings of the 22nd acmsigkdd international conference on knowledge discoveryand data mining , pages 785–794. ACM, 2016.[4] B. Ding, S. Das, R. Marcus, W. Wu, S. Chaudhuri,and V. R. Narasayya. Ai meets ai: Leveraging queryexecutions to improve index recommendations. In
Proceedings of the 2019 International Conference onManagement of Data , pages 1241–1258. ACM, 2019.[5] S. Duan, V. Thummala, and S. Babu. Tuningdatabase configuration parameters with ituned.
Proc.VLDB Endow. , 2(1):1246–1257, Aug. 2009.[6] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener,A. Fox, M. Jordan, and D. Patterson. Predictingmultiple metrics for queries: Better decisions enabledby machine learning. In , pages 592–603.IEEE, 2009.[7] G. Graefe. Encapsulation of parallelism in the volcanoquery processing system.
ACM SIGMOD Record ,19(2):102–111, 1990.[8] Y. Kim. Convolutional neural networks for sentenceclassification. arXiv preprint arXiv:1408.5882 , 2014.[9] D. P. Kingma and J. Ba. Adam: A method forstochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[10] A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, andA. Kemper. Learned cardinalities: Estimatingcorrelated joins with deep learning. arXiv preprintarXiv:1809.00677 , 2018.[11] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, andI. Stoica. Learning to optimize join queries with deepreinforcement learning. arXiv preprintarXiv:1808.03196 , 2018.[12] V. Leis, P. Boncz, A. Kemper, and T. Neumann.Morsel-driven parallelism: A numa-aware queryevaluation framework for the many-core age. In
Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data , SIGMOD ’14,pages 743–754, New York, NY, USA, 2014. ACM.[13] J. Li, A. C. K¨onig, V. Narasayya, and S. Chaudhuri.Robust estimation of resource consumption for sqlqueries using statistical techniques.
Proceedings of theVLDB Endowment , 5(11):1555–1566, 2012.[14] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh,T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: Alearned query optimizer. arXiv preprintarXiv:1904.03711 , 2019.[15] R. Marcus and O. Papaemmanouil. Deepreinforcement learning for join order enumeration. In13 roceedings of the First International Workshop onExploiting Artificial Intelligence Techniques for DataManagement , page 3. ACM, 2018.[16] R. Marcus and O. Papaemmanouil. Towards ahands-free query optimizer through deep learning. arXiv preprint arXiv:1809.10212 , 2018.[17] R. Marcus and O. Papaemmanouil. Plan-structureddeep neural network models for query performanceprediction. arXiv preprint arXiv:1902.00132 , 2019.[18] M. Mehta and D. J. DeWitt. Managing intra-operatorparallelism in parallel database systems. In
Proceedings of the 21th International Conference onVery Large Data Bases , VLDB ’95, pages 382–394,San Francisco, CA, USA, 1995.[19] Microsoft. Columnstore indexes: Overview. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15 , 2018.[20] Microsoft. Azure SQL Database - Platform as aService. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas , 2019.[21] Microsoft. Azure SQL Database serverless (preview). https://docs.microsoft.com/en-us/azure/sql-database/sql-database-serverless , 2019.[22] Microsoft. Choose among the vCore service tiers andmigrate from the DTU service tiers. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-vcore ,2019.[23] Microsoft. Choose the right SQL Server option inAzure. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas-vs-sql-server-iaas , 2019.[24] Microsoft. Resource limits for single databases usingthe DTU-based purchasing model. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-dtu-resource-limits-single-databases ,2019.[25] Microsoft. Resource limits for single databases usingthe vCore-based purchasing model. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-vcore-resource-limits-single-databases ,2019.[26] Microsoft. Service tiers in the DTU-based purchasemodel. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu , 2019.[27] Microsoft. SQL Server 2019. , 2019.[28] R. O. Nambiar and M. Poess. The making of tpc-ds.In
Proceedings of the 32nd international conference onVery large data bases , pages 1049–1058. VLDBEndowment, 2006.[29] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi.Learning state representations for query optimizationwith deep reinforcement learning. arXiv preprintarXiv:1803.08604 , 2018.[30] J. M. Patel, H. Deshmukh, J. Zhu, N. Potti, Z. Zhang,M. Spehlmann, H. Memisoglu, and S. Saurabh.Quickstep: A data platform based on the scaling-upapproach.
Proc. VLDB Endow. , 11(6):663–676, Feb.2018.[31] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma,P. Menon, T. C. Mowry, M. Perron, I. Quah, et al.Self-driving database management systems. In
CIDR ,volume 4, page 1, 2017.[32] M. Poess and C. Floyd. New tpc benchmarks fordecision support and web commerce.
ACM SigmodRecord , 29(4):64–71, 2000.[33] K. Rajan, D. Kakadia, C. Curino, and S. Krishnan.Perforator: Eloquent performance models for resourceoptimization. In
Proceedings of the Seventh ACMSymposium on Cloud Computing , SoCC ’16, pages415–427, New York, NY, USA, 2016. ACM.[34] M. Sundermeyer, R. Schl¨uter, and H. Ney. Lstmneural networks for language modeling. In
Thirteenthannual conference of the international speechcommunication association , 2012.[35] K. S. Tai, R. Socher, and C. D. Manning. Improvedsemantic representations from tree-structured longshort-term memory networks. arXiv preprintarXiv:1503.00075 , 2015.[36] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang.Automatic database management system tuningthrough large-scale machine learning. In
Proceedingsof the 2017 ACM International Conference onManagement of Data , SIGMOD ’17, pages 1009–1024,New York, NY, USA, 2017. ACM.[37] W. Wu, Y. Chi, H. Hac´ıg¨um¨u¸s, and J. F. Naughton.Towards predicting query execution time forconcurrent and dynamic database workloads.