[PDF] A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism

Abstract

There is a large body of recent work applying machine learning (ML) techniques to query optimization and query performance prediction in relational database management systems (RDBMSs). However, these works typically ignore the effect of \textit{intra-parallelism} -- a key component used to boost the performance of OLAP queries in practice -- on query performance prediction. In this paper, we take a first step towards filling this gap by studying the problem of \textit{tuning the degree of parallelism (DOP) via ML techniques} in Microsoft SQL Server, a popular commercial RDBMS that allows an individual query to execute using multiple cores. In our study, we cast the problem of DOP tuning as a {\em regression} task, and examine how several popular ML models can help with query performance prediction in a multi-core setting. We explore the design space and perform an extensive experimental study comparing different models against a list of performance metrics, testing how well they generalize in different settings: (i) to queries from the same template, (ii) to queries from a new template, (iii) to instances of different scale, and (iv) to different instances and queries. Our experimental results show that a simple featurization of the input query plan that ignores cost model estimations can accurately predict query performance, capture the speedup trend with respect to the available parallelism, as well as help with automatically choosing an optimal per-query DOP.

Full PDF

AA Comparative Exploration of ML Techniques for TuningQuery Degree of Parallelism

Zhiwei Fan , Rathijit Sen , Paraschos Koutris , Aws Albarghouthi , University of Wisconsin-Madison Microsoft Gray Systems Lab { zhiwei, paris, aws } @cs.wisc.edu { rathijit.sen } @microsoft.com ABSTRACT

There is a large body of recent work applying machine learn-ing (ML) techniques to query optimization and query per-formance prediction in relational database management sys-tems (RDBMSs). However, these works typically ignore theeﬀect of intra-parallelism – a key component used to boostthe performance of OLAP queries in practice – on queryperformance prediction. In this paper, we take a ﬁrst steptowards ﬁlling this gap by studying the problem of tuning thedegree of parallelism (DOP) via ML techniques in MicrosoftSQL Server, a popular commercial RDBMS that allows anindividual query to execute using multiple cores.In our study, we cast the problem of DOP tuning as a re-gression task, and examine how several popular ML modelscan help with query performance prediction in a multi-coresetting. We explore the design space and perform an exten-sive experimental study comparing diﬀerent models againsta list of performance metrics, testing how well they gen-eralize in diﬀerent settings: ( i ) to queries from the sametemplate, ( ii ) to queries from a new template, ( iii ) to in-stances of diﬀerent scale, and ( iv ) to diﬀerent instances andqueries. Our experimental results show that a simple featur-ization of the input query plan that ignores cost model esti-mations can accurately predict query performance, capturethe speedup trend with respect to the available parallelism,as well as help with automatically choosing an optimal per-query DOP.

1. INTRODUCTION

The rise of cloud-computing platforms has oﬀered new ca-pabilities and beneﬁts to users, including the capability toprovision the appropriate amount of resources necessary fora given task. In order to eﬀectively utilize this capability,Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service(PaaS) users need tools and techniques to quantify the trade-oﬀs between resource costs and performance beneﬁts. Suchtechniques are also invaluable to PaaS and Software-as-a-Service (SaaS) cloud providers, since they can reduce opera-tional costs while meeting Service-Level Agreements (SLAs).Identifying optimal or near-optimal resource conﬁgura-tions is a diﬃcult task. The challenges are three-fold: un-known workload characteristics, a large state space, andcostly evaluation of what-if scenarios. Although resourceutilization may be observed for repeating jobs, one may haveto resort to intrusive proﬁling (not feasible in productionenvironments) or A/B testing (which requires additional re-sources) to estimate tradeoﬀs with diﬀerent conﬁgurations. Resource provisioning for new jobs is even more challeng-ing. For these reasons, conﬁguration tuning is traditionallyperformed by database administrators. However, this ap-proach is tedious, requires a high level of domain expertise,and does not scale to cloud platforms that need to handlemillions of databases and diﬀerent service tiers.In this paper, we tackle this problem by studying how

Machine Learning (ML) techniques can be applied to esti-mate cost-performance tradeoﬀs for query processing in anRDBMS . To focus our study, we use Microsoft SQL Serveras the RDBMS, which is available as cloud IaaS and PaaSoﬀerings [23] and as an on-premises solution [27]. We alsochoose to study the conﬁguration of a particular resource,the query

Degree Of Parallelism (DOP) : this is the maxi-mum number of hardware threads (logical cores) that can beused at any time for executing the query. Estimates of cost-beneﬁt tradeoﬀs for diﬀerent DOPs allow the provisioningfor on-premises server conﬁgurations, IaaS compute sizes,and PaaS service tier selections—Azure SQL Database [20]oﬀers diﬀerent service tiers that allow users to choose aspeciﬁc number of (logical) cores [22, 25], or a range ofcores [21], or a pre-conﬁgured size with cores bundled withother resources [26, 24].

Example Figure 1 shows the speedup and costup of22 TPC-H1000 queries (on non-partitioned data) on SQLServer, running on our system (details in Section 4.1), fordiﬀerent values of the total number of logical cores allowedfor the workload. Here the speedup and cost are normalizedwith respect to the performance and cost for a single core(DOP = ≤ Choosing a DOP

Query DOP is a critical knob thataﬀects query performance and resource utilization of multi-core servers. However, choosing the optimal or near-optimal DOP is not easy. For well-parallelizable queries, perfor-mance improves as DOP increases, but diminishing returnsare observed beyond some point depending on the databaseinstance and query characteristics. For other queries, in-creasing DOP could be detrimental to performance due to1 a r X i v : . [ c s . D B ] M a y Number of Provisioned Cores Sp ee d u p C o s t u p SpeedupCostup

Figure 1: Speedup and Costup over those with the query plan as a whole without us-ing operator-level execution time information for ML-basedDOP tuning. An interesting outcome of our work is thatapproaches that use a simple plan representation are veryeﬀective in many cases when they are applied to DOP tun-ing while having been dismissed in previous relevant worksuch as QPP.

Our Contribution

In this work, we study the problem ofautomating ﬁne-grained DOP tuning using ML-based tech-niques. To the best of our knowledge, this is the ﬁrst workto study this problem in the context of an RDBMS. Ourcontributions can be summarized as follows: • We emphasize the importance of DOP tuning and showthe potential beneﬁts of ﬁne-grained per-query DOPtuning (Section 2). • We formulate the DOP tuning problem as a regressiontask and study how to best represent each query asa feature vector (Section 3). We perform a detailedexperimental evaluation of diﬀerent ML models usingwell-known decision support benchmarks (TPC-H [32]and TPC-DS [28]) consisting of queries generated fromdiﬀerent query templates, database instances and scale factors. • In order to evaluate the performance of the ML models,we use four generalization levels based on diﬀerent ap-plication scenarios: ( i ) to queries from the same tem-plate, ( ii ) to queries from a new template, ( iii ) to in-stances of diﬀerent scale, and ( iv ) to diﬀerent instancesand queries. We also use both task-agnostic and task-speciﬁc performance metrics to evaluate the utility ofeach model. Our experiments show that a simple whileproper featurization along with tree-ensemble modelsis eﬀective for DOP Tuning in most cases. We also ex-perimentally analyze the causes of systematic failures,and discuss about possible improvement as our futurework (Section 4).

2. DEGREE OF PARALLELISM

In SQL Server, the query Degree Of Parallelism (DOP)is the maximum number of hardware threads or logical pro-cessors that can be used at any time to execute the query.Intra-query parallel execution is inspired by the Volcano [7]operator model. The physical operators in a query plan canbe both serial and parallel, but all parallel operators havethe same DOP, which is ﬁxed at the start of the query execu-tion. Being able to determine the optimal DOP in one shotat (or before) query startup is important since a suboptimalDOP cannot be ﬁxed without terminating query executionand restarting it with a diﬀerent DOP.SQL Server provides mechanisms to conﬁgure the DOPboth at query optimization time and query startup time. Inthis work, we focus only on the latter option. The defaultDOP in the version of SQL Server we use, and for the ma-chines used for our experiments, is 64. Thus, we choose touse the query performance at DOP 64 as the baseline in ourexperiments. More details regarding our experimental setupare in Section 4.1.

Degree of Parallelism Sp ee d u p QaQb QcQd Qd(m)

Figure 2: Performance proﬁles for 4 queries on DS1000As we discuss in the introduction, each query performsdiﬀerently as the DOP increases. Figure 2 shows the per-formance proﬁles for 4 diﬀerent queries on TPC-DS withscale factor 1000 (see Section 4 for details about our ex-perimental setup and workloads). Queries Qb and Qc arewell-parallelizable and their performance improves as DOPincreases, but with diminishing returns beyond some point.Query Qa sees no beneﬁt from parallelism. Query Qd losesperformance at high DOPs due to disk spills caused by in-creased memory requirement that exceeds the default limitfor a single query. Query Qd(m) shows how Qd perform dif-ferently as the DOP increases by intentionally giving enoughmemory to avoid disk spilling, indicating that memory is animportant factor which can inﬂuence the query behavior at2iﬀerent DOP. One can also observe that there is no singleDOP choice that is optimal across all four queries.

Degree of Parallelism Sp ee d u p DS1000 DS300 DS100 H1000 H300 H100

Figure 3: Workload speedup over workload-level DOP=64(default value) for diﬀerent workload-level DOPs.Figure 3 shows workload performance (details in Section 4.1)as a function of the DOP value, with the same DOP for allparallel queries, relative to the performance with the defaultDOP of 64 for our setup. For most workloads, performanceimproves with DOP, starting from DOP=1, and attains thepeak at DOP values that are less than the default (64) andthe maximum value (80). The optimal value is workload-speciﬁc and can result in substantial performance gains overthe default value for some workloads.

DS100 DS300 DS1000 H100 H300 H10000.00.20.40.60.81.01.2 Sp ee d u p Per-Query OPT Workload OPT 40

Figure 4: Workload speedup (Y-Axis) over workload-levelDOP=64 (default value) with per-query optimal DOPs,workload-optimal DOP, and workload-level DOP=40.Further gains are possible by selecting optimal DOPs on aper-query basis. Figure 4 compares workload-level speedupspossible with per-query DOP selections, per-workload DOPselections, and a static selection of DOP=40 (equal to thenumber of physical cores in our server). All speedups valuesare with respect to the static (default) value of DOP=64.In line with observations from Figure 3, DOP=40 speeds upsome workloads (e.g., H 1000), but slows down some others(e.g., H 300), and a per-workload optimal DOP can signif-icantly improve performance (e.g., DS 300). All workloadsalso show additional speedups, ranging from 4% (H 300) to15% (DS 1000) beyond what is possible with a workload-optimal selection.Just as the per-workload optimal DOP diﬀers across work-loads, the per-query optimal DOP values within each work-load also show substantial variation. Figure 5 shows morevariation for the DS workloads than the H workloads, whichis due to a larger variety of query templates in DS. No work-load has a single per-query optimal DOP value, indicatingthe potential for speedup with per-query DOP selections aswe observed in Figure 4. The average and median shift to-wards larger DOP values as scale factors, and consequently,size of datasets increase.

DS 100 DS 300 DS 1000 H 100 H 300 H 1000

Workload Q u e r y P e r c e n t a g e ( % ) Figure 5: Distribution of the percentage of queries with re-spect to the optimal query DOP.Based on the above, we can observe the following: • The behaviors of diﬀerent queries for diﬀerent DOPsare diﬀerent (Figure 2). • The optimal DOP choices for diﬀerent workloads (Fig-ure 3) and queries are diﬀerent (Figure 5). • While ﬁne-grained DOP tuning (on a per-query level)is more challenging compared to DOP tuning at work-load level, it can lead to greater performance improve-ment (Figure 4).

3. THE MAIN COMPONENTS

In this section, we formally describe our approach to solv-ing the DOP tuning problem. We ﬁrst formally state the MLtask we are solving. Then, we describe how we featurize theinput data to the ML models and present the ML models weconsider in this work. Finally, we discuss diﬀerent ways toevaluate how well our ML models generalize under diﬀerentapplication scenarios.

Regression over Classiﬁcation

At ﬁrst glance, it is natu-ral to think of tuning query DOP as a classiﬁcation problem:given a query and a set of DOP values, classify the queryinto the DOP class that achieves optimal performance. Wechoose instead to cast the problem as a regression problemfor two reasons. First, the performance diﬀerence of a queryat neighboring DOP values can be very small (see for ex-ample Figure 2), which means that accurate classiﬁcationmay not be possible. Second, formulating DOP tuning asa regression problem allows us to use the learned model inapplications beyond DOP selection, such as resource pro-visioning. We also note that the capability to predict theestimated time of a query at diﬀerent DOPs might be help-ful in order to perform DOP selection for concurrent queryexecution, which we consider as future work.

We are given a database instance I over a schema R ,and a workload W = ( P , P , . . . , P n ) over the instance I ,where each P i is a compiled query plan generated by theRDBMS. For our purposes, a query plan is a tree whereeach node is a physical operator (e.g., Index Scan, HashJoin, Sort); additionally, for each node we are given certaininformation about the operator execution (e.g., estimatedrows read, whether it runs in parallel mode, etc). Figure 6shows an example of such a query plan. In the next section,we will show how to map the query plan P i to a featurerepresentation ftr ( P i ).For each plan P i , we also have a measurement of the ex-ecution time of the plan P i over instance I with DOP d ,where d comes from a set D of DOP values of interest. We3enote this time with t ( P i ,d ) . Every such measurement cor-responds to a training point ( x, y ), where x = ( ftr ( P i ) , d ),and y = t ( P i ,d ) .We can now cast our problem as a regression task. Inparticular, the goal is to use the training points in orderto learn a function f (model instance) of a model m thatminimizes the quantity n (cid:88) i =1 (cid:88) d ∈ D L ( f ( ftr ( P i ) , d ) , t ( P i ,d ) )where L is a loss function. In this paper, we will use as a lossfunction the mean squared error (MSE) , L ( a, b ) = ( a − b ) .We will use ˆ t ( P i ,d ) = f ( ftr ( P i ) , d ) to denote the estimatedexecution time using the learned model instance.Using the learned model instance f , we can now solve theDOP tuning problem. We distinguish two subproblems. DOP Selection at Workload Level

In practice, users inmany cases conﬁgure the DOP at workload-level, in whichthe same DOP is selected for all the queries in the workload.In this case, the task is to choose a single DOP, d w workload = argmin d ∈ D n (cid:88) i =1 f ( ftr ( P i ) , d ) . DOP Selection at Per-Query Level

Further perfor-mance gains are possible if we perform DOP selection on aper-query basis. In this case, the task is to select for eachquery plan P i a DOP d i = argmin d ∈ D f ( ftr ( P i ) , d ) . In this section, we describe how we map a query plan P toa feature vector ftr ( P ) of ﬁxed dimension in a similar mannerto what is presented in [4]. Compared to Ding et al. [4], ourfeaturization contains richer information by which the char-acteristics of a given query plan could be more accuratelycaptured. We propose diﬀerent featurization alternatives fora query plan – we study the eﬀects of diﬀerent featurizationchoices in Section 4.2.We next detail our featurization process. A simpliﬁed ex-ample of the process is depicted in Figure 6. Recall that P is a tree, where each node in the tree consists of a physi-cal operator along with its processing mode and informationabout its estimated runtime performance. Each physical op-erator can execute in diﬀerent modes: ( i ) parallel or serial ,and ( ii ) batch or row . Since diﬀerent modes handle paral-lelism diﬀerently, it is critical that we encode them in thefeature vector. For example, it is important to distinguishbetween operators that use a parallel implementation ver-sus the ones with a non-parallel implementation. To achievethis, we construct a composite key for each operator instancein the form of( operator , batch / row , parallel / serial , (cid:104) optional (cid:105) )The vector (cid:104) optional (cid:105) adds additional attributes that areonly applicable for certain operators. For example, IsAdap-tive and

EstimatedJoinType apply only for the operator

Adap-tiveJoin . In addition, in the case where the same physicaloperator can be used to implement diﬀerent logical opera-tors (e.g.,

HashMatch can be used for join, aggregation orunion), (cid:104) optional (cid:105) encodes the logical type as well.For each composite key, the feature vector allocates spaceto encode diﬀerent types of execution measures – each mea-

EstimateRows

EstimateExecutionMode batch

AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)...

Query Plan P :Hash Match(join)ClusteredIndex Scan ClusteredIndex Scan ... EstimateRows

EstimateExecutionMode batch

AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)...

EstimateRows

EstimateExecutionMode batch

AvgRowSize Parallel EstimateOutputSize (EstimateRows x AvgRowSize)... Feature Vector Components for

Composite Key : ("Hash Match", "batch", 1, "join" ) Feature Vector Components for

Other Composite Keys

Feature Vector Components for

Composite Key : ("Clustered Index Scan", "batch", 1 ) Feature Vector of Query Plan P

40 Degree of Parallelism(DOP)

Figure 6: Example of featurizing a query plan. The infor-mation of physical operators with diﬀerent composite keysare encoded into diﬀerent parts of the feature vector.sure is summed across all nodes in the plan with the samecomposite key. • Cardinality ( card ): cardinality estimates computedby the optimizer expressed in bytes (e.g., estimatedoutput size in bytes). • Cost ( cost ): cost-based estimates calculated by theoptimizer’s cost model (e.g., estimated CPU cost, es-timated I/O cost). • Count ( count ): takes value 1 if the operator has thecomposite key, otherwise 0. • Weight ( weight ): each node is assigned a weightwhich is computed recursively from the leaf nodes.The weight of a leaf node is the estimated output sizein bytes, while for a non-leaf node it is the sum ofproduct of weights and height of all its children. Theweight feature encodes structural information (see [4]). We brieﬂy describe the ML models we use in our approach.

Linear Regression (LR)

We consider Linear Regressionas one of the baseline ML models due to its simplicity andeﬃciency. Speciﬁcally, we use its regularized version, elasticnet , which linearly combines L1 and L2 regularization.

Random Forest (RF)

Random Forest is an ensemblelearning method exploiting bagging as its ensemble algo-rithm. Since multiple decision trees are built/traversed in-dependently during the model construction/inference phase,this allows for eﬃcient training/testing in a multicore set-ting. The fact that multiple trees built on diﬀerent randomlysampled data also makes it more robust compared to a sin-gle decision tree, enabling it to consider complex interactionsamong diﬀerent features while mitigating overﬁtting.

XGBoost

XGBoost [3] is an eﬃcient implementation ofgradient boosting with a set of optimizations that exploitany available parallelism, achieve faster model convergenceby using the second-order gradients approach Newton Boost-ing and mitigate the overﬁtting issue of gradient boosting byimposing better regularization techniques.4 ulti-Layer Perceptron (MLP)

The multi-layer per-ceptron is a class of feedforward artiﬁcial neural networks.In this work, we use a fully connected neural network with8 hidden layers in which each layer has 512 neurons. RELU( max (0 , x )) is used as the activation function. We use theAdam optimizer for training [9]. Exploration of Other Models

Motivated by the promis-ing results in [17] with explicit exploitation of the query planstructure and operator-level query latency, we investigateda rich set of DNN-based models that are designed to capturethe spatial (e.g., nearby pixels in images) or structural (e.g.,syntactic tree structure in natural languages) information ofthe data. Speciﬁcally, we explored models that have beenapplied successfully to a set of natural language processing(NLP) related tasks including convolutional neural network(CNN) [8], long short-term memory (LSTM) based recurrentneural network (RNN) [34] and Tree-LSTM [35], regardingeach operator instance as a special word . We observed thatnone of these models outperform MLP, the simplest DNN-based model, considering both runtime eﬃciency and predic-tion performance. We also implemented and explored QPP-Net [17], including the operator-level elapsed time . QPPNetfailed to learn the relation between query latency and DOP(see example given in Figure 7). The failure could be at-tributed to a series of possible reasons, such as the diﬀer-ence of the studied system environment (e.g., PostgreSQLvs SQL Server ), the assumptions QPPNet relying on as dis-cussed in the introduction, the not well-deﬁned per-operatorlatency in our study environment, and the diﬀerent sizes ofquery plans (more details in Figure 18). Our models ex-ploit the schema and database instance agnostic featuriza-tion, while QPPNet is tested on schema-dependent features.Deeper investigation of the failure of applying QPPNet inDOP tuning is future work. Due to the above observations,we choose to use MLP as the DNN-based model representa-tive and only present the comparison results of its with thatof other models.(a) Training Convergence

Degree of Parallelism Sp ee d u p ACTUAL QPPNet RF (b) Speedup Trend PredictionFigure 7:

Example Result of QPPNet in QPP : 7ashows QPPNet does converge after training for less than 25epochs on 80% of the TPC-DS1000 plan-dop pairs (Gener-alization Lvl1). But 7b suggests that QPPNet is ignorant ofdiﬀerent DOP values - it always gives the same latency pre-diction for the same query regardless of DOP changes, whileRF is able to capture the actual speedup curve accurately.

Diﬀerent application scenarios exhibit diﬀerent degree ofcommonality , i.e., how much similarity is observed across aquery workload. The commonality here refers to both thequeries themselves (e.g., SQL statements and query plans), as well as the input data. Given a speciﬁc application, hav-ing a good understanding of the degree of commonality iscrucial when attempting to apply ML-based techniques forDOP tuning, since it helps practitioners to understand thecapabilities and limitations of the models that are learnedfrom the underlying data and queries. More speciﬁcally, given training data (pairs of query plans plus runtime) , weshould ﬁrst try to answer the following questions:

1. How similar are future queries to queries in the train-ing data?2. How much can the input relations change (in terms ofboth schema and scale) compared to the input relationsin the training data?

The answers to the above questions are diﬀerent whenconsidering diﬀerent applications. For example, a supermar-ket chain might run the same set of queries daily to analyzeits sales, and the size and the distribution of the input datato the queries might be relatively stable across most of thedays, with the exception of a few promotion days such asBlack Friday and Christmas. On the other hand, analystsin diﬀerent teams might run diﬀerent queries on the sameinternal data for diﬀerent data analysis tasks. In addition,the scale of the input data may increase after a period oftime as new data arrives daily.To capture this diﬀerentiation, we categorize the trainingdata and test data into four generalization levels , consideringthe query templates, the schema of input tables and thescale of data. Suppose that we train our model on instance I with schema R , where the workload W is drawn from adistribution of templates dist ( T ) (see Figure 8). Generalization Lvl1:same templates & same schema & same scale

The test workload comes from the same distribution of tem-plates dist ( T ) and runs on the same database instance I .As an example, the training and test data are gathered fromthe executions of two sets of TPC-DS queries generated fromthe same set of templates but with diﬀerent random seeds running against the same instance . Generalization Lvl2:diﬀerent templates & same schema & same scale

The test workload comes from a diﬀerent distribution oftemplates dist ( T (cid:48) ), but runs on the same instance I . As anexample, the training and test data are gathered from run-ning two sets of TPC-DS queries generated from a diﬀerentsubset of templates running against the same instance . Generalization Lvl3:same template & same schema & diﬀerent scale

The test workload comes from the same distribution of tem-plates dist ( T ), but runs on an instance I (cid:48) of diﬀerent scaleusing the same schema R . As an example, the training andtest data are gathered from running two sets of TPC-DSqueries generated from the same set of query templates run-ning against the 300 and 1000 scale instances respectively. Generalization Lvl4:diﬀerent template & diﬀerent schema

The test workload comes from a diﬀerent distribution oftemplates dist ( T (cid:48)(cid:48) ) and runs on a diﬀerent instance I (cid:48) de-ﬁned on a diﬀerent schema R (cid:48) . As an example, the trainingand test data are gathered from running two sets of queriesgenerated from TPC-DS templates and TPC-H templatesrunning against the corresponding database instances.5 raining setup First, we train a model on a workload distribution and database instance; then, we test it on different generalization levels

Generalization level 1

Same templates, schema, scale I ∈ R W ~ dist(T) I ∈ R W ~ dist(T)

Generalization level 2

Different templates; same schema, scale I ∈ R W ~ dist(T')

Generalization level 3

Same templates, schema; different scale I’ ∈ R W ~ dist(T)

Generalization level 4

Different templates, schema, scale I’ ∈ R’ W ~ dist(T”)

Figure 8: Overview of training setup and generalization levels. We use dist ( T ) to denote a distribution of queries over a setof templates T .We assume a ﬁxed hardware conﬁguration and do notstudy generalization to diﬀerent hardware conﬁgurations inthis work. Cloud platforms generally have a restricted setof known hardware conﬁgurations on which services are de-ployed, and separate models could be trained and deployedfor each conﬁguration.

4. EXPERIMENTAL EVALUATION

In this section, we present and discuss the results of ourexperimental evaluation. We design our experiments withthe goal of answering the following questions: • What are the eﬀects of diﬀerent featurization alterna-tives in model performance? • How do diﬀerent ML models perform for diﬀerent gen-eralization levels? • What are the root causes behind wrong predictions inthe ML models in our problem formulation? • What trade-oﬀs should be considered when choosingbetween diﬀerent ML models?At a high level, our experiments show that the featuriza-tion using the count , card and weight features (exclud-ing cost ) leads to the overall best model prediction perfor-mance. RF shows the best overall performance consideringboth task-agnostic and task-speciﬁc metrics. However, whenthere is signiﬁcant distribution mismatch between the testqueries and training queries, no model is able to performDOP selection that gives performance close to optimal. Oneconcrete explanation for such distribution mismatch is thediﬀerence of memory requirements.Comparatively speaking, hyper-parameter tuning of XG-Boost and MLP is time-consuming while additional regu-larization is hardly to be imposed for better generalization(e.g., to achieve better performance on test data) when thetraining data is limited; RF is easier to be conﬁgured and itis more robust while it has relatively higher model inferenceoverhead. System Conﬁguration

Our queries were executed on adual-socket Intel Xeon Broadwell server with a total of 40 physical cores and 512 GB main memory. We have hyper-threading enabled, resulting in a total of 80 logical coreswhich, consequently, is the maximum DOP possible on thisserver. We run SQL Server 2019 CTP 2.0 Developer edi-tion. By default, SQL Server chooses a DOP value of 64 forqueries running on this server. SQL Server sets aside part ofthe available memory for the buﬀer pool and shared objects;the remainder can be granted as working memory to queriesup to a certain limit (by default, 25% of the working mem-ory, corresponding to ∼ Workload and Datasets

We use both the TPC-H andTPC-DS workloads for our experiments, with scale factors100, 300, and 1000 for each. The following table gives thedetailed information for the dataset construction.Table 1: Workload and Dataset Statistics

22 1346 13460

TPC-H 300

22 1346 13460

TPC-H 1000

22 1346 13460

TPC-DS 100

103 841 8410

TPC-DS 300

106 853 8530

TPC-DS 1000

103 841 8410For each TPC-H query template, we generated queriesusing 100 diﬀerent random seeds with duplicate queries be-ing removed. TPC-DS queries are generated in a similarmanner, but with 10 diﬀerent random seeds used for eachquery template. We execute each query in the workloadsfor 10 diﬀerent DOP values: { , , , , , , , , , } .We focus on warm-cache query execution, since caching isalmost always used in practice whenever it is feasible. Thus,to obtain each data point, we perform one cold-cache queryexecution followed by ten warm-cache runs. The averagetime of the warm-cache runs is used as the ground truth ofthe latency for the executed query. For the purpose of our6tudy, we execute one query at a time. We also only considerqueries executed using query plans with at least one physicaloperator that uses intra-parallelism. We use the clusteredcolumnstore index [19] organization for each database. Training and Testing Splits

We perform 5-fold cross-validation to evaluate the model performance for each gen-eralization level. For Generalization Lvl1, each fold uses80% of TPC-DS 1000 queries for training and the remaining20% for testing. For Generalization Lvl2, the data is splitbased on the query templates: each fold uses 80% of thequery templates from TPC-DS1000 for training, and the re-maining 20% for testing. The folds for Generalization Lvl3and Generalization Lvl4 are as follows:Table 2: TPC Queries in Training/Testing Folds for Gener-alization Lvl3 and Generalization Lvl4.

Generalization Lvl3 Generalization Lvl4Fold 1

DS1000/DS300 DS1000/H1000

Fold 2

DS1000/DS100 DS1000/H300

Fold 3

DS300/DS100 DS1000/H100

Fold 4

DS100/DS300 DS300/H1000

Fold 5

DS100/DS1000 DS300/H300

Hyper-Parameter Tuning

We tune the selected mod-els in a standard manner, similar to [4], based on the meanabsolute error in cross-validation on the training data. Fortree-ensemble models (XGBoost and RF), we limit the max-imal number of trees to be 1000. We note that spendingadditional eﬀort tuning the parameters of diﬀerent modelsmight lead to a narrower gap between the evaluation met-rics of these models. However, the main goal of this studyis to to explore the usability and trade-oﬀs present in thecompared techniques in a standard training conﬁguration.Hence, for each model, we stop tuning the hyper-parameters(e.g., searching larger hyper-parameter space) when reason-ably satisfying results are observed.

Comparison Metrics

To compare the performance of dif-ferent ML models, we use both task-agnostic metrics (e.g.,mean absolute error), as well as task-speciﬁc metrics that aretied to a speciﬁc application (e.g., workload throughput atoptimal DOP selection). We provide the detailed deﬁnitionsof these metrics in Table 3.

To compare the featurization alternatives, we run RF withall four features F = { card , cost , count , weight } , andalso excluding one feature at a time ( F \{ card } , F \{ cost } , F \{ count } , F \{ weight } ). Tables 7-14 show the distribu-tions of relative prediction error (RPE) and speedup predic-tion error (SPE). We observe that for Generalization Lvl1there is little diﬀerence across diﬀerent featurization alter-natives (Tables 7,8). For the other generalization levels, F \{ cost } leads to the best performance: for the low rela-tive error range (e.g., RPE < F \{ cost } is larger than other featurization alterna-tives. We also observe that including all features F is rarelyoptimal, while having count and weight is beneﬁcial mostof the time. As a result of our analysis, we present our exper-imental results using the features { count , card , weight } for the remaining experiments. In this section, we evaluate the training and inferenceoverheads for the ML models we tested. It is critical thatboth overheads are small for the model to be usable in prac-tice. We report below (Table 5) the training and inferenceoverheads for Generalization Lvl1.Table 5: Model overheads for Generalization Lvl1.

Model Training (sec) Inference (sec)LR

MLP

81 0.009772 RF

17 0.419193

XGBoost

23 0.0013The training overhead includes the time spent on prepro-cessing (e.g., featurization, training/test data splitting, etc).MLP is trained on NVIDIA GeForce GTX 1080Ti, whileother models are trained on the same machine in which thequeries are executed with all 40 physical cores given. Theinference time of all models is measured by running eachmodel on a single data point using a single CPU core. Weobserve that the inference time of all models except for RFis very small ( < ∼

400 ms) on inference due to the traversal of large num-ber of deep trees. However, our work focus on relativelylong-running queries (e.g., OLAP), for which DOP tuning ismostly beneﬁcial to, and thus we see the inference overheadof RF as insigniﬁcant.

Mean Absolute Error

MAE represents the average ab-solute diﬀerence between the actual query latency and pre-dicted query latency of all plan-dop pairs. A nice propertyof MAE is that it has the same unit as the regression target time , which is also noted in [17].In Table 4, we observe that except for LR, all other mod-els exhibit relatively small MAEs on the training data afterlearning across all generalization levels, suggesting the ne-cessity of nonlinearity and large model capacity. Amongthese models, XGBoost and RF show the best performanceconsidering only MAEs on the training data, which are closeto zero, suggesting the number/depth of the trees in tree-ensemble models are large/deep enough.Switching to test data, the comparison between resultsof LR and other models in Generalization Lvl1 are rela-tively consistent with what is being observed in trainingdata. However, we see much larger gaps across diﬀerentgeneralization levels in all models, and this suggests the dif-ferent degree of diﬃculty of applying ML models for DOPtuning at diﬀerent generalization levels. MAEs of diﬀerenttesting folds in each generalization level suggest that RFgeneralizes better than XGBoost and MLP. This result canbe explained by the fact that RF is less sensitive to over-ﬁtting compared to XGBoost and MLP, as well as easierto tune. While it is possible to apply better regularizationtechniques on XGBoost and MLP to further reduce the gen-eralization error, we note that the current architecture/hy-perparameters of XGBoost, MLP, and RF (which are tunedvia cross-validation on the training data) show comparable7able 3: Comparison Metrics

Category Metric Deﬁnition

Task-Agnostic Mean Absolute Error

MAE ( W ) = | W || D | (cid:80) ni =1 (cid:80) d ∈ D | ˆ t ( P i ,d ) − t ( P i ,d ) | Relative Prediction Error

RPE ( P ) = | D | (cid:80) d ∈ D | ˆ t ( P,d ) − t ( P,d ) | t ( P,d ) Speedup Prediction Error

SPE ( P ) = | D | (cid:80) d ∈ D | ˆ t ( P,d ) ˆ t ( P, − t ( P,d ) t ( P, | Task-Speciﬁc Throughput with per-Query DOP TQ ( W ) = | W | (cid:80) ni =1 min d ∈ D ˆ t ( Pi,d ) Throughput with per-Workload DOP TW ( W ) = | W | min d ∈ D { (cid:80) ni =1 ˆ t ( Pi,d ) } Table 4: 5-Fold Mean Absolute Error on Training Data

Model Generalization Lvl1 Generalization Lvl2 Generalization Lvl3 Generalization Lvl4LR

MLP RF XGBoost

Relative Prediction Error

We show the distribution ofthe per-query

RPE of each model at diﬀerent generaliza-tion levels in Figure 9. Not surprisingly, LR consistentlyexhibits high RPE ( > .

5) in most of the plan-dop pairsacross all generalization levels due to its inability of cap-turing the complex relationship between the plan-dop pairfeatures and the query latency. Looking at the relative errordistribution of other models, we observe that in most cases,RF exhibits lower RPE for a larger portion of test plan-doppairs compared to other models.

Speedup Prediction Error

While relative error might bea good metric for evaluating the accuracy of the query la-tency prediction, it does not directly infer the speedup of thequery execution at diﬀerent DOP values. Intuitively, whenselecting the DOP for a single query execution, users shouldknow the pattern or trend of the performance curve of thequery at DOP values of interests. We look at a simple met-ric called speedup prediction error (Table 3) that capturesthis property. Figure 10 presents the per-query based SPEdistribution of each model considering diﬀerent generaliza-tion levels. We observe that RF consistently shows the bestperformance for SPE.The relationships between relative error/speedup errorand the query latency (at DOP 40) are shown in Figure 13and 14. We observe that large errors are less likely to occurin long-running queries ( > s ) for both metrics. The obser-vation is positive, since intra-parallelism is more beneﬁcialto long running queries. DOP Selection at Individual Query Level

We nowconsider the task deﬁned in Section 3.1 of selecting the DOPper query Level. We use the metric TQ ( W ) as deﬁned inTable 3. We compare TQ ( W ) given by diﬀerent models andthe query throughputs given by executing queries at DOP40, 80, the actual optimal DOP (OPT) for each P i ∈ W ,the workload optimal DOP (WORKLOAD) for all P i ∈ W ,with all results being normalized over the throughput at DOP 64 (default value). The comparison results at diﬀerentgeneralization levels are shown in Figure 11. We ﬁrst observethat in Generalization Lvl1, all models except LR lead toquery throughput close to OPT.At Generalization Lvl2, we observe that no model is ableto predict DOPs that lead to throughput comparable toWORKLOAD in most of the cases. Though XGBoost seemsto give performance much closer to WORKLOAD in somecases (Fold 2-5), the results are more likely to be acciden-tal. After investigation, we have found that the root causebehind this performance gap is the mismatch between thedata distribution in training and test data: non-trivial diskspilling is observed in queries from the same query templatein which Qd (as shown in Figure 2) comes from (generatedfrom diﬀerent random seeds) due to the increase of memoryrequirements at high DOP values (DOP > Qd -type queries and the results are shown inFigure 12. Though the performance gain opportunity itselfbecomes signiﬁcantly smaller (the performance gap betweenactual optimal DOP values and the default DOP), in mostcases, most models do not select DOPs that result in perfor-mance regression and the best-performing model RF oftenselect DOPs giving performance better than WORKLOAD.At Generalization Lvl3, performance given by models ex-cept for LR often matches up OPT or WORKLOAD (Fold1-4). For Fold 5, since models are trained on plan-dop pairsgenerated by TPC-DS 100 queries & database instance andtested on TPC-DS 1000 queries & database instance (inwhich disk-spilling is more severe than that being observedin TPC-DS 100), the distribution mismatch (similar to pre-8 R MLP RF XGBoost (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 9: Relative Prediction Error Distribution — Percentage (Y-Axis) vs. RPE (X-Axis)(a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 10: Speedup Prediction Error Distribution — Percentage (Y-Axis) vs. SPE (X-Axis)Table 6: 5-Fold Mean Absolute Error on Test Data

Model Generalization Lvl1 Generalization Lvl2 Generalization Lvl3 Generalization Lvl4LR

MLP RF XGBoost

LR MLP RF XGBoost OPT WORKLOAD 40 80

Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (a) Generalization Lvl1

Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (b) Generalization Lvl2

Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (c) Generalization Lvl3

Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.21.4 (d) Generalization Lvl4Figure 11: DOP Selection at Individual Query Level — Query Throughput over DOP 64 (Y-Axis)

Fold1 Fold2 Fold3 Fold4 Fold500.20.40.60.811.2

Figure 12: Generalization Lvl2M vious reasoning for Generalization Lvl2) results in perfor-mance regression.At Generalization Lvl4, the opportunity for throughputimprovement over DOP 64 is small (the gap between OPTand throughput at DOP 64 is only 5% − R MLP RF XGBoost (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 13: RPE (Y-axis) vs. Query Latency at DOP 40 (X-axis)(a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl3 (d) Generalization Lvl4Figure 14: SPE (Y-axis) vs. Query Latency at DOP 40 (X-axis)

LR MLP RF XGBoost OPT (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 15: Predicted Performance at Per-Query Optimal-DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis) (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 16: Performance at Predicted Per-Query Optimal-DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis) (a) Generalization Lvl1 (b) Generalization Lvl2 (c) Generalization Lvl2M (d) Generalization Lvl3 (e) Generalization Lvl4Figure 17: Predicted Performance at Workload-Level DOP — Query Throughput over DOP 64 (Y-Axis) vs. DOP (X-Axis)from which the test data is generated, scale very well with in-creasing DOP on SQL Server. And LR always chooses DOP80 due to its linearity. At ﬁrst it may seem like the poor per-formance shown in other models is due to the fact that thedegree of commonality between training and testing queriesin Generalization Lvl4 is low, however, it is noticeable thatthere is no distinguishable diﬀerence between the SPE dis-tributions of Generalization Lvl3 and Generalization Lvl4(Figure 10c 10d), while RF does a fairly good job in DOP se-lection for Generalization Lvl3. After careful investigation,we ﬁnd out that for most of the queries, RF and XGBoostare able to capture the speedup trend accurately, and the rel-atively poor performance seems more likely to be caused bythe speciﬁc characteristics of TPC-H queries (testing queriesin Generalization Lvl4) - they scale really well with increas- ing DOP (see Figure 16e), suggesting large performance gapbetween query executions at diﬀerent DOP values and smalllatency at high DOP (e.g., nearly 30X speedup at DOP 80compared to performance at DOP 1). Keeping this fact inmind and considering the DS1000/H1000 training/testingsplit (Fold 2), RF and XGBoost failed to select the near-optimal DOP for two relatively long-running queries in thetesting split (RF selects DOP 20 and XGBoost selects DOP16 while the actual optimal DOPs are 80), resulting in theperformance gaps observed in Figure 11.

While making the optimal choice of DOP conﬁgurationat individual query is useful, selecting the optimal DOPat workload-level , although is sub-optimal, could ease DOP10onﬁguration. Figure 17 visualizes the actual speedup curveand the predicted speedup curves given by diﬀerent modelswith increasing degree of parallelism. The speedup curvecan be used for selecting DOP for a given set of queriesbased on a resource budget, or by simply choosing the opti-mal point without any constraints. For example, looking atFigure 17a, the speedup by increasing DOP from 32 to 40 isinsigniﬁcant but the hardware provision cost could increaseby 25% (assuming the hardware provision cost is proportionto the number of logical cores).We observe that the best-performing model RF is ableto approximately capture the trend of the actual speedupcurves in Generalization Lvl1, 2M and 3 considering bothper-query optimal-DOP (Figure 15) and workload-level DOP(Figure 17). Besides, RF is also able to select per-query levelDOPs for a given workload, leading to performance that isclose to the performance given by running each query atthe actual optimal-DOP given the speciﬁed maximum DOPpossible (Figure 16). However, there is a large gap betweenthe actual speedup curve and the curve predicted by RFin Generalization Lvl2 and Generalization Lvl4, where XG-Boost is performing better in predicting the speedup curvein Generalization Lvl4 (Figure 15e,17e )– in contrast to ourobservations for task-agnostic metrics. This gap is caused bythe heterogeneous predictions on diﬀerent plan-DOP pairs inthe test data: the prediction output of a few plan-DOP pairsdominates the predicted performance, and hence the actualperformance of other plan-DOP pairs is not accurately re-ﬂected. On the other hand, XGBoost fails to capture theperformance diﬀerence (e.g., query performance improves)for a few plan-DOP pairs that are important for DOP se-lection (Figure 16e), which though are not critical in pre-dicting the overall speedup trend (Figure 15e,17e). Basedon this observation, we argue that merely looking at thetask-agnostic metrics might not be always the best way toevaluate the performance of the model, but rather the task-speciﬁc ones should be considered together. For example,RF is doing well in terms of per-query DOP selection in Gen-eralization Lvl4 while XGBoost is preferred when predictingthe view of cost-performance trade-oﬀs is important. Mean-while, a better learning objective, that forces the model tomore accurately learn the absolute and relative diﬀerencebetween the query latency at diﬀerent DOPs is critical tomake ML-based techniques for query DOP tuning more fea-sible.

5. RELATED WORK

Query Performance Prediction (QPP) is a well-studiedproblem [1, 6, 13, 37, 17], but not in a setting with intra-query parallelism. Early work in DOP management [18]studied the exploitation of intra-operator parallelism in amulti-query environment for shared-nothing parallel databasesystems using a simpliﬁed simulation model. More recentworks have developed analytical models for query paral-lelism in BigData execution frameworks [33]. To the bestof our knowledge, this is the ﬁrst paper to do a comparativestudy of ML techniques for QPP for multithreaded queryexecution and evaluate query DOP tuning for a commercial-grade RDBMS on a modern multicore server.

ML for Query Performance Prediction

Prior workhas explored the use of per-operator models [1, 13, 17] forQPP. The per-operator latency estimates can be added to- Table 7: RF Distribution of Relative Prediction Error(RPE) of diﬀerent featurization (Generalization Lvl1)

RPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . Table 8: RF Distribution of Speedup Prediction Error (SPE)of diﬀerent featurization (Generalization Lvl1)

SPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ .

001 17.1% 17.7% ≤ .

005 67.1% 66.8% ≤ .

010 79.3% 79.0% ≤ . ≤ .

100 97.4% 97.4% 97.3%

RPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ .

001 1.1% 0.8% ≤ .

005 11.2% 10.1% 10.1% 7.2% ≤ .

010 27.9% 21.7% ≤ .

050 76.7% 74.9% 71.4% 79.9% ≤ .

100 82.6% 83.9% 84.0%

RPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ .

005 1.4% 0.9% ≤ .

010 7.6% 7.8% ≤ .

050 56.0% 63.4% ≤ .

100 76.9%

RPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . ≤ . SPE

F F \{ card } F \{ cost } F \{ count } F \{ weight }≤ . ≤ .

005 1.4% 0.9% ≤ .

010 7.6% 7.8% ≤ .

050 56.0% 63.4% ≤ .

100 76.9%

Number of Operators P l a n P e r c e n t a g e ( % ) Plan Tree Depth P l a n P e r c e n t a g e ( % ) Figure 18: TPC-DS1000 plan size & depth distribution.from extracting and composing operator-level estimates toderive query-level estimates as Figure 7 suggests.

Automated Database Tuning

Prior work has also ex-plored techniques for automatically tuning database systemsto improve query performance [2, 5, 31, 36]. Our work onDOP tuning can beneﬁt such approaches by providing what-if analyses for diﬀerent DOP settings and thereby potentiallyeliminating the need for runtime exploration of this param-eter space.

Dynamic Parallelism

Recent work on new database sys-tems use dynamic/elastic parallelism where the the paral-lelism can be increased/decreased at runtime [12, 30]. Al-though our ML-based DOP selection approach for perfor-mance improvement focuses on a setting with static par-allelism, we believe that our approach on providing cost-beneﬁt tradeoﬀ estimates would be useful for resource pro-visioning in elastic systems as well.

ML for Query Optimization

Conventional query opti-mization in RDBMSs is based on manually constructed costmodels and rule-based heuristics. Such strategies usuallymake certain assumptions about the plan search space (e.g.,left-deep tree) and rely heavily on the estimation of statisticsrequired by the cost models such as cardinality estimation–which can be quite inaccurate, leading to query plans thatare far away from optimal. In addition, query optimizersbuilt this way never learn from past experience, and thesame query plan with poor performance can be repeatedlygenerated [16]. Recent research eﬀorts attempt to enhanceand even replace the core optimizer components seen in mostof RDBMSs today using deep learning (DL) techniques. Forexample, [15, 11] exploit deep reinforcement learning (DRL)to optimize join queries, while [10, 29] propose to use DRLto improve the cardinality estimation. Neo, a query opti-mizer built based on a set of DL-based models [14], has beenshown to even outperform state-of-the-art commercial opti-mizers in some cases. Compared to this line of research, ourwork focuses on optimizing the query performance outsideof the RDBMS (an example of resource tuning) rather thantouching the optimizer internals, similar in scope to [31].

6. DISCUSSION

Our comparative exploration of ML techniques for tun-ing DOP in SQL Server indicates that performance gainsare possible by using simple tree-ensemble models alongwith query plan featurization. The performance proﬁling forqueries showing diﬀerent behaviors at diﬀerent DOP valuessuggests that DOP tuning is an important while challengingproblem. We have also identiﬁed a set of important issuesthat raise concerns when applying the ML-based techniquesfor DOP tuning in practice, along with the possible improve-ments to this work: • Memory consideration: in our present study, we runqueries in an environment in which the available mem-12ry is ﬁxed (i.e., static). The utility of ML modelslearned without considering memory information is lim-ited, preventing their use in new environments withdiﬀerent memory requirements (e.g., machines withdiﬀerent memory sizes, or concurrent query executionin which each stream is under strict memory constraints).We plan to address this deﬁciency as part of improve-ments to this work. • Concurrent query execution: we plan to utilize thesingle-query based ML models for DOP tuning in aconcurrent environment by an analytical approach, ex-panding the applicability of our study. • Model interpretability: the results given by the black-box ML algorithm should be interpretable in a waysimilar to that of the rule-based algorithm. The DBAsshould be able to see the decision path suggested bythe algorithm in order to decide whether to accep-t/reject the model output. Meanwhile, interpretabilitycould also help with possible implementation issues in-side the targeted RDBMS (e.g., the poor parallelismcould be caused by the behavior of operators of certaintypes). • Studying the eﬀects of hyper-threading: since enablingor disabling hyper-threading will aﬀect query perfor-mance characteristics, it is important to study suchdiﬀerences and maybe even encode this piece of infor-mation into ML models.

7. CONCLUSION

We studied the problem of tuning the degree of parallelismvia statistical machine learning. We focus our evaluation onMicrosoft SQL Server, a popular commercial RDBMS thatallows an individual query to execute on multiple cores. Inour study, we cast the problem of DOP tuning as a regres-sion task, and examine how several popular ML models canhelp with query performance prediction in a multi-core set-ting. We performed an extensive experimental study com-paring these models against a list of performance metrics,and tested how well they generalize in diﬀerent settings: ( i )to queries from the same template, ( ii ) to queries from anew template, ( iii ) to instances of diﬀerent scale, and ( iv )to diﬀerent instances and queries. Our experimental resultsshow that a simple featurization of the input query planthat ignores cost model estimations can accurately predictquery performance, capture the speedup trend with respectto the available parallelism, as well as help with automati-cally choosing an optimal per-query DOP.

8. ACKNOWLEDGEMENT

We thank Alan Halverson for insightful discussions aboutthe DOP problem and about baseline experimental setupduring the initial phase of this work. Zhiwei was supportedby Microsoft’s Gray Systems Lab (GSL) through a summerinternship and Microsoft RA-ships for this research work.We thank Carlo Curino and other members of GSL, andmembers of the SQL Server team for discussions and feed-back on this work. The manuscript was improved by de-tailed and thoughtful comments from Remzi Arpaci-Dusseauand other, anonymous, reviewers.

9. REFERENCES [1] M. Akdere, U. C¸ etintemel, M. Riondato, E. Upfal, andS. B. Zdonik. Learning-based query performancemodeling and prediction. In , pages390–401. IEEE, 2012.[2] S. Chaudhuri and V. Narasayya. Self-tuning databasesystems: A decade of progress. In

Proceedings of the33rd International Conference on Very Large DataBases , VLDB ’07, pages 3–14. VLDB Endowment,2007.[3] T. Chen and C. Guestrin. Xgboost: A scalable treeboosting system. In

Proceedings of the 22nd acmsigkdd international conference on knowledge discoveryand data mining , pages 785–794. ACM, 2016.[4] B. Ding, S. Das, R. Marcus, W. Wu, S. Chaudhuri,and V. R. Narasayya. Ai meets ai: Leveraging queryexecutions to improve index recommendations. In

Proceedings of the 2019 International Conference onManagement of Data , pages 1241–1258. ACM, 2019.[5] S. Duan, V. Thummala, and S. Babu. Tuningdatabase conﬁguration parameters with ituned.

Proc.VLDB Endow. , 2(1):1246–1257, Aug. 2009.[6] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener,A. Fox, M. Jordan, and D. Patterson. Predictingmultiple metrics for queries: Better decisions enabledby machine learning. In , pages 592–603.IEEE, 2009.[7] G. Graefe. Encapsulation of parallelism in the volcanoquery processing system.

ACM SIGMOD Record ,19(2):102–111, 1990.[8] Y. Kim. Convolutional neural networks for sentenceclassiﬁcation. arXiv preprint arXiv:1408.5882 , 2014.[9] D. P. Kingma and J. Ba. Adam: A method forstochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[10] A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, andA. Kemper. Learned cardinalities: Estimatingcorrelated joins with deep learning. arXiv preprintarXiv:1809.00677 , 2018.[11] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, andI. Stoica. Learning to optimize join queries with deepreinforcement learning. arXiv preprintarXiv:1808.03196 , 2018.[12] V. Leis, P. Boncz, A. Kemper, and T. Neumann.Morsel-driven parallelism: A numa-aware queryevaluation framework for the many-core age. In

Proceedings of the 2014 ACM SIGMOD InternationalConference on Management of Data , SIGMOD ’14,pages 743–754, New York, NY, USA, 2014. ACM.[13] J. Li, A. C. K¨onig, V. Narasayya, and S. Chaudhuri.Robust estimation of resource consumption for sqlqueries using statistical techniques.

Proceedings of theVLDB Endowment , 5(11):1555–1566, 2012.[14] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh,T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: Alearned query optimizer. arXiv preprintarXiv:1904.03711 , 2019.[15] R. Marcus and O. Papaemmanouil. Deepreinforcement learning for join order enumeration. In13 roceedings of the First International Workshop onExploiting Artiﬁcial Intelligence Techniques for DataManagement , page 3. ACM, 2018.[16] R. Marcus and O. Papaemmanouil. Towards ahands-free query optimizer through deep learning. arXiv preprint arXiv:1809.10212 , 2018.[17] R. Marcus and O. Papaemmanouil. Plan-structureddeep neural network models for query performanceprediction. arXiv preprint arXiv:1902.00132 , 2019.[18] M. Mehta and D. J. DeWitt. Managing intra-operatorparallelism in parallel database systems. In

Proceedings of the 21th International Conference onVery Large Data Bases , VLDB ’95, pages 382–394,San Francisco, CA, USA, 1995.[19] Microsoft. Columnstore indexes: Overview. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview?view=sql-server-ver15 , 2018.[20] Microsoft. Azure SQL Database - Platform as aService. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas , 2019.[21] Microsoft. Azure SQL Database serverless (preview). https://docs.microsoft.com/en-us/azure/sql-database/sql-database-serverless , 2019.[22] Microsoft. Choose among the vCore service tiers andmigrate from the DTU service tiers. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-vcore ,2019.[23] Microsoft. Choose the right SQL Server option inAzure. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-paas-vs-sql-server-iaas , 2019.[24] Microsoft. Resource limits for single databases usingthe DTU-based purchasing model. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-dtu-resource-limits-single-databases ,2019.[25] Microsoft. Resource limits for single databases usingthe vCore-based purchasing model. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-vcore-resource-limits-single-databases ,2019.[26] Microsoft. Service tiers in the DTU-based purchasemodel. https://docs.microsoft.com/en-us/azure/sql-database/sql-database-service-tiers-dtu , 2019.[27] Microsoft. SQL Server 2019. , 2019.[28] R. O. Nambiar and M. Poess. The making of tpc-ds.In

Proceedings of the 32nd international conference onVery large data bases , pages 1049–1058. VLDBEndowment, 2006.[29] J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi.Learning state representations for query optimizationwith deep reinforcement learning. arXiv preprintarXiv:1803.08604 , 2018.[30] J. M. Patel, H. Deshmukh, J. Zhu, N. Potti, Z. Zhang,M. Spehlmann, H. Memisoglu, and S. Saurabh.Quickstep: A data platform based on the scaling-upapproach.

Proc. VLDB Endow. , 11(6):663–676, Feb.2018.[31] A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma,P. Menon, T. C. Mowry, M. Perron, I. Quah, et al.Self-driving database management systems. In

CIDR ,volume 4, page 1, 2017.[32] M. Poess and C. Floyd. New tpc benchmarks fordecision support and web commerce.

ACM SigmodRecord , 29(4):64–71, 2000.[33] K. Rajan, D. Kakadia, C. Curino, and S. Krishnan.Perforator: Eloquent performance models for resourceoptimization. In

Proceedings of the Seventh ACMSymposium on Cloud Computing , SoCC ’16, pages415–427, New York, NY, USA, 2016. ACM.[34] M. Sundermeyer, R. Schl¨uter, and H. Ney. Lstmneural networks for language modeling. In

Thirteenthannual conference of the international speechcommunication association , 2012.[35] K. S. Tai, R. Socher, and C. D. Manning. Improvedsemantic representations from tree-structured longshort-term memory networks. arXiv preprintarXiv:1503.00075 , 2015.[36] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang.Automatic database management system tuningthrough large-scale machine learning. In

Proceedingsof the 2017 ACM International Conference onManagement of Data , SIGMOD ’17, pages 1009–1024,New York, NY, USA, 2017. ACM.[37] W. Wu, Y. Chi, H. Hac´ıg¨um¨u¸s, and J. F. Naughton.Towards predicting query execution time forconcurrent and dynamic database workloads.