[PDF] A Note On Operator-Level Query Execution Cost Modeling

Abstract

External query execution cost modeling using query execution feedback has found its way in various database applications such as admission control and query scheduling. Existing techniques in general fall into two categories, plan-level cost modeling and operator-level cost modeling. It has been shown in the literature that operator-level cost modeling can often significantly outperform plan-level cost modeling. In this paper, we study operator-level cost modeling from a robustness perspective. We address two main challenges in practice regarding limited execution feedback (for certain operators) and mixed cost estimates due to the use of multiple cost modeling techniques. We propose a framework that deals with these issues and present a comprehensive analysis of this framework. We further provide a case study to demonstrate the efficacy of our framework in the context of index tuning, which is itself a new application of external cost modeling techniques.

Full PDF

AA Note On Operator-Level Query Execution Cost Modeling

Wentao Wu

Microsoft Research, Redmond [email protected]

ABSTRACT

External query execution cost modeling using query execution feed-back has found its way in various database applications such as ad-mission control and query scheduling. Existing techniques in gen-eral fall into two categories, plan-level cost modeling and operator-level cost modeling. It has been shown in the literature that operator-level cost modeling can often signiﬁcantly outperform plan-levelcost modeling. In this paper, we study operator-level cost mod-eling from a robustness perspective. We address two main chal-lenges in practice regarding limited execution feedback (for cer-tain operators) and mixed cost estimates due to the use of multiplecost modeling techniques. We propose a framework that deals withthese issues and present a comprehensive analysis of this frame-work. We further provide a case study to demonstrate the efﬁcacyof our framework in the context of index tuning, which is itself anew application of external cost modeling techniques.

1. INTRODUCTION

In recent years, there is substantial work that tries to providemore accurate estimate for query execution cost [3, 4, 15, 17, 20,26, 27]. Unlike early work that mainly focuses on improving costestimates inside the optimizer (prominently, via improved cardinal-ity estimates), this line of work chose to sit outside the query opti-mizer. That is, the cost models proposed in this work are externalto the query optimizer and have no impact on its plan choice. Theeffectiveness of these external cost models has been demonstratedin various applications such as admission control [24] and queryscheduling [2]. However, their applications in other areas, such asquery optimization and index tuning, remain limited.Existing external cost modeling approaches can in general fallinto two categories: plan-level and operator-level modeling. Inplan-level modeling, one ﬁrst converts the entire query plan into afeature vector and then trains a machine learning model that learnsthe relationship between the feature vector and the correspondingplan execution time. In operator-level modeling, one applies thesame idea to – instead of the entire plan – each individual opera-tors. The estimated operator costs are then combined (e.g., summedup) to generate the estimated cost of the entire plan. As was shown in previous work [4, 20], operator-level modeling can often signif-icantly outperform plan-level modeling. There are, however, threemajor challenges in operator-level modeling. (Challenge 1: Appropriate Training)

Operator-level models re-quire appropriate training before they could be effective. This thenraises a question about training data — on which dataset should wetrain the models? A premise underlying these (and any) learning-based models is that the training set should be representative forthe workload. As was demonstrated in previous work [4], learning-based models are much better if they are trained and tested overthe same database and workload (see Section 2.1 for more details).Therefore, the training data set needs to be harvested from past ex-ecution history over the same database. (Challenge 2: Limited Execution Feedback)

Previous work onlearning-based models presumes the availability of sufﬁcient train-ing data, which is often not the case in a real production setting.That is, it rarely happens that abundant execution information abouta workload is available, especially for interactive workloads that of-ten contain lots of ad-hoc queries. (Challenge 3: Mixed Cost Estimates)

Operator-level modelingrequires training a model for each operator that may appear in aquery execution plan. Given the scarcity of past execution feed-back, it is unlikely that each model can receive similar amount oftraining data. For instance, for a database that contains very fewor no indexes, the query execution plans may contain much fewerindex-based nested-loop joins compared to hash joins. Therefore,in general we may have little (or even no) execution feedback forcertain operators. Training models for such operators is impossi-ble and a natural solution is fallback to optimizer’s cost estimates.Then, however, how do we combine cost estimates made by differ-ent modeling approaches ? Note that the cost estimates may evenhave different semantics. For example, external cost models usu-ally target elapsed time or resource consumption such as CPU time,whereas optimizer’s cost estimates are often more “abstract” andmay not have concrete semantics. In such situations, we cannotdirectly combine them (e.g., by adding them up).With the above challenges in mind, in this paper we propose asimple yet general framework that operates on top of limited exe-cution feedback, which consists of three steps. First, we identifya set of backbone operators that serve as workhorse for almost allworkload queries yet have relatively abundant execution feedback.In this work, we focus on using leaf operators, such as table scans,index scans, and index seeks, as backbone operators, though it isstraightforward to include other operators such as joins. The reasonfor this choice is obvious: Regardless of which query plan the opti-mizer chooses for a given query, it always accesses the same tables,whereas the internal operators such as joins can be quite different1 a r X i v : . [ c s . D B ] M a r e.g., different join orders may be chosen). Feedback informationsuch as input/output cardinality is therefore more likely reusableacross query plans for leaf operators, which results in more reli-able models. Second, we build an external cost model for each leafoperator using existing techniques, and use the models to estimatecosts for leaf operators in each query plan. Third, we combine thecost estimates made by external cost models for the leaf operatorswith the optimizer’s cost estimates for the internal operators, usinga straightforward yet principled technique.To understand when this framework can work and when it maynot work, we further conduct a theoretical analysis. Our analysisreveals that the effectiveness of this framework depends on howoverwhelming the (diversity or variation in the) amount of workdone by the backbone operators compared with the other operators. (Paper Organization) The rest of the paper is organized as fol-lows. We study the practical challenges of utilizing execution feed-back in detail, and propose a general framework that addressesthese challenges (Section 2). We next present an analysis of theframework in Section 3. In Section 4, we further present a casestudy of the applicability of the proposed framework in the con-text of index tuning. We summarize related work in Section 5 andconclude the paper in Section 6.

2. THE FRAMEWORK

The high-level idea of utilizing execution feedback to build ex-ternal, operator-level cost models is clear. However, as was dis-cussed in the introduction, there are several practical challengesthat may limit the effectiveness of the cost models. In the fol-lowing, we start by understanding these challenges, in retrospect,from lessons learned by previous work. We then propose a generalframework that addresses these challenges and present a speciﬁcimplementation that will be evaluated in our experiments.

Existing external cost modeling techniques all rely on certain de-gree of “learning.” In the literature, people have been trying eitherplan-level or operator-level cost modeling techniques. It has beenshown that operator-level modeling is superior to plan-level model-ing when workload drifts [4]. Nonetheless, operator-level modelingis still sensitive to the training data in use.

Lessons from Previous Work.

To illustrate this sensitivity, westudied the results reported by [20], which, as far as we know, rep-resents the state-of-the-art operator-level modeling. The authorsof [20] compared various operator-level modeling techniques intheir experimental evaluation. We observe the following two factsfrom their comparison results.First, if we train and test the models using the same workload,the estimation accuracy is quite good even in the presence of car-dinality estimation errors. The authors of [20] used the TPC-Hbenchmark to train and test the models, and found that more than80% of the test cases have a ratio error below 1.5 (i.e., 50% relativeerror), and the percentage can be improved to 90% if true cardinal-ity information is available.Second, if we train and test the models over different workloads,the estimation accuracy drops dramatically. The authors of [20]used the TPC-H benchmark to train the models and tested the mod-els on the TPC-DS benchmark as well as two real workloads. Basedon their reported results, only 30% of the test cases now have a ra-tio error below 1.5 on TPC-DS, and the percentage is about 40%on the two real workloads. However, if true cardinality informationis available, the percentage can be improved to 70% over all the O O O O O R S T(Index Scan, 20, 10s) (Index Seek, 2, 5s) (Table Scan, 200, 20s)(Nested-Loop, 500, N/A) (Hash Join, 300, N/A)

Figure 1: A running example. three workloads. Unfortunately, true cardinality information is usu-ally not available for queries in the testing set, and previous workhas reported worse results if true/estimated cardinalities are used intraining whereas estimated/true cardinaities are used in testing [4].The above observations suggest that operator-level modeling ap-proaches are sensitive to the training set. Only if the testing queriesare drawn from the same training workload should we expect goodestimation accuracy. Cardinality estimation error remains one ofthe major factors that prevent better generalization of the modelsaccross different workloads: Fixing cardinality errors can give us alift between 30% and 40% in terms of the ratio-error metric. Inter-estingly, within the same workload cardinality errors do not havesigniﬁcant impact on model estimation accuracy. This has alsobeen evidenced by other previous work [4]. One intuition for thisis that cardinality estimation errors depend on the difference be-tween query optimizer’s presumed data distribution and the actualdata distribution. As long as this difference is consistent across dif-ferent queries over the same database, cardinality estimation errorsare like systematic biases that could be accounted for by the mod-els learnt from training data. This is, however, usually not the caseacross workloads over different databases, and therefore the biaseshave to be learnt again by the model.

Discussion.

Lessons from previous work suggest that it may betoo ambitious to expect an external cost model that works “every-where.” A more realistic approach is to learn a cost model for aﬁxed database and workload, which, as has been demonstrated byvarious previous work, can outperform optimizer’s cost estimates(with some naive scaling). An even more interesting observationhere by previous work is that the speciﬁc machine learning mod-els do not matter too much on a ﬁxed database and workload, aslong as they capture both the linear and nonlinear factors in costmodeling [4, 20, 27].Moreover, this approach is sensitive to the amount of executionfeedback we possess, the collection of which may be expensivein many situations. In the worst case, we may not even have anyfeedback so it is equivalent to using optimizer’s cost estimates.

Insufﬁcient Feedback and Mixed Cost Estimates.

Theabove discussion naturally raises questions about insufﬁcient feed-back. It is quite likely that we do not have enough feedback totrain models for certain operators. A natural solution is to use op-timizer’s cost estimates for such operators. This results in a newchallenge of combining two different types of cost estimates.E

XAMPLE To understand the issue of mixed cost estimatesbetter, we present an annotated query execution plan as a runningexample in Figure 1. Here R , S , and T are tables, whereas O to O are physical operators. We annotate each operator with itstype and estimated cost. In this example, the cost estimates for O , O , and O come from external cost models, which indicate theestimated CPU time of these operators. On the other hand, the cost stimates for O and O are made by the optimizer, which use someabstract metric and do not have concrete semantics. We propose a general framework to address the challenge ofmixed cost estimates that naturally arises in the presence of insuf-ﬁcient execution feedback. Algorithm 1 presents the details. Wesummarize the notation in Table 1.

Algorithm 1:

Combine mixed cost estimates.

Input: P , a query plan; O , the set of operators with sufﬁcientfeedback; M , the operator-level models built withfeedback from O . Output: cost( P ) , estimated cost of P . cost( P ) ← ; o pivot ← P ickP ivot ( O ) ; // Find the “pivot” (Algorithm 2). foreach operator o ∈ P do if there is a model M ∈ M for o then ext-cost( o ) ← M ( o ) ; cost( P ) ← cost( P ) + ext-cost( o )act-cost( o pivot ) · opt-cost( o pivot ) ; else cost( P ) ← cost( P ) + opt-cost( o ) ; end end return cost( P ) ;The main idea here for combining mixed cost estimates is sim-ple. We choose one “pivot” operator o pivot from the operators withexecution feedback (line 2). We use the execution cost of o pivot asa baseline, and compute the relative cost rel-cost( o ) = ext-cost( o )act-cost( o pivot ) (1)for any operator o in the given plan P where we have an exter-nal operator-level cost model. We then scale the relative cost backusing opt-cost( o pivot ) . For any operator in P without sufﬁcientfeedback (i.e., we have not built a usable external cost model), wesimply use the optimizer’s cost estimate for it (lines 3 to 10).E XAMPLE Continue with the running example in Figure 1.Given the three operators O to O with available execution feed-back, suppose that we choose O as the pivot operator. Assumethat the external cost models for table scans, index scans, and in-dex seeks are perfect, i.e., for any such operator o we would have ext-cost( o ) = act-cost( o ) . The relative costs of O , O , and O can then be easily computed: rel-cost( O ) = ext-cost( O )act-cost( O ) = act-cost( O )act-cost( O ) = 1020 = 0 . , rel-cost( O ) = ext-cost( O )act-cost( O ) = act-cost( O )act-cost( O ) = 520 = 0 . , Notation Description o An operator in the query plan opt-cost( o ) The optimizer’s estimated cost of o act-cost( o ) The actual execution cost of o ext-cost( o ) The cost estimate of o from external modeling o pivot The pivot operator

Table 1: Terminology and notation. and rel-cost( O ) = ext-cost( O )act-cost( O ) = act-cost( O )act-cost( O ) = 2020 = 1 . Meanwhile, the scaling factor is opt-cost( O ) = 200 . Conse-quently, the adjusted estimated costs for O , O , and O are . ×

200 = 100 , . ×

200 = 50 , and ×

200 = 200 . There-fore, the ﬁnal estimated cost for the example plan P is cost( P ) =100 + 50 + 200 + 500 + 300 = 1150 . The above procedure relies on the following assumption:A

SSUMPTION The cost estimates made by external cost mod-els are comparable and the relative costs are well deﬁned. Thesame holds for optimizer’s cost estimates.

In other words, Assumption 3 states that the semantics of cost esti-mates produced by the same model should be consistent : If a costmodel estimates CPU time for one operator then it should do thesame for the others. In theory, a cost model is not required to con-form to this restriction. For example, we could have a cost modelthat estimates the number of IO’s for table scans whereas estimatesthe CPU time for hash joins. We are not aware of such cost modelsin practice and do not consider them in this paper. Moreover, if acost model needs to produce a single number (instead of a vectorof numbers) as its cost estimate at the operator level, then it has tobe consistent; otherwise it is unclear how to combine the operator-level cost estimates at the plan level.

Remarks.

Two remarks are in order. First, Algorithm 1 does notspecify how to pick the pivot operator o pivot . In theory, we couldpick any operator with execution feedback. However, it is clearthat the choice of o pivot has impact on the estimated cost, becausewe use opt-cost( o pivot ) as the scaling factor when combining withoptimizer’s cost estimates. We study this impact in Section 3.Second, Algorithm 1 assumes that the external models are al-ready built and ﬁxed. This should not be the case in reality. As weaccumulate execution feedback, it makes sense to train the mod-els again. Moreover, it is likely that operators lack of external costmodeling initially can catch up if enough feedback is available. Asa result, the inputs O and M will change dynamically. Even if the inputs O and M are ﬁxed, we still need to select theset O . In general, the selection depends on several factors: • First, the operators in O should have sufﬁcient amount of ex-ecution feedback. It is unlikely that we can train a promisingmodel with little training data. • Second, as cardinality information is used as one prominentfeature in almost all existing operator-level cost modelingtechniques, the operators in O should have relatively robustcardinality estimates. • Third, the operators should cover signiﬁcant amount of workperformed by an arbitrary query plan. Otherwise, even if wecan have perfect cost estimates for these operators, the im-pact on the overall cost estimate of the query plan is limited.We therefore propose to focus on leaf operators, including ta-ble scans, index scans, index seeks, and so on (depending on spe-ciﬁc database systems), which meet the above three criteria well.Nonetheless, there is no mandatory reason to exclude internal op-erators, though the decision is more workload-dependent. For ex-ample, for workloads with little data correlation, cardinality esti-mation errors may not be severe and thus it is convincing to also3otation Description L Leaf operators I Internal operators P Plan CPU time L Leaf CPU time I Internal CPU time α ρ ( L, I ) , Pearson CC between L and Iσ L Standard deviation of Lσ I Standard deviation of

Iη η = σ L σ I P (cid:48) Estimated plan cost L (cid:48) Estimated leaf cost I (cid:48) Estimated internal cost σ L (cid:48) Standard deviation of L (cid:48) σ I (cid:48) Standard deviation of I (cid:48) η (cid:48) η = σ L (cid:48) σ I (cid:48) β ρ ( L, I (cid:48) ) , Pearson CC between L and I (cid:48) γ ρ ( I, I (cid:48) ) , Pearson CC between I and I (cid:48) ρ ρ ( P, P (cid:48) ) , Pearson CC between P and P (cid:48) Table 2: Notation used in the formal analysis. include join operators into O if they have enough feedback. In therest of the paper, we call the operators in O the backbone operators.

3. ANALYSIS

In this section, we present analysis of the framework illustratedin Algorithm 1. We ﬁrst discuss target performance metrics andthen formalize the problem we will study. Based on our problemformulation, we provide answers to the following questions: • When does the approach work and when may it not work? • What can impact its performance and how can we improve?

The ﬁrst question is how to measure the performance of our ap-proach. One could aim for reducing cost estimation errors, just likeprevious work on external cost modeling. However, in many appli-cations such as query optimization or index tuning, we are not veryinterested in the speciﬁc numbers returned by cost models, becausewe only care about whether we can compare query plans based onthese numbers (i.e., estimated plan costs). Therefore, as far as wecan distinguish plans by their relative costs we are satisﬁed. Forexample, we perhaps only need to know that one plan is 20% bet-ter/cheaper than the other one. This suggests that we primarily con-sider the correlation between the cost estimates and the actual costsof the plans. Therefore, we use the well-known Pearson correlationcoefﬁcient (Pearson CC) as our performance metric.

Throughout this section, assume that we use leaf operators as thebackbone operators, for which we have external cost models builtusing execution feedback. We present the notation we will use inthe following analysis in Table 2.We use P , L , and I to represent the total CPU time spent on thewhole plan, the leaf operators, and the internal operators, respec-tively. Clearly, P = L + I , where L = (cid:88) o ∈L act-cost( o ) (2)and I = (cid:88) o ∈I act-cost( o ) . (3) Similarly, P (cid:48) = L (cid:48) + I (cid:48) , where, by Algorithm 1, L (cid:48) = (cid:88) o ∈L ext-cost( o )act-cost( o pivot ) · opt-cost( o pivot ) (4)and I (cid:48) = (cid:88) o ∈I opt-cost( o ) . (5)To simplify our analysis, assume that the external cost models areperfect, namely, ext-cost( o ) = act-cost( o ) for any o ∈ L . (Wewill study the impact of cost modeling errors later.) Moreover, de-ﬁne a constant λ = opt-cost( o pivot )act-cost( o pivot ) . (6)By Equation 4, it follows that L (cid:48) = λ · (cid:88) o ∈L act-cost( o ) = λ · L. (7) We are interested in the Pearson correlation coefﬁcient ρ ( P, P (cid:48) ) between P and P (cid:48) . Based on the previous formulation, we have ρ = ρ ( P, P (cid:48) ) = ρ ( L + I, L (cid:48) + I (cid:48) ) = ρ ( L + I, λ · L + I (cid:48) ) . (8)With the notation in Table 2, we can prove the following lemma.L EMMA ρ only depends on η , η (cid:48) , α , β , and γ . Speciﬁcally, ρ = ηη (cid:48) + αη (cid:48) + βη + γ (cid:112) η + 2 αη + 1 · (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 . (9)P ROOF . By Equation 8, we have ρ = ρ ( L + I, λL + I (cid:48) ) = Cov( L + I, λL + I (cid:48) ) σ L + I · σ λL + I (cid:48) . By the deﬁnition of covariance,

Cov( L + I, λL + I (cid:48) ) = E[( L + I ) − E( L + I )][( λL + I (cid:48) ) − E( λL + I (cid:48) )] . Using simple arithmetic calculation, we can obtain

Cov( L + I, λL + I (cid:48) ) = λσ L + λ · Cov(

L, I )+Cov(

L, I (cid:48) )+Cov(

I, I (cid:48) ) . On the other hand, by the deﬁnition of variance, σ L + I = E[( L + I ) − E( L + I )] = σ L + 2 · Cov(

L, I ) + σ I . Similarly, we have σ λL + I (cid:48) = λ σ L + 2 λ · Cov(

L, I (cid:48) ) + σ I (cid:48) . Using the relationships

Cov(

L, I ) = ρ ( L, I ) σ L σ I = ασ L σ I , Cov(

L, I (cid:48) ) = ρ ( L, I (cid:48) ) σ L σ I (cid:48) = βσ L σ I (cid:48) , Cov(

I, I (cid:48) ) = ρ ( I, I (cid:48) ) σ I σ I (cid:48) = γσ I σ I (cid:48) , it then follows that Cov( L + I, λL + I (cid:48) ) = λσ L + λασ L σ I + βσ L σ I (cid:48) + γσ I σ I (cid:48) ,σ L + I = σ L + 2 ασ L σ I + σ I ,σ λL + I (cid:48) = λ σ L + 2 λβσ L σ I (cid:48) + σ I (cid:48) . As a result, we have ρ = λσ L + λασ L σ I + βσ L σ I (cid:48) + γσ I σ I (cid:48) (cid:112) σ L + 2 ασ L σ I + σ I · (cid:112) λ σ L + 2 λβσ L σ I (cid:48) + σ I (cid:48) . σ I σ I (cid:48) , ρ = λ σ L σ I σ L σ I (cid:48) + λα σ L σ I (cid:48) + β σ L σ I + γ (cid:113)(cid:0) σ L σ I (cid:1) + 2 α σ L σ I + 1 · (cid:113) λ (cid:0) σ L σ I (cid:48) (cid:1) + 2 λβ σ L σ I (cid:48) + 1 . Since η = σ L σ I and η (cid:48) = σ L (cid:48) σ I (cid:48) = λσ L σ I (cid:48) , it follows that ρ = ηη (cid:48) + αη (cid:48) + βη + γ (cid:112) η + 2 αη + 1 · (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 . This completes the proof of the lemma.We can have several interesting observations based on Lemma 4.First, we have the following lower bounds for ρ that only dependon η and η (cid:48) (Theorem 5 and Corollary 6).T HEOREM Deﬁne a function f ( η, η (cid:48) ) = ηη (cid:48) − η (cid:48) − η − η + 1)( η (cid:48) + 1) = 1 − η − η (cid:48) − ηη (cid:48) (1 + 1 /η )(1 + 1 /η (cid:48) ) . For any ≤ η, η (cid:48) < ∞ , we have ρ ≥ f ( η, η (cid:48) ) . P ROOF . We have − ≤ α, β, γ ≤ . Based on Equation 9, ηη (cid:48) + αη (cid:48) + βη + γ ≥ ηη (cid:48) − η (cid:48) − η − , (cid:112) η + 2 αη + 1 ≤ (cid:112) η + 2 η + 1 = η + 1 , (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 ≤ (cid:112) ( η (cid:48) ) + 2 η (cid:48) + 1 = η (cid:48) + 1 . As a result, it follows that ρ ≥ ηη (cid:48) − η (cid:48) − η − η + 1)( η (cid:48) + 1) = f ( η, η (cid:48) ) . This completes the proof the theorem.C

OROLLARY Deﬁne a function g ( η, η (cid:48) ) = ηη + 1 · η (cid:48) η (cid:48) + 1 = 11 + 1 /η ·

11 + 1 /η (cid:48) . If ≤ α, β, γ ≤ , then ρ ≥ g ( η, η (cid:48) ) . The proof is very similar and thus omitted. Clearly, g ( η, η (cid:48) ) >f ( η, η (cid:48) ) . Intuitively, positive α , β , and γ suggest positive correla-tions between L , I , and I (cid:48) , which is usually the case in real work-loads (see Section 3.3.2).Based on Theorem 5 and Corollary 6, we immediately have thefollowing important observation:O BSERVATION If η (cid:29) and η (cid:48) (cid:29) , we have f ( η, η (cid:48) ) ≈ and g ( η, η (cid:48) ) ≈ . As a result, ρ ≈ . That is, when both η and η (cid:48) are sufﬁciently large, we should ex-pect very strong correlation between the estimated cost (using Al-gorithm 1) and the actual cost of a plan. More generally, it is easyto see both f ( η, η (cid:48) ) and g ( η, η (cid:48) ) are increasing functions with re-spect to η and η (cid:48) . This implies that we need to increase both η and η (cid:48) to improve ρ . Recall that η = σ L σ I whereas η (cid:48) = λσ L σ I (cid:48) . σ L and σ I are stan-dard deviation of the actual leaf and internal operator CPU time,whereas σ I (cid:48) is the standard deviation of the optimizer’s estimatedinternal cost. All three are constants for a given workload so wecannot change η in Algorithm 1. On the other hand, η (cid:48) dependson λ as well, which depends on our choice of the pivot operator.Because η (cid:48) increases as λ increases, it suggests that we should pickthe pivot operator that maximizes λ , as deﬁned by Equation 6. Al-gorithm 2 presents the details of our selection strategy for the pivotoperator based on this idea. Algorithm 2:

Pick the pivot operator.

Input: O , the operators with execution feedback. Output: o pivot , the pivot operator. o pivot ← nil ; λ ← ; foreach o ∈ O do λ o ← opt-cost( o )act-cost( o ) ; if λ o > λ then λ ← λ o ; o pivot ← o ; end end return o pivot ; F r e qu e n c y (a) Distribution of η : mean = . , median = . . F r e qu e n c y (b) Distribution of η (cid:48) : mean = . × , median = . × . Figure 2: The distributions of η and η (cid:48) on real workloads. While the previous analysis characterizes the connection between ρ , η , and η (cid:48) , it does not tell us what we should expect in practice.We thus studied 36 real workloads in the context of index tuningwith various physical design (e.g., both row store and column storewith necessary indexes) and with at least 10 queries (see Section 4for details of our settings in index tuning).Figure 2 presents the distributions of η and η (cid:48) . We computed η (cid:48) by using Algorithm 2 to pick the pivot operator and therefore λ . Weobserve that η (cid:48) is much larger than η . Therefore it is safe to ignorethe factor η (cid:48) in both f ( η, η (cid:48) ) and g ( η, η (cid:48) ) . Consequently, we havethe following approximations:O BSERVATION If η (cid:48) (cid:29) , we have /η (cid:48) ≈ . As a result, f ( η, η (cid:48) ) ≈ − /η /η and g ( η, η (cid:48) ) ≈ /η . Deﬁne f ( η ) = − /η /η and g ( η ) = /η . Figure 3 depicts thesetwo functions with η increasing from to . We observe thatboth functions increase quickly when η grows. For example, when η = 10 , f ( η ) = 0 . and g ( η ) = 0 . . When η = 18 . (i.e., themedian we observed on our workloads), we have f ( η ) = 0 . and g ( η ) = 0 . .Under the assumption that η (cid:48) ≈ , we can have a more detailedanalysis as we will see next. η (cid:48) Is Very Large

Dividing the numerator and denominator in Equation 9 by η (cid:48) andusing η (cid:48) ≈ , we obtain the following: ρ ≈ η + α (cid:112) η + 2 αη + 1 = 1 + α/η (cid:112) α/η + (1 /η ) . (10)Again, if η is sufﬁciently large, then /η ≈ and thus ρ ≈ . Wenext view ρ as a function of η and α .5

20 4010 30 505 15 25 35 45010.20.40.60.80.10.30.50.70.9

Figure 3: Plots of f ( η ) and g ( η ) with the growth of η . Figure 4: Plots of η as a function of α . L EMMA Assume that Equation 10 holds and η + α > .For a given < (cid:15) < , there exists some η s.t. if η > η then ρ > − (cid:15) . Speciﬁcally, η = (cid:115) − α / (1 − (cid:15) ) − − α. (11)P ROOF . Using Equation 10, ρ > − (cid:15) implies η + α > (1 − (cid:15) ) · (cid:112) η + 2 αη + 1 . Given that η + α > , it follows that ( η + α ) > (1 − (cid:15) ) · (cid:0) ( η + α ) + (1 − α ) (cid:1) . Since < − (cid:15) < , we have / (1 − (cid:15) ) > . As a result, ( η + α ) > − α / (1 − (cid:15) ) − . Since η + α > , taking the square root completes the proof.Lemma 9 suggests that there is a minimum η such that ρ can besufﬁciently high as long as η > η . We study two examples below: • If (cid:15) = 0 . , i.e., we want to have ρ > − (cid:15) = 0 . . As aresult, η = (cid:112) (1 − α ) / . − α . • If (cid:15) = 0 . , i.e., we want to have ρ > − (cid:15) = 0 . . As aresult, η = (cid:112) (1 − α ) / . − α .Figure 4 plots the η as a function of − ≤ α ≤ in the abovetwo cases. Apparently, η has a maximum η max0 with − ≤ α ≤ .As long as η > η max0 , we will have ρ > − (cid:15) regardless of α . Infact, this is easy to prove using Equation 11. Speciﬁcally we havethe following theorem. Figure 5: Plots of η max0 as a function of (cid:15) . T HEOREM η achieves its maximum η max0 when α = − (cid:112) − (1 − (cid:15) ) . In more detail, we have η max0 = 1 (cid:112) − (1 − (cid:15) ) . (12)P ROOF . We can view η as a function of α , i.e., η = η ( α ) .Deﬁne a constant C = √ / (1 − (cid:15) ) − . By Equation 11, we have η ( α ) = C (cid:112) − α − α. Taking derivatives of η ( α ) , we obtain η (cid:48) ( α ) = − Cα √ − α − , η (cid:48)(cid:48) ( α ) = − C (1 − α ) / . Since

C > and | α | ≤ , we have η (cid:48)(cid:48) ( α ) < . Therefore, η achieves its maximum when η (cid:48) ( α ) = 0 . Letting η (cid:48) ( α ) = 0 gives α = − √ C + 1 = − (cid:112) − (1 − (cid:15) ) . (13)Substituting Equation 13 into Equation 11 gives η max0 = (cid:112) − (1 − (cid:15) ) + (1 − (cid:15) ) (cid:112) − (1 − (cid:15) ) = 1 (cid:112) − (1 − (cid:15) ) . This completes the proof of the theorem.Continuing with the previous examples, by Theorem 10 we have • For (cid:15) = 0 . , η max0 = 3 . when α = − . ; • For (cid:15) = 0 . , η max0 = 7 . when α = − . .These results can be easily veriﬁed in Figure 4. Moreover, by Equa-tion 13, we have α → as (cid:15) → . Meanwhile, η max0 increases as (cid:15) decreases. In particular, η max0 → ∞ as (cid:15) → . Figure 5 furtherplots η max0 with respect to (cid:15) .So far we have focused on the general case where − ≤ α ≤ .In practice, it is reasonable to assume a positive α , i.e., ≤ α ≤ .(For the 36 workloads we studied in Section 3.3.2, we observedonly one workload with a negative α = − . .) Therefore, similarto Corollary 6, we can improve the result given by Theorem 10 forthe case when ≤ α ≤ .6 OROLLARY If ≤ α ≤ , η achieves its maximum η max ,p when α = 0 (the superscript p indicates a positive α ): η max ,p = 1 − (cid:15) (cid:112) − (1 − (cid:15) ) . (14)P ROOF . By the proof of Theorem 10, we have η (cid:48) ( α ) = − Cα √ − α − , where C > . If α ≥ , we have η (cid:48) ( α ) < . Therefore, η ( α ) is a decreasingfunction of α . As a result, η achieves its maximum when α = 0 .Setting α = 0 in Equation 11 gives Equation 14.Comparing Equation 14 with Equation 12 suggests that η max ,p <η max0 . This means that in the case of a positive α , which is the com-mon case in practice, one can have a less stringent requirement on η to expect a high ρ . Figure 5 illustrates this difference. When (cid:15) → , however, η max ,p → η max0 . In this section, we started by a formal analysis of Algorithm 1.We focus on studying the correlation between the cost estimate re-turned by Algorithm 1 and the actual execution cost. We developeda lower bound for the correlation (Theorem 5 and Corollary 6) thatonly depends on η and η (cid:48) , two quantities determined by workload-level properties. We then studied many real workloads in the con-text of index tuning and found that η (cid:48) is typically very large andthus we only need to focus on the part that depends on η . We furtherperformed a more detailed analysis in the presence of very large η (cid:48) ,and found a lower bound for η for a given level of correlation wewant to achieve (Theorem 10 and Corollary 11).As was shown in Figure 2, there is huge variance in the distribu-tion of η on real workloads. Although 25 out of the 36 workloadshave η ≥ , there are still 11 workloads with relatively small η .So a natural question is that how large ρ is over these real work-loads. In Figure 6, we present the distributions of both Pearson CCand Spearman CC on the 36 real workloads. Comparing with op-timizer’s cost estimates, the cost estimates returned by Algorithm 1improve the correlation coefﬁcients by . to . on average. Discussion.

Given that η plays an important role in determining ρ , we further analyze its impact in more detail. By Equation 10, fora given α , we can view ρ as a function of η , namely, ρ = ρ ( η ) . Wehave the following simple result.L EMMA

Assume that Equation 10 holds. For a given − ≤ α ≤ , ρ is then a non-decreasing function of η ( ≤ η ≤ ∞ ). P ROOF . Taking the derivative for ρ (Equation 10), we obtain ρ (cid:48) ( η ) = 1 − α w , where w = (cid:112) η + 2 αη + 1 . Since | α | ≤ and w > , ρ (cid:48) ( η ) ≥ . Therefore ρ is a non-decreasing function of η .T HEOREM

Assuming Equation 10 holds, α ≤ ρ ≤ . P ROOF . By Lemma 12, ρ ( η ) is a non-decreasing function of η .Given that ≤ η < ∞ , we have ρ (0) ≤ ρ ( η ) ≤ ρ ( ∞ ) . ByEquation 10, ρ (0) = α whereas ρ ( ∞ ) = 1 . This completes theproof of the theorem. Spearman CC is the rank-based version of Pearson CC. Comparedto Pearson CC, Spearman CC is more robust when there are out-liers, but it ignores the relative differences between costs. F r e qu e n c y (a) Distribution of Pearson CC using optimizer’s estimates:mean = . , median = . . F r e qu e n c y (b) Distribution of Pearson CC using Algorithm 1 (i.e., ρ ):mean = . , median = . . F r e qu e n c y (c) Distribution of Spearman CC using optimizer’s estimates:mean = . , median = . . F r e qu e n c y (d) Distribution of Spearman CC using Algorithm 1: mean = . , median = . . Figure 6: The distributions of Pearson CC and Spearman CCon real workloads using optimizer’s estimates vs. Algorithm 1. .In particular, when | α | < , ρ is a strictly increasing function of η . When α = 1 , ρ = 1 ; when α = − , ρ = 1 if η ≥ , otherwise ρ = − . In any case Theorem 13 holds. Figure 7 further plots thetwo functions ρ ( η ) = η √ η +1 when α = 0 and ρ ( η ) = η +0 . √ η + η +1 when α = 0 . . It is clear that ρ ≥ α in both cases.In Figure 8 we summarize our analysis. In the presence of a verylarge η (cid:48) , ρ only depends on η and α . Given a desired threshold < (cid:15) < , for a given − ≤ α ≤ − (cid:15) , along the spectrum ≤ η < ∞ there exists some η such that ρ ≥ − (cid:15) when η > η . On the other hand, if η ≤ η , then a weaker bound for ρ is α ≤ ρ ≤ − (cid:15) . Note that the condition α ≤ − (cid:15) is necessary for η ≥ (see Equation 11). Remarks.

We have several remarks in order. First, it is straight-forward to extend the analysis to the general case of backbone op-erators versus the rest (not just leaf operators versus internal ones).Second, so far we have focused on the case when η (cid:48) is very large.This may not hold on certain workloads. We present more analy-sis for that situation in Section 3.4. Third, so far we have assumedthere are no cost modeling errors for backbone operators, whichis unlikely the case in practice. It is straightforward to extend theanalysis by considering modeling errors, though the analytic for-mulas presented in this section will become more complicated.7 Figure 7: Plots of ρ as a function of η for a ﬁxed α .Figure 8: Summary of the correlation analysis. η (cid:48) Is Not Large

So far we have focused ourselves on the special case when η (cid:48) isvery large (more accurately, η (cid:48) ≈ ). One may be also interestedin the case when this does not hold. In the following, we study thiscase in more detail.By Equation 9, we can also view ρ as a function of η (cid:48) : ρ = ρ ( η (cid:48) ) = A ( Bη (cid:48) + C ) (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 , (15)where A = √ η +2 αη +1 , B = η + α , and C = βη + γ . (As wasin Lemma 9, we assume η + α ≥ . Note that this automaticallyholds if η ≥ .) Taking the derivative we obtain ρ (cid:48) ( η (cid:48) ) = ABv − u ( η (cid:48) + β ) v , (16)where u = A ( Bη (cid:48) + C ) , v = (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 . Note that thederivative of v satisﬁes v (cid:48) ( η (cid:48) ) = η (cid:48) + βv .Now let ρ (cid:48) ( η (cid:48) ) = 0 . We obtain η (cid:48) = βC − BβB − C (17) = (1 − β ) η + ( α − βγ ) γ − αβ = 1 − β γ − αβ η + α − βγγ − αβ . Using the relation

ABv | η (cid:48) = η (cid:48) = u | η (cid:48) = η (cid:48) ( η (cid:48) + β ) , which gives Bv | η (cid:48) = η (cid:48) = ( Bη (cid:48) + C )( η (cid:48) + β ) , it then follows that ρ ( η (cid:48) ) = A ( Bη (cid:48) + C ) v | η (cid:48) = η (cid:48) (18) = ABv | η (cid:48) = η (cid:48) η (cid:48) + β = AB (cid:112) ( η (cid:48) ) + 2 βη (cid:48) + 1 η (cid:48) + β = AB · (cid:115) − β ( η (cid:48) + β ) = (cid:118)(cid:117)(cid:117)(cid:116) ( η + α ) + ( γ − αβ ) − β ( η + α ) + (1 − α ) . Furthermore, we have ρ (cid:48)(cid:48) ( η (cid:48) ) = AB ( β − C ) v − v ( η (cid:48) + β )[ ABv − u ( η (cid:48) + β )] v . Again, using the relation

ABv | η (cid:48) = η (cid:48) = u | η (cid:48) = η (cid:48) ( η (cid:48) + β ) , it follows that ρ (cid:48)(cid:48) ( η (cid:48) ) = AB ( β − C ) v | η (cid:48) = η (cid:48) = AB [(1 − η ) β − γ ] v | η (cid:48) = η (cid:48) . (19)Assume β > . Therefore, if η > − γβ , we have ρ (cid:48)(cid:48) ( η (cid:48) ) < and thus ρ ( η (cid:48) ) attains its maximum at η (cid:48) . On the other hand, if η < − γβ , ρ (cid:48)(cid:48) ( η (cid:48) ) > and thus ρ ( η (cid:48) ) attains its minimum at η (cid:48) .Moreover, if γ > β , then − γβ < . Thus η > − γβ alwaysholds and ρ ( η (cid:48) ) attains its maximum at η (cid:48) . On the other hand, if < γ ≤ β , then ≤ − γβ < . If η ≥ then ρ ( η (cid:48) ) still attainsits maximum at η (cid:48) . Otherwise we need to compare η and − γβ .Hence, we have proved the following result:T HEOREM If η ≥ and < β, γ < , then ρ ( η (cid:48) ) attainsits maximum ρ max = ρ ( η (cid:48) ) (Equation 18) at η (cid:48) , and attains itsminimum ρ min at either ρ (0) = AC = βη + γ (cid:112) η + 2 αη + 1 or ρ ( ∞ ) = AB = η + α (cid:112) η + 2 αη + 1 . We have ρ min ≤ ρ ≤ ρ max . As an exercise, let us verify ρ ( η (cid:48) ) ≥ ρ (0) and ρ ( η (cid:48) ) ≥ ρ ( ∞ ) when η ≥ . Clearly ρ ( η (cid:48) ) ≥ ρ ( ∞ ) . We now prove that ρ ( η (cid:48) ) ≥ ρ (0) under the condition of Theorem 14.C OROLLARY If η ≥ and < β, γ < , ρ ( η (cid:48) ) ≥ ρ (0) . P ROOF . First notice that we must have η (cid:48) ≥ . By Equation 17, − β γ − αβ η + α − βγγ − αβ ≥ . If γ < αβ , then we have (1 − β ) η + ( α − βγ ) ≤ , Note that, under the condition < β, γ < , we must have <γ < αβ . As a result, α > and η ≤ βγ − α − β ≤ β · αβ − α − β = α ( β − − β = − α. η ≥ and α > . Consequently, wemust have γ ≥ αβ . Hence, (1 − β ) η + ( α − βγ ) ≥ , which gives η ≥ βγ − α − β . (20)Now deﬁne ∆ = ρ ( η (cid:48) ) − ρ (0)= A (cid:0) ( η + α ) + ( γ − αβ ) − β − ( βη + γ ) (cid:1) = A − β (cid:2) (1 − β ) (cid:0) ( η + α ) − ( βη + γ ) (cid:1) + ( γ − αβ ) (cid:3) . Next, let us consider δ = ( η + α ) − ( βη + γ ) (21) = (1 − β ) η + 2( α − βγ ) η + α − γ . (Case 1) Suppose that βγ − α ≥ . Plugging Equation 20 intoEquation 21, it follows that δ ≥ (1 − β ) · ( βγ − α ) (1 − β ) + 2( α − βγ ) · βγ − α − β + α − γ = 11 − β (cid:2) − ( α − βγ ) + ( α − γ )(1 − β ) (cid:3) = 11 − β (2 αβγ − γ − α β )= − − β ( γ − αβ ) . Therefore, we can obtain ∆ ≥ A − β (cid:2) (1 − β ) (cid:0) − − β ( γ − αβ ) (cid:1) + ( γ − αβ ) (cid:3) = A − β (cid:2) − ( γ − αβ ) + ( γ − αβ ) (cid:3) = 0 . (Case 2) On the other hand, suppose that βγ − α < . This implies α > βγ . Consequently, by Equation 21 and the condition η ≥ , δ ≥ (1 − β ) η + γ β − γ ≥ (1 − β ) + γ ( β − − β )(1 − γ ) ≥ . As a result, it follows that ∆ ≥ A − β ( γ − αβ ) ≥ . Combining Case 1 and Case 2, it always holds that ∆ ≥ . There-fore ρ ( η (cid:48) ) ≥ ρ (0) . This completes the proof.

4. CASE STUDY: INDEX TUNING

We now present a case study that applies the framework in Algo-rithm 1 to index tuning. We implemented a prototype system pig-gybacked on the index tuning framework presented in [10], whichhas been incorporated into Microsoft SQL Server [1, 12].

IndexTuning “What-If” Utility QueryOptimizer(Q, C)Q: query, C: hypothetical index configuration, P: query plan(P, cost-est(P))

Figure 9: Architecture of index tuning using “what-if” utility.

IndexTuning “What-If” Utility QueryOptimizer(Q, C)Q: query, C: hypothetical index configuration, P: query plan(P, cost-est(P))External Cost Models(P, refined-cost-est(P)) Execution Feedback

Figure 10: Architecture of index tuning using feedback.

There has been extensive research in the area of index tuningover the past two decades [12]. Major commercial database sys-tems, including Oracle [14], IBM DB2 [25], and Microsoft SQLServer [10], are all equipped with index tuning tools. At the high-est level, index tuning is very similar to query optimization. Exist-ing tools adopt a cost-based approach that picks an index conﬁg-uration from a number of candidates that results in the minimumoptimizer’s estimated cost for a given workload consisting of mul-tiple queries [10]. A basic step here is then to estimate cost for aquery over a candidate index conﬁguration. While different sys-tems may have different implementations, they essentially rely onthe so-called “what-if” utility that allows the optimizer to generatequery plans and estimate their costs for a given query using “hypo-thetical” or “virtual” indexes that are not actually materialized [11].Figure 9 outlines this architecture. The index tuning componentsends a query Q with the description of a candidate index con-ﬁguration C to the optimizer. The “what-if” utility simulates C by generating hypothetical indexes C contains. That is, it gener-ates all metadata and statistics information about these indexes andmakes them visible to the optimizer (only in this particular ses-sion). The key observation here is that query optimization does notrequire physical persistence of these indexes: The optimizer onlyneeds metadata and statistics to estimate costs for query plans thatuse these indexes. The best plan P chosen by the optimizer underthe conﬁguration C , along with optimizer’s cost estimate for P , isreturned to the index tuning component.Our idea of incorporating feedback into the architecture of indextuning is simple. As illustrated in Figure 10, after the optimizerreturns the best plan P , we reﬁne its cost estimate using externalcost models built on top of available execution feedback (i.e., Al-gorithm 1). We then send the reﬁned cost estimate, instead of theoriginal cost estimate from the optimizer, to the index tuning com-ponent. Note that our approach is passive rather than proactive : Wedo not use external cost models inside query optimizer to affect itsplan choice. As a result, if the query optimizer ends up choosing apoor execution plan because of bad cost estimates, we cannot bailit out. However, by reﬁning cost estimates afterwards, we increasethe chance of detecting such bad plans and therefore avoiding cor-responding disastrous index conﬁgurations that may lead to seriousquery performance regression.9ame Size η η (cid:48) TPC-DS 10GB 99 56.6 . × TPC-H 10GB 22 51.2 . × Real-1 40GB 25 1.7 . × Real-2 60GB 12 429.9 . × Real-3 100GB 20 217.9 . × Table 3: Workloads used in experimental evaluation. η and η (cid:48) are averaged over all queries in the workload. The effectiveness of Algorithm 1 relies on the following factors:(1) the backbone operators O ; (2) the operator-level cost models M ; and (3) the execution feedback F . For (1), as we have dis-cussed in Section 2.3, we use leaf operators as backbone operators;For (2), we use the operator-level modeling approach presentedin [20], as it represents the state of the art to the best of our knowl-edge; For (3), we assume sufﬁcient amount of execution feedbackis available for leaf operators.We used both synthetic and real database workloads in our eval-uation. For synthetic data, we used both TPC-H and TPC-DS atthe scale of 10GB; for real data, we used three customer databasesReal-1, Real-2, and Real-3. Table 3 presents the details of the work-loads we used and their characteristics. η (cid:48) is very large over all ofthese workloads. We focused on single-query workloads and con-ducted experiments under a workstation conﬁgured with two IntelE5-2650L CPUs and 192GB main memory. We ran Microsoft SQLServer 2017 under 64-bit Windows Server 2012. Initial Conﬁguration.

Since index tuning needs to start froman initial conﬁguration, we generated various initial conﬁgurationsfor our experiments in the following way. For each query q inthe workload, we generated different index conﬁgurations by lim-iting the number of indexes recommended by the index tuning tool(without using execution feedback). Speciﬁcally, we keep askingthe index tuning tool to return the next index until it runs out ofrecommendations. Suppose that the indexes recommended subse-quently are i , ..., i n . We then have n conﬁgurations I = { i } , I = { i , i } , ..., and I n = { i , i , ..., i n } . We used each of these n conﬁgurations as a different initial conﬁguration. Execution Feedback.

We generate execution feedback in thefollowing manner. For each initial conﬁguration, we run the queryand collect its execution time. For this purpose, we enable the“statistics XML” utility provided by Microsoft SQL Server to trackoperator-level execution information. We then randomly pick up toﬁve executed query plans into the execution feedback repository F . Performance Metrics.

We evaluate both the effectiveness ofAlgorithm 1 and the overall improvement of index tuning whenexecution feedback is utilized, with the following metrics:1. (Effectiveness of Algorithm 1)

Following our discussion inSection 3.1, we use the Pearson and Spearman correlationcoefﬁcients as our performance metrics.2. (Overall improvement)

We measure the relative improve-ment of the index conﬁguration I new returned by index tun-ing over the original index conﬁguration I old , deﬁned as fol-lows. Let c ( q, I ) and a ( q, I ) be the estimated and actualexecution costs of q over a conﬁguration I , respectively. We C o rr e l a t i o n C o e ff i c e n t Pearson (Optimizer) Pearson (Algorithm 1)Spearman (Optimizer) Spearman (Algorithm 1)

Figure 11: Comparison of correlation coefﬁcients using cost es-timates from Algorithm 1 over using optimizer’s estimates.

18 2 3 15 26 89 71 38 28 31 66 0020406080100 -100 -80 -60 -40 -20 0 20 40 60 80 100 More F r e qu e n c y CPU Time Improvement (a) Index tuning without feedback.

11 1 1 8 12 108 106 24 29 34 53 0020406080100120 -100 -80 -60 -40 -20 0 20 40 60 80 100 More F r e qu e n c y CPU Time Improvement (b) Index tuning with feedback.

Figure 12: The distributions of CPU time improvement overTPC-DS queries (estimated improvement threshold τ = 0 ). deﬁne the estimated improvement of I new over I old as c ( I old , I new ) = (cid:0) c ( q, I old ) − c ( q, I new ) (cid:1) /c ( q, I old )= 1 − c ( q, I new ) /c ( q, I old ) . We also deﬁne the actual improvement of I new over I old as a ( I old , I new ) = (cid:0) a ( q, I old ) − a ( q, I new ) (cid:1) /a ( q, I old )= 1 − a ( q, I new ) /a ( q, I old ) . We use the actual improvement as our metric, whereas the es-timated improvement is useful for controlling the index tun-ing component, as we will see.

Results.

Figure 11 presents the correlation coefﬁcients betweenestimated costs and actual CPU times. We compare the correlationcoefﬁcients using cost estimates produced by Algorithm 1 againstones using optimizer’s estimates. We observe signiﬁcant improve-ment over four of the ﬁve workloads. This implies that the cost es-timates from Algorithm 1 are considerably better than optimizer’soriginal cost estimates.In Figures 12, 13, and 14, we present the distributions of theactual improvement (averaged over all tested conﬁgurations) forTPC-DS queries by using execution feedback compared with in-dex tuning without feedback. In index tuning, usually there is a10 F r e qu e n c y CPU Time Improvement (a) Index tuning without feedback. F r e qu e n c y CPU Time Improvement (b) Index tuning with feedback.

Figure 13: The distributions of CPU time improvement overTPC-DS queries (estimated improvement threshold τ = 0 . ).

18 2 3 12 26 108 57 38 28 31 64 0020406080100120 -100 -80 -60 -40 -20 0 20 40 60 80 100 More F r e qu e n c y CPU Time Improvement (a) Index tuning without feedback. F r e qu e n c y CPU Time Improvement (b) Index tuning with feedback.

Figure 14: The distributions of CPU time improvement overTPC-DS queries (estimated improvement threshold τ = 0 . ). threshold τ for estimated improvement and a conﬁguration is rec-ommended only if its estimated improvement is above the thresh-old. In our experiments, we therefore varied this threshold from0 to 0.2 (i.e., 20% estimated improvement). We have the follow-ing observations. First, using execution feedback in index tuningsigniﬁcantly reduces the chance of query performance regression.The number of cases with 20% performance regression (i.e., -20%improvement) is reduced from 64 to 33 (48.4% reduction) when τ = 0 , is reduced from 63 to 21 (66.7% reduction) when τ = 0 . ,and is reduced from 61 to 13 (78.7% reduction) when τ = 0 . .Second, when increasing the estimated improvement threshold τ , the chance of performance regression decreases for both in-dex tuning with and without execution feedback. However, thechance reduces much faster for index tuning using execution feed-back. This implies that, while index tuning with execution feedbackcan still misestimate performance improvement, the misestimationis marginal compared to query optimizer’s cost estimates. Suchmarginal misestimation is more likely to be overcome by slightlyincreasing the threshold for estimated improvement (20% is a ruleof thumb in practice).Third, by comparing the ﬁgures for index tuning with and with-out execution feedback, we also observe that estimated improve-ment is diminished in more cases when execution feedback is uti-lized. (Notice that there are more cases in the bin with less than20% estimated improvement.) However, cases with more signiﬁ-cant improvement ( ≥ Remarks.

Let X be the relative error between estimated cost andactual cost of individual queries. That is, X = c ( q, I ) − a ( q, I ) a ( q, I ) = c ( q, I ) a ( q, I ) − . We have c ( q, I ) = (1 + X ) · a ( q, I ) . Taking expectation we have E [ c ( q, I )] = (1 + E [ X ]) · E [ a ( q, I )] = (1 + (cid:15) ) · E [ a ( q, I )] . Since c ( W, I ) = (cid:80) nk =1 c ( q k , I ) · w k , it follows that c ( W, I ) = n (cid:88) k =1 (1 + X k ) · a ( q k , I ) · w k . Taking expectation we have E [ c ( W, I )] = n (cid:88) k =1 (1 + E [ X k ]) · E [ a ( q k , I )] · w k . Since E [ X k ] = (cid:15) regardless of k , it follows that E [ c ( W, I )] = (1+ (cid:15) ) · n (cid:88) k =1 E [ a ( q k , I )] · w k = (1+ (cid:15) ) · E [ a ( W, I )] . We therefore have proved the following lemma:L

EMMA

If the expected relative error at query level is (cid:15) ,then the expected relative error at workload level is also (cid:15) . Now consider the estimated and actual improvements of I over I , in terms of expectation: c ( I , I ) = E [ c ( q, I )] − E [ c ( q, I )] E [ c ( q, I )] = 1 − E [ c ( q, I )] E [ c ( q, I )] , and a ( I , I ) = E [ a ( q, I )] − E [ a ( q, I )] E [ a ( q, I )] = 1 − E [ a ( q, I )] E [ a ( q, I )] . Since E [ c ( q, I )] = (1 + (cid:15) ) · E [ a ( q, I )] and E [ c ( q, I )] = (1 + (cid:15) ) · E [ a ( q, I )] , we have c ( I , I ) = 1 − (1 + (cid:15) ) · E [ a ( q, I )](1 + (cid:15) ) · E [ a ( q, I )] = 1 − E [ a ( q, I )] E [ a ( q, I )] = a ( I , I ) . Therefore, we have the following observation.L

EMMA

In expectation, the estimated improvement observedat query level carries over to the actual improvement.

5. RELATED WORK

Recently, the problem of estimating query execution time has at-tracted lots of research attention. Existing work more or less usesstatistical learning techniques [3, 4, 15, 17, 20, 26, 27]. The im-proved cost estimates are useful in a variety of applications, suchas admission control [24] and query scheduling [2].The idea of using execution feedback in query optimization hasalso been explored in the literature. Existing work focuses on im-proving cardinality estimates, using exactly the same cardinality11bserved in execution (e.g., [18, 21]), statistics built on top of ob-served cardinality (e.g. [5]), or sampling (e.g., [19, 28]). Althoughthis line of work also aims for improving cost estimates (by improv-ing cardinality estimates), its ultimate goal is to impact the decisionmade by the optimizer so that it may return a different, perhaps bet-ter execution plan. This is different from the goal of the previouswork on external cost modeling we have discussed, which does notwant to change the query plan.One noticeable problem when applying feedback, as documentedin the literature [13], is that partial execution feedback may re-sult in inconsistent cost estimates that mislead the query optimizer.That is, if some plans receive improved cost estimates whereas theothers do not, then the plan returned by the optimizer might beeven worse. One reason is that, although the optimizer can esti-mate costs appropriately for query plans with execution feedback,it may underestimate costs for plans without feedback. Our anal-ysis in Section 3 indicates that, as long as we have feedback forbackbone operators (speciﬁcally, when both η and η (cid:48) are large), thecomparison of plan costs is still reliable. It is interesting to see theefﬁcacy of our framework when applying it to query optimization.The problem of index tuning (or, in general, database physicaldesign tuning) has been studied for more than two decades [16].Existing index tuning tools use a cost-based architecture that pig-gybacks on top of query optimizer’s cost estimates [10, 25]. At thehighest level, index tuning tools try to search for an index conﬁg-uration from a number of candidates that minimizes the total es-timated workload cost. There has been work on different searchstrategies, such as bottom-up [10], top-down [6, 8], and constraint-based [9] approaches. However, to the best of our knowledge, thereis no previous work aiming for utilizing improved cost estimates inindex tuning. The framework we studied in this paper can be easilycombined with any of the previous search frameworks. Moreover,there has also been exploratory work towards continuous index tun-ing [7, 22, 23], though such frameworks have not been incorporatedinto existing index tuning tools as far as we know.

6. CONCLUSION AND FUTURE WORK

In this paper, we have studied the problem of operator-level costmodeling with query execution feedback. We focused ourselves ontwo major challenges with respect to sparse feedback and mixedcost estimates, and proposed a general framework that deals withboth altogether. We analyzed this framework in detail, and furtherstudied its efﬁcacy in the context of index tuning.This work opens up a number of interesting directions for futureexploration. First, as we have already brieﬂy discussed in Section 5,it is interesting to further investigate the applicability of our frame-work in query optimization so that we may be able to obtain betterquery plan. Second, although it is straightforward to integrate ourframework with existing index tuning architecture, it may have animpact on the search space of index tuning. In this work, we haveonly investigated this impact experimentally, and a more formalanalysis regarding the change in the search space is desirable if wewish to have a deeper understanding. Third, although our work isorthogonal to the concrete external cost model being used, in prac-tice different models can make a difference. There is still lots ofwork that can be done on model selection and tuning for a givenworkload. All of these are promising directions for future work.

7. REFERENCES [1] S. Agrawal, S. Chaudhuri, L. Koll´ar, A. P. Marathe, V. R. Narasayya,and M. Syamala. Database tuning advisor for microsoft SQL server2005: demo. In

SIGMOD , pages 930–932, 2005. [2] M. Ahmad, A. Aboulnaga, S. Babu, and K. Munagala.Interaction-aware scheduling of report-generation workloads.

TheVLDB Journal , 20:589–615, 2011.[3] M. Ahmad, S. Duan, A. Aboulnaga, and S. Babu. Predictingcompletion times of batch query workloads using interaction-awaremodels and simulation. In

EDBT , pages 449–460, 2011.[4] M. Akdere, U. C¸ etintemel, M. Riondato, E. Upfal, and S. B. Zdonik.Learning-based query performance modeling and prediction. In

ICDE , pages 390–401, 2012.[5] N. Bruno and S. Chaudhuri. Exploiting statistics on queryexpressions for optimization. In

SIGMOD , 2002.[6] N. Bruno and S. Chaudhuri. Automatic physical database tuning: Arelaxation-based approach. In

SIGMOD , pages 227–238, 2005.[7] N. Bruno and S. Chaudhuri. An online approach to physical designtuning. In

ICDE , pages 826–835, 2007.[8] N. Bruno and S. Chaudhuri. Physical design reﬁnement: The‘merge-reduce’ approach.

ACM Trans. Database Syst. , 32(4):28,2007.[9] N. Bruno and S. Chaudhuri. Constrained physical design tuning.

PVLDB , 1(1):4–15, 2008.[10] S. Chaudhuri and V. R. Narasayya. An efﬁcient cost-driven indexselection tool for microsoft SQL server. In

VLDB , pages 146–155,1997.[11] S. Chaudhuri and V. R. Narasayya. Autoadmin ’what-if’ indexanalysis utility. In

SIGMOD , pages 367–378, 1998.[12] S. Chaudhuri and V. R. Narasayya. Self-tuning database systems: Adecade of progress. In

VLDB , pages 3–14, 2007.[13] S. Chaudhuri, V. R. Narasayya, and R. Ramamurthy. Apay-as-you-go framework for query execution feedback.

PVLDB ,1(1):1141–1152, 2008.[14] B. Dageville, D. Das, K. Dias, K. Yagoub, M. Za¨ıt, and M. Ziauddin.Automatic SQL tuning in oracle 10g. In

VLDB , pages 1098–1109,2004.[15] J. Duggan, U. C¸ etintemel, O. Papaemmanouil, and E. Upfal.Performance prediction for concurrent database workloads. In

SIGMOD , 2011.[16] S. J. Finkelstein, M. Schkolnick, and P. Tiberio. Physical databasedesign for relational databases.

ACM Trans. Database Syst. ,13(1):91–128, 1988.[17] A. Ganapathi, H. A. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. I.Jordan, and D. A. Patterson. Predicting multiple metrics for queries:Better decisions enabled by machine learning. In

ICDE , 2009.[18] N. Kabra and D. J. DeWitt. Efﬁcient mid-query re-optimization ofsub-optimal query execution plans. In

SIGMOD , pages 106–117,1998.[19] P. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimationusing sample views with quality assurance. In

SIGMOD , pages175–186, 2007.[20] J. Li, A. C. K¨onig, V. R. Narasayya, and S. Chaudhuri. Robustestimation of resource consumption for sql queries using statisticaltechniques.

PVLDB , 5(11):1555–1566, 2012.[21] V. Markl, V. Raman, D. E. Simmen, G. M. Lohman, and H. Pirahesh.Robust query processing through progressive optimization. In

SIGMOD , pages 659–670, 2004.[22] K. Sattler, I. Geist, and E. Schallehn. QUIET: continuousquery-driven index tuning. In

VLDB , 2003.[23] K. Schnaitter, S. Abiteboul, T. Milo, and N. Polyzotis. COLT:continuous on-line tuning. In

SIGMOD , pages 793–795, 2006.[24] S. Tozer, T. Brecht, and A. Aboulnaga. Q-Cop: Avoiding bad querymixes to minimize client timeouts under heavy loads. In

ICDE , 2010.[25] G. Valentin, M. Zuliani, D. C. Zilio, G. M. Lohman, and A. Skelley.DB2 advisor: An optimizer smart enough to recommend its ownindexes. In

ICDE , pages 101–110, 2000.[26] W. Wu, Y. Chi, H. Hacig¨um¨us, and J. F. Naughton. Towardspredicting query execution time for concurrent and dynamic databaseworkloads.

PVLDB , 6(10):925–936, 2013.[27] W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacig¨um¨us, and J. F.Naughton. Predicting query execution time: Are optimizer costmodels really unusable? In

ICDE , 2013.[28] W. Wu, J. F. Naughton, and H. Singh. Sampling-based queryre-optimization. In

SIGMOD , pages 1721–1736, 2016., pages 1721–1736, 2016.