[PDF] ClassyTune: A Performance Auto-Tuner for Systems in the Cloud

Abstract

Performance tuning can improve the system performance and thus enable the reduction of cloud computing resources needed to support an application. Due to the ever increasing number of parameters and complexity of systems, there is a necessity to automate performance tuning for the complicated systems in the cloud. The state-of-the-art tuning methods are adopting either the experience-driven tuning approach or the data-driven one. Data-driven tuning is attracting increasing attentions, as it has wider applicability. But existing data-driven methods cannot fully address the challenges of sample scarcity and high dimensionality simultaneously. We present ClassyTune, a data-driven automatic configuration tuning tool for cloud systems. ClassyTune exploits the machine learning model of classification for auto-tuning. This exploitation enables the induction of more training samples without increasing the input dimension. Experiments on seven popular systems in the cloud show that ClassyTune can effectively tune system performance to seven times higher for high-dimensional configuration space, outperforming expert tuning and the state-of-the-art auto-tuning solutions. We also describe a use case in which performance tuning enables the reduction of 33% computing resources needed to run an online stateless service.

Full PDF

IIEEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 1

ClassyTune: A Performance Auto-Tuner forSystems in the Cloud

Yuqing Zhu,

Member, IEEE , and Jianxun Liu

Abstract —Performance tuning can improve the system performance and thus enable the reduction of cloud computing resourcesneeded to support an application. Due to the ever increasing number of parameters and complexity of systems, there is a necessity toautomate performance tuning for the complicated systems in the cloud. The state-of-the-art tuning methods are adopting either theexperience-driven tuning approach or the data-driven one. Data-driven tuning is attracting increasing attentions, as it has widerapplicability. But existing data-driven methods cannot fully address the challenges of sample scarcity and high dimensionalitysimultaneously. We present ClassyTune, a data-driven automatic conﬁguration tuning tool for cloud systems. ClassyTune exploits themachine learning model of classi ﬁcation for auto-tuning. This exploitation enables the induction of more training samples without increasing the input dimension. Experiments on seven popular systems in the cloud show that ClassyTune can effectively tune systemperformance to seven times higher for high-dimensional conﬁguration space, outperforming expert tuning and the state-of-the-artauto-tuning solutions. We also describe a use case in which performance tuning enables the reduction of 33% computing resourcesneeded to run an online stateless service.

Index Terms —Performance tuning, auto-tuning, autotuner, data-driven tuning, experience-driven tuning, performance modeling (cid:70)

NTRODUCTION C LOUD computing has facilitated the deployment ofsystems for big data analytics and Web services. Foran efﬁcient exploitation of the cloud computing resources,we can either choose for a speciﬁc task [1] the most cost-effective cloud conﬁguration, i.e., the types and numbers ofvirtual machine instances; or, we can optimize the systemperformance for a speciﬁc deployment setting so as toreduce the total computing resources in demand [2]. Infact, modern systems are exposing an increasing numberof conﬁgurable parameters that can have strong impacts onsystem performance and thus that are denoted as

PerfConfs ,e.g., innodb buffer pool size and executor.cores in Figure 1.Well tuning the PerfConfs of a system can lead to multipletimes of performance speedup [3], requiring no changeto the system design. Unfortunately, to meet the diversityof applications and deployment settings, the number andthe complexity of PerfConfs have increased to a level ex-ceeding the comprehension capability of human beings [4].We see an emerging need for automating the tuning ofPerfConfs [5], [6], [7] for much higher system performance.Existing solutions to auto-tuning PerfConfs for systemsin the cloud are either experience-driven or data-driven . Ap-proaches based on heuristics-guided search [8], [9] andanalytical modeling [10], [11] rely heavily on human experi-ences and knowledge, belonging to the experience-drivencategory. Experience-driven tuning requires human inter-vention for each speciﬁc case and has limited applicabil-ity. Approaches using Bayesian optimization [12] or othermachine-learning models [13] exploit data to train modelsfor optimization, thus falling into the data-driven category. • Y. Zhu is the corresponding author. E-mail:[email protected]. • Y. Zhu is with the Institute of Computing Technology, Chinese Academyof Sciences, Beijing 100190, China. • J. Liu is with UTuned Technology Company Limited, Beijing, China.Manuscript received 25 Dec. 2018; revised 18 Jun. 2019.

Data-driven tuning can be applied to where sufﬁcient tun-ing samples are provided, thus attracting increasing popu-larity [7], [14]. However, running tuning tests in the cloudand collecting large samples are expensive due to a pay-as-you-go cost, while the sample size required is in proportionto the dimension of the conﬁguration space [15]. Samplescarcity and high dimensionality place two challenges todata-driven conﬁguration tuning.In this paper, we take the data-driven approach to ad-dress the problem of auto-tuning performance for systemsin the cloud through adjusting PerfConf settings. Our mainidea is to tackle performance tuning as a comparison problem andmodel the performance comparison relations of the limited sam-ples . In contrast to the common exploitation of performanceprediction models [16], we adopt the classiﬁcation methodfor the comparison modeling, as it can bring about two ben-eﬁts that directly address the sample scarcity challenge ofthe data-driven tuning. First, the classiﬁcation model for thecomparison problem can have a training set as quadraticallylarge as the original sample set, as it takes pairs of originalsamples as input and such pairs can be constructed through permuting every pair of the original samples. Second, wecan generate even more training samples based on manualtuning experiences. As manual tuning process usually goesthrough numerous trials and comparisons, the tuning expe-riences are usually summarized in comparison-based rules,e.g., increasing memory cache sizes leads to higher performances .We can generate more training samples for the classiﬁcationmodels based on such tuning rules, while this is impossiblefor the performance prediction modeling [17].But two problems remain to be solved. The ﬁrst isabout dimensionality, i.e., how to effectively represent the inputwithout increasing its dimensions . If we directly concatenatetwo PerfConf settings, the input dimension for the modelis increased to twice of the original one, leading again tosample scarcity [15]. If we take the division or difference a r X i v : . [ c s . PF ] O c t EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 2 (a) MySQL performance underdifferent workloads. (b) Spark performance under dif-ferent environments.Figure 1. Performance-PerfConf curves are nonlinear , nonsmooth , and system-/workload-/environment-speciﬁc . of a PerfConf pair, different pairs will collide, leading todifferent inputs mapped to one same output. We proposeto induce samples by constructing a bijection from a d -dimensional space to a d -dimensional one. The second isabout model accuracy, i.e., how to ﬁnd the best PerfConfsetting using an imprecise model . This found PerfConf settingshould lead to the best system performance within a giventime and computing resources. Machine learning modelsare generally not a hundred percent accurate [15]. Evenif we train a model with enough samples, this model canstill mistakenly distinguish some comparison relations. Wemust robustly ﬁnd a best PerfConf setting even if somepredictions are incorrect. This best PerfConf setting shouldlead to better performance. We propose a clustering-basedtuning algorithm that can exploit the imprecise classiﬁcationmodel.We thus present ClassyTune, which is, to the best of ourknowledge, the ﬁrst automatic performance tuning systemthat exploits a classiﬁcation model to ﬁnd the best PerfConfsetting within a limited sample. In Classytune, we usea classiﬁer to predict whether one PerfConf setting has abetter performance than another. Taking this classiﬁcationapproach, ClassyTune can construct a useful model for auto-tuning with only a limited number of original PerfConf-performance samples, while the common auto-tuning meth-ods would require tens of times more samples [2], [12],[18]. The classiﬁer model can make a prediction in a timemultiple orders of magnitude shorter than a tuning testactually runs. We can thus use the model as the surrogate ofthe system and take a systematic approach towards tuningwith an imprecise model.ClassyTune consists of three components for sampling,modeling and searching respectively. The sampling com-ponent outputs a database of PerfConf-performance sam-ples; the modeling component outputs a classiﬁcation-basedmodel; and, the searching component ﬁnds the PerfConfsetting with the highest performance in best effort. Decou-pling the system into three components allows the reuse ofthe intermediate tuning outputs, i.e., the database and themodel. As a result, ClassyTune can be used not only fortuning, but also for system analysis. The intermediate out-puts, especially the model, can inform users about relationsbetween PerfConfs and performance,In this work, we make the following contributions: • We propose a data-driven performance auto-tuningapproach, unprecedentedly adopting a classiﬁcationmodel for representing the performance comparisonrelations between PerfConf settings ( § • We propose to address the input dimension problemthrough sample induction that constructs a bijection based on the Cantor’s proof ( § • We propose a clustering-based auto-tuning methodthat exploits the imprecise classiﬁcation model ( § • We implement the above solutions in ClassyTune ( § § • We present a customer’s use case to show howClassyTune can be used and help users reduce thecloud computing resources needed to run an onlinestateless service ( § OTIVATION AND R ELATED W ORK

This section examines the modeling challenges for the data-driven methods of automatic performance tuning basedon PerfConf setting adjustments. These challenges motivateour work over the related works, which are summarized atthe end of this section.

PerfConf-performance curves are formed by taking Perf-Confs as input and the system performance as output.Different systems have different performance curves. In fact,this curve is not only related to the system, but also verysensitive to the workloads, the deployment environmentsand the computing resources [3]. Figure 1 plots the curvesfor database system MySQL and the distributed onlineprocessing system Spark.Among the four plotted curves, two for MySQL and twofor Spark, none demonstrates linearity. The performance isnot in direct proportion to the PerfConf input. For exam-ple, Figure 1a plots the throughput of MySQL under twoworkloads of read-only and TPC-C, given buffer_pool_-size as input. The throughputs of MySQL are not directlyproportional to the size of buffer pool. Figure 1b plots thejob durations for Spark under the standalone and clusterdeployments respectively. The performances demonstrateno linearity with the number of executor cores either.Even for the same system, changes to the workload,the deployment environment, or the computing resourcescan also lead to different PerfConf-performance curves.Changing the workload from read-only to TPC-C leads totwo completely different performance curves for MySQL,as shown in Figure 1a. Changing the deployment from thestandalone mode to the cluster mode also changes the shapeof Spark’s performance curve, as illustrated in Figure 1b.Generally, it would not be wise to use linear models tomap PerfConf-performance relations due to non-linearity.As the system, workload, environment, and computingresources are factors inﬂuencing the curve shape, PerfConf-performance models should be constructed with regardto a speciﬁc combination of these factors, making modelreuse infeasible. In sum, tuning tests and samples must becollected speciﬁcally for such a combination, leading to thesample scarcity challenge ( § EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 3 (a) Max differences between predictedand real performances. (b) Errors reduced assamples added (Hadoop-KMeans/RFR).Figure 2. Highly inaccurate performance predictions due to limitedsamples , but adding samples reduces errors. sampling and search process with an acquisition function.The common application of BO adopts a Gaussian processprior to get a closed-form acquisition function. Unfortu-nately, this adoption requires the objective function to be adifferentiable function. But not all objective function is dif-ferentiable. In fact, it is shown that the performance surfacesof several popular cloud systems are non-differentiable [3].The dissatisfaction of this assumption can invalidate anoptimization process based on BO.

Data-driven auto-tuning methods commonly exploit ma-chine learning algorithms for modeling. We illustratethe sample size challenge to the common performance-prediction based modeling [16]. We model the PerfConf-performance relation by three machine learning methods.As performance is a continuous value, these models are re-gression models, including boosted decision tree (B CART),supported vector regression (SVR) and random forest re-gression (RFR). The decision tree model CART is effectivein performance modeling for simple systems [16] and thusrecently applied to performance tuning [20]. SVR can in-crease the sample set to twice as large, alleviating the samplescarcity partially. As a robust ensemble model, RFR com-bines the advantages of statistical reasoning and machinelearning approaches [17]. We have also tried linear regres-sion, which has been used in a state-of-the-art related workfor feature selection [12], but the model is too imprecise tobe useful due to the reason described in Section 2.1.We measure the above models using the max pre-diction error, which is the max difference between thereal performances and the model predictions, dividedby the corresponding real performance. The equation is max( { | y i − ˆ y i | y i } i ∈ [0 ,n − ) , where n is the number of samples, y is the real performances and ˆ y is the performances pre-dicted by a model. We use 100 samples to construct eachmodel over 10 PerfConfs.As demonstrated in Figure 2a, the max prediction errorsof these models can be very high, as much as twice morethan the real performances. While the complexity of thePerfConf-performance curves is one reason, the scarcity ofsamples is the other. In fact, the model inaccuracy can bedecreased given more samples (Figure 2b), but the cost ofobtaining a large sample set can be high. Many tuning solu-tions require a database of thousands of samples for tuning10 parameters [12]. Models based on neural networks wouldrequire more samples even for just two PerfConfs [13]. (a) Better optimizedperformances for a largerinitial training set. (b) Better next point predictionsfor a larger initial training set.Figure 3. Sample size matters: tuning Tomcat by BO with a GP prior.

Worse still, these samples must be collected for each spe-ciﬁc combination of system, workload, environment andcomputing resources. This makes the precise prediction onsystem performance almost impossible, because collectinga large number of samples for every such combination isimpractical, if not impossible. Hence, we are facing the prob-lem how to obtain proper samples for model construction.Sample scarcity has also negative impacts on the tuningprocess of BO. With BO, the GP model can be trainedwith limited samples and later updated with more samplesas the acquisition function drives the sampling process.However, with a GP model trained with limited samples,the tuning process based on BO can be very ineffective.As demonstrated in Figure 3, a BO model with very fewsamples cannot locate best points for sampling as one withmore samples does.

Data-driven tuning methods like Bayesian optimization op-timize and sample stepwise towards the ﬁnal optimizationgoal [12]. In comparison, many other data-driven tuningmethods train a model after taking a large samples and thenoptimize on the ﬁnal model [20]. There exists a question on whether we should optimize stepwise or integrally .We look into the optimization process of BO. At eachstep, BO algorithms determine the next sampling point by optimizing a carefully designed acquisition function [19].Acquisition functions determine how to explore the inputspace. The commonly used acquisition function is the ex-pected improvement (EI) function, which represents theexpected improvement on sampling a given point. The priorprobability model on f is needed in the EI computation. Thisprobability model is usually assumed to be described by aGaussian process (GP) [12]. Assuming the GP prior, a prioriknowledge over f is required to set the covariance functionand hyper-parameters. We take the common practice in thechoice of the covariance function and hyper-parameters [19].Figure 3b demonstrates how the BO method runs towardthe ﬁnal result by optimizing the EI acquisition function ateach step. Even though the current EI acquisition functionis optimized to ﬁnd the next sample point at every step,the found point is not necessarily a better one. In fact, it isa worse one in many cases as demonstrated in Figure 3b.When the total number of samples is small, the resultingmodel might even fail to ﬁnd a better point in the followingsteps, e.g., the optimization process with a small initialsample set as represented by the dotted line in Figure 3.These facts indicate that we do not need to optimizeat every step in the optimization process. We can wait till EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 4 enough samples are collected. We should optimize inte-grally on a large sample set, instead of on a small sampleset and in a stepwise way. Besides, instead of trying on asingle point at each step, we can simultaneously try multiplepoints. With these understandings, we design ClassyTune.

Solutions to automatic performance tuning have been pro-posed for a speciﬁc type of systems, e.g., storage sys-tems [13] and databases [12]. Auto-tuning for general sys-tems also exist, e.g., BestConﬁg [2], BOAT [7], and Smart-Conf [18]. Performance tuning requires the support from aﬂexible system architecture. Thus, auto-tuning systems forgeneral systems implement system architectures for sup-porting the whole process of auto-tuning PerfConfs, includ-ing manipulating the system under tune, running tuningtests and computing the optimization results. At the core ofconﬁguration tuning lies a black-box optimization problem.The solutions to this black-box optimization problem can bedivided into two categories, i.e., experience-driven tuning and data-driven tuning .Classic experience-driven tuning methods include theheuristics-based search approach [2], [8], [21] and thecontrol-theory based approach [18]. The tuning based onmanually speciﬁed models [10], [11] also belongs to thiscategory. While heuristics are highly related to human ex-perience, they might be useful for some systems but not theothers. Besides, the search-based approach can only producestable results when the searched space is large enough. Con-trol theory based auto-tuning iteratively applies a changeto inputs and monitors feedbacks to decide for the nextstep. This approach is only applicable to cases where thenumber of PerfConfs is only a handful. There also exist auto-tuning tools that decide the conﬁguration settings based onexpert provided guidelines or experts’ answers on a set ofquestions [22]. Like manually speciﬁed models, they haveonly limited applicability. Different heuristics-driven tuningmethods can be assembled for usage in auto-tuning, as theOpenTuner framework does [14].Data-driven tuning approaches exploit data to guidetuning, instead of experience-based heuristics or manuallyspeciﬁed models. Such approaches typically train a modelon a given data set and optimize the model towards the tun-ing objective [12]. Due to the large number of PerfConfs, themodel-based approach demands a large sample set to trainuseful regression models on performance [15]. Bayesianoptimization is a popular data-driven tuning approach [6],[12], [13], as it requires only a limited number of samples totrain the optimization model. For the BO method with a GPprior, a priori knowledge over the black-box function is re-quired to set the covariance function and hyper-parametersof the GP model. Unfortunately, such knowledge requiresdeep understanding of the optimization problem and thecovariance function, which is a difﬁcult task for commonusers. Facebook’s Spiral system [23] is an industrial practiceto integrate data-driven methods for predicting the currentbest setting of PerfConfs. A recent work BOAT [7] enablesthe blending of experience-driven tuning and data-driventuning. It proposes an optimization framework to integratehuman knowledge into the Bayesian optimization process,making the black-box optimization partially white. ClassyTune takes a classiﬁcation approach to perfor-mance auto-tuning, which is completely different fromprevious works. ClassyTune addresses the sample scarcityproblem in auto-tuning by two measures, i.e., permutingsample pairs to form inputs and generating samples fromtuning experiences. Through data generation, ClassyTunetransfers expert knowledge and experiences to the auto-tuning process. Like BestConﬁg [2], ClassyTune has anarchitecture that can work with both experience-driven anddata-driven tuning methods. The difference of these twoarchitectures is that ClassyTune can save all collected tuningsamples for future modeling purpose and expose the tuningmodel to inform users about PerfConf-performance rela-tions, while BestConﬁg cannot. The classiﬁcation model canbe used effectively as the surrogate of the system in analysis.In comparison, models directly predicting performances aretoo imprecise to rely upon [20], while models like Bayesianoptimization [6], [19] can only predict the next best pointsand not be used in such analysis.

ESIGN O VERVIEW

ClassyTune is a data-driven performance auto-tuning toolfor systems in the cloud. It addresses the problem of auto-tuning system PerfConfs within a given number of tuningtests. A set of PerfConf-performance samples can be col-lected from the given number of tuning tests.Taking a comparison-based perspective, Classytunemodels the relation between each pair of PerfConf-performance samples. This comparison-based modeling en-ables the generation of even more samples based on tun-ing experiences, further attacking the sample scarcity chal-lenge. The modeling process trains a classiﬁer for predict-ing whether the ﬁrst PerConf setting has a higher per-formance than the second in a pair of PerfConf settings.Section 4 presents the details of the comparison-basedmodeling based on classiﬁcation. Unlike the performance-prediction based methods, ClassyTune does not need toassume whether the performance curve is linear or non-linear, thanks to its classiﬁcation-based method. But, likeother machine learning models, the trained classiﬁer is nota hundred percent accurate. It is an imprecise classiﬁer.To tune with the imprecise classiﬁer, ClassyTunes adoptsa clustering-based method. Naive exploitations of the im-precise classiﬁer will fail to ﬁnd a best PerfConf setting dueto occasionally incorrect predictions. ClassyTune uses thetrained classiﬁer as the surrogate of the system. ClassyTuneclusters a set of good PerfConf settings output by theclassiﬁer to locate promising spaces for searching the bestPerfConf setting. Section 5 presents the details of the tuningprocess based on an imprecise classiﬁer.The overall architecture and implementation of Classy-Tune is presented in Section 6. ClassyTune consists of threemain components, i.e., sampling, modeling and searching( § § § EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 5

ODELING C OMPARISONS

In this section, we ﬁrst formulate the comparison-basedview for performance tuning. We then detail how to inducetraining samples and model comparison relations by classi-ﬁcation for the auto-tuning task.

We model the performance-comparison relations betweenpairs of PerfConf settings. This comparison-based modeltakes a pair of PerfConf settings ( X , X ) as input andoutputs if the ﬁrst setting has a performance better thanthe second, i.e., f ( X ) − f ( X ) > , or otherwise. Hence,it can be represented by the function g deﬁned as: g ( X , X ) = (cid:40) if f ( X ) − f ( X ) > , otherwise . (1)We exploit the above comparison-based model to tacklethe auto-tuning problem. We relate the comparison relationto each dimension difference between an input pair . Wepropose a mapping to encode this dimension difference andconstruct a new set of samples ( § compared to all other PerfConf settings.Second, modeling the comparison relations is more ro-bust than directly modeling on performance. On samplecollection, the performance measurements are in fact proneto noise, leading to a variance of measurements. But evenif two measurements might not be accurate due to noise orﬂuctuation, their comparison result can still be correct. Incase that some comparisons do not have correct results dueto a high variance of measurements, there still exist manyother correct comparison relations to rely upon. In compar-ison, such high variance of measurements can completelydivert the modeling of performance predictions.Third, comparison-based modeling leads to a naturalaugmentation of the data set, partially alleviating the sam-ple scarcity problem. With comparison-based modeling, thetraining set consists of PerfConf pairs and their performancecomparison results. This training set must be mapped fromthe original set of PerfConf-performance samples. The map-ping is a permutation of the original sample set. Thus,for the same sample collection effort, comparison-basedmodeling can have a training set as quadratically large asthe direct modeling of performance can have. Besides, wecan generate even more training samples based on man-ual tuning experiences, which are commonly expressed ascomparison-based rules. This is impossible for the perfor-mance prediction modeling.Finally, the comparison-based modeling providesstraight-forward means for users to gauge the inﬂuencesof PerfConfs on the performance. On manual tuning, we would actually observe whether a change of PerfConf val-ues leads to an increase or decrease of the performance. Thisis exactly a comparison process. In fact, when we make ananalysis on systems, we make similar comparison-based ob-servations as well. Thus, comparison-based modeling alignswell with the thinking of human beings. The performance comparison result can be viewed as theperformance change result if the ﬁrst PerfConf setting ischanged to the second one. Put it in another way, theperformance change is actually related to the ﬁrst PerfConfsetting and the value difference regarding the second Perf-Conf setting. Hence, we can represent a pair of PerfConfsettings by encoding in each dimension the value of theﬁrst setting and the corresponding difference respectively.For each dimension, we need to construct a bijection for aneffective encoding. With such bijection, we can construct alarger sample set without increasing the input dimension.Cantor’s proof is the solution to constructing such bijec-tion [24]. Probably sounding counter-intuitive, it has beenshown in cardinal arithmetic that the cardinality of the set [0 , × [0 , (the unit square) is equal to that of the set [0 , .The cardinality of a set is a measure the number of elementsof the set. The cardinality of a set is also called its size. Thecardinality of a ﬁnite set is the number of its elements. Twosets have the same cardinality if there exists a bijection between thetwo sets.

This result was ﬁrst demonstrated by Cantor andlater proved based on space-ﬁlling curves (SFC), which arecurved lines twisting and turning enough to ﬁll the whole ofany ﬁnite space [24]. Space-ﬁlling curves provide one wayfor constructing a bijection from the unit square to the unitinterval, mapping from the d -dimension space to the d -dimension space.For each PerfConf, we thus construct the bijection fromtwo values into one value using SFC, speciﬁcally the z-ordering method [24]. The mapped value in the unit in-terval is called the z-value. The z-value of a point inmulti-dimensions is simply calculated by interleaving thebinary representations of its coordinate values. For exam-ple, given the i th-dimension values X ( i )1 = 0 b and X ( i )2 = 0 b , we can get the z-value of ( X ( i )1 , X ( i )2 ) =00 bb . The order of the two input variablesactually matter. In the example, the z-value of ( X ( i )2 , X ( i )1 ) is bb . Note that, this z-ordering mapping canactually be modeled by a function with the modulo operatorand simple arithmetic operators.We construct a new sample set as quadratic large as theoriginal set of PerfConf-performance samples by permutingevery pair of original samples. The permutation generates P n = n × ( n − samples from the original n samples. Onconstruction, we exploit the above SFC method to map pairsof PerfConf settings into a space with the same dimensionsas the number of PerfConfs. It is common practice thatinputs are normalized before training machine learningmodels. Assuming that X , X are normalized and trans-formed into the unit interval [0 , , the SFC-based bijectionis h ( X , X ) = −−→ X , with X , X , −−→ X , ∈ [0 , d .We can generate even more training samples based onhistorical tuning experiences. Experiences useful for sam-ple generation are comparison-based rules, for example, EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 6 increasing the value of PerfConf X leads to a higher performance .For any given PerfConf setting, we can increase the valueof PerfConf X and obtain pairs of PerfConf settings. Wecan then induce new training samples based on the abovesample induction method. As long as the experience-basedrule holds, we can generate as many training samples asneeded. However, we must be careful of two things. First,the experience-based rule must be correct; otherwise, themodel trained on the generated samples would be wrong.Second, we must introduce no data skewness and takesamples uniformly distributed in the input space; otherwise,the trained model can be misguiding. We can model the comparison-based relations using the ma-chine leaning method of classiﬁcation, the model of which iscalled classiﬁer . A classiﬁcation problem is to decide whichclass a given input belongs to. Given pairs of PerfConfs,we classify their performance comparison results into twoclasses, i.e., the ﬁrst better than the second and other-wise. For example, a PerfConf pair ( X , X ) is classiﬁedinto one class if X performs better than X , i.e., when g ( X , X ) = 1 ; otherwise, it is classiﬁed into the other class.With sample induction h ( X , X ) = −−→ X , as deﬁned inSection 4.2, we can transform g of Eq. (1) into the followingfunction g (cid:48) : g (cid:48) ( h ( X , X )) = (cid:40) if f ( X ) − f ( X ) > , otherwise . (2)where g ( X , X )= g (cid:48) ( h ( X , X ))= g (cid:48) ( −−→ X , ) . The input spaceof g (cid:48) has the same dimensions as that of f , i.e., half the inputdimensions of g , but with training samples as quadraticallymany as those for f . We can now construct a classiﬁer onthe sample set ( −→ X , g (cid:48) ( −→ X )) with enough samples.We might also train a classiﬁer for telling whether oneconﬁguration setting is better than the default conﬁgurationsetting. But this way of constructing a classiﬁer cannotsolve the problem of sample scarcity. As our target is toexploit classiﬁer models to solve the tuning problem, ourfocus is how to use the machine learning model, instead ofimproving the model. We do not tune the hyper-parametersof the classiﬁer, as this is a problem as difﬁcult as the onethat the classiﬁer is trained for. Rather, we bear in mindthat the classiﬁer is not precise. We thus design algorithmsthat could exploit imprecise predictions by such classiﬁer tofulﬁll tuning-related tasks. Classiﬁcation vs. ranking.

As related works formulatetuning as an optimization problem, some would think thatmodeling tuning as a ranking problem [20] would be morenatural than as a comparison one. We do not address thetuning problem by ranking models but with classiﬁcationmodels for two reasons. First, the input space of conﬁgu-ration tuning generally has continuous dimensions, whichwould contain in any given range points in a number largerthan the total number of natural numbers. As ranking isin fact mapping natural numbers to inputs, this fact indi-cates ranking is an inadequate way of modeling. Second,conﬁguration tuning is to ﬁnd the top input(s) in the set,rather than aligning all inputs. While given a ranking model,obtaining any comparison result is straight forward. Given a classiﬁcation model for comparison, ﬁnding the ranking isan NP-hard problem [25]. In other words, the ranking modelhas incorporated more information than the classiﬁcationmodel. That said, like directly predicting performance, per-formance ranking has also done more than required.

UNING WITH AN I MPRECISE C LASSIFIER

With the comparison-based classiﬁcation model, ClassyTunecan compare any pair of PerfConf settings. Since we can nowuse the trained model as the surrogate of the real system, ourgoal now becomes to ﬁnd out the best one in a sufﬁcientlylarge set of N PerfConf settings.

Strawman.

One naive solution is to sample N PerfConfsettings and use the classiﬁer to compare every pair of them.In order to ﬁnd the optimal setting, N must be sufﬁcientlylarge to cover the whole space of PerfConfs. Unfortunately,pairing every two of the N PerfConf settings would lead toa set with a daunting size of C N . Even though the classiﬁercan predict in a sufﬁciently short time, this processing timewould add up to a long duration . Worse still, as the classiﬁeris not a hundred percent correct, some results would be contra-dicting , making it impossible to deduce the real optimal. A better strawman.

An alternative solution is to do abinary search among the huge set of N PerfConf settings.In each comparison, i.e., each prediction by the classiﬁer,the winning PerfConf setting is kept for the next roundof comparison, while the other one is discarded directly.After log N rounds of binary comparisons, we will ﬁnallyreach the last pair of winning PerfConf settings. And, theﬁnal winner will be the optimal. However, as we havementioned, the classiﬁer is not a hundred percent correct;thus, the actual optimal setting might have been discarded becauseit loses in just one false comparison. Our solution.

ClassyTune takes a systematic approachtowards tuning. Rather than trying to improve the precisionof the model, ClassyTune recognizes that the trained modelcan only make a large portion of predictions correct. Itexploits this fact and ﬁnds the top setting in best effortthrough three phases, i.e., ﬁnding a list of good PerfConfsettings, locating promising areas with optimal settings andsearching for the optimal setting.

ClassyTune does not compare every pair of PerfConf set-tings. Rather, in the training phase, it keeps the best Perf-Conf setting in the training set along with the trained model.When given the large set of N PerfConf settings, ClassyTuneuses the trained model to compare each of the N settingswith the best PerfConf setting in the training set. This list ofsettings that win in the comparisons are kept. Even thoughthe trained model might not be completely correct in thesecomparisons, it is very likely that many of these winningsettings are ones better than the best PerfConf setting in thetraining set.We take a list of winning settings output by the impreciseclassiﬁer. We do not keep the single PerfConf setting thatwins the most comparisons, contrasting the way that BOwith the GP prior takes one optimal setting at each step.Given the same imprecision rate, ﬁnding a list of winningsettings reduces the probability that we ﬁnd no PerfConfsetting better than the best one in the training set. EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 7

Furthermore, we do not directly output this list of win-ning PerfConf settings as optimal ones. Rather, we usethem to locate some promising areas for ﬁnding the realoptimal setting. The reasons include: 1) as the model is nota hundred percent accurate, some of the winning settingsmight not even be good settings; and, 2) the space ofPerfConfs is too large such that the N settings might notbe representative enough for ﬁnding the optimal one. In fact, we believe good settings are close to each otherand possibly locate at a few promising areas. Generally,the optimal PerfConf setting is surrounded by good settingsthat are better than many others. Likewise, the areas wheremany good settings locate are promising places that theoptimal setting might be found. We denote such areas asthe promising subspaces.For the set of winning PerfConf settings, ClassyTuneuses the clustering algorithm of KMeans to ﬁnd out wherethe good PerfConf settings cluster. To determine the numberof promising areas, i.e., the number of clusters, we exploitthe elbow criterion [26] to ﬁnd a best number k for clus-tering. We then run the KMeans algorithm to cluster thewinning PerfConf settings into k clusters, whose centersare then computed. The promising subspaces are locatedaround these centers. Now, we have the centers of the promising subspaces. Wehave not yet set their boundaries. We set the boundariesof the promising subspaces based on the PerfConf settingsthat we have already evaluated. As we know that none ofthe evaluated settings is expected to be better than the listof winning settings, we should not consider those settingslying farther from any center than an evaluated settingthat is closer to the center than other evaluated settings.Hence, for each center, we ﬁnd at each dimension its closestneighbor in the set of evaluated settings; and, the value ofthis neighboring setting is used as the boundary for thisdimension by the center. After ﬁnding for each dimensionof all centers, we bound all promising subspaces.Within the speciﬁed number of tuning tests, we thensample in the promising subspaces so that a good coverageof the areas is guaranteed [2]. These sampled PerfConfsettings are then evaluated in the system to decide whichexactly is the best. The ﬁnal best will be output as thesuggested setting for an optimal performance. HE C LASSY T UNE S YSTEM

The overall architecture of ClassyTune is illustrated inFigure 4. Like BestConﬁg [2] and Ottertune [12], Classy-Tune only needs the users to provide a list of PerfConfsalong with their valid ranges, and scripts to set PerfConfvalues/get system performances for tuning a new systemand its application workload. ClassyTune has three maincomponents, i.e., sampling, modeling and searching. Thesecomponents interact through data ﬂows, thus they canlocate on one same server or multiple servers. The resultsof sampling and modeling are produced as the intermediate (cid:51)(cid:85)(cid:74)(cid:75)(cid:82)(cid:79)(cid:84)(cid:77) (cid:38)(cid:79)(cid:68)(cid:86)(cid:86)(cid:76)(cid:73)(cid:76)(cid:70)(cid:68)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:80)(cid:82)(cid:71)(cid:72)(cid:79)(cid:48)(cid:82)(cid:71)(cid:72)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:72)(cid:70)(cid:75)(cid:81)(cid:76)(cid:84)(cid:88)(cid:72) (cid:57)(cid:95)(cid:89)(cid:90)(cid:75)(cid:83)(cid:14)(cid:61)(cid:85)(cid:88)(cid:81)(cid:82)(cid:85)(cid:71)(cid:74)(cid:15) (cid:86) (cid:72) (cid:87) (cid:38) (cid:82) (cid:81) (cid:73) (cid:74) (cid:72) (cid:87) (cid:51) (cid:72) (cid:85) (cid:73) (cid:57)(cid:75)(cid:71)(cid:88)(cid:73)(cid:78)(cid:79)(cid:84)(cid:77) (cid:54)(cid:72)(cid:68)(cid:85)(cid:70)(cid:75)(cid:3)(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:76)(cid:80)(cid:83)(cid:85)(cid:72)(cid:70)(cid:76)(cid:86)(cid:76)(cid:82)(cid:81)(cid:55)(cid:75)(cid:72)(cid:3)(cid:82)(cid:83)(cid:87)(cid:76)(cid:80)(cid:68)(cid:79)(cid:3)(cid:51)(cid:72)(cid:85)(cid:73)(cid:38)(cid:82)(cid:81)(cid:73)(cid:3)(cid:86)(cid:72)(cid:87)(cid:87)(cid:76)(cid:81)(cid:74)(cid:50)(cid:83)(cid:87)(cid:76)(cid:80)(cid:68)(cid:79)(cid:3)(cid:51)(cid:72)(cid:85)(cid:73)(cid:38)(cid:82)(cid:81)(cid:73)(cid:86)(cid:40)(cid:89)(cid:68)(cid:79)(cid:88)(cid:68)(cid:87)(cid:72)(cid:38)(cid:82)(cid:79)(cid:79)(cid:72)(cid:70)(cid:87) (cid:57)(cid:71)(cid:83)(cid:86)(cid:82)(cid:79)(cid:84)(cid:77) (cid:39)(cid:68)(cid:87)(cid:68)(cid:69)(cid:68)(cid:86)(cid:72)(cid:3)(cid:82)(cid:73)(cid:11)(cid:51)(cid:72)(cid:85)(cid:73)(cid:38)(cid:82)(cid:81)(cid:73)(cid:15)(cid:83)(cid:72)(cid:85)(cid:73)(cid:82)(cid:85)(cid:80)(cid:68)(cid:81)(cid:70)(cid:72)(cid:12) (cid:54)(cid:68)(cid:80)(cid:83)(cid:79)(cid:76)(cid:81)(cid:74)(cid:3)(cid:80)(cid:72)(cid:87)(cid:75)(cid:82)(cid:71)(cid:38)(cid:82)(cid:80)(cid:83)(cid:68)(cid:85)(cid:76)(cid:86)(cid:82)(cid:81)(cid:16)(cid:69)(cid:68)(cid:86)(cid:72)(cid:71)(cid:3)(cid:87)(cid:85)(cid:68)(cid:76)(cid:81)(cid:76)(cid:81)(cid:74)(cid:3)(cid:86)(cid:72)(cid:87) (cid:7484)(cid:7483) (cid:7482)

Figure 4. ClassyTune: the architecture & the tuning process. outputs for reuse in following tasks. The two intermediateoutputs are the database of PerfConf-performance samplesand the classiﬁer model. The ﬁnal output of the tuningprocess is the best PerfConf setting found within the givennumber of tuning tests.

Sampling.

Different from common machine learning tasks,conﬁguration tuning allows the learning process to freelychoose the points to sample in the input space. As all valuesin the range are valid for a dimension, sample values oneach dimension should spread across the correspondingrange so that the underlying relations impacting comparisonresults should be represented and learned. According ourpractical experience, we ﬁnd the latin hypercube sampling(LHS) method [27] used in ClassyTune very effective and tothe purpose. It can (1) uniformly cover the whole range oneach dimension and (2) sample a given number of points. Incomparison, uniform random sampling does not necessarilycover the whole range, while grid sampling might not beable to sample for a required number of points. Other sam-pling methods that satisfy the two properties like LHS canalso be used with ClassyTune. The output of the samplingphase is a database of PerfConf-performance samples.

Modeling.

ClassyTune exploits the database ofPerfConf-performance samples to construct news samplesfor training the comparison-based model. ClassyTune triesdifferent classiﬁcation methods to train the comparison-based model ( § Searching.

Based on the classiﬁer, we search the conﬁgu-ration space thoroughly for a set of best points. The classiﬁeris used to decide whether a conﬁguration setting is superiorto any other conﬁguration.

This prediction takes much lesstime than actually evaluating a conﬁguration setting for the per-formance.

Like other model-based tuning solutions [6], [12],[16], ClassyTune exploits the trained model as a surrogate.Different from some Bayesian-optimization based solutionsthat explicitly solve an optimization equation, ClassyTunetakes a systematic approach to optimization, adopting athree-phase searching process. The found candidate settingsare evaluated in the system for veriﬁcation.

EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 8

Algorithm 1:

ClassyTune: classiﬁcation-based tuning.

Input:

X, y ; // PerfConf settings,performance

Input: m ; // Output: bestX ; // the optimal PerfConf setting /* induce samples, train classifier */ clf = FIT ( SET_INDUCE ( X, y )); idxMax = ARGSORT ( y )[-1]; // index of best y /* sample many points in the space */ S ← { X } ; // | S | > × DIM ( X ) Xp = PAIR_INDUCE ( S , X [ idxMax ]); Y p = clf . PREDICT ( Xp ); /* points better than X [ idxMax ] */ idxList = IDX_WHERE ( Y p , yp i > ); X s = S [ idxList ]; /* compute best k = BEST_CLUSTER_NUM ( X s ); /* cluster points to promising subspaces */ C = KMEANS_FIT_AND_GET_CENTERS ( k , X s ); /* sampling in promising subspaces by LHS */ X candidates ← LHS ( C , m/k ); y candidates ← EVALUATE ( X candidates ); idxMax = ARGSORT ( y candidates )[-1]; bestX ← X candidates [ idxMax ]; return bestX ; The whole tuning process of ClassyTune is implementedas illustrated in Algorithm 1. Given a set of PerfConf-performance samples as input, we ﬁrst induce a new sampleset for training a binary classiﬁer (Line 1). Then, we ﬁndthe best PerfConf setting in the original sample set (Line 2).Using the best PerfConf setting in the training set as thepivot, ClassyTune compare each of the N PerfConf settingswith this pivot (Line 3-5). All the winning settings are putin a winner set (Line 6-7). Second, ClassyTune proposesto enclose the areas where the winner settings cluster in(Line 8-9). These areas are the promising subspaces whereoptimal settings might locate. Even though the classiﬁermight have mispredicted some winners, the location ofthe promising subspaces might be shifted a little bit butwould not be completely missed. Third, to actually ﬁnd theoptimal settings, ClassyTune proposes to resample in thesesubspaces and evaluate the sampled PerfConf settings in thesystem (Line 10-11). The best setting will be output as thesolution (Line 12-14).

Data types for sample representation.

One could noticethat, we need to use data types with higher precision torepresent the induced samples. In our implementation, weuse the double data type to represent the induced sample val-ues and the ﬂoat for the original ones. However, the lengthytail of a decimal is very likely to lose its signiﬁcance in themodel training process. Thanks to the sparsity of samples,it is rare that the induced inputs get collapsed with theoriginal ones. The disadvantage of the induction is that thelatent relations between conﬁguration pairs could becomeeven further profound. However, as we have mentioned inSection 4.2, the sample induction can actually be modeled asa function of modulo and other simple arithmetic operators.Luckily, as demonstrated by many real-world applications,some classiﬁcation algorithms can represent highly complexinput data [28].

Other implementation details.

We implement Classy-Tune using Python and R, with only about 2000 lines of

Table 1The Evaluated Systems and Variables

System Description Lang. Workloads

HDFS

Dist. ﬁlesystem Java PageRank,

YARN

Dist. processing Java Join,

Hive

Data analytics Java KMeans

Spark

Data processing Scala PageRank,TeraSort,KMeans

MySQL

DB server C++ readOnly,readWrite,TPC-C

PostgreSQL

DB server C readOnly,readWrite,TPC-C

Cassandra

NoSQL DB Java readWrite(YCSB-a)

Tomcat

Web server Java Web exploration code. The interactions with the system under tune areimplemented through shell scripts. ClassyTune maximizesa scalar performance metric. The scalar performance metriccan be deﬁned and speciﬁed through some utility func-tion [2], with user-concerned performance goals as inputs.

VALUATION

We evaluate ClassyTune over 7 cloud systems that areimplemented in different languages. They have supporteda variety of applications. These systems are listed in Table 1.To provide an example of tuning co-deployed cloud sys-tems, we tune Hive and Hadoop together for ofﬂine data an-alytical workloads. We choose these systems in accordancewith related works [2], [12], [18] for an easy comparison.We believe our choice should be representative for a largenumber of cloud systems.We choose 14 application workloads following the choiceof related works [2], [12], [18], as listed in Table 1. The casesof Tomcat and Cassandra are relatively simple as comparedto other systems, so only the workloads of Web explorationand read-write are chosen respectively. The other systemsare evaluated on three typical workloads. The distributedprocessing systems of Spark and Hive plus Hadoop areevaluated under analytical and machine learning work-loads, generated by the HiBench benchmark. The transac-tional (readWrite) and readOnly workloads for databasesare generated by the SysBench benchmark. We also includethe TPC-C workload, the current industrial standard forevaluating the performance of OLTP systems.For each system, we choose 10 inﬂuential PerfConfs fortuning, unless mentioned otherwise. Related works takingthe model-based approach typically use a similar numberof parameters, around 7 to 16 and with 8 achieving the beston tuning with ﬁxed parameters [6], [12]. We choose thePerfConfs to tune in accordance with related works. ThesePerfConfs control various aspects of systems, including butnot limited to network, CPU, memory, storage, indexing,caching and buffering.Performance metrics are application-speciﬁc. We adoptthe performance metrics commonly used for the evalu-ated workloads. While workloads on Spark and Hive plusHadoop are tuned for a shorter processing time (or taskduration), workloads on the other systems are tuned forhigher throughputs.

EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 9

Figure 5. Percentage of winning settings found by different classiﬁers:XGB outperms all the other classiﬁers, while the kernel method SVM,exploiting covariance functions, fails in most cases.

Our experimental platform consists of 12 servers. Eachserver has two 12-core Intel Xeon E5620 CPU with 32GBRAM. CentOS 6.5 and JVM 1.7 are installed. For each evalu-ation, one server is used to generate workloads. StandaloneSUTs are run on one server, while distributed SUTs arehosted by four servers.

We empirically study which classiﬁer is best to be used withClassyTune. There exist many machine learning methodsto model the comparison relations, e.g., logistic regression(LR for short), decision tree (DT), supported vector ma-chine (SVM), neural networks (NN) and XGBoost. Whilethe former three are the classic methods for binary classi-ﬁcation, neural networks have been applied to many realapplications and make signiﬁcant progress in applicationsto scenarios with big data. XGBoost (XGB for short) is in thealgorithm family of gradient-boosted trees [29], which havebeen shown to be among the best classiﬁers [28]. In binaryclassiﬁcation problems with small data, algorithms from thefamilies of gradient-boosted trees are on the top among all.XGBoost has been used to achieve state-of-the-art results onmany machine learning challenges.In the comparison-based tuning, the key to success isto recognize the whole set of PerfConf settings that arebetter than and winning a given one. We evaluate theabove ﬁve classiﬁers to see how they can recognize thewinning settings. We let each classiﬁer to be trained on aset of 50 original samples and tested on 20 samples. The 20samples have performances higher than the best sample inthe training set. We evaluate to see how many among the20 samples can be recognized by a trained classiﬁer. Theresults are plotted in Figure 5. From Figure 5, we can seethat XGBoost can almost ﬁnd all the winning settings forall systems. Therefore, we choose XGBoost as our classiﬁermodel in ClassyTune.

Comparing to performance-prediction based tuning.

Wehave tried predicting winning settings using regression mod-els on the same set of original samples as in Figure 5. We usethe decision tree based regression model, which is shown toperform best in predicting system performances [16]. Butthe model trained on the same sample set fails to ﬁnd out any of the winning samples. This again proves the validity oftaking a comparison-based approach.

Compared to other auto-tuning methods . To demon-strate the tuning efﬁcacy of ClassyTune, we compare Classy-Tune with two state-of-the-art tuning approaches, i.e., the (a) Throughputs of Web server,NoSQL database, and databases. (b) Running times of Spark andHadoop jobs.Figure 6. ClassyTune/BestConﬁg/GP-based BO(GP-BO) improving per-formances over those under default settings. search-based approach [2] and the Gaussian-process (GP)based Bayesian optimization (BO) approach [6], [12]. Be-sides, these two approaches are the few auto-tuning proposals thatwork on a limited number of samples . We do not compare withapproaches based on control theory [18] or reinforcementlearning [13] because they are only applicable to a handfulof conﬁguration parameters. We exploit the open-sourceimplementation BestConﬁg for the evaluation of the search-based approach. As no open-source implementation can befound for the GP-based BO tuning approach [6], [12], weimplemented it exploiting the Python package of GP-basedBO implementation .For each combination of tuning solutions, systems andworkloads, we run the tuning experiment for three timesand report the average performance improvement. In eachtuning experiment, we tune within 100 tests, as followingthe evaluation methodology of related works [2].Figure 6 shows that ClassyTune can ﬁnd conﬁgurationsbetter than and occasionally as good as those output bythe two state-of-the-art solutions. ClassyTune can improvethroughputs to as much as about × of that under thedefault setting, and decrease execution times to as muchas about / . Speciﬁcally, it has improved the throughputsof Tomcat by 76%, Cassandra by 4%, MySQL/transactionsby 654%, MySQL/reads by 256%, PostgreSQL/transactionsby 228% and PostgreSQL/reads by 33%. It reduces theexecution time by 58% for Spark/PageRank, 72% forSpark/TeraSort, 50% for Spark/ KMeans, 6% for Hive-Hadoop/PageRank, 7% for Hive-Hadoop/Join and 22% forHive-Hadoop/KMeans.Even for the complex co-deployed system of Hive-Hadoop, ClassyTune can still improve the performance byreducing as much as 22% execution time of the KMeansworkload. In comparison, the search-based method and theGP-based BO method cannot tune such a complex system toa performance as good as ClassyTune.ClassyTune tunes several systems to a performancemuch higher than the state-of-the-art solutions, e.g.,Spark/PageRank and MySQL/txns in Figure 6. For othersystems, ClassyTune can only win the state-of-the-art so-lutions by a small percentage. A system can in no way betuned as well as one would wish by only changing PerfConfsettings. There is an upper bound on the performance thattuning PerfConf settings can improve, although this boundcan hardly be ﬁgured out for the high-dimensional continu-ous space of PerfConfs. The performances that ClassyTunehas tuned to are the best we have found for the correspond- http://github.com/zhuyuqing/bestconf/ https://github.com/thuijskens/bayesian-optimization EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 10

Figure 7. Auto-tuning compared to manualtuning: databases/TPC-C. (a) PageRank on Spark (b) PageRank on Hive-HadoopFigure 8. Promising subspaces (bounded by circles) with optimal settings (i.e., evaluatingpoints) as located by ClassyTune. ing combinations of systems, workloads and environments.We have tried testing each combination over thousands ofdifferent PerfConf settings, but we never ﬁnd one settingbetter than the one suggested by ClassyTune.

Compared to manual and expert tuning.

To furtherdemonstrate the effectiveness of ClassyTune, we also eval-uate the performances tuned by ClassyTune towards thosetuned manually or by expert knowledge. We experimentwith databases under the TPC-C workload. To enable thecomparison, we adopt the setting as suggested by the Inter-net and related works [12] for the manual setting. Before au-tomatic tuning appears, a common way for tuning databasesis to use scripts that are written by experts based on theirknowledge and expertise. We exploit two tuning scripts forMySQL and PostgreSQL respectively. These scripts arealso evaluated in a related work [12]. We also demonstratethe tuning results of GP-based BO and BestConﬁg. Figure 7presents the results.ClassyTune can improve the system performance toabout . × of that under the manually tuned conﬁgura-tion. In fact, human beings can hardly capture fully thecharacteristics of complicated workloads, thus auto-tuningmethods ﬁnd PerfConf settings with better performancesthan those under manual-tuned and script-tuned PerfConfsettings. And, the latent relations between PerfConfs andperformances are better captured if modeled in the way ofClassyTune than if modeled in the way of GP-based BO.Therefore, ClassyTune has an advantage in both databasecases, while the BO-based and the search-based approachesperform slightly worse than the script-based approach ontuning PostgreSQL. We believe that the number of samplesis an inﬂuential factor. ClassyTune acquires its advantagefrom the comparison-based modeling. Have winning PerfConf settings been recognized?

Wemeasure to see whether ClassyTune can correctly differ-entiating all PerfConf settings better than a given one. Asplotted in Figure 5, we can see that the classiﬁer modelcan almost perfectly identify the list of winning PerfConfsettings when only 50 samples are provided. This fact sup-ports our design choice in locating promising subspaces byclustering these winning PerfConf settings.

Are promising subspaces located?

We examine whetherClassyTune actually locates the promising subspaces. To https://launchpad.net/mysql-tuning-primer http://pgfoundry.org/projects/pgtune/ better view the PerfConf-performance relations, we runa tuning experiment with 1000 tests for Spark/PageRankand Hive-Hadoop/PageRank respectively. We select themost inﬂuential PerfConf spark.default.parallelism for Spark and mapreduce.job.maps for Hive-Hadoop. Weplot all the sampled points in the sampling phase andthe evaluated points in the searching phase. The resultsare shown in Figure 8. For both systems, the evaluatedpoints are clustering in its space, which is circled out.And, the clusters are having short execution times, i.e.,higher performances, than other sampled points. In otherwords, ClassyTune has successfully located the promisingsubspaces and recognized a set of good settings. Imprecision is alleviated by the systematic approach.

We further verify the impacts of classiﬁers’ imprecisionon tuning. We choose to evaluate on Tomcat-webExploreand PostgreSQL-reads because classiﬁers display the mostdifference in the former and the least in the latter in Figure 5.XGB, DT and LR improve the performances to 1.76, 1.71and 1.73 respectively for tuning Tomcat/WebExplore, whilethey improve to 1.33, 1.25 and 1.24 respectively for Pos-greSQL/reads. We can ﬁnd that the differences between theimproved performances are not as much as those betweenthe percentage of winning settings found.In fact, the tuning results of ClassyTune do not solelyrely on the precision of the classiﬁer. Rather, after the clas-siﬁer pins down the promising areas, we take a systematicapproach by resampling in the areas using the LHS method.This result leads us to think that, while exploiting machinelearning models are beneﬁcial, taking a systematic approachto the goal will also help to reduce the effect brought aboutby the imprecision of machine learning models.

We evaluate whether the bijection-based sample inductionactually performs better than the simple way of directlytaking the difference (i.e., using the minus operation). Wealso compare our sample induction method with the directconcatenation of two PerfConf settings. We evaluate thethree methods on the percentage of winning settings theycan ﬁnd. In the experiments, we use the XGBoost classiﬁerfor all three sample induction methods. Results are illus-trated in Figure 9.Our sample induction method based on the Cantor’sproof performs the best for all systems. As we have men-tioned in Section 4.2, this sample induction method can bemodeled as a function of modulo and simple arithmeticoperators, although it is seemingly complicated. On the one

EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 11

Figure 9. Percentage of winning settings found: sample induction basedon Cantor’s proof outperforms others. hand, functions with modulo and simple arithmetic oper-ators can easily be learned by common machine learningalgorithms [15]. On the other hand, our sample inductionmethod feeds the model with the real independent factors,i.e., PerfConfs. In comparison, the concatenation methodmixes independent factors with correlated factors, increas-ing the input dimension simultaneously. And, the differencemethod performs worse than our method because the dif-ference computation can lead to collision of mappings.

We demonstrate ClassyTune’ advantages for tuning in ahigh-dimensional input space. We choose a tuning spacewith PerfConfs and constrain the tuning within tests.We compare ClassyTune to the two state-of-the-art auto-tuning methods, i.e., the search-based [2] and the GP-basedBO [6], [12] approach. Manual tuning is not applicable tohigh-dimensional tuning because it is very difﬁcult for hu-man beings to comprehend relations in a high-dimensionalspace, if not impossible [4]. Script-based tuning is alsobased on human experiences, making it inapplicable tohigh-dimensional tuning either. We tune for MySQL andPostgreSQL under the TPC-C workload respectively.The tuning results are presented in Figure 10a. First,increasing the dimension leads to a larger input spacewith possibly even better results, e.g., for MySQL/TPC-C.The performance improvements are higher than those in a10-dimensional input space, as demonstrated in Figure 6.ClassyTune outperforms the other auto-tuning methods inboth high and low dimensional cases. For high-dimensionaltuning, the advantage of ClassyTune over the other methodsis more obvious. ClassyTune improves the performance ofMySQL/TPC-C by more than six times, while the GP-basedBO and the search-based BestConﬁg can only improve byfour times. Second, some systems can have only limitedeffective PerfConfs, e.g., PostgreSQL/TPC-C. The perfor-mance improvements are similar for both high and lowdimensional tuning. Anyhow, ClassyTune still has a slightadvantage over the other auto-tuning methods.

We have mentioned that the GP-based GO method has highcomputation overhead. For the tuning results in Figure 10a,we record the tuning time for both ClassyTune and GP-based BO. The tuning time includes the time for modeltraining and model optimization. As GP-based BO is a step-wise method, the tuning time sums up all the computationtime in all steps. We carry out the auto-tuning process ofClassyTune and GP-based BO for ﬁve times respectively. We (a) ClassyTune outperforms otherauto-tuning methods. (b) Total tuning times for Classy-Tune and GP-BO respectively.Figure 10. High-dimensional tuning results: tuning 30 PerfConfs fordatabases/TPC-C. report the average of the tuning results and the tuning timesrespectively. The results are plotted in Figure 10b.ClassyTune involves a tuning time of no more than200 seconds, while GP-based BO requires a tuning timeof more than 550 seconds. Within a much shorter tuningtime, ClassyTune ﬁnds a better PerfConf setting than theGP-based BO method. The GP-BO method has a heavycomputation overhead because its tuning process involvesthe covariance matrix computation and this computation iscarried out stepwise. Taking an integral approach to auto-tuning, ClassyTune trains a model once and then spends therest of its time in searching the input space thoroughly basedon the trained model. If necessary, ClassyTune can furtherreduce its tuning time by searching fewer points.

ClassyTune can bring about the ﬁve beneﬁts of automaticperformance tuning [3] like related works [2], [12], [18],[30]. Here, we present a real use case of UTuned’s customersto show how ClassyTune enables cloud resource reductionvia performance tuning. In this case, ClassyTune is used totune a small online querying service deployed in the cloud.The application workload accesses the service by connectingto a stateless Web service cluster running a Spring Boot application, which sends user queries to the backend. Beforetuning, the service is deployed on a three-node cluster,supporting a throughput around 9000 composite operationsper second. There is a resource planning question about whether all the three nodes are needed or reducing one node ispossible, if the workload throughput must be guaranteed .To answer this resource planning question, we deploythe service on clusters of one to three nodes respectively. Foreach deployment, we test its performance under the defaultPerfConf setting. Then, we use ClassyTune to tune for thebest performance. Under the tuned PerfConf setting, we testthe service performance again. All the performance resultsare listed in Table 2. Table 2Service Throughputs: Default vs. Tuned

Node (err. rate)

ClassyTune (err. rate) (5.4%) (9.7%) 11905.2 (2.2%) For the target workload, a two-node cluster with a well-tuned PerfConf setting is the most cost effective. Withouttuning, it would require one more node, i.e., 50% morecomputing resources, to satisfy the application workload.From Table 2, we can see that a one-node deployment, tunedor untuned, cannot support the application workload. While https://spring.io/projects/spring-boot EEE TRANSACTIONS ON CLOUD COMPUTING, JUNE 2019 12 an untuned two-node deployment cannot meet the through-put requirement, it can perfectly support the workload afterbeing tuned by ClassyTune. For three-node deployment,performance tuning enables it to support an even heavierapplication workload. In sum, We have actually reduced thecloud resource requirements (and costs ) of an online serviceby 33% through performance tuning by ClassyTune.

ONCLUSION

This paper proposes a data-driven auto-tuning systemClassyTune, which can auto-tune the system performanceby adjusting the PerfConfs within a limited number of tun-ing tests. ClassyTune exploits and models the comparisonrelations between PerfConfs by classiﬁcation algorithms,instead of the typical performance-based model. Thanksto the comparison-based modeling, we can induce andgenerate more samples for training the classiﬁcation model.Like other machine learning models, the classiﬁcation modelis not a hundred percent correct. If exploited naively, theimprecision of model could divert the performance tuningprocess such that no better PerfConf can be found. Toguarantee a best PerfConf setting be found, we propose aclustering-based approach towards auto-tuning, exploitingthe imprecise classiﬁcation model.Extensive experiments on seven systems commonly usedin the cloud show that ClassyTune can outperform experttuning and the state-of-the-art auto-tuning solutions, espe-cially for high-dimensional inputs, while the computationoverhead of ClassyTune is much lighter than that of thestate-of-the-art GP-based BO method. An illustrative usecase is presented to show how performance tuning byClassyTune improves the system performance and enablesthe reduction of 33% cloud computing resources for anonline stateless service. R EFERENCES [1] S. M. Nabavinejad and M. Goudarzi, “Faster mapreduce compu-tation on clouds through better performance estimation,”

IEEETransactions on Cloud Computing , 2017.[2] Y. Zhu, J. Liu, M. Guo, Y. Bao, W. Ma, Z. Liu, K. Song, and Y. Yang,“Bestconﬁg: Tapping the performance potential of systems viaautomatic conﬁguration tuning,” in

Proceedings of ACM Symposiumon Cloud Computing 2017 . ACM, 2017, pp. 338–350.[3] Y. Zhu, J. Liu, M. Guo, W. Ma, and Y. Bao, “Acts in need:Automatic conﬁguration tuning with scalability guarantees,” in

Proceedings of the 8th ACM APSys , 2017, p. 14.[4] P. Bernstein, M. Brodie, S. Ceri, D. DeWitt, M. Franklin, H. Garcia-Molina, J. Gray, J. Held, J. Hellerstein, H. Jagadish et al. , “Theasilomar report on database research,”

ACM Sigmod record , vol. 27,no. 4, pp. 74–80, 1998.[5] P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K.Hollingsworth, B. Norris, and R. Vuduc, “Autotuning in high-performance computing applications,”

Proceedings of the IEEE ,no. 99, pp. 1–16, 2018.[6] Z. L. Li, C.-J. M. Liang, W. He, L. Zhu, W. Dai, J. Jiang, and G. Sun,“Metis: Robustly tuning tail latencies of cloud systems,” in , 2018, pp. 981–992.[7] V. Dalibard, M. Schaarschmidt, and E. Yoneki, “Boat: buildingauto-tuners with structured bayesian optimization,” in

Proceedingsof the 26th International Conference on WWW , 2017, pp. 479–488.[8] D. E. Goldberg and J. H. Holland, “Genetic algorithms and ma-chine learning,”

Machine learning , vol. 3, no. 2, pp. 95–99, 1988. [9] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V.Le, and A. Kurakin, “Large-scale evolution of image classiﬁers,” in

International Conference on Machine Learning , 2017, pp. 2902–2911.[10] H. Herodotou, F. Dong, and S. Babu, “No one (cluster) sizeﬁts all: automatic cluster sizing for data-intensive analytics,” in

Proceedings of the 2nd ACM SoCC , 2011, p. 18.[11] J. Chen, G. Soundararajan, S. Ghanbari, F. Iorio, A. B. Hashemi,and C. Amza, “Ensemble: A tool for performance modeling ofapplications in cloud data centers,”

IEEE Transactions on CloudComputing , vol. 4, no. 1, pp. 20–33, 2016.[12] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang, “Automaticdatabase management system tuning through large-scale machinelearning,” in

Proceedings of the 2017 ACM International Conferenceon Management of Data . ACM, 2017, pp. 1009–1024.[13] Y. Li, K. Chang, O. Bel, E. L. Miller, and D. D. Long, “Capes: unsu-pervised storage performance tuning using neural network-baseddeep reinforcement learning,” in

Proceedings of the InternationalConference for High Performance Computing, Networking, Storage andAnalysis . ACM, 2017, p. 42.[14] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom,U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: An extensibleframework for program autotuning,” in

Proceedings of the 23rd in-ternational conference on Parallel architectures and compilation . ACM,2014, pp. 303–316.[15] M. J. Kearns and U. V. Vazirani,

An introduction to computationallearning theory . MIT press, 1994.[16] J. Guo, K. Czarnecki, S. Apel, N. Siegmund, and A. Wasowski,“Variability-aware performance prediction: A statistical learningapproach,” in

IEEE/ACM 28th International Conference on ASE .IEEE, 2013, pp. 301–311.[17] Z. Bei, Z. Yu, H. Zhang, W. Xiong, C. Xu, L. Eeckhout, and S. Feng,“Rfhoc: A random-forest approach to auto-tuning hadoop’s con-ﬁguration,”

IEEE Transactions on Parallel and Distributed Systems ,vol. 27, no. 5, pp. 1470–1483, 2016.[18] S. Wang, C. Li, H. Hoffmann, S. Lu, W. Sentosa, and A. I. Kistijan-toro, “Understanding and auto-adjusting performance-sensitiveconﬁgurations,” in

Proceedings of the Twenty-Third InternationalConference on ASPLOS . ACM, 2018, pp. 154–168.[19] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Fre-itas, “Taking the human out of the loop: A review of bayesianoptimization,”

Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175,2016.[20] V. Nair, T. Menzies, N. Siegmund, and S. Apel, “Using bad learnersto ﬁnd good conﬁgurations,” in

ACM FSE , 2017, pp. 257–267.[21] P. J. Van Laarhoven and E. H. Aarts, “Simulated annealing,” in

Simulated Annealing: Theory and Applications . Springer, 1987, pp.7–15.[22] E. Kwan, S. Lightstone, A. Storm, and L. Wu, “Automatic con-ﬁguration for ibm db2 universal database,” in

Proc. of IBM PerfTechnical Report , 2002.[23] V. BYCHKOVSKY, J. CIPAR, A. WEN, L. HU, and S. MOHAP-ATRA, “Spiral: Self-tuning services via real-time machine learn-ing,” https://code.fb.com/data-infrastructure/spiral-self-tuning-services-via-real-time-machine-learning/, 2018.[24] H. Sagan,

Space-ﬁlling curves . Springer Science & Business Media,2012.[25] W. W. Cohen, R. E. Schapire, and Y. Singer, “Learning to orderthings,” in

Advances in Neural Information Processing Systems , 1998,pp. 451–457.[26] T. S. Madhulatha, “An overview on clustering methods,” arXivpreprint arXiv:1205.1117 , 2012.[27] M. D. McKay, R. J. Beckman, and W. J. Conover, “A comparisonof three methods for selecting values of input variables in theanalysis of output from a computer code,”

Technometrics , vol. 42,no. 1, pp. 55–61, 2000.[28] R. Caruana and A. Niculescu-Mizil, “An empirical comparison ofsupervised learning algorithms,” in

Proceedings of the 23rd ICML .ACM, 2006, pp. 161–168.[29] J. H. Friedman, “Greedy function approximation: a gradient boost-ing machine,”

Annals of statistics , pp. 1189–1232, 2001.[30] B. Xi, Z. Liu, M. Raghavachari, C. H. Xia, and L. Zhang, “A smarthill-climbing algorithm for application server conﬁguration,” in