Boosting Cloud Data Analytics using Multi-Objective Optimization
Fei Song, Khaled Zaouk, Chenghao Lyu, Arnab Sinha, Qi Fan, Yanlei Diao, Prashant Shenoy
BBoosting Cloud Data Analytics using Multi-ObjectiveOptimization
Fei Song † , Khaled Zaouk † , Chenghao Lyu ‡ , Arnab Sinha † , Qi Fan † , Yanlei Diao †‡ , Prashant Shenoy ‡† Ecole Polytechnique; ‡ University of Massachusetts, Amherst † { fei.song, khaled.zaouk, arnab.sinha, qi.fan, yanlei.diao } @polytechnique.edu; ‡ { chenghao, shenoy } @cs.umass.edu ABSTRACT
Data analytics in the cloud has become an integral part of en-terprise businesses. Big data analytics systems, however, still lackthe ability to take user performance goals and budgetary constraintsfor a task, collectively referred to as task objectives, and automat-ically configure an analytic job to achieve these objectives. Thispaper presents a data analytics optimizer that can automatically de-termine a cluster configuration with a suitable number of cores aswell as other system parameters that best meet the task objectives.At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of job config-urations to reveal tradeoffs between different user objectives, rec-ommends a new job configuration that best explores such tradeoffs,and employs novel optimizations to enable such recommendationswithin a few seconds. We present efficient incremental algorithmsbased on the notion of a Progressive Frontier for realizing our MOOapproach and implement them into a Spark-based prototype. De-tailed experiments using benchmark workloads show that our MOOtechniques provide a 2-50x speedup over existing MOO methods,while offering good coverage of the Pareto frontier. When com-pared to Ottertune, a state-of-the-art performance tuning system,our approach recommends configurations that yield 26%-49% re-duction of running time of the TPCx-BB benchmark while adaptingto different application preferences on multiple objectives.
1. INTRODUCTION
As the volume of data generated by enterprises has continued togrow, big data analytics has become commonplace for obtainingbusiness insights from this voluminous data. Today, such big dataanalytics tasks often run on the enterprise’s private cloud or on ma-chines leased by the enterprise in the public cloud. Despite its wideadoption, current big data analytics systems remain best effort innature and typically lack the ability to take user objectives such asperformance goals or cost constraints into account.Determining an optimal hardware and software configuration fora big-data analytic task based on user-specified objectives is a com-plex task and one that is largely performed manually. Consider anenterprise user who wishes to run a mix of analytic tasks on theirprivate or the public cloud. First, the user needs to choose the serverhardware configuration from the set of available choices. Amazon’sEC2 public cloud platforms currently offer over 190 hardware con-figurations [1], while Microsoft Azure offers over 30 different hard-ware configurations [3]. These configurations differ in the numberof cores, RAM size, and special hardware available. After deter-mining the hardware configuration, the user also needs to deter-mine the software configuration for the task by choosing variousruntime parameters. For the popular Spark runtime engine, for ex-ample, these runtime parameters include parallelism (for reduce- style transformations), number of executors , number cores per ex-ecutor , memory per executor , Rdd compression (boolean),
Memoryfraction (of heap space), to name a few.The choice of a configuration is further complicated by the needto optimize multiple , possibly conflicting, user objectives. Con-sider the following real-world use cases at several large data analyt-ics companies and cloud providers (anonymized for confidentiality)that elaborate on these challenges and motivate our work:
Use Case 1 (Data-driven Business Users) . A data-driven se-curity company runs thousands of analytic tasks in the cloud ev-ery day. The engineers managing these tasks have two objectives:keep the latency low in order to quickly detect fraudulent behav-iors and also reduce cloud costs that impose substantial operationalexpenses on the company. For cloud analytics, task latency canalways be reduced further by allocating more hardware resources,but this comes at the expense of higher cloud costs. Hence, the en-gineers face the challenge of deciding the cluster configuration andother runtime parameters that balance latency and cost . Use Case 2 (Severless Databases) . Cloud providers now of-fer hosted databases in the form of serverless offerings (e.g., [2])where the database is turned off during idle periods, dynamicallyturned on when new queries arrive, and scaled up or down as theload changes over time. A media company that uses this server-less database to run a news site sees peak loads in the morning oras news stories break, and a lighter load at other times. The newsapplication specifies the minimum and maximum number of com-puting units (CUs) to service its workload across peak and off-peakperiods; it prefers to minimize cost when the load is light and ex-pects the cloud provider to dynamically scale CUs for the morningpeak or breaking news. In this case, the cloud provider needs to bal-ance between latency under different data rates and user cost, whichdirectly depends on the number of CUs used. To do so, the cloudprovider needs automated methods to choose appropriate configu-rations under different workloads that address both objectives.Overall, choosing a configuration that balances multiple conflict-ing objectives is non-trivial. Studies have show that even expertengineers are often unable to choose between two cluster optionsfor a single objective like latency [35], let alone choosing betweendozens of cluster options for multiple competing objectives.In this paper, we introduce a multi-objective optimizer that canautomate the task of determining on optimal configuration for eachtask based on multiple task objectives. Such an optimizer takes asinput an analytic task in the form of a dataflow program (which sub-sumes SQL queries) and a set of objectives, and produces as outputa job configuration with a suitable number of cores as well as otherruntime system parameters that best meet the task objectives. Atthe core of our work is a principled multi-objective optimization(MOO) approach that takes multiple, possibly conflicting, objec-1 a r X i v : . [ c s . D B ] M a y ives and computes a Pareto-optimal set of job configurations. Afinal configuration is then chosen from this Pareto-optimal set.We note several differences of our work from SQL optimization:First, our work is complementary to SQL optimization for databaseworkloads. For a given query, SQL optimization chooses a queryplan, viewed as a dataflow program, that is then mapped to clusterresources by choosing an appropriate number of cores and mem-ory per core, and configured with many parameters of a distributedengine such as Spark. Our optimizer addresses this later step ofoptimization, yielding a cluster execution plan .Second, MOO for SQL queries [16, 21, 41, 42] examines a fi-nite set of query plans based on relational algebra, and selects thePareto optimal ones based on estimated cost of each plan. In con-trast, our optimizer searches through a parameter space that mixesnumerical and categorial parameters, with potentially an infinite setof possible configurations, and finds those that are Pareto optimal.To suit the property of the parameter space, our optimizer employsa numerical optimization approach to MOO.Third, our MOO-based optimizer aims to support a broad set ofanalytic tasks that are commonly mixed in data analytics pipelines,including SQL queries, ETL tasks based on SQL with UDFs, andmachine learning tasks for deep analysis, all in the general paradigmof dataflow programs. To do so, our optimizer leverages recentmachine learning based modeling approaches [43, 50] that can au-tomatically learn a predictive model for each objective using theruntime behavior of a user task (i.e., runtime metrics), without nec-essarily requiring the use of query plans. In particular, we view ourMOO work as a synergy with recent work on workload modeling.Such a synergy is reminiscent of the past work on SQL optimiza-tion: our MOO framework is analogous to the Dynamic Program-ming based optimization framework, although it has been extendedto the multi-objective settings due to the needs of today’s cloud an-alytics, while recent modeling work is analogous to cost modelingof SQL query plans but extended to automated learning of suchcost models from runtime observations. As we show, working withlearned models brings new challenges for optimization.More specifically, our design of a multi-objective optimizer ad-dresses the following technical challenges:1. Infinite Parameter Space : There are potentially infinite config-urations in our parameter space, but only a small fraction of thembelong to the Pareto set—most configurations are dominated bysome Pareto optimal configuration for all objectives. Hence, wemust address the challenge of efficiently searching through an infi-nite parameter space to find these Pareto optimal configurations.2.
Coverage of the Pareto Frontier : The Pareto set over the multi-objective space is also called the
Pareto frontier . Since we aim touse the Pareto frontier to recommend a new configuration that bestexplores tradeoffs between different objectives, the frontier shouldprovide good coverage of the overall objective space and have afine resolution for the regions when the tradeoffs are significant. Aswe show later, classical MOO algorithms [30] often fail to providesufficient coverage of the Pareto frontier.3.
Efficiency : Since our optimizer uses learned models of userobjectives, it has to handle the high complexity of such models(e.g., using Deep Neural Networks) and frequent updates of thesemodels as new training data becomes available. Before runninga user task, the learned models may have changed and the opti-mizer may have to recompute the Pareto frontier in order to makerecommendations. Therefore, the speed of computing the Paretofrontier, e.g., within a few seconds , is crucial for adapting to burstydata loads quickly (in the serverless database case) or reducing thedelay of starting a recurrent workload at a scheduled time. Mostexisting MOO algorithms, including Weighted Sum [30], Normal Constraints [30], and Evolutional Methods [13], are not designedto meet such stringent performance requirements.By addressing the above challenges, our paper makes the follow-ing contributions:(1) We address the infinite search space issue by presenting anew approach for incrementally transforming a MOO problem toa set of constrained optimization (CO) problems, where each COproblem can be solved individually to return a Pareto optimal point.(2) We then address the coverage and efficiency challenges bydesigning Progressive Frontier (PF) algorithms to realize our ap-proach. ( i ) Our first PF algorithm is designed to be incremen-tal , i.e., gradually expanding the Pareto frontier as more comput-ing time is invested, and uncertainty-aware , i.e., returning morepoints in regions of the frontier that lack sufficient information.( ii ) We also develop an approximate PF algorithm that given com-plex learned models, solves each CO problem efficiently based onthese models. ( iii ) We finally devise a parallel , approximate PFalgorithm to further improve efficiency.(3) We implement our algorithms into a Spark-based prototype.Evaluation results using benchmarks for batch and streaming work-loads show that our approach produces a Pareto frontier in less than2.5 seconds (2-50X faster that other MOO methods [30, 13]), pro-vides greater coverage over the frontier, and enables exploration oftradeoffs such as cost-latency or latency-throughput. When com-pared to Ottertune [43], a state-of-the-art performance tuning sys-tem, our approach recommends configurations that yield 26%-49%reduction of total running time of the TPCx-BB benchmark [40]while adapting to different application preferences on multiple ob-jectives and being able to accommodate a broader set of models.As database research continues to deliver new results on learnedmodels [11, 28, 26, 43, 50], the generality of our optimizer allowsit to achieve even better results once the new learned models aremade available.
2. BACKGROUND AND MOTIVATION
In this section, we discuss requirements and constraints fromreal-world use cases that motivate our system design.
Our work targets a broad set of analytic tasks such as SQL querieswith user-defined functions (UDFs), ETLs tasks based on SQL, andmachine learning-based analytics. We model these analytic tasksas dataflow programs that are designed to run on big data systemssuch as Spark [46, 47] and Flink [12]. We assume that each task hasuser- or provider-specified objectives, referred to as task objectives ,that need to be optimized during execution. Common objectives in-clude latency, throughput, and resource (i.e., cloud) costs. A taskcan specify multiple objectives and it is possible for these objec-tives to conflict. To execute the task, the system needs to determinea cluster execution plan with system parameters instantiated. Asstated before, these parameters control the degree of parallelism(the number of cores or executors), memory allocated to each ex-ecutor or data buffer, granularity of scheduling, compression op-tions, shuffling strategies, etc. An executing task using this planis referred to as a job and the system parameters are collectivelyreferred as the job configuration .The overall goal of our multi-objective optimizer is: given a userdata flow program and a set of objectives, compute a job configu-ration that optimizes these objectives and adapt the configurationquickly if either the task load or task objectives change.
In particular, our work is designed to address the following re-quirements of real-world analytics tasks.2.
Recurring workloads.
It is common for analytic environmentsto see repeated jobs in the form of daily or hourly batch jobs. Insome cases, recurring jobs have dependencies that trigger other jobsupon completion, yielding a pipeline of repeating jobs. Sometimesstream jobs can be repeated as well: under the lambda architec-ture, the batch layer runs to provide perfectly accurate analyticalresults, while the speed layer offers fast approximate analysis overlive streams; the results of these two layers are combined to servea model. As old data is periodically rolled into the batch job, thestreaming job is restarted over new data with a clean state. Ouroptimizer is designed to be invoked prior to each execution of a re-curring job with the goal of choosing a configuration that improvesperformance towards target objectives.2.
Serverless database workloads.
As noted in Section 1, server-less databases (DBs) are becoming common in cloud environments.Each invocation of a serveless DB by the cloud platform requiresa configuration that satisfies multiple objectives—to provide lowquery latency to end-users while using the least cost cloud configu-ration for the expected load. Furthermore, auto scaling features ofserverless DBs imply that new configurations need to be computedquickly to react to load changes. Our optimizer is designed tosupport serverless DBs by quickly (re)computing a configurationat each invocation or auto-scaling to a larger cluster instance.In both of the above scenarios, configurations need to be rec-ommended under stringent time requirements , e.g., within a fewseconds, in order to minimize the delay of starting a recurring job,or invoking or scaling a serverless DB. Such time constraints dis-tinguish our MOO work for the use by a cloud optimizer from thetheoretical work on MOO [30, 13] in the optimization community.3.
Mixed workloads . Besides SQL queries, cloud analytics todayoften involve large ETL jobs for data cleaning, transformation, andintegration, as well as machine learning tasks for deep analysis.Even SQL queries make extensive use of user-defined functions,as reported lately [35]. In order to support such diverse analyticworkloads, our work leverages recent machine learning based mod-eling approaches [43, 50] that can automatically learn a predictivemodel for each task objective based on runtime observations col-lected from job execution. The implication on our design is thatthe optimizer takes as input predictive models for target objectives,where each model is learned to predict the value of a target objec-tive based on the configuration of all system parameters. Designinga MOO solution over learned models that are highly complex, e.g.,using Deep Neural Networks (DNNs), and updated frequently, ischallenging given our coverage and efficiency requirements . We next describe the relation of our MOO work to the most rel-evant prior work.
Single-objective performance tuning.
Recent systems such asOtterTune [43] and CDBTune [50] address performance tuning ofSQL queries for objectives such as minimizing latency. They de-termine how to set the parameters of a RDBMS by modeling theobjective as a function of the parameters and then iteratively ex-ploring new configurations to update the model and move the ob-served performance toward the optimum of the objective. Thesesystems differ from our work in two main aspects:1)
Single objective optimization . Both OtterTune and CDBTuneare inherently designed for optimizing a single, fixed objective,while we seek to optimize multiple objectives. OtterTune can con-sider only one objective (e.g., latency) for optimization [43]. CDB-Tune considers both latency and throughput but uses a fixed, weightedapproach to compose a single objective from them, e.g., using weights[50]. As we will show, using a weighted approach to reduce a MOO problem to a fixed, single objective (SO) optimization problem en-tails significant loss of information regarding tradeoffs between dif-ferent objectives and yields suboptimal solutions.2)
Integrated modeling and optimization:
OtterTune and CDB-Tune integrate modeling and optimization steps into a single, longtuning session for each query workload. Such a session couplesmodeling and optimization via iterative exploration of new config-urations and takes 15-45 min to run [37,43]. In contrast, our targetapplications impose stringent time constraints and require new con-figurations to be computed in seconds rather than tens of minutes.To meet these stringent time constraints, we argue for decouplingmodeling and optimization into two separate steps. We assume thatthe time-consuming modeling step is performed asynchronouslyin the background whenever new training data becomes available.The MOO step runs separately on-demand and uses the most re-cent models to compute a new configuration with the delay of afew seconds. Such decoupling allows fast computation of new con-figurations with latencies that are not possible in approaches thatintegrate the two steps. The effectiveness of such decoupling wasrecently shown in a preliminary demo of our system [48].The implications of this change are two-fold: ( i ) It enables theMOO to be general and not tied a specific modeling approach ortool. This frees the modeling engine to use any appropriate mod-eling tool, and any future improvements in modeling automaticallyimprove the efficacy of the MOO as well. Integrated approaches donot allow this flexibility—OtterTune’s optimization method is tiedto the Gaussian Process (GP) modeling, while CDBTune’s opti-mization method is tied to its Reinforcement Learning (RL) model.As we shall show in this paper, our optimization solution worksas long as the learned models can be represented as a regressionfunction on system parameters, e.g., using a GP or a DNN. ( ii ) Themodeling engine can train a new model in the background as newtraining data becomes available. When the MOO solution needsto be run for a given task, the model for this task may have beenupdated, and hence the speed to compute a Pareto frontier based onthe new model is a key performance goal. Learning-based modeling.
Our MOO is designed to work withany applicable modeling approach that can produce predictive mod-els for task objectives. Since our focus is on MOO rather than mod-eling, we leverage recent work on automatic model learning [11,43, 50] for our MOO. Such modeling work makes two assump-tions: (1) A model for each objective can be learned from runtimeobservations, e.g., latency as the response to different configura-tions of system parameters. (2) Models for different objectives canbe built as regression functions over the full set of system param-eters. Today’s analytics engines have large numbers (10s or 100s)of parameters, as reported for DBMSs [43, 50] and as we observedfor the Spark engine. These parameters may contribute differentlyto the objectives, which are captured by different weights in thelearned models. Further, for a specific objective unimportant pa-rameters can be filtered via feature selection [43].Among modeling work, OtterTune [43] best matches the needsof our modeling engine: it can automatically learn a model for eachobjective as a regression function on system parameters. It is bet-ter than previous work that employs hand-crafted models [35, 44],hence hard to generalize. It is also better than iTuned [11], whichuses GP models to search for optimal configurations, but fails totrain models using data from the history and hence offers inferiorperformance. Finally, CDBTune [50] employs a RL approach andhence is subject to the limitation that the reward of choosing a con-figuration is a scalable value. To cope with it, CDBTune uses fixedweights to combine latency and throughput into a single objectivein order to build the reward function. As such, it cannot return a3 ob Configuration ! ( Objective models {Ψ ' ,… ,Ψ * } Modeling Engine Multi-ObjectiveOptimization
Objectives , ' ,… ,, * and Constraints Dataflow Optimizer
Pareto Frontier
Figure 1:
Overview of a dataflow optimizer. regression function for each objective as required for our MOO.Therefore, our current prototype uses OtterTune as the defaultapproach for the modeling engine while still supporting other mod-eling solutions, i.e., DNNs. Our experiments demonstrate the ben-efits of our MOO solution over the optimization solution of Otter-Tune when both systems use the same set of learned models.
Figure 1 shows the design of our dataflow optimizer based on theabove requirements. It takes as input a set of objectives for a task,denoted by ( F , . . . , F k ) , and optionally value constraints on theseobjectives, [ F Li , F Ui ]. It also takes a set of task-specific predictivemodels, (Ψ , . . . , Ψ k ) , one for each objective, and computes a jobconfiguration that optimizes these objectives.When a job runs for the first time, no predictive models will beavailable for the job. The job is assumed to run with a defaultconfiguration x . Our optimizer assumes that a separate model-ing engine will collect traces during the job execution and then usethese traces to design task-specific models. Two types of tracesare assumed to be collected: (i) system-level metrics, e.g., fromthe Spark engine, such as records read and written, bytes spilledto disk, and fetch wait time, and (ii) observed values of task ob-jectives such as latency and compute cost. The modeling engineis assumed to use these traces to compute task-specific regressionmodels (Ψ , . . . , Ψ k ) , one for each user objective. Our optimizeris designed to work with any modeling approach that can providesuch models—our current implementation uses models learned byOtterTune [43] as well as our own custom DNN implementations.Regardless of the approach used, the model training is done offlineand separately from the MOO path.Given the task-specific predictive models, the multi objective op-timization (MOO) module searches through the space of configu-rations and computes a set of Pareto-optimal configurations for thejob. Based on insights revealed in the Pareto frontier, the optimizerchooses a new configuration, x , that meets all user constraints andbest explores the tradeoffs among different objectives. Future invo-cations of the task run with this new configuration.If the user decides to adjust the constraints on the objective (e.g.,specifying new bounds [ F Li , F Ui ] to increase the throughput re-quirement in order to adapt to higher data rates), the MOO canquickly return a new configuration from the computed Pareto fron-tier. As the modeling engine continues to collect additional sam-ples from recurring executions of this task as well as others, it mayperiodically update the task’s predictive models. Upon the next ex-ecution of the task, if updated models become available, the Paretofrontier will be recomputed by re-running the MOO and a new con-figuration will be chosen.
3. PROGRESSIVE FRONTIER APPROACH
In this section, we present a new Progressive Frontier frameworkfor solving a multi-objective optimization problem, and then usethis concept to design fast algorithms in the next section.
We begin with a mathematical definition of the multi-objectiveoptimization problem for a given user task.
Problem 3.1. Multi-Objective Optimization (MOO). arg min x f ( x ) = F ( x ) = Ψ ( x ) ...F k ( x ) = Ψ k ( x ) (1) s.t. x ∈ Σ ⊆ R d F Li ≤ F i ( x ) ≤ F Ui , i = 1 , ..., k where x is the job configuration with d parameters, Σ ⊆ R d de-notes all possible job configurations. Further, F i ( x ) generally de-notes one of the k objective functions, while Ψ i ( x ) refers partic-ularly to the predictive model learned for this objective. If a taskobjective favors larger values, we add the minus sign to the objec-tive function to transform it to a minimization problem. Optionally,there can be a number of inequality constraints on the parametervector, x . Note that our problem involves the the d -dimensional parameter space , Σ , and the k -dimensional objective space , Φ ,where each dimension corresponds to an objective.In general, multi-objective optimization (MOO) leads to a set ofsolutions rather than a single optimal solution. Hence, the notionof optimality in MOO settings is based on the following concepts: Definition 3.1. Pareto Domination:
In the objective space Φ , apoint f (cid:48) Pareto-dominates another point f (cid:48)(cid:48) iff ∀ i ∈ [1 , k ] , f (cid:48) i ≤ f (cid:48)(cid:48) i and ∃ j ∈ [1 , k ] , f (cid:48) j < f (cid:48)(cid:48) j . Definition 3.2. Pareto Optimal:
In the objective space Φ , a point f ∗ is Pareto optimal iff there does not exist another point f (cid:48) thatPareto-dominates it. Definition 3.3. Pareto Set (Frontier):
For a given job, the Paretoset (frontier) F includes all the Pareto optimal points in the objec-tive space Φ , and is the solution to the MOO problem.The MOO problem is characterized by a mapping ( Ψ , · · · Ψ k )from the numerical parameter space, Σ , to the numerical objectivespace, Φ . Through this mapping, we want to find those configura-tions in the parameter space Σ that lead to Pareto-optimal points inthe objective space Φ .Besides the above definition, we impose performance require-ments to suit the needs of a multi-objective optimizer for cloudanalytics. As noted in Section 1, the Pareto frontier needs to becomputed with good coverage in order to enable recommendationsof best configurations, and high efficiency to respond to the needof starting a new database instance or adapt to bursty data loadsquickly in the serverless database case, or to reduce the delay ofstarting a recurrent workload at a scheduled time. Weighted Sum (WS) [30]: The WS method [30] transforms theMOO problem into a single-objective optimization problem by dis-tributing the weights (preferences) among different objectives. TheWS method tries a number of weight distributions and for eachcomputes the optimal solution. It returns the collection of solutionsas the Pareto set. A major issue with WS is the poor coverage of the Pareto frontier [8, 31], which is undesirable if one wants tounderstand tradeoffs across the entire Pareto frontier.
Normalized Constraints (NC) [30]: To address the coverage is-sue, the NC method [32] provides a set of evenly spaced Pareto op-timal points on the frontier F . It divides the objective space into anevenly distributed grid and probes the grid points to have even cov-erage of the objective space. However, NC suffers from efficiency issues. First, it uses a pre-determined parameter, N p , to specify4ow many Pareto points to explore, which can grow exponentiallywith with the dimensionality of Φ . This parameter can impose highcomputational cost (if set too high) or produce inaccurate Paretofrontiers (if set too low). Second, NC is not an incremental algo-rithm: if the Pareto frontier built form k probes does not providesufficient information about the tradeoffs, then the next run with k (cid:48) ( > k ) of probes will start the computation from scratch. Evolutionary Methods (Evo) [13] are randomized methods to ap-proximately compute the frontier set. As evolutionary methods areused to build an optimizer in our setting, they suffer from two is-sues: First, it is not an incremental algorithm. Like the NC method,if the Pareto frontier built from k probes does not provide sufficientinformation about the tradeoffs, then the next run with k (cid:48) ( > k ) probes will start from scratch. Since the optimizer does not knowthe sufficient number of probes in advance, it needs to start from asmaller value and try larger values if more information is needed.Hence, evolutionary methods will suffer from the efficiency issueby not being able to reuse the previous result. Second, a more se-vere problem is inconsistency : The Pareto frontier built from k (cid:48) probes can be inconsistent with that built from k probes, as ourexperimental results show. This causes a major problem in recom-mending configurations: if the optimizer uses the Pareto frontierfrom k probes to make a recommendation due to stringent timerequirements, it can be invalidated completely later as the Paretofrontier is recomputed to include more points. To meet the aforementioned performance requirements, we in-troduce a new approach, called
Progressive Frontier . It incre-mentally transforms the MOO problem into a series of constrainedsingle-objective optimization problems, which can be solved indi-vidually. In this section, we formally describe our approach andshow its correctness.We first show how to generate a single Pareto optimal point bysolving a constrained optimization (CO) problem derived from theMOO; later on, we use this building block to incrementally gener-ate a series of CO problems to compute a number of Pareto-optimalpoints on the Pareto frontier. Our approach overcomes limitationsof traditional CO by automatically choosing the constraint thresh-old and incrementally improving the solution set.
Problem 3.2. Constrained Optimization (CO).
Given the MOOdefined in Formula 1, its constrained optimization is a single objec-tive optimization problem P ( i ) C defined as: x C = arg min x F i ( x ) , subject to C Lj ≤ F j ( x ) ≤ C Uj , j ∈ [1 , k ] (2)where C j is constraint on the j -th objective, C = { [ C Lj , C Uj ] | j ∈ [1 , k ] } is the set of constraints on all the objectives. Proposition 3.1.
The solution to the above CO problem is Paretooptimal within the hyperrectangle formed by the constraints, C = { [ C Lj , C Uj ] , j ∈ [1 , k ] } .The result is easy to prove—within the hyperrectangle, we haveminimized the objective i and hence, no other points can dominatethis solution in the same hyperrectangle.We next explain how exactly we construct the CO problem given k objectives. To present our solution, we repeat some definitionsused in the NC method [32] below. Definition 3.4. Reference Point: r i is a special Pareto optimalpoint if it achieves the minimum for objective F i . F F 𝒇 𝐍 𝒇 𝐔 𝒇 𝑴 𝑵 𝑵 𝑼 𝑼 (a) Middle point probe. F F 𝒇 𝐍 𝒇 𝐔 𝒇 f 𝒇 (b) multiple Pareto points. Figure 2:
Uncertain space in 2-dimension. f U and f N are the Utopiaand Nadir points, respectively. In (a), f M is the solution to the middlepoint probe. In (b), ( f , f , f ) represent the solutions of a series ofmiddle point probes. Definition 3.5. Utopia and Nadir Points: r , . . . , r k ∈ Φ are k reference points. A point f U ∈ Φ is a Utopia point iff for each i = 1 , , ..., k , f Ui = min kl =1 { r li } . A point f N ∈ Φ is a Nadirpoint iff for each i = 1 , , ..., k , f Ni = max kl =1 { r li } .The Utopia point and Nadir point indeed are the two extremepoints that form a hyperrectangle in the objective space. Everypoint in this hyperrectangle Pareto dominates the Nadir point andis dominated by the Utopia point. The Pareto frontier can be ofarbitrary shape within this hyperrectangle. Proposition 3.2.
For the hyperrectangle enclosed by the Utopiaand Nadir points, no Pareto optimal points outside the hyperrect-angle (if they exist) can dominate any point within it. Further, inthe 2-dimensional case, for any hyperrectangle enclosed by a pairof known Pareto optimal points, the same conclusion holds.Due to space constraints, we refer the reader to our technical re-port [38] for the proof of this and other propositions.Next we describe a method to find one
Pareto point in the spaceenclosed by the Utopia point and the Nadir Point.
Definition 3.6. Middle Point Probe:
Given a hyperrectangle formedby f U = ( f U , . . . , f Ui , . . . , f Uk ) and f N = ( f N , . . . , f Ni , . . . , f Nk ) ,the middle point f M bounded by f U and f N is defined as the so-lution to constrained optimization of: x C = arg min x F i ( x ) , subject to f Uj ≤ F j ( x ) ≤ ( f Uj + f Nj )2 , j ∈ [1 , k ] . (3)As before, we use C M to denote the constraints in Eq. 3. In theory,we can choose any i to be the objective of the middle point probe. Proposition 3.3.
If we can not find the middle point f M , thenthere are no Pareto optimal points within the hyperrectangle en-closed by the constraints C M . In the presence of f M , by Propo-sition 3.1 and Proposition 3.2, f M is a Pareto optimal point in the2D case, and it is a candidate point regarding Pareto optimality,subject to further filtering, in high-dimensional cases.The detailed proof is given in our technical report [38]. Note thatin high-dimensional cases, our result is similar to classical MOOalgorithms like NC [30] in that the algorithm can return only likelycandidates for further pruning due to the complexity of dividinghigh-dimensional space.To quantify the effect of finding f M on the objective space, weintroduce the following notion: Definition 3.7. Uncertain Space:
Given a hyperrectangle formedby the Utopia Point f U and Nadir Point f N in the objective space,5he uncertain space is defined as the space within this hyperrectan-gle that encloses all possible shapes of the Pareto frontier that areconsistent with the current set of Pareto optimal points available.Intuitively, one can interpret the uncertain space as subregions ofthe objective space where a Pareto point may exist anywhere withinit. In the presence of f M , the sub-hyperrectangle enclosed by f M and f U contains only the points that dominate f M , as depictedby the blue rectangle in Figure 2(a). Due to the Pareto optimal-ity of f M , we are certain that no Pareto points exist there. Simi-larly the sub-hyperrectangle enclosed by f M and f N , depicted bythe red rectangle in Figure 2(a), contains only points dominated by f M ; hence no Pareto points can exist there. Hence by finding f M ,we can reduce the uncertain space from the original hyperrectangle(enclosed by f U and f N ) by discarding these two colored sub-hyperrectangles. This notion can be extended to hyperrectanglesnaturally. In a k -dimensional hyperrectangle, each Pareto point candivide the hyperrectangle into k sub-hyperrectangles, and thosethat are Pareto-dominated by f M can be discarded.Then we have the following result on the effect of finding f M . Proposition 3.4.
In case that the middle point probe returns nopoints, we can claim that no Pareto optimal points reside within thesub-hyperrectangle enclosed by its constraints, C M , hence reduc-ing the uncertain space. Otherwise, it guarantees to reduce the un-certain space of the current hyperrectangle by discarding the dom-inated sub-hyperrectangles. A Series of Constrained Optimization Problems.
Having ex-plained that a middle point probe defines a CO problem, potentiallyleading to one Pareto optimal point, we next extend our solution totranslate a MOO problem to a series of CO problems in order tocover all (or most) of the Pareto set.The middle point probe method described above offers not onlya way to find a single Pareto point, but also a potential iterativemethod to find more Pareto points. The middle point f M dividesthe rectangle into k sub-hyperrectangles, formed by k hyperplanesthat go through f M and are parallel to the each dimension. If weabuse the notation, for each of the sub-hyperrectangles, it is indeedenclosed by its own Utopia point and Nadir point, which can be cal-culated from f U , f N , and f M . As seen in Figure 2(a), after find-ing the middle point f M , one can form two sub-hyperrectangles,enclosed by ( U , N ) and ( U , N ) , respectively. Based on thisobservation, we can derive an iterative method to find more Paretopoints by recursively invoking the middle point probe for each sub-hyperrectangle formed. Iterative Middle Point Probes.
Given the initial Utopia and Nadirpoints, f U and f N , in k -dimensional objective space, the firstmiddle point probe returns one Pareto point, f M , that divides thehyperrectangle enclosed by ( f U , f N ) into k sub-hyperrectangles.The sub-hyperrectangles dominated by f M can be safely discarded,while others are added to a queue, each of which has its own localUtopia and Nadir points, ( U i , N i ), representing the unexploredspace with no knowledge if Pareto points exist. For each unex-plored sub-hyperrectangle, we remove it from the queue and con-tinue to apply the middle point probe: if the probe returns a newPareto point, we generate new sub-hyperrectangles and add themto the queue for exploration later. This process continues until thequeue is empty or the maximum number of iterations is reached.In Figure 2(b), we illustrate the uncertain space given a set ofPareto optimal points in the D case. Each of the Pareto pointscan divide the rectangle enclosed by ( f U , f N ) into four parts, andthe union of the colored regions is where no Pareto optimal pointsexist. The complement of this union, the unfilled regions, is the uncertain space. In a k -dimensional hyperrectangle, again, if wetake union of all those nonviable regions for Pareto optimality, itscomplement forms the uncertain space.Following the previous propositions, we have the following re-sult on our returned solution set. Proposition 3.5.
If we start the Iterative Middle Point Probes pro-cedure from the initial Utopia and Nadir points, and let it terminateuntil the uncertainty space becomes empty, then in the 2D case, ourprocedure guarantees to find all the Pareto points if they are finite.In high-dimensional cases, it guarantees to find a subset of Paretooptimal points.
Filtering of Candidate Points . As explained previously, in generalit is not guaranteed that the CO solutions returned are Pareto opti-mal in high-dimensional cases. Therefore, we can add a filter at theend of our Progressive Frontier approach to remove the solutionsthat are not Pareto optimal. This is similar to the NC method [32].
4. PF ALGORITHMS
In this section, we present algorithms that implement the Pro-gressive Frontier (PF) approach. We note that most MOO algo-rithms suffer from exponential complexity with respect to the num-ber of the objectives, k . This is because the number of non-dominatedpoints tends to grow very quickly with k and the time complexity ofcomputing the volume of dominated space grows super-polynomiallywith k [13]. For this reason, the MOO literature refers to op-timization with up to 3 objectives as multi-objective optimiza-tion, whereas optimization with more than 3 objectives is called many-objective optimization and handled using different methodssuch as preference modeling [13] or fairness among different ob-jectives [39].Since our goal is to develop a practical solution for a cloud op-timizer, most of our use cases fall in the scope of multi-objectiveoptimization. However, a unique requirement in our problem set-ting is to compute the Pareto frontier in a few seconds, which hasn’tbeen considered previously [13]. To achieve our performance goal,we present several techniques, including uncertainty-aware incre-mental computation , fast approximation , and parallel computation . We first present a deterministic, sequential algorithm that imple-ments the Progressive Frontier (PF) approach, referred to as PF- S .This algorithm has two key features:First, it is incremental , in the sense that it first constructs a Paretofrontier ˜ F by using a small number of points and expands ˜ F withmore points as more time is invested. This feature is crucial becausefinding one Pareto point is already expensive due to being a mixed-integer nonlinear programming problem [5, 17]. Hence, one cannotexpect the optimizer to find all Pareto points at once. Instead, itproduces n points first (e.g., those that can be computed withinthe first second), and then expands the frontier with additional n points, afterwards n points, and so on.Second, this algorithm is uncertainty-aware , returning more pointsin the regions of the Pareto frontier that lack sufficient information.If the Pareto frontier includes a large number of points, under a timeconstraint we can return only a subset of them as approximation.In this case, we would like to generate the next set of CO problemssuch that the points returned bring more valuable information.To do so, we want to augment the Iterative Middle Point Probemethod ( § best sub-hyperrectangle to explore next, among those that have not beenexplored. We do so by defining a measure, the volume of uncer-tain space , to capture how much the current frontier ˜ F may deviate6 lgorithm 1 Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M P Q ← φ { PQ is a priority queue sorted by hyperrectangle volume } plan i ← optimize i ( LOW ER, UP P ER ) { Single Objective Opti-mizer takes LOWER and UPPER as constraints and optimizes on i thobjective } Utopia, Nadir ← computeBounds ( plan , . . . , plan k ) volume ← computeVolume ( Utopia, Nadir ) seg ← ( Utopia, Nadir, volume ) P Q .put( seg ) count ← k repeat seg ← P Q .pop()
Utopia ← seg.Utopia Nadir ← seg.Nadir Middle ← ( Utopia + Nadir ) / Middle i ← optimize i ( Utopia, Middle ) { Constraint Opti-mization on i -th objective } { plan } ← Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia, Middle, Nadir ) { return k − rectangles, represented by each own Utopia andNadir } for each rectangle in { rectangle } do Utopia ← rectangle.Utopia Nadir ← rectangle.Nadir volume ← computeVolume ( Utopia, Nadir ) seg ← ( Utopia, Nadir, volume ) P Q .put( seg ) end for until count > M output ← filter ( { plan } ) { remove plan dominated by another planin the same set } from the true yet unknown frontier F . (Note that both ˜ F and F may include an infinite set of points.) The volume of uncertainspace can be calculated from a related set of sub-hyperrectanges,which allows us to rank the sub-hyperrectangles that have not beenexplored. Among those, the sub-hyperrectangle with the largestvolume of uncertain space will be chosen to explore next.Now we are ready to present the complete sequential algorithm.It is an incremental method trying to reduce the uncertain space asfast as possible. Below we give a brief description of it, while thedetails are shown in Algorithm 1. Init:
Find the reference points by solving k single objective op-timization problems. Form the initial Utopia and Nadir ( U and N ) points, and construct the first hyperrectangle. Prepare a prior-ity queue in decreasing order of hyperrectangle volume, initializedwith the first hyperrectangle. Iterate:
Pop a hyperrectangle from the priority queue. Applythe middle point probe to find a Pareto point, f M , in the currenthyperrectangle, which is formed by U i and N i and with largest volume among all the existing hyperrectangles. Divide the currenthyperrectangle into k sub-hyperrectangles, discard those that aredominated by f M , and calculate the volume of the others. Putthem in to the priority queue. Terminate: when the desired number of solutions is reached.
Filter:
Check the result set, and remove any point dominated byanother one in the result set.
We next consider an important subroutine, optimize () , that solveseach constrained optimization problem (line 13 of Algorithm 1).Recall that our objective functions are given by learned models, Ψ i ( x ) , i = 1 . . . k , where each model is likely to be non-linear andsome variables among x can be integers. Even when restricted to asingle objective, this problem reduces to a mixed-integer nonlinearprogramming (MINLP) and is NP-hard [5, 17]. There is not onegeneral MINLP solver that will work effectively for every nonlin-ear programming problem [33]. For example, many of the MINLPsolvers [34] fail to run because they assume certain properties of theobjective function F , e.g., twice continuously differentiable (Bon-min [4]) or factorable into the sumproduct of univariate functions(Couenne [6]), which do not suit our learned models representedsuch as Deep Neural Networks (DNNs). The most general MINLPsolver, Knitro [22], runs for our learned models but very slowly,e.g., 42 minutes for solving a single-objective optimization prob-lem when the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow forus to use even for single-objective optimization, let alone the ex-tension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solve our con-strained optimization (CO) problems involving multiple objectivefunctions. This CO problem is illustrated in Figure 3(a).In the first step, we transform variables to prepare for optimiza-tion by following the common practice in machine learning: Let x be the original set of parameters, which can be categorical, integer,or continuous variables. If a variable is categorical , we use one-hot encoding to create dummy variables. For example, if x d takesvalues { a, b, c } , we create three boolean variables, x ad , x bd , and x cd ,among which only one takes the value ‘1’. Afterwards all the vari-ables are normalized to the range [0, 1], and boolean variables and(normalized) integer variables are relaxed to continuous variablesin [0, 1]. As such, the CO problem deals only with continuous vari-ables in [0,1], which we denote as x = x , . . . , x D ∈ [0 , . Aftera solution is returned for the CO problem, we set the value for acategorical attribute based on the dummy variable with the highestvalue, and round the solution returned for a normalized integer vari-able to its closest integer. Our work also employs an optimizationto support categorical variables using parallel processing, which wedefer to the next subsection.Next, we focus on the CO problem. Our design of a Multi-Objective Gradient Descent (MOGD) solver uses carefully-craftedloss functions to guide gradient descent to find the minimum of atarget objective while satisfying a variety of constraints, where boththe target objective and constraints can be specified over complexmodels, e.g., using DNNs, GPs, or other regression functions.1.
Single objective optimization . As a base case, we considersingle-objective optimization, minimize F ( x ) = Ψ ( x ) . For opti-mization, we set the loss function simply as, L ( x ) = F ( x ) . Thenstarting from an initial configuration, x , gradient descent (GD)will iteratively adjust the configuration to a sequence x , . . . , x n in order to minimize the loss, until it converges to a local minimumor reaches a maximum of steps. To increase the chance of hittinga global minimum, we use a multi-start method to try gradient de-scent from multiple initial points, and finally choose x ∗ that givesthe smallest value among these trials. Finally, to cope with the con-straint on each variable, ≤ x d ≤ , we restrict the GD processsuch that when it tries to walk across the boundary of x d , we simplyset x d to the boundary value. In future iterations, GD may not beable to reduce the loss by pushing x d outside the boundaries, but itcan still adjust other variables until reaching the stopping criteria.2. Constrained optimization . Next we consider a constrained op-timization (CO) problem, as shown in Figure 3(a). Without loss ofgenerality, we treat F as the target objective and F j ∈ [ F Lj , F Uj ] , j = 1 , . . . , k as constraints. Recall that these constraints [ F Lj , F Uj ]7 urrent hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M P Q { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOW ER, U P P ER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } U topia, N adir computeBounds ( plan , . . . , plan k ) volume computeVolume( U topia, N adir ) seg ( U topia, N adir, volume ) P Q .put( seg ) count k repeat seg P Q .pop()
U topia seg.U topia
N adir seg.N adir
M iddle ( U topia + N adir ) / M iddle i optimize i ( U topia, M iddle ) { Constraint Opti-mization on i -th objective } { plan } M iddle i count + = 1 { rectangle } = generateSubRectangles ( U topia, M iddle, N adir ) { return 2 k } for each rectangle in { rectangle } do U topia rectangle.U topia
N adir rectangle.N adir volume computeVolume(
U topia, N adir ) seg ( U topia, N adir, volume ) P Q .put( seg ) end for until count > M output f ilter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] sub ject to F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Ob jective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can usethe same process as above to address the CO problem, but8 input, x ou t pu t, f ( x ) -2 -1 -4 -2 (a) Prior input, x ou t pu t, f ( x ) -3 -2 -1 -4 -2 (b) Posterior
Figure 1:
ExampleofGPregression. (a)priorfunctions,(b)posteriorfunctionsconditioningontrainingdata returns the covariance between the function values at two inputpoints, i.e., k ( x , x ) = Cov ( f ( x ) ,f ( x )) .A GP is a distribution over functions with a special property:if we fix any vector of inputs ( x ,..., x n ) , the output vector f =( f ( x ) ,f ( x ) ,...,f ( x n )) hasamultivariateGaussiandistribution.Specifically, f ⇠ N ( m ,K ) ,where m isthevector ( m ( x ) ...m ( x n )) containing the mean function evaluated at all the inputs and K is amatrix of covariances K ij = k ( x i , x j ) between all the input pairs.The covariance function has a vital role. Recall that the ideawas to approximate f by interpolating between its values at nearbypoints. The covariance function helps determine which points are“nearby”. Iftwopointsarefaraway,thentheirfunctionvaluesshouldbe only weakly related, i.e., their covariance should be near 0. Ontheotherhand, iftwopointsarenearby, thentheircovarianceshouldbe large in magnitude. We accomplish this by using a covariancefunction that depends on the distance between the input points.Inthiswork,weusestandardchoicesforthemeanandcovariancefunctions. We choose the mean function m ( x ) = 0 , which is astandard choice when we have no prior information about the UDF.For the covariance function, we use the squared exponential one,which in its simplest form is k ( x , x ) = f e l k x x k , where k·k isEuclideandistance,and f and l areitsparameters. Thesignalvariance f primarily determines the variance of the function valueat individual points, i.e., x = x . More important is the lengthscale l , which determines how rapidly the covariance decays as x and x move farther apart. If l is small, the covariance decays rapidly, sosample functions from the result GP will have many small bumps;if l is large, then these functions will tend to be smoother.The key assumption made by GP modeling is that at any point x , the function value f ( x ) can be accurately predicted using thefunction values at nearby points. GPs are flexible to model differenttypes of functions by using an appropriate covariance function [18].For instance, for smooth functions, squared-exponential covariancefunctions work well; for less smooth functions, Matern covariancefunctionsworkwell(wheresmoothnessisdefinedby“mean-squareddifferentiability”). In this paper, we focus on the common squared-exponentialfunctions,whichareshownexperimentallytoworkwellfor the UDFs in our applications (see §6.4). In general, the usercan choose a suitable covariance function based on the well-definedproperties of UDFs, and plug it into our framework. We next describe how to use a GP to predict the function outputsatnewinputs. Denotethetrainingdataby X ⇤ = { x ⇤ i | i = 1 ,...,n } for the inputs and f ⇤ = { f ⇤ i | i = 1 ,...,n } for the function values.Inthissection,weassumethatwearetoldafixedsetof m testinputs X = ( x , x ,..., x m ) at which we wish to predict the functionvalues. Denote the unknown function values at the test points by f = ( f ,f ,...,f m ) . The vector ( f ⇤ , f ) is a random vector becauseeach f i : i =1 ...m is random, and by the definition of a GP, this vectorsimply has a multivariate Gaussian distribution. This distribution is: f ⇤ f ⇠ N , K ( X ⇤ ,X ⇤ ) K ( X ⇤ ,X ) K ( X,X ⇤ ) K ( X,X ) ! , (1)where we have written the covariances as matrix with four blocks.Theblock K ( X ⇤ ,X ) isan n ⇥ m matrixofthecovariancesbetweenall training and test points, i.e., K ( X ⇤ ,X ) ij = k ( x ⇤ i , x j ) . Similarnotions are for K ( X ⇤ ,X ⇤ ) , K ( X,X ) , and K ( X,X ⇤ ) .Nowthatwehaveajointdistribution,wecanpredicttheunknowntest outputs f by computing the conditional distribution of f giventhe training data and test inputs. Applying the standard formula forthe conditional of a multivariate Gaussian yields: f | X,X ⇤ , f ⇤ ⇠ N ( m , ⌃) , where (2) m = K ( X,X ⇤ ) K ( X ⇤ ,X ⇤ ) f ⇤ ⌃ = K ( X,X ) K ( X,X ⇤ ) K ( X ⇤ ,X ⇤ ) K ( X ⇤ ,X ) To interpret m intuitively, imagine that m = 1 , i.e., we wish topredict only one output. Then K ( X,X ⇤ ) K ( X ⇤ ,X ⇤ ) is an n -dimensional vector, and the mean m ( x ) is the dot product of thisvector with the training values f ⇤ . So m ( x ) is simply a weightedaverage of the function values at the training points. A similarintuition holds when there is more than one test point, m > . Fig.1(b) illustrates the resulting GP after conditioning on training data.As observed, the posterior functions pass through the training pointsmarked by the black dots. The sampled functions also show that thefurther a point is from the training points, the larger the variance is.We now consider the complexity of this inference step. Note thatonce the training data is collected, the inverse covariance matrix K ( X ⇤ ,X ⇤ ) can be computed once, with a cost of O ( n ) . Thengiven a test point x (or X has size 1), inference involves computing K ( X,X ⇤ ) and multiplying matrices, which has a cost of O ( n ) .The space complexity is also O ( n ) , for storing these matrices. Typically, the covariance functions have some free parameters,called hyperparameters , such as the lengthscale l of the squared-exponential function. The hyperparameters determine how quicklytheconfidenceestimatesexpandastestpointsmovefurtherfromthetraining data. For example, in Fig. 1(b), if the lengthscale decreases,the spread of the function will increase, meaning that there is lessconfidence in the predictions.We can learn the hyperparameters using the training data (seeChapter 5, [18]). We adopt maximum likelihood estimation (MLE),astandardtechniqueforthisproblem. Let ✓ bethevectorofhyperpa-rameters. Theloglikelihoodfunctionis L ( ✓ ) := log p ( f ⇤ | X ⇤ , ✓ ) =log N ( X ⇤ ; m , ⌃) ;hereweuse N torefertothedensityoftheGaus-sian distribution, and m and ⌃ are defined in Eq. (2). MLE solvesfor the value of ✓ that maximizes L ( ✓ ) . We use gradient descent,a standard method for this task. Its complexity is O ( n ) due tothe cost of inverting the matrix K ( X ⇤ ,X ⇤ ) . Gradient descentrequires many steps to compute the optimal ✓ ; thus, retraining oftenhas a high cost for large numbers of training points. Note that whenthe training data X ⇤ changes, ✓ that maximizes the log likelihood L ( ✓ ) may also change. Thus, one would need to maximize the loglikelihood to update the hyperparameters. In §5.3, we will discussretraining strategies that aim to reduce this computation cost.
4. UNCERTAINTY IN QUERY RESULTS
So far in our discussions of GPs, we have assumed that all theinput values are known in advance. However, our work aims tocompute UDFs on uncertain input. In this section, we describe how
NeuralNetworks
Multi-LayerPerceptron … current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M PQ { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOWER,UPPER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia,Nadir computeBounds ( plan ,...,plan k ) volume computeVolume( Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) count k repeat seg PQ .pop() Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia,Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia,Middle,Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M PQ { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOWER,UPPER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia,Nadir computeBounds ( plan ,...,plan k ) volume computeVolume( Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) count k repeat seg PQ .pop() Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia,Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia,Middle,Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8 current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M P Q { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOW ER, UP P ER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia, Nadir computeBounds ( plan , . . . , plan k ) volume computeVolume( Utopia, Nadir ) seg ( Utopia, Nadir, volume ) P Q .put( seg ) count k repeat seg P Q .pop()
Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia, Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia, Middle, Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia, Nadir ) seg ( Utopia, Nadir, volume ) P Q .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M P Q { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOW ER, UP P ER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia, Nadir computeBounds ( plan , . . . , plan k ) volume computeVolume( Utopia, Nadir ) seg ( Utopia, Nadir, volume ) P Q .put( seg ) count k repeat seg P Q .pop()
Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia, Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia, Middle, Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia, Nadir ) seg ( Utopia, Nadir, volume ) P Q .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8 current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M PQ { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOWER,UPPER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia,Nadir computeBounds ( plan ,...,plan k ) volume computeVolume( Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) count k repeat seg PQ .pop() Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia,Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia,Middle,Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8current hyperrectangle, which is formed by U i and N i andwith largest volume among all the existing hyperrectangles.Divide the current hyperrectangle into 2 k sub-hyperrectangles,discard two of them, and calculate the volume of the others.Put them in to the priority queue. Terminate: when the desired number of probes is reached.
Filter:
Check the result set, and remove any point dom-inated by another one in the result set.
Algorithm 1
Progressive Frontier-Sequential (PF-S)
Require: k lower bounds(LOWER): lower j , k upper bounds(UPPER): upper j ,number of points: M PQ { PQ is a priority queue sorted by hyperrectanglevolume } plan i optimize i ( LOWER,UPPER ) { Single ObjectiveOptimizer takes LOWER and UPPER as constraints and op-timizes on i th objective } Utopia,Nadir computeBounds ( plan ,...,plan k ) volume computeVolume( Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) count k repeat seg PQ .pop() Utopia seg.Utopia
Nadir seg.Nadir
Middle ( Utopia + Nadir ) / Middle i optimize i ( Utopia,Middle ) { Constraint Opti-mization on i -th objective } { plan } Middle i count + = 1 { rectangle } = generateSubRectangles ( Utopia,Middle,Nadir ) { return 2 k } for each rectangle in { rectangle } do Utopia rectangle.Utopia
Nadir rectangle.Nadir volume computeVolume(
Utopia,Nadir ) seg ( Utopia,Nadir,volume ) PQ .put( seg ) end for until count > M output filter ( { plan } ) { remove plan dominated by anotherplan in the same set } We next consider an important subroutine, optimize (),that solves each constrained optimization problem (line 13 ofAlgorithm 1). Recall that our objective functions are givenby learned models, i ( x ), i = 1 . . . k , where each model islikely to be non-linear and some variables among x can be in-tegers. Even restricted to a single objective, this problem isknown as a mixed-integer nonlinear programming (MINLP)problem and is NP-hard [6, 15]. There is not one generalMINLP solver that will work e↵ectively for every nonlin-ear programming problem [30]. For example, many of theMINLP solvers [31] fail to run because they assume cer-tain properties of the objective function F , e.g., twice con-tinuously di↵erentiable (Bonmin [5]) or factorable into thesumproduct of univariate functions (Couenne [7]), which donot suit our learned models, e.g., represented as Deep NeuralNetworks (DNNs). The most general MINLP solver, Kni-tro [ ? ], runs for our learned models but very slowly, e.g., 42minutes for solving a single-objective optimization problemwhen the learned model is a DNN, or 17 minutes when themodel is a Gaussian Process (GP). Such a solution is too slow for us to use even for single-objective optimization, letalone the extension to multiple objectives.In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solveour constrained optimization problems involving multipleobjective functions. The problem is illustrated in Figure ?? .First, we follow the common practice in machine learning totransform variables for optimization. Let x be the originalset of parameters, which can be categorical, integer, or con-tinuous variables. If a variable is categorical , we use one-hotencoding to create dummy variables. For example, x i takesvalues { a, b, c } , we create three boolean variables, x ai , x bi ,and x ci , among which only one takes the value ‘1’. After-wards all the variables are normalized to the range [0, 1],and boolean variables and normalized integer variables arerelaxed to continuous variables in [0, 1]. As such, the con-strained optimization (CO) problem deals with continuousvariables in [0,1], which we denote as x = x , . . . , x D [0 , x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x i , i = 1 , , . . . , D [3] Multi-Objective Gradient Descent (MOGD) Solver.
Next, we focus on the CO problem depicted in Figure X(a).Our design of a Multi-Objective Gradient Descent (MOGD)solver uses carefully-crafted loss functions to guide gradientdescent to find the minimum of a target objective while sat-isfying a variety of constraints, while both the target objec-tive and constraints are specified over complex models, e.g.,using DNNs, GPs, or other regression functions.
Single objective optimization . We begin our discussionwith a single objective optimization, minimize F ( x ) = ( x ),which is Part [1] in Figure X(a). To enable gradient descent,we set the loss function simply as, L = F ( x ). Let x n de-note the configuration computed after iteration n . Initially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of theobjective F ( x n ) under the current configuration x n . Wethen use the loss L to estimate how well the configuration x n optimizes (e.g., minimizes) the value of the objective.We then compute the gradient of the loss function as r n x and use it to adjust the configuration for the next iterationso that it minimizes L . That is, we iteratively choose the( n + 1) th configuration as x n +1 = x n ↵ r n x . Currently, weuse the adaptive moment estimation SGD approach from[18] to compute the gradient of the loss function r n x . Theabove process repeats iteratively until we find the optimalconfiguration x ⇤ that minimizes loss and yields the optimalvalue of the target objective F ( x ⇤ ). [ This is obvious. ] Constrained optimization . Then consider a constrainedoptimization (CO) problem, where F as the target of op-timization and constraints are F i [ F Lj , F Uj ] , i = 1 , . . . , k ,which include Parts [1] and [2] of Figure X(a). We can use8 In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solve our con-strained optimization problems involving multiple objective func-tions. The problem is illustrated in Figure ?? . x ⇤ = arg min x F ( x ) [1] subject to F L F ( x ) F U [2] . . .F Lk F k ( x ) F Uk x d , d = 1 , , . . . , D [3] In the first step, we transform variables for optimization by fol-lowing the common practice in machine learning: Let x be theoriginal set of parameters, which can be categorical, integer, orcontinuous variables. If a variable is categorical , we use one-hot encoding to create dummy variables. For example, if x d takesvalues { a, b, c } , we create three boolean variables, x ad , x bd , and x cd , among which only one takes the value ‘1’. Afterwards all thevariables are normalized to the range [0, 1], and boolean variablesand (normalized) integer variables are relaxed to continuous vari-ables in [0, 1]. As such, the constrained optimization (CO) prob-lem deals with continuous variables in [0,1], which we denote as x = x , . . . , x D [0 , . After a solution is returned for the COproblem, we set the value for a categorical attribute based on thedummy variable with the highest value, and round the solution fora normalized integer variable to its closest integer.Next, we focus on the CO problem depicted in Figure X(a). Ourdesign of a Multi-Objective Gradient Descent (MOGD) solver uses carefully-crafted loss functions to guide gradient descent tofind the minimum of a target objective while satisfying a varietyof constraints, where both the target objective and constraints canbe specified over complex models, e.g., using DNNs, GPs, or otherregression functions.
Single objective optimization . As a base case, we consider single-objective optimization, minimize F ( x ) = ( x ) , which is Part[1] in Figure X(a). For optimization, we set the loss function sim-ply as, L ( x ) = F ( x ) . Then starting from an initial configuration, x , gradient descent will iteratively adjust the configuration to asequence x , . . . , x n in order to minimize the loss, until it reachesa local minimum or a maximum of steps allowed. To increase thechance of hitting a global minimum, we use a standard multi-start method to try gradient descent from multiple initial values of x ,and finally choose x ⇤ that gives the smallest value among thesetrials.Let x n denote the configuration computed after iteration n . Ini-tially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of the objec-tive F ( x n ) under the current configuration x n . We then use theloss L to estimate how well the configuration x n optimizes (e.g.,minimizes) the value of the objective. We then compute the gradi-ent of the loss function as r n x and use it to adjust the configurationfor the next iteration so that it minimizes L . That is, we iterativelychoose the ( n + 1) th configuration as x n +1 = x n ↵ r n x . Cur-rently, we use the adaptive moment estimation SGD approach from[17] to compute the gradient of the loss function r n x . The aboveprocess repeats iteratively until we find the optimal configuration x ⇤ that minimizes loss and yields the optimal value of the targetobjective F ( x ⇤ ) . [ This is obvious. ] Constrained optimization . Then consider a constrained optimiza-tion (CO) problem, where F as the target of optimization and con-straints are F i [ F Lj , F Uj ] , i = 1 , . . . , k , which include Parts [1]and [2] of Figure X(a). We can use the same process as above to address the CO problem, but with a different loss function and ascaled value of the objective as following: loss ( i ) = k X j =1 [ { ˆ F j ( x ) > _ ˆ F j ( x ) < } ( ˆ F j ( x ) . + P ] + { ˆ F i ( x ) } ˆ F i ( x ) (4)where ˆ F j ( x ) = F j ( x ) F Lj F Uj F Lj , for j [1 , k ] , and P is constant. Sincethe range of each objective function F j ( x ) varies, we first normal-ize each objective according to its upper and lower bounds, so that ˆ F Lj = 0 , ˆ F Uj = 1 , and a valid objective value ˆ F j ( x ) [0 , .The loss function includes two part. One is to push objectivesinto its constraints region. If an objective F j ( x ) cannot satisfythe constraints ( ˆ F j ( x ) > _ ˆ F j ( x ) < ), it is going to con-tribute a loss according to its distance to the constraints region. Topush objectives satisfying the constraints, we further assign an ex-tra penalty P to stress their importances. The other part of loss isfor the optimization target F i . Once F i ( x ) laid in the constraintsregion ( ˆ F i ( x ) ), we generate a loss according to its value.Therefore, the iterative forward and backward paths can push theobjective values satisfying their constraints as well as minimize thetarget objective. Supporting additional constraints . Two additional constraints: x i and g ( x ) . Handling model uncertainty . ˜ F i ( x ) = E [ F i ( x )]+ ↵ · std [ F i ( x )] Approximate Sequential Algorithm (PF- AS ): When we im-plement single-objective optimization (Line 2 of Algorithm 1) andconstrained optimization (Line 13) using the above procedures, weobtain a new algorithm called PF Approximate Sequential. Notethat this leads to an approximate Pareto set for the MOO problembecause each solution of a constrained optimization can be subop-timal as backpropogation is stuck at a local minima. In practice,if the objective function is complex with many local minima, it ishard for any algorithm to guarantee to find global minima.We finally present a parallel version of the approximate algo-rithm, called PF-Approximate Parallel (PF- AP ). The main differ-ence from the approximate sequential algorithm is that given anyhyperrecetange we aim to explore at the moment, we partition itinto a l k grid and for each grid cell, construct a constrained opti-mization (CO) problem using the the Middle Point Probe (Eq. 4.2).We send these l k CO problem to the DNN-based solver simulta-neously. Internally, the DNN solver will solve these problems inparallel (using a multi-threaded implementation). Some of the cellswill not return any Pareto point and hence will be omitted. For eachof those cells that returns a Pareto point, the Pareto point breaks thecell into a set of sub-hyperrectangles that need to be further ex-plored, and are added to a priority queue as before. Afterwards, thesub-hyperrectangle with the largest volume is removed from thequeue. To explore it, we further partition it into l k cells and ask thesolver to solve their corresponding CO problems simultaneously.This process terminates when the queue becomes empty.
5. AUTOMATIC SOLUTION SELECTION
Our optimizer implements three strategies to recommend newjob configurations from the computed Pareto frontier. First, the
Utopia Nearest (UN) strategy chooses the Pareto optimal point clos-est to the Utopia point by computing the Euclidean distance of eachpoint in ˜ F to f U , and returns the point that minimizes the distance.A variant is the Weighted Utopia Nearest (WUN) strategy, which8 … In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solve our con-strained optimization problems involving multiple objective func-tions. The problem is illustrated in Figure ?? . x ⇤ = arg min x F ( x ) [1] x subject to F L F ( x ) F U [2] x d . . .F Lk F k ( x ) F Uk x d , d = 1 , , . . . , D [3] x D In the first step, we transform variables for optimization by fol-lowing the common practice in machine learning: Let x be theoriginal set of parameters, which can be categorical, integer, orcontinuous variables. If a variable is categorical , we use one-hot encoding to create dummy variables. For example, if x d takesvalues { a, b, c } , we create three boolean variables, x ad , x bd , and x cd , among which only one takes the value ‘1’. Afterwards all thevariables are normalized to the range [0, 1], and boolean variablesand (normalized) integer variables are relaxed to continuous vari-ables in [0, 1]. As such, the constrained optimization (CO) prob-lem deals with continuous variables in [0,1], which we denote as x = x , . . . , x D [0 , . After a solution is returned for the COproblem, we set the value for a categorical attribute based on thedummy variable with the highest value, and round the solution fora normalized integer variable to its closest integer.Next, we focus on the CO problem depicted in Figure X(a). Ourdesign of a Multi-Objective Gradient Descent (MOGD) solver uses carefully-crafted loss functions to guide gradient descent tofind the minimum of a target objective while satisfying a varietyof constraints, where both the target objective and constraints canbe specified over complex models, e.g., using DNNs, GPs, or otherregression functions.
Single objective optimization . As a base case, we consider single-objective optimization, minimize F ( x ) = ( x ) , which is Part[1] in Figure X(a). For optimization, we set the loss function sim-ply as, L ( x ) = F ( x ) . Then starting from an initial configuration, x , gradient descent will iteratively adjust the configuration to asequence x , . . . , x n in order to minimize the loss, until it reachesa local minimum or a maximum of steps allowed. To increase thechance of hitting a global minimum, we use a standard multi-start method to try gradient descent from multiple initial values of x ,and finally choose x ⇤ that gives the smallest value among thesetrials.Let x n denote the configuration computed after iteration n . Ini-tially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of the objec-tive F ( x n ) under the current configuration x n . We then use theloss L to estimate how well the configuration x n optimizes (e.g.,minimizes) the value of the objective. We then compute the gradi-ent of the loss function as r n x and use it to adjust the configurationfor the next iteration so that it minimizes L . That is, we iterativelychoose the ( n + 1) th configuration as x n +1 = x n ↵ r n x . Cur-rently, we use the adaptive moment estimation SGD approach from[17] to compute the gradient of the loss function r n x . The aboveprocess repeats iteratively until we find the optimal configuration x ⇤ that minimizes loss and yields the optimal value of the targetobjective F ( x ⇤ ) . [ This is obvious. ] Constrained optimization . Then consider a constrained optimiza-tion (CO) problem, where F as the target of optimization and con-straints are F i [ F Lj , F Uj ] , i = 1 , . . . , k , which include Parts [1] and [2] of Figure X(a). We can use the same process as above toaddress the CO problem, but with a different loss function and ascaled value of the objective as following: loss ( i ) = k X j =1 [ { ˆ F j ( x ) > _ ˆ F j ( x ) < } ( ˆ F j ( x ) . + P ] + { ˆ F i ( x ) } ˆ F i ( x ) (4)where ˆ F j ( x ) = F j ( x ) F Lj F Uj F Lj , for j [1 , k ] , and P is constant. Sincethe range of each objective function F j ( x ) varies, we first normal-ize each objective according to its upper and lower bounds, so that ˆ F Lj = 0 , ˆ F Uj = 1 , and a valid objective value ˆ F j ( x ) [0 , .The loss function includes two part. One is to push objectivesinto its constraints region. If an objective F j ( x ) cannot satisfythe constraints ( ˆ F j ( x ) > _ ˆ F j ( x ) < ), it is going to con-tribute a loss according to its distance to the constraints region. Topush objectives satisfying the constraints, we further assign an ex-tra penalty P to stress their importances. The other part of loss isfor the optimization target F i . Once F i ( x ) laid in the constraintsregion ( ˆ F i ( x ) ), we generate a loss according to its value.Therefore, the iterative forward and backward paths can push theobjective values satisfying their constraints as well as minimize thetarget objective. Supporting additional constraints . Two additional constraints: x i and g ( x ) . Handling model uncertainty . ˜ F i ( x ) = E [ F i ( x )]+ ↵ · std [ F i ( x )] Approximate Sequential Algorithm (PF- AS ): When we im-plement single-objective optimization (Line 2 of Algorithm 1) andconstrained optimization (Line 13) using the above procedures, weobtain a new algorithm called PF Approximate Sequential. Notethat this leads to an approximate Pareto set for the MOO problembecause each solution of a constrained optimization can be subop-timal as backpropogation is stuck at a local minima. In practice,if the objective function is complex with many local minima, it ishard for any algorithm to guarantee to find global minima.We finally present a parallel version of the approximate algo-rithm, called PF-Approximate Parallel (PF- AP ). The main differ-ence from the approximate sequential algorithm is that given anyhyperrecetange we aim to explore at the moment, we partition itinto a l k grid and for each grid cell, construct a constrained opti-mization (CO) problem using the the Middle Point Probe (Eq. 4.2).We send these l k CO problem to the DNN-based solver simulta-neously. Internally, the DNN solver will solve these problems inparallel (using a multi-threaded implementation). Some of the cellswill not return any Pareto point and hence will be omitted. For eachof those cells that returns a Pareto point, the Pareto point breaks thecell into a set of sub-hyperrectangles that need to be further ex-plored, and are added to a priority queue as before. Afterwards, thesub-hyperrectangle with the largest volume is removed from thequeue. To explore it, we further partition it into l k cells and ask thesolver to solve their corresponding CO problems simultaneously.This process terminates when the queue becomes empty.
5. AUTOMATIC SOLUTION SELECTION
Our optimizer implements three strategies to recommend newjob configurations from the computed Pareto frontier. First, the
Utopia Nearest (UN) strategy chooses the Pareto optimal point clos-est to the Utopia point by computing the Euclidean distance of eachpoint in ˜ F to f U , and returns the point that minimizes the distance.8In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solve our con-strained optimization problems involving multiple objective func-tions. The problem is illustrated in Figure ?? . x ⇤ = arg min x F ( x ) [1] x subject to F L F ( x ) F U [2] x d . . .F Lk F k ( x ) F Uk x d , d = 1 , , . . . , D [3] x D In the first step, we transform variables for optimization by fol-lowing the common practice in machine learning: Let x be theoriginal set of parameters, which can be categorical, integer, orcontinuous variables. If a variable is categorical , we use one-hot encoding to create dummy variables. For example, if x d takesvalues { a, b, c } , we create three boolean variables, x ad , x bd , and x cd , among which only one takes the value ‘1’. Afterwards all thevariables are normalized to the range [0, 1], and boolean variablesand (normalized) integer variables are relaxed to continuous vari-ables in [0, 1]. As such, the constrained optimization (CO) prob-lem deals with continuous variables in [0,1], which we denote as x = x , . . . , x D [0 , . After a solution is returned for the COproblem, we set the value for a categorical attribute based on thedummy variable with the highest value, and round the solution fora normalized integer variable to its closest integer.Next, we focus on the CO problem depicted in Figure X(a). Ourdesign of a Multi-Objective Gradient Descent (MOGD) solver uses carefully-crafted loss functions to guide gradient descent tofind the minimum of a target objective while satisfying a varietyof constraints, where both the target objective and constraints canbe specified over complex models, e.g., using DNNs, GPs, or otherregression functions.
Single objective optimization . As a base case, we consider single-objective optimization, minimize F ( x ) = ( x ) , which is Part[1] in Figure X(a). For optimization, we set the loss function sim-ply as, L ( x ) = F ( x ) . Then starting from an initial configuration, x , gradient descent will iteratively adjust the configuration to asequence x , . . . , x n in order to minimize the loss, until it reachesa local minimum or a maximum of steps allowed. To increase thechance of hitting a global minimum, we use a standard multi-start method to try gradient descent from multiple initial values of x ,and finally choose x ⇤ that gives the smallest value among thesetrials.Let x n denote the configuration computed after iteration n . Ini-tially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of the objec-tive F ( x n ) under the current configuration x n . We then use theloss L to estimate how well the configuration x n optimizes (e.g.,minimizes) the value of the objective. We then compute the gradi-ent of the loss function as r n x and use it to adjust the configurationfor the next iteration so that it minimizes L . That is, we iterativelychoose the ( n + 1) th configuration as x n +1 = x n ↵ r n x . Cur-rently, we use the adaptive moment estimation SGD approach from[17] to compute the gradient of the loss function r n x . The aboveprocess repeats iteratively until we find the optimal configuration x ⇤ that minimizes loss and yields the optimal value of the targetobjective F ( x ⇤ ) . [ This is obvious. ] Constrained optimization . Then consider a constrained optimiza-tion (CO) problem, where F as the target of optimization and con-straints are F i [ F Lj , F Uj ] , i = 1 , . . . , k , which include Parts [1] and [2] of Figure X(a). We can use the same process as above toaddress the CO problem, but with a different loss function and ascaled value of the objective as following: loss ( i ) = k X j =1 [ { ˆ F j ( x ) > _ ˆ F j ( x ) < } ( ˆ F j ( x ) . + P ] + { ˆ F i ( x ) } ˆ F i ( x ) (4)where ˆ F j ( x ) = F j ( x ) F Lj F Uj F Lj , for j [1 , k ] , and P is constant. Sincethe range of each objective function F j ( x ) varies, we first normal-ize each objective according to its upper and lower bounds, so that ˆ F Lj = 0 , ˆ F Uj = 1 , and a valid objective value ˆ F j ( x ) [0 , .The loss function includes two part. One is to push objectivesinto its constraints region. If an objective F j ( x ) cannot satisfythe constraints ( ˆ F j ( x ) > _ ˆ F j ( x ) < ), it is going to con-tribute a loss according to its distance to the constraints region. Topush objectives satisfying the constraints, we further assign an ex-tra penalty P to stress their importances. The other part of loss isfor the optimization target F i . Once F i ( x ) laid in the constraintsregion ( ˆ F i ( x ) ), we generate a loss according to its value.Therefore, the iterative forward and backward paths can push theobjective values satisfying their constraints as well as minimize thetarget objective. Supporting additional constraints . Two additional constraints: x i and g ( x ) . Handling model uncertainty . ˜ F i ( x ) = E [ F i ( x )]+ ↵ · std [ F i ( x )] Approximate Sequential Algorithm (PF- AS ): When we im-plement single-objective optimization (Line 2 of Algorithm 1) andconstrained optimization (Line 13) using the above procedures, weobtain a new algorithm called PF Approximate Sequential. Notethat this leads to an approximate Pareto set for the MOO problembecause each solution of a constrained optimization can be subop-timal as backpropogation is stuck at a local minima. In practice,if the objective function is complex with many local minima, it ishard for any algorithm to guarantee to find global minima.We finally present a parallel version of the approximate algo-rithm, called PF-Approximate Parallel (PF- AP ). The main differ-ence from the approximate sequential algorithm is that given anyhyperrecetange we aim to explore at the moment, we partition itinto a l k grid and for each grid cell, construct a constrained opti-mization (CO) problem using the the Middle Point Probe (Eq. 4.2).We send these l k CO problem to the DNN-based solver simulta-neously. Internally, the DNN solver will solve these problems inparallel (using a multi-threaded implementation). Some of the cellswill not return any Pareto point and hence will be omitted. For eachof those cells that returns a Pareto point, the Pareto point breaks thecell into a set of sub-hyperrectangles that need to be further ex-plored, and are added to a priority queue as before. Afterwards, thesub-hyperrectangle with the largest volume is removed from thequeue. To explore it, we further partition it into l k cells and ask thesolver to solve their corresponding CO problems simultaneously.This process terminates when the queue becomes empty.
5. AUTOMATIC SOLUTION SELECTION
Our optimizer implements three strategies to recommend newjob configurations from the computed Pareto frontier. First, the
Utopia Nearest (UN) strategy chooses the Pareto optimal point clos-est to the Utopia point by computing the Euclidean distance of eachpoint in ˜ F to f U , and returns the point that minimizes the distance.8In this work, we propose a novel solver that employs a cus-tomized gradient descent approach to approximately solve our con-strained optimization problems involving multiple objective func-tions. The problem is illustrated in Figure ?? . x ⇤ = arg min x F ( x ) [1] x subject to F L F ( x ) F U [2] x d . . .F Lk F k ( x ) F Uk x d , d = 1 , , . . . , D [3] x D In the first step, we transform variables for optimization by fol-lowing the common practice in machine learning: Let x be theoriginal set of parameters, which can be categorical, integer, orcontinuous variables. If a variable is categorical , we use one-hot encoding to create dummy variables. For example, if x d takesvalues { a, b, c } , we create three boolean variables, x ad , x bd , and x cd , among which only one takes the value ‘1’. Afterwards all thevariables are normalized to the range [0, 1], and boolean variablesand (normalized) integer variables are relaxed to continuous vari-ables in [0, 1]. As such, the constrained optimization (CO) prob-lem deals with continuous variables in [0,1], which we denote as x = x , . . . , x D [0 , . After a solution is returned for the COproblem, we set the value for a categorical attribute based on thedummy variable with the highest value, and round the solution fora normalized integer variable to its closest integer.Next, we focus on the CO problem depicted in Figure X(a). Ourdesign of a Multi-Objective Gradient Descent (MOGD) solver uses carefully-crafted loss functions to guide gradient descent tofind the minimum of a target objective while satisfying a varietyof constraints, where both the target objective and constraints canbe specified over complex models, e.g., using DNNs, GPs, or otherregression functions.
Single objective optimization . As a base case, we consider single-objective optimization, minimize F ( x ) = ( x ) , which is Part[1] in Figure X(a). For optimization, we set the loss function sim-ply as, L ( x ) = F ( x ) . Then starting from an initial configuration, x , gradient descent will iteratively adjust the configuration to asequence x , . . . , x n in order to minimize the loss, until it reachesa local minimum or a maximum of steps allowed. To increase thechance of hitting a global minimum, we use a standard multi-start method to try gradient descent from multiple initial values of x ,and finally choose x ⇤ that gives the smallest value among thesetrials.Let x n denote the configuration computed after iteration n . Ini-tially n = 0 and x is the default configuration. We can usethe model ( x n ) to compute the predicted value of the objec-tive F ( x n ) under the current configuration x n . We then use theloss L to estimate how well the configuration x n optimizes (e.g.,minimizes) the value of the objective. We then compute the gradi-ent of the loss function as r n x and use it to adjust the configurationfor the next iteration so that it minimizes L . That is, we iterativelychoose the ( n + 1) th configuration as x n +1 = x n ↵ r n x . Cur-rently, we use the adaptive moment estimation SGD approach from[17] to compute the gradient of the loss function r n x . The aboveprocess repeats iteratively until we find the optimal configuration x ⇤ that minimizes loss and yields the optimal value of the targetobjective F ( x ⇤ ) . [ This is obvious. ] Constrained optimization . Then consider a constrained optimiza-tion (CO) problem, where F as the target of optimization and con-straints are F i [ F Lj , F Uj ] , i = 1 , . . . , k , which include Parts [1] and [2] of Figure X(a). We can use the same process as above toaddress the CO problem, but with a different loss function and ascaled value of the objective as following: loss ( i ) = k X j =1 [ { ˆ F j ( x ) > _ ˆ F j ( x ) < } ( ˆ F j ( x ) . + P ] + { ˆ F i ( x ) } ˆ F i ( x ) (4)where ˆ F j ( x ) = F j ( x ) F Lj F Uj F Lj , for j [1 , k ] , and P is constant. Sincethe range of each objective function F j ( x ) varies, we first normal-ize each objective according to its upper and lower bounds, so that ˆ F Lj = 0 , ˆ F Uj = 1 , and a valid objective value ˆ F j ( x ) [0 , .The loss function includes two part. One is to push objectivesinto its constraints region. If an objective F j ( x ) cannot satisfythe constraints ( ˆ F j ( x ) > _ ˆ F j ( x ) < ), it is going to con-tribute a loss according to its distance to the constraints region. Topush objectives satisfying the constraints, we further assign an ex-tra penalty P to stress their importances. The other part of loss isfor the optimization target F i . Once F i ( x ) laid in the constraintsregion ( ˆ F i ( x ) ), we generate a loss according to its value.Therefore, the iterative forward and backward paths can push theobjective values satisfying their constraints as well as minimize thetarget objective. Supporting additional constraints . Two additional constraints: x i and g ( x ) . Handling model uncertainty . ˜ F i ( x ) = E [ F i ( x )]+ ↵ · std [ F i ( x )] Approximate Sequential Algorithm (PF- AS ): When we im-plement single-objective optimization (Line 2 of Algorithm 1) andconstrained optimization (Line 13) using the above procedures, weobtain a new algorithm called PF Approximate Sequential. Notethat this leads to an approximate Pareto set for the MOO problembecause each solution of a constrained optimization can be subop-timal as backpropogation is stuck at a local minima. In practice,if the objective function is complex with many local minima, it ishard for any algorithm to guarantee to find global minima.We finally present a parallel version of the approximate algo-rithm, called PF-Approximate Parallel (PF- AP ). The main differ-ence from the approximate sequential algorithm is that given anyhyperrecetange we aim to explore at the moment, we partition itinto a l k grid and for each grid cell, construct a constrained opti-mization (CO) problem using the the Middle Point Probe (Eq. 4.2).We send these l k CO problem to the DNN-based solver simulta-neously. Internally, the DNN solver will solve these problems inparallel (using a multi-threaded implementation). Some of the cellswill not return any Pareto point and hence will be omitted. For eachof those cells that returns a Pareto point, the Pareto point breaks thecell into a set of sub-hyperrectangles that need to be further ex-plored, and are added to a priority queue as before. Afterwards, thesub-hyperrectangle with the largest volume is removed from thequeue. To explore it, we further partition it into l k cells and ask thesolver to solve their corresponding CO problems simultaneously.This process terminates when the queue becomes empty.
5. AUTOMATIC SOLUTION SELECTION
Our optimizer implements three strategies to recommend newjob configurations from the computed Pareto frontier. First, the
Utopia Nearest (UN) strategy chooses the Pareto optimal point clos-est to the Utopia point by computing the Euclidean distance of eachpoint in ˜ F to f U , and returns the point that minimizes the distance.8 …… (a) Constrained optimization with multiple objectives (e.g., F is modeled by a DNN and F by a GP) (b) Loss term on obj1, φ = ˆ F (c) Loss term on obj2, φ = ˆ F (d) L ( x ) on univariate input x , F = max (0 , x − , and F = max (0 , x − (e) L ( x , x ) on bivariate input x and x , F = max (0 , x + 8 x − , and F = max (0 , x − x + 4) (f) A GP fn on x with expectedvalues (the blue line) and modeluncertainty (the pink region) Figure 3:
Constrained optimization with multiple objectives, and the loss function, L , on different objectives and input parameters x come from our PF algorithm that at each step, aims to explore a spe-cific region (a hyper-rectangle formed by these constraints) in theobjective space by solving a particular CO problem.To solve the CO problem, we need to design an appropriate lossfunction, L ( x ) , such that by minimizing this loss, we can minimize F ( x ) while at the same time satisfying all the constraints. Ourproposed loss function is as follows: L ( x ) = { ≤ ˆ F ( x ) ≤ } · ˆ F ( x ) + k (cid:88) j =1 { ˆ F j ( x ) > ∨ ˆ F j ( x ) < } [( ˆ F j ( x ) −
12 ) + P ] (4)where ˆ F j ( x ) = F j ( x ) − F Lj F Uj − F Lj , for j ∈ [1 , k ] , denotes the normalizedvalue of each objective, and P is a constant for extra penalty. Sincethe range of each objective function F j ( x ) varies, we first normal-ize each objective according to its upper and lower bounds, so that ˆ F Lj = 0 , ˆ F Uj = 1 , and a valid objective value ˆ F j ( x ) ∈ [0 , .Figures 3(b)-3(c) illustrate the breakdown of the terms of theabove loss when we have a single objective F and constraints onobjectives F and F . The loss for F has two terms. When F falls in the region ( ≤ ˆ F ( x ) ≤ ), the first term of the losspenalizes a large value of F , and hence minimizing the loss helpsreduce F . The second term of the loss aims to push the objectiveinto its constraint region. If F ( x ) cannot satisfy the constraints( ˆ F j ( x ) > ∨ ˆ F j ( x ) < ), it will contribute a loss according toits distance from the valid region. The extra penalty, P , is a largeconstant added to ensure that the loss for F if it falls outside itsvalid region is much higher than that if F lies in its valid region.Figures 3(b) shows the combined effect of these two terms over F . In comparison, the loss for F has only the second term, i.e., topush it to satisfy its constraint, which is shown in Figure-3(c). Thefinal loss combines the terms for both F and F .Figure 3(d) illustrates the loss function, L , over univariate input x , assuming two specific models for F and F , each of which is asimple neural network with ReLU as the activation function, whereFigure 3(d) shows the loss over bivariate input x and x . Suchloss will be used to guide gradient descent such that by minimizing L , it is likely to find the values of the input variables that minimizethe target objective while satisfying the constraints. A final note isthat GD usually assumes the loss function to be differentiable, butour loss function is not at specific points. However, we can use sub-derivative : for a point x that is not differentiable, we can choose avalue between its left derivative and right derivative, which is calleda subgradient. Machine learning libraries allow subgradients to bedefined by the user program and can automatically handle commoncases, including our loss functions for DNNs and GPs.3. Handling model uncertainty . We have several extensions ofour MO-GD solver. In the interest of the space, we highlight themost important extension, that is, to support model uncertainty .Since our objective functions use learned models, these modelsmay not be accurate before sufficient training data is available.Hence, it is desirable that the optimization procedure take into ac-count model uncertainty when recommending an optimal solution.Fortunately, advanced machine learning techniques can support aregression task, F ( x ) , with both expected value E [ F ( x )] and vari-ance, Var [ F ( x )] . Such techniques include Gaussian Processes [36],with an example shown in Figure 3(f), and Bayesian approximationfor DNNs [15]. Given such information, we only need to replaceeach objective function, F j ( x ) , with ˜ F j ( x ) = E [ F j ( x )] + α · std [ F j ( x )] , where α is a small positive constant. As such, ˜ F j ( x ) offers a more conservative estimate of F j ( x ) for solving a mini-mization problem, given the model uncertainty. Then we use ˜ F i ( x ) to build the loss function as in Eq. 4 to solve the CO problem.Finally, we note three sources of approximation in our MO-GDsolver: (1) the relaxation of categorical and integer variables tocontinuous variables in [0,1]; (2) the use of Gradient Descent (GD)to solve a non-convex optimization problem, where solutions arelikely to be local minima; (3) the model uncertainty. As our bench-mark results show, models learned by OtterTune [43] can have pre-diction errors ∈ [10% , compared to observed values. Thus,model uncertainty has a major impact on the performance achievedby the solution from the MO-GD solver. This indicates that to builda practical system, it is crucial that the model be updated frequentlyfrom new training data and our multi-objective optimizer refreshthe Pareto frontier based on the new model in order to make more8ccurate recommendations – for this reason, we treat the speed ofcomputing the Pareto frontier as a key performance goal. Approximate Sequential Algorithm (PF- AS ): When we im-plement single-objective optimization (Line 2 of Algorithm 1) andconstrained optimization (Line 13) using our MO-GD solver, weobtain a new algorithm called PF Approximate Sequential. Notethat this leads to an approximate Pareto set for the MOO problembecause each solution of a CO problem can be suboptimal. In fact,the most powerful commercial solver, Knitro [22], also returns ap-proximate solutions due to the complex, non-convex properties ofour objective functions, despite its long running time. Approximate Parallel Algorithm (PF- AP ): We finally presenta parallel version of the approximate algorithm, PF-ApproximateParallel (PF- AP ). The main difference from the approximate se-quential algorithm is that given any hyperrecetange we aim to ex-plore at the moment, we partition it into a l k grid and for eachgrid cell, construct a CO problem using the the Middle Point Probe(Eq. 3). We send these l k CO problem to our MO-GD solver simul-taneously. Internally, our solver will solve these problems in paral-lel (using a multi-threaded implementation). Some of the cells willnot return any Pareto point and hence will be omitted. For eachof those cells that returns a Pareto point, the Pareto point breaksthe cell into a set of sub-hyperrectangles that need to be furtherexplored, and are added to a priority queue as before. Afterwards,the sub-hyperrectangle with the largest volume is removed from thequeue. To explore it, we further partition it into l k cells and ask thesolver to solve their corresponding CO problems simultaneously.This process terminates when the queue becomes empty. Parallel Models.
Our work also supports categorical attributesusing parallel processing. As OtterTune [43] suggests, we can per-form feature selection to focus on a small set ( ∼
10) of variables thatimpact a specific model the most. If among the selected variables,there are only a limited number of categorical values to consider,e.g., a boolean variable indicating compressing immediate data ornot, we can build value-specific models, one for each categoricalvalue, and compute them using multiple threads simultaneously.
5. AUTOMATIC SOLUTION SELECTION
Once a Pareto set is computed for a workload, our optimizer em-ploys a range of strategies to recommend a new configuration fromthe set. We highlight the most effective strategies below and de-scribe other recommendation strategies in our technical report [38].First, the
Utopia Nearest (UN) strategy chooses the Pareto pointclosest to the Utopia point f U , by computing the Euclidean dis-tance of each point in the Pareto set ˜ F to f U and returning thepoint that minimizes the distance.A variant is the Weighted Utopia Nearest (WUN) strategy, whichuses a weight vector, w = ( w , ...w k ) , to capture the importanceamong different objectives and is usually set based on the applica-tion preference. A further improvement is workload-aware WUN,motivated by our observation that expert knowledge about differ-ent objectives is available from the literature. For example, be-tween latency and cost, it is known from the literature of ParallelDatabases [10] that it is beneficial to allocate more resources tolarge queries (e.g. join queries) but less so for small queries (e.g.,selection queries). In this case, we use historical data to divideworkloads into three categories, (low, medium, high), based on theobserved latency under the default configuration. For long runningworkloads, we give more weight to latency than the cost, hence en-couraging more cores to be allocated; for short running workloads,we give more weight to the cost, limiting the cores to be used. We encode such expert knowledge using internal weights, w I =( w I , ..., w Ik ) , and application preference using external weights, w E = ( w E , ..., w Ek ) . The final weights are w = ( w I w E , ..., w Ik w Ek ) .
6. PERFORMANCE EVALUATION
In this section, we compare our MOO approach to popular MOOtechniques and perform an end-to-end comparison to the state-of-the-art OtterTune [43] system.
System.
We chose Apache Spark as the data analytics systembecause it allows us to run analytics including both SQL queriesand ML tasks. Within our dataflow optimizer, the modeling engineemploys two modeling tools, GP models from OtterTune [43] andour custom deep neural network (DNN) models. Since DNN mod-els are more complex than GP models, they increase computationalcost in MOO. Therefore, we use DNN models as the default whenwe evaluate efficiency of MOO methods, and will switch to GPmodels when we conduct end-to-end comparison to OtterTune. Ourmodeling engine is implemented using a mix of PyTorch, Keras andTensorflow to implement and train different models. The trainedmodels interface with our MOO module through network sockets.Our MOO module is implemented in Java and invokes a solverfor constrained optimization (CO). Our system supports severalsolvers including our MO-GD solver ( § Workloads.
We used two benchmarks for evaluation.
Batch Workloads (TPCx-BB): We used the TPCx-BB (BigBench)benchmark [40] designed to model mixed workloads in big data an-alytics, with a scale factor 100G. TPCx-BB includes 30 templates,including 14 SQL queries, 11 SQL with UDF, and 5 ML tasks,which we modified to run on Spark. We parameterized the 30 tem-plates to create 258 workloads and ran them under different con-figurations, totaling 19528 traces, each with 360 runtime metrics(OtterTune [43] suggests similar numbers of traces for training).These traces were then used to train workload-specific models forlatency, CPU utilization, IO cost, etc. After hyperparameter tuning,our latency model has 4 hidden layers, each with 128 nodes, anduses ReLU as the activation function. Adaptive moment estimation(Adam) [20] was used to run backpropogation with learning rate =0.1, weight decay = 0.1, max iter = 100, and early stop patience =20. OtterTune [43] suggests using feature selection to focus on themost important ( ∼
10) parameters for tuning; hence, we ran MOOover the most important 12 parameters of Spark, including paral-lelism, the number of executors, the number of core per executor,memory per executor, shuffle compress, etc.
Streaming Workloads : We also created a streaming benchmarkby extending a prior study [23] on click stream analysis, including5 SQL templates with UDFs and 1 ML template. We then created63 workloads from templates using parameterization, and collectedtraces for training workload models for latency and throughput.
Hardware . Our system was deployed on a cluster with one gate-way and 20 compute nodes. The compute nodes are CentOS basedwith 2xIntel Xeon Gold 6130 processors and 16 cores each, 768GBof memory, and RAID disks. Each MOO algorithm runs on a ded-icated compute node, potentially with multi-threading. Time (seconds)020406080100 U n c e r t a i n t y s p a c e i n % PF-APPF-ASWSNC (a)
Uncertain space (job 9, 2d) N u m b e r o f c o r e s WSNC (b)
Frontier of WS and NC (job 9, 2d) N u m b e r o f c o r e s PF-AP (c)
Frontier of PF (job 9, 2d) Time (seconds)020406080100 U n c e r t a i n t y s p a c e i n % PF-APPF-ASEvo (d)
Uncertain space (job 9, 2d) N u m b e r o f c o r e s (e) Frontier of Evo (job 9, 2d)
Evo (1s) PF-AP (1s) Evo (2s) PF-AP (2s)020406080100 U n c e r t a i n t y s p a c e i n % (f) Uncertain space of all 258 jobs
Figure 4:
Comparative results on multi-objective optimization using 258 batch workloads
We first compare our Progressive Frontier (PF) algorithms, PF- AS and PF- AP , to three major MOO methods, Weighted Sum(WS) [30], Normalized Constraints (NC) [32], and NSGA-II [9]from JMetal [18] in the family of Evolutionary (Evo) methods [13].(As [13] points out, more recent Evo methods concern many ob-jective optimization and preference modeling, hence orthogonal toour current work.) For each algorithm, we request it to generateincreasingly more Pareto points (10, 20, 30, 40, 50, 100, 150, 200),which are called probes, as more computing time is invested. Expt 1: Batch 2D.
We start with the batch workloads where theobjectives are latency and cost (simulated by the number of coresused). As results across different jobs are consistent, we first showdetails using job 9. To compare PF- AS and PF- AP to WS andNC, Fig. 4(a) shows the uncertain space when more Pareto pointsare requested over time. Here, the uncertain space is the percent-age of the total objective space that the algorithm is uncertain about(Def. 3.7). Initially at 100%, it starts to reduce when the first Paretoset (of up to 10 points) is generated. A main observation that WSand NC take long to run, e.g., about 47 seconds to generate thefirst Pareto set. Such delay means that they are not suitable for rec-ommending job configurations under stringent time constraints. Incomparison, our PF approximate sequential (PF- AS ) and parallel(PF- AP ) algorithms reduce uncertain space much more quickly,e.g., with the first Pareto set generated under 1 second by PF- AP .PF- AS does not work as well as PF- AP because PF- AS is a se-quential algorithm, and hence the quality of Pareto points found inthe early stage have a severe impact on the later procedure. Giventhat our algorithm is approximate, it could happen that one low-quality result in the early stage leads to overall low-quality Paretofrontier. In contrast, the PF- AP is a parallel algorithm and henceone low-quality result won’t have as much impact as in PF- AS .Fig. 4(b) shows the Pareto frontiers of WS and NC generatedafter 47 seconds, where the Utopia (hypothetically optimal) pointis at the lower left corner. WS is shown to have poor coverage ofthe Pareto frontier, e.g., returning only 3 points although 10 wererequested. NC generates more points (8) on the Pareto frontier.However, it still provides fewer points and less information thanthe Pareto frontier of PF- AP (12 points) shown in Fig. 4(c), con-structed using only 3.2 seconds. These frontiers also show thatlatency and cost do compete for resources, hence with tradeoffs.We next compare PF- AP to Evo, as Fig. 4(d) shows. AlthoughEvo runs faster than WS and NC, it still fails to generate the firstPareto set until after 2.6 seconds. As noted before, such a delay is still quite high for our target use case of serverless databases.Fig. 4(e) shows another issue of Evo: the Pareto frontiers generatedover time are not consistent. For example, the frontier generatedwith 30 probes indicates that if one aims at latency of 6 seconds,the cost is around 36 units. The frontier produced with 40 probesshows the cost to be as low as 20, while the frontier with 50 probeschanges the cost to 28. Recommending configurations based onsuch inconsistent information is highly undesirable.Finally, Fig. 4(f) compares PF- AP against Evo for all 258 jobs,when we impose a 1-second (or 2 second) constraint for makinga configuration recommendation to balance cost and latency, e.g.,when a serverless database needs to be started. Evo fails to generateany Pareto set under 1 seconds (with 100% uncertain space) andhence cannot make any recommendation. In contrast, our PF- AP can generate Pareto sets under 1 seconds for all 258 jobs, with amedian of 9.2% uncertain space across all jobs. For a 2 secondtime constraint, Evo still can not make any recommendation. OurPF- AP can generate Pareto sets under 2 seconds for all 258 jobs,with a median of 6.1% uncertain space across all jobs. Expt 2: Streaming 2D and 3D.
We next use the streamingworkload under 2 objectives, where the objectives are average la-tency (of output records) and throughput (the number of records persecond), as well as under 3 objectives, where we add (simulated)cost as the 3rd objective. As the results for different jobs are similar,we illustrate them using job 54 under 3D, while additional resultsare available in our technical report [38]. Fig. 5(a) confirms thatWS and NC take long, e.g., 74 and 75 seconds, respectively, to re-turn the first Pareto set, where our PF- AP computes the first Paretoset with 1.5 seconds. Evo method again fails to produce the firstPareto set under 2.2 seconds. Fig. 5(b) and Fig. 5(c) show that WSand NC again have poor coverage of the frontier (7 points only foreach), while Fig. 5(d) shows that PF can offer much better coverageof the frontier using less time. Again, we observe that Evo returnsinconsistent Pareto frontiers as more probes are made, whose plotsare left to our technical report [38] due to space limitations.Fig. 5(e) and 5(f) summarize the running time of our PF- AP and Evo for 2D and 3D cases. For all 2D jobs, Evo fail to meet the1-second or 2-second constraint. Our PF- AP can generate Paretosets for 5 jobs under 1 second, and all 63 jobs under 2 seconds, witha median of 8% uncertain space. For 3D jobs, we give a slightlylooser constraint, 2.5 seconds, to accommodate the increased di-mensionality. PF- AP can generate Pareto sets for all 63 jobs under2.5 seconds, with a median of 2% uncertain space, whereas Evo10 Time (seconds)020406080100 U n c e r t a i n t y s p a c e i n % PF-APPF-ASEvoWSNC (a)
Uncertain space (job 54, 3d) (b)
Frontier of WS (job 54, 3d) (c)
Frontier of NC (job 54, 3d) (d)
Frontier of PF (job 54, 3d)
Evo (1s) PF-AP (1s) Evo (2s) PF-AP (2s)020406080100 U n c e r t a i n t y s p a c e i n % (e) Uncertain space under 1 or 2 seconds (2D)
Evo (1s) PF-AP (1s) Evo (2.5s) PF-AP (2.5s)020406080100 U n c e r t a i n t y s p a c e i n % (f) Uncertain space under 1 or 2.5 seconds (3D)
Figure 5:
Comparative results on multi-objective optimization using 63 streaming workloads can generate Pareto sets for 5 out of 63 under 2.5 seconds.
Summary.
Across all 321 workloads tested, WS and NC fail togenerate Pareto sets under 40 seconds. Evo is faster but still fails togenerate the first Pareto set for most jobs under the constraints of 2- 2.5 seconds. In contrast, our PF- AP algorithm takes less than 2.5seconds to generate a decent Pareto set for all jobs, offering 2.6x to50x speedup over other methods in achieving so. In addition, PF- AP overcomes the poor coverage (WS and NC) and inconsistency(Evo) issues that other MOO algorithms have. Next we perform an end-to-end comparison of the configura-tions recommended by our MOO against those by Ottertune [43],a state-of-the-art automatic tuning system as described in § Expt 3: Accurate models.
First, we assume that the learnedmodels are accurate and treat the model-predicted value of an ob-jective as the true value, for any given configuration. To ensurefair comparison, we use the GP models from Ottertune in bothsystems. For each of the following 2D workloads, our systemruns the PF algorithm to compute the Pareto set for each work-load and then the Weighted Utopia Nearest (WUN) strategy ( § k objective functions into a single ob-jective, (cid:80) ki =1 w i Ψ i ( x ) , with (cid:80) i w i = 1 , and then call Ottertuneto solve a SO problem. The weight vector, w , is a hyperparameterthat has to be set before running Ottertune for a workload.It is important to note that the weights are used very differentlyin these two systems: PF-W UN applies w in the objective spaceto select one configuration from its Pareto set. In contrast, theweighted method for Ottertune applies w to construct a SO prob-lem and hence obtains a single solution. It is known from the theoryof Weighted Sum [30] that even if one tries many different valuesof w , the weighted method cannot find many diverse Pareto points, which is confirmed by the sparsity of the Pareto set in Fig. 5(b)where Weight Sum tried different w values. Batch 2D . For the 2D batch workloads, We compare the rec-ommendations of PF-W UN to those of Ottertune. Fig. 6(a) showsthe comparison when w =(0.5, 0.5), indicating that the applicationwants balanced results between latency and cost. Since TPCx-BBworkloads have 2 orders of magnitude difference in latency, we nor-malize the latency of each workload (x-axis) by treating the slowerone between PF-W UN and Ottertune as 100% and the faster as avalue less than 100%. The number of cores (y-axis) allowed in thisexperiment is [4, 58]. For all 30 workloads, Ottertune recommendsthe smallest number of cores (4), favoring small numbers of coresat the cost of high latency. PF-W UN is adaptive to the applica-tion requirement of balanced objectives, using 2-4 more cores foreach workload to enable up to 26% reduction of latency. When wechange the weights to (0.1, 0.9), favoring cost to latency, both sys-tems recommend using 4 cores; the plot is omitted in the interest ofspace. Fig. 6(b) shows the comparison when w =(0.9, 0.1), indicat-ing strong application preference for low latency. For 19 out of 30workloads, Ottertune still recommends 4 cores because that is thesolution returned even using the 0.9 weight for latency. In contrast,PF-W UN is more adaptive, achieving lower latency than Ottertunefor all 30 workloads with up to 61% reduction of latency. In ad-dition, for 8 out of 30 workloads, PF-W UN dominates Ottertunein both objectives, saving up to 33% of latency while using fewercores – in this case, Ottertune’s solution is not Pareto optimal. Streaming 2D . For 15 streaming workloads, Fig. 6(c) shows thecomparison when w =(0.5, 0.5), where we normalize both latency(x-axis) and throughput (y-axis) using the larger value betweenPF-W UN and Ottertune. The two systems mainly indicate trade-offs: Ottertune recommends low-latency, low-throughput configu-rations, while PF-W UN recommends high-latency, high-throughputconfigurations. When the application changes the preference to w =(0.9, 0.1), again PF-W UN is more adaptive, achieving lower la-tency for all 15 workloads with up to 63% reduction of latency. For2 out of 15 jobs, PF-W UN dominates Ottertune in both objectives,e.g., 36% reduction of latency and 30% increase of throughput.Again, Ottertune’s solution is not Pareto optimal here. Expt 4: Inaccurate models.
We next consider a realistic settingwhere learned models are not accurate. In this case, our MO-GDsolver uses the variance of the model prediction for a given ob-11 c o r e s OttertuneWUN (a)
Batch (0.5,0.5), accurate models
40 50 60 70 80 90 100latency %0102030405060 c o r e s OttertuneWUN (b)
Batch (0.9,0.1), accurate models
20 30 40 50 60 70 80 90 100latency (%)2030405060708090100 t h r o u g h p u t ( % ) OttertuneWUN (c)
Stream (0.5,0.5), accurate models
30 40 50 60 70 80 90 100latency (%)30405060708090100 t h r o u g h p u t ( % ) OttertuneWUN (d)
Stream (0.9,0.1), accurate models L a t e n c y ( s e c o n d s ) OttertuneWUN (e)
Batch (0.5,0.5), inaccurate models L a t e n c y ( s e c o n d s ) OttertuneWUN (f)
Batch (0.9,0.1), inaccurate models
Figure 6:
Comparative results to Ottertune on single-objective and multi-objective optimization jective to obtain a more conservative estimate during optimization.In addition, we obtained a more accurate latency model for the 30TPCx-BB workloads using our customized DNN implementation.Note that our optimizer does not favor any particular model. Ourexperiment simply uses the DNN latency model to demonstrate theflexibility of our optimizer, while Ottertune can only use its GPmodel. By simply changing the model, we reduce latency by 5%over Ottertune when solving 1D optimization on latency.Next we consider 2D optimization over latency and cost. For w =(0.5, 0.5), we take recommendations from both systems andmeasure the actual latency and cost (num. of cores times latency)on our cluster. Fig. 6(e) shows detailed latency results of top 12long-running workloads. Since both systems use low numbers ofcores, the cost plot is quite similar to the latency plot, hence omittedhere. Most importantly, to run the full TPCx-BB benchmark, oursystem outperforms Ottertune with 26% savings on running timewhile using 3% less cost. For w =(0.9, 0.1), Ottertune’s recommen-dations vary from those for (0.5, 0.5) with only 6% reduction oftotal running time, while our recommendations lead to 35% reduc-tion. As Fig. 6(f) shows, our system outperforms Ottertune by 49%reduction of total running time, with 48% increase of cost, whichmatches the application’s strong preference for latency.
7. RELATED WORK
Multi-objective optimization for SQL [16, 21, 41, 42] enumeratesa finite set of query plans and selects Pareto-optimal ones based oncost—in contrast to our need for searching through an an infinite parameter space. MOO approaches for workflow scheduling [21]differ from our MOO in the parameter space and solution.
Resource management.
TEMPO [39] addresses resource manage-ment for SQL databases in MOO settings. When all Service-LevelObjectives (SLOs) cannot be satisfied, it guarantees max-min fair-ness over SLO satisfactions; otherwise, it degrades to WS for rec-ommending a single solution. Recent optimization for cluster andcloud computing [24, 37] focus on running time of SQL queries,but not dataflow problems or multiple objectives. Morpheus [19]addresses the tradeoff between cluster utilization and job’s perfor-mance predictability by codifying implicit user expectations as ex-plicit SLOs and enforces SLOs using scheduling techniques.WiseDB [27, 29] proposes learning-based techniques for cloud resource management. A decision tree is trained on a set of per-formance and cost-related features collected from minimum costschedules of sample workloads. Such minimum cost schedules arenot available in our case. Dhalion [14] uses learning methods to de-tect backpressure and resolve it by tuning the degree of parallelism,but does not consider optimization for user objectives. Li [25] min-imizes end-to-end tuple processing time using deep reinforcementlearning, and requires defining scheduling actions and the associ-ated reward, which is not available in our problem.More relevant is recent work on performance tuning and opti-mization. PerfOrator [35] and Ernest [44] are modeling tools thatuse hand-crafted models while OtterTune [43, 49], which we lever-age, learns more flexible models from the data. For optimizing asingle objective, OtterTune builds a predictive model for each userquery by mapping it to the most similar past query and based onthe model, runs Gaussian Process exploration to minimize the ob-jective. CDBTune [50] solves a similar problem, but uses DeepRL to learn predictive models and determine the best configuration.
Learning-based query optimization.
Cardlearner [45] uses a MLapproach to learn cardinality models from previous job executionsand uses them to predict the cardinalities in future jobs. Recentwork [28] has used neural networks to match the structure of anyoptimizer-selected query execution plan and predicts latency. Neo [26]is a DNN-based query optimizer that bootstraps its optimizationmodel from existing optimizers and then learns from incoming queries.
8. CONCLUSIONS
We presented a Progressive Frontier-based multi-objective opti-mizer that constructs a Pareto-optimal set of job configurations formultiple task-specific objectives, and recommends new job config-urations to best meet them. Using batch and streaming workloads,we showed that our MOO method outperform existing MOO meth-ods [30, 13] in both speed and coverage of the Pareto set, and out-performs Ottertune [43], a state-of-the-art approach, by a 26%-49%reduction in running time of the TPCx-BB benchmark, while adapt-ing to different application preferences on multiple objectives. Infuture work, we plan to extend our optimizer to consider a pipelineof analytical tasks, and optimize for both task-specific and system-wide objectives such as utilization and throughput.12 . REFERENCES [1] Amazon ec2 instance types. https://aws.amazon.com/ec2/instance-types/ , 2019.[2] Amazon aurora serverless. https://aws.amazon.com/rds/aurora/serverless/ .[3] Virtual machine sizes in azure. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes , 2019.[4] Bomin: Basic open-source nonlinear mixed integerprogramming. .[5] M. R. Bussieck and A. Pruessner. Mixed-integer nonlinearprogramming.
SIAG/OPT Newsletter: Views & News ,14(1):19–22, 2003.[6] Couenne: Convex over and under envelopes for nonlinearestimation. https://projects.coin-or.org/Couenne/ .[7] Cplex optimizer. .[8] I. Das and J. E. Dennis. A closer look at drawbacks ofminimizing weighted sums of objectives for pareto setgeneration in multicriteria optimization problems.
StructuralOptimization , 14(1):63–69, 1997.[9] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast andelitist multiobjective genetic algorithm: Nsga-ii.
Trans. Evol.Comp , 6(2):182–197, Apr. 2002.[10] D. DeWitt and J. Gray. Parallel database systems: the futureof high performance database systems.
Commun. ACM ,35(6):85–98, 1992.[11] S. Duan, V. Thummala, and S. Babu. Tuning databaseconfiguration parameters with ituned.
PVLDB ,2(1):1246–1257, 2009.[12] S. Dudoladov, C. Xu, S. Schelter, A. Katsifodimos, S. Ewen,K. Tzoumas, and V. Markl. Optimistic recovery for iterativedataflows in action. In
Proceedings of the 2015 ACMSIGMOD International Conference on Management of Data,Melbourne, Victoria, Australia, May 31 - June 4, 2015 , pages1439–1443, 2015.[13] M. T. Emmerich and A. H. Deutz. A tutorial onmultiobjective optimization: Fundamentals and evolutionarymethods.
Natural Computing: an international journal ,17(3):585–609, Sept. 2018.[14] A. Floratou, A. Agrawal, B. Graham, S. Rao, andK. Ramasamy. Dhalion: Self-regulating stream processing inheron.
Proc. VLDB Endow. , 10(12):1825–1836, Aug. 2017.[15] Y. Gal and Z. Ghahramani. Dropout as a bayesianapproximation: Representing model uncertainty in deeplearning. In M. F. Balcan and K. Q. Weinberger, editors,
Proceedings of The 33rd International Conference onMachine Learning , volume 48 of
Proceedings of MachineLearning Research , pages 1050–1059, New York, New York,USA, 20–22 Jun 2016. PMLR.[16] S. Ganguly, W. Hasan, and R. Krishnamurthy. Queryoptimization for parallel execution. In
Proceedings of the1992 ACM SIGMOD International Conference onManagement of Data , SIGMOD ’92, pages 9–18, New York,NY, USA, 1992. ACM.[17] L. iberti. Undecidability and hardness in mixed-integernonlinear programming. Technical report, CNRS, EcolePolytechnique, 2018. . [18] Jmetal: an object-oriented java-based framework formulti-objective optimization with metaheuristics. http://jmetal.sourceforge.net/ .[19] S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy,A. Tumanov, J. Yaniv, R. Mavlyutov, I. Goiri, S. Krishnan,J. Kulkarni, and S. Rao. Morpheus: Towards automated slosfor enterprise clusters. In , pages 117–134,2016.[20] D. P. Kingma and J. Ba. Adam: A method for stochasticoptimization, 2014.[21] H. Kllapi, E. Sitaridi, M. M. Tsangaris, and Y. Ioannidis.Schedule optimization for data processing flows on thecloud. In
Proceedings of the 2011 ACM SIGMODInternational Conference on Management of Data , SIGMOD’11, pages 289–300, New York, NY, USA, 2011. ACM.[22] Artelys knitro user’s manual. .[23] B. Li, Y. Diao, and P. J. Shenoy. Supporting scalableanalytics with latency constraints.
PVLDB ,8(11):1166–1177, 2015.[24] J. Li, J. F. Naughton, and R. V. Nehme. Resource bricolagefor parallel database systems.
PVLDB , 8(1):25–36, 2014.[25] T. Li, Z. Xu, J. Tang, and Y. Wang. Model-free control fordistributed stream data processing using deep reinforcementlearning.
Proc. VLDB Endow. , 11(6):705–718, Feb. 2018.[26] R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh,T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: Alearned query optimizer.
Proc. VLDB Endow. ,12(11):1705–1718, July 2019.[27] R. Marcus and O. Papaemmanouil. Wisedb: Alearning-based workload management advisor for clouddatabases.
PVLDB , 9(10):780–791, 2016.[28] R. Marcus and O. Papaemmanouil. Plan-structured deepneural network models for query performance prediction.
Proc. VLDB Endow. , 12(11):1733–1746, July 2019.[29] R. Marcus, S. Semenova, and O. Papaemmanouil. Alearning-based service for cost and performancemanagement of cloud databases. In , pages 1361–1362, 2017.[30] R. Marler and J. S. Arora. Survey of multi-objectiveoptimization methods for engineering.
Structural andMultidisciplinary Optimization , 26(6):369–395, 2004.[31] A. Messac. From dubious construction of objective functionsto the application of physical programming.
AIAA Journal ,38(1):155–163, 2012.[32] A. Messac, A. Ismailyahaya, and C. A. Mattson. Thenormalized normal constraint method for generating thepareto frontier.
Structural and MultidisciplinaryOptimization , 25(2):86–98, 2003.[33] Neos guide: Nonlinear programming software. https://neos-guide.org/content/nonlinear-programming .[34] Neos solvers: Nonlinear programming software. https://neos-server.org/neos/solvers/index.html .[35] K. Rajan, D. Kakadia, C. Curino, and S. Krishnan.Perforator: eloquent performance models for resourceoptimization. In
Proceedings of the Seventh ACMSymposium on Cloud Computing, Santa Clara, CA, USA, ctober 5-7, 2016 , pages 415–427, 2016.[36] E. Schulz, M. Speekenbrink, and A. Krause. A tutorial ongaussian process regression: Modelling, exploring, andexploiting functions. Journal of Mathematical Psychology ,85:1–16, August 2018.[37] J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang. Mrtuner: Atoolkit to enable holistic optimization for mapreduce jobs.
PVLDB , 7(13):1319–1330, 2014.[38] F. Song, K. Zaouk, C. Lyu, A. Sinha, Q. Fan, Y. Diao, andP. Shenoy. Boosting cloud data analytics usingmulti-objective optimization. Technical report, 2020. https://hal.inria.fr/hal-02549758 .[39] Z. Tan and S. Babu. Tempo: robust and self-tuning resourcemanagement in multi-tenant parallel databases.
Proceedingsof the VLDB Endowment , 9(10):720–731, 2016.[40] TPCx-BB. Tpcx-bb (bigbench) benchmark for big dataanalytics. .[41] I. Trummer and C. Koch. Approximation schemes formany-objective query optimization. In
Proceedings of the2014 ACM SIGMOD International Conference onManagement of Data , SIGMOD ’14, pages 1299–1310, NewYork, NY, USA, 2014. ACM.[42] I. Trummer and C. Koch. An incremental anytime algorithmfor multi-objective query optimization. In
Proceedings of the2015 ACM SIGMOD International Conference onManagement of Data, Melbourne, Victoria, Australia, May31 - June 4, 2015 , pages 1941–1953, 2015.[43] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang.Automatic database management system tuning throughlarge-scale machine learning. In
Proceedings of the 2017ACM International Conference on Management of Data ,SIGMOD ’17, pages 1009–1024, New York, NY, USA,2017. ACM.[44] S. Venkataraman, Z. Yang, M. Franklin, B. Recht, andI. Stoica. Ernest: Efficient performance prediction forlarge-scale advanced analytics. In
Proceedings of the 13thUsenix Conference on Networked Systems Design andImplementation , NSDI’16, pages 363–378, Berkeley, CA,USA, 2016. USENIX Association.[45] C. Wu, A. Jindal, S. Amizadeh, H. Patel, W. Le, S. Qiao, andS. Rao. Towards a learning optimizer for shared clouds.
Proc. VLDB Endow. , 12(3):210–222, Nov. 2018.[46] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica.Resilient distributed datasets: a fault-tolerant abstraction forin-memory cluster computing. In
Proceedings of the 9thUSENIX conference on Networked Systems Design andImplementation , NSDI’12, pages 2–2, Berkeley, CA, USA,2012. USENIX Association.[47] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, andI. Stoica. Discretized streams: fault-tolerant streamingcomputation at scale. In
Proceedings of the Twenty-FourthACM Symposium on Operating Systems Principles , SOSP’13, pages 423–438, New York, NY, USA, 2013. ACM.[48] K. Zaouk, F. Song, C. Lyu, A. Sinha, Y. Diao, and P. J.Shenoy. UDAO: A next-generation unified data analyticsoptimizer (vldb 2019 demo).
PVLDB , 12(12):1934–1937,2019.[49] B. Zhang, D. V. Aken, J. Wang, T. Dai, S. Jiang, J. Lao,S. Sheng, A. Pavlo, and G. J. Gordon. A demonstration ofthe ottertune automatic database management system tuningservice.
PVLDB , 11(12):1910–1913, 2018. [50] J. Zhang, Y. Liu, K. Zhou, G. Li, Z. Xiao, B. Cheng, J. Xing,Y. Wang, T. Cheng, L. Liu, M. Ran, and Z. Li. An end-to-endautomatic cloud database tuning system using deepreinforcement learning. In