Constrained Multi-Objective Optimization for Automated Machine Learning
Steven Gardner, Oleg Golovidov, Joshua Griffin, Patrick Koch, Wayne Thompson, Brett Wujek, Yan Xu
CConstrained Multi-Objective Optimization forAutomated Machine Learning
Steven Gardner, Oleg Golovidov, Joshua Griffin, Patrick Koch, Wayne Thompson, Brett Wujek and Yan Xu
SAS Institute Inc.North Carolina, USA { Steven.Gardner, Oleg.Golovidov, Joshua.Griffin, Patrick.Koch, Wayne.Thompson, Brett.Wujek, Yan.Xu } @sas.com Abstract —Automated machine learning has gained a lot of at-tention recently. Building and selecting the right machine learningmodels is often a multi-objective optimization problem. Generalpurpose machine learning software that simultaneously supportsmultiple objectives and constraints is scant, though the potentialbenefits are great. In this work, we present a framework calledAutotune that effectively handles multiple objectives and con-straints that arise in machine learning problems. Autotune is builton a suite of derivative-free optimization methods, and utilizesmulti-level parallelism in a distributed computing environmentfor automatically training, scoring, and selecting good models.Incorporation of multiple objectives and constraints in the modelexploration and selection process provides the flexibility neededto satisfy trade-offs necessary in practical machine learningapplications. Experimental results from standard multi-objectiveoptimization benchmark problems show that Autotune is veryefficient in capturing Pareto fronts. These benchmark resultsalso show how adding constraints can guide the search to morepromising regions of the solution space, ultimately producingmore desirable Pareto fronts. Results from two real-world casestudies demonstrate the effectiveness of the constrained multi-objective optimization capability offered by Autotune.
Index Terms —Multi-objective Optimization; Automated Ma-chine Learning; Distributed Computing System
I. I
NTRODUCTION
There has been increasing interest in automated machinelearning (AutoML) for improving data scientists’ productivityand reducing the cost of model building. A number of generalor specialized AutoML systems have been developed [1]–[7], showing impressive results in creating good models withmuch less manual effort. Most of these systems only supporta single objective, typically accuracy or error, to assess andcompare models during the automation process. However,building and selecting machine learning models is inherentlya multi-objective optimization problem, in which trade-offsbetween accuracy, complexity, interpretability, fairness or in-ference speed are desired. There are a plethora of metrics fordescribing model performance [8], [9] such as precision, recall,F1 score, AUC, informedness, markedness, and correlation toname a few. In general, each measure has an inherent bias [9]and we typically expect data scientists to compare differentperformance measures when selecting the best models froma set of candidates. A data scientist might desire relativelyaccurate models but with minimal memory footprints and/orfaster inference speed. Alternatively, a data scientist mighthave business constraints that are difficult to incorporate into the machine learning model training algorithm itself. Therecould also be a number of segments inherent within the datawhere it is important to have comparable accuracy acrossall segments. When toggling between different performancemeasures and goals, what the data scientist is really doingis executing a manual multi-objective optimization. Arguably,they are mentally constructing a Pareto front and choosing themodel that achieves the best compromise for their use caseand criteria.It is considered fruitless to search for a single measurethat perfectly captures the multiple dimensions of interestin machine learning as shown in Zitzler et. al [10] andparaphrased here:
Theorem 1.
In general, solution quality for the m -objectiveoptimization problem cannot be reduced to less than m performance measures. To emphasize this observation, we include a hypotheticalexample. Consider Matthews Correlation Coefficient (MCC)[11] that is considered a good metric to quantify performanceof the binary classification problem even when data is unbal-anced:
MCC = TP × TN − FP × FN (cid:112) (TP + FP)(TP + FN)(TN + FP)(TN + FN) Now suppose we were to apply single objective optimizationand discover two models (model A and model B) withperformances shown in Table I.
TP FP FN TN ACC MCC FPR model A
900 500 100 8500 94.0% 0.73 5.6% model B
350 100 650 8900 92.5% 0.49 1.1%
TABLE I: Performance of models A and B. TP is truepositives; FP is false positives; FN is false negatives; TN istrue negatives; ACC is accuracy; FPR is false positive rate.Compared to model B, model A has better MCC, but worseFPR.With MCC as the single objective to be maximized , anoptimization algorithm would discard model B in preferencefor model A. However, the choice of which model is betterdepends entirely on context. For instance, if this is a creditcard fraud case, we might also be interested in reducingfalse positive rate (FPR) because false positives are very a r X i v : . [ c s . L G ] A ug ostly [12]. Thus, we would prefer to search around model Bto attempt to improve MCC while trying to maintain FPR.However, with unconstrained single objective optimization,this preference is difficult to enforce during the optimizationprocess.One approach to addressing this problem is aggregatingmultiple objectives into a single objective, usually accom-plished by some linear weighting of the objectives. The maindisadvantage of this approach is that many separate optimiza-tions with different weighting factors need to be performed toexamine the trade-offs among the objectives. Another popularapproach is multi-objective optimization [13], [14], whichgenerates diverse multiple Pareto-optimal models to achievea desired trade-off among various performance metrics andgoals. However, a potential drawback of pure multi-objectiveoptimization is that the corresponding algorithms are designedto determine the entire Pareto front when, in practice, only partof the front may be desired. For example, if considering falsenegative rate and false positive rate together, the trivial modelsthat predict always negative and always positive could be partof the Pareto front. It would be a waste of computationalresources to train models to refine such regions of the Paretofront. Moreover, not all measures for assessing models canbe easily formulated as objectives. Therefore, it can be verybeneficial to guide model search to the desired area by usingconstraints.In this work, we provide a constrained multi-objectiveoptimization framework for automated machine learning. Thisframework is built on a suite of derivative-free search methodsand supports multiple objectives and linear or nonlinear con-straints. While the default search method works well in mostsettings, the hybrid framework is extensible so that other desir-able search methods can be incorporated easily in such a waythat computing resources are shared to minimize and exploitinherent load imbalance. Moreover, redundant evaluations areintercepted and handled seamlessly to avoid similar algorithmswithin the hybrid strategy from performing redundant work.The approach works well on standard benchmark problemsand shows promising results on real world applications. Ourmain contributions in this work are: • To the best of our knowledge, this is the first gen-eral extensible constrained multi-objective optimizationframework specifically designed for automated machinelearning. • The Autotune framework embraces the no-free-lunchtheorem in that new and diverse search algorithms fitwell in the existing framework and may be added in acollaborative rather than a competitive manner, permit-ting resource sharing and making completed evaluationsavailable to all internal solvers that are capable of usingthem. • By supporting general constraints, we can aid users infocusing on specific segments of the Pareto front to savecomputational time from models that are of little interestto the user. Further, in certain cases the multi-objectiveproblem is really a nonlinearly constrained problem in disguise; for example, one might wish only to optimizespecificity and sensitivity while ensuring overall accuracydoes not degrade beyond a given threshold. The Autotuneframework offers this flexibility.II. R
ELATED WORK
Jin [13], [15] claims that machine learning is inherentlya multi-objective task and provides a compilation of variousmulti-objective applications including feature extraction, accu-racy, interpretability, and ensemble generation. He et al. [16]use reinforcement learning to balance the trade-off betweenaccuracy and compression of neural networks. The approach issequential and not targeted toward the general multi-objectiveproblem. Asgari et al. [17] apply a specialized evolutionaryalgorithm to optimize parameters of an auto-encoder withrespect to the two objectives: reconstruction error and classifi-cation error. Loeckx [18] stresses the need for multi-objectiveoptimization in the context of machine learning applied tostructural and energetic properties of models, emphasizing thatsuch an approach provides a gateway to hierarchy and abstrac-tion. A novel multi-objective evolutionary algorithm (ENORA)was created to search for and select the optimal feature subsetin the context of a multi-class classification problem [19].Shenfield and Rostami [20] apply an evolutionary algorithmthat optimizes neural network weights, biases, and structuresto simultaneously optimize both overall and individual classaccuracy. In RapidMiner [21], an evolutionary framework isproposed where the user may manually design the evolutionaryalgorithm using drag and drop features.A significant body of multi-objective research has beenproposed in the context of neural architecture search (NAS). Tosimultaneously optimize accuracy and inference speed, Kim etal. [22] propose a multi-objective approach where neural ar-chitectures are encoded using integer variables and optimizedusing a customized evolutionary algorithm. Elsken et al. [23]develop a novel evolutionary algorithm (LEMONADE) to op-timize both accuracy and several model complexity measuresincluding number of parameters. They propose a Lamarckianinheritance mechanism for warmstarting children networkswith parent network predictive performance. Dong et al. [24]adopt progressive search to optimize for both device-related(inference speed and memory usage) and device-agnosticobjectives (accuracy and model size). DVOLVER [25], anevolutionary approach inspired by NSGA-II [26], is createdto find a family of convolutional neural networks with goodaccuracy and computational resource trade-offs.Multi-objective optimization in machine learning seemsto favor evolutionary algorithms. However, there have beenenhancements made to many other derivative-free optimiza-tion approaches that are appropriate and have complementaryproperties that, if combined, may create robust powerful hybridapproaches. The derivative-free optimization community hasbeen successfully handling these scenarios in arguably similarif not identically complex and challenging conditions [26]–[29]. For instance, inspired by direct-search methods, Cust´odioet al. [30] propose a novel algorithm called direct multisearchor optimization problems with multiple black-box objectives.Deb and Sundar [31] combine a preference based strategy withan evolutionary multi-objective optimization methodology anddemonstrate that a preferred set of solutions near a referencepoint can be found in parallel (instead of one solution).III. C
ONSTRAINED M ULTI - OBJECTIVE O PTIMIZATION F RAMEWORK
Autotune is designed specifically to tune the hyperparame-ters and architectures of various machine learning model typesincluding decision trees, forests, gradient boosted trees, neuralnetworks, support vector machines, factorization machines,Bayesian network classifiers, and more. The tuning processutilizes customizable, hybrid strategies of search methods andmulti-level parallelism (for both training and tuning). In thiswork, we focus on the two key features of Autotune: multipleobjectives and constraints . CandidateConfigurations Best ModelsFound
Gradient Boosting
ML Algorithms
SVMNeural Network ...
Genetic Algorithm
Search Methods
Pattern Search
Bayesian ...
Search
Manager
Model
DefinitionsControl StrategiesHyperparameterand ArchitectureConfigurations Control Strategies F e t c h i n g F e t c h i n g ModelEvaluator P r o p o s e d C o n f i g u r a t i o n s T r a i n e d M o d e l s Fig. 1: The Autotune framework. Machine learning algorithmsprovide detailed model definitions. Search methods proposecandidate configurations that are stored in a dedicated pool.Model evaluator utilizes a distributed computing system totrain and evaluate models. Search manager supervises thewhole search and evaluation process, and collects the bestmodels found and other searching information.The Autotune framework is shown in Figure 1. An extend-able suite of search methods (also called solvers) is drivenby the search manager that controls concurrent execution ofthe search methods. The search methods propose candidateconfigurations that are stored in a dedicated pool. New searchmethods can easily be added to the framework. The modelevaluator utilizes a distributed computing system to train andevaluate candidate models. The search manager supervisesthe entire search and evaluation process and collects the bestmodels found. The pseudocode in Algorithm 1 provides ahigh-level algorithmic view of the Autotune framework.
A. Derivative-Free Optimization Strategy
Autotune is able to perform optimization of general non-linear functions over both continuous and integer variables.The functions do not need to be expressed in analytic closed
Algorithm 1
Multi-objective constrained optimization in Au-totune
Require:
Population size n p , and evaluation budget n b . Require:
Number of centers n c < n p and initial step-size ˆ∆ . Require:
Sufficient decrease criterion α ∈ (0 , . Generate initial parent-points P using LHS with |P| = n p . Evaluate P asynchronously in parallel. Populate reference cache-tree, R , with unique points from P . Associate each point p ∈ P with step ∆ p initialized to ˆ∆ . Let F denote current approximation of Pareto front. while ( |R| ≤ n b ) do Select
A ⊂ F for local search, such that |A| = n c . for p ∈ A do (cid:46) Search along compass directions Set T p = {} for e i ∈ I do T p = T p ∪ { p + ∆ p e i } ∪ { p − ∆ p e i } end for end for Generate child-points C via crossover and mutationson P . Set T = C ∪ p ∈A T p . Evaluate
T ∩ R using fast tree-search look-up on R . Project
T − R to linear constraint manifold.
Evaluate remaining
T − R asynchronously in parallel.
Add unique points from
T − R to cache-tree R . Update P with new generation C and initial step ˆ∆ . for p ∈ A do if |T p ∩ F| > then Select new p ∈ F (cid:46) Pattern search success else
Set ∆ p = ∆ p / (cid:46) Pattern search failure end if end for end while form (i.e., black-box integration is supported), and they canbe non-smooth, discontinuous, and computationally expensiveto evaluate. Problem types can be single objective or multi-objective. The system is designed to run in either singlemachine mode or distributed mode.Because of the limited assumptions that are made about theobjective and constraint functions, Autotune takes a parallel,hybrid, derivative-free approach similar to those used in Taddyet al. [32]; Plantenga [33]; Gray, Fowler, and Griffin [34];Griffin and Kolda [35]. Derivative-free methods are effectivewhether or not derivatives are available, provided that thenumber of variables is not too large (Gray and Fowler [36]). Asa rule of thumb, derivative-free algorithms are rarely appliedto black-box optimization problems that have more than 100variables. The term “black-box” emphasizes that the functionis used only as a mapping operator and makes no implicitassumption about the structure of the functions themselves.In contrast, derivative-based algorithms commonly require thenonlinear objectives and constraints to be continuous andmooth and to have an exploitable analytic representation.Autotune has the ability to simultaneously apply multipleinstances of global and local search algorithms in parallel.This ability streamlines the process of needing to first applya global algorithm in order to determine a good starting pointto initialize a local algorithm. For example, if the problemis convex, a local algorithm should be sufficient, and theapplication of the global algorithm would create unnecessaryoverhead. If the problem instead has many local minima,failing to run a global search algorithm first could result inan inferior solution. Rather than attempting to guess whichparadigm is best, the system simultaneously performs globaland local searches while continuously sharing computationalresources and function evaluations. The resulting run timeand solution quality should be similar to having automati-cally selected the best global and local search combination,given a suitable number of threads and processors. Moreover,because information is shared among simultaneous searches,the robustness of this hybrid approach can be increased overother hybrid combinations that simply use the output of onealgorithm to hot start the second algorithm.Autotune handles integer and categorical variables by usingstrategies and concepts similar to those in Griffin et al. [37].This approach can be viewed as a genetic algorithm thatincludes an additional “growth” step, in which selected pointsfrom the population are allotted a small fraction of the totalevaluation budget to improve their fitness score (that is, theobjective function value) by using local optimization over thecontinuous variables.Execution of the system is iterative in its processing, witheach iteration repeating the following steps:1) Acquire new points from the solvers2) Evaluate each of those points by calling the appropriateblack-box functions (model training and validation)3) Return the evaluated point values (model assessmentmetrics) back to the solversThe search manager exchanges points with each solver inthe list. During this exchange, the solver receives back all thepoints that were evaluated in the previous iteration. Based uponthose evaluated point values, the solver generates a new set ofpoints it wants evaluated and those new points get passed to thesearch manager to be submitted for evaluation. For any solverscapable of “cheating”, they may look at evaluated pointsthat were submitted by a different solver. As a result, searchmethods can learn from each other, discover new opportunities,and increase the overall robustness of the system.To best utilize computing resources, Autotune supportsmultiple levels of parallelization ran simultaneously: • Each evaluation can use multiple threads and multipleworker nodes, and • Multiple evaluations can run concurrentlyEvaluation sessions can be configured to minimize the overlapof worker nodes but also allow resources to be shared. Thisdesign makes Autotune extremely powerful and capable ofefficiently using compute grids of any size.
B. Multi-Objective Optimization Approach
When attempting to find the best machine learning model,it is very common to have several objectives. For instance, wemight want to build models that maximize accuracy while alsominimizing model size so that the models can be deployed inmobile devices. The desired result for such problems is usuallynot a single solution but rather a range of solutions that wecan use to identify an acceptable compromise. Ideally eachsolution represents a necessary compromise in the sense thatno single objective can be improved without causing at leastone remaining objective to deteriorate. The goal of Autotunein the multi-objective case is thus to provide to the decisionmaker a set of solutions that represent the continuum of best-case scenarios.Mathematically, we can define multi-objective optimizationin terms of dominance and
Pareto optimality . For a k-objectiveminimizing optimization problem, a point (solution) x is dominated by a point y if f i ( x ) ≥ f i ( y ) for all i = 1 , . . . , k and f j ( x ) > f j ( y ) for some j = 1 , . . . , k . f (x)f (x) a b dce f g h ij Fig. 2: Example Pareto Front. f ( x ) and f ( x ) are twofunctions to be minimized. Points a , b , c and d consist ofthe Pareto frontier found.A Pareto front contains only nondominated solutions. InFigure 2, a Pareto front is plotted with respect to minimizationobjectives f ( x ) and f ( x ) along with a corresponding popula-tion of 10 points that are plotted in the objective space. In thisexample, point a dominates { e, f, j } , b dominates { e, f, g, j } , c dominates { g, h, j } , and d dominates { i, j } . Although noother point in the population dominates point c , it has not yetconverged to the true Pareto front. Thus there are points in aneighborhood of c that have smaller values of f and f thathave not yet been identified.In the constrained case, a point x is dominated by a point y if θ ( x ) > (cid:15) and θ ( y ) < θ ( x ) , where θ ( x ) denotes themaximum constraint violation at point x and the feasibilitytolerance is (cid:15) ; thus feasibility takes precedence over objectivefunction values.Unlike common multi-objective optimization approachesthat solely use metaheuristics [20], [23], [25], the defaultpproach employed by Autotune is a novel hybrid strategythat combines the global search emphasis of metaheuristic[38] with lesser known, but efficient, direct local searchmethods [39]. The hybrid search strategy begins by creatinga Latin Hypercube Sampling (LHS) of the search space. ThisLHS is used as the starting point for a Genetic Algorithm(GA) to search the solution space for promising configurations.GA’s enable us to attack multi-objective problems directly inorder to evolve a set of Pareto-optimal solutions in one run ofthe optimization process instead of solving multiple separateproblems. In addition, Autotune conducts local searches usinga Generating Set Search (GSS) algorithm in neighborhoodsaround nondominated points to improve objective function val-ues and reduce crowding distance. For measuring convergence,Autotune uses a variation of the averaged Hausdorff distance[40] that is extended for general constraints. C. Constraint Handling
In real-world use cases, it is common to encounter con-straints that impose limits on the predictive models being used.For example, consider the context of the Internet of Things(IoT). In the IoT setting, model size and inference speedare very important factors as models are typically deployedto edge computing devices. If a model requires too muchmemory for storage or is very slow to score, then it is nota good fit for edge computing. For mobile devices, modelsthat need many computations during inference will consumetoo much power and should be avoided. For these examples,it can be extremely powerful to add constraints when pickinga model. The constraints can be used to focus on the parts ofthe solution space that satisfy the business needs.Autotune uses different strategies to handle different typesof constraints. Linear constraints are handled by using bothlinear programming and strategies similar to those in [41],where tangent directions to nearby constraints are constructedand used as search directions. In this case, trial points thatviolate the linear constraints are first projected back to the fea-sible region before being submitted for evaluation. Nonlinearconstraints are handled by using smooth merit functions [42].Nonlinear constraint violations are penalized with an L2-normpenalty term that is added to the objective value. In the context of constrained multi-objective optimization, when comparingpoints for domination, a feasible point is always favored overan infeasible one.IV. E
XPERIMENTAL R ESULTS
While Autotune is designed specifically for automaticallyfinding good machine learning models, the optimization pro-cess that drives it is applicable to general optimization prob-lems. Therefore, to evaluate the performance of Autotuneand its effectiveness at solving multi-objective optimizationproblems, we conducted a benchmark experiment by applyingthe Autotune system to a set of common multi-objectiveoptimization benchmark problems. We present a sampling ofthe results here for two of the benchmark problems: ZDT1 andZDT3, taken from [14]. For both of these problems, the truePareto front is known, which provides a basis for comparison.The mathematical formulation for ZDT1 is: f ( x ) = x , f ( x ) = g ( x )(1 − (cid:114) x g ( x ) ) g ( x ) = 1 + 9 n − n (cid:88) i =2 x i , ∀ x i ∈ [0 , , n = 30 ZDT1 is a multi-objective optimization problem with twoobjectives ( f , f ) and 30 variables. Figure 3a shows Auto-tune’s results when running with a sufficiently large evaluationbudget of 25,000 evaluations. The plot shows that Autotunehas completely captured the true Pareto front and Autotune’sPareto markers completely cover the true Pareto front. Manytimes in real-world use cases, evaluation budgets are limiteddue to time and cost. Figure 3b shows Autotune’s results whenrunning with a limited evaluation budget of 5000 evaluations.In this case, we can see that Autotune’s approximation of thePareto front isn’t nearly as complete, and there are significantgaps when running with the limited evaluation budget. Con-straints can be added to the optimization to focus the searchto a particular region of the solution space. To demonstratethe power of constraints in the Autotune multi-objective opti-mization framework, Figure 3c shows the results of re-runningAutotune against ZDT1, this time with a constraint specifyingthat f ≥ . . Again, Autotune was given a limited budget f1 f Autotune ParetoTrue Pareto (a) 25,000 evals; no constraints f1 f Autotune ParetoTrue Pareto (b) 5000 evals; no constraints f1 f Autotune ParetoTrue Pareto (c) 5000 evals; constraint f ≥ . Fig. 3: ZDT1 Benchmark Test Problem .0 0.2 0.4 0.6 0.8 1.0 f1 -1.0-0.50.00.51.0 f Autotune ParetoTrue Pareto (a) 25,000 evals; no constraints f1 -1.0-0.50.00.51.0 f Autotune ParetoTrue Pareto (b) 5000 evals; no constraints f1 -1.0-0.50.00.51.0 f Autotune ParetoTrue Pareto (c) 5000 evals; constraint f ≤ . Fig. 4: ZDT3 Benchmark Test Problemof 5000 evaluations. This plot clearly shows how adding theconstraint has focused the optimization to that lower-rightsection of the solution space. This has allowed Autotune tocapture a much better representation of the true Pareto frontin that region where f ≥ . .The mathematical formulation for ZDT3 is: f ( x ) = x , f ( x ) = g ( x )(1 − (cid:114) x g ( x ) − x g ( x ) sin (10 πx )) g ( x ) = 1 + 9 n − n (cid:88) i =2 x i , ∀ x i ∈ [0 , , n = 30 ZDT3 has two objectives ( f , f ) and 30 variables. Figure4a shows that Autotune is able to obtain the true Pareto frontvery well when given a sufficiently large evaluation budgetof 25,000 objective evaluations. Figure 4b shows Autotune’sresults when running with a limited evaluation budget of 5000objective evaluations. Autotune struggles to find a completerepresentation of the Pareto front when limited to 5000 evalu-ations. In particular, the left side of the plot only shows a fewPareto points that were found by Autotune. Figure 4c showsthe results with the same limited evaluation budget of 5000objective evaluations but with an added constraint of f ≤ . .The plot clearly shows Autotune was able to do a much betterjob of representing the Pareto front in that area of the solutionspace.This experiment demonstrates that Autotune correctly cap-tures the Pareto fronts of the benchmark problems when givenadequate evaluation budgets. By using constraints, Autotune isable to significantly improve the search efficiency by focusingon the regions of the solution space that we are interested in.V. C ASE S TUDIES
The case study data sets are much larger real world machinelearning applications, using multi-objective optimization totune a high quality predictive model. The first data set comesfrom the Kaggle Donors Choose challenge. The second dataset is a sales leads data set. After a preliminary study ofdifferent model types, including logistic regression, decisiontrees, forests, and gradient boosted trees, the gradient boostedtree model type was selected for both case studies as the other model types all significantly underperformed. Table IIpresents the tuning hyperparameters of gradient boosted tree,their ranges, and default values.
Hyperparameter Lower Default Upper
Num Trees 100 100 500Num Vars to Try 1 all allLearning Rate 0.01 0.1 1.0Sampling Rate 0.1 0.5 1.0Lasso 0.0 0.0 10.0Ridge 0.0 0.0 10.0Num Bins 20 20 50Maximum Levels 2 6 7
TABLE II: Gradient Boosted Tree HyperparametersFor both studies, Autotune’s default hybrid strategy thatcombines a LHS as the initial population with the GA andGSS algorithms is used. The population size used is 50 andthe maximum number of iterations is 20. The tuning process isexecuted on a compute cluster containing 100 worker nodes.Individual model training uses multiple worker nodes andmultiple models are trained in parallel.
A. Donors Choose Data
This case study involves building a model using data fromthe website DonorsChoose.org. This is a charity organizationthat provides a platform for teachers to request materialsfor projects. The business objective is to identify projectsthat are likely to attract donations based on the historicalsuccess of previous projects. Since DonorsChoose.org receiveshundreds of thousands of proposals each year, automatingthe screening process and providing consistent vetting with amachine learning model allows volunteers to spend more timeinteracting with teachers to help develop successful projects.Properly classifying whether or not a project is exciting isa primary objective, but an important component of that isto minimize the number of projects improperly classified asexciting (false positives). This ensures that valuable humanresources are not wasted vetting projects that are likely to beunsuccessful.The data includes 24 attributes describing the project, in-cluding: the type of school (metro, charter, magnet, year-round,NLNS) • school state/region • average household income for the region • grade level, subject, and focus area for the project • teacher information • various aspects of project costThe data set contains 620,672 proposal records, of whichroughly 18% were ultimately considered worthy of a review bythe volunteers. A binary variable labeling whether or not theproject was ultimately considered exciting is used as the targetfor predictive modeling. The data set was partitioned into 70%for training (434,470) and 30% for validation (186,202) fortuning the gradient boosted tree predictive model.As mentioned in the study data set description, usingmisclassification rate as a single objective is not sufficient, anda successful predictive model is expected to also minimize thefalse positive rate. This makes the multi-objective optimizationapproach well suited for the study, with both misclassificationrate and false positive rate (FPR) as the two objectives. It isunlikely that using any one of the more traditional machinelearning metrics for tuning the models would produce thedesired results.The default gradient boosted tree model uses the defaulthyperparameter configuration listed in Table II. Its confusionmatrix is shown in Table III. The default model predicts 5,562false positives, a significant amount. The FPR on the validationdata set is 3.6%. The overall misclassification rate on thevalidation set is high, around 15%, and needs to be improved,ideally while also improving FPR. Target Predicted False Predicted TrueFalse
True
TABLE III: Confusion Matrix - Validation Data - DefaultModel
Misclassification Rate F a l s e P o s i t i v e R a t e F1DefaultMCEKSAUCParetoAll
Fig. 5: Donors Choose - All Evaluations The multi-objective tuning results for the Donors Choosedata set are shown in Figures 5 and 6. In Figure 5 theentire set of evaluated configurations is displayed, along withthe default model and the generated Pareto front, trading offthe minimization of misclassification on the x-axis and theminimization of the FPR on the y-axis. The entire cloud ofpoints is split into two distinct branches, one branch trendingtowards a near zero FPR value, and another branch trendingtowards lower misclassification values, resulting in a split setof Pareto points. The default configuration appears to be a nearequal compromise of the two objectives.Several other tuning runs were executed with various tra-ditional metrics (AUC, KS, MCE and F1) as a single objec-tive. The best model configurations for each of the runs aresuperimposed on Figures 5 and 6. Nearly all of the singleobjective runs converged to similar values of misclassificationand FPR. All of them sacrificed some FPR in the process,which is undesirable as defined by the conditions of this study.
Misclassification Rate F a l s e P o s i t i v e R a t e BestF1MCEKSAUCConstrainedParetoAll
Fig. 6: Donors Choose - Misclassification < < B. Sales Leads Data
Marketers often rely on machine learning models to accu-rately predict marketing actions and strategies that are mostlikely to succeed. In this case study, we use a data set collectedby the marketing department at SAS Institute Inc. A key goal arget Predicted False Predicted TrueFalse
True
TABLE IV: Confusion Matrix - Validation Data - ”Best”Modelof this study is to provide the sales team of the companywith an updated list of quality candidate leads. Supervisedmodels are then built to identify and prioritize qualified leadsacross about 20 global regions. Machine learning qualifiesleads by prioritizing known prospects and accounts based ontheir likelihood of acting.The training data has about 200 candidate features througha four-year window. Web traffic data is a key feature categorythat includes page counts for several company websites aswell as the referrer domain. Customer experience data such asthe number of whitepapers downloaded, webcasts watched,and live events attended is also captured. A text analyticstool is used to standardize new features such as job functionand department. Marketing based on business rules and actualoutcomes labels the binary target for model training. The non-event (not a lead) is down sampled using stratified sampling toobtain a 10% target event rate. The data set contains 962,670observations. For the tuning process, the observations werepartitioned into 42% for training (404,297), 28% for validation(269,556), and 30% for test (288,817).Purchase propensity models are very difficult to build dueto the unbalanced nature of the training data. It is veryimportant to deliver a scoring model that captures the eventwell yet minimizes false negatives so that sales opportunitiesare not overlooked. Typically with unbalanced data, overallmisclassification rate is not the preferred measure of modelquality. Here we investigate several model quality measuresalong with a multi-objective tuning strategy that incorporatesboth overall model accuracy and minimizing the false negativerate (FNR).The confusion matrix for a default gradient boosted treemodel is shown in Table V. The default model predicts manymore false negatives than false positives which is oppositefrom the desired scenario in this case only 31% of truepositives are captured.
Target Predicted False Predicted TrueFalse
True
TABLE V: Confusion Matrix - Holdout DataThe multi-objective tuning results for the leads data set areshown in Figures 7 and 8. In Figure 7 the entire set of evalu-ated configurations is displayed, along with the default modeland the generated Pareto front, trading off the minimizationof misclassification on the x-axis and the minimization of theFNR on the y-axis. The majority of the cloud of evaluationsperform better than the default model, with respect to bothoverall misclassification and FNR. The Pareto front represents a set of trade-off solutions all of which are significantly betterthan the default model, cutting the FNR in half.
Misclassification Rate F a l s e N ega t i v e R a t e FNRF1DefaultMCEKSAUCParetoAll
Fig. 7: Leads Data Results - All EvaluationsThe Pareto front is shown in more detail in Figure 8. It canbe seen more clearly that the solution generated by maximizingonly KS for this unbalanced data set, given the same evaluationbudget, underperforms relative to the Pareto front of solutions.The overall misclassification of this solution is similar to thatof the highest misclassification solution on the Pareto frontand the FNR is higher than that of all solutions on the Paretofront. When the misclassification is minimized as a singleobjective tuning effort the misclassification is similar to thelowest misclassification solution on the Pareto front, but theFNR is higher. In review of the Pareto front, it is clear that therange of misclassification of the solutions is relatively small.If it is desirable to trade some false positives for a reductionof false negatives, an increase of over 300 sales leads can beobtained by sacrificing just 0.05% in overall misclassification.
Misclassification Rate F a l s e N ega t i v e R a t e BestFNRF1MCEKSAUCParetoAll
Fig. 8: Leads Data - Pareto Front and Single ObjectiveConstraints on both FNR and misclassification were appliedn this problem in an attempt to identify more Pareto solutionswith lower FNR. However, since the Pareto front is very nar-row in this case study, with both objectives gravitating towardsthe lower left in the solution space, no additional preferredPareto solutions were identified by adding constraints. Withvery little trade-off between objectives observed after runningmulti-objective optimization, a final attempt to further reduceFNR is executed as a single objective constrained optimizationproblem. This result is shown in Figure 8 which shows thatwhen minimizing FNR directly as a single objective, we donot achieve results as desirable as those that were found whenexecuting the multi-objective tuning process. The solution withthe lowest FNR was chosen as the ‘Best’ model and itsconfusion matrix is given in Table VI. The number of falsenegatives is reduced by 40% (3007), compared to the defaultmodel. The FNR is 0.4343 on the holdout test data; 56.6% ofthe true positive leads are captured, a significant improvementover 31% with the default model.
Target Predicted False Predicted TrueFalse
True
TABLE VI: Confusion Matrix - Holdout Data - Lowest FNRVI. C
ONCLUSIONS
Automation in machine learning improves model buildingefficiency and creates opportunities for more applications. Thiswork extends the general framework Autotune by implement-ing two novel features: multi-objective optimization and con-straints. With multi-objective optimization, instead of a singlemodel, a set of models on a Pareto front are produced. Then,the preferred model can be selected by balancing differentobjectives. Adding constraints is also important in the modeltuning process. Constraints provide a way to enforce businessrestrictions or improve the search efficiency by pruning partsof the solution search space. The numerical experiments onbenchmark problems demonstrate the effectiveness of ourimplementation of multi-objective optimization and constrainthandling. The two case studies we presented show Autotune’sability to find models that appropriately balance multipleobjectives while also adhering to constraints. Future work toenhance Autotune includes simplifying the user’s experiencewhen choosing metrics for objectives and constraints.R
EFERENCES[1] Google, “Cloud automl
BETA
Advances in Neural Information Processing Systems 28 , C. Cortes, N. D.Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. CurranAssociates, Inc., 2015, pp. 2962–2970.[5] L. Kotthoff, C. Thornton, H. Hoos, F. Hutter, and K. Leyton-Brown.,“Auto-weka 2.0: Automatic model selection and hyperparameter opti-mization in weka,”
Journal of Machine Learning Research , vol. 18,no. 25, pp. 1–5, 2017. [6] h2o, “Automl: Automatic machine learning,” 2018. [Online]. Available:http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html[7] H. Jin, Q. Song, and X. Hu. (2018) Auto-keras: Efficient neuralarchitecture search with network morphism.[8] C. Ferri, J. Hern´andez-Orallo, and R. Modroiu, “An experimentalcomparison of performance measures for classification,”
Pattern Recogn.Lett. , vol. 30, no. 1, pp. 27–38, 2009.[9] D. Powers, “Evaluation: From precision, recall and f-factor to roc,informedness, markedness & correlation,”
Mach. Learn. Technol. , vol. 2,01 2008.[10] E. Zitzler, M. Laumanns, L. Thiele, C. M. Fonseca, and V. G.da Fonseca, “Why quality assessment of multiobjective optimizers isdifficult,” in
Proceedings of the 4th Annual Conference on Genetic andEvolutionary Computation , ser. GECCO’02. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 2002, pp. 666–674.[11] B. W Matthews, “Comparison of the predicted and observed secondarystructure of t4 phage lysozyme,”
Biochimica et Biophysica Acta , vol.405, pp. 442–51, 11 1975.[12] Bolt, “The limitations of fraud detection today, and its futurewith bolt,” 2018. [Online]. Available: https://medium.com/@bolt.com/the-limitations-of-fraud-detection-today-and-its-future-with-bolt-5cdac0114a2f[13] Y. Jin and B. Sendhoff, “Pareto-based multiobjective machine learning:An overview and case studies,”
Trans. Sys. Man Cyber Part C , vol. 38,no. 3, pp. 397–415, 2008.[14] E. Zitzler, K. Deb, and L. Thiele, “Comparison of multiobjectiveevolutionary algorithms: Empirical results,”
Evolutionary Computation ,vol. 8, no. 2, pp. 173–195, 2000.[15] Y. Jin, Ed.,
Multi-Objective Machine Learning , ser. Studies in Compu-tational Intelligence. Springer, 2006, vol. 16.[16] Y. He and S. Han, “ADC: automated deep compression and accelerationwith reinforcement learning,”
CoRR , vol. abs/1802.03494, 2018.[17] S. A. Taghanaki, J. Kawahara, B. Miles, and G. Hamarneh, “Pareto-optimal multi-objective dimensionality reduction deep auto-encoder formammography classification,”
Computer methods and programs inbiomedicine , vol. 145, pp. 85–93, 2017.[18] J. Loeckx, “Beyond mitchell: Multi-objective machine learning – min-imal entropy, energy and error,” in , 6 2015.[19] M. Panda, “Big models for big data using multi objective averaged onedependence estimators,”
CoRR , vol. abs/1610.07752, 2016.[20] A. Shenfield and S. Rostami, “Multi-objective evolution of artificialneural networks in multi-class medical diagnosis problems with classimbalance,” in
CIBCB . IEEE, 2017, pp. 1–8.[21] RapidMiner, “Rapidminer documentation,” 2019. [Online]. Available:https://docs.rapidminer.com[22] Y.-H. Kim, B. Reddy, S. Yun, and C. Seo, “Nemo : Neuro-evolutionwith multiobjective optimization of deep neural network for speed andaccuracy,” 2017.[23] T. Elsken, J. H. Metzen, and F. Hutter. (2018) Efficient multi-objectiveneural architecture search via lamarckian evolution. [Online]. Available:https://arxiv.org/pdf/1804.09081.pdf[24] J.-D. Dong, A.-C. Cheng, D.-C. Juan, W. Wei, and M. Sun, “Ppp-net:Platform-aware progressive search for pareto-optimal neural architec-tures,” 2018.[25] G. Michel, M. A. Alaoui, A. Lebois, A. Feriani, and M. Felhi,“DVOLVER: Efficient pareto-optimal neural network architecturesearch,” 2019.[26] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, “A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization:Nsga-ii,” in
Parallel Problem Solving from Nature PPSN VI , M. Schoe-nauer, K. Deb, G. Rudolph, X. Yao, E. Lutton, J. J. Merelo, and H.-P.Schwefel, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000,pp. 849–858.[27] C. Audet, G. Savard, and W. Zghal, “Multiobjective optimizationthrough a series of single-objective formulations,”
SIAM J. on Opti-mization , vol. 19, no. 1, pp. 188–210, Feb. 2008.[28] P. Cao, Q. Shuai, and J. Tang, “A multi-objective DIRECT algorithmtowards structural damage identification with limited dynamic responseinformation,”
CoRR , 2017.[29] A. L. Cust´odio and J. F. A. Madeira, “Multiglods: global and localmultiobjective optimization using direct search,”
Journal of GlobalOptimization , vol. 72, no. 2, pp. 323–345, Oct 2018.30] A. L. Cust´odio, J. F. A. Madeira, A. I. F. Vaz, and L. N. Vicente,“Direct multisearch for multiobjective optimization.”
SIAM Journal onOptimization , vol. 21, no. 3, pp. 1109–1140, 2011.[31] K. Deb and J. Sundar, “Reference point based multi-objective optimiza-tion using evolutionary algorithms,” in
Proceedings of the 8th AnnualConference on Genetic and Evolutionary Computation , ser. GECCO ’06.New York, NY, USA: ACM, 2006, pp. 635–642.[32] M. A. Taddy, H. K. H. Lee, G. A. Gray, and J. D. Griffin, “Bayesianguided pattern search for robust local optimization,”
Technometrics ,vol. 51, pp. 389–401, 2009.[33] T. Plantenga, “Hopspack 2.0 user manual (v 2.0.2),” Sandia NationalLaboratories, Tech. Rep., 2009.[34] G. A. Gray, K. R. Fowler, and J. D. Griffin, “Hybrid optimizationschemes for simulation-based problems,”
Procedia Computer Science ,vol. 1, pp. 1349–1357, 2010.[35] J. D. Griffin and T. G. Kolda, “Asynchronous parallel hybrid optimiza-tion combining direct and gss,”
Optimization Methods and Software ,vol. 25, pp. 797–817, 2010.[36] G. A. Gray and K. R. Fowler, “The effectiveness of derivative-freehybrid methods for black-box optimization,”
International Journal ofMathematical Modeling and Numerical Optimization , vol. 2, pp. 112– 133, 2011.[37] J. D. Griffin, K. R. Fowler, G. A. Gray, and T. Hemker, “Derivative-freeoptimization via evolutionary algorithms guiding local search (eagls) forminlp,”
Pacific Journal of Optimization , vol. 7, pp. 425–443, 2011.[38] D. E. Goldberg,
Genetic Algorithms in Search, Optimization and Ma-chine Learning , 1st ed. Addison-Wesley Longman Publishing Co., Inc.,1989.[39] J. D. Griffin, T. G. Kolda, and R. M. Lewis, “Asynchronous parallel gen-erating set search for linearly constrained optimization,”
SIAM Journalon Scientific Computing , vol. 30, pp. 1892–1924., 2008.[40] O. Sch¨utze, X. Esquivel, A. Lara, and C. A. Coello Coello, “Using theaveraged hausdorff distance as a performance measure in evolutionarymultiobjective optimization,”
IEEE Transactions on Evolutionary Com-putation , vol. 16, pp. 504–522, 2012.[41] J. D. Griffin, T. G. Kolda, and R. M. Lewis, “Asynchronous parallel gen-erating set search for linearly constrained optimization,”
SIAM Journalon Scientific Computing , vol. 30, pp. 1892–1924, 2008.[42] J. D. Griffin and T. G. Kolda, “Nonlinearly constrained optimizationusing heuristic penalty methods and asynchronous parallel generatingset search,”