[PDF] Online Mixed-Integer Optimization in Milliseconds

Abstract

We propose a method to solve online mixed-integer optimization (MIO) problems at very high speed using machine learning. By exploiting the repetitive nature of online optimization, we are able to greatly speedup the solution time. Our approach encodes the optimal solution into a small amount of information denoted as strategy using the Voice of Optimization framework proposed in [BS21]. In this way the core part of the optimization algorithm becomes a multiclass classification problem which can be solved very quickly. In this work, we extend that framework to real-time and high-speed applications focusing on parametric mixed-integer quadratic optimization (MIQO). We propose an extremely fast online optimization algorithm consisting of a feedforward neural network (NN) evaluation and a linear system solution where the matrix has already been factorized. Therefore, this online approach does not require any solver nor iterative algorithm. We show the speed of the proposed method both in terms of total computations required and measured execution time. We estimate the number of floating point operations (flops) required to completely recover the optimal solution as a function of the problem dimensions. Compared to state-of-the-art MIO routines, the online running time of our method is very predictable and can be lower than a single matrix factorization time. We benchmark our method against the state-of-the-art solver Gurobi obtaining from two to three orders of magnitude speedups on examples from fuel cell energy management, sparse portfolio optimization and motion planning with obstacle avoidance.

Full PDF

OOnline Mixed-Integer Optimization in

Milliseconds

Dimitris Bertsimas and Bartolomeo StellatoJuly 1, 2020

Abstract

We propose a method to solve online mixed-integer optimization(MIO) problems at very high speed using machine learning. By ex-ploiting the repetitive nature of online optimization, we are able togreatly speedup the solution time. Our approach encodes the optimalsolution into a small amount of information denoted as strategy usingthe Voice of Optimization framework proposed in [BS20]. In this waythe core part of the optimization algorithm becomes a multiclass clas-siﬁcation problem which can be solved very quickly. In this work weextend that framework to real-time and high-speed applications focus-ing on parametric mixed-integer quadratic optimization (MIQO). Wepropose an extremely fast online optimization algorithm consisting ofa feedforward neural network (NN) evaluation and a linear system so-lution where the matrix has already been factorized. Therefore, thisonline approach does not require any solver nor iterative algorithm.We show the speed of the proposed method both in terms of totalcomputations required and measured execution time. We estimatethe number of ﬂoating point operations (ﬂops) required to completelyrecover the optimal solution as a function of the problem dimensions.Compared to state-of-the-art MIO routines, the online running time ofour method is very predictable and can be lower than a single matrixfactorization time. We benchmark our method against the state-of-the-art solver Gurobi obtaining from two to three orders of magnitudespeedups on examples from fuel cell energy management, sparse port-folio optimization and motion planning with obstacle avoidance. a r X i v : . [ m a t h . O C ] J un Introduction

Mixed-integer optimization (MIO) has become a powerful tool for modelingand solving real-world decision making problems; see [JLN + N P -hard and thus considered intractable, we are now ableto solve instances with complexity and dimensions that were unthinkable justa decade ago. In [Bix10] the authors analyzed the impressive rate at whichthe computational power of MIO grew in the last 25 years providing over atrillion times speedups. This remarkable progress is due to both algorithmicand hardware improvements. Despite these advances, MIO is still consideredharder to solve than convex optimization and, therefore, it is more rarelyapplied to online settings.Online optimization diﬀers from general optimization by requiring on theone hand computing times strictly within the application limits and on theother hand limited computing resources. Fortunately, while online optimiza-tion problems are not the same between each solve, only some parametersvary and the structure remains unchanged. For this reason, online optimiza-tion falls into the broader class of parametric optimization where we cangreatly exploit the repetitive structure of the problem instances. In particu-lar, there is a signiﬁcant amount of data that we can reuse from the previoussolutions.In a recent work [BS20], the authors constructed a framework to predictand interpret the optimal solution of parametric optimization problems usingmachine learning. By encoding the optimal solution into a small amount ofinformation denoted as strategy , the authors convert the solution algorithminto a multiclass classiﬁcation problem. Using interpretable machine learningpredictors such as optimal classiﬁcation trees (OCTs), Bertsimas and Stellatowere able to understand and interpret how the problem parameters aﬀect theoptimal solutions. Therefore they were able to give optimization a voice thatthe practitioner can understand.In this paper we extend the framework from [BS20] to online optimiza-tion focusing on speed and real-time applications instead of interpretability.This allows us to obtain an end-to-end approach to solve mixed-integer op-timization problems online without the need of any solver nor linear systemfactorization. The online solution is extremely fast and can be carried outless than a millisecond reducing the online computation time by more thantwo orders of magnitude compared to state-of-the-art algorithms.2 .1 Contributions

In this work, by exploiting the structure of mixed-integer quadratic optimiza-tion (MIQO) problems, we derive a very fast online solution algorithm wherethe whole optimization is reduced to a neural network (NN) prediction anda single linear system solution. Even though our approach shares the sameframework as [BS20], it is substantially diﬀerent in the focus and the ﬁnalalgorithm. The focus is primarily speed and online optimization applicationsand not interpretability as in [BS20]. This is why for our predictions we usenon interpretable, but very fast, methods such as NNs. Furthermore, ourﬁnal algorithm does not involve any convex optimization problem solution asin [BS20]. Instead, we just apply simple matrix-vector multiplications. Ourspeciﬁc contributions include:1. We focus on the class of MIQO instead of dealing with general mixed-integer convex optimization (MICO) as in [BS20]. This allows us toreplace the ﬁnal step to recover the solution with a simple linear sys-tem solution based on the KKT optimality conditions of the reducedproblem. Therefore, the whole procedure does not require any solverto run compared to [BS20].2. To reduce the number of strategies in larger examples, we reassign thesamples to a lower number of selected strategies so that the averagesuboptimality and infeasibility do not increase above certain tolerances.We deﬁne this step as “strategy pruning” and formulate it as a large-scale mixed-integer linear optimization (MILO). To provide solutions inreasonable times, we develop an approximation algorithm that reassingsthe training samples according to the strategies appearing most often.3. In several practical applications of MIQO, the KKT matrix of the re-duced problem does not change with the parameters. In this workwe factorize it oﬄine and cache the factorization for all the possiblesolution strategies appearing in the data. By doing so, our online solu-tion step becomes a sequence of simple forward-backward substitutionsthat we can carry out very eﬃciently. Hence, with the oﬄine factoriza-tion, our overall online procedure does not even require a single matrixfactorization. Compared to [BS20], this further simpliﬁes the onlinesolution step. 3. After the algorithm simpliﬁcations, we derive the precise complexity ofthe overall algorithm in terms of ﬂoating point operations (ﬂops) whichdoes not depend on the problem parameters values. This makes theexecution time predictable and reliable compared to branch-and-bound(B&B) algorithms which often get stuck in the tree search procedure.5. We benchmark our method against state-of-the-art MIQO solverGurobi on sparse portfolio trading, fuel battery management andmotion planning examples. Thanks to the strategy pruning, weobtain between few hundreds to less than 10,000 strategies for all theexamples. This allows to achieve high quality strategy predictions interms of suboptimality and infeasibility. In particular, the averagesuboptimality is comparable to the one from Gurobi heuristics andinfeasibility is always within acceptable values for the applicationsconsidered. Timing comparisons show up to three orders of magnitudespeedups compared to both Gurobi global optimizer and Gurobiheuristics. The worst-case solution time of our method is also up tothree orders of magnitude smaller than the one obtained with B&Bschemes, enabling real-time implementations in milliseconds.

The structure of the paper is as follows. In Section 2, we review recent workon machine learning for optimization outlining the relationships and limi-tations of other methods compared to approach presented in this work. Inaddition, we outline the recent developments in high-speed online optimiza-tion and the limited advances that appeared so far for MIO. In Section 3, weintroduce the Voice of Optimization framework from [BS20] for general MIOdescribing the concept of solution strategy . In Section 4, we describe thestrategies selection problem as a multiclass classiﬁcation problem and pro-pose the NN architecture used in the prediction phase. We also introducea strategy pruning scheme to reduce the number of strategies. Section 5describes the computation savings that we can obtain with problem with aspeciﬁc structure such as MIQO and the worst-case complexity in terms ofnumber of ﬂops. In Section 6, we describe the overall algorithm with theimplementation details. Benchmarks comparing our method to the state-of-the-art solver Gurobi examples with realistic data appear in Section 7.4

Related Work

Recently the operations research community started to focus on systematicways to analyze and solve combinatorial optimization problems with the helpof machine learning. For an extensive review on the topic we refer the readerto [BLP18].Machine learning has so far helped optimization in two directions. Theﬁrst one investigates heuristics to improve solution algorithms. Iterativeroutines deal with repeated decisions where the answers are based on expertknowledge and manual tuning. A common example is branching heuristicsin B&B algorithms. In general these rules are hand tuned and encoded intothe solvers. However, the hand tuning can be hard and is in general sub-optimal, especially with complex decisions such as B&B algorithm behavior.To overcome these limitations in [KBS +

16] the authors learn the branch-ing rules without the need of expert knowledge showing comparable or evenbetter performance than hand-tuned algorithms. Other promising resultsusing ExtraTrees to learn branching rules appeared in [ALW17]. We referthe reader to [LZ17] for a review on the intersection of machine learning andbranching.Another example appears in [BLZ18] where the authors investigatewhether it is faster to solve MIQOs directly or as second-order coneoptimization (SOCO) problems by linearizing the cost function. Thisproblem becomes a classiﬁcation problem that oﬀers an advantage basedon previous data compared to how the decision is heuristically made insideoﬀ-the-shelf solvers.The second direction poses combinatorial problems as control tasks thatwe can analyze under the reinforcement learning framework [SB18]. This isapplicable to problems with multistage decisions such as network problemsor knapsack-like problems. [DKZ +

17] learn the heuristic criteria for stage-wise decisions in problems over graphs. In other words, they build a greedyheuristic framework, where they learn the node selection policy using a spe-cialized neural network able to process graphs of any size [DDS16]. Forevery node, the authors feed a graph representation of the problem to thenetwork and they receive an action-value pair suggesting the next node toselect in the optimal path.Even though these two directions introduce optimization to the beneﬁts5f machine learning and show promising results, they do not consider theparametric nature of problems solved in real-world applications. We oftensolve the same problem formulation with slightly varying parameters severaltimes generating a large amount of data describing how the parameters aﬀectthe solution.Recently, this idea was used to devise a sampling scheme to collect allthe optimal active sets appearing from the problem parameters in continuousconvex optimization [MRN19]. While this approach is promising, it evaluatesonline all the combinations of the collected active sets without predictingthe optimal ones using machine learning. Another related work appearedin [KKK19] where the authors warm-start online active set solvers using thepredicted active sets from a machine learning framework. However, theydo not provide probabilistic guarantees for that method and their samplingscheme is tailored to their speciﬁc application of quadratic optimization (QO)for model predictive control (MPC).In our work, we propose a new method that exploits the amount of datagenerated by parametric problems to solve MIO online at high speed. Inparticular we study how eﬃcient we can make the computations using acombination of machine learning and optimization. To the authors knowledgethis is the ﬁrst time machine learning is used to both reduce and make moreconsistent the solution time of MIO algorithms.

Applications of online optimization span a wide variety of ﬁelds includingscheduling [CaPM10], supply chain management [YG08], hybrid model pre-dictive control [BM99], signal decoding [DCB00].

Embedded optimization.

Over the last decade there has been a signiﬁcantattention from the community for tools for generating custom solvers for on-line parametric programs. CVXGEN [MB12] is a code generation softwarefor parametric QO that produces a fast and reliable solver. However, its codesize grows dramatically with the problem dimensions and it is not applicableto large problems. More recently, the OSQP solver [SBG +

20] showed remark-able performance with a light ﬁrst-order method greatly exploiting the struc-ture of parametric programs to save computation time. The OSQP authorsalso proposed an optimized version for embedded applications in [BSM + +

14] for QO and ECOS [DCB13] for SOCO.

Parametric MIQOs.

However, all these approaches focus on continuousconvex problems with no integer variables such as QO. This is because oftwo main reasons. On the one hand, mixed-integer optimization algorithmsare far more complicated to implement than convex optimization ones sincethey feature a massive amount of heuristics and preprocessing. This is whythere is still a huge gap in performance between open and commercial solversfor MIO. On the other hand, for many online applications the solution timerequired to solve MIO problems is still not compatible with the amount oftime allowed. An example is hybrid MPC where depending on the systemdynamics, we have to solve MIO problems online in fractions of a second.Explicit hybrid MPC tackles this issue by precomputing oﬄine the entiremapping between the parameter space to the optimal solution [BMDP02].However, the memory required for storing such solutions grows exponentiallywith the problem dimensions and this approach easily becomes intractable.

Suboptimal heuristics.

Other approaches solve these problems only subop-timally using heuristics to deal with insuﬃcient time to compute the globallyoptimal solution. Examples include the Feasibility Pump heuristic [FGL05]that iteratively solves linear optimization (LO) subproblems and rounds theirsolutions until it ﬁnds a feasible solution for the original MIO problem. An-other heuristic works by integrating the alternating direction method of mul-tipliers (ADMM) [DTB18] with rounding steps to obtain integer feasiblesolutions for the original problem. The downside of these heuristics is thatthey do not exploit the large amount of data that we gain by solving theparametric problems over and over again in online settings.

Warm-starting.

In order to speedup subsequent solves, several works focuson warm-starting B&B algorithms [GHW15, RG06]. However, there can bethree possible reasons for which signiﬁcant solution time speedups cannot beachieved. First, previous solutions can be infeasible for the current problemand, therefore, not useful to create bounds to quickly prune branches inthe B&B tree. Second, many commercial solvers apply fast heuristics thatcan quickly obtain good feasible solutions. In case these solutions are asgood or better than the provided one, warm-starting does not bring any7eneﬁt; see, e.g. , [GO20,

Start variable attribute]. Third, in B&B algorithmsthe vast majority of time is usually spent to prove optimality and we arenot able to signiﬁcantly reduce it with a warm-started solution [GHW15,Section 4]. Instead of providing only the previous optimal solution, we canpass the previous B&B tree and adapt the nodes according to parameterchanges [MT19]. This technique can sometimes greatly reduce the numberof QOs solved. However, it still requires a B&B algorithm to complete, whichmight be too slow in fast real-time settings.

Value function approximations.

Parametric MIQO have also been studiedin terms of how the optimal cost changes with the parameters, i.e. , the valuefunction. The authors of [HR14] propose an iterative scheme to dynami-cally generate points to construct approximations of the value function withapplications to stochastic integer and bilevel integer optimization problems.Constructing value functions to solve stochastic MIO has been studied alsoin [TPS19, TPS13]. Our approach also tries to map the parameters to theoptimizer output but with two main diﬀerences: ﬁrst it relies on data, secondit encodes the optimal solution and not the value function.

Our approach.

Despite all these eﬀorts in solving MIO online, there is stilla signiﬁcant gap between the solution time and the real-time constraintsof many applications. In this work we propose an alternative approach thatexploits data coming from previous problem solutions with high-performanceMIO solvers to reduce the online computation time making it possible toapply MIO to online problems that were not approachable before. Theaproach proposed in this paper has already been applied to control problemsin robotics in [CCS +

20] by exploiting the application-speciﬁc structure of theconstraints.

In this section we introduce the idea of the optimal strategy following thesame framework ﬁrst introduced in [BS20]. Given a parameter θ ∈ R p , wedeﬁne a strategy s ( θ ) as the complete information needed to eﬃciently recoverthe optimal solution of an optimization problem.8onsider the mixed-integer optimization problemminimize f ( θ, x )subject to g ( θ, x ) ≤ ,x I ∈ Z d , (1)where x ∈ R n is the decision variable and θ ∈ R p deﬁnes the parametersaﬀecting the problem. We denote the cost as f : R p × R n → R and the con-straints as g : R p × R n → R m . The vector x (cid:63) ( θ ) denotes the optimal solutionand f ( θ, x (cid:63) ( θ )) the optimal cost function value given the parameter θ . Optimal strategy.

We now deﬁne the optimal strategy as the set of tightconstraints together with the value of integer variables at the optimal solu-tion. We denote the tight constraints T ( θ ) as constraints that are equalitiesat optimality, T ( θ ) = { i ∈ { , . . . , m } | g i ( θ, x (cid:63) ( θ )) = 0 } . (2)Hence, given the T ( θ ) all the other constraints are redundant for the originalproblem.If we assume linear independence constraint qualiﬁcation (LICQ), thenumber of tight constraints is at most n because the tight constraints gra-dients at the solution ∇ g i ( θ, x (cid:63) ( θ )) ∈ R n , i ∈ T ( θ ) are linearly indepen-dent [NW06, Section 12.2], . When LICQ does not hold, the number oftight constraints can be more than the number of decision variables, i.e. , |T ( θ ) | > n . However, in practice the number of tight constraints is signiﬁ-cantly lower than the number of constraints m , even for degenerate problems.This means that for many applications where the number of constraints islarger than the number of variables, by knowing T ( θ ) we can to neglect alarge number of redundant constraints at the optimal solution.When some of the components of x are integer, we cannot easily computethe optimal solution by knowing only T ( θ ). This is because the solutionretrieval would involve a MIO problem to identify the integer components of x . However, after ﬁxing the integer components to their optimal values x (cid:63) I ( θ ),the tight constraints allow us to eﬃciently compute the optimal solution.Hence the strategy identifying the optimal solution is a tuple containing theindex of tight constraints at optimality and the optimal value of the integervariables, i.e. , s ( θ ) = ( T ( θ ) , x (cid:63) I ( θ )). 9 olution method. Given the optimal strategy, solving (1) corresponds tosolving the following optimization problemminimize f ( θ, x )subject to g i ( θ, x ) ≤ , ∀ i ∈ T ( θ ) x I = x (cid:63) I ( θ ) , (3)Solving (3) is much easier than (1) because it is continuous, convex and hasa smaller number of constraints. Note that, we cannot in general enforce g i ( θ, x ) = 0 for the tight constraints because it would make (3) nonconvex.However, we can enforce equalities when g i are linear in x [BV04]. Sinceit is a very fast and direct step, we will denote it as solution decoding . InSection 5 we describe the details of how to exploit the structure of (3) andcompute the optimal solution online at very high speeds. In this section, we describe how we learn the mapping from the parameters θ to the optimal strategies s ( θ ). In this way, we can replace the hardest partof the optimization routine by a prediction step in a multiclass classiﬁcationproblem where each strategy is a class label. Our classiﬁcation problem features data points ( θ i , s i ) , i = 1 , . . . , N where θ i ∈ R p are the parameters and s i ∈ S the corresponding labels identifyingthe optimal strategies. Set S is the set of strategies of cardinality |S| = M .Our goal is to predict ˆ s i so that it is as close as possible to the true s i givensample θ i .We solve the classiﬁcation task using NNs. NNs have recently become themost popular machine learning method radically changing the way we classifyin ﬁelds such as computer vision [KSH12], speech recognition [HDY + + + rchitecture. We deploy a similar architecture as in [BS20] consisting of L layers deﬁning a composition of functions of the formˆ s = h L ( h L − ( . . . h ( θ ))) , where each layer consists of y l = h l ( y l − ) = σ l ( W l y l − + b l ) , i = 1 , . . . , L, (4)where y l ∈ R n l . We deﬁne the input layer as l = 1 and the output layer as l = L so that y = θ and y L = ˆ s .Each layer performs an aﬃne transformation with parameters W l ∈ R n l × n l − and b l ∈ R n l . In addition, it includes an activationfunction σ : R n l → R n l to model nonlinearities. Inner layers featurea rectiﬁed linear unit (ReLU) deﬁned as σ l ( x ) = max( x, , l = 1 , . . . , L − , where the max operator is intended elementwise. The ReLU operator hasbecome popular because it promotes sparsity to the model outputting 0 forthe negative components of x and because it does not experience vanishinggradient issues typical in sigmoid functions [GBC16].The output layer features a softmax activation function σ L ( x ) ∈ R M toprovide a normalized ranking between the strategies and quantify how likelythey are to be the correct one. Softmax activation functions are very com-mon in multiclass classiﬁcaiton because of their smoothness and the prob-abilistic interpretation of their output as the relative importance betweenclasses [GBC16, Section 6.2.2.3]. We can write the output layer as( σ L ( x )) j = e x j (cid:80) Mj =1 e x j , where 0 ≤ σ L ( x ) ≤ Learning.

In order to deﬁne a proper cost function and train the network werewrite the labels as a one-hot encoding, i.e. , s i ∈ R M where M is the totalnumber of classes and all the elements of s i are 0 except the one corresponding11o the class which is 1. Then we deﬁne a smooth cost function, i.e. , the cross-entropy loss for which gradient descent-like algorithms work well L NN = N (cid:88) i =1 − s Ti log(ˆ s i ) , where log is intended elementwise. This loss L can also be interpreted asthe distance between the predicted probability density of the labels and tothe true one. The training step consists of applying the classic stochasticgradient descent with the derivatives of the cost function obtained using theback-propagation rule. Online predictions.

After we complete the model training, we can predictthe optimal strategy given θ . However, the model can never be perfect andthere can be situations where the prediction is not correct. To overcome thesepossible limitations, instead of considering only the best class predicted bythe NN we pick the k most-likely classes. Afterwards, we can evaluate inparallel their feasibility and objective value and pick the feasible one withthe best objective. It is diﬃcult to estimate the amount of data required to accurately learnthe classiﬁer for problem (1). In particular, given N independent samplesΘ N = { θ , . . . , θ N } drawn from an unknown discrete distribution we ﬁnd M diﬀerent strategies S (Θ N ) = { s , . . . , s M } . How likely is it to encounter newstrategies in the next sample θ N +1 ? Following the approach in [BS20], weuse the same algorithm to iteratively sample and estimate the probability ofﬁnding unseen strategies P ( s ( θ N +1 ) / ∈ S (Θ N )) , using the Good-Turing estimator [Goo53]. Given a desired probability guar-antee (cid:15) > β >

0, the algorithm terminates when abound on the probability of ﬁnding new strategies falls below (cid:15) . For some problems, the number of strategies M can quickly grow, therebymaking the multiclass classiﬁcation task very diﬃcult. Fortunately, diﬀerent12trategies are often redundant because they correspond to multiple globaloptima and we can select only the relevant strategies to apply. Mixed-integer linear optimization modelling.

For every sample i andstrategy j , we can compute the solution to the reduced problem (3) ob-taining an objective value f ij . If the reduced problem is infeasible, we set f ij = ∞ . To simplify the notation, we refer to ¯ f i = f ( θ i , x (cid:63) ( θ i )) as theoptimal objective value for sample θ i . We model the sample i to strategy j assignments with variables x ij ∈ { , } . Variables p j ∈ { , } describe ifstrategy j is chosen in the pruning. The objective is to minimize the numberof strategies selected such that the cost relative cost degradation for eachsample i is less than a tolerance (cid:15) . The pruning problem can be formulatedas the following MILO,minimize (cid:80) Mj =1 p j subject to (cid:80) Mj =1 f ij z ij ≤ ¯ f i + (cid:15) (cid:12)(cid:12) ¯ f i (cid:12)(cid:12) , i = 1 , . . . , N (cid:80) Mj =1 z ij = 1 , i = 1 , . . . , N, (cid:80) Ni =1 z ij ≤ p j , j = 1 , . . . , M,p ∈ { , } M , z ∈ { , } N × M , (5)where |·| is the absolute value. Unfortunately, it is very costly to constructproblem (5) because it involves computing f ij for every combination of sam-ples and strategies. For example, if we have 100,000 samples and 2,000strategies, we need to solve 200,000,000 reduced problems which can be verychallenging, even in the specialized cases from Section 5. In addition, despiterecent advances of MILO solvers, (5) with millions of binary variables areoften intractable. Therefore, we implement a simpler pruning technique. Frequency-based heuristic.

In most cases, the majority of the samplesis assigned to a few strategies and the rest of the strategies appears veryrarely. Therefore, if we select the most frequent strategies, we cover themajority of the samples. In this way we have to reassign only a small por-tion of the samples to the selected strategies without having to compute f ij for every sample-strategy combination. The whole procedure is outlined inAlgorithm 2. Given α >

0, the function

SelectFrequentStrategies inAlgorithm 1 selects the most frequent strategies S α appearing in at least 1 − α fraction of samples. Then, the algorithm selects the discarded samples Θ d lgorithm 1 ( SelectFrequentStrategies ) Select most frequent strate-gies that are assigned to at least 1 − α fraction of samples. input α, { s ( θ i ) } Ni =1 , S output S α t ← , S α ← ∅ for s ∈ S do q s ← |{ s ( θ i ) = s, i = 1 , . . . , N }| (cid:46) Compute strategy occurrences. q sorted ← ReverseArgsort ( q ) (cid:46) Sort occurrences in decreasing order. for (cid:96) ∈ q sorted do t ← t + (cid:96) (cid:46) Update number of samples. S α ← S α ∪ { (cid:96) } if t > (cid:100) (1 − α ) N (cid:101) then break that were not assigned to any strategy in S α . These samples are then reas-signed by comparing their cost f ij with every selected strategy in S α . If thereis at least a sample for which the best reassigned strategy cost f i is abovethe tolerance, i.e. , f i > ¯ f i + c (cid:12)(cid:12) ¯ f i (cid:12)(cid:12) , then α is reduced to α/ (cid:15) and the algorithm terminates. If we reach the maximum numberof iterations, it means that there is no feasible assignment given the speciﬁedtolerance and the last value of fraction of discarded samples α . Compared tosolving problem (5), this method relies on the samples-strategy computationof just a small portion of discarded samples. In addition, there is no needto solve any large-scale MILO, which makes it much more scalable to largesettings. The downside of this heuristic approach is that the best strategyassignment uses the most frequent strategies, which might not always be theoptimal one. However, for small problems where the MILO is solvable, theheuristic solution always gives similar number of pruned strategies as MILOwhile always satisfying, by construction, the cost function degradation con-straint. Thanks to the learned predictor, our method oﬀers great computationalsavings compared to solving each problem instance from scratch. In ad-14 lgorithm 2

Prune strategies while keeping feasibility and low suboptimal-ity input { ( θ i , s ( θ i )) } N , S output S α ← . for it = 1 , . . . , max it do S α ← SelectFrequentStrategies ( α, { s ( θ i ) } Ni =1 , S )Θ d ← { θ i | s ( θ i ) / ∈ S} (cid:46) Select discarded samples for θ i ∈ Θ d dofor s j ∈ S α do f ij ← Solve (3) (cid:46)

Compute sample-strategy pairs f i ← min j ( f ij ) (cid:46) Reassign sample i to best strategy if f i > ¯ f i + (cid:15) (cid:12)(cid:12) ¯ f i (cid:12)(cid:12) then (cid:46) Suboptimality condition not satisﬁed break α ← α/ return S dition, when the problem oﬀers a speciﬁc structure, we can gain even furtherspeedups and replace the whole optimizer with a linear system solution. Two major challenges in optimization.

By using our previous solutions,the learned predictor maps new parameters θ to the optimal strategy replac-ing two of the arguably hardest tasks in numerical optimization algorithms: Tight constraints.

Identifying the tight constraints at the optimal solu-tion is in general a hard task because of the combinatorial complexityto search over all the possible combinations of constraints. For this rea-son the worst-case complexity active-set methods (also called simplexmethods for LO) is exponential in the number of constraints [BT97].

Integer variables.

It is well known that ﬁnding the optimal solution ofmixed-integer programs is

N P -hard [BW05]. Hence, solving prob-lem (1) online might require a prohibitive computation time because ofthe combinatorial complexity of computing the optimal values of theinteger variables.We solve both these issues by evaluating our predictor that outputs the opti-mal strategy s ( θ ), i.e. , the tight constraints at optimality T ( θ ) and the value15f the integer variables x (cid:63) I ( θ ). After evaluating the predictor, computing theoptimal solution consists of solving (3) which we can achieve much moreeﬃciently than solving (1), especially in case of special structure. Special structure.

When g is a linear function in x we can directly con-sider the tight constraints as equalities in (3) without losing convexity. Inthese cases (3) becomes a convex equality constrained problem that can besolved via Newton’s method [BV04, Section 10.2]. We can further simplifythe online solution in special cases such as MIQO (and also MILO) of theform minimize (1 / x T P x + q T x + r subject to Ax ≤ bx I ∈ Z d , (6)with cost P ∈ S n + , q ∈ R n , r ∈ R and constraints A ∈ R m × n and b ∈ R m . Weomitted the dependency of the problem data on θ for ease of notation. Giventhe optimal strategy s ( θ ) = ( T ( θ ) , x (cid:63) I ( θ )), computing the optimal solutionto (6) corresponds to solving the following reduced problem from (3)minimize (1 / x T P x + q T x + r subject to A T ( θ ) x = b T ( θ ) x I = x (cid:63) I ( θ ) . (7)Since it is an equality constrained QO, we can compute the optimal so-lution by solving the linear system deﬁned by its KKT conditions [BV04,Section 10.1.1]  P A T T ( θ ) I T I A T ( θ ) I I  (cid:20) xν (cid:21) =  − qb T ( θ ) x (cid:63) I ( θ )  . (8)Matrix I is the identity matrix. Vectors or matrices with a subscript indexset identify only the rows corresponding to the indices in that set. ν are thedual variables of the reduced continuous problem. The dimensions of theKKT matrix are q × q where q = n + |T ( θ ) | + d. (9)We can apply the same method to MILO by setting P = 0. In case no integervariables are present ( d = 0), the strategy identiﬁes only the tight constraintsand the dimension of the linear system (8) reduces to n + |T ( θ ) | .16 .1 The Eﬃcient Solution Computation In case of MIQO, solving the linear system (8) corresponds to computingthe solution to (6). Let us analyze the components involved in the onlinecomputations to further optimize the solution time.

Linear system solution.

The linear system (8) is sparse and symmetric andwe can solve it with both direct methods and indirect methods. Regardingdirect methods, we compute a sparse permuted

LDL T factorization [Dav06]of the KKT matrix where L is a square lower triangular matrix and D adiagonal matrix both with dimensions q × q . The factorization step requires O ( q ) number of ﬂops which can be expensive for large systems. After thefactorization, the solution consists only in forward-backward solves which canbe evaluated very eﬃciently and in parallel with complexity O ( q ). Alter-natively, when the system is very large we can use an indirect method suchas MINRES [PS82] to iteratively approximate the solution by using simplematrix-vector multiplications at each step with complexity q . Note thatindirect methods while more amenable for large problems, can suﬀer frombad scaling of the matrix data, requiring many steps before convergence. Matrix caching.

In several cases θ does not aﬀect the matrices P and A in (6). In other words, θ enters only in the linear part of the cost andthe right hand side of the constraints and does not aﬀect the KKT matrixin (8). This means that, since we know all the strategies that appeared in thetraining phase, we can factor each of the KKT matrices and store the factors L and D oﬄine. Therefore, whenever we predict the strategy related to anew parameter θ , we can just perform forward-backward solves to obtain theoptimal solution without having to perform a new factorization. This steprequires O ( q ) ﬂops – an order of magnitude less than factorizing the matrix. Parallel strategy evaluation.

In Section 4.1 we explained how we trans-form the strategy selection into a multiclass classiﬁcation problem. Since wehave the ability to compare the quality of diﬀerent strategies in terms of op-timality and feasibility, online we choose the k most likely strategies from thepredictor output and we compare their performance. Since the comparisonis completely independent between the candidate strategies, we can performit in parallel saving additional computation time.17 .2 Online Complexity We now measure the online complexity of the proposed approach in terms ofﬂops. The ﬁrst step consists in evaluating the neural network with L layers.As shown in (4), each layer consists in a matrix-vector multiplication andadditions W l y l − + b l which has order O ( n l n l − ) operations [BV04, SectionC.1.2]. The ReLU step in the layer does not involve any ﬂop since it is a sim-ple truncation of the non positive components of the layer input. Summingthese operations over all the layers, the complexity of evaluating the networkbecomes O ( n n p + n n + · · · + n M n L − ) = O ( n M n L − ) , where p is the dimension of the parameter θ and we assume that the numberof strategies M is larger than the dimension of any layer. Note that theﬁnal softmax layer, while being very useful in the training phase and inits interpretation as likelihood for each class, is unnecessary in the onlineevaluation since we just need to rank the best strategies.After the neural network evaluation we can decode the optimal solutionby solving the KKT linear system (8) of dimension deﬁned in (9). Sincewe already factorized it, the online computations are just simple forward-backward substitutions as discussed in Section 5.1. Therefore, the ﬂops re-quired to solve the linear system are O (( n + |T ( θ ) | + d ) ) . This dimension is in general much smaller than the number of constraints m and mostly depends only on the problem variables. In addition the KKTmatrix (8) is sparse and if the factorization matrices are sparse as well, wecould further reduce the ﬂops required.The overall complexity of the complete MIQO online solution taking intoaccount both NN prediction and solution decoding becomes O ( n M n L − + ( n + |T ( θ ) | + d ) ) , (10)which does not depend on the cube of any of the problem dimensions. Thismeans that with dense matrices, our method is asymptotically cheaper thanfactorizing a single linear system since that operation would scale with thecube of its input data. For sparse matrices, we can make similar consid-erations based on the number of nonzero elements instead of the dimen-sions [Dav06]. 18espite these high-speed considerations on the number of operations re-quired, it is very important to remark the reliability of the computation timerequired by our approach. The execution time of B&B algorithms greatlydepends on how the solution search tree is analyzed and pruned. This canvary signiﬁcantly when problem parameters change making the whole onlineoptimization procedure unreliable in real-time applications. On the contrary,our method oﬀers a ﬁxed number of operations which we evaluate every timewe encounter a new parameter. Our implementation extends the software tool machine learning optimizer(MLOPT) from [BS20] which is implemented in Python and integrated withCVXPY [DB16] to model the optimization problems. MLOPT is availableat https://github.com/bstellato/mlopt .To speedup the repetitive canonicalizations of parametric MIQO we usethe disciplined parametric program (DPP) language introduced in CVXPY1.1 [AAB + θ i consists of a matrix-vector multiplication. MLOPT relies on the Gurobi Optimizer [GO20] tosolve the problems in the training phase with high accuracy and identify thetight constraints.After MLOPT collects the strategies, we apply Algorithm 2 to selectthe most frequent ones and reassign the samples accordingly. This processis performed in parallel over multiple cores to minimize the independentevaluations. Then, MLOPT passes the data to PyTorch [PGC +

17] usingPytorch-Lightning [Fal19] library to deﬁne the architecture and train theNN to classify the strategies. We split the training data into 80 % trainingand 20 % validation. We tune the neural network depth, width, learningrate, batch size, number of epochs using the Optuna framework [ASY +

19] toexploit high parallelization and early pruning. In addition to the MLOPTframework in [BS20], we include a specialized solution method for MIQObased on the techniques described in Section 5 where we factorize and cachethe factorization of the KKT matrices (8) for each strategy of the problemto speedup the online computations. 19 ﬄine

CVXPYModeling StrategiesSampling NNTraining Factorizationand Caching

Online

StrategyPrediction θ s ( θ ) SolutionDecoding x (cid:63) Figure 1:

Algorithm implementation

As outlined in Section 5, when we classify online using the NN we pickthe best k strategies and evaluate them in parallel, in terms of objectivefunction value and infeasibility, to choose the best one. This step requiresa minimal overhead since the matrices are all factorized and we can executethe evaluations in parallel.We also parallelize the training phase where we collect data and solvethe problems over multiple CPUs. The NN training takes place on a GPUwhich greatly reduces the training time. An overview of the online and oﬄinealgorithms appears in Figure 1. In this section, we benchmark our learning scheme on multiple parametricexamples from continuous and mixed-integer problems. We compare thepredictive performance and the computational time required by our methodto solve the problem compared to using GUROBI Optimizer [GO20]. We runthe experiments on the MIT SuperCloud facility in collaboration with theLincoln Laboratory [RKB +

18] exploiting 40 parallel Intel Xeon E5-2650 coresfor the data collection involving the problems solution and a NVIDIA TeslaV100 GPU with 16GB for the neural network training. For each problemwe sample 100.000 parameters θ i to collect the strategies. We choose thisnumber because the NN training works better with a large number of datapoints. Note that, thanks to the multiple cores and the code parallelizations,we were able to train the algorithm between a few hours and less than a day,even for problems that take up to hundreds of seconds to solve with Gurobi.20n addition, for all the examples the Good-Turing estimator conditionfrom Section 4.2 was always satisﬁed for (cid:15) = 0 . ,

000 samplesin the test set of the algorithm.

Infeasibility and suboptimality.

We follow the same criteria for calculatingthe relative suboptimality and infeasibility as in [BS20]. We repeat them herefor completeness. After the learning phase, we compare the predicted solutionˆ x (cid:63)i to the optimal one x (cid:63)i obtained by solving the instances from scratch.Given a parameter θ i , we say that the predicted solution is infeasible if theconstraints are violated more than (cid:15) inf = 10 − according to the infeasibilitymetric p (ˆ x (cid:63)i ) = (cid:107) ( g ( θ i , ˆ x (cid:63)i )) + (cid:107) ∞ /r ( θ i , ˆ x (cid:63)i ) , where r ( θ, x ) normalizes the violation depending on the size of the summandsof g . In case of MIQO, g ( θ, x ) = A ( θ ) x − b ( θ ) and r ( θ, x ) = (cid:107) b ( θ ) (cid:107) ∞ . If thepredicted solution ˆ x i is feasible, we deﬁne its suboptimality as d (ˆ x (cid:63)i ) = ( f ( θ i , ˆ x (cid:63)i ) − f ( θ i , x (cid:63)i )) / | f ( θ i , x (cid:63)i ) | , where f is the objective of our MIQO problem in (6). Note that d (ˆ x (cid:63)i ) ≥ (cid:15) sub = 10 − . Foreach instance we report the average infeasibility and suboptimality over thesamples. Fuel cells are a green highly eﬃcient energy source that need to be controlledto stay within admissible operating ranges. Too large switching between ONand OFF states can reduce both the lifespan of the energy cell and increaseenergy losses. This is why fuel cells are often paired with energy storagedevices such as capacitors which help reducing the switching frequency duringfast transients.In this example we would like to control the energy balance between a su-per capacitor and a fuel cell in order to match the demanded power [FDM15].The goal is to minimize the energy losses while maintaining the device withinacceptable operating limits to prevent lifespan degradation.We can model the capacitor dynamics as E t +1 = E t + τ ( P t − P load t ) , (11)21here τ > E t ∈ [ E min , E max ] is the energy stored. P t ∈ [0 , P max ] is the power provided by the fuel cell and P load is the desiredload power.At each time t we model the on-oﬀ state of the fuel cell with the binaryvariable z t ∈ { , } . When the battery is oﬀ ( z t = 0) we do not consume anyenergy, thus we have 0 ≤ P t ≤ P max z t . When the engine is on ( z t = 1) itconsumes αP t + βP t + γ units of fuel, with α, β, γ >

0. We deﬁne the stagepower cost as f ( P, z ) = αP + βP + γz. We now model the total sum of the switchings over a time window inorder to constrain its value. In order to do so we introduce binary variable d t ∈ { , } determining whether the cell switches at time t either from ONto OFF or viceversa. Additionally we introduce the auxiliary variable w t ∈ [ − ,

1] accounting for the amount of change brought by d t in the batterystate z t , w t =  , d t = 1 ∧ z t = 1 , − , d t = 1 ∧ z t = 0 , , otherwise . (12)We can model these logical relationships as the following linear inequal-ity [FDM15], G ( w t , z t , d t ) ≤ h, with G =  − − −

11 2 2 − −  , h = (0 , , , . (13)Hence we can write the number of switchings s t +1 appeared up to time t + 1over the past time window of length T as s t +1 = s t + d t − d t − T , (14)22nd impose the constraints s t ≤ n sw . The complete fuel cell problem becomesminimize T − (cid:88) t =0 f ( P t , z t )subject to E t +1 = E t + τ ( P t − P load t ) ,E min ≤ E t ≤ E max , ≤ P t ≤ z t P max ,z t +1 = z t + w t ,s t +1 = s t + d t − d t − T ,s t ≤ n sw ,G ( w t , z t , d t ) ≤ h,E = E init , z = z init , s = s init z t ∈ { , } , d t ∈ { , } , w ∈ [ − , . (15)The problem parameters are θ = ( E init , z init , s init , d past , P load ) where d past =( d − T , . . . , d − ) and P load = ( P load0 , . . . , P load T − ). In order to properly controlthe dynamical system, we must solve (15) within each sampling time τ . Problem setup.

We chose parameters from [FDM15] with values α = 6 . · − , β = 0 . γ = 80 W, and sampling time τ = 1 second We deﬁneenergy and power constraints with E min = 5 . E max = 10 . P max = 1 . E init = 7 . z init = 0 and s init = 0.We randomly generated the load proﬁle P load .We obtain the oﬄine samples by simulating a closed-loop trajectory of10 ,

000 time steps and storing the parameters for each component of θ alongthe trajectory i.e. , E init , z init , s init , d past , and P load . We, then, sample from auniform distribution over a hypershpere of radius 0 . Results.

Table 1 reports the problem dimensions and the maximum com-putation time needed with each technique. Figure 2 shows the performanceof MLOPT for varying values of the top- k strategies and in terms of com-putation time, suboptimality, infeasibility and accuracy. As expected, as k increases the performance improves. For time horizons of T = 50 or T = 60,Gurobi takes on average longer than the allowed sampling time, τ = 1 sec.As noted by the authors of [FDM15], in order to get good performance with23 = 10 T = 20 T = 30 T = 40 T = 50 T = 601 20 40 60 80 10010 − − C o m pu t a t i o n t i m e [ s ] − − Sub o p t i m a li t y − − k I n f e a s i b ili t y k A cc u r a c y [ % ] Figure 2:

MLOPT average performance indicators for fuel cell battery manage-ment example. The dashed line indicates the sampling time. this system, horizons in at least of T = 60 need to be considered and therespective solutions are not computable in less than τ = 1 sec with state-of-the-art algorithms. From Table 1, the maximum time of Gurobi is abovethe allowed sampling time for horizons ≥

20 and Gurobi heuristic for hori-zon ≥

40. Therefore, these solution methods are not applicable for real-timeoptimization of this dynamical systems.

Consider the portfolio investment problem [Mar52, BBD + w t ∈ R n +1 at time t corresponding to each of the assets in the port-folio and a riskless asset at the ( n + 1)-th position denoting a cash account.We deﬁne the trade as the diﬀerence of consecutive weights w t − w t − where w t − is the given vector of current asset investments acting as a problem24 urobi heuristic MLOPT ( k = 100) Gurobi10 20 30 40 50 6010 − − T C o m pu t a t i o n t i m e [ s ]

10 20 30 40 50 6010 − − T Sub o p t i m a li t y Figure 3:

Comparison between Gurobi and MLOPT performance for the fuel cellbattery management example. Average computation time and suboptimality. Thedashed line indicates the sampling time.

Table 1:

Fuel cell energy management problem dimensions and maximum times.

T n var n constr M t max

MLOPT [s] t max Gurobi [s] t max Gurobi heuristic [s]10 88 207 709 0 . . . . . . . . . . . . . . . . . . r Tt w t − γ(cid:96) risk t ( w t ) − (cid:96) hold t ( w t ) − (cid:96) trade t ( w t − w t − )subject to T w t = 1 , card ( w t ) ≤ c. (16)Four terms compose the stage rewards. First, we describe returns ˆ r Tt w t as afunction of the estimated stock returns ˆ r t ∈ [0 , n +1 at time t . Second, wedeﬁne the risk cost as (cid:96) risk t ( x ) = x T ˆΣ t x, where ˆΣ t ∈ S ( n +1) × ( n +1)+ is the estimate of the covariance of the returns attime t . Third, we deﬁne the holding cost as (cid:96) hold t ( x ) = s Tt ( x ) − , where ( s t ) i ≥ i at time t . The fourthterm describes a penalty on the trades deﬁned as (cid:96) trade t ( x ) = λ (cid:107) x t − x t − (cid:107) , Parameter λ > i.e. , the cardinality, to be less than c ∈ Z > . For the complete model deriva-tion without cardinality constraints see [BBD +

17, Section 5.2].

Risk model.

We use a common risk model described as a k -factor modelΣ t = F t Σ Ft F Tt + D t where F t ∈ R ( n +1) × k is the factor loading matrix andΣ Ft ∈ S k × k + is an estimate of the factor returns F T r t covariance [BBD + F t ) ij is the loading of asset i to factor j . D t ∈ S ( n +1) × ( n +1)+ is a nonnegative diagonal matrix accounting for the additional variance inthe asset returns usually called idiosyncratic risk. We compute the factormodel estimates ˆΣ t with 15 factors by using a similar method as in [BBD + t . 26 eturn forecasts. In practice return forecasts are always proprietary andcome from sophisticated prediction techniques based on a wide variety of dataavailable to the trading companies. In this example, we simply add zero-meannoise to the realized returns to obtain the estimates and then rescale them tohave a realistic mean square error of our prediction. While these estimates arenot real because they use the actual returns, they provide realistic values forthe purpose of our computations. We assume to know the risk-free interestrates exactly with (ˆ r t ) n +1 = ( r t ) n +1 . The return estimates for the non-cashassets are (ˆ r t ) n = α (( r t ) n + (cid:15) t ) where (cid:15) t ∼ N (0 , σ (cid:15) I ) and α > E ((ˆ r t ) n − ( r t ) n ) [BBD + ± . √ α ≈ .

15 typical of a proprietary return forecastprediction.

Parameters.

The problem parameters are θ = ( w t − , r t , D t , Σ Ft , F t ) which,respectively, correspond to the previous assets allocation, vector of returnsand the risk model matrices. Note that the risk model is updated at thebeginning of every month. Since the parameters not only aﬀect the datain problem vectors, but also in the matrices, we cannot exploit the oﬄinefactorization caching for linear system (8). Problem setup.

We simulate the trading system using the S&P100 val-ues [Qua19] from 2008 to 2013 with risk cost γ = 100, borrow cost s t = 0 . λ = 0 .

01. These values are similar as the benchmark valuesin [BBD + c from 1 to 40. Afterwards wecollect data by sampling around the trajectory points from a hyperspherewith radius 0 .

001 times the magnitude of the parameter. For example, for avector of returns of magnitude ¯ r t we sample around r t with a radius 0 . r t .Even though this problem does not have to be solved in real-time, in orderto optimize the trading performance, we must perform multiple expensivebacktesting simulations. Results.

Table 2 shows the problem dimensions and the maximum compu-tation time needed with each technique. Figure 4 displays the performance ofMLOPT for varying values of the top- k strategies and in terms of computa-tion time, suboptimality, infeasibility and accuracy. For smaller c , althoughthe total number of integer variable combinations is lower, the problem is27 = 10 c = 20 c = 30 c = 401 20 40 60 80 10010 − − C o m pu t a t i o n t i m e [ s ] − − Sub o p t i m a li t y − − k I n f e a s i b ili t y k A cc u r a c y [ % ] Figure 4:

MLOPT average performance indicators for portfolio example. harder to solve for every technique. In this example, the performance doesnot signiﬁcantly increase above k = 20. We suspect that further accuracyimprovements should happen with k > c = 30 and c = 40. The performance comparison between thediﬀerent methods appears in Figure 5. MLOPT shows up to three orders ofmagnitude speedups over Gurobi and Gurobi heuristic despite moderate sub-optimality and infeasibility values, mostly for c = 10 and c = 20. With thesecomputation time speedups, backtesting time can be signiﬁcantly reducedand multiple parameter simulations can be executed to tune the problem pa-rameters while evaluating the performance on historical data. This is crucialto obtain high quality portfolio trades. We consider the problem of motion planning in the presence of obstacles.This problem has a wide variety of applications including autonomous ve-hicles, space robots, and unmanned aerial vehicless (UAVs) [LaV06]. Theseproblems must usually be solved online within a few milliseconds to provide28 urobi heuristic MLOPT ( k = 100) Gurobi10 20 30 4010 − − c C o m pu t a t i o n t i m e [ s ]

10 20 30 4010 − − c Sub o p t i m a li t y Figure 5:

Comparison between Gurobi and MLOPT performance for the portfolioexample. Average computation time and suboptimality. The dashed line indicatesthe sampling time.

Table 2:

Portfolio problem dimensions and maximum times. c n var n constr M t max

MLOPT [s] t max Gurobi [s] t max Gurobi heuristic [s]10 552 902 4153 0 . . . . . . . . . . . . d dimensions, in practice d = 2 for planarsystems or d = 3 for aerial systems. The state of the system is x t = ( p t , v t )where p t ∈ R d is the position at time t and v t ∈ R d the velocity. Theinput u t ∈ R d are the forces produced by the system in every direction. Forexample, for an UAV, u t corresponds to the thruster forces. The discrete-timelinear system dynamics are described as( p t +1 , v t +1 ) = A ( p t , v t ) + Bu t , t = 0 , . . . , T, where A ∈ R d × d and B ∈ R d × d and τ > p init , v init ). We deﬁne upper and lower boundson the state and inputs as p ≤ p t ≤ p, v ≤ v t ≤ v, t = 0 , . . . , T,u ≤ u t ≤ u, t = 0 , . . . , T − . We model every obstacle i as a rectangle in R d with upper bounds o i ∈ R d and lower bounds o i ∈ R d for i = 1 , . . . , n obs . At every time t and for everyobstacle i , we model the obstacle avoidance decisions with binary variables δ it ∈ { , } d for the obstacle upper bounds and δ it ∈ { , } d for the lowerbounds. We can write obstacle avoidance as the following big-M conditions o i − M δ it ≤ p t ≤ o i + M δ it , t = 0 , . . . , T, i = 1 , . . . , n obs . In addition, we must impose that we cannot be at the same time on diﬀerentsides of an obstacle, T δ it + T δ it ≤ d − , t = 0 , . . . , T, i = 1 , . . . , n obs , where is the vector of ones of dimension d . Our goal is to minimize the dis-tance with respect to a desired position ( p t ≈ p des ) while expending minimalcontrol eﬀort ( u t ≈ (cid:107) p T − p des (cid:107) + T − (cid:88) t =0 (cid:107) p t − p des (cid:107) + γ (cid:107) u t (cid:107) , γ > (cid:107) p T − p des (cid:107) + T − (cid:88) t =0 (cid:107) p t − p des (cid:107) + γ (cid:107) u t (cid:107) subject to ( p t +1 , v t +1 ) = A ( p t , v t ) + Bu t , t = 0 , . . . , T − p = p init , v = v init o i − M δ it ≤ p t ≤ o i + M δ it , t = 0 , . . . , T, i = 1 , . . . , n obs T δ it + T δ it ≤ d − t = 0 , . . . , Tδ it ∈ { , } d , δ it ∈ { , } d , i = 1 , . . . , n obs x t ∈ [ x, x ] , t = 0 , . . . , T,u t ∈ [ u, u ] , t = 0 , . . . , T − . (17) Parameters.

We consider the problem of computing the optimal trajectoryfrom any point in the plane to the desired position p des . In this example therest of the problem data does not change. Therefore, the problem parametersare the initial state θ = p init . Problem (17) must be solved within eachsampling time τ to provide fast enough inputs to the dynamical system. Problem setup.

We consider a discrete-time double-integrator dynamicalsystem in two dimensions as in [SDFH01]. The optimizer must provide solu-tions within the sampling time 0 . T = 60 which corresponds to 6 sec. The cost tradeoﬀparameter is γ = 0 .

01. The initial state is given by p init = θ and v init = 0.The desired position is p des = ( − . , −

10) and the desired velocity v des = 0.Figure 6 shows an example trajectory for θ = (9 , θ i , we sample uniformly between p and p . Results.

Table 3 shows the problem dimensions along with the maximumcomputation time taken by each technique. Figure 4 shows the average per-formance of MLOPT for varying top- k strategies selected. Even if the accu-racy is always low, most of the times less than 10%, the suboptimality andinfeasibility are acceptable for this application. As shown in Figure 6, therecan be some constraints violations due to marginal violations of the obstacleboundaries. We show the direct comparison of Gurobi, Gurobi heuristic and31 igure 6: Trajectory planning example with 10 obstacles and sampling time 0.1sec. Circles indicate the optimal path computed with Gurobi in 18.18 sec and thesquares indicate the optimal path computed with MLOPT with k = 100 in 0.06sec. The ﬁlled dot is the starting position while the start is the desired position.The boundaries of the feasible positions are the dotted line while obstaces aredisplayed as rectangles. MLOPT in Figure 8. The time improvements of MLOPT allow its real-timeimplementation, in contrast to state-of-the-art options.

The numerical examples show several beneﬁts of our approach. First, thestrategy pruning technique allows us to work with a total number of strate-gies between 100 and 10,000. These numbers allow high quality predictionsin terms of suboptimality and infeasibility with NNs combined with top- k strategies selection. Second, despite low accuracy when comparing the pre-dicted strategy to the exact optimal one, the average subptimality is com-32 obs = 2 n obs = 4 n obs = 6 n obs = 8 n obs = 101 20 40 60 80 10010 − − C o m pu t a t i o n t i m e [ s ] − − Sub o p t i m a li t y − k I n f e a s i b ili t y k A cc u r a c y [ % ] Figure 7:

MLOPT average performance indicators for motion planning example.The dashed line indicates the sampling time.Gurobi heuristic MLOPT ( k = 100) Gurobi2 4 6 8 1010 − − n obstacles C o m pu t a t i o n t i m e [ s ] − − n obstacles Sub o p t i m a li t y Figure 8:

Comparison between Gurobi and MLOPT performance for the motionplanning example. Average computation time and suboptimality. The dashed lineindicates the sampling time. able 3: Motion planning problem dimensions and maximum times. n obstacles n var n constr M t max

MLOPT [s] t max Gurobi [s] t max Gurobi heuristic [s]2 1135 3773 1371 0 . . . . . . . . . . . . . . . parable or better than the one of Gurobi heuristic. In addition, infeasibilityis always within acceptable levels for the proposed application. This meansthat, even in case of multiple global minimizers, our method is able to con-sistently predict optimal or close-to-optimal solutions. Third, time measure-ments show up to three orders of magnitude speedups of MLOPT comparedto Gurobi and Gurobi heuristics, both in terms of average and worst-case so-lution time. Since for the fuel cell energy management and the motion plan-ning examples, optimal solutions must be computed within hard real-timerequirements, only fast optimization methods can be implemented in prac-tice and Gurobi-based approaches are too slow. On the contrary, MLOPTproves to be a much faster approach for computing online solutions. We proposed a machine learning method for solving online MIO at very highspeed. By using the Voice of Optimization framework [BS20] we exploitedthe repetitive nature of online optimization problems to greatly reduce thesolution time. Our method casts the core part of the optimization algo-rithm as a multiclass classiﬁcation problem that we solve using a NN. Inthis work we considered the class of MIQO which, while covering the vastmajority of real-world optimization problems, allows us to further reducethe computation time exploiting its structure. For MIQO, we only have tosolve a linear system to obtain the optimal solution after the NN predic-tion. In other words, our approach does not require any solver nor iterativeroutine to solve parametric MIQOs. Our method is not only extremely fastbut also reliable. Compared to branch-and-bound methods our approachhas a very predictable online solution time for which we can exactly bound34he ﬂops complexity. This is of crucial importance for real-time applications.We benchmarked our approach against the state-of-the-art solver Gurobi ondiﬀerent real-world examples showing 100 to 1000 fold speedups in onlineexecution time.There are several future research directions to improve the MLOPTframework. First, specialized neural networks architectures might providemore accurate strategy classiﬁers for optimization problems with a speciﬁcstructure. Second, the MLOPT training phase at the moment requirescomputing the optimal solution of problems with several diﬀerent parametercombinations. Therefore, the current approach does not apply to problemsthat take hours to solve with Gurobi. Using suboptimal solutions in thetraining phase would allow us signiﬁcantly reduce the training time andapply MLOPT to such problems. Third, strategy reduction techniquescombined with structure exploiting classiﬁers would enable applicationsto large-scale optimization problems with only discrete variables. Finally,including the knowledge about the previous solution would improve theperformance of MLOPT, similarly to how warm-starting techniques helpreducing computations in MIO solvers.

Acknowledgment

The authors acknowledge the MIT SuperCloud and Lincoln Laboratory Su-percomputing Center for providing High Performace Computing resourcesthat have contributed to the research results reported within this paper.

References [AAB +

19] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, andZ. Kolter. Diﬀerentiable convex optimization layers. In

Ad-vances in Neural Information Processing Systems , 2019.[ALW17] A. M. Alvarez, Q. Louveaux, and L. Wehenkel. A machinelearning-based approximation of strong branching.

INFORMSJournal on Computing , 29(1):185–195, 2017.[ASY +

19] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta,and Masanori Koyama. Optuna: A next-generation hyperpa-35ameter optimization framework. In

Proceedings of the 25rdACM SIGKDD International Conference on Knowledge Discov-ery and Data Mining , 2019.[BBD +

17] S. Boyd, E. Busseti, S. Diamond, R. N. Kahn, K. Koh, P. Nys-trup, and J. Speth. Multi-period trading via convex optimiza-tion.

Foundations and Trends in Optimization , 3(1):1–76, 2017.[BDTD +

16] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars. arXiv:1604.07316, 2016.[Bix10] R. E. Bixby. A brief history of linear and mixed-integer pro-gramming computation.

Documenta Mathematica , pages 107–121, 2010.[BLP18] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning forcombinatorial optimization: a methodological tour d’horizon.arXiv:1811.0612, 2018.[BLZ18] P. Bonami, A. Lodi, and G. Zarpellon. Learning a classiﬁcationof mixed-integer quadratic programming problems. In Willem-Jan van Hoeve, editor,

Integration of Constraint Programming,Artiﬁcial Intelligence, and Operations Research , pages 595–604,Cham, 2018. Springer International Publishing.[BM99] A. Bemporad and M. Morari. Control of systems integratinglogic, dynamics, and constraints.

Automatica , 35(3):407–427,1999.[BMDP02] A. Bemporad, M. Morari, V. Dua, and E. N. Pistikopoulos.The explicit linear quadratic regulator for constrained systems.

Automatica , 38(1):3–20, 2002.[BP08] D. Bertsimas and D. Pachamanova. Robust multiperiod portfo-lio management in the presence of transaction costs.

Computers& Operations Research , 35(1):3 – 17, 2008.[BS20] D. Bertsimas and B. Stellato. The Voice of Optimization.

Ma-chine Learning (to appear) , 2020.36BSM +

17] G. Banjac, B. Stellato, N. Moehle, P. Goulart, A. Bemporad,and S. Boyd. Embedded code generation using the OSQP solver.In

IEEE Conference on Decision and Control (CDC) , 2017.[BT97] D. Bertsimas and J. N. Tsitsiklis.

Introduction to Linear Opti-mization . Athena Scientiﬁc, 1997.[BV04] S. Boyd and L. Vandenberghe.

Convex Optimization . Cam-bridge University Press, 2004.[BW05] D. Bertsimas and R. Weismantel.

Optimization Over Integers .Dynamic Ideas Press, 2005.[CaPM10] J. P. S Catal˜ao, H.M.I. Pousinho, and V.M.F. Mendes. Schedul-ing of head-dependent cascaded hydro systems: Mixed-integerquadratic programming approach.

Energy Conversion and Man-agement , 51(3):524 – 530, 2010.[CCS +

20] A. Cauligi, P. Culbertson, B. Stellato, D. Bertsimas, M. Schwa-ger, and M. Pavone. Learning mixed-integer convex optimiza-tion strategies for robot planning and control, 2020.[Dav06] T. A. Davis.

Direct Methods for Sparse Linear Systems . Societyfor Industrial and Applied Mathematics, 2006.[DB16] S. Diamond and S. Boyd. CVXPY: A Python-embedded mod-eling language for convex optimization.

Journal of MachineLearning Research , 17(83):1–5, 2016.[DCB00] O. Damen, A. Chkeif, and Belﬁore. Lattice code decoder forspace-time codes.

IEEE Communications Letters , 4(5):161–163,May 2000.[DCB13] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solverfor embedded systems. In

European Control Conference (ECC) ,pages 3071–3076, 2013.[DDS16] H. Dai, B. Dai, and L. Song. Discriminative embeddings oflatent variable models for structured data. In

Proceedings ofthe 33rd International Conference on International Conferenceon Machine Learning - Volume 48 , ICML’16, pages 2702–2711.JMLR.org, 2016. 37DKZ +

17] H. Dai, E. B. Khalil, Y. Zhang, B. Dilkina, and L. Song.Learning combinatorial optimization algorithms over graphs.arXiv:1704.01665, 2017.[DTB18] S. Diamond, R. Takapoui, and S. Boyd. A general system forheuristic minimization of convex functions over non-convex sets.

Optimization Methods and Software , 33(1):165–193, 2018.[Fal19] WA Falcon. Pytorch lightning.github.com/PyTorchLightning/pytorch-lightning, 2019.[FDM15] D. Frick, A. Domahidi, and M. Morari. Embedded optimizationfor mixed logical dynamical systems.

Computers & ChemicalEngineering , 72:21–33, 2015.[FGL05] M. Fischetti, F. Glover, and A. Lodi. The feasibility pump.

Mathematical Programming , 104(1):91–104, 2005.[FKP +

14] H. J. Ferreau, C. Kirches, A. Potschka, H. G. Bock, andM. Diehl. qpOASES: a parametric active-set algorithm forquadratic programming.

Mathematical Programming Compu-tation , 6(4):327–363, 2014.[GBC16] I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning . MITPress, 2016.[GHW15] G. Gamrath, B. Hiller, and J. Witzig. Reoptimization tech-niques for MIP solvers. In E. Bampis, editor,

ExperimentalAlgorithms , pages 181–192. Springer International Publishing,2015.[GO20] LLC Gurobi Optimization. Gurobi optimizer reference manual,2020.[Goo53] I. J. Good. The population frequencies of species and the esti-mation of population parameters.

Biometrika , 40(3/4):237–264,1953.[HDG07] F. Herzog, G. Dondi, and H. P. Geering. Stochastic model pre-dictive control and portfolio optimization.

International Journalof Theoretical and Applied Finance (IJTAF) , 10(02):203–233,2007. 38HDY +

12] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, andB. Kingsbury. Deep neural networks for acoustic modeling inspeech recognition: The shared views of four research groups.

IEEE Signal Processing Magazine , 29(6):82–97, Nov 2012.[HR14] A. Hassanzadeh and T. K. Ralph. On the value function of amixed integer linear optimization problem and an algorithm forits construction. Technical report, COR@L Laboratory Report14T-004, Lehigh University, 2014.[JLN +

10] M. J¨unger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R.Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey.

50 Yearsof Integer Programming 1958-2008 . Springer Berlin Heidelberg,2010.[KBS +

16] E. B. Khalil, P. L. Bodic, L. Song, G. Nemhauser, and B. Dilk-ina. Learning to branch in mixed integer programming. In

Proceedings of the Thirtieth AAAI Conference on Artiﬁcial In-telligence , AAAI’16, pages 724–731. AAAI Press, 2016.[KKK19] M. Klauˇco, M. Kal´uz, and M. Kvasnica. Machine learning-basedwarm starting of active set methods in embedded model predic-tive control.

Engineering Applications of Artiﬁcial Intelligence ,77:1 – 8, 2019.[KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi-ﬁcation with deep convolutional neural networks. In

Advancesin Neural Information Processing Systems , page 2012, 2012.[LaV06] Steven M. LaValle.

Planning Algorithms . Cambridge UniversityPress, USA, 2006.[LBH15] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.

Nature ,521(7553):436–444, may 2015.[LZ17] A. Lodi and G. Zarpellon. On learning and branching: a survey.

TOP , 25(2):207–236, Jul 2017.[Mar52] H. Markowitz. Portfolio selection.

The Journal of Finance ,7(1):77–91, 1952. 39MB12] J. Mattingley and S. Boyd. CVXGEN: A code generator forembedded convex optimization.

Optimization and Engineering ,13(1):1–27, 2012.[MRN19] S. Misra, L. Roald, and Y. Ng. Learning for constrainedoptimization: Identifying optimal active constraint sets.arXiv:1802.09639, 2019.[MT19] Tobia Marcucci and Russ Tedrake. Warm start of mixed-integer programs for model predictive control of hybrid systems.arXiv:1910.08251, 2019.[NW06] J. Nocedal and S. J. Wright.

Numerical Optimization . SpringerSeries in Operations Research and Financial Engineering.Springer, Berlin, 2nd edition, 2006.[PGC +

17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automaticdiﬀerentiation in PyTorch. In

NIPS-W , 2017.[PS82] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparselinear equations and sparse least squares.

ACM Trans. Math.Softw. , 8(1):43–71, March 1982.[Qua19] Quandl. WIKI S&P100 End-Of-Day Data, 2019.[RG06] T. L. Ralphs and M. G¨uzelsoy. Duality and warm starting in in-teger programming. In

The Proceedings of the 2006 NSF Design,Service, and Manufacturing Grantees and Research Conference ,2006.[RKB +

18] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand,D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell,M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa,C. Yee, and P. Michaleas. Interactive supercomputing on 40,000cores for machine learning and data analysis. In , pages 1–6, Sep. 2018.[SB18] R. S. Sutton and A. G. Barto.

Reinforcement Learning: AnIntroduction . A Bradford Book, 2nd edition, 2018.40SBG +

20] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd.OSQP: An Operator Splitting Solver for Quadratic Programs,feb 2020.[SDFH01] T. Schouwenaars, B. De Moor, E. Feron, and J. How. Mixedinteger programming for multi-vehicle path planning. In , pages 2603–2608, 2001.[SSS +

17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,Y. Chen, L. Lillicrap, F. Hui, L. Sifre, G. van den Driessche,T. Graepel, and D. Hassabis. Mastering the game of go withouthuman knowledge.

Nature , 550(7676):354–359, oct 2017.[TPS13] A. C. Trapp, O. A. Prokopyev, and A. J. Schaefer. On a level-setcharacterization of the value function of an integer program andits application to stochastic programming.

Operations Research ,61(2):498–511, 2013.[TPS19] O. Tavaslo˘glu, O. A. Prokopyev, and A. J. Schaefer. Solvingstochastic and bilevel mixed-integer programs via a generalizedvalue function.

Operations Research , 67(6):1659–1677, 2019.[YG08] F. You and I. E. Grossmann. Mixed-integer nonlinear program-ming models and algorithms for large-scale supply chain designwith stochastic inventory management.