Comparing Inverse Optimization and Machine Learning Methods for Imputing a Convex Objective Function
CComparing Inverse Optimization and Machine Learning Methods forImputing a Convex Objective Function
Elaheh H.Iraj and Daria Terekhov
Department of Mechanical, Industrial and Aerospace Engineering, Gina Cody School ofEngineering and Computer Science, Concordia University, Montr´eal, QC, Canada
Abstract
Inverse optimization (IO) aims to determine optimization model parameters fromobserved decisions. However, IO is not part of a data scientist’s toolkit in practice,especially as many general-purpose machine learning packages are widely availableas an alternative. When encountering IO, practitioners face the question of when,or even whether, investing in developing IO methods is worthwhile. Our paper pro-vides a starting point toward answering these questions, focusing on the problem ofimputing the objective function of a parametric convex optimization problem. Wecompare the predictive performance of three standard supervised machine learning(ML) algorithms (random forest, support vector regression and Gaussian process re-gression) to the performance of the IO model of Keshavarz, Wang, and Boyd (2011).While the IO literature focuses on the development of methods tailored to particu-lar problem classes, our goal is to evaluate general ‘out-of-the-box’ approaches. Ourexperiments demonstrate that determining whether to use an ML or IO approachrequires considering (i) the training set size, (ii) the dependence of the optimizationproblem on external parameters, (iii) the level of confidence with regards to thecorrectness of the optimization prior, and (iv) the number of critical regions in thesolution space.
KEYWORDS inverse optimization; machine learning; data efficiency; parametric optimization;multi-parametric programming
1. Introduction
Inverse optimization (IO) imputes missing optimization model parameters from datathat represents minimally sub-optimal solutions of that unknown optimization prob-lem (e.g., past decisions of an optimizing agent). In the academic literature, IO hasbeen shown successful in a variety of problems, including medical decision making(Chan, Craig, Lee, & Sharpe, 2014), electricity demand forecasting (Saez-Gallego &Morales, 2017), the household activity pattern problem (Chow & Recker, 2012) andeconomic lot-sizing (Egri, Kis, Kov´acs, & V´ancza, 2014). However, IO is rarely usedby practitioners.Previous work has established an analogy between IO and regression (Chan, Lee,& Terekhov, 2019), and provided a statistical inference perspective on IO (Aswani,Shen, & Siddiq, 2018). More generally, one can notice that imputing parameter valuesfrom data can be viewed as a machine learning problem (Tan, Delong, & Terekhov,
Email address: [email protected], [email protected] (corresponding author) a r X i v : . [ m a t h . O C ] F e b
2. Background and Related Work
To define an IO problem, we first need to postulate a forward optimization problem,i.e., the problem whose solutions, perhaps corrupted by noise, we observe. The initialfocus of the IO literature was to impute coefficients of forward optimization problemsthat do not have any dependence on an external parameter (non-parametric). Burtonand Toint (1992, 1994) study the inverse shortest path problem in which the goal is tofind a minimal perturbation of the arc costs to make an observed set of paths optimal.Given an observed value for decision variables x , Ahuja and Orlin (2001) find a c vector of minimal distance to a known vector ˆ c that would make x optimal. Chanet al. (2014) and Chan et al. (2019) study the inverse linear programming problemwhen the observed data is noisy and no prior ˆ c is given. Tavaslıo˘glu, Lee, Valeva, andSchaefer (2018) characterize the inverse feasible region: the set of objectives that wouldmake a given set of feasible solutions to a linear program optimal. Shahmoradi andLee (2020) introduce a way to measure how sensitive the set of optimal cost vectorsis to changes in a given data set.A parametric optimization problem (also known as a multi-parametric programmingproblem) is a family of optimization problems parametrized by an independent param-eter; the forward parametric problem involves finding a function that would map valuesof the independent parameter to optimal solutions (Pistikopoulos, Georgiadis, & Dua,2011). As an example, consider the following parameteric linear program with indepen-2ent parameters ( u , u ): min( c + c u ) x + ( c + c u ) x s.t. a x + a x ≥ b , whereinstantiating u and u leads to a standard (non-parametric) optimization problem.There is substantial interest in the inverse parametric optimization problem (Aswaniet al., 2018; Chow & Recker, 2012; Esfahani, Shafieezadeh-Abadeh, Hanasusanto, &Kuhn, 2018; Keshavarz et al., 2011; Kov´acs, 2019; Saez-Gallego & Morales, 2017; Tanet al., 2019; Tan, Terekhov, & Delong, 2020). In such a problem, pairs ( u k , x k ) where u k is an independent parameter for observation k and x k is an observed value forthe decision variables x are given; the goal is to impute objective function and/orconstraint coefficients that would make the observed points optimal or near-optimalsolutions of the given parametric forward optimization problem.Classic IO methods (Ahuja & Orlin, 2001; Burton & Toint, 1992; Keshavarz et al.,2011) rely on some form of optimality conditions, such as the Karush-Kuhn-Tucker(KKT) conditions. More recently, methods based on a combination of ML and IOhave emerged (Babier, Chan, Lee, Mahmood, & Terekhov, 2021; Tan et al., 2019,2020). Aswani, Kaminsky, Mintz, Flowers, and Fukuoka (2019) and Fern´andez-Blanco,Morales, Pineda, and Porras (2019) include comparison of an IO approach to severalML algorithms in the context of a particular application. Related work in the ML lit-erature includes, most notably, structured prediction (Taskar, Chatalbashev, Koller, &Guestrin, 2005). Despite this related ML-based work, a comparison of classic methodson general parametric IO problems has not previously been done and the questionsposed in the introduction have not been answered.
3. Problem Definition
We study IO for parametric optimization problems (POPs), which are ‘compatible’with an ML perspective unlike the non-parametric optimization ones. POPs are math-ematical programs where the objective function and constraints are functions of oneor multiple external parameters (Oberdieck et al., 2016; Pistikopoulos et al., 2011).The forward parametric optimization problem (FOP) of interest to us is:
FOP ( u , c ) : minimize x f ( u , x , c )s.t. g ( u , x ) ≤ , (1)where x ∈ R n are the decision variables, u ∈ R v are the independent parameters(‘features’ in ML parlance), and c ∈ R n are the coefficients of the objective function f . We assume f and g are differentiable and convex in x for each value of u . We let J be the index set of variables with | J | = n , and I be the index set of constraintswith | I | = m ; we let denote the column vector of zeros of the required dimension.In an inverse parametric optimization problem, observations of pairs D = { (ˆ x , ˆ u ) , . . . , (ˆ x K , ˆ u K ) } are given; the goal is to find a c that would make ˆ x k minimallysub-optimal for the corresponding FOP (ˆ u k , c ), for each k ∈ K = { , . . . , K } :minimize c (cid:110) (cid:88) k ∈ K L (ˆ x k , x predk ) | x predk ∈ arg min x ∈ X ˆ u k f (ˆ u k , x , c ) (cid:111) (2)where L is any given loss function, x predk is an optimal solution to the fitted forwardoptimization model instantiated at u = ˆ u k , and X ˆ u k = { x | g (ˆ u k , x ) ≤ } is the3easible region corresponding to ˆ u k . There are various ways to reformulate (2); theparticular model we use here is the model proposed by Keshavarz et al. (2011), whichformulates the optimality criterion via KKT optimality conditions (see appendix).The goal of IO could be explanatory, i.e., we aim to characterize the process thatgenerated the data, or predictive, i.e., we want to fit a model that will predict theactions of an optimizing agent/process. That is, having imputed c and given u new wecould use FOP ( u new , c ) to predict x new . Fitting a model to data can also result inbetter prescriptive performance, i.e., if instead of predicting what an optimizing agen-t/process would do, we want to prescribe an optimal decision under new circumstances u new (this latter perspective being more consistent with classical optimization).
4. Inverse Optimization Problem as a Learning Problem
Learning is the capability to acquire knowledge by extracting patterns from raw data(Goodfellow, Bengio, & Courville, 2016); in this sense, IO fits within the broader con-ceptual framework of learning, i.e., we are acquiring knowledge about an optimizationprocess that generated the data we observe. Applied to the IO problem described inSection 3, supervised ML would aim to find a vector-valued function h ( u , β ) from u (input variable or feature) to x (target variable) such that the loss function L is mini-mized (Russell & Norvig, 2016). Once h ( u , β ) is imputed, it can be used for predictiongiven new feature observations u new . Figure 1 illustrates this observation: in the leftpanel, we shown that IO takes pairs of (ˆ u , ˆ x ) (i.e., data) as input, imputes the value of c (i.e., knowledge) and then, given u new and an imputed c , makes a prediction x new ; inthe right panel, we show that the same happens in a typical supervised ML framework,with the difference being the actual models and methods used to impute the unknownparameters and, consequently, the learned model that is used for prediction.Figure 1.: IO and ML perspectives applied to an IO problem.The fundamental assumption in IO is that the observations are near-optimal oroptimal solutions to an optimization problem, whereas general ML methods do notrequire such an assumption. Based on the description above, an IO problem can be seenas part of the same conceptual framework as machine learning – the main differencebeing that in IO data is assumed to come from an optimization process. Viewed anotherway, we can say that IO assumes a strong prior that the input data is coming from anoptimization process, and IO methods leverage this prior knowledge.4igure 2.: A data set and the fitted optimization model (points A to D are the trainingset, point E is the test set).Consider Figure 2 (a): points A to D are given and our goal is to predict the nextobservation. Point E is an unseen test point for which we want to evaluate the qualityof prediction. Given no additional information, a natural idea would be to fit a curveusing simple linear or polynomial regression. The training error for these models maybe low but the test error would be high, which would normally signal over-fittingto the training data. However, here the problem would not be over-fitting — rather,the issue is that the data is not coming from a linear or a polynomial relationshipbetween x and x and also not from a linear or polynomial relationship between afeature u and each of the x i . In reality, A , B , C and D are optimal solutions to theparametric optimization problem P ( u ) : max ( − u ) x − (3 + 10 u ) x subject to theconvex hull of ( A, B, C, D, E ), shown in Figure 2 (d), with P ( − . ≤ u ≤ .
16) = A , P ( − . ≤ u ≤ .
25) = B , P (1 . ≤ u ≤ .
7) = C , P (17 . ≤ u ≤
20) = D and P ( − ≤ u ≤ − .
96) = E . This example was generated using the Wolfram ParametricLinear Programming app (Bunduchi & Mandric, 2011).Just like we aim to fit a linear regression model when we conjecture the relation-ship is linear, or a polynomial one when we conjecture it is polynomial, we can fit anoptimization model when we think the data is coming from an optimization problem.Prior knowledge of the process that generated the data can have a substantial impacton the ability to fit a good model; the above example illustrates specifically the effectof incorporating an optimization model prior. Next, we compare the predictive perfor-mance of IO methods, which assume an optimization prior, and ML methods, whichdo not, when data is generated from parametric convex optimization problems.
5. Experiments
Our goal is to evaluate ‘out-of-the-box’ approaches that can be easily implemented inpractice: a KKT-based IO model (Keshavarz et al., 2011), random forest (Breiman,2001), support vector regression (Drucker, Burges, Kaufman, Smola, & Vapnik, 1997)and Gaussian process regression (Rasmussen, 2004). We use the scikit-learn (Pedregosaet al., 2011) implementation of the ML methods. Keshavarz et al. (2011)’s KKT-basedIO model is solved using
Concert Technology of IBM ILOG CPLEX v.12.7.0. All5xperiments are run on an HP server with 20 Intel(R) Xeon(R) CPU E5-2687W v3 @3.10GHz processors and 512 GB RAM under Linux environment.To show the value of prior knowledge for IO, we consider two types of IO models.In
IO-perfect , we have perfect information about the form of the unknown objectivefunction (but do not know its parameters). In
IO-imperfect , the objective function ismisspecified, and so differs from the true one due to modeling error.
Algorithm 1
Data Generation for k ← K do generate u k from uniform dist’nsolve FOP true ( u k , c true ) and get x truek k ← k + 1 end for D ← { (ˆ x , ˆ u ) , (ˆ x , ˆ u ) , ..., (ˆ x K , ˆ u K ) } Pairs (ˆ u k , ˆ x k ) are generated using Algorithm 1 and divided into a training set anda test set. We use leave-one-out cross-validation on the training set for the ML meth-ods: we pick one point as our validation set, fit a model to the remaining data andevaluate the error on the single held-out point. A validation error estimate is obtainedby repeating this procedure for each of the training points and averaging the results.Doing so with several hyperparameter settings enables us to choose the hyperparame-ter setting with the lowest average cross-validation error. All of the ML experimentalresults presented below show the error on the test set for the best hyper-parametersetting found during the leave-one-out cross-validation procedure.We measure the deviation between the predicted x and x true in the test set (of size M ) using mean relative error (MRE): M RE = 1 M M (cid:88) m =1 (cid:107) x − x true (cid:107) (cid:107) x true (cid:107) . (3) Consider imputing a customer’s unknown utility function given observations of theirpast purchases (B¨armann, Pokutta, & Schneider, 2017; Keshavarz et al., 2011). Weassume that customers solve an internal optimization problem in the course of pur-chasing goods to maximize their utility function:
U F OP ( p ) : minimize x p T x − U ( x )subject to x ≥ , (4)where x ∈ R n and p ∈ R n are the vectors of consumer demand and the associatedprice, respectively. Similarly to Keshavarz et al. (2011), we assume that the function U ( x ) is a concave utility function which is non-decreasing over [0 , x max ], x max beingthe maximum demand level found from past observations. Since U ( x ) is concave, theobjective function of (4) is convex.We generate p i ∼ U [5 , U true ( x ) = (cid:80) ni =1 √ x i . We use U perfect ( x ) = c i √ x i in implementing IO-perfect (i.e., the model is correctly specified6o that all observations can be made optimal solutions for their corresponding p i ). For IO-imperfect , we use U imperfect ( x ) = (cid:80) ni =1 qx i + 2 rx i .Based on the results of our hyper-parameter search during leave-one-outcross-validation, we use the scikit-learn implementation of random forest with n estimators=50 , max depth=None , max feature=None , support vector regressionwith c=0.1 , gamma="auto" , kernel=RBF and Gaussian process with kernel=RBF , alpha=1e-10 , normalized y=False , optimizer="fmin 1 bfgs b" and n restarts optimizer=10 .Figure 3 shows how the relative test error of IO-perfect , IO-imperfect , RF, SVRand GP changes when we increase the training size. Whereas RF, SVR and GP per-formance improves, the performance of
IO-imperfect and
IO-perfect is not affected.GP achieves performance similar to that of
IO-imperfect with only 20 observations,while RF and SVR get close to
IO-imperfect at 60 and 100 points, respectively. TheML methods perform well because the underlying optimization process (resulting frommodel (4)) is constrained by only non-negativity constraints and is parametric only inthe objective function – the relationship between the input and output of this processcan be captured (learned) without knowledge of the underlying optimization model.In the next experiment, we aim to identify IO problems which would be more difficultfor these ML methods and would require knowledge of the underlying optimizationstructure to make good predictions.Figure 3.: Comparing
IO-imperfect , IO-perfect , RF, SVR, and GP for utility func-tion estimation; each point is an average over 15 randomly-generated problems, p i ∼ U [5 , .3. Experiment 2: Randomly-Generated Parametric OptimizationProblems Using the POP toolbox (Oberdieck et al., 2016), we generate POPs of the followingform:
T P OP ( u ) : min x ( Qx + Hu + c ) T x s.t. Ax ≤ b + Fux ∈ R n u ∈ U := { u ∈ R q | CR A u ≤ CR b } , (5)where Q ∈ R n × n (cid:31) H ∈ R n × q and c ∈ R n × . In the inequality constraints, A ∈ R m × n , b ∈ R m × and F ∈ R m × q . The last constraint of (5) restricts the pa-rameter space, i.e., the admissible values of external parameter (feature) u , where CR A ∈ R q × q and CR b ∈ R q × allow for representations of constraints on u (inparticular, lower and upper bounds – hence the number of rows is 2 q ), and q denotesthe dimensionality of u . We assume q = 2 in our experiments.To study (forward) POPs, it is standard to view the space of feasible parameters u asbeing partitioned into polyhedrons called critical regions , each of which is uniquely de-fined by a set of optimal active constraints (Ahmadi-Moshkenani, Johansen, & Olaru,2018; Oberdieck et al., 2016). It is known that if Q is positive definite and U is convex,the optimal objective function z ( u ) is continuous and piecewise quadratic, while if theproblem is reduced to a multi-parametric linear program, then the optimal objectivefunction z ( u ) is continuous, convex, and piecewise affine (see Oberdieck et al. (2016)and the references therein). Figure 4 shows an example of one of the instances wegenerate using the POP toolbox of Oberdieck et al. (2016) with 21 critical regionsand a two-dimensional u ; the critical regions are shown in the left panel while thecorresponding optimal value function z ( u ) is shown on the right.To be able to control the number of critical regions in our experiments, we manuallypre-process the instances generated from the toolbox to define problems over smallerregions of the parameter space with a particular number of critical regions. For in-stance, from the example in Figure 4, we can generate a one-region, a three-regionand a five-region problems by selecting the regions indicated by the black rectangleborders. Throughout this process, care is taken to select intervals where each criticalregion has close to a “fair share” of the region; in preliminary experiments, it wasnoted that the difficulty of learning over a parameter space interval that covers, say,five regions but is dominated by one particular region is similar to the case of one crit-ical region (more discussion on the effect of critical regions on learning performanceappears later in this section). 8igure 4.: A POP instance defined over two parameters u and u generated usingthe POP toolbox of Oberdieck et al. (2016). The left panel shows 21 critical regionsof the parameter space, while the right panel shows the piecewise quadratic optimalobjective function z ( u ). Each critical region on the left corresponds to a quadratic‘piece’ of z ( u ) on the right. Varying the Training Set Size (Data Efficiency)
We compare the predictiveperformance of
IO-perfect , IO-imperfect , SVR, RF and GP varying the training sizefrom 1 to 500 on 20 randomly generated POPs of type (5) with two parameters u and u , with 1 and 3 critical regions. For IO-imperfect we use the mean of selectedintervals of u in the ( Hu ) T x term in the objective function instead of a parametricterm.Figures 5(a) and 5(b) illustrate the predictive (i.e., test set) performance of thesemethods (averaged over 20 instances) given data sets from one critical region and threecritical regions, respectively. As expected, ML performance improves with increasingtraining set size in both cases. On the contrary, IO methods are not affected: IO-perfect has nearly zero error while
IO-imperfect incurs around 8% error, regardlessof the training set size. Since IO is able to achieve low prediction error with a muchsmaller training size, we see that IO methods with correctly specified priors can bemore data efficient than ML algorithms for problems where data is generated by anoptimization process.Because the IO approach we chose solves an optimization model, no matter howmany points are given,
IO-perfect is able to recover the parameters that make the ob-servations optimal while
IO-imperfect will maintain the same level of mis-specification.Insensitivity to the training set size is consistent with Aswani et al. (2018)’s results,which showed that Keshavarz et al. (2011)’s method is in general not risk consis-tent, i.e., it is not guaranteed to asymptotically provide the best possible predictions.However, we emphasize that despite being risk inconsistent, given the right prior, theKeshavarz et al. (2011) approach performs well, and can be the method of choice dueto its data efficiency.The test error for ML methods increases for all training set sizes as we move fromone critical region (Figure 5(a)) to three critical regions (Figure 5(b)). ComparingFigures 5(a) and 5(b), the curves for GP and RF in 5(a) intersect
IO-imperfect after9 a) One Critical Region (b) Three Critical Regions
Figure 5.: Comparing
IO-perfect , IO-imperfect , SVR, RF and GP varying trainingsample size and critical regions.200 points, but in 5(b) they require more points to be comparable to
IO-imperfect .This observation suggests that the number of critical regions in the parameter spacemay be a factor that makes these problems more difficult for standard ML approaches,motivating the next experimental comparison.
Increasing the Number of Critical Regions
Figure 6 shows the average pre-dictive performance of
IO-imperfect and the three ML methods over 20 randomlygenerated POPs given a training set of 200 observations. For each of the 20 probleminstances, we choose one interval of u with one critical region, one with three criticalregions and one with five, keeping the chosen range of both u and u equal to two,as in Figure 4. We can see that IO-imperfect is not sensitive to the number of criticalregions while for all employed ML methods the test error increases with the number ofcritical regions. As mentioned above, every critical region is a polyhedron of u valuesover which the optimal objective function z ( u ) is quadratic (a quadratic ‘piece’ of theoverall piecewise quadratic function) while the optimizer function x ∗ ( u ) is piecewise-affine (Oberdieck et al., 2016). It is not surprising that some of the ML methods canperform well with 200 points over one critical region as they only have to learn onequadratic function. When the number of critical regions increases, the task for thechosen ML methods becomes more challenging: they have to learn multiple quadraticfunctions without any prior knowledge about when the transition from one region toanother will occur. We do not rule out the possibility of creating new, custom learningalgorithms to overcome this issue but doing so is beyond the scope of this paper.10igure 6.: Comparing IO-imperfect , SVR, RF and GP as the number of critical regionsincreases; training set size = 200, test set size = 200.
Varying the Functional Dependence of the Objective Function on u
Next,we investigate how the relationship between the TPOP objective function and thefeature u affects the performance of ML and IO. We investigate this question in thecontext of one randomly generated POP (see Figure 7): z ( u ) : minimize x . x + (1 + u ) x + 19 . x + ( u − u + 1) x s.t. 0 . x ≤ . − . u . x ≤ . − . u − . u . x − . x ≤ . − . u − . u . x ≤ . − . u . (6)11igure 7.: Problem (6) critical regions (left) and parametric objective function (right)generated using the POP Toolbox of Oberdieck et al. (2016).We consider three objective functions for problem (6) defined through functionsΨ ( u ) and Ψ ( u ), which set the coefficients of x and x : f = 1 . x + Ψ ( u ) x + 19 . x + Ψ ( u ) x . (7)We start with a linear Ψ( u ) followed by changing it to a multi-variate rational functionand subsequently increasing the degree of the polynomials forming the rational func-tion to create a relationship that could lead to a more challenging learning problem.As shown in Table 1, the test error of GP, SVR and RF increases as we move fromthe ‘simple’ (linear) dependence on u to making the coefficient of x and x multi-variate rational functions of u . Since ML methods aim to find a mapping from u to x , a more complex Ψ( u ) indeed makes learning more difficult. On the other hand, IOis not sensitive to the nature of Ψ( u ) as this function is just the coefficient of x in theIO mathematical model. 12able 1.: Comparing the predictive performance of ML and IO varying the form ofthe true objective function, with ˆ u ∼ U (4 ,
6) and ˆ u ∼ U ( − , − K istraining set size. The test set size is set to be equal to the training set size.True Objective Function Method MRE %( K = 100) MRE %( K = 200)1 . x + (1 + u ) x +19 . x + ( u − u + 1) x IO-perfect ∼ ∼ IO-imperfect . x + u − u x +19 . x + ( u − u +1) ( u − x IO-perfect ∼ ∼ IO-imperfect . x + u (1 − u ) x +19 . x + ( u − u ) (3 u − x IO-perfect ∼ ∼ IO-imperfect
Investigating the Impact of Correctness of Prior Knowledge
Above, we showthat
IO-perfect and
IO-imperfect are insensitive to problem characteristics which sig-nificantly impact the performance of ML (training set size, number of critical regionsand the nature of the dependence on u ). However, a difference between IO-perfect and
IO-imperfect was observable. To investigate the impact of the correctness of objectivefunction prior, we consider five objective function choices for problem (6) which encodedifferent degrees of correctness (or mis-specification) of a prior.In Table 2, given D , we aim to impute c and c . The first objective function is theexact objective function of problem (6), i.e., the IO-perfect method. The subsequentfunctions deviate from the true function in the coefficients of x and x (but the goalis still to impute c and c from the data generated by the true objective function ofproblem (6)). Besides parametric objective functions (i.e., objective functions definedover u ), we also consider two non-parametric choices. In the first non-parametric ob-jective, we use the mean of 1 + ˆ u and ˆ u − ˆ u + 1 from the given training set ˆ u andˆ u ; in the second non-parametric function, we completely eliminate the linear terms.We see that even slight differences in the function increase IO test error, althoughfor the parametric functions the difference is small. The non-parametric mean-basedfunction also does not lead to a significant increase, but this could change if a largerrange of ˆ u is considered. Removing any encoding of the parametric parts of the objec-tive function drastically increases the prediction error from 3.88% to 52.28%. Theseobservations confirm the sensitivity of IO to the correctness of objective function prior.13able 2.: Different parametrizations of the objective function of problem (6) and thecorresponding test error, given ˆ u ∼ U (4 ,
6) and ˆ u ∼ U ( − , − K = 100)Parametric c x + (1 + u ) x + c x + ( u − u + 1) x ∼ c x + x + c x + ( u − u ) x c x + x + c x − u x c x + 6 x + c x − x c x + c x Our experiments with randomly-generated POPs demonstrate the importance of fourcharacteristics as determinants of the difficulty of objective function learning: (i) sizeof the training set, (ii) nature of the dependence of the optimization problem on theexternal parameters, (iii) level of confidence with regards to the correctness of theoptimization prior, (iv) number of critical regions in the parameter space of a POP.Increasing the size of the training set favours ML methods but not necessarily the IOmethods. The contrast in performance between the problem in experiment 1 and therandomly-generated POPs suggests that ML methods are more likely to be successfulon problems with no or few (parametric) constraints. Increasing the complexity ofthe relationship between u and the objective function also makes the problem moredifficult for ML. On the other hand, the performance of IO is strongly dependent onthe correctness of the objective function prior.Perhaps the most important characteristic determining the difficulty of the objectivefunction learning tasks is the number of critical regions. Over 20 randomly-generatedPOPs, we found a substantial increase in the relative test error for all three MLmethods as the number of critical regions increased. In fact, we conjecture that theunderlying reason why adding parametric constraints and/or increasing the complexityof the dependence on u made learning more difficult is because doing so also increasedthe number of critical regions.Given these observations, a summary of recommendations of when to use ML orIO for learning the coefficients of a convex POP is needed. We summarize our recom-mendations in Table 3. In each combination of a criterion and its value (low/high orsmall/large), we state whether we expect the classic ML methods or an IO method tobe a better choice (or at least a better starting point) for solving the learning prob-lem, while holding other criteria fixed. For example, when training set size is small,we expect IO to be a better option due to its data efficiency.Table 3.: Problem Characteristics and Suggested Methods.Low/Small High/LargeSize of Training Set IO MLDependence on External Parameter ML IOConfidence in Correctness of Prior ML IONumber of Critical Regions ML IO14he relative performance of the methods considered may not be the same for othertypes of learning problems where data was generated by an optimization problem (e.g.,learning constraints, learning in discrete optimization problems). However, we conjec-ture that the analysis of the underlying structure of the value function or the optimizerfunction in terms of the external parameters u (i.e., analysis of the problem in termsthe critical regions) will be important in both gaining additional understanding ofthe challenges of learning from optimization data and developing more sophisticatedlearning methods for such problems.
6. Conclusion
In this paper, we view inverse optimization as a problem of learning from decisions thatare made through an unknown optimization process. We specifically focus on the prob-lem of learning a convex objective function of a parametric optimization problem. Weexperimentally compare the predictive performance of an inverse optimization methodwith perfect and imperfect priors with three well-known machine learning algorithms:support vector regression, random forest and Gaussian processes. While we show thatsome inverse optimization problems can be tackled through classic machine learningapproaches, we highlight the need for sophisticated inverse optimization models forproblems where at least one of the following characteristics holds: (i) the size of thetraining set is small, (ii) both the constraints and the objective function of the prob-lem in question are dependent on an external parameter (feature), particularly whenthat dependence is non-linear, (iii) we have high confidence that our knowledge of theparametric nature of the constraints and objective is correct, and (iv) the number ofcritical regions of the POP is large with no one region dominating. We believe thatthese observations provide practitioners with guidance on when to consider employinginverse optimization instead of, or in addition to, classical machine learning.
Appendix: KKT-based Inverse Optimization Model
The Keshavarz et al. (2011) IO model for (1) is IO ( D ) : minimize λ , c (cid:88) k ∈ K φ ( r statk , r compk )s.t. r statjk = ∂f ( u , x , c ) ∂x j + m (cid:88) i =1 λ ki ∂g i ( u , x ) ∂x j ∀ j ∈ J , k ∈ K r compik = − λ ik g i ( u , x ) ∀ i ∈ I , k ∈ K λ ≥ , where φ is some norm, r stat and r comp represent the complementary slackness andstationarity residuals, respectively, of the KKT conditions; λ is the dual variable.The missing objective function coefficient vector c is a decision variable and pairs of(ˆ x k , ˆ u k ) are the inputs. 15 cknowledgement(s) The authors are supported by the Natural Sciences and Engineering Research Councilof Canada.
References
Ahmadi-Moshkenani, P., Johansen, T. A., & Olaru, S. (2018). Combinatorial approach towardmultiparametric quadratic programming based on characterizing adjacent critical regions.
IEEE Transactions on Automatic Control , (10), 3221–3231.Ahuja, R. K., & Orlin, J. B. (2001). Inverse optimization. Operations Research , (5),771–783.Aswani, A., Kaminsky, P., Mintz, Y., Flowers, E., & Fukuoka, Y. (2019). Behavioral modelingin weight loss interventions. European Journal of Operational Research , (3), 1058–1072.Aswani, A., Shen, Z.-J., & Siddiq, A. (2018). Inverse optimization with noisy data. OperationsResearch , (3), 870–892.Babier, A., Chan, T. C. Y., Lee, T., Mahmood, R., & Terekhov, D. (2021). An en-semble learning framework for model fitting and evaluation in inverse linear optimiza-tion. INFORMS Journal on Optimization , Articles in Advance , 1–20. Retrieved from https://doi.org/10.1287/ijoo.2019.0045
B¨armann, A., Pokutta, S., & Schneider, O. (2017). Emulating the expert: Inverse optimizationthrough online learning. In
International conference on machine learning (pp. 400–410).Breiman, L. (2001). Random forests.
Machine Learning , (1), 5–32.Bunduchi, E., & Mandric, I. (2011). Parametric linear programming.
Retrieved from http://demonstrations.wolfram.com/ParametricLinearProgramming/
Burton, D., & Toint, P. L. (1992). On an instance of the inverse shortest paths problem.
Mathematical Programming , (1-3), 45–61.Burton, D., & Toint, P. L. (1994). On the use of an inverse shortest paths algorithm forrecovering linearly correlated costs. Mathematical Programming , (1-3), 1–22.Chan, T. C., Craig, T., Lee, T., & Sharpe, M. B. (2014). Generalized inverse multiobjectiveoptimization with application to cancer therapy. Operations Research , (3), 680–695.Chan, T. C., Lee, T., & Terekhov, D. (2019). Inverse optimization: Closed-form solutions,geometry, and goodness of fit. Management Science , (3), 1115–1135.Chow, J., & Recker, W. (2012). Inverse optimization with endogenous arrival time con-straints to calibrate the household activity pattern problem. Transportation Research PartB: Methodological , (3), 463–479.Drucker, H., Burges, C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vectorregression machines. In Advances in neural information processing systems (pp. 155–161).Egri, P., Kis, T., Kov´acs, A., & V´ancza, J. (2014). An inverse economic lot-sizing approachto eliciting supplier cost parameters.
International Journal of Production Economics , ,80–88.Esfahani, P., Shafieezadeh-Abadeh, S., Hanasusanto, G., & Kuhn, D. (2018). Data-driveninverse optimization with imperfect information. Mathematical Programming , (1), 191–234.Fern´andez-Blanco, R., Morales, J. M., Pineda, S., & Porras, ´A. (2019). Ev-fleet power fore-casting via kernel-based inverse optimization. arXiv preprint arXiv:1908.00399 .Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning . MIT Press. ( )Keshavarz, A., Wang, Y., & Boyd, S. (2011). Imputing a convex objective function. In
Intelligent control (isic), 2011 ieee international symposium on (pp. 613–619).Kov´acs, A. (2019). Parameter elicitation for consumer models in demand response man-agement. In (pp. Industrial & Engineering Chemistry Research , (33), 8979-8991. Retrieved from https://doi.org/10.1021/acs.iecr.6b01913 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay,E. (2011, November). Scikit-learn: Machine learning in python.
J. Mach. Learn. Res. , (null), 2825–2830.Pistikopoulos, E. N., Georgiadis, M. C., & Dua, V. (2011). Multi-parametric programming (Vol. 1).Rasmussen, C. (2004). Gaussian processes in machine learning. In
Advanced lectures onmachine learning (pp. 63–71). Springer.Russell, S. J., & Norvig, P. (2016).
Artificial intelligence: a modern approach . Malaysia;Pearson Education Limited,.Saez-Gallego, J., & Morales, J. M. (2017). Short-term forecasting of price-responsive loadsusing inverse optimization.
IEEE Transactions on Smart Grid , (5), 4805–4814.Shahmoradi, Z., & Lee, T. (2020). Quantile inverse optimization: Improving stability in inverselinear programming.
Tan, Y., Delong, A., & Terekhov, D. (2019). Deep inverse optimization. In
Internationalconference on integration of constraint programming, artificial intelligence, and operationsresearch (pp. 540–556).Tan, Y., Terekhov, D., & Delong, A. (2020). Learning linear programs from optimal decisions.In
Advances in neural information processing systems (Vol. 33).Taskar, B., Chatalbashev, V., Koller, D., & Guestrin, C. (2005). Learning structured predictionmodels: A large margin approach. In
Proceedings of the 22nd international conference onmachine learning (pp. 896–903).Tavaslıo˘glu, O., Lee, T., Valeva, S., & Schaefer, A. J. (2018). On the structure of the inverse-feasible region of a linear program.
Operations Research Letters , (1), 147–152.(1), 147–152.