[PDF] Comparing Inverse Optimization and Machine Learning Methods for Imputing a Convex Objective Function

Abstract

Inverse optimization (IO) aims to determine optimization model parameters from observed decisions. However, IO is not part of a data scientist's toolkit in practice, especially as many general-purpose machine learning packages are widely available as an alternative. When encountering IO, practitioners face the question of when, or even whether, investing in developing IO methods is worthwhile. Our paper provides a starting point toward answering these questions, focusing on the problem of imputing the objective function of a parametric convex optimization problem. We compare the predictive performance of three standard supervised machine learning (ML) algorithms (random forest, support vector regression and Gaussian process regression) to the performance of the IO model of Keshavarz, Wang, and Boyd (2011). While the IO literature focuses on the development of methods tailored to particular problem classes, our goal is to evaluate general "out-of-the-box" approaches. Our experiments demonstrate that determining whether to use an ML or IO approach requires considering (i) the training set size, (ii) the dependence of the optimization problem on external parameters, (iii) the level of confidence with regards to the correctness of the optimization prior, and (iv) the number of critical regions in the solution space.

Full PDF

CComparing Inverse Optimization and Machine Learning Methods forImputing a Convex Objective Function

Elaheh H.Iraj and Daria Terekhov

Department of Mechanical, Industrial and Aerospace Engineering, Gina Cody School ofEngineering and Computer Science, Concordia University, Montr´eal, QC, Canada

Abstract

Inverse optimization (IO) aims to determine optimization model parameters fromobserved decisions. However, IO is not part of a data scientist’s toolkit in practice,especially as many general-purpose machine learning packages are widely availableas an alternative. When encountering IO, practitioners face the question of when,or even whether, investing in developing IO methods is worthwhile. Our paper pro-vides a starting point toward answering these questions, focusing on the problem ofimputing the objective function of a parametric convex optimization problem. Wecompare the predictive performance of three standard supervised machine learning(ML) algorithms (random forest, support vector regression and Gaussian process re-gression) to the performance of the IO model of Keshavarz, Wang, and Boyd (2011).While the IO literature focuses on the development of methods tailored to particu-lar problem classes, our goal is to evaluate general ‘out-of-the-box’ approaches. Ourexperiments demonstrate that determining whether to use an ML or IO approachrequires considering (i) the training set size, (ii) the dependence of the optimizationproblem on external parameters, (iii) the level of conﬁdence with regards to thecorrectness of the optimization prior, and (iv) the number of critical regions in thesolution space.

KEYWORDS inverse optimization; machine learning; data eﬃciency; parametric optimization;multi-parametric programming

1. Introduction

Inverse optimization (IO) imputes missing optimization model parameters from datathat represents minimally sub-optimal solutions of that unknown optimization prob-lem (e.g., past decisions of an optimizing agent). In the academic literature, IO hasbeen shown successful in a variety of problems, including medical decision making(Chan, Craig, Lee, & Sharpe, 2014), electricity demand forecasting (Saez-Gallego &Morales, 2017), the household activity pattern problem (Chow & Recker, 2012) andeconomic lot-sizing (Egri, Kis, Kov´acs, & V´ancza, 2014). However, IO is rarely usedby practitioners.Previous work has established an analogy between IO and regression (Chan, Lee,& Terekhov, 2019), and provided a statistical inference perspective on IO (Aswani,Shen, & Siddiq, 2018). More generally, one can notice that imputing parameter valuesfrom data can be viewed as a machine learning problem (Tan, Delong, & Terekhov,

Email address: [email protected], [email protected] (corresponding author) a r X i v : . [ m a t h . O C ] F e b

2. Background and Related Work

To deﬁne an IO problem, we ﬁrst need to postulate a forward optimization problem,i.e., the problem whose solutions, perhaps corrupted by noise, we observe. The initialfocus of the IO literature was to impute coeﬃcients of forward optimization problemsthat do not have any dependence on an external parameter (non-parametric). Burtonand Toint (1992, 1994) study the inverse shortest path problem in which the goal is toﬁnd a minimal perturbation of the arc costs to make an observed set of paths optimal.Given an observed value for decision variables x , Ahuja and Orlin (2001) ﬁnd a c vector of minimal distance to a known vector ˆ c that would make x optimal. Chanet al. (2014) and Chan et al. (2019) study the inverse linear programming problemwhen the observed data is noisy and no prior ˆ c is given. Tavaslıo˘glu, Lee, Valeva, andSchaefer (2018) characterize the inverse feasible region: the set of objectives that wouldmake a given set of feasible solutions to a linear program optimal. Shahmoradi andLee (2020) introduce a way to measure how sensitive the set of optimal cost vectorsis to changes in a given data set.A parametric optimization problem (also known as a multi-parametric programmingproblem) is a family of optimization problems parametrized by an independent param-eter; the forward parametric problem involves ﬁnding a function that would map valuesof the independent parameter to optimal solutions (Pistikopoulos, Georgiadis, & Dua,2011). As an example, consider the following parameteric linear program with indepen-2ent parameters ( u , u ): min( c + c u ) x + ( c + c u ) x s.t. a x + a x ≥ b , whereinstantiating u and u leads to a standard (non-parametric) optimization problem.There is substantial interest in the inverse parametric optimization problem (Aswaniet al., 2018; Chow & Recker, 2012; Esfahani, Shaﬁeezadeh-Abadeh, Hanasusanto, &Kuhn, 2018; Keshavarz et al., 2011; Kov´acs, 2019; Saez-Gallego & Morales, 2017; Tanet al., 2019; Tan, Terekhov, & Delong, 2020). In such a problem, pairs ( u k , x k ) where u k is an independent parameter for observation k and x k is an observed value forthe decision variables x are given; the goal is to impute objective function and/orconstraint coeﬃcients that would make the observed points optimal or near-optimalsolutions of the given parametric forward optimization problem.Classic IO methods (Ahuja & Orlin, 2001; Burton & Toint, 1992; Keshavarz et al.,2011) rely on some form of optimality conditions, such as the Karush-Kuhn-Tucker(KKT) conditions. More recently, methods based on a combination of ML and IOhave emerged (Babier, Chan, Lee, Mahmood, & Terekhov, 2021; Tan et al., 2019,2020). Aswani, Kaminsky, Mintz, Flowers, and Fukuoka (2019) and Fern´andez-Blanco,Morales, Pineda, and Porras (2019) include comparison of an IO approach to severalML algorithms in the context of a particular application. Related work in the ML lit-erature includes, most notably, structured prediction (Taskar, Chatalbashev, Koller, &Guestrin, 2005). Despite this related ML-based work, a comparison of classic methodson general parametric IO problems has not previously been done and the questionsposed in the introduction have not been answered.

3. Problem Deﬁnition

We study IO for parametric optimization problems (POPs), which are ‘compatible’with an ML perspective unlike the non-parametric optimization ones. POPs are math-ematical programs where the objective function and constraints are functions of oneor multiple external parameters (Oberdieck et al., 2016; Pistikopoulos et al., 2011).The forward parametric optimization problem (FOP) of interest to us is:

FOP ( u , c ) : minimize x f ( u , x , c )s.t. g ( u , x ) ≤ , (1)where x ∈ R n are the decision variables, u ∈ R v are the independent parameters(‘features’ in ML parlance), and c ∈ R n are the coeﬃcients of the objective function f . We assume f and g are diﬀerentiable and convex in x for each value of u . We let J be the index set of variables with | J | = n , and I be the index set of constraintswith | I | = m ; we let denote the column vector of zeros of the required dimension.In an inverse parametric optimization problem, observations of pairs D = { (ˆ x , ˆ u ) , . . . , (ˆ x K , ˆ u K ) } are given; the goal is to ﬁnd a c that would make ˆ x k minimallysub-optimal for the corresponding FOP (ˆ u k , c ), for each k ∈ K = { , . . . , K } :minimize c (cid:110) (cid:88) k ∈ K L (ˆ x k , x predk ) | x predk ∈ arg min x ∈ X ˆ u k f (ˆ u k , x , c ) (cid:111) (2)where L is any given loss function, x predk is an optimal solution to the ﬁtted forwardoptimization model instantiated at u = ˆ u k , and X ˆ u k = { x | g (ˆ u k , x ) ≤ } is the3easible region corresponding to ˆ u k . There are various ways to reformulate (2); theparticular model we use here is the model proposed by Keshavarz et al. (2011), whichformulates the optimality criterion via KKT optimality conditions (see appendix).The goal of IO could be explanatory, i.e., we aim to characterize the process thatgenerated the data, or predictive, i.e., we want to ﬁt a model that will predict theactions of an optimizing agent/process. That is, having imputed c and given u new wecould use FOP ( u new , c ) to predict x new . Fitting a model to data can also result inbetter prescriptive performance, i.e., if instead of predicting what an optimizing agen-t/process would do, we want to prescribe an optimal decision under new circumstances u new (this latter perspective being more consistent with classical optimization).

4. Inverse Optimization Problem as a Learning Problem

Learning is the capability to acquire knowledge by extracting patterns from raw data(Goodfellow, Bengio, & Courville, 2016); in this sense, IO ﬁts within the broader con-ceptual framework of learning, i.e., we are acquiring knowledge about an optimizationprocess that generated the data we observe. Applied to the IO problem described inSection 3, supervised ML would aim to ﬁnd a vector-valued function h ( u , β ) from u (input variable or feature) to x (target variable) such that the loss function L is mini-mized (Russell & Norvig, 2016). Once h ( u , β ) is imputed, it can be used for predictiongiven new feature observations u new . Figure 1 illustrates this observation: in the leftpanel, we shown that IO takes pairs of (ˆ u , ˆ x ) (i.e., data) as input, imputes the value of c (i.e., knowledge) and then, given u new and an imputed c , makes a prediction x new ; inthe right panel, we show that the same happens in a typical supervised ML framework,with the diﬀerence being the actual models and methods used to impute the unknownparameters and, consequently, the learned model that is used for prediction.Figure 1.: IO and ML perspectives applied to an IO problem.The fundamental assumption in IO is that the observations are near-optimal oroptimal solutions to an optimization problem, whereas general ML methods do notrequire such an assumption. Based on the description above, an IO problem can be seenas part of the same conceptual framework as machine learning – the main diﬀerencebeing that in IO data is assumed to come from an optimization process. Viewed anotherway, we can say that IO assumes a strong prior that the input data is coming from anoptimization process, and IO methods leverage this prior knowledge.4igure 2.: A data set and the ﬁtted optimization model (points A to D are the trainingset, point E is the test set).Consider Figure 2 (a): points A to D are given and our goal is to predict the nextobservation. Point E is an unseen test point for which we want to evaluate the qualityof prediction. Given no additional information, a natural idea would be to ﬁt a curveusing simple linear or polynomial regression. The training error for these models maybe low but the test error would be high, which would normally signal over-ﬁttingto the training data. However, here the problem would not be over-ﬁtting — rather,the issue is that the data is not coming from a linear or a polynomial relationshipbetween x and x and also not from a linear or polynomial relationship between afeature u and each of the x i . In reality, A , B , C and D are optimal solutions to theparametric optimization problem P ( u ) : max ( − u ) x − (3 + 10 u ) x subject to theconvex hull of ( A, B, C, D, E ), shown in Figure 2 (d), with P ( − . ≤ u ≤ .

16) = A , P ( − . ≤ u ≤ .

25) = B , P (1 . ≤ u ≤ .

7) = C , P (17 . ≤ u ≤

20) = D and P ( − ≤ u ≤ − .

96) = E . This example was generated using the Wolfram ParametricLinear Programming app (Bunduchi & Mandric, 2011).Just like we aim to ﬁt a linear regression model when we conjecture the relation-ship is linear, or a polynomial one when we conjecture it is polynomial, we can ﬁt anoptimization model when we think the data is coming from an optimization problem.Prior knowledge of the process that generated the data can have a substantial impacton the ability to ﬁt a good model; the above example illustrates speciﬁcally the eﬀectof incorporating an optimization model prior. Next, we compare the predictive perfor-mance of IO methods, which assume an optimization prior, and ML methods, whichdo not, when data is generated from parametric convex optimization problems.

5. Experiments

Our goal is to evaluate ‘out-of-the-box’ approaches that can be easily implemented inpractice: a KKT-based IO model (Keshavarz et al., 2011), random forest (Breiman,2001), support vector regression (Drucker, Burges, Kaufman, Smola, & Vapnik, 1997)and Gaussian process regression (Rasmussen, 2004). We use the scikit-learn (Pedregosaet al., 2011) implementation of the ML methods. Keshavarz et al. (2011)’s KKT-basedIO model is solved using

Concert Technology of IBM ILOG CPLEX v.12.7.0. All5xperiments are run on an HP server with 20 Intel(R) Xeon(R) CPU E5-2687W v3 @3.10GHz processors and 512 GB RAM under Linux environment.To show the value of prior knowledge for IO, we consider two types of IO models.In

IO-perfect , we have perfect information about the form of the unknown objectivefunction (but do not know its parameters). In

IO-imperfect , the objective function ismisspeciﬁed, and so diﬀers from the true one due to modeling error.

Algorithm 1

Data Generation for k ← K do generate u k from uniform dist’nsolve FOP true ( u k , c true ) and get x truek k ← k + 1 end for D ← { (ˆ x , ˆ u ) , (ˆ x , ˆ u ) , ..., (ˆ x K , ˆ u K ) } Pairs (ˆ u k , ˆ x k ) are generated using Algorithm 1 and divided into a training set anda test set. We use leave-one-out cross-validation on the training set for the ML meth-ods: we pick one point as our validation set, ﬁt a model to the remaining data andevaluate the error on the single held-out point. A validation error estimate is obtainedby repeating this procedure for each of the training points and averaging the results.Doing so with several hyperparameter settings enables us to choose the hyperparame-ter setting with the lowest average cross-validation error. All of the ML experimentalresults presented below show the error on the test set for the best hyper-parametersetting found during the leave-one-out cross-validation procedure.We measure the deviation between the predicted x and x true in the test set (of size M ) using mean relative error (MRE): M RE = 1 M M (cid:88) m =1 (cid:107) x − x true (cid:107) (cid:107) x true (cid:107) . (3) Consider imputing a customer’s unknown utility function given observations of theirpast purchases (B¨armann, Pokutta, & Schneider, 2017; Keshavarz et al., 2011). Weassume that customers solve an internal optimization problem in the course of pur-chasing goods to maximize their utility function:

U F OP ( p ) : minimize x p T x − U ( x )subject to x ≥ , (4)where x ∈ R n and p ∈ R n are the vectors of consumer demand and the associatedprice, respectively. Similarly to Keshavarz et al. (2011), we assume that the function U ( x ) is a concave utility function which is non-decreasing over [0 , x max ], x max beingthe maximum demand level found from past observations. Since U ( x ) is concave, theobjective function of (4) is convex.We generate p i ∼ U [5 , U true ( x ) = (cid:80) ni =1 √ x i . We use U perfect ( x ) = c i √ x i in implementing IO-perfect (i.e., the model is correctly speciﬁed6o that all observations can be made optimal solutions for their corresponding p i ). For IO-imperfect , we use U imperfect ( x ) = (cid:80) ni =1 qx i + 2 rx i .Based on the results of our hyper-parameter search during leave-one-outcross-validation, we use the scikit-learn implementation of random forest with n estimators=50 , max depth=None , max feature=None , support vector regressionwith c=0.1 , gamma="auto" , kernel=RBF and Gaussian process with kernel=RBF , alpha=1e-10 , normalized y=False , optimizer="fmin 1 bfgs b" and n restarts optimizer=10 .Figure 3 shows how the relative test error of IO-perfect , IO-imperfect , RF, SVRand GP changes when we increase the training size. Whereas RF, SVR and GP per-formance improves, the performance of

IO-imperfect and

IO-perfect is not aﬀected.GP achieves performance similar to that of

IO-imperfect with only 20 observations,while RF and SVR get close to

IO-imperfect at 60 and 100 points, respectively. TheML methods perform well because the underlying optimization process (resulting frommodel (4)) is constrained by only non-negativity constraints and is parametric only inthe objective function – the relationship between the input and output of this processcan be captured (learned) without knowledge of the underlying optimization model.In the next experiment, we aim to identify IO problems which would be more diﬃcultfor these ML methods and would require knowledge of the underlying optimizationstructure to make good predictions.Figure 3.: Comparing

IO-imperfect , IO-perfect , RF, SVR, and GP for utility func-tion estimation; each point is an average over 15 randomly-generated problems, p i ∼ U [5 , .3. Experiment 2: Randomly-Generated Parametric OptimizationProblems Using the POP toolbox (Oberdieck et al., 2016), we generate POPs of the followingform:

T P OP ( u ) : min x ( Qx + Hu + c ) T x s.t. Ax ≤ b + Fux ∈ R n u ∈ U := { u ∈ R q | CR A u ≤ CR b } , (5)where Q ∈ R n × n (cid:31) H ∈ R n × q and c ∈ R n × . In the inequality constraints, A ∈ R m × n , b ∈ R m × and F ∈ R m × q . The last constraint of (5) restricts the pa-rameter space, i.e., the admissible values of external parameter (feature) u , where CR A ∈ R q × q and CR b ∈ R q × allow for representations of constraints on u (inparticular, lower and upper bounds – hence the number of rows is 2 q ), and q denotesthe dimensionality of u . We assume q = 2 in our experiments.To study (forward) POPs, it is standard to view the space of feasible parameters u asbeing partitioned into polyhedrons called critical regions , each of which is uniquely de-ﬁned by a set of optimal active constraints (Ahmadi-Moshkenani, Johansen, & Olaru,2018; Oberdieck et al., 2016). It is known that if Q is positive deﬁnite and U is convex,the optimal objective function z ( u ) is continuous and piecewise quadratic, while if theproblem is reduced to a multi-parametric linear program, then the optimal objectivefunction z ( u ) is continuous, convex, and piecewise aﬃne (see Oberdieck et al. (2016)and the references therein). Figure 4 shows an example of one of the instances wegenerate using the POP toolbox of Oberdieck et al. (2016) with 21 critical regionsand a two-dimensional u ; the critical regions are shown in the left panel while thecorresponding optimal value function z ( u ) is shown on the right.To be able to control the number of critical regions in our experiments, we manuallypre-process the instances generated from the toolbox to deﬁne problems over smallerregions of the parameter space with a particular number of critical regions. For in-stance, from the example in Figure 4, we can generate a one-region, a three-regionand a ﬁve-region problems by selecting the regions indicated by the black rectangleborders. Throughout this process, care is taken to select intervals where each criticalregion has close to a “fair share” of the region; in preliminary experiments, it wasnoted that the diﬃculty of learning over a parameter space interval that covers, say,ﬁve regions but is dominated by one particular region is similar to the case of one crit-ical region (more discussion on the eﬀect of critical regions on learning performanceappears later in this section). 8igure 4.: A POP instance deﬁned over two parameters u and u generated usingthe POP toolbox of Oberdieck et al. (2016). The left panel shows 21 critical regionsof the parameter space, while the right panel shows the piecewise quadratic optimalobjective function z ( u ). Each critical region on the left corresponds to a quadratic‘piece’ of z ( u ) on the right. Varying the Training Set Size (Data Eﬃciency)

We compare the predictiveperformance of

IO-perfect , IO-imperfect , SVR, RF and GP varying the training sizefrom 1 to 500 on 20 randomly generated POPs of type (5) with two parameters u and u , with 1 and 3 critical regions. For IO-imperfect we use the mean of selectedintervals of u in the ( Hu ) T x term in the objective function instead of a parametricterm.Figures 5(a) and 5(b) illustrate the predictive (i.e., test set) performance of thesemethods (averaged over 20 instances) given data sets from one critical region and threecritical regions, respectively. As expected, ML performance improves with increasingtraining set size in both cases. On the contrary, IO methods are not aﬀected: IO-perfect has nearly zero error while

IO-imperfect incurs around 8% error, regardlessof the training set size. Since IO is able to achieve low prediction error with a muchsmaller training size, we see that IO methods with correctly speciﬁed priors can bemore data eﬃcient than ML algorithms for problems where data is generated by anoptimization process.Because the IO approach we chose solves an optimization model, no matter howmany points are given,

IO-perfect is able to recover the parameters that make the ob-servations optimal while

IO-imperfect will maintain the same level of mis-speciﬁcation.Insensitivity to the training set size is consistent with Aswani et al. (2018)’s results,which showed that Keshavarz et al. (2011)’s method is in general not risk consis-tent, i.e., it is not guaranteed to asymptotically provide the best possible predictions.However, we emphasize that despite being risk inconsistent, given the right prior, theKeshavarz et al. (2011) approach performs well, and can be the method of choice dueto its data eﬃciency.The test error for ML methods increases for all training set sizes as we move fromone critical region (Figure 5(a)) to three critical regions (Figure 5(b)). ComparingFigures 5(a) and 5(b), the curves for GP and RF in 5(a) intersect

IO-imperfect after9 a) One Critical Region (b) Three Critical Regions

Figure 5.: Comparing

IO-perfect , IO-imperfect , SVR, RF and GP varying trainingsample size and critical regions.200 points, but in 5(b) they require more points to be comparable to

IO-imperfect .This observation suggests that the number of critical regions in the parameter spacemay be a factor that makes these problems more diﬃcult for standard ML approaches,motivating the next experimental comparison.

Increasing the Number of Critical Regions

Figure 6 shows the average pre-dictive performance of

IO-imperfect and the three ML methods over 20 randomlygenerated POPs given a training set of 200 observations. For each of the 20 probleminstances, we choose one interval of u with one critical region, one with three criticalregions and one with ﬁve, keeping the chosen range of both u and u equal to two,as in Figure 4. We can see that IO-imperfect is not sensitive to the number of criticalregions while for all employed ML methods the test error increases with the number ofcritical regions. As mentioned above, every critical region is a polyhedron of u valuesover which the optimal objective function z ( u ) is quadratic (a quadratic ‘piece’ of theoverall piecewise quadratic function) while the optimizer function x ∗ ( u ) is piecewise-aﬃne (Oberdieck et al., 2016). It is not surprising that some of the ML methods canperform well with 200 points over one critical region as they only have to learn onequadratic function. When the number of critical regions increases, the task for thechosen ML methods becomes more challenging: they have to learn multiple quadraticfunctions without any prior knowledge about when the transition from one region toanother will occur. We do not rule out the possibility of creating new, custom learningalgorithms to overcome this issue but doing so is beyond the scope of this paper.10igure 6.: Comparing IO-imperfect , SVR, RF and GP as the number of critical regionsincreases; training set size = 200, test set size = 200.

Varying the Functional Dependence of the Objective Function on u

Next,we investigate how the relationship between the TPOP objective function and thefeature u aﬀects the performance of ML and IO. We investigate this question in thecontext of one randomly generated POP (see Figure 7): z ( u ) : minimize x . x + (1 + u ) x + 19 . x + ( u − u + 1) x s.t. 0 . x ≤ . − . u . x ≤ . − . u − . u . x − . x ≤ . − . u − . u . x ≤ . − . u . (6)11igure 7.: Problem (6) critical regions (left) and parametric objective function (right)generated using the POP Toolbox of Oberdieck et al. (2016).We consider three objective functions for problem (6) deﬁned through functionsΨ ( u ) and Ψ ( u ), which set the coeﬃcients of x and x : f = 1 . x + Ψ ( u ) x + 19 . x + Ψ ( u ) x . (7)We start with a linear Ψ( u ) followed by changing it to a multi-variate rational functionand subsequently increasing the degree of the polynomials forming the rational func-tion to create a relationship that could lead to a more challenging learning problem.As shown in Table 1, the test error of GP, SVR and RF increases as we move fromthe ‘simple’ (linear) dependence on u to making the coeﬃcient of x and x multi-variate rational functions of u . Since ML methods aim to ﬁnd a mapping from u to x , a more complex Ψ( u ) indeed makes learning more diﬃcult. On the other hand, IOis not sensitive to the nature of Ψ( u ) as this function is just the coeﬃcient of x in theIO mathematical model. 12able 1.: Comparing the predictive performance of ML and IO varying the form ofthe true objective function, with ˆ u ∼ U (4 ,

6) and ˆ u ∼ U ( − , − K istraining set size. The test set size is set to be equal to the training set size.True Objective Function Method MRE %( K = 100) MRE %( K = 200)1 . x + (1 + u ) x +19 . x + ( u − u + 1) x IO-perfect ∼ ∼ IO-imperfect . x + u − u x +19 . x + ( u − u +1) ( u − x IO-perfect ∼ ∼ IO-imperfect . x + u (1 − u ) x +19 . x + ( u − u ) (3 u − x IO-perfect ∼ ∼ IO-imperfect

Investigating the Impact of Correctness of Prior Knowledge

Above, we showthat

IO-perfect and

IO-imperfect are insensitive to problem characteristics which sig-niﬁcantly impact the performance of ML (training set size, number of critical regionsand the nature of the dependence on u ). However, a diﬀerence between IO-perfect and

IO-imperfect was observable. To investigate the impact of the correctness of objectivefunction prior, we consider ﬁve objective function choices for problem (6) which encodediﬀerent degrees of correctness (or mis-speciﬁcation) of a prior.In Table 2, given D , we aim to impute c and c . The ﬁrst objective function is theexact objective function of problem (6), i.e., the IO-perfect method. The subsequentfunctions deviate from the true function in the coeﬃcients of x and x (but the goalis still to impute c and c from the data generated by the true objective function ofproblem (6)). Besides parametric objective functions (i.e., objective functions deﬁnedover u ), we also consider two non-parametric choices. In the ﬁrst non-parametric ob-jective, we use the mean of 1 + ˆ u and ˆ u − ˆ u + 1 from the given training set ˆ u andˆ u ; in the second non-parametric function, we completely eliminate the linear terms.We see that even slight diﬀerences in the function increase IO test error, althoughfor the parametric functions the diﬀerence is small. The non-parametric mean-basedfunction also does not lead to a signiﬁcant increase, but this could change if a largerrange of ˆ u is considered. Removing any encoding of the parametric parts of the objec-tive function drastically increases the prediction error from 3.88% to 52.28%. Theseobservations conﬁrm the sensitivity of IO to the correctness of objective function prior.13able 2.: Diﬀerent parametrizations of the objective function of problem (6) and thecorresponding test error, given ˆ u ∼ U (4 ,

6) and ˆ u ∼ U ( − , − K = 100)Parametric c x + (1 + u ) x + c x + ( u − u + 1) x ∼ c x + x + c x + ( u − u ) x c x + x + c x − u x c x + 6 x + c x − x c x + c x Our experiments with randomly-generated POPs demonstrate the importance of fourcharacteristics as determinants of the diﬃculty of objective function learning: (i) sizeof the training set, (ii) nature of the dependence of the optimization problem on theexternal parameters, (iii) level of conﬁdence with regards to the correctness of theoptimization prior, (iv) number of critical regions in the parameter space of a POP.Increasing the size of the training set favours ML methods but not necessarily the IOmethods. The contrast in performance between the problem in experiment 1 and therandomly-generated POPs suggests that ML methods are more likely to be successfulon problems with no or few (parametric) constraints. Increasing the complexity ofthe relationship between u and the objective function also makes the problem morediﬃcult for ML. On the other hand, the performance of IO is strongly dependent onthe correctness of the objective function prior.Perhaps the most important characteristic determining the diﬃculty of the objectivefunction learning tasks is the number of critical regions. Over 20 randomly-generatedPOPs, we found a substantial increase in the relative test error for all three MLmethods as the number of critical regions increased. In fact, we conjecture that theunderlying reason why adding parametric constraints and/or increasing the complexityof the dependence on u made learning more diﬃcult is because doing so also increasedthe number of critical regions.Given these observations, a summary of recommendations of when to use ML orIO for learning the coeﬃcients of a convex POP is needed. We summarize our recom-mendations in Table 3. In each combination of a criterion and its value (low/high orsmall/large), we state whether we expect the classic ML methods or an IO method tobe a better choice (or at least a better starting point) for solving the learning prob-lem, while holding other criteria ﬁxed. For example, when training set size is small,we expect IO to be a better option due to its data eﬃciency.Table 3.: Problem Characteristics and Suggested Methods.Low/Small High/LargeSize of Training Set IO MLDependence on External Parameter ML IOConﬁdence in Correctness of Prior ML IONumber of Critical Regions ML IO14he relative performance of the methods considered may not be the same for othertypes of learning problems where data was generated by an optimization problem (e.g.,learning constraints, learning in discrete optimization problems). However, we conjec-ture that the analysis of the underlying structure of the value function or the optimizerfunction in terms of the external parameters u (i.e., analysis of the problem in termsthe critical regions) will be important in both gaining additional understanding ofthe challenges of learning from optimization data and developing more sophisticatedlearning methods for such problems.

6. Conclusion

In this paper, we view inverse optimization as a problem of learning from decisions thatare made through an unknown optimization process. We speciﬁcally focus on the prob-lem of learning a convex objective function of a parametric optimization problem. Weexperimentally compare the predictive performance of an inverse optimization methodwith perfect and imperfect priors with three well-known machine learning algorithms:support vector regression, random forest and Gaussian processes. While we show thatsome inverse optimization problems can be tackled through classic machine learningapproaches, we highlight the need for sophisticated inverse optimization models forproblems where at least one of the following characteristics holds: (i) the size of thetraining set is small, (ii) both the constraints and the objective function of the prob-lem in question are dependent on an external parameter (feature), particularly whenthat dependence is non-linear, (iii) we have high conﬁdence that our knowledge of theparametric nature of the constraints and objective is correct, and (iv) the number ofcritical regions of the POP is large with no one region dominating. We believe thatthese observations provide practitioners with guidance on when to consider employinginverse optimization instead of, or in addition to, classical machine learning.

Appendix: KKT-based Inverse Optimization Model

The Keshavarz et al. (2011) IO model for (1) is IO ( D ) : minimize λ , c (cid:88) k ∈ K φ ( r statk , r compk )s.t. r statjk = ∂f ( u , x , c ) ∂x j + m (cid:88) i =1 λ ki ∂g i ( u , x ) ∂x j ∀ j ∈ J , k ∈ K r compik = − λ ik g i ( u , x ) ∀ i ∈ I , k ∈ K λ ≥ , where φ is some norm, r stat and r comp represent the complementary slackness andstationarity residuals, respectively, of the KKT conditions; λ is the dual variable.The missing objective function coeﬃcient vector c is a decision variable and pairs of(ˆ x k , ˆ u k ) are the inputs. 15 cknowledgement(s) The authors are supported by the Natural Sciences and Engineering Research Councilof Canada.

References

Ahmadi-Moshkenani, P., Johansen, T. A., & Olaru, S. (2018). Combinatorial approach towardmultiparametric quadratic programming based on characterizing adjacent critical regions.

IEEE Transactions on Automatic Control , (10), 3221–3231.Ahuja, R. K., & Orlin, J. B. (2001). Inverse optimization. Operations Research , (5),771–783.Aswani, A., Kaminsky, P., Mintz, Y., Flowers, E., & Fukuoka, Y. (2019). Behavioral modelingin weight loss interventions. European Journal of Operational Research , (3), 1058–1072.Aswani, A., Shen, Z.-J., & Siddiq, A. (2018). Inverse optimization with noisy data. OperationsResearch , (3), 870–892.Babier, A., Chan, T. C. Y., Lee, T., Mahmood, R., & Terekhov, D. (2021). An en-semble learning framework for model ﬁtting and evaluation in inverse linear optimiza-tion. INFORMS Journal on Optimization , Articles in Advance , 1–20. Retrieved from https://doi.org/10.1287/ijoo.2019.0045

B¨armann, A., Pokutta, S., & Schneider, O. (2017). Emulating the expert: Inverse optimizationthrough online learning. In

International conference on machine learning (pp. 400–410).Breiman, L. (2001). Random forests.

Machine Learning , (1), 5–32.Bunduchi, E., & Mandric, I. (2011). Parametric linear programming.

Retrieved from http://demonstrations.wolfram.com/ParametricLinearProgramming/

Burton, D., & Toint, P. L. (1992). On an instance of the inverse shortest paths problem.

Mathematical Programming , (1-3), 45–61.Burton, D., & Toint, P. L. (1994). On the use of an inverse shortest paths algorithm forrecovering linearly correlated costs. Mathematical Programming , (1-3), 1–22.Chan, T. C., Craig, T., Lee, T., & Sharpe, M. B. (2014). Generalized inverse multiobjectiveoptimization with application to cancer therapy. Operations Research , (3), 680–695.Chan, T. C., Lee, T., & Terekhov, D. (2019). Inverse optimization: Closed-form solutions,geometry, and goodness of ﬁt. Management Science , (3), 1115–1135.Chow, J., & Recker, W. (2012). Inverse optimization with endogenous arrival time con-straints to calibrate the household activity pattern problem. Transportation Research PartB: Methodological , (3), 463–479.Drucker, H., Burges, C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vectorregression machines. In Advances in neural information processing systems (pp. 155–161).Egri, P., Kis, T., Kov´acs, A., & V´ancza, J. (2014). An inverse economic lot-sizing approachto eliciting supplier cost parameters.

International Journal of Production Economics , ,80–88.Esfahani, P., Shaﬁeezadeh-Abadeh, S., Hanasusanto, G., & Kuhn, D. (2018). Data-driveninverse optimization with imperfect information. Mathematical Programming , (1), 191–234.Fern´andez-Blanco, R., Morales, J. M., Pineda, S., & Porras, ´A. (2019). Ev-ﬂeet power fore-casting via kernel-based inverse optimization. arXiv preprint arXiv:1908.00399 .Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning . MIT Press. ( )Keshavarz, A., Wang, Y., & Boyd, S. (2011). Imputing a convex objective function. In

Intelligent control (isic), 2011 ieee international symposium on (pp. 613–619).Kov´acs, A. (2019). Parameter elicitation for consumer models in demand response man-agement. In (pp. Industrial & Engineering Chemistry Research , (33), 8979-8991. Retrieved from https://doi.org/10.1021/acs.iecr.6b01913 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay,E. (2011, November). Scikit-learn: Machine learning in python.

J. Mach. Learn. Res. , (null), 2825–2830.Pistikopoulos, E. N., Georgiadis, M. C., & Dua, V. (2011). Multi-parametric programming (Vol. 1).Rasmussen, C. (2004). Gaussian processes in machine learning. In

Advanced lectures onmachine learning (pp. 63–71). Springer.Russell, S. J., & Norvig, P. (2016).

Artiﬁcial intelligence: a modern approach . Malaysia;Pearson Education Limited,.Saez-Gallego, J., & Morales, J. M. (2017). Short-term forecasting of price-responsive loadsusing inverse optimization.

IEEE Transactions on Smart Grid , (5), 4805–4814.Shahmoradi, Z., & Lee, T. (2020). Quantile inverse optimization: Improving stability in inverselinear programming.

Tan, Y., Delong, A., & Terekhov, D. (2019). Deep inverse optimization. In

Internationalconference on integration of constraint programming, artiﬁcial intelligence, and operationsresearch (pp. 540–556).Tan, Y., Terekhov, D., & Delong, A. (2020). Learning linear programs from optimal decisions.In

Advances in neural information processing systems (Vol. 33).Taskar, B., Chatalbashev, V., Koller, D., & Guestrin, C. (2005). Learning structured predictionmodels: A large margin approach. In

Proceedings of the 22nd international conference onmachine learning (pp. 896–903).Tavaslıo˘glu, O., Lee, T., Valeva, S., & Schaefer, A. J. (2018). On the structure of the inverse-feasible region of a linear program.

Operations Research Letters , (1), 147–152.(1), 147–152.